By Gordon Hull
Last time, I looked a new paper by Fabian Offert, Paul Kim, and Qiaoyu Cai and applied it to a reworking of some of my earlier remarks on Derrida’s use of iterability in transformer-based large language models (LLMs) like ChatGPT. In particular, I tried to draw out some of the implications of subword tokenization for iterability. Here I want to continue that process with other aspects of the transformer model.
If subword tokenization deals with the complexity of semantics by limiting the number of tokens, another attempt to preserve semantics while reducing complexity is through word embeddings. Here, the model constructs a vector mapping that puts words that tend to occur near to one another in text in locations proximate to one another. We might imagine a two-dimensional space, which locates words by age and gender. In such a space, “man” and “grandfather” would likely be closer to one another than “grandfather” and “girl,” since man and grandfather are closer to one another in terms of both age and gender than grandfather is to girl. As Offert, Kim and Cai explain, “Either learned during the training process or sourced from another model, embedding matrices position each word in relation to other words in the input sequence” (11). The sentence comes with a footnote about efforts to draw Derridean implications:
“This relational aspect of word embedding has often been compared to poststructuralist notions of meaning, particularly Derrida’s notion of différance. It should be noted, however, that the relational saliency of embedded words is a product only of their operationalization: only words that are, in fact, numbers, gain relational saliency”
In other words, if it is true that the meaning of words is only in relation to other words, it is also true that a language model’s available catalog of words is arbitrarily limited by the training data it ingests and any other limitations on which words make it onto vectors.
These mappings can be incredibly complex, but they are finite – which generates another limit on iterability. As Offert, Kim and Cai put it:
“Though this proximity is not arbitrary, it is still constrained by the continuous but finite dimensions of the embedding vector space (typically in the hundreds or even thousands) that may not capture all possible nuances in the relation between words. As such, relationships between words are to be understood as contextually determined and constrained by the fundamental limitations inherent in translating words to numbers. Nevertheless, as the famous analogy test shows, some semantic aspects are indeed preserved” (11).
There at least two aspects of this to look at. First, the caution about différance. This seems correct at least in part because of the way Derrida, well, embeds his usage of différance. In Limited, Inc, for example, he writes that “the parasitic structure is what I have tried to analyze everywhere, under the names of writing, mark, step, margin, différance, graft, undecidable, supplement, pharmakon, hymen, parergon, etc.” (103). The final indeterminacy of is of course marked by the indefinite “etc.” with which the list closes, but that Derrida associates it with the parasitic structure suggests that the target here is the sort of intentional phenomenology that Derrida says underlies speech act theory. Early in the text, for example, he proposes that différance is “the irreducible absence of intention or attendance to the performative utterance” (18-19). The meaning of the words depends on their proximity to other words because it does not depend on authorial intent in the sense that iterability means that authorial intent can never be dispositive of the meaning of an utterance, and that the meaning of the utterance can never be made fully present. It is not clear what it means to take that argument and apply it to a vector representation.
Second, the analogy test. The analogy test refers to the model’s ability to fill in analogy statements: “x is to y as z is to…” The brilliance of this capacity and its limitations are best shown through an example. Here is ChatGPT:
Both cases show that the model has preserved a lot of semantics, and sees the logical structure of the analogy. But it’s also limited: when you read “computer programming is to employment as philosophizing is to what,” I’m guessing your mind immediately jumped to “unemployment.” That’s both a longstanding joke, and a real fear. But both of those reasons why you might have said “unemployment” require broadening the context quite a bit or seeing contexts other than the typical ones. ChatGPT was capable of the joke – but it had to be explicitly told to make it:
It seems to me, at least, that this shows the connection between word embeddings and the sort of complaint Derrida makes against Searle’s demotion of parasitic forms of language: both of them rely on an implicit normativity to make their project tractable in the first place. But there are contexts where the parasitic meaning is in fact the first meaning, and precisely that sense of unusual context is what disappears in the language model.
More next time…
Recent Comments