By Gordon Hull
In a fascinating new paper up on arxiv.org, Fabian Offert, Paul Kim, Qiaoyu Cai start with the observation that both AlphaFold and ChatGPT are transformer architectures, and that for proponents there is frequently a significant sense in which “it is the language-ness of proteins (and of language) … that renders the transformer architecture universal, and enables it to model two very different domains” (3). As they add, to someone in the humanities, this “sounds like structuralism 2.0.” Indeed, there is a rich history of connection and communication between structuralist linguistics and information theory, as demonstrated by Bernard Dionysius Geoghegan’s Code: From Information Theory to French Theory (which they cite for the point; my synopsis of it is here).
Offert, Kim and Cai argue that this thesis is backwards: it is not that there is a universal epistemology of which language is the paradigm, that is then modeled by transformer architectures (so the universality of language grounds both ChatGPT and AlphaFold’s ability to be captured by the transformer architecture, which itself follows the universality of language). Rather, the model is what transformer architectures do, and then language and protein folding are examples of how it can be used. In both use cases, the architecture generates an approximation of the phenomenon in question; in both use cases, that is often enough. However, the transformer architecture can be seen as “the definition of a specific class of knowledge” and not the realization of (or really even a model of) something linguistic. This means that “any manifestation of the transformer architecture is, at the same time, a representation of a particular body of knowledge, and a representation of a particular kind of knowledge” (15)
The core of the paper is the demonstration that language models rely on some specific, non-linguistic steps. That is, “techniques like word embedding, positional encoding, and subword tokenization ‘infuse’ tokens with a continuity that could not be more language-unlike, and actually is much closer, analogically, to a physical, not a symbolic understanding of the world” (2-3). Consider subword tokenization. Subword tokenization solves a problem: if we turn each word into a token, we preserve semantics best but have too many tokens to be practicable. If we turn each letter into a token, we have a tidy number of tokens, but mostly lose semantics. Subword encoding tries to split the difference; in their example, a word like “refactoring” is quite uncommon, but can be broken down into the common tokens of “re” “factor” and “ing.” This process generates “more manageable units [that] still maintain contextual relevance and are semantically salient” (10). Of course, that means that “it is already in this very first pre-processing step that we can see the “linguistic” nature of language, its dependence on discrete tokens organized in hierarchical structures, vanish.” (10)
It seems to me that this is an opportunity to develop and refashion somewhat some earlier thoughts I’ve had about LLMs in the context of Derrida. As I’ve explored (here, here and here), language models depend on what Derrida calls iterability: the ability of a word to be taken from one context and put into indefinitely many others. For Derrida, this means (among other things) that context can’t be dispositive for determining the meaning of a word. Iterability in LLMs guarantees that the training data can be repurposed: “good” can be applied in new contexts, independent of the contexts of the training data. Subword tokenization depends on a different kind of iterability, the iterability of tokens. So context fails to be dispositive at two levels, words and tokens. Indeed, depending on the number of words that get tokenized (vs subwords), the context of words will be even less relevant. The contexts of tokens are relevant (the co-occurrence of “ing” with various verbs, for example) more so than words, since “factor” might be common enough to get its own token, but not “factoring.” Similarly, what “re” means isn’t dependent on its specific location in “refactoring.”
In short, at the point we’re dealing with the semantics of tokens, we’ve pulled the rug out from under the direct iterability of words and replaced that with the more bounded iterability of tokens (recall that the project was to reduce the number of tokens needed). The process of iterability of tokens serves as a substitute for the iterability of words.
This means three things. First, the direct iterability of words simply disappears, except for words that make it into the model as entire tokens. The word “refactoring” is not iterable because it does not make it into the model as such. It exists only in its decomposition. This is a sharp break from language as we use it.
As Offert, Kim and Cai underscore, this moves the transformer architecture away from language. To get a sense of how far, consider how Barthes famously begins his “Death of the Author” essay: “writing is the destruction of every voice, of every point of origin. Writing is that neutral, composite, oblique space where our subject slips away, the negative where all identity is lost, starting with the very identity of the body writing” (Image, Music, Text, 142). Barthes elaborates his point in terms that are reminiscent of Derrida’s. Indeed, Barthes specifically makes what has to be a reference to speech act theory, noting that writing “designates exactly what linguists, referring to Oxford philosophy, call a performative, a rare verbal form … in which the enunciation has no other content (contains no other proposition) than the act by which it is uttered” (145-6; as far as I can tell, Barthes is not named in Limited, inc.). Barthes adds:
“We know now that a text is not a line of words releasing a single ‘theological’ meaning (the ‘message’ of the Author-God) but a multi-dimensional space in which a variety of writings, none of them original, blend and clash. The text is a tissue of quotations drawn from the innumerable centres of culture” (146)
I flesh out this description to indicate that for theorists like Barthes and Derrida, there’s an important sense in which language depends on what one might call the tokenization of words – that the limiting structures that were imposed by functions like authorial intent aren’t real. There is no inherent semantic limitation imposed by authorial intent. Again, language models depend on something like this. But to the extent that subword tokenization reduces the number of tokens, they reduce sharply the “innumerable centres” that Barthes refers to.
The flexibility of Barthes’ “text” also indicates that iterability also operates at a super-word level, incorporating (for example) phrases. Humor makes good use of putting not just words but phrases in unexpected places, and a lot of social meaning is conveyed by the repetition of not just words but phrases or even entire texts (this point was made brilliantly in the context of copyright by Rebecca Tushnet, who points out that copying can itself be expressive and politically meaningful). As I understand it, this sort of iterability/referentiality relation also disappears in language models.
More next time…
Recent Comments