Language Modeling Bias, Iterability, Grammatology and Epistemic Injustice

By Gordon Hull

In a recent paper in Ethics and Information Technology, Paul Helm and Gábor Bella argue that current large language models (LLMs) exhibit what they call language modeling bias, a series of structural and design issues that serve as a significant and underappreciated form of epistemic injustice. As they explain the concept, “A resource or tool exhibits language modeling bias if, by design, it is not capable of adequately representing or processing certain languages while it is for others” (2) Their basic argument is that the standard way of proceeding with non-English languages, which is more or less to throw more data at the model, build in structural biases against other languages, especially those that are more morphologically complex than English (=df those with lots of inflections).

The proof of concept is in multi-lingual tools:

“The subject of language modeling bias are not just languages per se but also the design of language technology: corpora, lexical databases, dictionaries, machine translation systems, word vector models, etc. Language modeling bias is present in all of them, but it is easiest to observe with respect to multilingual resources and tools, where the relative correctness and completeness for each language can be observed and compared” (6)

They identify several kinds of such structural bias. The first is that prominent current architectures “tend to train slower on morphologically complex (synthetic, agglutinate) languages, meaning that more training data are required for these languages to achieve the same performance on downstream language understanding tasks” (7). Given the percentage of the available training data that’s in English, this magnifies what’s already a problem. Second, the models perform poorly on untranslatable words. Third, they cite a study showing “that both lexicon and morphology tend to become poorer in machine-translated text with respect to the original (untranslated) corpora: for example, features of number or gender for nouns tend to decrease. This is a form of language modeling bias against morphologically rich languages” (7).

Fourth, translations tend to use English as a pivot language. They cite one example: In translating “my (female) cousin married a tall man” from French to Italian, the interposition of English – which does not have gendered forms for cousin – causes the Italian output to default to a male gendered term (at best, it would have to guess, since the problem is the disappearance of the gender marker in the move from French to English). Finally, multilingual lexical databases which are often based on Wordnet (which is weird. And English-based) tend to flatten out concepts that are richer in non-English languages. They cite an example of going from Swahili to Japanese on rice. Both Swahili and Japanese have a rich vocabulary for rice, with separate words for uncooked versus cooked rice and so forth. But English has only one, so “uncooked rice” in Swahili goes through the database, and comes out incorrectly in Japanese as just “rice.”

Helm and Bella characterize these results as examples of hermeneutic injustice, adding to an already rich literature about epistemic injustice in machine learning systems. Still, the paper stands out in a few ways. First, the focus on structural issues in LLMs is, as far as I know, novel. Second, they also make an important theoretical move, tying epistemic injustice to colonialism by way of Spivak’s “Can the Subaltern Speak” (all of the other work I know on ML/data and epistemic injustice goes through Fricker without the addition of Spivak). Helm and Bella write that “the concept of epistemic injustice also needs to be situated historically, as it can be understood as a further development of Gayatri Chakravorty Spivak’s notion of epistemic colonization” (5). They add:

“This involves the domination of a particular theory of knowledge (in the present case, it may be a belief in the universal power of AI systems developed in the West) over others, often marginalizing or suppressing local knowledge systems and ways of understanding the world.” (5).

The paper thus builds an explicit bridge between work on data and AI colonialism and on epistemic injustice.

It seems that the move to Spivak is also important for an unexpected reason. Spivak’s essay distinguishes between Deleuze and Foucault, on the one hand, and Derrida, on the other. She writes:

“I have tried to argue that the substantive concern for the politics of the oppressed which often accounts for Foucault’s appeal can hide a privileging of the intellectual and of the ‘concrete’ subject of oppression that, in fact, compounds the appeal. Conversely …. I will discuss a few aspects of Derrida’s work that retain a long-term usefulness for people outside the First World. This is not an apology. Derrida is hard to read; his real object of investigation is classical philosophy. Yet he is less dangerous when understood than the first-world intellectual masquerading as the absent nonrepresenter who lets the oppressed speak for themselves” (292; citing the original version in Marxism and the Interpretation of Culture)

Spivak suggests that, for the Derrida of Grammatology, “the question is how to keep the ethnocentric Subject from establishing itself by selectively defining an Other,” a question which is “not a program for the Subject as such; rather, it is a program for the benevolent Western intellectual.” Spivak notes that Derrida identifies linguistic prejudices in the seventeenth-century:

“The first can be indexed as: God wrote a primitive or natural script: Hebrew or Greek. The second: Chinese is a perfect blueprint for philosophical writing, but it is only a blueprint. True philosophical writing is 'independent[t] with regard to history' (OG, p. 79) and will sublate Chinese into an easy-to-learn script that will supersede actual Chinese. The third: that Egyptian script is too sublime to be deciphered. The first prejudice preserves the 'actuality' of Hebrew or Greek, the last two ('rational' and 'mystical', respectively) collude to support the first, where the center of the logos is seen as the Judaeo-Christian God (the appropriation of the Hellenic Other through assimilation is an earlier story a 'prejudice' still sustained in efforts to give the cartography of the Judaeo-Christian myth the status of geopolitical history” (292-3)

Derrida follows with “an account of the complicity between writing, the opening of domestic and civil society, and the structures of desire, power and capitalization” (293).

I don’t want to get too deep into questions of grammatology (or of Spivak) here. But I do want to note the resonance between Spivak’s Derrida and the problems Helm and Bella identify. At one point, they underscore that the point is not to get bias out of technology; “Unbiasedness is therefore a deceptive goal that, instead of solving social problems, reproduces problematic ideas, such as the unrealistic imaginary that technology can be neutral” (6).

It is the imaginary of a neutral technology that shows how deeply implicated LLMs are in the problems that Derrida highlights. Recall from his critique of Searle that one of the problems (intentionality is another) there is Searle’s non-innocent taking of a moment of “serious” (as opposed to “parasitic”) linguistic usage as the norm of a theory of what language does. As I argued last time, this Here, we see a parallel critique, articulated through the language of grammatology: the language models Helm and Bella critique precisely reenact the three prejudices. In place of Hebrew or Greek, English becomes the natural script against which all other languages are based. Some actual languages are too morphologically complex, and they will be sublated into an easy-to-learn script that will supersede the actual language. And others – this is the third prejudice – are indecipherable and simply misrepresented.

In any case, English emerges as the de facto universal language against which others are measured. Helm and Bella underscore this point repeatedly, as noted above. The co-constitution of the default to English and the technical structure of language models is also emphasized by Hovy and Prabhumoye (upon whom Helm and Bella deliberately build):

“Overexposure can also create or feed into existing biases, for example, that English is the ‘default’ language, even though both morphology and syntax of English are global outliers. It is questionable whether NLP would have focused on n-gram models to the same extent it had instead been developed on a morphologically complex language (e.g., Finnish, German). However, because of the unique structure of English, n-gram approaches worked well, spread to become the default approach and only encountered problems when faced with different languages” (10)

Recent trends toward looking at less-well-represented languages underscore the reenactment. It isn’t just that the models structurally prioritize English as the norm. It is also that the outreach into other languages imitates the colonialist structures Spivak articulates through her reading of Derrida. Helm and Bella argue:

“Low-resource language research is only worthy of a top publication as long as (1) it provides a solution for multiple, preferably tens or hundreds of languages at the same time; (2) it involves mainstream AI technology, i.e. neural networks; and (3) it requires very little to no knowledge from experts or speakers of the languages targeted. The typical low-resource research contribution thus scrapes web content, such as Wikipedia pages, written in the languages in question, often without any understanding of their quality or content. It then trains or fine-tunes deep learning models based on the data, and finally demonstrates a few percentages of increase in quality (precision, recall, BLEU, etc.) over one of the standard tasks in computational linguistics, such as named entity recognition or machine translation, against corpora that the researchers themselves cannot read. This practice is certainly not in line with what we earlier described as accounting for meaningful diversity.” (9)

In this, the wrong kind of well-meaning gestures to linguistic diversity are like Spivak’s Foucault: “the first-world intellectual masquerading as the absent nonrepresenter who lets the oppressed speak for themselves,” this time by scraping their Wikipedia pages.

New APPS

recent posts

about