Implicit Normativity in Reinforcement Learning with Human Feedback in Large Language Models

By Gordon Hull

This is somewhat circuitous – but I want to approach the question of Reinforcement Learning with Human Feedback (RLHF) by way of recent work on algorithmic transparency. So bear with me… RLHF is currently all the rage in improving large language models (LLMs). Basically, it’s a way to try to deal with the problem that LLMs aren’t referentially grounded, which means that their output is not in any direct way connected to the world outside the model.

LLMs train on large corpora of internet text – typically sources like Wikipedia, Reddit, patent applications and so forth. They learn to predict what kinds of text are likely to come next, given a specific input text. The results, as anybody who has sat down with ChatGPT for long knows, can be spectacular. Those results also evidence that the models function, in one paper’s memorable phrasing, as “stochastic parrots.” What they say is about what their training data says is most likely, not about what’s, say, contextually appropriate. But appropriate human speech is context-dependent, and answers that sound right (in the statistical sense: these words, in general, are likely to come after those words) in one context may be wrong in another (because language does not get used “in general”). RLHF is designed to get at that problem, as a blogpost at HuggingFace explains:

“Wouldn't it be great if we use human feedback for generated text as a measure of performance or go even one step further and use that feedback as a loss to optimize the model? That's the idea of Reinforcement Learning from Human Feedback (RLHF); use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values.”

The core idea is simple (for much more detail, read the rest of the HuggingFace post): the model (or different models, or different versions of a model) generates two responses to a prompt, and a person then rates one of them as more appropriate than the other. Iterate this process enough times, and the model will eventually learn quite a lot about context-appropriate speech. It will thus perform better. We can thus understand RLHF as a technique for rooting out bad responses, which in this sense are artifacts, defined as “a problem in functioning, a result or behavior that the designers did not expect or want” (Creel, cited below, 571). Inappropriate responses are artifacts, and RLHF assists the system in not producing them.

Set that aside for just a moment. I just finished reading an excellent paper by Kathleen Creel that gets at the problem of algorithmic opacity/transparency. Algorithmic opacity tends to refer to the fact that it’s not possible for humans to understand or explain what algorithmic systems are actually doing (though there are other versions, as when the workings are hidden behind trade secret law; this excellent paper is a good primer and a standard cite). This is an especially acute problem when you’re dealing with neural networks with gazillions of weighted parameters. Even if you could somehow explain what the model did, it likely wouldn’t be understandable to people, even real experts. Creel suggests that this version of algorithmic opacity (or getting past it with transparency) isn’t the only one we can use, and that disaggregating it into different versions of transparency can be useful:

Functional transparency occurs “when it is possible to know the high-level, logical rules according to which the system will transform a given input into an output” (573)
Structural transparency requires “knowledge of how an algorithm is realized in code” (575). Issues here are often about the interaction between components of the system, or conflicting expectations between, say, a Python programmer and the person who developed a package she is using to perform a standard calculation. Of course, structural transparency is very hard to achieve in neural networks, because even if we know how the algorithm works, we likely don’t know how it generates a given classification, much less its classifications more generally.
Run transparency “requires analysis of one particular run on an individual machine using actual data” (580) and is useful in discovering difficulties in, say, the interaction between the programming language and the hardware on which it runs.

Thus the basic taxonomy. Late in the paper, Creel applies this to a couple of efforts at algorithmic transparency. One, post-hoc explanation, “aims to explain the predictions of a classifier by fitting a linear model to the pattern of its prediction given the input data” (582). This is basically constructing a model of what the algorithm did. The advantage is that it provides a human-comprehensible model of what the algorithm did, at least in that case. The disadvantage is that it does not actually describe what the algorithm did. Nonetheless, it increases functional transparency insofar as it “succeeds in explaining the functioning of the algorithm on that particular decision, albeit in coarse-grain, human-interpretable terms.” (583).

Her other example is feature reconstruction through visualization. Here the example is Google’s DeepDream image recognition software. Google devised an algorithm to reconstruct what image features DeepDream used to classify an object: “for each label that DeepDream recognized, they iteratively fed the system a white noise image and asked it to determine which of two random modifications of that image were closer to its understanding of ‘dumbbell’” (585). What the researchers learned was that “the reversed images of dumbbells come with partial images of arms, [which] means that arm images are useful in raising the probability that this is an image of a dumbbell and therefore making a successful classification” (585). She then notes:

“However, perhaps because of implicit essentialism about categories, the Google researchers interpreted the presence of arms as an artifact: “The network failed to completely distill the essence of a dumbbell. Maybe it’s never been shown a dumbbell without an arm holding it. Visualization can help us correct these kinds of training mishaps” (Mordvintsev et al. 2015). The choice to eliminate the arms shows a commitment to the correctness of the researchers’ prior concept of the essence of dumbbell. Despite acknowledging elsewhere that the labels derived are cluster concepts and despite the status of dumbbells as an object created by humans for a purpose, the researchers are unwilling to accept lifting biceps as part of the functional concept of dumbbell” (585).

Creel the underscores that “the detection of artifacts is a normative decision based in part on wat kinds of explanations satisfy humans” (585).

It seems to me that there is something important about RLHF here. RLHF is basically an artifact detection technique, insofar as it’s helping to root out inappropriate responses. But any sort of artifact detection depends on an implicit normativity, precisely based on “what kinds of explanations satisfy humans.”

The problem here is that RLHF work involves a shady, opaque (as in, not transparent) and precarious workforce, as noted in a long piece by Joseph Dzieza. The same Turkers that detoxify the training data used for language models (at great personal cost: their job is literally to read and mark Nazi speech, child porn, etc. as such) are now being given RLHF tasks. The job in question is, in a sense, no different from any other form of annotation, except that in more complicated cases, it pays better. Here’s Dzieza’s account of one annotator:

“Each time Anna prompts Sparrow, it delivers two responses and she picks the best one, thereby creating something called “human-feedback data.” When ChatGPT debuted late last year, its impressively natural-seeming conversational style was credited to its having been trained on troves of internet data. But the language that fuels ChatGPT and its competitors is filtered through several rounds of human annotation. One group of contractors writes examples of how the engineers want the bot to behave, creating questions followed by correct answers, descriptions of computer programs followed by functional code, and requests for tips on committing crimes followed by polite refusals. After the model is trained on these examples, yet more contractors are brought in to prompt it and rank its responses. This is what Anna is doing with Sparrow. Exactly which criteria the raters are told to use varies — honesty, or helpfulness, or just personal preference. The point is that they are creating data on human taste, and once there’s enough of it, engineers can train a second model to mimic their preferences at scale, automating the ranking process and training their AI to act in ways humans approve of. The result is a remarkably human-seeming bot that mostly declines harmful requests and explains its AI nature with seeming self-awareness.”

As Dzieza notes, “Put another way, ChatGPT seems so human because it was trained by an AI that was mimicking humans who were rating an AI that was mimicking humans who were pretending to be a better version of an AI that was trained on human writing.”

He continues that RLHF is:

“so effective that it’s worth pausing to fully register what it doesn’t do. When annotators teach a model to be accurate, for example, the model isn’t learning to check answers against logic or external sources or about what accuracy as a concept even is. The model is still a text-prediction machine mimicking patterns in human writing, but now its training corpus has been supplemented with bespoke examples, and the model has been weighted to favor them. Maybe this results in the model extracting patterns from the part of its linguistic map labeled as accurate and producing text that happens to align with the truth, but it can also result in it mimicking the confident style and expert jargon of the accurate text while writing things that are totally wrong. There is no guarantee that the text the labelers marked as accurate is in fact accurate, and when it is, there is no guarantee that the model learns the right patterns from it”

What I want to draw from this is that there is a problem of normativity in visualization that cuts across different techniques to reduce artifacts and other misfires of current machine learning techniques. The essential rendering of a dumbbell as a discrete object is one example, and if you’ve read any Heidegger, you might wonder why we’d look at a dumbbell in isolation rather than seeing it as a tool being used – i.e., with the arm attached. Of course, either example presents a paradigmatic or model case and then judges images according to that standard. Bespoke uses of dumbbells will likely never be registered as such. In this since, normativity is inescapable.

The normativity built into RLHF is harder to identify because it is more diffuse, spread out over however many Turkers other gig workers. But that doesn’t mean it isn’t there, as Dzieza’s comments make clear. What is the normativity of the bespoke examples? They likely function in the aggregate, so we’re basically using RLHF to train the model to generate speech that is… acceptable to the average Mechanical Turker.

I was going to say something snarky about not knowing whether that’s a good idea or not, but I actually think there’s a non-snarky version of this, because we don’t actually know much about this precariat, except that they tend to be in lower-income countries. Except when they’re not. What are their individual preferences? How do they aggregate? Do they even care, or are they just pushing buttons to get paid?

Creel suggests that “visualization provided insight into features of the training data set, namely, the preponderance of biceps in dumbbell pictures, which influenced the outcome in ways the team deemed deleterious” (586). As she says, this “allows for artifact detection of a kind, [it] obscures the algorithmic process by which the images are made” (586).

Something analogous is happening with RLHF. It’s providing a window into the training data for the LLM by showing us specific output examples, and then Turkers are using that as a form of artifact detection by labeling outputs as inferior/superior. But this leaves the original processes that generated that output obscure. It also introduces a new kind of opacity: the processes by which the RLHF humans generate their answers, and the ways that the model learns from them.

It seems to me that a few points should be underscored. First, the labor problem is real. Lots of discussions about the injustices perpetrated by AI or the harms it might do to marginalized groups fail to mention the significant exploitation of labor at its core. This exploitation is – outside of reporting like Dzieza’s, or Billy Perrigo’s at Time – is largely invisible. There has also been relatively little scholarly attention to it – notable exceptions include parts of Kate Crawford’s Atlas of AI and Gray and Suri’s Ghost Work. Human labor is and always has been at the heart of AI, as a fantastic new book by Matteo Pasquinelli argues. To be sure, there have been efforts to automate tasks like image labeling (by using alt-text, as in the LAION dataset), but these introduce their own weird problems, as Abeba Birhane, Vinay Uday Prabhu, Emmanuel Kahembwe demonstrated. Human involvement is here to stay, and the people doing it deserve better than they’re getting.

Second, if the connection I’m seeing is right, RLHF as an effort to remove artifacts is going to introduce some of the same kinds of problems were see with bias in supervised models, where contingent features of the dataset bias the entire model in specific and sometimes unpredictable ways. ImageNet is a mess, as Crawford has demonstrated and a recent Washington Post article illustrates. Image sets scraped from Flickr exhibit a rich country bias (not surprisingly) – Hammerhead sharks are trophies, lobsters are on plates, and soap is in a pump bottle with tiles around it. As the study notes:

“the object-classification accuracy in recognizing household items is substantially higher for high-income households than it is for low-income households. For all systems, the difference in accuracy for household items appearing the in the lowest income bracket (less than US$50 per month) is approximately 10% lower than that for household items appearing the in the highest income bracket (more than US$3,500 per month) …. the discrepancy [likely] stems from household items being very different across countries and income levels (e.g., dish soap) and from household items appearing in different contexts (e.g., toothbrushes appearing in households without bathroom); the ones that get their images from flickr undersample poorer, less dense places”

RLHF means that the language model is now supervised in some sense. What does this supervision mean? It’s hard to know. On the one hand, there are already concerns with the way LLMs operationalize “correct” or “high quality” language. On the other hand, the use of gig workers in RLHF introduces a very difficult to quantify or measure normativity, since so little is known about those workers.

In any case, and third, it’s important to recognize that this implicit normativity is just a thing. Language is uttered by speakers in specific contexts to other speakers in specific contexts, and this is something that LLMs are not equipped to handle particularly well, except as they are meant to produce a language that’s a “blurry .JPG of the web.” But the problem there is not just that nobody speaks that language, it’s that it also has, unavoidably, an implicit normativity. Recent work suggests that RLHF may make some headway on the “grounding problem” for LLMs – their lack of referentiality. I think that’s right, but we need to recognize that it comes at a cost – the mediating, grounding layer itself is, in a strange way, partly ungrounded.

New APPS

recent posts

about