RLHF and Curation Transparency

By Gordon Hull

Last time, I followed a reading of Kathleen Creel’s recent “Transparency in Complex Computational Systems” to think about the ways that RLHF (Reinforcement Learning with Human Feedback) in Large Language Models (LLMs) like ChatGPT necessarily involves an opaque, implicit normativity. To recap: RLHF improves the models by involving actual humans (usually gig workers) in their training: the model presents two possible answers to a prompt, and the human tells it which one is better. As I suggested, and will pursue in a later post, this introduces all sorts of weird and difficult-to-measure normative aspects into the model performance, above and beyond those that are lurking in the training data. Here I want to pause to consider this as a question of opacity and transparency. I’m going to end up by proposing that there’s a fourth kind of transparency that we should care about, for both epistemic and moral reasons, which I’ll call “curation transparency.”

Creel, as I noted, distinguishes three types of transparency, and the question I want to pursue here is whether the implicit normativity should be characterized as a fourth. The three are functional, structural and run transparency. Run transparency, which “requires analysis of one particular run on an individual machine using actual data” (580) seems like the closest fit here: an individual LLM uses its training data, and RLHF basically embeds an artifact detection process into the training. The process is not too different from the one Creel describes in which Google determined, by adding pixels to an initial image of white noise, determined that an image recognition system weighted the presence of an arm heavily in classifying a dumbbell. “For each label that DeepDream recognized, they iteratively fed the system a white noise image and asked it to determine which of two random modifications of that image were closer to its understanding of ‘dumbbell.’” (585). They then took this association to be an undesirable artifact (presumably derived from the training data, which would have involved mostly pictures of people at the gym: how often do you see a dumbbell somewhere else, or pictured not in use?) and corrected it.

In the case of RLHF, the human worker sees two given outputs, and picks one as preferable. The human thus does the work of the google engineers who decided that the better version of a dumbbell did not have human arms, and in so doing, tweaks the model to perform better. Suppose, for example, that a multimodal model trained on the LAION dataset. We know from work led by Abeba Birhane that the LAION dataset, which is based on alt-text and images, disproportionately depicts Latina women pornographically. This, one assumes, is a problem in the data: for whatever reason, porn sites are more likely to use alt-text with their images than other places that depict Latina women. An RLHF session might involve a worker getting two samples to the prompt “show me a Latina woman.” One of them would be pornographic and the other not, and the human would mark the second as better. More likely, this would involve a lot of steps: Birhane indicated that the database was a complete disaster on this point, so there would likely be a lot of iterations involving “more” or “less” pornographic images until enough sessions guided the model toward satisfactory performance.

This is thus recognizable as a technique for improving problems or run transparency. Creel notes two examples of the breakdown of run transparency into opacity. The second is closest to what the RLHF workers are fixing and involves situations in which:

“Features of the data for which the programmers did not account can also interact with the program to create artifacts. Although the problem may be located in the data, in an opaque system it is more difficult to detect. Increasing run transparency can reveal previously undetected problems in the interactions between input data and the software” (580)

She points to the much-discussed COMPAS pretrial detention algorithm as an example. That program basically categorized minority defendants as higher risk than they turned out to be, because they were overrepresented in the training data. But this was because minorities are overpoliced. Here, pornographic images are unexpectedly overrepresented in the training data for LAION, or arms are overrpresented in the training data for dumbbells.

I emphasize the LAION dataset, despite how uncomfortable the example potentially is, to underscore four things. First, human annotators and RLHF workers do a lot. They are subjected to a lot. Much of this work can be damaging. Second, datasets are weird and often bad – and in complicated ways. Third, the stipulation that dumbbells do not involve arms, and the stipulation that Latina women are not appropriately represented pornographically are both normative. The LAION case shows us – much like Safiya Noble’s work on search autofill – that a lot of what is out there on the Internet is toxic mess. The non-toxic version involves overriding the views of a lot of people who put things online. Fourth, this sort of cleanup matters because when real people interact with language models, multimodal datasets, search autofill, etc., they are going to be subject to a lot of toxicity if the models aren’t cleaned up. Someone – end users, or RLHF workers (or other people who curate the datasets) – is going to be subjected to that toxicity. Usable AI, at least at this juncture, depends on causing either gatekeeprs or end users to suffer.

Thus for the run transparency problem generated by opacity in the dataset. But what about the RLHF fix? We don’t, as I argued last time, know a lot about what the workers are doing beyond being able to say that they annotate. Yet they are integral to the run of the model. Do they introduce their own opacity?

Creel’s other example of run opacity is the corruption of telescope data sensors by cosmic rays:

“One hardware problem that run transparency illuminates is the corruption of sensitive detector equipment by cosmic rays. When balloon-borne telescopes are launched to measure cosmic microwave background radiation, powerful cosmic rays flip the telescopes’ bits and corrupt their data. Therefore, balloon experiments use special “space-qualified” hardware: circuits and logic gates that are less susceptible to bit flipping and can detect and repair the flips that do occur. Changing the system level implementation and hardware components protects the system from errors that would be difficult to detect with access only to the algorithmic or structural levels. Knowledge about the hardware and the state of the data stored in a particular run is necessary in order to pinpoint the bit-flipping errors generated by cosmic rays” (580, citation omitted)

RLHF workers similarly embed in the system and correct problems that do occur. I know the analogy is far from exact, but I want to pick up on a point: we have a correct model of what the hardware is supposed to do. This can be specified as a matter of logic or engineering. So detecting and correcting the flips is itself a transparent process.

None of that corrective transparency applies to RLHF workers. Some of the reasons were the ones I looked at last time; next time, I want to develop some thoughts about this as a matter of language. Here, the question is whether the opacity introduced by RLHF workers needs to be viewed as another axis of transparency/opacity. My impulse is that it does, not necessarily because we can’t view this under Creel’s category of run transparency, but because human intervention is so pervasive in the operation of LLMs and other AI. That intervention is generally used to correct problems in model runs or data and so fits with Creel’s category of run transparency. But it also introduces its own opacity because we can’t necessarily transparently say what the humans are doing, even if we know they’re doing it. Given the combination of the normative significance of what the humans are doing (in terms of toxicity etc.) and the difficulty in saying precisely what that is, it seems to me to be worth adding a term.

I propose “curation transparency.” A system is curation transparent we know how it’s being curated by people, who’s doing the curation, and how they’re doing it. If they are underpaid Turkers in Kenya who screen the dataset for child pornography, knowing that is essential for achieving curation transparency. If they are RLHF workers who are telling it which of two prompts is better, knowing that is essential for achieving curation transparency. And we need to know what these workers are told to do, and have some way of modeling their output. The AI companies try very hard to obscure this sort of information, probably because it undermines the myth of autonomous, intelligent systems to have to admit that the system won’t produce acceptable output without several thousand people telling it what’s acceptable or not. For that reason, curation transparency is something we should demand. It’s both an epistemic and a moral question.

New APPS

recent posts

about