By Gordon Hull
Machine Learning (ML) applications learn by repetition. That is, they come to recognize what, say, a chair looks like, by seeing lots of images of chairs that have been correctly labeled as such. Since the machine is trying to figure out a pattern or set of characteristics that distinguish chairs from other objects, the more chairs it sees, the better it will perform at its task. This collection of labeled chairs is the algorithm’s “training data.” There are notorious problems with bias in training data due to underrepresentation of certain groups. For example, one study found that datasets designed to train ML to recognize objects performed poorly in developing countries, most likely due to underrepresentation of images from those places; when combined with pictures labeled in English, and the fact that standard Western commodity forms of household objects might look very different in the developing world, the ML was stumped. Most famously in this country, facial recognition software performs best at identifying white men and worst at identifying Black women. ImageNet, which is widely used for object recognition purposes, employees a hierarchy of labels that include calling a child wearing sunglasses a “failure, loser, non-starter, unsuccessful person” but also differentiates between assistant and associate professors. Despite these and many other problems, massive datasets are essential for training ML applications.
For this reason, datasets have been called “infrastructural” for machine learning, defined as follows:
“Infrastructure is characterized, we argue, by a set of core features: it is embedded into, and acts as the foundation, for other tools and technologies; when working as intended for a particular community, it tends to seep into the background and become incorporated into routines; the invisibility of infrastructure, however, is situated - what is natural or taken for granted from one perspective may be highly visible or jarring from another; though frequently naturalized, infrastructure is built, and thus inherently contextual, situated, and shaped by specific aims.”
If AI is cars and buses and trains that do what we want, the datasets it trains on shape the roads and paths where those roads go, provide their material basis, become thereby incorporated into higher level routines like search, and tend to disappear into the background when not actively used. But just like other examples of infrastructure – say, the bridges over the Long Island Freeway – infrastructure can embed priorities and affordances. In this sense, dataset infrastructures have a politics, and serve as platforms on which applications are built.
The analogy is not perfect, but it’s good enough to do the work I need it to here, which is to note that platform sites can be governed in different ways, and some of those ways are better suited to public values than others. Since AI is becoming more and more public, perhaps it’s time to start thinking about how we regulate datasets as infrastructure. Rather than make the move from infrastructure to platform. I’d like to go the other way, because centering the word “platform” highlights that there are governance strategies available. In a recent paper, I distinguished between three types of platform governance: infrastructure, modulation, and portal.
Infrastructure governance understands the Internet as a common pool or public resource, on the model of traditional infrastructures like roads and bridges. Modulation understands Internet governance as traffic shaping. Portal governance understands the Internet as creating a user experience that facilitates data mining. Modulation is what ISPs do in the absence of net neutrality rules, when they make traffic to preferred sites flow faster, for example. Portal is like what you see on FB or Google, where your experience is determined by data extraction. It is infrastructure that is of interest here.
The infrastructural model was born in the era of public utilities, and its premise is nondiscriminatory access: it doesn’t matter why you are at the public park, as long as you’re there (for an otherwise legal purpose). This doesn’t mean you can always be at the park; what it does mean is that you are to be treated the same as everybody else. If the park closes at 10pm, it doesn’t just do so for poor people. Infrastructure became a bigger deal with the rise and necessity of public utilities. As K. Sabeel Rahman argues, utility regulation emerged out of the Progressive Era, and was marked by a concern to reconcile the private power of utility companies with the social necessity of the services they provided. Progressive “thinkers saw public utilities as required where a good was of sufficient social value to be a necessity, and where the provision of this necessity was at risk of subversion or corruption if left to private or market forces” (115). Designated as common carriers, utilities “were subject to special restrictions, such as the duty to provide a service once undertaken, to serve all comers, to demand reasonable prices, and to offer acceptable compensation” (115).
What if we were to think about datasets as utilities? Would that suggest that rules analogous to common carrier provisions be provided, such that they could be required to come with extensive documentation about where their data came from? Could we require certain levels of representativeness in data? These would be justified because the value of infrastructure is in the productive activities it enables: the value of ImageNet is not in the dataset itself, but in the kinds of AI applications it enables.
Current research shows that the datatsets most in use are of limited and shrinking number, and located in a few mostly Western tech hubs – and that’s the other way they’re infrastructural, in that they respond to scale. It might well be that for most applications, something like a “natural monopoly” exists for datasets in that you want as big and diverse datasets as possible for, say, natural language processing. Having multiple smaller ones won’t be as efficient in the same way that multiple utility companies is not efficient. This is at least the premise that ML research seems to be operating under for now, as training datasets are getting large enough that dissenting voices are worrying about their carbon footprint. If that is the case, then it is increasingly important that the datasets serve the public interest, for the same reason that it’s important that the power company serves the public interests. The other feature of datasets that is relevant here is that most are in wealthy, Western countries. That’s fine until you think of AI or the Internet as a global resource. If datasets are infrastructure, and infrastructure is about nondiscriminatory access, then they absolutely must diversify the data they contain. It won’t do to only include utensils and soap dispensers in use in Palto Alto, and it ought to be possible to add more diverse data to datasets.
I’m not certain how you could pursue this as a regulatory matter (if I was, this would be an article and not a blog post, at the very least!) But I am becoming convinced that the alternative is that yet another part of the economy turns into platforms governed by modulation: you enter a world of language recognition where the options are limited and you have to communicate in those limited ways, where you’re nudged into thinking that the kid with sunglasses is a “loser” and that associate professors are ontologically different from assistants. In short, a whole business model built (as Kate Crawford argues) built on extraction. That’s a world we want to avoid, or at least try to get out of, if we actually care about AI doing good things, and not exacerbating existing inequalities.
Recent Comments