By Gordon Hull
Machine Learning (ML) applications learn by repetition. That is, they come to recognize what, say, a chair looks like, by seeing lots of images of chairs that have been correctly labeled as such. Since the machine is trying to figure out a pattern or set of characteristics that distinguish chairs from other objects, the more chairs it sees, the better it will perform at its task. This collection of labeled chairs is the algorithm’s “training data.” There are notorious problems with bias in training data due to underrepresentation of certain groups. For example, one study found that datasets designed to train ML to recognize objects performed poorly in developing countries, most likely due to underrepresentation of images from those places; when combined with pictures labeled in English, and the fact that standard Western commodity forms of household objects might look very different in the developing world, the ML was stumped. Most famously in this country, facial recognition software performs best at identifying white men and worst at identifying Black women. ImageNet, which is widely used for object recognition purposes, employees a hierarchy of labels that include calling a child wearing sunglasses a “failure, loser, non-starter, unsuccessful person” but also differentiates between assistant and associate professors. Despite these and many other problems, massive datasets are essential for training ML applications.
For this reason, datasets have been called “infrastructural” for machine learning, defined as follows:
“Infrastructure is characterized, we argue, by a set of core features: it is embedded into, and acts as the foundation, for other tools and technologies; when working as intended for a particular community, it tends to seep into the background and become incorporated into routines; the invisibility of infrastructure, however, is situated - what is natural or taken for granted from one perspective may be highly visible or jarring from another; though frequently naturalized, infrastructure is built, and thus inherently contextual, situated, and shaped by specific aims.”
If AI is cars and buses and trains that do what we want, the datasets it trains on shape the roads and paths where those roads go, provide their material basis, become thereby incorporated into higher level routines like search, and tend to disappear into the background when not actively used. But just like other examples of infrastructure – say, the bridges over the Long Island Freeway – infrastructure can embed priorities and affordances. In this sense, dataset infrastructures have a politics, and serve as platforms on which applications are built.
Continue reading "Can Datasets be Governed Like Public Utilities?" »
Recent Comments