A recent paper by Ermanno Bencivenga in Philosophical Forum argues that it’s “time for philosophy to step into the conversation” (135) about big data, in particular to refute the thesis, which the article identifies in a 2008 piece in Wired, that big data will mean that we no longer need theory: “with enough data, the numbers speak for themselves” (qt. on 135). The paper draws on concerns about spurious correlations: to demonstrate that a correlation is legitimate, it “must be shown to manifest a lawlike regularity; there must be a theoretical account of it,” that laws have to cohere with one another, and so on (139). In other words, “knowledge is constitutionally dependent on theory” (ibid.). Bencivenga concludes:
“Big Data enthusiasts are (unwittingly advocating a new definition of what it is to know. Their agenda is (unwittingly) semantical. Except that it is not worked out, and any attempt at developing it in the semantical terms that have been current (and antagonistic) for the past two millennia is hopeless. I will not rule out that a new set of terms might be forthcoming, but the burden is / on those enthusiasts to provide it; simply piling up data and being awed by them will not do. What would be needed, ironically, is a new theory of knowledge, which so far I have not seen. This is the reason why I have made an effort to get clearer about the claims being made, so that we can have a more orderly discussion of them and what it would take to make progress in it” (141-2).
Fair enough, though I do want to note that the paper does not engage with any literature about big data other than the dated piece from Wired; and to hear enthusiastic techno-babble from Wired is not surprising. That’s what they do.
It’s also worth pointing out that these sorts of concerns have been expressed before. Here are danah boyd and Kate Crawford from a widely-cited paper in 2012. After noting that “Big data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and categorization of reality” (665), they caution that:
“Interpretation is at the center of data analysis. Regardless of the size of a data, it is subject to limitation and bias. Without those biases and limitations being understood and outlined, misinterpretation is the result. Data analysis is most effective when researchers take account of the complex methodological processes that underlie the analysis of that data” (668)
These are cherry-picked quotes – arguably, the entire paper is a response to the sort of enthusiasm in the Wired piece, and the focus on our hidden rules for interpretation is clearly directed at the view that somehow data self-interprets. And there are certainly more papers that bring up this and analogous topics; Luciano Floridi raises similar ones here (Floridi’s worries resonate well with Amoore’s, discussed below). That said, I think there’s something to be said for speaking of big data in Kantian terms, though not perhaps for the reasons Bencivenga advances.
First, Bencivenga is undoubtedly correct about the epistemology from a Kantian point of view. The enthusiast is basically making a Humean argument but without the skepticism: if things happen together often enough, then you have causal knowledge. As Kant quips in critique of Locke, universal concepts “must be in a position to show a certificate of birth quite other than that of descent from experience” (CPR B119). Or, as Kant says a few pages later, we have to say that causality “must either be grounded completely a priori in the understanding, or must be entirely given up as a mere phantom of the brain” (B123-4). Causality demands necessity, and experience can’t get there.
Of course, the data enthusiast isn’t making a claim of causal necessity but a statistical one. This doesn’t help as much as one might think; the data enthusiast needs some epistemology, substantially because, as Bencivenga notes, she offers no way to differentiate between spurious and non-spurious correlations. Critics have noted this problem before, of course (as for example this 2006 piece, with regard to genomic data). But the data enthusiast is seems more or less in the position of advocating what Kant calls “fate” (B116-17): all things are explained by the data, no matter what we observe. As Peter Thielke notes, “the notions of fate and fortune … stand as cautionary examples of concepts gone bad, and the labors of the Deduction seem designed to guarantee that the categories do not suffer the same downfall” (440). Thielke’s example: suppose a ball breaks a window. “The problem … is that an invocation of fate would work equally well as an explanation of why the window broke or why it didn’t break” (449). Analogously, if the data says the stock market fluctuates in correlation with the price of butter in Bangladesh (this is a published reductio that boyd and Crawford cite), then any change in the stock market can be explained. If prices fluctuate in tandem with the price of butter, then the correlation appears right. If they do not, the correlation is still correct, since it wasn’t a 100% correlation. In other words, a claim about statistical correlation can’t really be invalidated by any given event, since the claim was never causal in the first place. Kant explains the need for the categories in juridical terms, and it’s perhaps worth noting that recent work proposes that our juridical mind is resistant to statistics: intuitively, the philosophical-“we” would much rather have a 70% reliable witness to an event than a 70% reliable statistic. Part of the reason is that we can explain why the witness is wrong, if she is. The statistic isn’t “wrong,” whichever statistical group the event in question falls into.
However, the vast majorty of what I’ve seen about data analytics takes the activity not as knowledge generating in any traditional sense, but as action guiding (though a quick search of philpapers suggests that the philosophy of science literature is picking up steam). In other words, the question is not whether we “know” that people who like Hello Kitty tend to be low on emotional stability (that correlation is reported here). It’s that we want to know if the correlation is robust enough to justify some sort of action. For example, if there is a 70% correlation between ball-throwing and window-breaking, we might well want a rule against throwing balls near windows. Here, there is indeed theory, but it’s theory (implicit or otherwise) about how to act, and the problems confronted there are normative. For example, do the various aspects of generating the system act in ways that will reproduce racial and other social biases? How strong does the correlation need to be to justify doing something (I remember a presentation once by a company that markets data to online retailers, and he pointed out that even a 1% improvement in click-through rates on online advertising could save a retailer a lot of money. How that 1% happened would be irrelevant. In medicine, we need better than 1%. Etc.). Ethnographic research shows in granular detail the kinds of decisions that have to be made in designing systems, and they all involve judgments about the relative importance of data points, and what kind of information the system should consider as data and what it should ignore.
Louise Amoore offers a lucid presentation of the issues involved in the context of security. The basic issue is that the effort is to produce actionable information. At the end of the day, this effort makes irrelevant virtually any sort of epistemological claim about the things the data represents (and there is no claim that the things the data represents are the “things themselves.”). In a way, there is a thoroughgoing Kantianism in that regard:
“The pre-emptive deployment of a data derivative does not seek to predict the future, as in systems of pattern recognition that track forward from past data, for example, because it is precisely indifferent to whether a particular event occurs or not. What matters instead is the capacity to act in the face of uncertainty, to render data actionable” (29).
And:
“To put the matter simply, it is of lesser consequence whether data accurately captures a set of circumstances in the world than whether the models can be refined for precision. If the governing by norm we associate with census and survey required the ‘large number’ collection of data in order empirically to identify patterns, validate and calculate – such as for example in the prudentialism of actuarial models and insurance calculation – the mobile [i.e., constantly changing and being revised – GH] norms of data derivatives are oriented not to /the conventional archive and collection but to discarding” (32-3).
In other words, the system makes no claim to represent any specific individual – all it wants to do is indicate when some threshold condition is met for action. As Amoore puts it, “like its commercial origins in retail data mining, the data derivative does not seek out the settled categories of ‘this customer’, ‘this traveler’, ‘this migrant’ or ‘that visa applicant’, but instead wants to recognize bodies and objects in movement, in and through their very transaction” (36). The threshold condition for action is always in the process of being changed and revised on the basis of incoming data (including false positives), but that data may or may not have any representational content (in the sense of presenting an object to the intellect). What is presented is a guide to judgment. Amoore thus suggests that the critical questions that need asking are all normative. Thus is is surely “a primary task for critical enquiry - to uncover and probe the moments that come together in the making of a calculation that will automate all future decisions” and the call for “attention to be paid to the specific temporalities and norms of algorithmic techniques that rule out, render invisible, other potential futures” (38)
This post is already long, and by situating the issue as one of practical reason, I haven’t either foreclosed epistemological points or said anything about what a Kantian treatment of the recommendations of data systems might look like. The interaction is likely to be difficult, given Kant’s reliance on juridical and causal models of thought, and the way the algorithms behind systems rely on statistics. But if algorithms encode rules, and big data relies on algorithms to generate/recommend actions, then Kantian questions about how one tests those algorithms ought to be central. Again, this might be tricky: if we take an analytics system as an assemblage of moving parts and correlations (etc.), Mike Ananny notes that “a purely deontological approach might be applied to the entire assemblage—asking whether its rules and policies adhere to ethical principles—but it may be difficult to trace which parts of an assemblage adhere to or deviate from deontological guidelines” (108). But this does seem like a start.
Recent Comments