By Gordon Hull
As a criterion for algorithmic assessment, “fairness” has encountered numerous problems. Many of these emerged in the wake of ProPublica’s argument that Broward County’s pretrial detention system, COMPAS, was unfair to black suspects. To recall: In 2016, ProPublica published an investigation piece criticizing Broward County, Florida’s use of a software program called COMPAS in its pretrial detention system. COMPAS produced a recidivism risk score for each suspect, which could then be used in deciding whether someone should be detained prior to their trial. ProPublica’s investigation found that, among suspects that did not have a rearrest prior to their trial, black suspects were much more likely to have been rated as “high risk” for rearrest than white suspects. Conversely, among suspects who were arrested a second time, white suspects were more likely to have been labeled “low risk” than black ones. The system thus appeared to be discriminating against black suspects. The story led to an extensive debate (for an accessible summary with cites, see Ben Green’s discussion here) over how fairness should be understood in a machine learning context.
The debate basically showed that ProPublica focused on outcomes and demonstrated that the system failed to achieve separation fairness, which is met when all groups subject to the algorithm’s decisions receive the same false negative/positive rate. The system failed because “high-risk” black suspects were much more likely than white to be false positives. In response, the software vendor argued that the system made fair predictions because among those classified in the same way (high or low risk), both racial groups exhibited the predicted outcome at the same rate. In other words, among those classified as “high risk,” there was no racial difference in how likely they were to actually be rearrested. The algorithm thus satisfied the criterion of sufficiency fairness. In the ensuing debate, computer scientists arrived at a proof that, except in very limited cases, it was impossible to simultaneously satisfy both separation and sufficiency fairness criteria.
In the meantime, on the philosophy side, Brian Hedden has argued that a provably fair algorithm could nonetheless be shown to potentially violate 11 of 12 possible fairness conditions. In a response piece, Benjamin Eva showed the limits of the twelfth with a different test and proposed a new criterion:
If an algorithm assigns one group a higher average risk score than another, that discrepancy has to be justified by a corresponding discrepancy between the base rates of those two groups, and the magnitudes of those discrepancies should be equivalent. In slogan form: an algorithm should only treat one groups as much more risky than another if it really is much more risky (258).
He formalizes the criterion, which he calls “base rate tracking” as follows:
The difference between the average risk scores assigned to the relevant groups should be equal to the difference between the (expected) base rates of those groups (258).
His example is a redlining algorithm that tries to discriminate against black borrowers by using zip code as a proxy for race, and then assigning much higher risk scores to those in undesirable zip codes, even though actual loan default rates are only slightly higher. He argues, correctly it seems to me, that most of us would perceive that as intuitively unfair.
Nonetheless, I think a combination of the abstraction of philosophy and the redlining example make it hard to see that this criterion will be very difficult to use, except in limited cases. The problem is that, in many instances, we don’t actually know the base rate. Mortgage defaults are something for which accurate data is available. On the other hand, in the most prominent cases of predictive algorithms with racial bias – predictive policing – we simply do not have any idea as to the base rate we care about, crime. The base rate algorithm might work in theory, but it’s not going to help in a lot of situations.
Moreover, the difficulty in base rate tracking can make it easy to miss another serious problem: choosing the wrong target variable. In the case of pretrial detention, the algorithm looked at arrest rates as a proxy for crime. In this case, the system accurately reflected the fact that black people were arrested at much higher rates than whites. As Sandra Mayson emphasizes in a widely-cited paper, the fairness problem is in the higher arrest rates themselves, and their failure to accurately represent criminal behavior. Bu the difficulty in measuring criminal behavior is evident in the selection of arrest as a proxy for it: if we could actually measure crime, the algorithm could be trained on that. The proxy selection does more than sweep questions of measurement under the carpet. It also assumes that the meaning of “crime” is settled and that decisions about what kind of crime should be targeted have been successfully resolved. That assumption is clearly not warranted. What neighborhoods would see more police and more arrests if white-collar crime were selected for extra scrutiny?
If we don’t know the base-rate for the thing we really care about, or have trouble defining it, we won’t be able to understand how the target variable represents it. In the case of crime, the main thing we know is that arrest rates don’t do so accurately. But then we are not in a good position to use predictive algorithms in the first place, because the proxy is guaranteed to generate outcomes that mirror and reproduce whatever structural problems are in the data for the proxy variable.
This target variable problem, it should be emphasized, is not confined to policing. For example, as Ziad Obermeyer has shown, it also caused a well-intentioned hospital algorithm designed to get extra care to high-risk and medically fragile patients to underestimate the needs of black patients. The reason was that it was taking billing codes as a proxy for health status, on the intuitively plausible premise that sicker patients would generate more billing codes. The problem was that black patients had less access to the healthcare system (or faced various inequities while inside, such as less aggressive treatments and so forth), and so generated fewer billing codes than comparably-sick white patients. Billing codes were a bad proxy. In that case, Obermeyer was able to find a better target variable, and resolve the problem. But the case of crime rates shows this won’t always be possible. Even in the context of mortgage default rates, once one gets past clear cases of intentional redlining, there is a lot of injustice baked into racial disparities in the ability of borrowers to pay.
In other words, it seems to me that the lack of available base-rate data, and the prevalence of poor target variables, are likely to substantially limit the use of base-rate tracking for fairness. To his credit, Benjamin concludes by noting that “there are some instances of the applications of predictive algorithms that involve grave injustices that simply cannot be properly diagnosed by purely statistical criteria. In cases like these, one can reasonably contend that the injustice is not an intrinsic property of the algorithm itself, but rather arises from the historical conditions pertaining to the development and application of the algorithm” (266). That seems right to me, and like a bigger problem than it initially appears. Perhaps one might suggest the following: the burden on algorithm developers ought to be to prove reasonable success in base rate tracking; absent such proof, the system should be presumed unfair.
Recent Comments