Knee pain is common and debilitating, and it’s often caused by osteoarthritis in the knee. Treatment options range from analgesics (including opioids) to knee-replacement surgery. If you go to the doctor with arthritic knee pain, you can get an x-ray which can then be interpreted using standard rubrics like the Kellgren–Lawrence Grade (KLG) to quantify damage to your knee and then guide treatment options. The KLG isn’t perfect in that the correlation between pain and objective scores of damage to the knee isn’t perfect. Some people’s knees are a wreck and they report no pain; others have pain beyond what their KLG score indicate. But here’s the thing: Black patients consistently report more knee pain than white patients. They also tend to have more knee damage on the KLG – but even when you factor that in, Black patients report much more knee pain than white patients with comparable KLG scores. What’s going on?
One possibility is that factors external to the knee – stress, for example – explain the higher pain. If that’s the case, then patients need less knee treatment. But what if their knees were in worse shape? To answer that question, you’d have to ask yourself what in an x-ray indicated poor knee condition.
Disease is often measured through indicators, and we know that these indicators can lead to all sorts of complexity. In the context of Covid, for example, there are all sorts of questions about testing and sensitivity that I’ve talked about before. Along the way, I referred to a fantastic paper on malaria testing in sub-Saharan Africa – suffice it to say that “cases of malaria” reported to donor organizations is a difficult number to parse for reasons having to do with vagaries in testing and diagnosis.
In a new paper in Nature Medicine, a team led by Emma Pierson makes ingenious use of artificial intelligence to tackle the problem of racial disparities in knee pain. Since algorithms and data are so often implicated in increasing or magnifying racial disparities (see, for example, Safiya Noble on Google, or Timnit Gebru on facial recognition, or Margaret Hu’s chilling “Algorithmic Jim Crow”), it’s encouraging to learn about machine learning working to undermine racial disparities. Ordinarily, you train an algorithm to perform like an excellent clinician. In this case, that would mean training it to look at radiography and determine the correct KLG score. The trick here was to instead train it to look at pain: to determine what features of the x-ray predicted that the patient would report pain. It turns out that the algorithm’s diagnoses reduced racial disparities in diagnosis by a jaw-dropping 47%.
One possible explanation? The algorithm was given deliberately diverse training data. The KLG, on the other hand, was based on studies on homogeneous white populations. In his Twitter thread on the article, Ziad Obermeyer, one of the researchers, points to one on British coal miners from the 1950s. Is the diversity of the training data really driving the better outcomes? It would seem so:
“This was tested by retraining the neural network under two experimental conditions: (1) using a non-diverse training set from which all minority patients (for example, all Black patients; we also performed analogous experiments by removing all lower-income patients and all lower-education patients) had been removed and (2) using an equally sized diverse training set from which a subset of non-minority patients were removed. While models trained under both conditions outperformed KLG, models trained on the diverse training sets achieved better predictive performance for pain and greater reductions in racial and socioeconomic pain disparities than models trained on the non-diverse training sets of the same size. The model trained on a dataset with no Black patients reduced the racial pain disparity by only 2.3× KLG, as opposed to an average of 4.9× for models trained on five randomly sampled diverse training sets of the same size (P value for difference, <0.001 for all five randomly sampled training sets; results when removing all lower-income or all lower-education patients were similar). Thus, training set diversity contributes to the algorithm’s ability to reduce disparities.”
So the algorithm was better than KLG anyway, but when it saw diverse patients, it got less biased outcomes. This is as vivid a demonstration as I’ve seen of the importance of diverse training data; the perhaps unexpected (but obvious once you see it) point is that the use of AI with diverse training data can expose the shaky basis of standard indicators. In other words, the standard indicator was based on bad “training data,” and starting over with better training data leads to better predictors of pain.
As the authors of the study indicate, this has significant clinical implications. Patients with lower KLG scores tend to get shunted into non-specific interventions:
“In addition to raising important questions regarding how we understand potential sources of pain, our results have implications for the determination of who receives arthroplasty for knee pain. While radiographic severity is not part of the formal guideline in allocations for arthroplasty (which only requires evidence of radiographic damage), empirically, patients with higher KLGs are more likely to receive surgery. Consequently, we hypothesize that underserved patients with disabling pain but without severe radiographic disease could be less likely to receive surgical treatments and more likely to be offered non-specific therapies for pain. This approach could lead to overuse of pharmacological remedies, including opioids, for underserved patients and contribute to the well-documented disparities in access to knee arthroplasty”
So Black patients are more likely to be prescribed opioids and less likely to get knee replacement surgery. In an accompanying editorial, Said Ibrahim points out that this is not the end of the discussion: the role of pain in surgical decision-making and in predicting surgical outcomes both need further study (aside: there is also evidence that knee replacement is very overprescribed). In addition, the question of racial disparity in surgical decision-making opens up a big can of worms, many of which point to further racial disparities:
“A key factor for receiving a surgical recommendation for replacement is patient willingness. Black patients have lower preference for joint replacement than do white patients. This is in part because Black patients have concerns about surgical outcomes. There are also data showing that Black patients are more likely to receive joint-replacement surgery in low-quality hospitals in the USA. Therefore, whether better assessment of pain will address disparities in the use of elective joint-replacement surgery remains to be determined.”
Those are precisely the sorts of questions that need to be raised and addressed if we are to make progress understanding racial disparities in medicine. The only point I want to underline here is that these are significant racial disparities in medicine that are traceable to the lack of diversity in the data behind a standard indicator. And it was AI that got it right, or at least better.
It seems to me that this is also an excellent example of epistemic justice. By starting with the patient’s reporting of pain, it’s possible to use that to understand the images of their knees. And that seems more just than using an objective metric like the KLG that requires discounting the pain reports of minorities. Given the long history of medicine not taking Black patients' pain seriously, this is not a small thing.
Recent Comments