26 research outputs found

    Disparate Censorship & Undertesting: A Source of Label Bias in Clinical Machine Learning

    Full text link
    As machine learning (ML) models gain traction in clinical applications, understanding the impact of clinician and societal biases on ML models is increasingly important. While biases can arise in the labels used for model training, the many sources from which these biases arise are not yet well-studied. In this paper, we highlight disparate censorship (i.e., differences in testing rates across patient groups) as a source of label bias that clinical ML models may amplify, potentially causing harm. Many patient risk-stratification models are trained using the results of clinician-ordered diagnostic and laboratory tests of labels. Patients without test results are often assigned a negative label, which assumes that untested patients do not experience the outcome. Since orders are affected by clinical and resource considerations, testing may not be uniform in patient populations, giving rise to disparate censorship. Disparate censorship in patients of equivalent risk leads to undertesting in certain groups, and in turn, more biased labels for such groups. Using such biased labels in standard ML pipelines could contribute to gaps in model performance across patient groups. Here, we theoretically and empirically characterize conditions in which disparate censorship or undertesting affect model performance across subgroups. Our findings call attention to disparate censorship as a source of label bias in clinical ML models.Comment: 48 pages, 18 figures. Machine Learning for Healthcare Conference (MLHC 2022

    When do confounding by indication and inadequate risk adjustment bias critical care studies? A simulation study

    Get PDF
    Abstract Introduction In critical care observational studies, when clinicians administer different treatments to sicker patients, any treatment comparisons will be confounded by differences in severity of illness between patients. We sought to investigate the extent that observational studies assessing treatments are at risk of incorrectly concluding such treatments are ineffective or even harmful due to inadequate risk adjustment. Methods We performed Monte Carlo simulations of observational studies evaluating the effect of a hypothetical treatment on mortality in critically ill patients. We set the treatment to have either no association with mortality or to have a truly beneficial effect, but more often administered to sicker patients. We varied the strength of the treatment’s true effect, strength of confounding, study size, patient population, and accuracy of the severity of illness risk-adjustment (area under the receiver operator characteristics curve, AUROC). We measured rates in which studies made inaccurate conclusions about the treatment’s true effect due to confounding, and the measured odds ratios for mortality for such false associations. Results Simulated observational studies employing adequate risk-adjustment were generally able to measure a treatment’s true effect. As risk-adjustment worsened, rates of studies incorrectly concluding the treatment provided no benefit or harm increased, especially when sample size was large (n = 10,000). Even in scenarios of only low confounding, studies using the lower accuracy risk-adjustors (AUROC < 0.66) falsely concluded that a beneficial treatment was harmful. Measured odds ratios for mortality of 1.4 or higher were possible when the treatment’s true beneficial effect was an odds ratio for mortality of 0.6 or 0.8. Conclusions Large observational studies confounded by severity of illness have a high likelihood of obtaining incorrect results even after employing conventionally “acceptable” levels of risk-adjustment, with large effect sizes that may be construed as true associations. Reporting the AUROC of the risk-adjustment used in the analysis may facilitate an evaluation of a study’s risk for confounding.http://deepblue.lib.umich.edu/bitstream/2027.42/111639/1/13054_2015_Article_923.pd

    Late mortality after acute hypoxic respiratory failure

    Get PDF
    BackgroundAcute hypoxic respiratory failure (AHRF) is associated with significant acute mortality. It is unclear whether later mortality is predominantly driven by pre-existing comorbid disease, the acute inciting event or is the result of AHRF itself.MethodsObservational cohort study of elderly US Health and Retirement Study (HRS) participants in fee-for-service Medicare (1998–2012). Patients hospitalised with AHRF were matched 1:1 to otherwise similar adults who were not currently hospitalised and separately to patients hospitalised with acute inciting events (pneumonia, non-pulmonary infection, aspiration, trauma, pancreatitis) that may result in AHRF, here termed at-risk hospitalisations. The primary outcome was late mortality—death in the 31 days to 2 years following hospital admission.ResultsAmong 15 075 HRS participants, we identified 1268 AHRF and 13 117 at-risk hospitalisations. AHRF hospitalisations were matched to 1157 non-hospitalised adults and 1017 at-risk hospitalisations. Among patients who survived at least 30 days, AHRF was associated with a 24.4% (95%CI 19.9% to 28.9%, p&lt;0.001) absolute increase in late mortality relative to adults not currently hospitalised and a 6.7% (95%CI 1.7% to 11.7%, p=0.01) increase relative to adults hospitalised with acute inciting event(s) alone. At-risk hospitalisation explained 71.2% of the increased odds of late mortality, whereas the development of AHRF itself explained 28.8%. Risk for death was equivalent to at-risk hospitalisation beyond 90 days, but remained elevated for more than 1 year compared with non-hospitalised controls.ConclusionsIn this national sample of older Americans, approximately one in four survivors with AHRF had a late death not explained by pre-AHRF health status. More than 70% of this increased risk was associated with hospitalisation for acute inciting events, while 30% was associated with hypoxemic respiratory failure.</jats:sec

    Collaborative strategies for deploying artificial intelligence to complement physician diagnoses of acute respiratory distress syndrome

    No full text
    Abstract There is a growing gap between studies describing the capabilities of artificial intelligence (AI) diagnostic systems using deep learning versus efforts to investigate how or when to integrate AI systems into a real-world clinical practice to support physicians and improve diagnosis. To address this gap, we investigate four potential strategies for AI model deployment and physician collaboration to determine their potential impact on diagnostic accuracy. As a case study, we examine an AI model trained to identify findings of the acute respiratory distress syndrome (ARDS) on chest X-ray images. While this model outperforms physicians at identifying findings of ARDS, there are several reasons why fully automated ARDS detection may not be optimal nor feasible in practice. Among several collaboration strategies tested, we find that if the AI model first reviews the chest X-ray and defers to a physician if it is uncertain, this strategy achieves a higher diagnostic accuracy (0.869, 95% CI 0.835–0.903) compared to a strategy where a physician reviews a chest X-ray first and defers to an AI model if uncertain (0.824, 95% CI 0.781–0.862), or strategies where the physician reviews the chest X-ray alone (0.808, 95% CI 0.767–0.85) or the AI model reviews the chest X-ray alone (0.847, 95% CI 0.806–0.887). If the AI model reviews a chest X-ray first, this allows the AI system to make decisions for up to 79% of cases, letting physicians focus on the most challenging subsets of chest X-rays
    corecore