19 research outputs found

    Is this model reliable for everyone? Testing for strong calibration

    Full text link
    In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model

    Book Review

    No full text

    Net Benefit of Diagnostic Tests for Multistate Diseases: an Indicator Variables Approach

    No full text
    A limitation of the common measures of diagnostic test performance, such as sensitivity and specificity, is that they do not consider the relative importance of false negative and false positive test results, which are likely to have different clinical consequences. Therefore, the use of classification or prediction measures alone to compare diagnostic tests or biomarkers can be inconclusive for clinicians. Comparing tests on net benefit can be more conclusive because clinical consequences of misdiagnoses are considered. The literature suggested evaluating the binary diagnostic tests based on net benefit, but did not consider diagnostic tests that classify more than two disease states, e.g., stroke subtype (large-artery atherosclerosis, cardioembolism, small-vessel occlusion, stroke of other determined etiology, stroke of undetermined etiology), skin lesion subtype, breast cancer subtypes (benign, mass, calcification, architectural distortion, etc.), METAVIR liver fibrosis state (F0- F4), histopathological classification of cervical intraepithelial neoplasia (CIN), prostate Gleason grade, brain injury (intracranial hemorrhage, mass effect, midline shift, cranial fracture) . Other diseases have more than two stages, such as Alzheimer\u27s disease (dementia due to Alzheimer\u27s disease, mild cognitive disability (MCI) due to Alzheimer\u27s disease, and preclinical presymptomatics due to Alzheimer\u27s disease). In diseases with more than two states, the benefits and risks may vary between states. This paper extends the net-benefit approach of evaluating binary diagnostic tests to multi-state clinical conditions to rule-in or rule-out a clinical condition based on adverse consequences of work-up delay (due to false negative test result) and unnecessary workup (due to false positive test result). We demonstrate our approach with numerical examples and real data

    Desirability of Outcome Ranking (DOOR) and Response Adjusted for Duration of Antibiotic Risk (RADAR)

    No full text
    Clinical trials that compare strategies to optimize antibiotic use are of critical importance but are limited by competing risks that distort outcome interpretation, complexities of noninferiority trials, large sample sizes, and inadequate evaluation of benefits and harms at the patient level. The Antibacterial Resistance Leadership Group strives to overcome these challenges through innovative trial design. Response adjusted for duration of antibiotic risk (RADAR) is a novel methodology utilizing a superiority design and a 2-step process: (1) categorizing patients into an overall clinical outcome (based on benefits and harms), and (2) ranking patients with respect to a desirability of outcome ranking (DOOR). DOORs are constructed by assigning higher ranks to patients with (1) better overall clinical outcomes and (2) shorter durations of antibiotic use for similar overall clinical outcomes. DOOR distributions are compared between antibiotic use strategies. The probability that a randomly selected patient will have a better DOOR if assigned to the new strategy is estimated. DOOR/RADAR represents a new paradigm in assessing the risks and benefits of new strategies to optimize antibiotic use
    corecore