1,128 research outputs found

    Uncertainty in Lung Cancer Stage for Outcome Estimation via Set-Valued Classification

    Full text link
    Difficulty in identifying cancer stage in health care claims data has limited oncology quality of care and health outcomes research. We fit prediction algorithms for classifying lung cancer stage into three classes (stages I/II, stage III, and stage IV) using claims data, and then demonstrate a method for incorporating the classification uncertainty in outcomes estimation. Leveraging set-valued classification and split conformal inference, we show how a fixed algorithm developed in one cohort of data may be deployed in another, while rigorously accounting for uncertainty from the initial classification step. We demonstrate this process using SEER cancer registry data linked with Medicare claims data.Comment: Code available at: https://github.com/sl-bergquist/cancer_classificatio

    A review of probabilistic forecasting and prediction with machine learning

    Full text link
    Predictions and forecasts of machine learning models should take the form of probability distributions, aiming to increase the quantity of information communicated to end users. Although applications of probabilistic prediction and forecasting with machine learning models in academia and industry are becoming more frequent, related concepts and methods have not been formalized and structured under a holistic view of the entire field. Here, we review the topic of predictive uncertainty estimation with machine learning algorithms, as well as the related metrics (consistent scoring functions and proper scoring rules) for assessing probabilistic predictions. The review covers a time period spanning from the introduction of early statistical (linear regression and time series models, based on Bayesian statistics or quantile regression) to recent machine learning algorithms (including generalized additive models for location, scale and shape, random forests, boosting and deep learning algorithms) that are more flexible by nature. The review of the progress in the field, expedites our understanding on how to develop new algorithms tailored to users' needs, since the latest advancements are based on some fundamental concepts applied to more complex algorithms. We conclude by classifying the material and discussing challenges that are becoming a hot topic of research.Comment: 83 pages, 5 figure

    A Comprehensive Framework for Evaluating Time to Event Predictions using the Restricted Mean Survival Time

    Full text link
    The restricted mean survival time (RMST) is a widely used quantity in survival analysis due to its straightforward interpretation. For instance, predicting the time to event based on patient attributes is of great interest when analyzing medical data. In this paper, we propose a novel framework for evaluating RMST estimations. Our criterion estimates the mean squared error of an RMST estimator using Inverse Probability Censoring Weighting (IPCW). A model-agnostic conformal algorithm adapted to right-censored data is also introduced to compute prediction intervals and to evaluate variable importance. Our framework is valid for any RMST estimator that is asymptotically convergent and works under model misspecification

    Conformalized Survival Analysis

    Full text link
    Existing survival analysis techniques heavily rely on strong modelling assumptions and are, therefore, prone to model misspecification errors. In this paper, we develop an inferential method based on ideas from conformal prediction, which can wrap around any survival prediction algorithm to produce calibrated, covariate-dependent lower predictive bounds on survival times. In the Type I right-censoring setting, when the censoring times are completely exogenous, the lower predictive bounds have guaranteed coverage in finite samples without any assumptions other than that of operating on independent and identically distributed data points. Under a more general conditionally independent censoring assumption, the bounds satisfy a doubly robust property which states the following: marginal coverage is approximately guaranteed if either the censoring mechanism or the conditional survival function is estimated well. Further, we demonstrate that the lower predictive bounds remain valid and informative for other types of censoring. The validity and efficiency of our procedure are demonstrated on synthetic data and real COVID-19 data from the UK Biobank.Comment: 33 pages, 7 figure

    Post-selection Inference for Conformal Prediction: Trading off Coverage for Precision

    Full text link
    Conformal inference has played a pivotal role in providing uncertainty quantification for black-box ML prediction algorithms with finite sample guarantees. Traditionally, conformal prediction inference requires a data-independent specification of miscoverage level. In practical applications, one might want to update the miscoverage level after computing the prediction set. For example, in the context of binary classification, the analyst might start with a 95%95\% prediction sets and see that most prediction sets contain all outcome classes. Prediction sets with both classes being undesirable, the analyst might desire to consider, say 80%80\% prediction set. Construction of prediction sets that guarantee coverage with data-dependent miscoverage level can be considered as a post-selection inference problem. In this work, we develop uniform conformal inference with finite sample prediction guarantee with arbitrary data-dependent miscoverage levels using distribution-free confidence bands for distribution functions. This allows practitioners to trade freely coverage probability for the quality of the prediction set by any criterion of their choice (say size of prediction set) while maintaining the finite sample guarantees similar to traditional conformal inference

    Conformalized survival analysis with adaptive cutoffs

    Full text link
    This paper introduces a method that constructs valid and efficient lower predictive bounds (LPBs) for survival times with censored data. Traditional methods for survival analysis often assume a parametric model for the distribution of survival time as a function of the measured covariates, or assume that this conditional distribution is captured well with a non-parametric method such as random forests; however, these methods may lead to undercoverage if their assumptions are not satisfied. In this paper, we build on recent work by Cand\`es et al. (2021), which offers a more assumption-lean approach to the problem. Their approach first subsets the data to discard any data points with early censoring times and then uses a reweighting technique (namely, weighted conformal inference (Tibshirani et al., 2019)) to correct for the distribution shift introduced by this subsetting procedure. For our new method, instead of constraining to a fixed threshold for the censoring time when subsetting the data, we allow for a covariate-dependent and data-adaptive subsetting step, which is better able to capture the heterogeneity of the censoring mechanism. As a result, our method can lead to LPBs that are less conservative and give more accurate information. We show that in the Type I right-censoring setting, if either of the censoring mechanism or the conditional quantile of survival time is well estimated, our proposed procedure achieves approximately exact marginal coverage, where in the latter case we additionally have approximate conditional coverage. We evaluate the validity and efficiency of our proposed algorithm in numerical experiments, illustrating its advantage when compared with other competing methods. Finally, our method is applied to a real dataset to generate LPBs for users' active times on a mobile app.Comment: 21 pages, 6 figures, and 1 tabl

    Automatically Score Tissue Images Like a Pathologist by Transfer Learning

    Full text link
    Cancer is the second leading cause of death in the world. Diagnosing cancer early on can save many lives. Pathologists have to look at tissue microarray (TMA) images manually to identify tumors, which can be time-consuming, inconsistent and subjective. Existing algorithms that automatically detect tumors have either not achieved the accuracy level of a pathologist or require substantial human involvements. A major challenge is that TMA images with different shapes, sizes, and locations can have the same score. Learning staining patterns in TMA images requires a huge number of images, which are severely limited due to privacy concerns and regulations in medical organizations. TMA images from different cancer types may have common characteristics that could provide valuable information, but using them directly harms the accuracy. By selective transfer learning from multiple small auxiliary sets, the proposed algorithm is able to extract knowledge from tissue images showing a ``similar" scoring pattern but with different cancer types. Remarkably, transfer learning has made it possible for the algorithm to break the critical accuracy barrier -- the proposed algorithm reports an accuracy of 75.9% on breast cancer TMA images from the Stanford Tissue Microarray Database, achieving the 75\% accuracy level of pathologists. This will allow pathologists to confidently use automatic algorithms to assist them in recognizing tumors consistently with a higher accuracy in real time.Comment: 19 pages, 6 figure
    • …
    corecore