1,128 research outputs found
Uncertainty in Lung Cancer Stage for Outcome Estimation via Set-Valued Classification
Difficulty in identifying cancer stage in health care claims data has limited
oncology quality of care and health outcomes research. We fit prediction
algorithms for classifying lung cancer stage into three classes (stages I/II,
stage III, and stage IV) using claims data, and then demonstrate a method for
incorporating the classification uncertainty in outcomes estimation. Leveraging
set-valued classification and split conformal inference, we show how a fixed
algorithm developed in one cohort of data may be deployed in another, while
rigorously accounting for uncertainty from the initial classification step. We
demonstrate this process using SEER cancer registry data linked with Medicare
claims data.Comment: Code available at:
https://github.com/sl-bergquist/cancer_classificatio
A review of probabilistic forecasting and prediction with machine learning
Predictions and forecasts of machine learning models should take the form of
probability distributions, aiming to increase the quantity of information
communicated to end users. Although applications of probabilistic prediction
and forecasting with machine learning models in academia and industry are
becoming more frequent, related concepts and methods have not been formalized
and structured under a holistic view of the entire field. Here, we review the
topic of predictive uncertainty estimation with machine learning algorithms, as
well as the related metrics (consistent scoring functions and proper scoring
rules) for assessing probabilistic predictions. The review covers a time period
spanning from the introduction of early statistical (linear regression and time
series models, based on Bayesian statistics or quantile regression) to recent
machine learning algorithms (including generalized additive models for
location, scale and shape, random forests, boosting and deep learning
algorithms) that are more flexible by nature. The review of the progress in the
field, expedites our understanding on how to develop new algorithms tailored to
users' needs, since the latest advancements are based on some fundamental
concepts applied to more complex algorithms. We conclude by classifying the
material and discussing challenges that are becoming a hot topic of research.Comment: 83 pages, 5 figure
A Comprehensive Framework for Evaluating Time to Event Predictions using the Restricted Mean Survival Time
The restricted mean survival time (RMST) is a widely used quantity in
survival analysis due to its straightforward interpretation. For instance,
predicting the time to event based on patient attributes is of great interest
when analyzing medical data. In this paper, we propose a novel framework for
evaluating RMST estimations. Our criterion estimates the mean squared error of
an RMST estimator using Inverse Probability Censoring Weighting (IPCW). A
model-agnostic conformal algorithm adapted to right-censored data is also
introduced to compute prediction intervals and to evaluate variable importance.
Our framework is valid for any RMST estimator that is asymptotically convergent
and works under model misspecification
Conformalized Survival Analysis
Existing survival analysis techniques heavily rely on strong modelling
assumptions and are, therefore, prone to model misspecification errors. In this
paper, we develop an inferential method based on ideas from conformal
prediction, which can wrap around any survival prediction algorithm to produce
calibrated, covariate-dependent lower predictive bounds on survival times. In
the Type I right-censoring setting, when the censoring times are completely
exogenous, the lower predictive bounds have guaranteed coverage in finite
samples without any assumptions other than that of operating on independent and
identically distributed data points. Under a more general conditionally
independent censoring assumption, the bounds satisfy a doubly robust property
which states the following: marginal coverage is approximately guaranteed if
either the censoring mechanism or the conditional survival function is
estimated well. Further, we demonstrate that the lower predictive bounds remain
valid and informative for other types of censoring. The validity and efficiency
of our procedure are demonstrated on synthetic data and real COVID-19 data from
the UK Biobank.Comment: 33 pages, 7 figure
Post-selection Inference for Conformal Prediction: Trading off Coverage for Precision
Conformal inference has played a pivotal role in providing uncertainty
quantification for black-box ML prediction algorithms with finite sample
guarantees. Traditionally, conformal prediction inference requires a
data-independent specification of miscoverage level. In practical applications,
one might want to update the miscoverage level after computing the prediction
set. For example, in the context of binary classification, the analyst might
start with a prediction sets and see that most prediction sets contain
all outcome classes. Prediction sets with both classes being undesirable, the
analyst might desire to consider, say prediction set. Construction of
prediction sets that guarantee coverage with data-dependent miscoverage level
can be considered as a post-selection inference problem. In this work, we
develop uniform conformal inference with finite sample prediction guarantee
with arbitrary data-dependent miscoverage levels using distribution-free
confidence bands for distribution functions. This allows practitioners to trade
freely coverage probability for the quality of the prediction set by any
criterion of their choice (say size of prediction set) while maintaining the
finite sample guarantees similar to traditional conformal inference
Conformalized survival analysis with adaptive cutoffs
This paper introduces a method that constructs valid and efficient lower
predictive bounds (LPBs) for survival times with censored data. Traditional
methods for survival analysis often assume a parametric model for the
distribution of survival time as a function of the measured covariates, or
assume that this conditional distribution is captured well with a
non-parametric method such as random forests; however, these methods may lead
to undercoverage if their assumptions are not satisfied. In this paper, we
build on recent work by Cand\`es et al. (2021), which offers a more
assumption-lean approach to the problem. Their approach first subsets the data
to discard any data points with early censoring times and then uses a
reweighting technique (namely, weighted conformal inference (Tibshirani et al.,
2019)) to correct for the distribution shift introduced by this subsetting
procedure. For our new method, instead of constraining to a fixed threshold for
the censoring time when subsetting the data, we allow for a covariate-dependent
and data-adaptive subsetting step, which is better able to capture the
heterogeneity of the censoring mechanism. As a result, our method can lead to
LPBs that are less conservative and give more accurate information. We show
that in the Type I right-censoring setting, if either of the censoring
mechanism or the conditional quantile of survival time is well estimated, our
proposed procedure achieves approximately exact marginal coverage, where in the
latter case we additionally have approximate conditional coverage. We evaluate
the validity and efficiency of our proposed algorithm in numerical experiments,
illustrating its advantage when compared with other competing methods. Finally,
our method is applied to a real dataset to generate LPBs for users' active
times on a mobile app.Comment: 21 pages, 6 figures, and 1 tabl
Automatically Score Tissue Images Like a Pathologist by Transfer Learning
Cancer is the second leading cause of death in the world. Diagnosing cancer
early on can save many lives. Pathologists have to look at tissue microarray
(TMA) images manually to identify tumors, which can be time-consuming,
inconsistent and subjective. Existing algorithms that automatically detect
tumors have either not achieved the accuracy level of a pathologist or require
substantial human involvements. A major challenge is that TMA images with
different shapes, sizes, and locations can have the same score. Learning
staining patterns in TMA images requires a huge number of images, which are
severely limited due to privacy concerns and regulations in medical
organizations. TMA images from different cancer types may have common
characteristics that could provide valuable information, but using them
directly harms the accuracy. By selective transfer learning from multiple small
auxiliary sets, the proposed algorithm is able to extract knowledge from tissue
images showing a ``similar" scoring pattern but with different cancer types.
Remarkably, transfer learning has made it possible for the algorithm to break
the critical accuracy barrier -- the proposed algorithm reports an accuracy of
75.9% on breast cancer TMA images from the Stanford Tissue Microarray Database,
achieving the 75\% accuracy level of pathologists. This will allow pathologists
to confidently use automatic algorithms to assist them in recognizing tumors
consistently with a higher accuracy in real time.Comment: 19 pages, 6 figure
- …