8,544 research outputs found
Evaluation of lntelligent Medical Systems
This thesis presents novel, robust, analytic and algorithmic methods for calculating Bayesian
posterior intervals of receiver operating characteristic (ROC) curves and confusion
matrices used for the evaluation of intelligent medical systems tested with small amounts
of data.
Intelligent medical systems are potentially important in encapsulating rare and valuable
medical expertise and making it more widely available. The evaluation of intelligent medical
systems must make sure that such systems are safe and cost effective. To ensure systems
are safe and perform at expert level they must be tested against human experts. Human
experts are rare and busy which often severely restricts the number of test cases that may
be used for comparison.
The performance of expert human or machine can be represented objectively by ROC
curves or confusion matrices. ROC curves and confusion matrices are complex representations
and it is sometimes convenient to summarise them as a single value. In the case of
ROC curves, this is given as the Area Under the Curve (AUC), and for confusion matrices
by kappa, or weighted kappa statistics. While there is extensive literature on the statistics
of ROC curves and confusion matrices they are not applicable to the measurement of intelligent
systems when tested with small data samples, particularly when the AUC or kappa
statistic is high.
A fundamental Bayesian study has been carried out, and new methods devised, to provide
better statistical measures for ROC curves and confusion matrices at low sample sizes.
They enable exact Bayesian posterior intervals to be produced for: (1) the individual points
on a ROC curve; (2) comparison between matching points on two uncorrelated curves; .
(3) the AUC of a ROC curve, using both parametric and nonparametric assumptions; (4)
the parameters of a parametric ROC curve; and (5) the weight of a weighted confusion
matrix.
These new methods have been implemented in software to provide a powerful and accurate
tool for developers and evaluators of intelligent medical systems in particular, and to a
much wider audience using ROC curves and confusion matrices in general. This should
enhance the ability to prove intelligent medical systems safe and effective and should lead
to their widespread deployment.
The mathematical and computational methods developed in this thesis should also provide
the basis for future research into determination of posterior intervals for other statistics
at small sample sizes
Confidence Bands for Roc Curves
In this paper we study techniques for generating and evaluating
confidence bands on ROC curves. ROC curve evaluation is
rapidly becoming a commonly used evaluation metric in machine
learning, although evaluating ROC curves has thus far been limited
to studying the area under the curve (AUC) or generation of
one-dimensional confidence intervals by freezing one variableâ
the false-positive rate, or threshold on the classification scoring
function. Researchers in the medical field have long been using
ROC curves and have many well-studied methods for analyzing
such curves, including generating confidence intervals as
well as simultaneous confidence bands. In this paper we introduce
these techniques to the machine learning community and
show their empirical fitness on the Covertype data setâa standard
machine learning benchmark from the UCI repository. We
show how some of these methods work remarkably well, others
are too loose, and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well
to generation of simultaneous bandsâtheir bands are too tight.Information Systems Working Papers Serie
Confidence Bands for ROC Curves: Methods and an Empirical Study
In this paper we study techniques for generating
and evaluating confidence bands on ROC curves. ROC
curve evaluation is rapidly becoming a commonly used evaluation
metric in machine learning, although evaluating ROC
curves has thus far been limited to studying the area under
the curve (AUC) or generation of one-dimensional confidence
intervals by freezing one variable—the false-positive rate, or
threshold on the classification scoring function. Researchers in
the medical field have long been using ROC curves and have
many well-studied methods for analyzing such curves, including
generating confidence intervals as well as simultaneous
confidence bands. In this paper we introduce these techniques
to the machine learning community and show their empirical
fitness on the Covertype data set—a standard machine learning
benchmark from the UCI repository. We show how some
of these methods work remarkably well, others are too loose,
and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well to
generation of simultanous bands—their bands are too tight.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Cross-Modal Data Programming Enables Rapid Medical Machine Learning
Labeling training datasets has become a key barrier to building medical
machine learning models. One strategy is to generate training labels
programmatically, for example by applying natural language processing pipelines
to text reports associated with imaging studies. We propose cross-modal data
programming, which generalizes this intuitive strategy in a
theoretically-grounded way that enables simpler, clinician-driven input,
reduces required labeling time, and improves with additional unlabeled data. In
this approach, clinicians generate training labels for models defined over a
target modality (e.g. images or time series) by writing rules over an auxiliary
modality (e.g. text reports). The resulting technical challenge consists of
estimating the accuracies and correlations of these rules; we extend a recent
unsupervised generative modeling technique to handle this cross-modal setting
in a provably consistent way. Across four applications in radiography, computed
tomography, and electroencephalography, and using only several hours of
clinician time, our approach matches or exceeds the efficacy of
physician-months of hand-labeling with statistical significance, demonstrating
a fundamentally faster and more flexible way of building machine learning
models in medicine
Understanding metric-related pitfalls in image analysis validation
Validation metrics are key for the reliable tracking of scientific progress
and for bridging the current chasm between artificial intelligence (AI)
research and its translation into practice. However, increasing evidence shows
that particularly in image analysis, metrics are often chosen inadequately in
relation to the underlying research problem. This could be attributed to a lack
of accessibility of metric-related knowledge: While taking into account the
individual strengths, weaknesses, and limitations of validation metrics is a
critical prerequisite to making educated choices, the relevant knowledge is
currently scattered and poorly accessible to individual researchers. Based on a
multi-stage Delphi process conducted by a multidisciplinary expert consortium
as well as extensive community feedback, the present work provides the first
reliable and comprehensive common point of access to information on pitfalls
related to validation metrics in image analysis. Focusing on biomedical image
analysis but with the potential of transfer to other fields, the addressed
pitfalls generalize across application domains and are categorized according to
a newly created, domain-agnostic taxonomy. To facilitate comprehension,
illustrations and specific examples accompany each pitfall. As a structured
body of information accessible to researchers of all levels of expertise, this
work enhances global comprehension of a key topic in image analysis validation.Comment: Shared first authors: Annika Reinke, Minu D. Tizabi; shared senior
authors: Paul F. J\"ager, Lena Maier-Hei
Recommended from our members
Early symptoms and sensations as predictors of lung cancer: a machine learning multivariate model.
The aim of this study was to identify a combination of early predictive symptoms/sensations attributable to primary lung cancer (LC). An interactive e-questionnaire comprised of pre-diagnostic descriptors of first symptoms/sensations was administered to patients referred for suspected LC. Respondents were included in the present analysis only if they later received a primary LC diagnosis or had no cancer; and inclusion of each descriptor required ≥4 observations. Fully-completed data from 506/670 individuals later diagnosed with primary LC (n = 311) or no cancer (n = 195) were modelled with orthogonal projections to latent structures (OPLS). After analysing 145/285 descriptors, meeting inclusion criteria, through randomised seven-fold cross-validation (six-fold training set: n = 433; test set: n = 73), 63 provided best LC prediction. The most-significant LC-positive descriptors included a cough that varied over the day, back pain/aches/discomfort, early satiety, appetite loss, and having less strength. Upon combining the descriptors with the background variables current smoking, a cold/flu or pneumonia within the past two years, female sex, older age, a history of COPD (positive LC-association); antibiotics within the past two years, and a history of pneumonia (negative LC-association); the resulting 70-variable model had accurate cross-validated test set performance: area under the ROC curve = 0.767 (descriptors only: 0.736/background predictors only: 0.652), sensitivity = 84.8% (73.9/76.1%, respectively), specificity = 55.6% (66.7/51.9%, respectively). In conclusion, accurate prediction of LC was found through 63 early symptoms/sensations and seven background factors. Further research and precision in this model may lead to a tool for referral and LC diagnostic decision-making
Investigating the detection of adverse drug events in a UK general practice electronic health-care database
Data-mining techniques have frequently been developed
for Spontaneous reporting databases. These techniques
aim to find adverse drug events accurately and efficiently. Spontaneous reporting databases are prone to missing information,under reporting and incorrect entries. This often results in a detection lag or prevents the detection of some adverse drug events. These limitations do not occur in electronic healthcare databases. In this paper, existing methods developed for spontaneous reporting databases are implemented on both a
spontaneous reporting database and a general practice electronic health-care database and compared. The results suggests that the application of existing methods to the general practice database may help find signals that have gone undetected when using the spontaneous reporting system database. In addition the general practice database provides far more supplementary information, that if incorporated in analysis could provide a wealth of information for identifying adverse events more
accurately
- …