8,544 research outputs found

    Evaluation of lntelligent Medical Systems

    Get PDF
    This thesis presents novel, robust, analytic and algorithmic methods for calculating Bayesian posterior intervals of receiver operating characteristic (ROC) curves and confusion matrices used for the evaluation of intelligent medical systems tested with small amounts of data. Intelligent medical systems are potentially important in encapsulating rare and valuable medical expertise and making it more widely available. The evaluation of intelligent medical systems must make sure that such systems are safe and cost effective. To ensure systems are safe and perform at expert level they must be tested against human experts. Human experts are rare and busy which often severely restricts the number of test cases that may be used for comparison. The performance of expert human or machine can be represented objectively by ROC curves or confusion matrices. ROC curves and confusion matrices are complex representations and it is sometimes convenient to summarise them as a single value. In the case of ROC curves, this is given as the Area Under the Curve (AUC), and for confusion matrices by kappa, or weighted kappa statistics. While there is extensive literature on the statistics of ROC curves and confusion matrices they are not applicable to the measurement of intelligent systems when tested with small data samples, particularly when the AUC or kappa statistic is high. A fundamental Bayesian study has been carried out, and new methods devised, to provide better statistical measures for ROC curves and confusion matrices at low sample sizes. They enable exact Bayesian posterior intervals to be produced for: (1) the individual points on a ROC curve; (2) comparison between matching points on two uncorrelated curves; . (3) the AUC of a ROC curve, using both parametric and nonparametric assumptions; (4) the parameters of a parametric ROC curve; and (5) the weight of a weighted confusion matrix. These new methods have been implemented in software to provide a powerful and accurate tool for developers and evaluators of intelligent medical systems in particular, and to a much wider audience using ROC curves and confusion matrices in general. This should enhance the ability to prove intelligent medical systems safe and effective and should lead to their widespread deployment. The mathematical and computational methods developed in this thesis should also provide the basis for future research into determination of posterior intervals for other statistics at small sample sizes

    Confidence Bands for Roc Curves

    Get PDF
    In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning, although evaluating ROC curves has thus far been limited to studying the area under the curve (AUC) or generation of one-dimensional confidence intervals by freezing one variableâ the false-positive rate, or threshold on the classification scoring function. Researchers in the medical field have long been using ROC curves and have many well-studied methods for analyzing such curves, including generating confidence intervals as well as simultaneous confidence bands. In this paper we introduce these techniques to the machine learning community and show their empirical fitness on the Covertype data setâa standard machine learning benchmark from the UCI repository. We show how some of these methods work remarkably well, others are too loose, and that existing machine learning methods for generation of 1-dimensional confidence intervals do not translate well to generation of simultaneous bandsâtheir bands are too tight.Information Systems Working Papers Serie

    Confidence Bands for ROC Curves: Methods and an Empirical Study

    Get PDF
    In this paper we study techniques for generating and evaluating confidence bands on ROC curves. ROC curve evaluation is rapidly becoming a commonly used evaluation metric in machine learning, although evaluating ROC curves has thus far been limited to studying the area under the curve (AUC) or generation of one-dimensional confidence intervals by freezing one variable—the false-positive rate, or threshold on the classification scoring function. Researchers in the medical field have long been using ROC curves and have many well-studied methods for analyzing such curves, including generating confidence intervals as well as simultaneous confidence bands. In this paper we introduce these techniques to the machine learning community and show their empirical fitness on the Covertype data set—a standard machine learning benchmark from the UCI repository. We show how some of these methods work remarkably well, others are too loose, and that existing machine learning methods for generation of 1-dimensional confidence intervals do not translate well to generation of simultanous bands—their bands are too tight.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Cross-Modal Data Programming Enables Rapid Medical Machine Learning

    Full text link
    Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports associated with imaging studies. We propose cross-modal data programming, which generalizes this intuitive strategy in a theoretically-grounded way that enables simpler, clinician-driven input, reduces required labeling time, and improves with additional unlabeled data. In this approach, clinicians generate training labels for models defined over a target modality (e.g. images or time series) by writing rules over an auxiliary modality (e.g. text reports). The resulting technical challenge consists of estimating the accuracies and correlations of these rules; we extend a recent unsupervised generative modeling technique to handle this cross-modal setting in a provably consistent way. Across four applications in radiography, computed tomography, and electroencephalography, and using only several hours of clinician time, our approach matches or exceeds the efficacy of physician-months of hand-labeling with statistical significance, demonstrating a fundamentally faster and more flexible way of building machine learning models in medicine

    Understanding metric-related pitfalls in image analysis validation

    Get PDF
    Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.Comment: Shared first authors: Annika Reinke, Minu D. Tizabi; shared senior authors: Paul F. J\"ager, Lena Maier-Hei

    Investigating the detection of adverse drug events in a UK general practice electronic health-care database

    Get PDF
    Data-mining techniques have frequently been developed for Spontaneous reporting databases. These techniques aim to find adverse drug events accurately and efficiently. Spontaneous reporting databases are prone to missing information,under reporting and incorrect entries. This often results in a detection lag or prevents the detection of some adverse drug events. These limitations do not occur in electronic healthcare databases. In this paper, existing methods developed for spontaneous reporting databases are implemented on both a spontaneous reporting database and a general practice electronic health-care database and compared. The results suggests that the application of existing methods to the general practice database may help find signals that have gone undetected when using the spontaneous reporting system database. In addition the general practice database provides far more supplementary information, that if incorporated in analysis could provide a wealth of information for identifying adverse events more accurately

    AUC confidence bounds for performance evaluation of Brain-Computer Interface

    Full text link