133 research outputs found

    Threshold Choice Methods: the Missing Link

    Full text link
    Many performance metrics have been introduced for the evaluation of classification performance, with different origins and niches of application: accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the absolute error, and the Brier score (with its decomposition into refinement and calibration). One way of understanding the relation among some of these metrics is the use of variable operating conditions (either in the form of misclassification costs or class proportions). Thus, a metric may correspond to some expected loss over a range of operating conditions. One dimension for the analysis has been precisely the distribution we take for this range of operating conditions, leading to some important connections in the area of proper scoring rules. However, we show that there is another dimension which has not received attention in the analysis of performance metrics. This new dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the loss of these methods for a uniform range of operating conditions we get the 0-1 loss, the absolute error, the Brier score (mean squared error), the AUC and the refinement loss respectively. This provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation, namely: take a model, apply several threshold choice methods consistent with the information which is (and will be) available about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibration in choosing the threshold choice method

    A review of calibration methods for biometric systems in forensic applications

    Get PDF
    When, in a criminal case there are traces from a crime scene - e.g., finger marks or facial recordings from a surveillance camera - as well as a suspect, the judge has to accept either the hypothesis \emph{HpH_{p}} of the prosecution, stating that the trace originates from the subject, or the hypothesis of the defense \emph{HdH_d}, stating the opposite. The current practice is that forensic experts provide a degree of support for either of the two hypotheses, based on their examinations of the trace and reference data - e.g., fingerprints or photos - taken from the suspect. There is a growing interest in a more objective quantitative support for these hypotheses based on the output of biometric systems instead of manual comparison. However, the output of a score-based biometric system is not directly suitable for quantifying the evidential value contained in a trace. A suitable measure that is gradually becoming accepted in the forensic community is the Likelihood Ratio (LR) which is the ratio of the probability of evidence given \emph{HpH_p} and the probability of evidence given \emph{HdH_d}. In this paper we study and compare different score-to-LR conversion methods (called calibration methods). We include four methods in this comparative study: Kernel Density Estimation (KDE), Logistic Regression (Log Reg), Histogram Binning (HB), and Pool Adjacent Violators (PAV). Useful statistics such as mean and bias of the bootstrap distribution of \emph{LRs} for a single score value are calculated for each method varying population sizes and score location

    ROC curves in cost space

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10994-013-5328-9ROC curves and cost curves are two popular ways of visualising classifier performance, finding appropriate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the ROC curve (AUC) or the area under the optimal cost curve. In this paper we present new findings and connections between ROC space and cost space. In particular, we show that ROC curves can be transferred to cost space by means of a very natural threshold choice method, which sets the decision threshold such that the proportion of positive predictions equals the operating condition. We call these new curves rate-driven curves, and we demonstrate that the expected loss as measured by the area under these curves is linearly related to AUC. We show that the rate-driven curves are the genuine equivalent of ROC curves in cost space, establishing a point-point rather than a point-line correspondence. Furthermore, a decomposition of the rate-driven curves is introduced which separates the loss due to the threshold choice method from the ranking loss (Kendall τ distance). We also derive the corresponding curve to the ROC convex hull in cost space; this curve is different from the lower envelope of the cost lines, as the latter assumes only optimal thresholds are chosen.We would like to thank the anonymous referees for their helpful comments. This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project PROMETEO/2008/051, the COST-European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the Engineering and Physical Sciences Research Council in the UK and the Ministerio de Economia y Competitividad in Spain.Hernández Orallo, J.; Flach ., P.; Ferri Ramírez, C. (2013). ROC curves in cost space. Machine Learning. 93(1):71-91. https://doi.org/10.1007/s10994-013-5328-9S7191931Adams, N., & Hand, D. (1999). Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32(7), 1139–1147.Chang, J., & Yap, C. (1986). A polynomial solution for the potato-peeling problem. Discrete & Computational Geometry, 1(1), 155–182.Drummond, C., & Holte, R. (2000). Explicitly representing expected cost: an alternative to ROC representation. In Knowl. discovery & data mining (pp. 198–207).Drummond, C., & Holte, R. (2006). Cost curves: an improved method for visualizing classifier performance. Machine Learning, 65, 95–130.Elkan, C. (2001). The foundations of cost-sensitive learning. In B. Nebel (Ed.), Proc. of the 17th intl. conf. on artificial intelligence (IJCAI-01) (pp. 973–978).Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.Fawcett, T., & Niculescu-Mizil, A. (2007). PAV and the ROC convex hull. Machine Learning, 68(1), 97–106.Flach, P. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Machine learning, proceedings of the twentieth international conference (ICML 2003) (pp. 194–201).Flach, P., Hernández-Orallo, J., & Ferri, C. (2011). A coherent interpretation of AUC as a measure of aggregated classification performance. In Proc. of the 28th intl. conference on machine learning, ICML2011.Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml .Hand, D. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.Hernández-Orallo, J., Flach, P., & Ferri, C. (2011). Brier curves: a new cost-based visualisation of classifier performance. In Proceedings of the 28th international conference on machine learning, ICML2011.Hernández-Orallo, J., Flach, P., & Ferri, C. (2012). A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research, 13, 2813–2869.Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93. doi: 10.2307/2332226 .Swets, J., Dawes, R., & Monahan, J. (2000). Better decisions through science. Scientific American, 283(4), 82–87

    Evaluating Probabilistic Classifiers: The Triptych

    Full text link
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science

    On classification, ranking, and probability estimation

    Get PDF
    whose rankings are derived not from scores, but from a simple ranking of attribute values obtained from the training data. Although various metrics can be used, we show that by using the odds ratio to rank the attribute values we obtain a ranker that is conceptually close to the naive Bayes classifier, in the sense that for every instance of LexRank there exists an instance of naive Bayes that achieves the same ranking. However, the reverse is not true, which means that LexRank is more biased than naive Bayes. We systematically develop the relationships and differences between classification, ranking, and probability estimation, which leads to a novel connection between the Brier score and ROC curves. Combining LexRank with isotonic regression, which derives probability estimates from the ROC convex hull, results in the lexicographic probability estimator LexProb

    Precision-Recall-Gain Curves:PR Analysis Done Right

    Get PDF
    Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier’s performance. Perhaps inspired by the many advantages of receiver op-erating characteristic (ROC) curves and the area under such curves for accuracy-based performance assessment, many researchers have taken to report Precision-Recall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of in-coherent scale assumptions – e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the Fβ score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves con-veys an expected F1 score on a harmonic scale, and the convex hull of a Precision-Recall-Gain curve allows us to calibrate the classifier’s scores so as to determine, for each operating point on the convex hull, the interval of β values for which the point optimises Fβ. We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected F1 score than others, and so the use of Precision-Recall-Gain curves will result in better model selection.
    • …
    corecore