133 research outputs found
Threshold Choice Methods: the Missing Link
Many performance metrics have been introduced for the evaluation of
classification performance, with different origins and niches of application:
accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the
absolute error, and the Brier score (with its decomposition into refinement and
calibration). One way of understanding the relation among some of these metrics
is the use of variable operating conditions (either in the form of
misclassification costs or class proportions). Thus, a metric may correspond to
some expected loss over a range of operating conditions. One dimension for the
analysis has been precisely the distribution we take for this range of
operating conditions, leading to some important connections in the area of
proper scoring rules. However, we show that there is another dimension which
has not received attention in the analysis of performance metrics. This new
dimension is given by the decision rule, which is typically implemented as a
threshold choice method when using scoring models. In this paper, we explore
many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. By calculating the loss of these methods
for a uniform range of operating conditions we get the 0-1 loss, the absolute
error, the Brier score (mean squared error), the AUC and the refinement loss
respectively. This provides a comprehensive view of performance metrics as well
as a systematic approach to loss minimisation, namely: take a model, apply
several threshold choice methods consistent with the information which is (and
will be) available about the operating condition, and compare their expected
losses. In order to assist in this procedure we also derive several connections
between the aforementioned performance metrics, and we highlight the role of
calibration in choosing the threshold choice method
A review of calibration methods for biometric systems in forensic applications
When, in a criminal case there are traces from a crime scene - e.g., finger marks or facial recordings from a surveillance camera - as well as a suspect, the judge has to accept either the hypothesis \emph{} of the prosecution, stating that the trace originates from the subject, or the hypothesis of the defense \emph{}, stating the opposite. The current practice is that forensic experts provide a degree of support for either of the two hypotheses, based on their examinations of the trace and reference data - e.g., fingerprints or photos - taken from the suspect. There is a growing interest in a more objective quantitative support for these hypotheses based on the output of biometric systems instead of manual comparison. However, the output of a score-based biometric system is not directly suitable for quantifying the evidential value contained in a trace. A suitable measure that is gradually becoming accepted in the forensic community is the Likelihood Ratio (LR) which is the ratio of the probability of evidence given \emph{} and the probability of evidence given \emph{}. In this paper we study and compare different score-to-LR conversion methods (called calibration methods). We include four methods in this comparative study: Kernel Density Estimation (KDE), Logistic Regression (Log Reg), Histogram Binning (HB), and Pool Adjacent Violators (PAV). Useful statistics such as mean and bias of the bootstrap distribution of \emph{LRs} for a single score value are calculated for each method varying population sizes and score location
ROC curves in cost space
The final publication is available at Springer via http://dx.doi.org/10.1007/s10994-013-5328-9ROC curves and cost curves are two popular ways of visualising classifier performance, finding appropriate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the ROC curve (AUC) or the area under the optimal cost curve. In this paper we present new findings and connections between ROC space and cost space. In particular, we show that ROC curves can be transferred to cost space by means of a very natural threshold choice method, which sets the decision threshold such that the proportion of positive predictions equals the operating condition. We call these new curves rate-driven curves, and we demonstrate that the expected loss as measured by the area under these curves is linearly related to AUC. We show that the rate-driven curves are the genuine equivalent of ROC curves in cost space, establishing a point-point rather than a point-line correspondence. Furthermore, a decomposition of the rate-driven curves is introduced which separates the loss due to the threshold choice method from the ranking loss (Kendall Ď„ distance). We also derive the corresponding curve to the ROC convex hull in cost space; this curve is different from the lower envelope of the cost lines, as the latter assumes only optimal thresholds are chosen.We would like to thank the anonymous referees for their helpful comments. This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project PROMETEO/2008/051, the COST-European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the Engineering and Physical Sciences Research Council in the UK and the Ministerio de Economia y Competitividad in Spain.Hernández Orallo, J.; Flach ., P.; Ferri RamĂrez, C. (2013). ROC curves in cost space. Machine Learning. 93(1):71-91. https://doi.org/10.1007/s10994-013-5328-9S7191931Adams, N., & Hand, D. (1999). Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32(7), 1139–1147.Chang, J., & Yap, C. (1986). A polynomial solution for the potato-peeling problem. Discrete & Computational Geometry, 1(1), 155–182.Drummond, C., & Holte, R. (2000). Explicitly representing expected cost: an alternative to ROC representation. In Knowl. discovery & data mining (pp. 198–207).Drummond, C., & Holte, R. (2006). Cost curves: an improved method for visualizing classifier performance. Machine Learning, 65, 95–130.Elkan, C. (2001). The foundations of cost-sensitive learning. In B. Nebel (Ed.), Proc. of the 17th intl. conf. on artificial intelligence (IJCAI-01) (pp. 973–978).Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.Fawcett, T., & Niculescu-Mizil, A. (2007). PAV and the ROC convex hull. Machine Learning, 68(1), 97–106.Flach, P. (2003). The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Machine learning, proceedings of the twentieth international conference (ICML 2003) (pp. 194–201).Flach, P., Hernández-Orallo, J., & Ferri, C. (2011). A coherent interpretation of AUC as a measure of aggregated classification performance. In Proc. of the 28th intl. conference on machine learning, ICML2011.Frank, A., & Asuncion, A. (2010). UCI machine learning repository. http://archive.ics.uci.edu/ml .Hand, D. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123.Hernández-Orallo, J., Flach, P., & Ferri, C. (2011). Brier curves: a new cost-based visualisation of classifier performance. In Proceedings of the 28th international conference on machine learning, ICML2011.Hernández-Orallo, J., Flach, P., & Ferri, C. (2012). A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research, 13, 2813–2869.Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93. doi: 10.2307/2332226 .Swets, J., Dawes, R., & Monahan, J. (2000). Better decisions through science. Scientific American, 283(4), 82–87
Evaluating Probabilistic Classifiers: The Triptych
Probability forecasts for binary outcomes, often referred to as probabilistic
classifiers or confidence scores, are ubiquitous in science and society, and
methods for evaluating and comparing them are in great demand. We propose and
study a triptych of diagnostic graphics that focus on distinct and
complementary aspects of forecast performance: The reliability diagram
addresses calibration, the receiver operating characteristic (ROC) curve
diagnoses discrimination ability, and the Murphy diagram visualizes overall
predictive performance and value. A Murphy curve shows a forecast's mean
elementary scores, including the widely used misclassification rate, and the
area under a Murphy curve equals the mean Brier score. For a calibrated
forecast, the reliability curve lies on the diagonal, and for competing
calibrated forecasts, the ROC and Murphy curves share the same number of
crossing points. We invoke the recently developed CORP (Consistent, Optimally
binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based)
approach to craft reliability diagrams and decompose a mean score into
miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components.
Plots of the DSC measure of discrimination ability versus the calibration
metric MCB visualize classifier performance across multiple competitors. The
proposed tools are illustrated in empirical examples from astrophysics,
economics, and social science
On classification, ranking, and probability estimation
whose rankings are derived not from scores, but from a simple ranking of attribute values obtained from the training data. Although various metrics can be used, we show that by using the odds ratio to rank the attribute values we obtain a ranker that is conceptually close to the naive Bayes classifier, in the sense that for every instance of LexRank there exists an instance of naive Bayes
that achieves the same ranking. However, the reverse is not true, which means that LexRank is more biased than naive Bayes. We systematically develop the relationships and differences between classification, ranking, and probability estimation, which leads to a novel connection between the Brier score and ROC curves. Combining LexRank with isotonic regression, which derives probability estimates from the ROC convex hull, results in the lexicographic probability estimator LexProb
Precision-Recall-Gain Curves:PR Analysis Done Right
Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier’s performance. Perhaps inspired by the many advantages of receiver op-erating characteristic (ROC) curves and the area under such curves for accuracy-based performance assessment, many researchers have taken to report Precision-Recall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of in-coherent scale assumptions – e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the Fβ score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves con-veys an expected F1 score on a harmonic scale, and the convex hull of a Precision-Recall-Gain curve allows us to calibrate the classifier’s scores so as to determine, for each operating point on the convex hull, the interval of β values for which the point optimises Fβ. We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected F1 score than others, and so the use of Precision-Recall-Gain curves will result in better model selection.
- …