21 research outputs found

    Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation

    Get PDF
    Precision-recall (PR) curves and the areas under them are widely used to summarize machine learning results, especially for data sets exhibiting class skew. They are often used analogously to ROC curves and the area under ROC curves. It is known that PR curves vary as class skew changes. What was not recognized before this paper is that there is a region of PR space that is completely unachievable, and the size of this region depends only on the skew. This paper precisely characterizes the size of that region and discusses its implications for empirical evaluation methodology in machine learning.Comment: ICML2012, fixed citations to use correct tech report numbe

    Precision-Recall-Gain Curves: PR Analysis Done Right

    Get PDF
    Abstract Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier's performance. Perhaps inspired by the many advantages of receiver operating characteristic (ROC) curves and the area under such curves for accuracybased performance assessment, many researchers have taken to report PrecisionRecall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions -e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the F β score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves conveys an expected F 1 score on a harmonic scale, and the convex hull of a PrecisionRecall-Gain curve allows us to calibrate the classifier's scores so as to determine, for each operating point on the convex hull, the interval of β values for which the point optimises F β . We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected F 1 score than others, and so the use of Precision-Recall-Gain curves will result in better model selection

    Doublet identification in single-cell sequencing data using scDblFinder

    Full text link
    Doublets are prevalent in single-cell sequencing data and can lead to artifactual findings. A number of strategies have therefore been proposed to detect them. Building on the strengths of existing approaches, we developed scDblFinder, a fast, flexible and accurate Bioconductor-based doublet detection method. Here we present the method, justify its design choices, demonstrate its performance on both single-cell RNA and accessibility (ATAC) sequencing data, and provide some observations on doublet formation, detection, and enrichment analysis. Even in complex datasets, scDblFinder can accurately identify most heterotypic doublets, and was already found by an independent benchmark to outcompete alternatives
    corecore