21 research outputs found
Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation
Precision-recall (PR) curves and the areas under them are widely used to
summarize machine learning results, especially for data sets exhibiting class
skew. They are often used analogously to ROC curves and the area under ROC
curves. It is known that PR curves vary as class skew changes. What was not
recognized before this paper is that there is a region of PR space that is
completely unachievable, and the size of this region depends only on the skew.
This paper precisely characterizes the size of that region and discusses its
implications for empirical evaluation methodology in machine learning.Comment: ICML2012, fixed citations to use correct tech report numbe
Precision-Recall-Gain Curves: PR Analysis Done Right
Abstract Precision-Recall analysis abounds in applications of binary classification where true negatives do not add value and hence should not affect assessment of the classifier's performance. Perhaps inspired by the many advantages of receiver operating characteristic (ROC) curves and the area under such curves for accuracybased performance assessment, many researchers have taken to report PrecisionRecall (PR) curves and associated areas as performance metric. We demonstrate in this paper that this practice is fraught with difficulties, mainly because of incoherent scale assumptions -e.g., the area under a PR curve takes the arithmetic mean of precision values whereas the F β score applies the harmonic mean. We show how to fix this by plotting PR curves in a different coordinate system, and demonstrate that the new Precision-Recall-Gain curves inherit all key advantages of ROC curves. In particular, the area under Precision-Recall-Gain curves conveys an expected F 1 score on a harmonic scale, and the convex hull of a PrecisionRecall-Gain curve allows us to calibrate the classifier's scores so as to determine, for each operating point on the convex hull, the interval of β values for which the point optimises F β . We demonstrate experimentally that the area under traditional PR curves can easily favour models with lower expected F 1 score than others, and so the use of Precision-Recall-Gain curves will result in better model selection
Identification of long non-coding transcripts with feature selection: a comparative study
Table S4. List of features ranked by each algorithm in each species. (XLS 63 kb
Doublet identification in single-cell sequencing data using scDblFinder
Doublets are prevalent in single-cell sequencing data and can lead to artifactual findings. A number of strategies have therefore been proposed to detect them. Building on the strengths of existing approaches, we developed scDblFinder, a fast, flexible and accurate Bioconductor-based doublet detection method. Here we present the method, justify its design choices, demonstrate its performance on both single-cell RNA and accessibility (ATAC) sequencing data, and provide some observations on doublet formation, detection, and enrichment analysis. Even in complex datasets, scDblFinder can accurately identify most heterotypic doublets, and was already found by an independent benchmark to outcompete alternatives