15 research outputs found
Statistical Theory for Imbalanced Binary Classification
Within the vast body of statistical theory developed for binary
classification, few meaningful results exist for imbalanced classification, in
which data are dominated by samples from one of the two classes. Existing
theory faces at least two main challenges. First, meaningful results must
consider more complex performance measures than classification accuracy. To
address this, we characterize a novel generalization of the Bayes-optimal
classifier to any performance metric computed from the confusion matrix, and we
use this to show how relative performance guarantees can be obtained in terms
of the error of estimating the class probability function under uniform
() loss. Second, as we show, optimal classification
performance depends on certain properties of class imbalance that have not
previously been formalized. Specifically, we propose a novel sub-type of class
imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class
Imbalance influences optimal classifier performance and show that it
necessitates different classifier behavior than other types of class imbalance.
We further illustrate these two contributions in the case of -nearest
neighbor classification, for which we develop novel guarantees. Together, these
results provide some of the first meaningful finite-sample statistical theory
for imbalanced binary classification.Comment: Parts of this paper have been revised from arXiv:2004.04715v2
[math.ST
A Symmetric Loss Perspective of Reliable Machine Learning
When minimizing the empirical risk in binary classification, it is a common
practice to replace the zero-one loss with a surrogate loss to make the
learning objective feasible to optimize. Examples of well-known surrogate
losses for binary classification include the logistic loss, hinge loss, and
sigmoid loss. It is known that the choice of a surrogate loss can highly
influence the performance of the trained classifier and therefore it should be
carefully chosen. Recently, surrogate losses that satisfy a certain symmetric
condition (aka., symmetric losses) have demonstrated their usefulness in
learning from corrupted labels. In this article, we provide an overview of
symmetric losses and their applications. First, we review how a symmetric loss
can yield robust classification from corrupted labels in balanced error rate
(BER) minimization and area under the receiver operating characteristic curve
(AUC) maximization. Then, we demonstrate how the robust AUC maximization method
can benefit natural language processing in the problem where we want to learn
only from relevant keywords and unlabeled documents. Finally, we conclude this
article by discussing future directions, including potential applications of
symmetric losses for reliable machine learning and the design of non-symmetric
losses that can benefit from the symmetric condition.Comment: Preprint of an Invited Review Articl
Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification
International audienceWe study the problem of fair binary classification using the notion of Equal Opportunity. It requires the true positive rate to distribute equally across the sensitive groups. Within this setting we show that the fair optimal classifier is obtained by recalibrating the Bayes classifier by a group-dependent threshold. We provide a constructive expression for the threshold. This result motivates us to devise a plug-in classification procedure based on both unlabeled and labeled datasets. While the latter is used to learn the output conditional probability, the former is used for calibration. The overall procedure can be computed in polynomial time and it is shown to be statistically consistent both in terms of the classification error and fairness measure. Finally, we present numerical experiments which indicate that our method is often superior or competitive with the state-of-the-art methods on benchmark datasets