15 research outputs found
Continuous Sweep: an improved, binary quantifier
Quantification is a supervised machine learning task, focused on estimating
the class prevalence of a dataset rather than labeling its individual
observations. We introduce Continuous Sweep, a new parametric binary quantifier
inspired by the well-performing Median Sweep. Median Sweep is currently one of
the best binary quantifiers, but we have changed this quantifier on three
points, namely 1) using parametric class distributions instead of empirical
distributions, 2) optimizing decision boundaries instead of applying discrete
decision rules, and 3) calculating the mean instead of the median. We derive
analytic expressions for the bias and variance of Continuous Sweep under
general model assumptions. This is one of the first theoretical contributions
in the field of quantification learning. Moreover, these derivations enable us
to find the optimal decision boundaries. Finally, our simulation study shows
that Continuous Sweep outperforms Median Sweep in a wide range of situations
A Symmetric Loss Perspective of Reliable Machine Learning
When minimizing the empirical risk in binary classification, it is a common
practice to replace the zero-one loss with a surrogate loss to make the
learning objective feasible to optimize. Examples of well-known surrogate
losses for binary classification include the logistic loss, hinge loss, and
sigmoid loss. It is known that the choice of a surrogate loss can highly
influence the performance of the trained classifier and therefore it should be
carefully chosen. Recently, surrogate losses that satisfy a certain symmetric
condition (aka., symmetric losses) have demonstrated their usefulness in
learning from corrupted labels. In this article, we provide an overview of
symmetric losses and their applications. First, we review how a symmetric loss
can yield robust classification from corrupted labels in balanced error rate
(BER) minimization and area under the receiver operating characteristic curve
(AUC) maximization. Then, we demonstrate how the robust AUC maximization method
can benefit natural language processing in the problem where we want to learn
only from relevant keywords and unlabeled documents. Finally, we conclude this
article by discussing future directions, including potential applications of
symmetric losses for reliable machine learning and the design of non-symmetric
losses that can benefit from the symmetric condition.Comment: Preprint of an Invited Review Articl
A Domain-Region Based Evaluation of ML Performance Robustness to Covariate Shift
Most machine learning methods assume that the input data distribution is the
same in the training and testing phases. However, in practice, this
stationarity is usually not met and the distribution of inputs differs, leading
to unexpected performance of the learned model in deployment. The issue in
which the training and test data inputs follow different probability
distributions while the input-output relationship remains unchanged is referred
to as covariate shift. In this paper, the performance of conventional machine
learning models was experimentally evaluated in the presence of covariate
shift. Furthermore, a region-based evaluation was performed by decomposing the
domain of probability density function of the input data to assess the
classifier's performance per domain region. Distributional changes were
simulated in a two-dimensional classification problem. Subsequently, a higher
four-dimensional experiments were conducted. Based on the experimental
analysis, the Random Forests algorithm is the most robust classifier in the
two-dimensional case, showing the lowest degradation rate for accuracy and
F1-score metrics, with a range between 0.1% and 2.08%. Moreover, the results
reveal that in higher-dimensional experiments, the performance of the models is
predominantly influenced by the complexity of the classification function,
leading to degradation rates exceeding 25% in most cases. It is also concluded
that the models exhibit high bias towards the region with high density in the
input space domain of the training samples
A Comparative Evaluation of Quantification Methods
Quantification represents the problem of predicting class distributions in a
given target set. It also represents a growing research field in supervised
machine learning, for which a large variety of different algorithms has been
proposed in recent years. However, a comprehensive empirical comparison of
quantification methods that supports algorithm selection is not available yet.
In this work, we close this research gap by conducting a thorough empirical
performance comparison of 24 different quantification methods. To consider a
broad range of different scenarios for binary as well as multiclass
quantification settings, we carried out almost 3 million experimental runs on
40 data sets. We observe that no single algorithm generally outperforms all
competitors, but identify a group of methods including the Median Sweep and the
DyS framework that perform significantly better in binary settings. For the
multiclass setting, we observe that a different, broad group of algorithms
yields good performance, including the Generalized Probabilistic Adjusted
Count, the readme method, the energy distance minimization method, the EM
algorithm for quantification, and Friedman's method. More generally, we find
that the performance on multiclass quantification is inferior to the results
obtained in the binary setting. Our results can guide practitioners who intend
to apply quantification algorithms and help researchers to identify
opportunities for future research
Doubly Flexible Estimation under Label Shift
In studies ranging from clinical medicine to policy research, complete data
are usually available from a population , but the quantity of
interest is often sought for a related but different population
which only has partial data. In this paper, we consider the setting that both
outcome and covariate are available from whereas
only is available from , under the so-called label shift
assumption, i.e., the conditional distribution of given remains
the same across the two populations. To estimate the parameter of interest in
via leveraging the information from , the following
three ingredients are essential: (a) the common conditional distribution of
given , (b) the regression model of given in
, and (c) the density ratio of between the two populations. We
propose an estimation procedure that only needs standard nonparametric
technique to approximate the conditional expectations with respect to (a),
while by no means needs an estimate or model for (b) or (c); i.e., doubly
flexible to the possible model misspecifications of both (b) and (c). This is
conceptually different from the well-known doubly robust estimation in that,
double robustness allows at most one model to be misspecified whereas our
proposal can allow both (b) and (c) to be misspecified. This is of particular
interest in our setting because estimating (c) is difficult, if not impossible,
by virtue of the absence of the -data in . Furthermore, even
though the estimation of (b) is sometimes off-the-shelf, it can face curse of
dimensionality or computational challenges. We develop the large sample theory
for the proposed estimator, and examine its finite-sample performance through
simulation studies as well as an application to the MIMIC-III database
A review onquantification learning
The task of quantification consists in providing an aggregate estimation (e.g. the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution. Several real-world applications demand this kind of methods that do not require predictions for individual examples and just focus on obtaining accurate estimates at an aggregate level. During the past few years, several quantification methods have been proposed from different perspectives and with different goals. This paper presents a unified review of the main approaches with the aim of serving as an introductory tutorial for newcomers in the fiel