723 research outputs found
A Two-stage Classification Method for High-dimensional Data and Point Clouds
High-dimensional data classification is a fundamental task in machine
learning and imaging science. In this paper, we propose a two-stage multiphase
semi-supervised classification method for classifying high-dimensional data and
unstructured point clouds. To begin with, a fuzzy classification method such as
the standard support vector machine is used to generate a warm initialization.
We then apply a two-stage approach named SaT (smoothing and thresholding) to
improve the classification. In the first stage, an unconstraint convex
variational model is implemented to purify and smooth the initialization,
followed by the second stage which is to project the smoothed partition
obtained at stage one to a binary partition. These two stages can be repeated,
with the latest result as a new initialization, to keep improving the
classification quality. We show that the convex model of the smoothing stage
has a unique solution and can be solved by a specifically designed primal-dual
algorithm whose convergence is guaranteed. We test our method and compare it
with the state-of-the-art methods on several benchmark data sets. The
experimental results demonstrate clearly that our method is superior in both
the classification accuracy and computation speed for high-dimensional data and
point clouds.Comment: 21 pages, 4 figure
Dimensionality reduction of clustered data sets
We present a novel probabilistic latent variable model to perform linear dimensionality reduction on data sets which contain clusters. We prove that the maximum likelihood solution of the model is an unsupervised generalisation of linear discriminant analysis. This provides a completely new approach to one of the most established and widely used classification algorithms. The performance of the model is then demonstrated on a number of real and artificial data sets
Deep Learning is Provably Robust to Symmetric Label Noise
Deep neural networks (DNNs) are capable of perfectly fitting the training
data, including memorizing noisy data. It is commonly believed that
memorization hurts generalization. Therefore, many recent works propose
mitigation strategies to avoid noisy data or correct memorization. In this
work, we step back and ask the question: Can deep learning be robust against
massive label noise without any mitigation? We provide an affirmative answer
for the case of symmetric label noise: We find that certain DNNs, including
under-parameterized and over-parameterized models, can tolerate massive
symmetric label noise up to the information-theoretic threshold. By appealing
to classical statistical theory and universal consistency of DNNs, we prove
that for multiclass classification, -consistent DNN classifiers trained
under symmetric label noise can achieve Bayes optimality asymptotically if the
label noise probability is less than , where is the
number of classes. Our results show that for symmetric label noise, no
mitigation is necessary for -consistent estimators. We conjecture that for
general label noise, mitigation strategies that make use of the noisy data will
outperform those that ignore the noisy data
Tracking the risk of a deployed model and detecting harmful distribution shifts
When deployed in the real world, machine learning models inevitably encounter
changes in the data distribution, and certain -- but not all -- distribution
shifts could result in significant performance degradation. In practice, it may
make sense to ignore benign shifts, under which the performance of a deployed
model does not degrade substantially, making interventions by a human expert
(or model retraining) unnecessary. While several works have developed tests for
distribution shifts, these typically either use non-sequential methods, or
detect arbitrary shifts (benign or harmful), or both. We argue that a sensible
method for firing off a warning has to both (a) detect harmful shifts while
ignoring benign ones, and (b) allow continuous monitoring of model performance
without increasing the false alarm rate. In this work, we design simple
sequential tools for testing if the difference between source (training) and
target (test) distributions leads to a significant increase in a risk function
of interest, like accuracy or calibration. Recent advances in constructing
time-uniform confidence sequences allow efficient aggregation of statistical
evidence accumulated during the tracking process. The designed framework is
applicable in settings where (some) true labels are revealed after the
prediction is performed, or when batches of labels become available in a
delayed fashion. We demonstrate the efficacy of the proposed framework through
an extensive empirical study on a collection of simulated and real datasets.Comment: Accepted as a conference paper at ICLR 202
One-Bit Quantization and Sparsification for Multiclass Linear Classification via Regularized Regression
We study the use of linear regression for multiclass classification in the
over-parametrized regime where some of the training data is mislabeled. In such
scenarios it is necessary to add an explicit regularization term, , for some convex function , to avoid overfitting the mislabeled
data. In our analysis, we assume that the data is sampled from a Gaussian
Mixture Model with equal class sizes, and that a proportion of the training
labels is corrupted for each class. Under these assumptions, we prove that the
best classification performance is achieved when and
. We then proceed to analyze the classification errors for
and in the large
regime and notice that it is often possible to find sparse and
one-bit solutions, respectively, that perform almost as well as the one
corresponding to
Confusion-Based Online Learning and a Passive-Aggressive Scheme
International audienceThis paper provides the first ---to the best of our knowledge--- analysis of online learning algorithms for multiclass problems when the {\em confusion} matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called \copa (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for \copa can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it
- …