1,561 research outputs found
Learning Geometric Concepts with Nasty Noise
We study the efficient learnability of geometric concept classes -
specifically, low-degree polynomial threshold functions (PTFs) and
intersections of halfspaces - when a fraction of the data is adversarially
corrupted. We give the first polynomial-time PAC learning algorithms for these
concept classes with dimension-independent error guarantees in the presence of
nasty noise under the Gaussian distribution. In the nasty noise model, an
omniscient adversary can arbitrarily corrupt a small fraction of both the
unlabeled data points and their labels. This model generalizes well-studied
noise models, including the malicious noise model and the agnostic (adversarial
label noise) model. Prior to our work, the only concept class for which
efficient malicious learning algorithms were known was the class of
origin-centered halfspaces.
Specifically, our robust learning algorithm for low-degree PTFs succeeds
under a number of tame distributions -- including the Gaussian distribution
and, more generally, any log-concave distribution with (approximately) known
low-degree moments. For LTFs under the Gaussian distribution, we give a
polynomial-time algorithm that achieves error , where
is the noise rate. At the core of our PAC learning results is an efficient
algorithm to approximate the low-degree Chow-parameters of any bounded function
in the presence of nasty noise. To achieve this, we employ an iterative
spectral method for outlier detection and removal, inspired by recent work in
robust unsupervised learning. Our aforementioned algorithm succeeds for a range
of distributions satisfying mild concentration bounds and moment assumptions.
The correctness of our robust learning algorithm for intersections of
halfspaces makes essential use of a novel robust inverse independence lemma
that may be of broader interest
Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise
The concept class of low-degree polynomial threshold functions (PTFs) plays a
fundamental role in machine learning. In this paper, we study PAC learning of
-sparse degree- PTFs on , where any such concept depends
only on out of attributes of the input. Our main contribution is a new
algorithm that runs in time and under the Gaussian
marginal distribution, PAC learns the class up to error rate with
samples even when an fraction of them are corrupted by the nasty noise of
Bshouty et al. (2002), possibly the strongest corruption model. Prior to this
work, attribute-efficient robust algorithms are established only for the
special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a
structural result that translates the attribute sparsity to a sparsity pattern
of the Chow vector under the basis of Hermite polynomials, and 2) a novel
attribute-efficient robust Chow vector estimation algorithm which uses
exclusively a restricted Frobenius norm to either certify a good approximation
or to validate a sparsity-induced degree- polynomial as a filter to detect
corrupted samples.Comment: ICML 202
Learning predictive models from massive, semantically disparate data
Machine learning approaches offer some of the most successful techniques for constructing predictive models from data. However, applying such techniques in practice requires overcoming several challenges: infeasibility of centralized access to the data because of the massive size of some of the data sets that often exceeds the size of memory available to the learner, distributed nature of data, access restrictions, data fragmentation, semantic disparities between the data sources, and data sources that evolve spatially or temporally (e.g. data streams and genomic data sources in which new data is being submitted continuously). Learning using statistical queries and semantic correspondences that present a unified view of disparate data sources to the learner offer a powerful general framework for addressing some of these challenges.
Against this background, this thesis describes (1) approaches to deal with missing values in the statistical query based algorithms for building predictors (Nayve Bayes and decision trees) and the techniques to minimize the number of required queries in such a setting. (2) Sufficient statistics based algorithms for constructing and updating sequence classifiers. (3) Reduction of several aspects of learning from semantically disparate data sources (such as (a) how errors in mappings affect the accuracy of the learned model and (b) how to choose an optimal mapping from among a set of alternative expert-supplied or automatically generated mappings) to the well-studied problems of domain adaptation and learning in presence of noise and (4) a software for learning predictive models from semantically disparate data
Fairness-aware PAC learning from corrupted data
Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading
accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data
limit
Multi-party Poisoning through Generalized -Tampering
In a poisoning attack against a learning algorithm, an adversary tampers with
a fraction of the training data with the goal of increasing the
classification error of the constructed hypothesis/model over the final test
distribution. In the distributed setting, might be gathered gradually from
data providers who generate and submit their shares of
in an online way.
In this work, we initiate a formal study of -poisoning attacks in
which an adversary controls of the parties, and even for each
corrupted party , the adversary submits some poisoned data on
behalf of that is still "-close" to the correct data (e.g.,
fraction of is still honestly generated). For , this model
becomes the traditional notion of poisoning, and for it coincides with
the standard notion of corruption in multi-party computation.
We prove that if there is an initial constant error for the generated
hypothesis , there is always a -poisoning attacker who can decrease
the confidence of (to have a small error), or alternatively increase the
error of , by . Our attacks can be implemented in
polynomial time given samples from the correct data, and they use no wrong
labels if the original distributions are not noisy.
At a technical level, we prove a general lemma about biasing bounded
functions through an attack model in which each
block might be controlled by an adversary with marginal probability
in an online way. When the probabilities are independent, this coincides with
the model of -tampering attacks, thus we call our model generalized
-tampering. We prove the power of such attacks by incorporating ideas from
the context of coin-flipping attacks into the -tampering model and
generalize the results in both of these areas
- …