Search CORE

1,561 research outputs found

Learning Geometric Concepts with Nasty Noise

Author: Daniely A.
Diakonikolas I.
High Robust Estimators
Learning
Publication venue
Publication date: 05/07/2017
Field of study

We study the efficient learnability of geometric concept classes - specifically, low-degree polynomial threshold functions (PTFs) and intersections of halfspaces - when a fraction of the data is adversarially corrupted. We give the first polynomial-time PAC learning algorithms for these concept classes with dimension-independent error guarantees in the presence of nasty noise under the Gaussian distribution. In the nasty noise model, an omniscient adversary can arbitrarily corrupt a small fraction of both the unlabeled data points and their labels. This model generalizes well-studied noise models, including the malicious noise model and the agnostic (adversarial label noise) model. Prior to our work, the only concept class for which efficient malicious learning algorithms were known was the class of origin-centered halfspaces. Specifically, our robust learning algorithm for low-degree PTFs succeeds under a number of tame distributions -- including the Gaussian distribution and, more generally, any log-concave distribution with (approximately) known low-degree moments. For LTFs under the Gaussian distribution, we give a polynomial-time algorithm that achieves error

O(\epsilon)

, where

\epsilon

is the noise rate. At the core of our PAC learning results is an efficient algorithm to approximate the low-degree Chow-parameters of any bounded function in the presence of nasty noise. To achieve this, we employ an iterative spectral method for outlier detection and removal, inspired by recent work in robust unsupervised learning. Our aforementioned algorithm succeeds for a range of distributions satisfying mild concentration bounds and moment assumptions. The correctness of our robust learning algorithm for intersections of halfspaces makes essential use of a novel robust inverse independence lemma that may be of broader interest

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise

Author: Shen Jie
Zeng Shiwei
Publication venue
Publication date: 01/06/2023
Field of study

The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of

K

-sparse degree-

d

PTFs on

\mathbb{R}^n

, where any such concept depends only on

K

out of

n

attributes of the input. Our main contribution is a new algorithm that runs in time

({nd}/{\epsilon})^{O(d)}

and under the Gaussian marginal distribution, PAC learns the class up to error rate

\epsilon

with

O(\frac{K^{4d}}{\epsilon^{2d}} \cdot \log^{5d} n)

samples even when an

\eta \leq O(\epsilon^d)

fraction of them are corrupted by the nasty noise of Bshouty et al. (2002), possibly the strongest corruption model. Prior to this work, attribute-efficient robust algorithms are established only for the special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a structural result that translates the attribute sparsity to a sparsity pattern of the Chow vector under the basis of Hermite polynomials, and 2) a novel attribute-efficient robust Chow vector estimation algorithm which uses exclusively a restricted Frobenius norm to either certify a good approximation or to validate a sparsity-induced degree-

2d

polynomial as a filter to detect corrupted samples.Comment: ICML 202

arXiv.org e-Print Archive

Learning predictive models from massive, semantically disparate data

Author: Koul Neeraj
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2011
Field of study

Machine learning approaches offer some of the most successful techniques for constructing predictive models from data. However, applying such techniques in practice requires overcoming several challenges: infeasibility of centralized access to the data because of the massive size of some of the data sets that often exceeds the size of memory available to the learner, distributed nature of data, access restrictions, data fragmentation, semantic disparities between the data sources, and data sources that evolve spatially or temporally (e.g. data streams and genomic data sources in which new data is being submitted continuously). Learning using statistical queries and semantic correspondences that present a unified view of disparate data sources to the learner offer a powerful general framework for addressing some of these challenges. Against this background, this thesis describes (1) approaches to deal with missing values in the statistical query based algorithms for building predictors (Nayve Bayes and decision trees) and the techniques to minimize the number of required queries in such a setting. (2) Sufficient statistics based algorithms for constructing and updating sequence classifiers. (3) Reduction of several aspects of learning from semantically disparate data sources (such as (a) how errors in mappings affect the accuracy of the learned model and (b) how to choose an optimal mapping from among a set of alternative expert-supplied or automatically generated mappings) to the well-studied problems of domain adaptation and learning in presence of noise and (4) a software for learning predictive models from semantically disparate data

Digital Repository @ Iowa State University (ISU)

Learning Stochastic Decision Trees

Author: Blanc Guy
Lange Jane
Tan Li-Yang
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)
Publication date: 01/01/2021
Field of study

Dagstuhl Research Online Publication Server

Fairness-aware PAC learning from corrupted data

Author: Konstantinov Nikola H
Lampert Christoph
Publication venue
Publication date: 01/01/2022
Field of study

Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit

IST Austria: PubRep (Institute of Science and Technology)

Preface

Author: Sharma Arun
Watanabe Osamu
Publication venue: Elsevier Science B.V.
Publication date
Field of study

Elsevier - Publisher Connector

Multi-party Poisoning through Generalized $p$ -Tampering

Author: Mahloujifar Saeed
Mahmoody Mohammad
Mohammed Ameer
Publication venue
Publication date: 11/09/2018
Field of study

In a poisoning attack against a learning algorithm, an adversary tampers with a fraction of the training data

T

with the goal of increasing the classification error of the constructed hypothesis/model over the final test distribution. In the distributed setting,

T

might be gathered gradually from

m

data providers

P_1,\dots,P_m

who generate and submit their shares of

T

in an online way. In this work, we initiate a formal study of

(k,p)

-poisoning attacks in which an adversary controls

k\in[n]

of the parties, and even for each corrupted party

P_i

, the adversary submits some poisoned data

T'_i

on behalf of

P_i

that is still "

(1-p)

-close" to the correct data

T_i

(e.g.,

1-p

fraction of

T'_i

is still honestly generated). For

k=m

, this model becomes the traditional notion of poisoning, and for

p=1

it coincides with the standard notion of corruption in multi-party computation. We prove that if there is an initial constant error for the generated hypothesis

h

, there is always a

(k,p)

-poisoning attacker who can decrease the confidence of

h

(to have a small error), or alternatively increase the error of

h

, by

\Omega(p \cdot k/m)

. Our attacks can be implemented in polynomial time given samples from the correct data, and they use no wrong labels if the original distributions are not noisy. At a technical level, we prove a general lemma about biasing bounded functions