Search CORE

143 research outputs found

Learning Discrete Distributions from Untrusted Batches

Author: Qiao Mingda
Valiant Gregory
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 9th Innovations in Theoretical Computer Science Conference (ITCS 2018)
Publication date: 21/11/2017
Field of study

We consider the problem of learning a discrete distribution in the presence of an epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of >= k samples, with the guarantee that at least a (1 - epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most eta from p. We make no assumptions on the data provided by the remaining epsilon fraction of sources--this data can even be chosen as an adversarial function of the (1 - epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/epsilon and 1/eta that takes O((n + k)/epsilon^2) batches and recovers p to error O(eta + epsilon/sqrt(k)). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the eta = 0 setting and also achieves an O(epsilon/sqrt(k)) recover guarantee, though it runs in poly((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing l_1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

PAC Verification of Statistical Algorithms

Author: Mutreja Saachi
Shafer Jonathan
Publication venue
Publication date: 02/09/2023
Field of study

Goldwasser et al. (2021) recently proposed the setting of PAC verification, where a hypothesis (machine learning model) that purportedly satisfies the agnostic PAC learning objective is verified using an interactive proof. In this paper we develop this notion further in a number of ways. First, we prove a lower bound of

\Omega\left(\sqrt{d}/\varepsilon^2\right)

i.i.d.\ samples for PAC verification of hypothesis classes of VC dimension

d

. Second, we present a protocol for PAC verification of unions of intervals over

\mathbb{R}

that improves upon their proposed protocol for that task, and matches our lower bound's dependence on

d

. Third, we introduce a natural generalization of their definition to verification of general statistical algorithms, which is applicable to a wider variety of settings beyond agnostic PAC learning. Showcasing our proposed definition, our final result is a protocol for the verification of statistical query algorithms that satisfy a combinatorial constraint on their queries

arXiv.org e-Print Archive

Efficiently Learning Structured Distributions from Untrusted Batches

Author: Acharya Jayadev
Achlioptas D.
Arora S.
Balakrishnan Sivaraman
Dasgupta S.
Diakonikolas Ilias
Diakonikolas Ilias
Feldman J.
Grenander U.
Groeneboom P.
Huber Peter J
Kerkyacharian G.
Klivans Adam
Li Jerry
McMahan H. Brendan
Prakasa Rao B.L.S.
Steinhardt Jacob
Tukey John W
Vempala S.
Publication venue
Publication date: 05/11/2019
Field of study

We study the problem, introduced by Qiao and Valiant, of learning from untrusted batches. Here, we assume

m

users, all of whom have samples from some underlying distribution

p

over

1, \ldots, n

. Each user sends a batch of

k

i.i.d. samples from this distribution; however an

\epsilon

-fraction of users are untrustworthy and can send adversarially chosen responses. The goal is then to learn

p

in total variation distance. When

k = 1

this is the standard robust univariate density estimation setting and it is well-understood that

\Omega (\epsilon)

error is unavoidable. Suprisingly, Qiao and Valiant gave an estimator which improves upon this rate when

k

is large. Unfortunately, their algorithms run in time exponential in either

n

k

. We first give a sequence of polynomial time algorithms whose estimation error approaches the information-theoretically optimal bound for this problem. Our approach is based on recent algorithms derived from the sum-of-squares hierarchy, in the context of high-dimensional robust estimation. We show that algorithms for learning from untrusted batches can also be cast in this framework, but by working with a more complicated set of test functions. It turns out this abstraction is quite powerful and can be generalized to incorporate additional problem specific constraints. Our second and main result is to show that this technology can be leveraged to build in prior knowledge about the shape of the distribution. Crucially, this allows us to reduce the sample complexity of learning from untrusted batches to polylogarithmic in

n

for most natural classes of distributions, which is important in many applications. To do so, we demonstrate that these sum-of-squares algorithms for robust mean estimation can be made to handle complex combinatorial constraints (e.g. those arising from VC theory), which may be of independent technical interest.Comment: 46 page

arXiv.org e-Print Archive

API design for machine learning software: experiences from the scikit-learn project

Author: Blondel Mathieu
Buitinck Lars
Gramfort Alexandre
Grisel Olivier
Grobler Jaques
Holt Brian
Joly Arnaud
Layton Robert
Louppe Gilles
Mueller Andreas
Niculae Vlad
Pedregosa Fabian
Prettenhofer Peter
Vanderplas Jake
Varoquaux Gaël
Publication venue
Publication date: 01/09/2013
Field of study

Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Federation ResearchOnline

FLEA: Provably Fair Multisource Learning from Unreliable Training Data

Author: Iofinova Eugenia
Konstantinov Nikola
Lampert Christoph H.
Publication venue
Publication date: 22/06/2021
Field of study

Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might be not representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally we prove formally that, given enough data, FLEA protects the learner against unreliable data as long as the fraction of affected data sources is less than half

arXiv.org e-Print Archive