194 research outputs found

    Efficiently Learning Structured Distributions from Untrusted Batches

    Full text link
    We study the problem, introduced by Qiao and Valiant, of learning from untrusted batches. Here, we assume mm users, all of whom have samples from some underlying distribution pp over 1,,n1, \ldots, n. Each user sends a batch of kk i.i.d. samples from this distribution; however an ϵ\epsilon-fraction of users are untrustworthy and can send adversarially chosen responses. The goal is then to learn pp in total variation distance. When k=1k = 1 this is the standard robust univariate density estimation setting and it is well-understood that Ω(ϵ)\Omega (\epsilon) error is unavoidable. Suprisingly, Qiao and Valiant gave an estimator which improves upon this rate when kk is large. Unfortunately, their algorithms run in time exponential in either nn or kk. We first give a sequence of polynomial time algorithms whose estimation error approaches the information-theoretically optimal bound for this problem. Our approach is based on recent algorithms derived from the sum-of-squares hierarchy, in the context of high-dimensional robust estimation. We show that algorithms for learning from untrusted batches can also be cast in this framework, but by working with a more complicated set of test functions. It turns out this abstraction is quite powerful and can be generalized to incorporate additional problem specific constraints. Our second and main result is to show that this technology can be leveraged to build in prior knowledge about the shape of the distribution. Crucially, this allows us to reduce the sample complexity of learning from untrusted batches to polylogarithmic in nn for most natural classes of distributions, which is important in many applications. To do so, we demonstrate that these sum-of-squares algorithms for robust mean estimation can be made to handle complex combinatorial constraints (e.g. those arising from VC theory), which may be of independent technical interest.Comment: 46 page

    API design for machine learning software: experiences from the scikit-learn project

    Get PDF
    Scikit-learn is an increasingly popular machine learning li- brary. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library

    IST Austria Thesis

    Get PDF
    Because of the increasing popularity of machine learning methods, it is becoming important to understand the impact of learned components on automated decision-making systems and to guarantee that their consequences are beneficial to society. In other words, it is necessary to ensure that machine learning is sufficiently trustworthy to be used in real-world applications. This thesis studies two properties of machine learning models that are highly desirable for the sake of reliability: robustness and fairness. In the first part of the thesis we study the robustness of learning algorithms to training data corruption. Previous work has shown that machine learning models are vulnerable to a range of training set issues, varying from label noise through systematic biases to worst-case data manipulations. This is an especially relevant problem from a present perspective, since modern machine learning methods are particularly data hungry and therefore practitioners often have to rely on data collected from various external sources, e.g. from the Internet, from app users or via crowdsourcing. Naturally, such sources vary greatly in the quality and reliability of the data they provide. With these considerations in mind, we study the problem of designing machine learning algorithms that are robust to corruptions in data coming from multiple sources. We show that, in contrast to the case of a single dataset with outliers, successful learning within this model is possible both theoretically and practically, even under worst-case data corruptions. The second part of this thesis deals with fairness-aware machine learning. There are multiple areas where machine learning models have shown promising results, but where careful considerations are required, in order to avoid discrimanative decisions taken by such learned components. Ensuring fairness can be particularly challenging, because real-world training datasets are expected to contain various forms of historical bias that may affect the learning process. In this thesis we show that data corruption can indeed render the problem of achieving fairness impossible, by tightly characterizing the theoretical limits of fair learning under worst-case data manipulations. However, assuming access to clean data, we also show how fairness-aware learning can be made practical in contexts beyond binary classification, in particular in the challenging learning to rank setting

    FLEA: Provably Fair Multisource Learning from Unreliable Training Data

    Full text link
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might be not representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally we prove formally that, given enough data, FLEA protects the learner against unreliable data as long as the fraction of affected data sources is less than half
    corecore