68 research outputs found

    Optimal PAC Bounds Without Uniform Convergence

    Full text link
    In statistical learning theory, determining the sample complexity of realizable binary classification for VC classes was a long-standing open problem. The results of Simon and Hanneke established sharp upper bounds in this setting. However, the reliance of their argument on the uniform convergence principle limits its applicability to more general learning settings such as multiclass classification. In this paper, we address this issue by providing optimal high probability risk bounds through a framework that surpasses the limitations of uniform convergence arguments. Our framework converts the leave-one-out error of permutation invariant predictors into high probability risk bounds. As an application, by adapting the one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth, we propose an algorithm that achieves an optimal PAC bound for binary classification. Specifically, our result shows that certain aggregations of one-inclusion graph algorithms are optimal, addressing a variant of a classic question posed by Warmuth. We further instantiate our framework in three settings where uniform convergence is provably suboptimal. For multiclass classification, we prove an optimal risk bound that scales with the one-inclusion hypergraph density of the class, addressing the suboptimality of the analysis of Daniely and Shalev-Shwartz. For partial hypothesis classification, we determine the optimal sample complexity bound, resolving a question posed by Alon, Hanneke, Holzman, and Moran. For realizable bounded regression with absolute loss, we derive an optimal risk bound that relies on a modified version of the scale-sensitive dimension, refining the results of Bartlett and Long. Our rates surpass standard uniform convergence-based results due to the smaller complexity measure in our risk bound.Comment: 27 page

    An adaptive multiclass nearest neighbor classifier

    Full text link
    We consider a problem of multiclass classification, where the training sample Sn={(Xi,Yi)}i=1nS_n = \{(X_i, Y_i)\}_{i=1}^n is generated from the model P(Y=m∣X=x)=ηm(x)\mathbb P(Y = m | X = x) = \eta_m(x), 1≤m≤M1 \leq m \leq M, and η1(x),…,ηM(x)\eta_1(x), \dots, \eta_M(x) are unknown α\alpha-Holder continuous functions.Given a test point XX, our goal is to predict its label. A widely used k\mathsf k-nearest-neighbors classifier constructs estimates of η1(X),…,ηM(X)\eta_1(X), \dots, \eta_M(X) and uses a plug-in rule for the prediction. However, it requires a proper choice of the smoothing parameter k\mathsf k, which may become tricky in some situations. In our solution, we fix several integers n1,…,nKn_1, \dots, n_K, compute corresponding nkn_k-nearest-neighbor estimates for each mm and each nkn_k and apply an aggregation procedure. We study an algorithm, which constructs a convex combination of these estimates such that the aggregated estimate behaves approximately as well as an oracle choice. We also provide a non-asymptotic analysis of the procedure, prove its adaptation to the unknown smoothness parameter α\alpha and to the margin and establish rates of convergence under mild assumptions.Comment: Accepted in ESAIM: Probability & Statistics. The original publication is available at www.esaim-ps.or

    Multiclass learnability and the ERM principle

    Get PDF
    Abstract We study the sample complexity of multiclass prediction in several learning settings. For the PAC setting our analysis reveals a surprising phenomenon: In sharp contrast to binary classification, we show that there exist multiclass hypothesis classes for which some Empirical Risk Minimizers (ERM learners) have lower sample complexity than others. Furthermore, there are classes that are learnable by some ERM learners, while other ERM learners will fail to learn them. We propose a principle for designing good ERM learners, and use this principle to prove tight bounds on the sample complexity of learning symmetric multiclass hypothesis classes-classes that are invariant under permutations of label names. We further provide a characterization of mistake and regret bounds for multiclass learning in the online setting and the bandit setting, using new generalizations of Littlestone's dimension

    Multiclass Learnability Beyond the PAC Framework: Universal Rates and Partial Concept Classes

    Full text link
    In this paper we study the problem of multiclass classification with a bounded number of different labels kk, in the realizable setting. We extend the traditional PAC model to a) distribution-dependent learning rates, and b) learning rates under data-dependent assumptions. First, we consider the universal learning setting (Bousquet, Hanneke, Moran, van Handel and Yehudayoff, STOC '21), for which we provide a complete characterization of the achievable learning rates that holds for every fixed distribution. In particular, we show the following trichotomy: for any concept class, the optimal learning rate is either exponential, linear or arbitrarily slow. Additionally, we provide complexity measures of the underlying hypothesis class that characterize when these rates occur. Second, we consider the problem of multiclass classification with structured data (such as data lying on a low dimensional manifold or satisfying margin conditions), a setting which is captured by partial concept classes (Alon, Hanneke, Holzman and Moran, FOCS '21). Partial concepts are functions that can be undefined in certain parts of the input space. We extend the traditional PAC learnability of total concept classes to partial concept classes in the multiclass setting and investigate differences between partial and total concepts

    Theoretical Foundations of Adversarially Robust Learning

    Full text link
    Despite extraordinary progress, current machine learning systems have been shown to be brittle against adversarial examples: seemingly innocuous but carefully crafted perturbations of test examples that cause machine learning predictors to misclassify. Can we learn predictors robust to adversarial examples? and how? There has been much empirical interest in this contemporary challenge in machine learning, and in this thesis, we address it from a theoretical perspective. In this thesis, we explore what robustness properties can we hope to guarantee against adversarial examples and develop an understanding of how to algorithmically guarantee them. We illustrate the need to go beyond traditional approaches and principles such as empirical risk minimization and uniform convergence, and make contributions that can be categorized as follows: (1) introducing problem formulations capturing aspects of emerging practical challenges in robust learning, (2) designing new learning algorithms with provable robustness guarantees, and (3) characterizing the complexity of robust learning and fundamental limitations on the performance of any algorithm.Comment: PhD Thesi

    Two studies in resource-efficient inference: structural testing of networks, and selective classification

    Get PDF
    Inference systems suffer costs arising from information acquisition, and from communication and computational costs of executing complex models. This dissertation proposes, in two distinct themes, systems-level methods to reduce these costs without affecting the accuracy of inference by using ancillary low-cost methods to cheaply address most queries, while only using resource-heavy methods on 'difficult' instances. The first theme concerns testing methods in structural inference of networks and graphical models, the proposal being that one first cheaply tests whether the structure underlying a dataset differs from a reference structure, and only estimates the new structure if this difference is large. This study focuses on theoretically establishing separations between the costs of testing and learning to determine when a strategy such as the above has benefits. For two canonical models---the Ising model, and the stochastic block model---fundamental limits are derived on the costs of one- and two-sample goodness-of-fit tests by determining information-theoretic lower bounds, and developing matching tests. A biphasic behaviour in the costs of testing is demonstrated: there is a critical size scale such that detection of differences smaller than this size is nearly as expensive as recovering the structure, while detection of larger differences has vanishing costs relative to recovery. The second theme concerns using Selective classification (SC), or classification with an option to abstain, to control inference-time costs in the machine learning framework. The proposal is to learn a low-complexity selective classifier that only abstains on hard instances, and to execute more expensive methods upon abstention. Herein, a novel SC formulation with a focus on high-accuracy is developed, and used to obtain both theoretical characterisations, and a scheme for learning selective classifiers based on optimising a collection of class-wise decoupled one-sided risks. This scheme attains strong empirical performance, and admits efficient implementation, leading to an effective SC methodology. Finally, SC is studied in the online learning setting with feedback only provided upon abstention, modelling the practical lack of reliable labels without expensive feature collection, and a Pareto-optimal low-error scheme is described

    Adaptive Online Learning

    Get PDF
    The research that constitutes this thesis was driven by the two related goals in mind. The first one was to develop new efficient online learning algorithms and to study their properties and theoretical guarantees. The second one was to study real-world data and find algorithms appropriate for the particular real-world problems. This thesis studies online prediction with few assumptions about the nature of the data. This is important for real-world applications of machine learning as complex assumptions about the data are rarely justified. We consider two frameworks: conformal prediction, which is based on the randomness assumption, and prediction with expert advice, where no assumptions about the data are made at all. Conformal predictors are set predictors, that is a set of possible labels is issued by Learner at each trial. After the prediction is made the real label is revealed and Learner's prediction is evaluated. 10 case of classification the label space is finite so Learner makes an error if the true label is not in the set produced by Learner. Conformal prediction was originally developed for the supervised learning task and was proved to be valid in the sense of making errors with a prespecified probability. We will study possible ways of extending this approach to the semi-supervised case and build a valid algorithm for this t ask. Also, we will apply conformal prediction technique to the problem of diagnosing tuberculosis in cattle. Whereas conformal prediction relies on just the randomness assumption, prediction with expert advice drops this one as well. One may wonder whether it is possible to make good predictions under these circumstances. However Learner is provided with predictions of a certain class of experts (or prediction strategies) and may base his prediction on them. The goal then is to perform not much worse than the best strategy in the class. This is achieved by carefully mixing (aggregating) predictions of the base experts. However, often the nature of data changes over time, such that there is a region where one expert is good, followed by a region where another is good and so on. This leads to the algorithms which we call adaptive: they take into account this structure of the data. We explore the possibilities offered by the framework of specialist experts to build adaptive algorithms. This line of thought allows us then to provide an intuitive explanation for the mysterious Mixing Past Posteriors algorithm and build a new algorithm with sharp bounds for Online Multitask Learning.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    • …