8 research outputs found

    Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach

    Full text link
    Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.Comment: 16 pages, 1 figure. Accepted at ICLR 201

    A Primer on PAC-Bayesian Learning

    Get PDF
    International audienc

    PAC-Bayes Analysis Beyond the Usual Bounds

    Get PDF
    We focus on a stochastic learning model where the learner observes a finite set of training examples and the output of the learning process is a data-dependent distribution over a space of hypotheses. The learned data-dependent distribution is then used to make randomized predictions, and the high-level theme addressed here is guaranteeing the quality of predictions on examples that were not seen during training, i.e. generalization. In this setting the unknown quantity of interest is the expected risk of the data-dependent randomized predictor, for which upper bounds can be derived via a PAC-Bayes analysis, leading to PAC-Bayes bounds. Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds. We clarify the role of the requirements of fixed 'data-free' priors, bounded losses, and i.i.d. data. We highlight that those requirements were used to upper-bound an exponential moment term, while the basic PAC-Bayes theorem remains valid without those restrictions. We present three bounds that illustrate the use of data-dependent priors, including one for the unbounded square loss.Comment: In NeurIPS 2020. Version 3 is the final published paper. Note that this paper is an enhanced version of the short paper with the same title that was presented at the NeurIPS 2019 Workshop on Machine Learning with Guarantees. Important update: the PAC-Bayes type inequality for unbounded loss functions (Section 2.3) is ne

    Generalization Bounds: Perspectives from Information Theory and PAC-Bayes

    Full text link
    A fundamental question in theoretical machine learning is generalization. Over the past decades, the PAC-Bayesian approach has been established as a flexible framework to address the generalization capabilities of machine learning algorithms, and design new ones. Recently, it has garnered increased interest due to its potential applicability for a variety of learning algorithms, including deep neural networks. In parallel, an information-theoretic view of generalization has developed, wherein the relation between generalization and various information measures has been established. This framework is intimately connected to the PAC-Bayesian approach, and a number of results have been independently discovered in both strands. In this monograph, we highlight this strong connection and present a unified treatment of generalization. We present techniques and results that the two perspectives have in common, and discuss the approaches and interpretations that differ. In particular, we demonstrate how many proofs in the area share a modular structure, through which the underlying ideas can be intuited. We pay special attention to the conditional mutual information (CMI) framework; analytical studies of the information complexity of learning algorithms; and the application of the proposed methods to deep learning. This monograph is intended to provide a comprehensive introduction to information-theoretic generalization bounds and their connection to PAC-Bayes, serving as a foundation from which the most recent developments are accessible. It is aimed broadly towards researchers with an interest in generalization and theoretical machine learning.Comment: 222 page

    Apprentissage automatique avec garanties de généralisation à l'aide de méthodes d'ensemble maximisant le désaccord

    Get PDF
    Nous nous intéressons au domaine de l’apprentissage automatique, une branche de l’intelligence artificielle. Pour résoudre une tâche de classification, un algorithme d’apprentissage observe des données étiquetées et a comme objectif d’apprendre une fonction qui sera en mesure de classifier automatiquement les données qui lui seront présentées dans le futur. Plusieurs algorithmes classiques d’apprentissage cherchent à combiner des classificateurs simples en construisant avec ceux-ci un classificateur par vote de majorité. Dans cette thèse, nous explorons l’utilisation d’une borne sur le risque du classificateur par vote de majorité, nommée la C-borne. Celle-ci est définie en fonction de deux quantités : la performance individuelle des votants, et la corrélation de leurs erreurs (leur désaccord). Nous explorons d’une part son utilisation dans des bornes de généralisation des classificateurs par vote de majorité. D’autre part, nous l’étendons de la classification binaire vers un cadre généralisé de votes de majorité. Nous nous en inspirons finalement pour développer de nouveaux algorithmes d’apprentissage automatique, qui offrent des performances comparables aux algorithmes de l’état de l’art, en retournant des votes de majorité qui maximisent le désaccord entre les votants, tout en contrôlant la performance individuelle de ceux-ci. Les garanties de généralisation que nous développons dans cette thèse sont de la famille des bornes PAC-bayésiennes. Nous généralisons celles-ci en introduisant une borne générale, à partir de laquelle peuvent être retrouvées les bornes de la littérature. De cette même borne générale, nous introduisons des bornes de généralisation basées sur la C-borne. Nous simplifions également le processus de preuve des théorèmes PAC-bayésiens, nous permettant d’obtenir deux nouvelles familles de bornes. L’une est basée sur une différente notion de complexité, la divergence de Rényi plutôt que la divergence Kullback-Leibler classique, et l’autre est spécialisée au cadre de l’apprentissage transductif plutôt que l’apprentissage inductif. Les deux algorithmes d’apprentissage que nous introduisons, MinCq et CqBoost, retournent un classificateur par vote de majorité maximisant le désaccord des votants. Un hyperparamètre permet de directement contrôler leur performance individuelle. Ces deux algorithmes étant construits pour minimiser une borne PAC-bayésienne, ils sont rigoureusement justifiés théoriquement. À l’aide d’une évaluation empirique, nous montrons que MinCq et CqBoost ont une performance comparable aux algorithmes classiques de l’état de l’art.We focus on machine learning, a branch of artificial intelligence. When solving a classification problem, a learning algorithm is provided labelled data and has the task of learning a function that will be able to automatically classify future, unseen data. Many classical learning algorithms are designed to combine simple classifiers by building a weighted majority vote classifier out of them. In this thesis, we extend the usage of the C-bound, bound on the risk of the majority vote classifier. This bound is defined using two quantities : the individual performance of the voters, and the correlation of their errors (their disagreement). First, we design majority vote generalization bounds based on the C-bound. Then, we extend this bound from binary classification to generalized majority votes. Finally, we develop new learning algorithms with state-of-the-art performance, by constructing majority votes that maximize the voters’ disagreement, while controlling their individual performance. The generalization guarantees that we develop in this thesis are in the family of PAC-Bayesian bounds. We generalize the PAC-Bayesian theory by introducing a general theorem, from which the classical bounds from the literature can be recovered. Using this same theorem, we introduce generalization bounds based on the C-bound. We also simplify the proof process of PAC-Bayesian theorems, easing the development of new families of bounds. We introduce two new families of PAC-Bayesian bounds. One is based on a different notion of complexity than usual bounds, the Rényi divergence, instead of the classical Kullback-Leibler divergence. The second family is specialized to transductive learning, instead of inductive learning. The two learning algorithms that we introduce, MinCq and CqBoost, output a majority vote classifier that maximizes the disagreement between voters. An hyperparameter of the algorithms gives a direct control over the individual performance of the voters. These two algorithms being designed to minimize PAC-Bayesian generalization bounds on the risk of the majority vote classifier, they come with rigorous theoretical guarantees. By performing an empirical evaluation, we show that MinCq and CqBoost perform as well as classical stateof- the-art algorithms
    corecore