Search CORE

1,541 research outputs found

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Author: Finzi Marc
Goldblum Micah
Kapoor Sanyam
Lotfi Sanae
Potapczynski Andres
Wilson Andrew Gordon
Publication venue
Publication date: 24/11/2022
Field of study

While there has been progress in developing non-vacuous generalization bounds for deep neural networks, these bounds tend to be uninformative about why deep learning works. In this paper, we develop a compression approach based on quantizing neural network parameters in a linear subspace, profoundly improving on previous results to provide state-of-the-art generalization bounds on a variety of tasks, including transfer learning. We use these tight bounds to better understand the role of model size, equivariance, and the implicit biases of optimization, for generalization in deep learning. Notably, we find large models can be compressed to a much greater extent than previously known, encapsulating Occam's razor. We also argue for data-independent bounds in explaining generalization.Comment: NeurIPS 2022. Code is available at https://github.com/activatedgeek/tight-pac-baye

arXiv.org e-Print Archive

The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning

Author: Finzi Marc
Goldblum Micah
Rowan Keefer
Wilson Andrew Gordon
Publication venue
Publication date: 11/04/2023
Field of study

No free lunch theorems for supervised learning state that no learner can solve all problems or that all learners achieve exactly the same accuracy on average over a uniform distribution on learning problems. Accordingly, these theorems are often referenced in support of the notion that individual problems require specially tailored inductive biases. While virtually all uniformly sampled datasets have high complexity, real-world problems disproportionately generate low-complexity data, and we argue that neural network models share this same preference, formalized using Kolmogorov complexity. Notably, we show that architectures designed for a particular domain, such as computer vision, can compress datasets on a variety of seemingly unrelated domains. Our experiments show that pre-trained and even randomly initialized language models prefer to generate low-complexity sequences. Whereas no free lunch theorems seemingly indicate that individual problems require specialized learners, we explain how tasks that often require human intervention such as picking an appropriately sized model when labeled data is scarce or plentiful can be automated into a single learning algorithm. These observations justify the trend in deep learning of unifying seemingly disparate problems with an increasingly small set of machine learning models

arXiv.org e-Print Archive

Recommended from our members

Variational methods with dependence structure

Author: Yin Mingzhang
Publication venue
Publication date: 21/12/2020
Field of study

It is a common practice among humans to deduce, to explain and to make predictions based on concepts that are not directly observable. In Bayesian statistics, the underlying propositions of the unobserved latent variables are summarized in the posterior distribution. With the increasing complexity of real-world data and statistical models, fast and accurate inference for the posterior becomes essential. Variational methods, by casting the posterior inference problem in the optimization framework, are widely used for their flexibility and computational efficiency. In this thesis, we develop new variational methods, studying their theoretical properties and applications. In the first part of the thesis, we utilize dependence structures towards addressing fundamental problems in variational inference (VI): posterior uncertainty estimation, convergence properties, and discrete optimization. Though it is flexible, variational inference often underestimates the posterior uncertainty. This is a consequence of the over-simplified variational family. Mean-field variational inference (MFVI), for example, uses a product of independent distributions as a coarse approximation to the posterior. As a remedy, we propose a hierarchical variational distribution with flexible parameterization that can model the dependence structure between latent variables. With a newly derived objective, we show that the proposed variational method can achieve accurate and efficient uncertainty estimation. We further theoretically study the structured variational inference in the setting of the Stochastic Blockmodel (SBM). The variational distribution is constructed with a pairwise structure among the nodes of a graph. We prove that, in a broad density regime and for general random initializations, the estimated class labels by structured VI converge to the ground truth with high probability. Empirically, we demonstrate structured VI is more robust compared with MFVI when the graph is sparse and the signal to noise ratio is low. When the latent variables are discrete, gradient descent based VI often suffers from bias and high variance in the gradient estimation. With correlated random samples, we propose a novel unbiased, low-variance gradient estimator. We demonstrate that under certain constraints, such correlated sampling gives an optimal control variates for the variance reduction. The efficient gradient estimation can be applied to solve a wide range of problems such as the variable selection, reinforcement learning, natural language processing, among others. For the second part of the thesis, we apply variational methods to the study of generalization problems in the meta-learning. When trained over multiple-tasks, we identify that a variety of the meta-learning algorithms implicitly require the tasks to have a mutually-exclusive dependence structure. This prevents the task-level overfitting problem and ensures the fast adaptation of the algorithm in the face of a new task. However, such dependence structure may not exist for general tasks. When the tasks are non-mutually exclusive, we develop new meta-learning algorithms with variational regularization to prevent the task-level overfitting. Consequently, we can expand the meta-learning to the domains which it cannot be effective on before.Statistic

Texas ScholarWorks

A Primer on Bayesian Neural Networks: Review and Debates

Author: Arbel Julyan
Fortuin Vincent
Pitas Konstantinos
Vladimirova Mariia
Publication venue
Publication date: 28/09/2023
Field of study

Neural networks have achieved remarkable performance across various problem domains, but their widespread applicability is hindered by inherent limitations such as overconfidence in predictions, lack of interpretability, and vulnerability to adversarial attacks. To address these challenges, Bayesian neural networks (BNNs) have emerged as a compelling extension of conventional neural networks, integrating uncertainty estimation into their predictive capabilities. This comprehensive primer presents a systematic introduction to the fundamental concepts of neural networks and Bayesian inference, elucidating their synergistic integration for the development of BNNs. The target audience comprises statisticians with a potential background in Bayesian methods but lacking deep learning expertise, as well as machine learners proficient in deep neural networks but with limited exposure to Bayesian statistics. We provide an overview of commonly employed priors, examining their impact on model behavior and performance. Additionally, we delve into the practical considerations associated with training and inference in BNNs. Furthermore, we explore advanced topics within the realm of BNN research, acknowledging the existence of ongoing debates and controversies. By offering insights into cutting-edge developments, this primer not only equips researchers and practitioners with a solid foundation in BNNs, but also illuminates the potential applications of this dynamic field. As a valuable resource, it fosters an understanding of BNNs and their promising prospects, facilitating further advancements in the pursuit of knowledge and innovation.Comment: 65 page

arXiv.org e-Print Archive

Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary $\beta$ -Mixing Processes

Author: Ralaivola Liva
Stempfel Guillaume
Szafranski Marie
Publication venue
Publication date: 09/09/2009
Field of study

Pac-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first - to the best of our knowledge - Pac-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new Pac-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as Auc) and classifiers trained on data distributed according to a stationary {\ss}-mixing process. In the way, we show how our approach seemlessly allows us to deal with U-processes. As a side note, we also provide a Pac-Bayes generalization bound for classifiers learned on data from stationary

\varphi

-mixing distributions.Comment: Long version of the AISTATS 09 paper: http://jmlr.csail.mit.edu/proceedings/papers/v5/ralaivola09a/ralaivola09a.pd

arXiv.org e-Print Archive

HAL Evry

HAL AMU

Hypernetwork approach to Bayesian MAML

Author: Borycki Piotr
Kubacki Piotr
Kuśmierczyk Tomasz
Przewięźlikowski Marcin
Spurek Przemysław
Tabor Jacek
Publication venue
Publication date: 30/08/2023
Field of study

The main goal of Few-Shot learning algorithms is to enable learning from small amounts of data. One of the most popular and elegant Few-Shot learning approaches is Model-Agnostic Meta-Learning (MAML). The main idea behind this method is to learn the shared universal weights of a meta-model, which are then adapted for specific tasks. However, the method suffers from over-fitting and poorly quantifies uncertainty due to limited data size. Bayesian approaches could, in principle, alleviate these shortcomings by learning weight distributions in place of point-wise weights. Unfortunately, previous modifications of MAML are limited due to the simplicity of Gaussian posteriors, MAML-like gradient-based weight updates, or by the same structure enforced for universal and adapted weights. In this paper, we propose a novel framework for Bayesian MAML called BayesianHMAML, which employs Hypernetworks for weight updates. It learns the universal weights point-wise, but a probabilistic structure is added when adapted for specific tasks. In such a framework, we can use simple Gaussian distributions or more complicated posteriors induced by Continuous Normalizing Flows.Comment: arXiv admin note: text overlap with arXiv:2205.1574

arXiv.org e-Print Archive

Social Contract AI: Aligning AI Assistants with Implicit Group Norms

Author: Arumugam Dilip
Fränken Jan-Philipp
Gandhi Kanishk
Gerstenberg Tobias
Goodman Noah D.
Kwok Sam
Moore Jared
Tamkin Alex
Ye Peixuan
Publication venue
Publication date: 03/12/2023
Field of study

We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. To validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. We find that the AI assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). However, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. Additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. Overall, our preliminary results suggest that developing simulation frameworks in which AI assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.Comment: SoLaR NeurIPS 2023 Workshop (https://solar-neurips.github.io/

arXiv.org e-Print Archive

Information-Theoretic Generalization Bounds: Tightness and Expressiveness

Author: Hellstr\uf6m Fredrik
Publication venue
Publication date: 01/01/2022
Field of study

Machine learning has achieved impressive feats in numerous domains, largely driven by the emergence of deep neural networks. Due to the high complexity of these models, classical bounds on the generalization error---that is, the difference between training and test performance---fail to explain this success. This discrepancy between theory and practice motivates the search for new generalization guarantees, which must rely on other properties than function complexity. Information-theoretic bounds, which are intimately related to probably approximately correct (PAC)-Bayesian analysis, naturally incorporate a dependence on the relevant data distributions and learning algorithms. Hence, they are a promising candidate for studying generalization in deep neural networks.In this thesis, we derive and evaluate several such information-theoretic generalization bounds. First, we derive both average and high-probability bounds in a unified way, obtaining new results and recovering several bounds from the literature. We also develop new bounds by using tools from binary hypothesis testing. We extend these results to the conditional mutual information (CMI) framework, leading to results that depend on quantities such as the conditional information density and maximal leakage.While the aforementioned bounds achieve a so-called slow rate with respect to the number of training samples, we extend our techniques to obtain bounds with a fast rate. Furthermore, we show that the CMI framework can be viewed as a way of automatically obtaining data-dependent priors, an important technique for obtaining numerically tight PAC-Bayesian bounds. A numerical evaluation of these bounds demonstrate that they are nonvacuous for deep neural networks, but diverge as training progresses.To obtain numerically tighter results, we strengthen our bounds through the use of the samplewise evaluated CMI, which depends on the information captured by the losses of the neural network rather than its weights. Furthermore, we make use of convex comparator functions, such as the binary relative entropy, to obtain tighter characterizations for low training losses. Numerically, we find that these bounds are nearly tight for several deep neural network settings, and remain stable throughout training. We demonstrate the expressiveness of the evaluated CMI framework by using it to rederive nearly optimal guarantees for multiclass classification, known from classical learning theory.Finally, we study the expressiveness of the evaluated CMI framework for meta learning, where data from several related tasks is used to improve performance on new tasks from the same task environment. Through the use of a one-step derivation and the evaluated CMI, we obtain new information-theoretic generalization bounds for meta learning that improve upon previous results. Under certain assumptions on the function classes used by the learning algorithm, we obtain convergence rates that match known classical results. By extending our analysis to oracle algorithms and considering a notion of task diversity, we obtain excess risk bounds for empirical risk minimizers

Chalmers Research