Search CORE

13 research outputs found

On the Role of Optimization in Double Descent: A Least Squares Study

Author: Kuzborskij I
Pascanu R
Rivasplata O
Szepesvári C
Triki AR
Publication venue: Advances in Neural Information Processing Systems
Publication date: 01/01/2021
Field of study

Empirically it has been observed that the performance of deep neural networks steadily improves as we increase model size, contradicting the classical view on overfitting and generalization. Recently, the double descent phenomena has been proposed to reconcile this observation with theory, suggesting that the test error has a second descent when the model becomes sufficiently overparameterized, as the model size itself acts as an implicit regularizer. In this paper we add to the growing body of work in this space, providing a careful study of learning dynamics as a function of model size for the least squares scenario. We show an excess risk bound for the gradient descent solution of the least squares objective. The bound depends on the smallest non-zero eigenvalue of the covariance matrix of the input features, via a functional form that has the double descent behavior. This gives a new perspective on the double descent curves reported in the literature. Our analysis of the excess risk allows to decouple the effect of optimization and generalization error. In particular, we find that in case of noiseless regression, double descent is explained solely by optimization-related quantities, which was missed in studies focusing on the Moore-Penrose pseudoinverse solution. We believe that our derivation provides an alternative view compared to existing work, shedding some light on a possible cause of this phenomena, at least in the considered least squares setting. We empirically explore if our predictions hold for neural networks, in particular whether the covariance of intermediary hidden activations has a similar behavior as the one predicted by our derivations

arXiv.org e-Print Archive

UCL Discovery

PAC-Bayes analysis beyond the usual bounds

Author: Kuzborskij I
Rivasplata O
Shawe-Taylor J
Szepesvári C
Publication venue: Neural Information Processing Systems (NeurIPS)
Publication date: 06/12/2020
Field of study

We focus on a stochastic learning model where the learner observes a finite set of training examples and the output of the learning process is a data-dependent distribution over a space of hypotheses. The learned data-dependent distribution is then used to make randomized predictions, and the high-level theme addressed here is guaranteeing the quality of predictions on examples that were not seen during training, i.e. generalization. In this setting the unknown quantity of interest is the expected risk of the data-dependent randomized predictor, for which upper bounds can be derived via a PAC-Bayes analysis, leading to PAC-Bayes bounds. Specifically, we present a basic PAC-Bayes inequality for stochastic kernels, from which one may derive extensions of various known PAC-Bayes bounds as well as novel bounds. We clarify the role of the requirements of fixed ‘data-free’ priors, bounded losses, and i.i.d. data. We highlight that those requirements were used to upper-bound an exponential moment term, while the basic PAC-Bayes theorem remains valid without those restrictions. We present three bounds that illustrate the use of data-dependent priors, including one for the unbounded square loss

UCL Discovery

Efficient Linear Bandits through Matrix Sketching

Author: I. Kuzborskij
L. Cella
N. Cesa-Bianchi
Publication venue: PMLR
Publication date: 01/01/2019
Field of study

We prove that two popular linear contextual bandit algorithms, OFUL and Thompson Sampling, can be made efficient using Frequent Directions, a deterministic online sketching technique. More precisely, we show that a sketch of size m allows a O(md) update time for both algorithms, as opposed to \u2126(d 2 ) required by their non-sketched versions in general (where d is the dimension of context vectors). This computational speedup is accompanied by regret bounds of order (1 + \u3b5m) 3/2d 1a T for OFUL and of order (1 + \u3b5m)d 3/2 1a T for Thompson Sampling, where \u3b5m is bounded by the sum of the tail eigenvalues not covered by the sketch. In particular, when the selected contexts span a subspace of dimension at most m, our algorithms have a regret bound matching that of their slower, non-sketched counterparts. Experiments on real-world datasets corroborate our theoretical results

AIR Universita degli studi di Milano

Adding New Tasks to a Single Network with Weight Transformations using Binary Masks

Author: A Mallya
BM Lake
I Kuzborskij
J Kirkpatrick
J Stallkamp
M McCloskey
M Ristin
Mathias Eitz
O Russakovsky
RM French
S Munder
S Thrun
T Mensink
Z Li
Publication venue
Publication date: 14/06/2018
Field of study

Visual recognition algorithms are required today to exhibit adaptive abilities. Given a deep model trained on a specific, given task, it would be highly desirable to be able to adapt incrementally to new tasks, preserving scalability as the number of new tasks increases, while at the same time avoiding catastrophic forgetting issues. Recent work has shown that masking the internal weights of a given original conv-net through learned binary variables is a promising strategy. We build upon this intuition and take into account more elaborated affine transformations of the convolutional weights that include learned binary masks. We show that with our generalization it is possible to achieve significantly higher levels of adaptation to new tasks, enabling the approach to compete with fine tuning strategies by requiring slightly more than 1 bit per network parameter per additional task. Experiments on two popular benchmarks showcase the power of our approach, that achieves the new state of the art on the Visual Decathlon Challenge

arXiv.org e-Print Archive

Crossref

Archivio della ricerca - Fondazione Bruno Kessler

Archivio della ricerca- Università di Roma La Sapienza

Nonparametric Online Regression while Learning the Metric

Author: I. Kuzborskij
N. Cesa-Bianchi
Publication venue: Curran Associates, Inc.
Publication date: 01/01/2017
Field of study

We study algorithms for online nonparametric regression that learn the directions along which the regression function is smoother. Our algorithm learns the Mahalanobis metric based on the gradient outer product matrix G of the regression function (automatically adapting to the effective rank of this matrix), while simultaneously bounding the regret \u2014on the same data sequence\u2014 in terms of the spectrum of G. As a preliminary step in our analysis, we extend a nonparametric online learning algorithm by Hazan and Megiddo enabling it to compete against functions whose Lipschitzness is measured with respect to an arbitrary Mahalanobis metric

AIR Universita degli studi di Milano

On the challenge of classifying 52 hand movements from surface electromyography

Author: A. Gijsberts
Caputo Barbara
I. Kuzborskij
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

The level of dexterity of myoelectric hand prostheses depends to large extent on the feature representation and subsequent classification of surface electromyography signals. This work presents a comparison of various feature extraction and classification methods on a large-scale surface electromyography database containing 52 different hand movements obtained from 27 subjects. Results indicate that simple feature representations as Mean Absolute Value and Waveform Length can achieve similar performance to the computationally more demanding marginal Discrete Wavelet Transform. With respect to classifiers, the Support Vector Machine was found to be the only method that consistently achieved top performance in combination with each feature extraction method

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Distribution-Dependent Analysis of Gibbs-ERM Principle

Author: C. Szepesvari
I. Kuzborskij
N. Cesa-Bianchi
Publication venue: PMLR
Publication date: 01/01/2019
Field of study

Gibbs-ERM learning is a natural idealized model of learning with stochastic optimization algorithms (such as SGLD and \u2014to some extent\u2014 SGD), while it also arises in other contexts, including PAC-Bayesian theory, and sampling mechanisms. In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces. Our main results are emphdistribution-dependent upper bounds on several notions of excess risk. We show that, in all cases, the distribution-dependent excess risk is essentially controlled by the empheffective dimension

exttrleft(oldsymbolH^star (oldsymbolH^star + lambda oldsymbolI)^-1 ight)

of the problem, where

oldsymbolH^star

is the Hessian matrix of the risk at a local minimum. This is a well-established notion of effective dimension appearing in several previous works, including the analyses of SGD and ridge regression, but ours is the first work that brings this dimension to the analysis of learning using Gibbs densities. The distribution-dependent view we advocate here improves upon earlier results of Raginsky et al. 2017, and can yield much tighter bounds depending on the interplay between the data-generating distribution and the loss function. The first part of our analysis focuses on the emphlocalized excess risk in the vicinity of a fixed local minimizer. This result is then extended to bounds on the emphglobal excess risk, by characterizing probabilities of local minima (and their complement) under Gibbs densities, a results which might be of independent interest