24,532 research outputs found
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common
techniques to derive efficient large scale learning algorithms. In this paper,
we investigate their application in the context of nonparametric statistical
learning. More precisely, we study the estimator defined by stochastic gradient
with mini batches and random features. The latter can be seen as form of
nonlinear sketching and used to define approximate kernel methods. The
considered estimator is not explicitly penalized/constrained and regularization
is implicit. Indeed, our study highlights how different parameters, such as
number of features, iterations, step-size and mini-batch size control the
learning properties of the solutions. We do this by deriving optimal finite
sample bounds, under standard assumptions. The obtained results are
corroborated and illustrated by numerical experiments
Learning with SGD and Random Features
Sketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments
Learning with SGD and Random Features
SpotlightInternational audienceSketching and stochastic gradient methods are arguably the most common techniques to derive efficient large scale learning algorithms. In this paper, we investigate their application in the context of nonparametric statistical learning. More precisely, we study the estimator defined by stochastic gradient with mini batches and random features. The latter can be seen as form of nonlinear sketching and used to define approximate kernel methods. The considered estimator is not explicitly penalized/constrained and regularization is implicit. Indeed, our study highlights how different parameters, such as number of features, iterations, step-size and mini-batch size control the learning properties of the solutions. We do this by deriving optimal finite sample bounds, under standard assumptions. The obtained results are corroborated and illustrated by numerical experiments
Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning
Random features are a central technique for scalable learning algorithms
based on kernel methods. A recent work has shown that an algorithm for machine
learning by quantum computer, quantum machine learning (QML), can exponentially
speed up sampling of optimized random features, even without imposing
restrictive assumptions on sparsity and low-rankness of matrices that had
limited applicability of conventional QML algorithms; this QML algorithm makes
it possible to significantly reduce and provably minimize the required number
of features for regression tasks. However, a major interest in the field of QML
is how widely the advantages of quantum computation can be exploited, not only
in the regression tasks. We here construct a QML algorithm for a classification
task accelerated by the optimized random features. We prove that the QML
algorithm for sampling optimized random features, combined with stochastic
gradient descent (SGD), can achieve state-of-the-art exponential convergence
speed of reducing classification error in a classification task under a
low-noise condition; at the same time, our algorithm with optimized random
features can take advantage of the significant reduction of the required number
of features so as to accelerate each iteration in the SGD and evaluation of the
classifier obtained from our algorithm. These results discover a promising
application of QML to significant acceleration of the leading classification
algorithm based on kernel methods, without ruining its applicability to a
practical class of data sets and the exponential error-convergence speed.Comment: 28 pages, no figur
Shaping the learning landscape in neural networks around wide flat minima
Learning in Deep Neural Networks (DNN) takes place by minimizing a non-convex
high-dimensional loss function, typically by a stochastic gradient descent
(SGD) strategy. The learning process is observed to be able to find good
minimizers without getting stuck in local critical points, and that such
minimizers are often satisfactory at avoiding overfitting. How these two
features can be kept under control in nonlinear devices composed of millions of
tunable connections is a profound and far reaching open question. In this paper
we study basic non-convex one- and two-layer neural network models which learn
random patterns, and derive a number of basic geometrical and algorithmic
features which suggest some answers. We first show that the error loss function
presents few extremely wide flat minima (WFM) which coexist with narrower
minima and critical points. We then show that the minimizers of the
cross-entropy loss function overlap with the WFM of the error loss. We also
show examples of learning devices for which WFM do not exist. From the
algorithmic perspective we derive entropy driven greedy and message passing
algorithms which focus their search on wide flat regions of minimizers. In the
case of SGD and cross-entropy loss, we show that a slow reduction of the norm
of the weights along the learning process also leads to WFM. We corroborate the
results by a numerical study of the correlations between the volumes of the
minimizers, their Hessian and their generalization performance on real data.Comment: 37 pages (16 main text), 10 figures (7 main text
Exponential Machines
Modeling interactions between features improves the performance of machine
learning solutions in many domains (e.g. recommender systems or sentiment
analysis). In this paper, we introduce Exponential Machines (ExM), a predictor
that models all interactions of every order. The key idea is to represent an
exponentially large tensor of parameters in a factorized format called Tensor
Train (TT). The Tensor Train format regularizes the model and lets you control
the number of underlying parameters. To train the model, we develop a
stochastic Riemannian optimization procedure, which allows us to fit tensors
with 2^160 entries. We show that the model achieves state-of-the-art
performance on synthetic data with high-order interactions and that it works on
par with high-order factorization machines on a recommender system dataset
MovieLens 100K.Comment: ICLR-2017 workshop track pape
- …