96,306 research outputs found
Generalization Properties and Implicit Regularization for Multiple Passes SGM
We study the generalization properties of stochastic gradient methods for
learning with convex loss functions and linearly parameterized functions. We
show that, in the absence of penalizations or constraints, the stability and
approximation properties of the algorithm can be controlled by tuning either
the step-size or the number of passes over the data. In this view, these
parameters can be seen to control a form of implicit regularization. Numerical
results complement the theoretical findings.Comment: 26 pages, 4 figures. To appear in ICML 201
Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
Empirical studies show that gradient-based methods can learn deep neural
networks (DNNs) with very good generalization performance in the
over-parameterization regime, where DNNs can easily fit a random labeling of
the training data. Very recently, a line of work explains in theory that with
over-parameterization and proper random initialization, gradient-based methods
can find the global minima of the training loss for DNNs. However, existing
generalization error bounds are unable to explain the good generalization
performance of over-parameterized DNNs. The major limitation of most existing
generalization bounds is that they are based on uniform convergence and are
independent of the training algorithm. In this work, we derive an
algorithm-dependent generalization error bound for deep ReLU networks, and show
that under certain assumptions on the data distribution, gradient descent (GD)
with proper random initialization is able to train a sufficiently
over-parameterized DNN to achieve arbitrarily small generalization error. Our
work sheds light on explaining the good generalization performance of
over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the
presentation in Version 3. In AAAI 202
Boosting with early stopping: Convergence and consistency
Boosting is one of the most significant advances in machine learning for
classification and regression. In its original and computationally flexible
version, boosting seeks to minimize empirically a loss function in a greedy
fashion. The resulting estimator takes an additive function form and is built
iteratively by applying a base estimator (or learner) to updated samples
depending on the previous iterations. An unusual regularization technique,
early stopping, is employed based on CV or a test set. This paper studies
numerical convergence, consistency and statistical rates of convergence of
boosting with early stopping, when it is carried out over the linear span of a
family of basis functions. For general loss functions, we prove the convergence
of boosting's greedy optimization to the infinimum of the loss function over
the linear span. Using the numerical convergence result, we find early-stopping
strategies under which boosting is shown to be consistent based on i.i.d.
samples, and we obtain bounds on the rates of convergence for boosting
estimators. Simulation studies are also presented to illustrate the relevance
of our theoretical results for providing insights to practical aspects of
boosting. As a side product, these results also reveal the importance of
restricting the greedy search step-sizes, as known in practice through the work
of Friedman and others. Moreover, our results lead to a rigorous proof that for
a linearly separable problem, AdaBoost with \epsilon\to0 step-size becomes an
L^1-margin maximizer when left to run to convergence.Comment: Published at http://dx.doi.org/10.1214/009053605000000255 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Sample Complexity Analysis for Learning Overcomplete Latent Variable Models through Tensor Methods
We provide guarantees for learning latent variable models emphasizing on the
overcomplete regime, where the dimensionality of the latent space can exceed
the observed dimensionality. In particular, we consider multiview mixtures,
spherical Gaussian mixtures, ICA, and sparse coding models. We provide tight
concentration bounds for empirical moments through novel covering arguments. We
analyze parameter recovery through a simple tensor power update algorithm. In
the semi-supervised setting, we exploit the label or prior information to get a
rough estimate of the model parameters, and then refine it using the tensor
method on unlabeled samples. We establish that learning is possible when the
number of components scales as , where is the observed
dimension, and is the order of the observed moment employed in the tensor
method. Our concentration bound analysis also leads to minimax sample
complexity for semi-supervised learning of spherical Gaussian mixtures. In the
unsupervised setting, we use a simple initialization algorithm based on SVD of
the tensor slices, and provide guarantees under the stricter condition that
(where constant can be larger than ), where the
tensor method recovers the components under a polynomial running time (and
exponential in ). Our analysis establishes that a wide range of
overcomplete latent variable models can be learned efficiently with low
computational and sample complexity through tensor decomposition methods.Comment: Title change
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
- …