13 research outputs found
Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications
Deep neural networks with remarkably strong generalization performances are
usually over-parameterized. Despite explicit regularization strategies are used
for practitioners to avoid over-fitting, the impacts are often small. Some
theoretical studies have analyzed the implicit regularization effect of
stochastic gradient descent (SGD) on simple machine learning models with
certain assumptions. However, how it behaves practically in state-of-the-art
models and real-world datasets is still unknown. To bridge this gap, we study
the role of SGD implicit regularization in deep learning systems. We show pure
SGD tends to converge to minimas that have better generalization performances
in multiple natural language processing (NLP) tasks. This phenomenon coexists
with dropout, an explicit regularizer. In addition, neural network's finite
learning capability does not impact the intrinsic nature of SGD's implicit
regularization effect. Specifically, under limited training samples or with
certain corrupted labels, the implicit regularization effect remains strong. We
further analyze the stability by varying the weight initialization range. We
corroborate these experimental findings with a decision boundary visualization
using a 3-layer neural network for interpretation. Altogether, our work enables
a deepened understanding on how implicit regularization affects the deep
learning model and sheds light on the future study of the over-parameterized
model's generalization ability
Fixup Initialization: Residual Learning Without Normalization
Normalization layers are a staple in state-of-the-art deep neural network
architectures. They are widely believed to stabilize training, enable higher
learning rate, accelerate convergence and improve generalization, though the
reason for their effectiveness is still an active research topic. In this work,
we challenge the commonly-held beliefs by showing that none of the perceived
benefits is unique to normalization. Specifically, we propose fixed-update
initialization (Fixup), an initialization motivated by solving the exploding
and vanishing gradient problem at the beginning of training via properly
rescaling a standard initialization. We find training residual networks with
Fixup to be as stable as training with normalization -- even for networks with
10,000 layers. Furthermore, with proper regularization, Fixup enables residual
networks without normalization to achieve state-of-the-art performance in image
classification and machine translation.Comment: Updating reference. Accepted for publication at ICLR 2019; see
https://openreview.net/forum?id=H1gsz30cK
Just Interpolate: Kernel "Ridgeless" Regression Can Generalize
In the absence of explicit regularization, Kernel "Ridgeless" Regression with
nonlinear kernels has the potential to fit the training data perfectly. It has
been observed empirically, however, that such interpolated solutions can still
generalize well on test data. We isolate a phenomenon of implicit
regularization for minimum-norm interpolated solutions which is due to a
combination of high dimensionality of the input data, curvature of the kernel
function, and favorable geometric properties of the data such as an eigenvalue
decay of the empirical covariance and kernel matrices. In addition to deriving
a data-dependent upper bound on the out-of-sample error, we present
experimental evidence suggesting that the phenomenon occurs in the MNIST
dataset.Comment: 28 pages, 8 figure
On the Generalization Gap in Reparameterizable Reinforcement Learning
Understanding generalization in reinforcement learning (RL) is a significant
challenge, as many common assumptions of traditional supervised learning theory
do not apply. We focus on the special class of reparameterizable RL problems,
where the trajectory distribution can be decomposed using the reparametrization
trick. For this problem class, estimating the expected return is efficient and
the trajectory can be computed deterministically given peripheral random
variables, which enables us to study reparametrizable RL using supervised
learning and transfer learning theory. Through these relationships, we derive
guarantees on the gap between the expected and empirical return for both
intrinsic and external errors, based on Rademacher complexity as well as the
PAC-Bayes bound. Our bound suggests the generalization capability of
reparameterizable RL is related to multiple factors including "smoothness" of
the environment transition, reward and agent policy function class. We also
empirically verify the relationship between the generalization gap and these
factors through simulations
Implicit Bias of Gradient Descent on Linear Convolutional Networks
We show that gradient descent on full-width linear convolutional networks of
depth converges to a linear predictor related to the bridge
penalty in the frequency domain. This is in contrast to linearly fully
connected networks, where gradient descent converges to the hard margin linear
support vector machine solution, regardless of depth
Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
In this paper we study the problem of learning a shallow artificial neural
network that best fits a training data set. We study this problem in the
over-parameterized regime where the number of observations are fewer than the
number of parameters in the model. We show that with quadratic activations the
optimization landscape of training such shallow neural networks has certain
favorable characteristics that allow globally optimal models to be found
efficiently using a variety of local search heuristics. This result holds for
an arbitrary training data of input/output pairs. For differentiable activation
functions we also show that gradient descent, when suitably initialized,
converges at a linear rate to a globally optimal model. This result focuses on
a realizable model where the inputs are chosen i.i.d. from a Gaussian
distribution and the labels are generated according to planted weight
coefficients.Comment: Section 3 on numerical experiments is added. Theorems 2.1 and 2.2 are
improved to apply to almost all input data (not just Gaussian inputs).
Related work section is expanded. The paper is accepted for publication in
IEEE transaction on Information Theory (2018
Provably convergent acceleration in factored gradient descent with applications in matrix sensing
We present theoretical results on the convergence of \emph{non-convex}
accelerated gradient descent in matrix factorization models with -norm
loss. The purpose of this work is to study the effects of acceleration in
non-convex settings, where provable convergence with acceleration should not be
considered a \emph{de facto} property. The technique is applied to matrix
sensing problems, for the estimation of a rank optimal solution . Our contributions can be summarized as follows.
We show that acceleration in factored gradient descent converges at a
linear rate; this fact is novel for non-convex matrix factorization settings,
under common assumptions. Our proof technique requires the acceleration
parameter to be carefully selected, based on the properties of the problem,
such as the condition number of and the condition number of objective
function. Currently, our proof leads to the same dependence on the
condition number(s) in the contraction parameter, similar to recent results on
non-accelerated algorithms. Acceleration is observed in practice, both in
synthetic examples and in two real applications: neuronal multi-unit activities
recovery from single electrode recordings, and quantum state tomography on
quantum computing simulators.Comment: 23 page
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Neural networks have many successful applications, while much less
theoretical understanding has been gained. Towards bridging this gap, we study
the problem of learning a two-layer overparameterized ReLU neural network for
multi-class classification via stochastic gradient descent (SGD) from random
initialization. In the overparameterized setting, when the data comes from
mixtures of well-separated distributions, we prove that SGD learns a network
with a small generalization error, albeit the network has enough capacity to
fit arbitrary labels. Furthermore, the analysis provides interesting insights
into several aspects of learning neural networks and can be verified based on
empirical studies on synthetic data and on the MNIST dataset.Comment: NeurIPS'18 version. Appendix updated, additional experimental results
adde
Characterizing Implicit Bias in Terms of Optimization Geometry
We study the implicit bias of generic optimization methods, such as mirror
descent, natural gradient descent, and steepest descent with respect to
different potentials and norms, when optimizing underdetermined linear
regression or separable linear classification problems. We explore the question
of whether the specific global minimum (among the many possible global minima)
reached by an algorithm can be characterized in terms of the potential or norm
of the optimization geometry, and independently of hyperparameter choices such
as step-size and momentum.Comment: (1) A bug in the proof of implicit bias for matrix factorization was
fixed. v2 gives a characterization of the asymptotic bias of the factor
matrices, while v1 made a stronger claim on the limit direction of the
unfactored matrix. (2) v2 also includes new results on implicit bias of
mirror descent with realizable affine constraint
Goodness-of-fit tests on manifolds
We develop a general theory for the goodness-of-fit test to non-linear
models. In particular, we assume that the observations are noisy samples of a
submanifold defined by a \yao{sufficiently smooth non-linear map}. The
observation noise is additive Gaussian. Our main result shows that the
"residual" of the model fit, by solving a non-linear least-square problem,
follows a (possibly noncentral) distribution. The parameters of the
distribution are related to the model order and dimension of the
problem. We further present a method to select the model orders sequentially.
We demonstrate the broad application of the general theory in machine learning
and signal processing, including determining the rank of low-rank (possibly
complex-valued) matrices and tensors from noisy, partial, or indirect
observations, determining the number of sources in signal demixing, and
potential applications in determining the number of hidden nodes in neural
networks