28,966 research outputs found
Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
We study the problem of training deep neural networks with Rectified Linear
Unit (ReLU) activation function using gradient descent and stochastic gradient
descent. In particular, we study the binary classification problem and show
that for a broad family of loss functions, with proper random weight
initialization, both gradient descent and stochastic gradient descent can find
the global minima of the training loss for an over-parameterized deep ReLU
network, under mild assumption on the training data. The key idea of our proof
is that Gaussian random initialization followed by (stochastic) gradient
descent produces a sequence of iterates that stay inside a small perturbation
region centering around the initial weights, in which the empirical loss
function of deep ReLU networks enjoys nice local curvature properties that
ensure the global convergence of (stochastic) gradient descent. Our theoretical
results shed light on understanding the optimization for deep learning, and
pave the way for studying the optimization dynamics of training modern deep
neural networks.Comment: 54 pages. This version relaxes the assumptions on the loss functions
and data distribution, and improves the dependency on the problem-specific
parameters in the main theor
Stuck in a What? Adventures in Weight Space
Deep learning researchers commonly suggest that converged models are stuck in
local minima. More recently, some researchers observed that under reasonable
assumptions, the vast majority of critical points are saddle points, not true
minima. Both descriptions suggest that weights converge around a point in
weight space, be it a local optima or merely a critical point. However, it's
possible that neither interpretation is accurate. As neural networks are
typically over-complete, it's easy to show the existence of vast continuous
regions through weight space with equal loss. In this paper, we build on recent
work empirically characterizing the error surfaces of neural networks. We
analyze training paths through weight space, presenting evidence that apparent
convergence of loss does not correspond to weights arriving at critical points,
but instead to large movements through flat regions of weight space. While it's
trivial to show that neural network error surfaces are globally non-convex, we
show that error surfaces are also locally non-convex, even after breaking
symmetry with a random initialization and also after partial training
Snapshot Ensembles: Train 1, get M for free
Ensembles of neural networks are known to be much more robust and accurate
than individual networks. However, training multiple deep networks for model
averaging is computationally expensive. In this paper, we propose a method to
obtain the seemingly contradictory goal of ensembling multiple neural networks
at no additional training cost. We achieve this goal by training a single
neural network, converging to several local minima along its optimization path
and saving the model parameters. To obtain repeated rapid convergence, we
leverage recent work on cyclic learning rate schedules. The resulting
technique, which we refer to as Snapshot Ensembling, is simple, yet
surprisingly effective. We show in a series of experiments that our approach is
compatible with diverse network architectures and learning tasks. It
consistently yields lower error rates than state-of-the-art single models at no
additional training cost, and compares favorably with traditional network
ensembles. On CIFAR-10 and CIFAR-100 our DenseNet Snapshot Ensembles obtain
error rates of 3.4% and 17.4% respectively
Data optimization for large batch distributed training of deep neural networks
Distributed training in deep learning (DL) is common practice as data and
models grow. The current practice for distributed training of deep neural
networks faces the challenges of communication bottlenecks when operating at
scale, and model accuracy deterioration with an increase in global batch size.
Present solutions focus on improving message exchange efficiency as well as
implementing techniques to tweak batch sizes and models in the training
process. The loss of training accuracy typically happens because the loss
function gets trapped in a local minima. We observe that the loss landscape
minimization is shaped by both the model and training data and propose a data
optimization approach that utilizes machine learning to implicitly smooth out
the loss landscape resulting in fewer local minima. Our approach filters out
data points which are less important to feature learning, enabling us to speed
up the training of models on larger batch sizes to improved accuracy.Comment: Computational Science & Computational Intelligence (CSCI'20), 7 page
Progressive Learning of Low-Precision Networks
Recent years have witnessed the great advance of deep learning in a variety
of vision tasks. Many state-of-the-art deep neural networks suffer from large
size and high complexity, which makes it difficult to deploy in
resource-limited platforms such as mobile devices.
To this end, low-precision neural networks are widely studied which quantize
weights or activations into the low-bit format.
Though being efficient, low-precision networks are usually hard to train and
encounter severe accuracy degradation.
In this paper, we propose a new training strategy through expanding
low-precision networks during training and removing the expanded parts for
network inference.
First, we equip each low-precision convolutional layer with an ancillary
full-precision convolutional layer based on a low-precision network structure,
which could guide the network to good local minima.
Second, a decay method is introduced to reduce the output of the added
full-precision convolution gradually, which keeps the resulted topology
structure the same to the original low-precision one.
Experiments on SVHN, CIFAR and ILSVRC-2012 datasets prove that the proposed
method can bring faster convergence and higher accuracy for low-precision
neural networks.Comment: 10 pages, 8 figure
Langevin Dynamics with Continuous Tempering for Training Deep Neural Networks
Minimizing non-convex and high-dimensional objective functions is
challenging, especially when training modern deep neural networks. In this
paper, a novel approach is proposed which divides the training process into two
consecutive phases to obtain better generalization performance: Bayesian
sampling and stochastic optimization. The first phase is to explore the energy
landscape and to capture the "fat" modes; and the second one is to fine-tune
the parameter learned from the first phase. In the Bayesian learning phase, we
apply continuous tempering and stochastic approximation into the Langevin
dynamics to create an efficient and effective sampler, in which the temperature
is adjusted automatically according to the designed "temperature dynamics".
These strategies can overcome the challenge of early trapping into bad local
minima and have achieved remarkable improvements in various types of neural
networks as shown in our theoretical analysis and empirical experiments
On the energy landscape of deep networks
We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm
motivated by an analysis of the energy landscape of a particular class of deep
networks with sparse random weights. The loss function of such networks can be
approximated by the Hamiltonian of a spherical spin glass with Gaussian
coupling. While different from currently-popular architectures such as
convolutional ones, spin glasses are amenable to analysis, which provides
insights on the topology of the loss function and motivates algorithms to
minimize it. Specifically, we show that a regularization term akin to a
magnetic field can be modulated with a single scalar parameter to transition
the loss function from a complex, non-convex landscape with exponentially many
local minima, to a phase with a polynomial number of minima, all the way down
to a trivial landscape with a unique minimum. AnnealSGD starts training in the
relaxed polynomial regime and gradually tightens the regularization parameter
to steer the energy towards the original exponential regime. Even for
convolutional neural networks, which are quite unlike sparse random networks,
we empirically show that AnnealSGD improves the generalization error using
competitive baselines on MNIST and CIFAR-10
Integrating Deep Neural Networks with Full-waveform Inversion: Reparametrization, Regularization, and Uncertainty Quantification
Full-waveform inversion (FWI) is an imaging approach for modeling velocity
structure by minimizing the misfit between recorded and predicted seismic
waveforms. The strong non-linearity of FWI resulting from fitting oscillatory
waveforms can trap the optimization in local minima. We propose a
neural-network-based full waveform inversion method (NNFWI) that integrates
deep neural networks with FWI by representing the velocity model with a
generative neural network. Neural networks can naturally introduce spatial
correlations as regularization to the generated velocity model, which
suppresses noise in the gradients and mitigates local minima. The velocity
model generated by neural networks is input to the same partial differential
equation (PDE) solvers used in conventional FWI. The gradients of both the
neural networks and PDEs are calculated using automatic differentiation, which
back-propagates gradients through the acoustic/elastic PDEs and neural network
layers to update the weights and biases of the generative neural network.
Experiments on 1D velocity models, the Marmousi model, and the 2004 BP model
demonstrate that NNFWI can mitigate local minima, especially for imaging high
contrast features like salt bodies, and significantly improves the inversion in
the presence of noise. Adding dropout layers to the neural network model also
allows analyzing the uncertainty of the inversion results through Monte Carlo
dropout. NNFWI opens a new pathway to combine deep learning and FWI for
exploiting both the characteristics of deep neural networks and the high
accuracy of PDE solvers. Because NNFWI does not require extra training data and
optimization loops, it provides an attractive and straightforward alternative
to conventional FWI
To Drop or Not to Drop: Robustness, Consistency and Differential Privacy Properties of Dropout
Training deep belief networks (DBNs) requires optimizing a non-convex
function with an extremely large number of parameters. Naturally, existing
gradient descent (GD) based methods are prone to arbitrarily poor local minima.
In this paper, we rigorously show that such local minima can be avoided (upto
an approximation error) by using the dropout technique, a widely used heuristic
in this domain. In particular, we show that by randomly dropping a few nodes of
a one-hidden layer neural network, the training objective function, up to a
certain approximation error, decreases by a multiplicative factor.
On the flip side, we show that for training convex empirical risk minimizers
(ERM), dropout in fact acts as a "stabilizer" or regularizer. That is, a simple
dropout based GD method for convex ERMs is stable in the face of arbitrary
changes to any one of the training points. Using the above assertion, we show
that dropout provides fast rates for generalization error in learning (convex)
generalized linear models (GLM). Moreover, using the above mentioned stability
properties of dropout, we design dropout based differentially private
algorithms for solving ERMs. The learned GLM thus, preserves privacy of each of
the individual training points while providing accurate predictions for new
test points. Finally, we empirically validate our stability assertions for
dropout in the context of convex ERMs and show that surprisingly, dropout
significantly outperforms (in terms of prediction accuracy) the L2
regularization based methods for several benchmark datasets.Comment: Currently under review for ICML 201
Ill-Posedness and Optimization Geometry for Nonlinear Neural Network Training
In this work we analyze the role nonlinear activation functions play at
stationary points of dense neural network training problems. We consider a
generic least squares loss function training formulation. We show that the
nonlinear activation functions used in the network construction play a critical
role in classifying stationary points of the loss landscape. We show that for
shallow dense networks, the nonlinear activation function determines the
Hessian nullspace in the vicinity of global minima (if they exist), and
therefore determines the ill-posedness of the training problem. Furthermore,
for shallow nonlinear networks we show that the zeros of the activation
function and its derivatives can lead to spurious local minima, and discuss
conditions for strict saddle points. We extend these results to deep dense
neural networks, showing that the last activation function plays an important
role in classifying stationary points, due to how it shows up in the gradient
from the chain rule
- …