361 research outputs found
Training feedforward neural networks using orthogonal iteration of the Hessian eigenvectors
Introduction
Training algorithms for Multilayer Perceptrons optimize the set of W weights and biases, w, so as to minimize an
error function, E, applied to a set of N training patterns. The well-known back propagation algorithm combines an
efficient method of estimating the gradient of the error function in weight space, DE=g, with a simple gradient
descent procedure to adjust the weights, Dw = -hg. More efficient algorithms maintain the gradient estimation
procedure, but replace the update step with a faster non-linear optimization strategy [1].
Efficient non-linear optimization algorithms are based upon second order approximation [2]. When sufficiently
close to a minimum the error surface is approximately quadratic, the shape being determined by the Hessian matrix.
Bishop [1] presents a detailed discussion of the properties and significance of the Hessian matrix. In principle, if
sufficiently close to a minimum it is possible to move directly to the minimum using the Newton step, -H-1g.
In practice, the Newton step is not used as H-1 is very expensive to evaluate; in addition, when not sufficiently close
to a minimum, the Newton step may cause a disastrously poor step to be taken. Second order algorithms either build
up an approximation to H-1, or construct a search strategy that implicitly exploits its structure without evaluating it;
they also either take precautions to prevent steps that lead to a deterioration in error, or explicitly reject such steps.
In applying non-linear optimization algorithms to neural networks, a key consideration is the high-dimensional
nature of the search space. Neural networks with thousands of weights are not uncommon. Some algorithms have
O(W2) or O(W3) memory or execution times, and are hence impracticable in such cases. It is desirable to identify
algorithms that have limited memory requirements, particularly algorithms where one may trade memory usage
against convergence speed.
The paper describes a new training algorithm that has scalable memory requirements, which may range from O(W)
to O(W2), although in practice the useful range is limited to lower complexity levels. The algorithm is based upon a
novel iterative estimation of the principal eigen-subspace of the Hessian, together with a quadratic step estimation
procedure.
It is shown that the new algorithm has convergence time comparable to conjugate gradient descent, and may be
preferable if early stopping is used as it converges more quickly during the initial phases.
Section 2 overviews the principles of second order training algorithms. Section 3 introduces the new algorithm.
Second 4 discusses some experiments to confirm the algorithm's performance; section 5 concludes the paper
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
A central challenge to many fields of science and engineering involves
minimizing non-convex error functions over continuous, high dimensional spaces.
Gradient descent or quasi-Newton methods are almost ubiquitously used to
perform such minimizations, and it is often thought that a main source of
difficulty for these local methods to find the global minimum is the
proliferation of local minima with much higher error than the global minimum.
Here we argue, based on results from statistical physics, random matrix theory,
neural network theory, and empirical evidence, that a deeper and more profound
difficulty originates from the proliferation of saddle points, not local
minima, especially in high dimensional problems of practical interest. Such
saddle points are surrounded by high error plateaus that can dramatically slow
down learning, and give the illusory impression of the existence of a local
minimum. Motivated by these arguments, we propose a new approach to
second-order optimization, the saddle-free Newton method, that can rapidly
escape high dimensional saddle points, unlike gradient descent and quasi-Newton
methods. We apply this algorithm to deep or recurrent neural network training,
and provide numerical evidence for its superior optimization performance.Comment: The theoretical review and analysis in this article draw heavily from
arXiv:1405.4604 [cs.LG
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Despite the widespread practical success of deep learning methods, our
theoretical understanding of the dynamics of learning in deep neural networks
remains quite sparse. We attempt to bridge the gap between the theory and
practice of deep learning by systematically analyzing learning dynamics for the
restricted case of deep linear neural networks. Despite the linearity of their
input-output map, such networks have nonlinear gradient descent dynamics on
weights that change with the addition of each new hidden layer. We show that
deep linear networks exhibit nonlinear learning phenomena similar to those seen
in simulations of nonlinear networks, including long plateaus followed by rapid
transitions to lower error solutions, and faster convergence from greedy
unsupervised pretraining initial conditions than from random initial
conditions. We provide an analytical description of these phenomena by finding
new exact solutions to the nonlinear dynamics of deep learning. Our theoretical
analysis also reveals the surprising finding that as the depth of a network
approaches infinity, learning speed can nevertheless remain finite: for a
special class of initial conditions on the weights, very deep networks incur
only a finite, depth independent, delay in learning speed relative to shallow
networks. We show that, under certain conditions on the training data,
unsupervised pretraining can find this special class of initial conditions,
while scaled random Gaussian initializations cannot. We further exhibit a new
class of random orthogonal initial conditions on weights that, like
unsupervised pre-training, enjoys depth independent learning times. We further
show that these initial conditions also lead to faithful propagation of
gradients even in deep nonlinear networks, as long as they operate in a special
regime known as the edge of chaos.Comment: Submission to ICLR2014. Revised based on reviewer feedbac
Neural networks in geophysical applications
Neural networks are increasingly popular in geophysics.
Because they are universal approximators, these
tools can approximate any continuous function with an
arbitrary precision. Hence, they may yield important
contributions to finding solutions to a variety of geophysical applications.
However, knowledge of many methods and techniques
recently developed to increase the performance
and to facilitate the use of neural networks does not seem
to be widespread in the geophysical community. Therefore,
the power of these tools has not yet been explored to
their full extent. In this paper, techniques are described
for faster training, better overall performance, i.e., generalization,and the automatic estimation of network size
and architecture
Orthogonal SVD Covariance Conditioning and Latent Disentanglement
Inserting an SVD meta-layer into neural networks is prone to make the
covariance ill-conditioned, which could harm the model in the training
stability and generalization abilities. In this paper, we systematically study
how to improve the covariance conditioning by enforcing orthogonality to the
Pre-SVD layer. Existing orthogonal treatments on the weights are first
investigated. However, these techniques can improve the conditioning but would
hurt the performance. To avoid such a side effect, we propose the Nearest
Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of
our methods is validated in two applications: decorrelated Batch Normalization
(BN) and Global Covariance Pooling (GCP). Extensive experiments on visual
recognition demonstrate that our methods can simultaneously improve covariance
conditioning and generalization. The combinations with orthogonal weight can
further boost the performance. Moreover, we show that our orthogonality
techniques can benefit generative models for better latent disentanglement
through a series of experiments on various benchmarks. Code is available at:
\href{https://github.com/KingJamesSong/OrthoImproveCond}{https://github.com/KingJamesSong/OrthoImproveCond}.Comment: Accepted by IEEE T-PAMI. arXiv admin note: substantial text overlap
with arXiv:2207.0211
- ā¦