5,784 research outputs found
Adaptive Regularization of Neural Networks Using Conjugate Gradient
Recently we suggested a regularization scheme which iteratively adapts regularization parameters by minimizing validation error using simple gradient descent. In this contribution we present an improved algorithm based on the conjugate gradient technique. Numerical experiments with feed-forward neural networks successfully demonstrate improved generalization ability and lower computational cost. 1. INTRODUCTION Neural networks are flexible tools for regression, timeseries modeling and pattern recognition which find expression in universal approximation theorems [6]. The risk of over-fitting on noisy data is of major concern in neural network design, as exemplified by the bias-variance dilemma, see e.g., [5]. Using regularization serves two purposes: first, it remedies numerical instabilities during training by imposing smoothness on the cost function; secondly, regularization is a tool for reducing variance by introducing extra bias. The overall goal is to minimize the generalization..
Small steps and giant leaps: Minimal Newton solvers for Deep Learning
We propose a fast second-order method that can be used as a drop-in
replacement for current deep learning solvers. Compared to stochastic gradient
descent (SGD), it only requires two additional forward-mode automatic
differentiation operations per iteration, which has a computational cost
comparable to two standard forward passes and is easy to implement. Our method
addresses long-standing issues with current second-order solvers, which invert
an approximate Hessian matrix every iteration exactly or by conjugate-gradient
methods, a procedure that is both costly and sensitive to noise. Instead, we
propose to keep a single estimate of the gradient projected by the inverse
Hessian matrix, and update it once per iteration. This estimate has the same
size and is similar to the momentum variable that is commonly used in SGD. No
estimate of the Hessian is maintained. We first validate our method, called
CurveBall, on small problems with known closed-form solutions (noisy Rosenbrock
function and degenerate 2-layer linear networks), where current deep learning
solvers seem to struggle. We then train several large models on CIFAR and
ImageNet, including ResNet and VGG-f networks, where we demonstrate faster
convergence with no hyperparameter tuning. Code is available
Adaptive Momentum for Neural Network Optimization
In this thesis, we develop a novel and efficient algorithm for optimizing neural networks inspired by a recently proposed geodesic optimization algorithm. Our algorithm, which we call Stochastic Geodesic Optimization (SGeO), utilizes an adaptive coefficient on top of Polyaks Heavy Ball method effectively controlling the amount of weight put on the previous update to the parameters based on the change of direction in the optimization path. Experimental results on strongly convex functions with Lipschitz gradients and deep Autoencoder benchmarks show that SGeO reaches lower errors than established first-order methods and competes well with lower or similar errors to a recent second-order method called K-FAC (Kronecker-Factored Approximate Curvature). We also incorporate Nesterov style lookahead gradient into our algorithm (SGeO-N) and observe notable improvements. We believe that our research will open up new directions for high-dimensional neural network optimization where combining the efficiency of first-order methods and the effectiveness of second-order methods proves a promising avenue to explore
Stochastic Training of Neural Networks via Successive Convex Approximations
This paper proposes a new family of algorithms for training neural networks
(NNs). These are based on recent developments in the field of non-convex
optimization, going under the general name of successive convex approximation
(SCA) techniques. The basic idea is to iteratively replace the original
(non-convex, highly dimensional) learning problem with a sequence of (strongly
convex) approximations, which are both accurate and simple to optimize.
Differently from similar ideas (e.g., quasi-Newton algorithms), the
approximations can be constructed using only first-order information of the
neural network function, in a stochastic fashion, while exploiting the overall
structure of the learning problem for a faster convergence. We discuss several
use cases, based on different choices for the loss function (e.g., squared loss
and cross-entropy loss), and for the regularization of the NN's weights. We
experiment on several medium-sized benchmark problems, and on a large-scale
dataset involving simulated physical data. The results show how the algorithm
outperforms state-of-the-art techniques, providing faster convergence to a
better minimum. Additionally, we show how the algorithm can be easily
parallelized over multiple computational units without hindering its
performance. In particular, each computational unit can optimize a tailored
surrogate function defined on a randomly assigned subset of the input
variables, whose dimension can be selected depending entirely on the available
computational power.Comment: Preprint submitted to IEEE Transactions on Neural Networks and
Learning System
- …