1,135 research outputs found
Let's Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence
Block coordinate descent (BCD) methods are widely-used for large-scale
numerical optimization because of their cheap iteration costs, low memory
requirements, amenability to parallelization, and ability to exploit problem
structure. Three main algorithmic choices influence the performance of BCD
methods: the block partitioning strategy, the block selection rule, and the
block update rule. In this paper we explore all three of these building blocks
and propose variations for each that can lead to significantly faster BCD
methods. We (i) propose new greedy block-selection strategies that guarantee
more progress per iteration than the Gauss-Southwell rule; (ii) explore
practical issues like how to implement the new rules when using "variable"
blocks; (iii) explore the use of message-passing to compute matrix or Newton
updates efficiently on huge blocks for problems with a sparse dependency
between variables; and (iv) consider optimal active manifold identification,
which leads to bounds on the "active set complexity" of BCD methods and leads
to superlinear convergence for certain problems with sparse solutions (and in
some cases finite termination at an optimal solution). We support all of our
findings with numerical results for the classic machine learning problems of
least squares, logistic regression, multi-class logistic regression, label
propagation, and L1-regularization
TMB: Automatic Differentiation and Laplace Approximation
TMB is an open source R package that enables quick implementation of complex
nonlinear random effect (latent variable) models in a manner similar to the
established AD Model Builder package (ADMB, admb-project.org). In addition, it
offers easy access to parallel computations. The user defines the joint
likelihood for the data and the random effects as a C++ template function,
while all the other operations are done in R; e.g., reading in the data. The
package evaluates and maximizes the Laplace approximation of the marginal
likelihood where the random effects are automatically integrated out. This
approximation, and its derivatives, are obtained using automatic
differentiation (up to order three) of the joint likelihood. The computations
are designed to be fast for problems with many random effects (~10^6) and
parameters (~10^3). Computation times using ADMB and TMB are compared on a
suite of examples ranging from simple models to large spatial models where the
random effects are a Gaussian random field. Speedups ranging from 1.5 to about
100 are obtained with increasing gains for large problems. The package and
examples are available at http://tmb-project.org
Backpropagation Beyond the Gradient
Automatic differentiation is a key enabler of deep learning: previously, practitioners were limited to models
for which they could manually compute derivatives. Now, they can create sophisticated models with almost
no restrictions and train them using first-order, i. e. gradient, information. Popular libraries like PyTorch
and TensorFlow compute this gradient efficiently, automatically, and conveniently with a single line of
code. Under the hood, reverse-mode automatic differentiation, or gradient backpropagation, powers the
gradient computation in these libraries. Their entire design centers around gradient backpropagation.
These frameworks are specialized around one specific task—computing the average gradient in a mini-batch.
This specialization often complicates the extraction of other information like higher-order statistical moments
of the gradient, or higher-order derivatives like the Hessian. It limits practitioners and researchers to methods
that rely on the gradient. Arguably, this hampers the field from exploring the potential of higher-order
information and there is evidence that focusing solely on the gradient has not lead to significant recent
advances in deep learning optimization.
To advance algorithmic research and inspire novel ideas, information beyond the batch-averaged gradient
must be made available at the same level of computational efficiency, automation, and convenience.
This thesis presents approaches to simplify experimentation with rich information beyond the gradient
by making it more readily accessible. We present an implementation of these ideas as an extension to the
backpropagation procedure in PyTorch. Using this newly accessible information, we demonstrate possible use
cases by (i) showing how it can inform our understanding of neural network training by building a diagnostic
tool, and (ii) enabling novel methods to efficiently compute and approximate curvature information.
First, we extend gradient backpropagation for sequential feedforward models to Hessian backpropagation
which enables computing approximate per-layer curvature. This perspective unifies recently proposed block-
diagonal curvature approximations. Like gradient backpropagation, the computation of these second-order
derivatives is modular, and therefore simple to automate and extend to new operations.
Based on the insight that rich information beyond the gradient can be computed efficiently and at the
same time, we extend the backpropagation in PyTorch with the BackPACK library. It provides efficient and
convenient access to statistical moments of the gradient and approximate curvature information, often at a
small overhead compared to computing just the gradient.
Next, we showcase the utility of such information to better understand neural network training. We build
the Cockpit library that visualizes what is happening inside the model during training through various
instruments that rely on BackPACK’s statistics. We show how Cockpit provides a meaningful statistical
summary report to the deep learning engineer to identify bugs in their machine learning pipeline, guide
hyperparameter tuning, and study deep learning phenomena.
Finally, we use BackPACK’s extended automatic differentiation functionality to develop ViViT, an approach
to efficiently compute curvature information, in particular curvature noise. It uses the low-rank structure
of the generalized Gauss-Newton approximation to the Hessian and addresses shortcomings in existing
curvature approximations. Through monitoring curvature noise, we demonstrate how ViViT’s information
helps in understanding challenges to make second-order optimization methods work in practice.
This work develops new tools to experiment more easily with higher-order information in complex deep
learning models. These tools have impacted works on Bayesian applications with Laplace approximations,
out-of-distribution generalization, differential privacy, and the design of automatic differentia-
tion systems. They constitute one important step towards developing and establishing more efficient deep
learning algorithms
Expectation Propagation for Nonlinear Inverse Problems -- with an Application to Electrical Impedance Tomography
In this paper, we study a fast approximate inference method based on
expectation propagation for exploring the posterior probability distribution
arising from the Bayesian formulation of nonlinear inverse problems. It is
capable of efficiently delivering reliable estimates of the posterior mean and
covariance, thereby providing an inverse solution together with quantified
uncertainties. Some theoretical properties of the iterative algorithm are
discussed, and the efficient implementation for an important class of problems
of projection type is described. The method is illustrated with one typical
nonlinear inverse problem, electrical impedance tomography with complete
electrode model, under sparsity constraints. Numerical results for real
experimental data are presented, and compared with that by Markov chain Monte
Carlo. The results indicate that the method is accurate and computationally
very efficient.Comment: Journal of Computational Physics, to appea
A second derivative SQP method: local convergence
In [19], we gave global convergence results for a second-derivative SQP method for minimizing the exact â„“1-merit function for a fixed value of the penalty parameter. To establish this result, we used the properties of the so-called Cauchy step, which was itself computed from the so-called predictor step. In addition, we allowed for the computation of a variety of (optional) SQP steps that were intended to improve the efficiency of the algorithm. \ud
\ud
Although we established global convergence of the algorithm, we did not discuss certain aspects that are critical when developing software capable of solving general optimization problems. In particular, we must have strategies for updating the penalty parameter and better techniques for defining the positive-definite matrix Bk used in computing the predictor step. In this paper we address both of these issues. We consider two techniques for defining the positive-definite matrix Bk—a simple diagonal approximation and a more sophisticated limited-memory BFGS update. We also analyze a strategy for updating the penalty paramter based on approximately minimizing the ℓ1-penalty function over a sequence of increasing values of the penalty parameter.\ud
\ud
Algorithms based on exact penalty functions have certain desirable properties. To be practical, however, these algorithms must be guaranteed to avoid the so-called Maratos effect. We show that a nonmonotone varient of our algorithm avoids this phenomenon and, therefore, results in asymptotically superlinear local convergence; this is verified by preliminary numerical results on the Hock and Shittkowski test set
- …