109 research outputs found
Optimization Methods for Inverse Problems
Optimization plays an important role in solving many inverse problems.
Indeed, the task of inversion often either involves or is fully cast as a
solution of an optimization problem. In this light, the mere non-linear,
non-convex, and large-scale nature of many of these inversions gives rise to
some very challenging optimization problems. The inverse problem community has
long been developing various techniques for solving such optimization tasks.
However, other, seemingly disjoint communities, such as that of machine
learning, have developed, almost in parallel, interesting alternative methods
which might have stayed under the radar of the inverse problem community. In
this survey, we aim to change that. In doing so, we first discuss current
state-of-the-art optimization methods widely used in inverse problems. We then
survey recent related advances in addressing similar challenges in problems
faced by the machine learning community, and discuss their potential advantages
for solving inverse problems. By highlighting the similarities among the
optimization challenges faced by the inverse problem and the machine learning
communities, we hope that this survey can serve as a bridge in bringing
together these two communities and encourage cross fertilization of ideas.Comment: 13 page
Training (Overparametrized) Neural Networks in Near-Linear Time
The slow convergence rate and pathological curvature issues of first-order
gradient methods for training deep neural networks, initiated an ongoing effort
for developing faster - optimization
algorithms beyond SGD, without compromising the generalization error. Despite
their remarkable convergence rate ( of the training batch
size ), second-order algorithms incur a daunting slowdown in the
(inverting the Hessian
matrix of the loss function), which renders them impractical. Very recently,
this computational overhead was mitigated by the works of [ZMG19,CGH+19},
yielding an -time second-order algorithm for training two-layer
overparametrized neural networks of polynomial width .
We show how to speed up the algorithm of [CGH+19], achieving an
-time backpropagation algorithm for training (mildly
overparametrized) ReLU networks, which is near-linear in the dimension ()
of the full gradient (Jacobian) matrix. The centerpiece of our algorithm is to
reformulate the Gauss-Newton iteration as an -regression problem, and
then use a Fast-JL type dimension reduction to the
underlying Gram matrix in time independent of , allowing to find a
sufficiently good approximate solution via -
conjugate gradient. Our result provides a proof-of-concept that advanced
machinery from randomized linear algebra -- which led to recent breakthroughs
in (ERM, LPs, Regression) -- can be
carried over to the realm of deep learning as well
A Survey on Intelligent Iterative Methods for Solving Sparse Linear Algebraic Equations
Efficiently solving sparse linear algebraic equations is an important
research topic of numerical simulation. Commonly used approaches include direct
methods and iterative methods. Compared with the direct methods, the iterative
methods have lower computational complexity and memory consumption, and are
thus often used to solve large-scale sparse linear equations. However, there
are numerous iterative methods, parameters and components needed to be
carefully chosen, and an inappropriate combination may eventually lead to an
inefficient solution process in practice. With the development of deep
learning, intelligent iterative methods become popular in these years, which
can intelligently make a sufficiently good combination, optimize the parameters
and components in accordance with the properties of the input matrix. This
survey then reviews these intelligent iterative methods. To be clearer, we
shall divide our discussion into three aspects: a method aspect, a component
aspect and a parameter aspect. Moreover, we summarize the existing work and
propose potential research directions that may deserve a deep investigation
Making Scalable Meta Learning Practical
Despite its flexibility to learn diverse inductive biases in machine learning
programs, meta learning (i.e., learning to learn) has long been recognized to
suffer from poor scalability due to its tremendous compute/memory costs,
training instability, and a lack of efficient distributed training support. In
this work, we focus on making scalable meta learning practical by introducing
SAMA, which combines advances in both implicit differentiation algorithms and
systems. Specifically, SAMA is designed to flexibly support a broad range of
adaptive optimizers in the base level of meta learning programs, while reducing
computational burden by avoiding explicit computation of second-order gradient
information, and exploiting efficient distributed training techniques
implemented for first-order gradients. Evaluated on multiple large-scale meta
learning benchmarks, SAMA showcases up to 1.7/4.8x increase in throughput and
2.0/3.8x decrease in memory consumption respectively on single-/multi-GPU
setups compared to other baseline meta learning algorithms. Furthermore, we
show that SAMA-based data optimization leads to consistent improvements in text
classification accuracy with BERT and RoBERTa large language models, and
achieves state-of-the-art results in both small- and large-scale data pruning
on image classification tasks, demonstrating the practical applicability of
scalable meta learning across language and vision domains
Explaining the Adaptive Generalisation Gap
We conjecture that the inherent difference in generalisation between adaptive
and non-adaptive gradient methods stems from the increased estimation noise in
the flattest directions of the true loss surface. We demonstrate that typical
schedules used for adaptive methods (with low numerical stability or damping
constants) serve to bias relative movement towards flat directions relative to
sharp directions, effectively amplifying the noise-to-signal ratio and harming
generalisation. We further demonstrate that the numerical stability/damping
constant used in these methods can be decomposed into a learning rate reduction
and linear shrinkage of the estimated curvature matrix. We then demonstrate
significant generalisation improvements by increasing the shrinkage
coefficient, closing the generalisation gap entirely in both Logistic
Regression and Deep Neural Network experiments. Finally, we show that other
popular modifications to adaptive methods, such as decoupled weight decay and
partial adaptivity can be shown to calibrate parameter updates to make better
use of sharper, more reliable directions
- …