5,604 research outputs found
Three Mechanisms of Weight Decay Regularization
Weight decay is one of the standard tricks in the neural network toolbox, but
the reasons for its regularization effect are poorly understood, and recent
results have cast doubt on the traditional interpretation in terms of
regularization. Literal weight decay has been shown to outperform
regularization for optimizers for which they differ. We empirically investigate
weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a
variety of network architectures. We identify three distinct mechanisms by
which weight decay exerts a regularization effect, depending on the particular
optimization algorithm and architecture: (1) increasing the effective learning
rate, (2) approximately regularizing the input-output Jacobian norm, and (3)
reducing the effective damping coefficient for second-order optimization. Our
results provide insight into how to improve the regularization of neural
networks
Convolution based smooth approximations to the absolute value function with application to non-smooth regularization
We present new convolution based smooth approximations to the absolute value
function and apply them to construct gradient based algorithms such as the
nonlinear conjugate gradient scheme to obtain sparse, regularized solutions of
linear systems , a problem often tackled via iterative algorithms which
attack the corresponding non-smooth minimization problem directly. In contrast,
the approximations we propose allow us to replace the generalized non-smooth
sparsity inducing functional by a smooth approximation of which we can readily
compute gradients and Hessians. The resulting gradient based algorithms often
yield a good estimate for the sought solution in few iterations and can either
be used directly or to quickly warm start existing algorithms
Accelerated PDE's for efficient solution of regularized inversion problems
We further develop a new framework, called PDE Acceleration, by applying it
to calculus of variations problems defined for general functions on
, obtaining efficient numerical algorithms to solve the resulting
class of optimization problems based on simple discretizations of their
corresponding accelerated PDE's. While the resulting family of PDE's and
numerical schemes are quite general, we give special attention to their
application for regularized inversion problems, with particular illustrative
examples on some popular image processing applications. The method is a
generalization of momentum, or accelerated, gradient descent to the PDE
setting. For elliptic problems, the descent equations are a nonlinear damped
wave equation, instead of a diffusion equation, and the acceleration is
realized as an improvement in the CFL condition from (for diffusion) to (for wave equations). We work
out several explicit as well as a semi-implicit numerical schemes, together
with their necessary stability constraints, and include recursive update
formulations which allow minimal-effort adaptation of existing gradient descent
PDE codes into the accelerated PDE framework. We explore these schemes more
carefully for a broad class of regularized inversion applications, with special
attention to quadratic, Beltrami, and Total Variation regularization, where the
accelerated PDE takes the form of a nonlinear wave equation. Experimental
examples demonstrate the application of these schemes for image denoising,
deblurring, and inpainting, including comparisons against Primal Dual, Split
Bregman, and ADMM algorithms
Understanding Machine-learned Density Functionals
Kernel ridge regression is used to approximate the kinetic energy of
non-interacting fermions in a one-dimensional box as a functional of their
density. The properties of different kernels and methods of cross-validation
are explored, and highly accurate energies are achieved. Accurate {\em
constrained optimal densities} are found via a modified Euler-Lagrange
constrained minimization of the total energy. A projected gradient descent
algorithm is derived using local principal component analysis. Additionally, a
sparse grid representation of the density can be used without degrading the
performance of the methods. The implications for machine-learned density
functional approximations are discussed
A distributed block coordinate descent method for training regularized linear classifiers
Distributed training of regularized classifiers has received great
attention recently. Most existing methods approach this problem by taking steps
obtained from approximating the objective by a quadratic approximation that is
decoupled at the individual variable level. These methods are designed for
multicore and MPI platforms where communication costs are low. They are
inefficient on systems such as Hadoop running on a cluster of commodity
machines where communication costs are substantial. In this paper we design a
distributed algorithm for regularization that is much better suited for
such systems than existing algorithms. A careful cost analysis is used to
support these points and motivate our method. The main idea of our algorithm is
to do block optimization of many variables on the actual objective function
within each computing node; this increases the computational cost per step that
is matched with the communication cost, and decreases the number of outer
iterations, thus yielding a faster overall method. Distributed Gauss-Seidel and
Gauss-Southwell greedy schemes are used for choosing variables to update in
each step. We establish global convergence theory for our algorithm, including
Q-linear rate of convergence. Experiments on two benchmark problems show our
method to be much faster than existing methods
Don't relax: early stopping for convex regularization
We consider the problem of designing efficient regularization algorithms when
regularization is encoded by a (strongly) convex functional. Unlike classical
penalization methods based on a relaxation approach, we propose an iterative
method where regularization is achieved via early stopping. Our results show
that the proposed procedure achieves the same recovery accuracy as penalization
methods, while naturally integrating computational considerations. An empirical
analysis on a number of problems provides promising results with respect to the
state of the art
Faster gradient descent and the efficient recovery of images
Much recent attention has been devoted to gradient descent algorithms where
the steepest descent step size is replaced by a similar one from a previous
iteration or gets updated only once every second step, thus forming a {\em
faster gradient descent method}. For unconstrained convex quadratic
optimization these methods can converge much faster than steepest descent. But
the context of interest here is application to certain ill-posed inverse
problems, where the steepest descent method is known to have a smoothing,
regularizing effect, and where a strict optimization solution is not necessary.
Specifically, in this paper we examine the effect of replacing steepest
descent by a faster gradient descent algorithm in the practical context of
image deblurring and denoising tasks. We also propose several highly efficient
schemes for carrying out these tasks independently of the step size selection,
as well as a scheme for the case where both blur and significant noise are
present.
In the above context there are situations where many steepest descent steps
are required, thus building slowness into the solution procedure. Our general
conclusion regarding gradient descent methods is that in such cases the faster
gradient descent methods offer substantial advantages. In other situations
where no such slowness buildup arises the steepest descent method can still be
very effective
Riemannian Dictionary Learning and Sparse Coding for Positive Definite Matrices
Data encoded as symmetric positive definite (SPD) matrices frequently arise
in many areas of computer vision and machine learning. While these matrices
form an open subset of the Euclidean space of symmetric matrices, viewing them
through the lens of non-Euclidean Riemannian geometry often turns out to be
better suited in capturing several desirable data properties. However,
formulating classical machine learning algorithms within such a geometry is
often non-trivial and computationally expensive. Inspired by the great success
of dictionary learning and sparse coding for vector-valued data, our goal in
this paper is to represent data in the form of SPD matrices as sparse conic
combinations of SPD atoms from a learned dictionary via a Riemannian geometric
approach. To that end, we formulate a novel Riemannian optimization objective
for dictionary learning and sparse coding in which the representation loss is
characterized via the affine invariant Riemannian metric. We also present a
computationally simple algorithm for optimizing our model. Experiments on
several computer vision datasets demonstrate superior classification and
retrieval performance using our approach when compared to sparse coding via
alternative non-Riemannian formulations
Cubic Regularization with Momentum for Nonconvex Optimization
Momentum is a popular technique to accelerate the convergence in practical
training, and its impact on convergence guarantee has been well-studied for
first-order algorithms. However, such a successful acceleration technique has
not yet been proposed for second-order algorithms in nonconvex optimization.In
this paper, we apply the momentum scheme to cubic regularized (CR) Newton's
method and explore the potential for acceleration. Our numerical experiments on
various nonconvex optimization problems demonstrate that the momentum scheme
can substantially facilitate the convergence of cubic regularization, and
perform even better than the Nesterov's acceleration scheme for CR.
Theoretically, we prove that CR under momentum achieves the best possible
convergence rate to a second-order stationary point for nonconvex optimization.
Moreover, we study the proposed algorithm for solving problems satisfying an
error bound condition and establish a local quadratic convergence rate. Then,
particularly for finite-sum problems, we show that the proposed algorithm can
allow computational inexactness that reduces the overall sample complexity
without degrading the convergence rate
Convex Optimization without Projection Steps
For the general problem of minimizing a convex function over a compact convex
domain, we will investigate a simple iterative approximation algorithm based on
the method by Frank & Wolfe 1956, that does not need projection steps in order
to stay inside the optimization domain. Instead of a projection step, the
linearized problem defined by a current subgradient is solved, which gives a
step direction that will naturally stay in the domain. Our framework
generalizes the sparse greedy algorithm of Frank & Wolfe and its primal-dual
analysis by Clarkson 2010 (and the low-rank SDP approach by Hazan 2008) to
arbitrary convex domains. We give a convergence proof guaranteeing
{\epsilon}-small duality gap after O(1/{\epsilon}) iterations.
The method allows us to understand the sparsity of approximate solutions for
any l1-regularized convex optimization problem (and for optimization over the
simplex), expressed as a function of the approximation quality. We obtain
matching upper and lower bounds of {\Theta}(1/{\epsilon}) for the sparsity for
l1-problems. The same bounds apply to low-rank semidefinite optimization with
bounded trace, showing that rank O(1/{\epsilon}) is best possible here as well.
As another application, we obtain sparse matrices of O(1/{\epsilon}) non-zero
entries as {\epsilon}-approximate solutions when optimizing any convex function
over a class of diagonally dominant symmetric matrices.
We show that our proposed first-order method also applies to nuclear norm and
max-norm matrix optimization problems. For nuclear norm regularized
optimization, such as matrix completion and low-rank recovery, we demonstrate
the practical efficiency and scalability of our algorithm for large matrix
problems, as e.g. the Netflix dataset. For general convex optimization over
bounded matrix max-norm, our algorithm is the first with a convergence
guarantee, to the best of our knowledge
- …