Search CORE

5,604 research outputs found

Three Mechanisms of Weight Decay Regularization

Author: Grosse Roger
Wang Chaoqi
Xu Bowen
Zhang Guodong
Publication venue
Publication date: 29/10/2018
Field of study

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of

L_2

regularization. Literal weight decay has been shown to outperform

L_2

regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks

arXiv.org e-Print Archive

Convolution based smooth approximations to the absolute value function with application to non-smooth regularization

Author: Ozkaya Gorkem
Voronin Sergey
Yoshida Davis
Publication venue
Publication date: 01/07/2015
Field of study

We present new convolution based smooth approximations to the absolute value function and apply them to construct gradient based algorithms such as the nonlinear conjugate gradient scheme to obtain sparse, regularized solutions of linear systems

Ax = b

, a problem often tackled via iterative algorithms which attack the corresponding non-smooth minimization problem directly. In contrast, the approximations we propose allow us to replace the generalized non-smooth sparsity inducing functional by a smooth approximation of which we can readily compute gradients and Hessians. The resulting gradient based algorithms often yield a good estimate for the sought solution in few iterations and can either be used directly or to quickly warm start existing algorithms

arXiv.org e-Print Archive

Accelerated PDE's for efficient solution of regularized inversion problems

Author: Benyamin Minas
Calder Jeff
Sundaramoorthi Ganesh
Yezzi Anthony
Publication venue
Publication date: 30/09/2018
Field of study

We further develop a new framework, called PDE Acceleration, by applying it to calculus of variations problems defined for general functions on

\mathbb{R}^n

, obtaining efficient numerical algorithms to solve the resulting class of optimization problems based on simple discretizations of their corresponding accelerated PDE's. While the resulting family of PDE's and numerical schemes are quite general, we give special attention to their application for regularized inversion problems, with particular illustrative examples on some popular image processing applications. The method is a generalization of momentum, or accelerated, gradient descent to the PDE setting. For elliptic problems, the descent equations are a nonlinear damped wave equation, instead of a diffusion equation, and the acceleration is realized as an improvement in the CFL condition from

\Delta t\sim \Delta x^{2}

(for diffusion) to

\Delta t\sim \Delta x

(for wave equations). We work out several explicit as well as a semi-implicit numerical schemes, together with their necessary stability constraints, and include recursive update formulations which allow minimal-effort adaptation of existing gradient descent PDE codes into the accelerated PDE framework. We explore these schemes more carefully for a broad class of regularized inversion applications, with special attention to quadratic, Beltrami, and Total Variation regularization, where the accelerated PDE takes the form of a nonlinear wave equation. Experimental examples demonstrate the application of these schemes for image denoising, deblurring, and inpainting, including comparisons against Primal Dual, Split Bregman, and ADMM algorithms

arXiv.org e-Print Archive

Understanding Machine-learned Density Functionals

Author: Burke Kieron
Duncan Paul
Huang Jessica
Li Li
Müller Klaus-Robert
Niranjan Uma-Naresh
Pelaschier Isabelle M.
Rupp Matthias
Snyder John C.
Publication venue
Publication date: 26/05/2014
Field of study

Kernel ridge regression is used to approximate the kinetic energy of non-interacting fermions in a one-dimensional box as a functional of their density. The properties of different kernels and methods of cross-validation are explored, and highly accurate energies are achieved. Accurate {\em constrained optimal densities} are found via a modified Euler-Lagrange constrained minimization of the total energy. A projected gradient descent algorithm is derived using local principal component analysis. Additionally, a sparse grid representation of the density can be used without degrading the performance of the methods. The implications for machine-learned density functional approximations are discussed

arXiv.org e-Print Archive

A distributed block coordinate descent method for training $l_1$ regularized linear classifiers

Author: Keerthi S. Sathiya
Mahajan Dhruv
Sundararajan S.
Publication venue
Publication date: 16/03/2015
Field of study

Distributed training of

l_1

regularized classifiers has received great attention recently. Most existing methods approach this problem by taking steps obtained from approximating the objective by a quadratic approximation that is decoupled at the individual variable level. These methods are designed for multicore and MPI platforms where communication costs are low. They are inefficient on systems such as Hadoop running on a cluster of commodity machines where communication costs are substantial. In this paper we design a distributed algorithm for

l_1

regularization that is much better suited for such systems than existing algorithms. A careful cost analysis is used to support these points and motivate our method. The main idea of our algorithm is to do block optimization of many variables on the actual objective function within each computing node; this increases the computational cost per step that is matched with the communication cost, and decreases the number of outer iterations, thus yielding a faster overall method. Distributed Gauss-Seidel and Gauss-Southwell greedy schemes are used for choosing variables to update in each step. We establish global convergence theory for our algorithm, including Q-linear rate of convergence. Experiments on two benchmark problems show our method to be much faster than existing methods

arXiv.org e-Print Archive

Don't relax: early stopping for convex regularization

Author: Matet Simon
Rosasco Lorenzo
Villa Silvia
Vu Bang Long
Publication venue
Publication date: 17/07/2017
Field of study

We consider the problem of designing efficient regularization algorithms when regularization is encoded by a (strongly) convex functional. Unlike classical penalization methods based on a relaxation approach, we propose an iterative method where regularization is achieved via early stopping. Our results show that the proposed procedure achieves the same recovery accuracy as penalization methods, while naturally integrating computational considerations. An empirical analysis on a number of problems provides promising results with respect to the state of the art

arXiv.org e-Print Archive

Faster gradient descent and the efficient recovery of images

Author: Ascher Uri
Huang Hui
Publication venue
Publication date: 12/08/2013
Field of study

Much recent attention has been devoted to gradient descent algorithms where the steepest descent step size is replaced by a similar one from a previous iteration or gets updated only once every second step, thus forming a {\em faster gradient descent method}. For unconstrained convex quadratic optimization these methods can converge much faster than steepest descent. But the context of interest here is application to certain ill-posed inverse problems, where the steepest descent method is known to have a smoothing, regularizing effect, and where a strict optimization solution is not necessary. Specifically, in this paper we examine the effect of replacing steepest descent by a faster gradient descent algorithm in the practical context of image deblurring and denoising tasks. We also propose several highly efficient schemes for carrying out these tasks independently of the step size selection, as well as a scheme for the case where both blur and significant noise are present. In the above context there are situations where many steepest descent steps are required, thus building slowness into the solution procedure. Our general conclusion regarding gradient descent methods is that in such cases the faster gradient descent methods offer substantial advantages. In other situations where no such slowness buildup arises the steepest descent method can still be very effective

arXiv.org e-Print Archive

Riemannian Dictionary Learning and Sparse Coding for Positive Definite Matrices

Author: Cherian Anoop
Sra Suvrit
Publication venue
Publication date: 16/12/2015
Field of study

Data encoded as symmetric positive definite (SPD) matrices frequently arise in many areas of computer vision and machine learning. While these matrices form an open subset of the Euclidean space of symmetric matrices, viewing them through the lens of non-Euclidean Riemannian geometry often turns out to be better suited in capturing several desirable data properties. However, formulating classical machine learning algorithms within such a geometry is often non-trivial and computationally expensive. Inspired by the great success of dictionary learning and sparse coding for vector-valued data, our goal in this paper is to represent data in the form of SPD matrices as sparse conic combinations of SPD atoms from a learned dictionary via a Riemannian geometric approach. To that end, we formulate a novel Riemannian optimization objective for dictionary learning and sparse coding in which the representation loss is characterized via the affine invariant Riemannian metric. We also present a computationally simple algorithm for optimizing our model. Experiments on several computer vision datasets demonstrate superior classification and retrieval performance using our approach when compared to sparse coding via alternative non-Riemannian formulations

arXiv.org e-Print Archive

Cubic Regularization with Momentum for Nonconvex Optimization

Author: Lan Guanghui
Liang Yingbin
Wang Zhe
Zhou Yi
Publication venue
Publication date: 27/06/2019
Field of study

Momentum is a popular technique to accelerate the convergence in practical training, and its impact on convergence guarantee has been well-studied for first-order algorithms. However, such a successful acceleration technique has not yet been proposed for second-order algorithms in nonconvex optimization.In this paper, we apply the momentum scheme to cubic regularized (CR) Newton's method and explore the potential for acceleration. Our numerical experiments on various nonconvex optimization problems demonstrate that the momentum scheme can substantially facilitate the convergence of cubic regularization, and perform even better than the Nesterov's acceleration scheme for CR. Theoretically, we prove that CR under momentum achieves the best possible convergence rate to a second-order stationary point for nonconvex optimization. Moreover, we study the proposed algorithm for solving problems satisfying an error bound condition and establish a local quadratic convergence rate. Then, particularly for finite-sum problems, we show that the proposed algorithm can allow computational inexactness that reduces the overall sample complexity without degrading the convergence rate

arXiv.org e-Print Archive

Convex Optimization without Projection Steps

Author: Jaggi Martin
Publication venue
Publication date: 27/12/2011
Field of study

For the general problem of minimizing a convex function over a compact convex domain, we will investigate a simple iterative approximation algorithm based on the method by Frank & Wolfe 1956, that does not need projection steps in order to stay inside the optimization domain. Instead of a projection step, the linearized problem defined by a current subgradient is solved, which gives a step direction that will naturally stay in the domain. Our framework generalizes the sparse greedy algorithm of Frank & Wolfe and its primal-dual analysis by Clarkson 2010 (and the low-rank SDP approach by Hazan 2008) to arbitrary convex domains. We give a convergence proof guaranteeing {\epsilon}-small duality gap after O(1/{\epsilon}) iterations. The method allows us to understand the sparsity of approximate solutions for any l1-regularized convex optimization problem (and for optimization over the simplex), expressed as a function of the approximation quality. We obtain matching upper and lower bounds of {\Theta}(1/{\epsilon}) for the sparsity for l1-problems. The same bounds apply to low-rank semidefinite optimization with bounded trace, showing that rank O(1/{\epsilon}) is best possible here as well. As another application, we obtain sparse matrices of O(1/{\epsilon}) non-zero entries as {\epsilon}-approximate solutions when optimizing any convex function over a class of diagonally dominant symmetric matrices. We show that our proposed first-order method also applies to nuclear norm and max-norm matrix optimization problems. For nuclear norm regularized optimization, such as matrix completion and low-rank recovery, we demonstrate the practical efficiency and scalability of our algorithm for large matrix problems, as e.g. the Netflix dataset. For general convex optimization over bounded matrix max-norm, our algorithm is the first with a convergence guarantee, to the best of our knowledge

arXiv.org e-Print Archive