7,318 research outputs found
A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization
We describe a framework for deriving and analyzing online optimization
algorithms that incorporate adaptive, data-dependent regularization, also
termed preconditioning. Such algorithms have been proven useful in stochastic
optimization by reshaping the gradients according to the geometry of the data.
Our framework captures and unifies much of the existing literature on adaptive
online methods, including the AdaGrad and Online Newton Step algorithms as well
as their diagonal versions. As a result, we obtain new convergence proofs for
these algorithms that are substantially simpler than previous analyses. Our
framework also exposes the rationale for the different preconditioned updates
used in common stochastic optimization methods
A unified view of entropy-regularized Markov decision processes
We propose a general framework for entropy-regularized average-reward
reinforcement learning in Markov decision processes (MDPs). Our approach is
based on extending the linear-programming formulation of policy optimization in
MDPs to accommodate convex regularization functions. Our key result is showing
that using the conditional entropy of the joint state-action distributions as
regularization yields a dual optimization problem closely resembling the
Bellman optimality equations. This result enables us to formalize a number of
state-of-the-art entropy-regularized reinforcement learning algorithms as
approximate variants of Mirror Descent or Dual Averaging, and thus to argue
about the convergence properties of these methods. In particular, we show that
the exact version of the TRPO algorithm of Schulman et al. (2015) actually
converges to the optimal policy, while the entropy-regularized policy gradient
methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally,
we illustrate empirically the effects of using various regularization
techniques on learning performance in a simple reinforcement learning setup
Decomposition into Low-rank plus Additive Matrices for Background/Foreground Separation: A Review for a Comparative Evaluation with a Large-Scale Dataset
Recent research on problem formulations based on decomposition into low-rank
plus sparse matrices shows a suitable framework to separate moving objects from
the background. The most representative problem formulation is the Robust
Principal Component Analysis (RPCA) solved via Principal Component Pursuit
(PCP) which decomposes a data matrix in a low-rank matrix and a sparse matrix.
However, similar robust implicit or explicit decompositions can be made in the
following problem formulations: Robust Non-negative Matrix Factorization
(RNMF), Robust Matrix Completion (RMC), Robust Subspace Recovery (RSR), Robust
Subspace Tracking (RST) and Robust Low-Rank Minimization (RLRM). The main goal
of these similar problem formulations is to obtain explicitly or implicitly a
decomposition into low-rank matrix plus additive matrices. In this context,
this work aims to initiate a rigorous and comprehensive review of the similar
problem formulations in robust subspace learning and tracking based on
decomposition into low-rank plus additive matrices for testing and ranking
existing algorithms for background/foreground separation. For this, we first
provide a preliminary review of the recent developments in the different
problem formulations which allows us to define a unified view that we called
Decomposition into Low-rank plus Additive Matrices (DLAM). Then, we examine
carefully each method in each robust subspace learning/tracking frameworks with
their decomposition, their loss functions, their optimization problem and their
solvers. Furthermore, we investigate if incremental algorithms and real-time
implementations can be achieved for background/foreground separation. Finally,
experimental results on a large-scale dataset called Background Models
Challenge (BMC 2012) show the comparative performance of 32 different robust
subspace learning/tracking methods.Comment: 121 pages, 5 figures, submitted to Computer Science Review. arXiv
admin note: text overlap with arXiv:1312.7167, arXiv:1109.6297,
arXiv:1207.3438, arXiv:1105.2126, arXiv:1404.7592, arXiv:1210.0805,
arXiv:1403.8067 by other authors, Computer Science Review, November 201
Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates
In this paper, we describe a phenomenon, which we named "super-convergence",
where neural networks can be trained an order of magnitude faster than with
standard training methods. The existence of super-convergence is relevant to
understanding why deep networks generalize well. One of the key elements of
super-convergence is training with one learning rate cycle and a large maximum
learning rate. A primary insight that allows super-convergence training is that
large learning rates regularize the training, hence requiring a reduction of
all other forms of regularization in order to preserve an optimal
regularization balance. We also derive a simplification of the Hessian Free
optimization method to compute an estimate of the optimal learning rate.
Experiments demonstrate super-convergence for Cifar-10/100, MNIST and Imagenet
datasets, and resnet, wide-resnet, densenet, and inception architectures. In
addition, we show that super-convergence provides a greater boost in
performance relative to standard training when the amount of labeled training
data is limited. The architectures and code to replicate the figures in this
paper are available at github.com/lnsmith54/super-convergence. See
http://www.fast.ai/2018/04/30/dawnbench-fastai/ for an application of
super-convergence to win the DAWNBench challenge (see
https://dawn.cs.stanford.edu/benchmark/).Comment: This paper was significantly revised to show super-convergence as a
general fast training methodolog
A Unified View of Regularized Dual Averaging and Mirror Descent with Implicit Updates
We study three families of online convex optimization algorithms:
follow-the-proximally-regularized-leader (FTRL-Proximal), regularized dual
averaging (RDA), and composite-objective mirror descent. We first prove
equivalence theorems that show all of these algorithms are instantiations of a
general FTRL update. This provides theoretical insight on previous experimental
observations. In particular, even though the FOBOS composite mirror descent
algorithm handles L1 regularization explicitly, it has been observed that RDA
is even more effective at producing sparsity. Our results demonstrate that
FOBOS uses subgradient approximations to the L1 penalty from previous rounds,
leading to less sparsity than RDA, which handles the cumulative penalty in
closed form. The FTRL-Proximal algorithm can be seen as a hybrid of these two,
and outperforms both on a large, real-world dataset.
Our second contribution is a unified analysis which produces regret bounds
that match (up to logarithmic terms) or improve the best previously known
bounds. This analysis also extends these algorithms in two important ways: we
support a more general type of composite objective and we analyze implicit
updates, which replace the subgradient approximation of the current loss
function with an exact optimization.Comment: Extensively updated version of earlier draft with new analysis
including a general treatment of composite objectives and experiments. Also
fixes a small bug in some of one of the proofs in the early versio
A Survey of Algorithms and Analysis for Adaptive Online Learning
We present tools for the analysis of Follow-The-Regularized-Leader (FTRL),
Dual Averaging, and Mirror Descent algorithms when the regularizer
(equivalently, prox-function or learning rate schedule) is chosen adaptively
based on the data. Adaptivity can be used to prove regret bounds that hold on
every round, and also allows for data-dependent regret bounds as in
AdaGrad-style algorithms (e.g., Online Gradient Descent with adaptive
per-coordinate learning rates). We present results from a large number of prior
works in a unified manner, using a modular and tight analysis that isolates the
key arguments in easily re-usable lemmas. This approach strengthens pre-viously
known FTRL analysis techniques to produce bounds as tight as those achieved by
potential functions or primal-dual analysis. Further, we prove a general and
exact equivalence between an arbitrary adaptive Mirror Descent algorithm and a
correspond- ing FTRL update, which allows us to analyze any Mirror Descent
algorithm in the same framework. The key to bridging the gap between Dual
Averaging and Mirror Descent algorithms lies in an analysis of the
FTRL-Proximal algorithm family. Our regret bounds are proved in the most
general form, holding for arbitrary norms and non-smooth regularizers with
time-varying weight
Online Linear Optimization via Smoothing
We present a new optimization-theoretic approach to analyzing
Follow-the-Leader style algorithms, particularly in the setting where
perturbations are used as a tool for regularization. We show that adding a
strongly convex penalty function to the decision rule and adding stochastic
perturbations to data correspond to deterministic and stochastic smoothing
operations, respectively. We establish an equivalence between "Follow the
Regularized Leader" and "Follow the Perturbed Leader" up to the smoothness
properties. This intuition leads to a new generic analysis framework that
recovers and improves the previous known regret bounds of the class of
algorithms commonly known as Follow the Perturbed Leader.Comment: COLT 201
Extreme Tensoring for Low-Memory Preconditioning
State-of-the-art models are now trained with billions of parameters, reaching
hardware limits in terms of memory consumption. This has created a recent
demand for memory-efficient optimizers. To this end, we investigate the limits
and performance tradeoffs of memory-efficient adaptively preconditioned
gradient methods. We propose extreme tensoring for high-dimensional stochastic
optimization, showing that an optimizer needs very little memory to benefit
from adaptive preconditioning. Our technique applies to arbitrary models (not
necessarily with tensor-shaped parameters), and is accompanied by regret and
convergence guarantees, which shed light on the tradeoffs between
preconditioner quality and expressivity. On a large-scale NLP model, we reduce
the optimizer memory overhead by three orders of magnitude, without degrading
performance
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
In this paper, we set forth a new vision of reinforcement learning developed
by us over the past few years, one that yields mathematically rigorous
solutions to longstanding important questions that have remained unresolved:
(i) how to design reliable, convergent, and robust reinforcement learning
algorithms (ii) how to guarantee that reinforcement learning satisfies
pre-specified "safety" guarantees, and remains in a stable region of the
parameter space (iii) how to design "off-policy" temporal difference learning
algorithms in a reliable and stable manner, and finally (iv) how to integrate
the study of reinforcement learning into the rich theory of stochastic
optimization. In this paper, we provide detailed answers to all these questions
using the powerful framework of proximal operators.
The key idea that emerges is the use of primal dual spaces connected through
the use of a Legendre transform. This allows temporal difference updates to
occur in dual spaces, allowing a variety of important technical advantages. The
Legendre transform elegantly generalizes past algorithms for solving
reinforcement learning problems, such as natural gradient methods, which we
show relate closely to the previously unconnected framework of mirror descent
methods. Equally importantly, proximal operator theory enables the systematic
development of operator splitting methods that show how to safely and reliably
decompose complex products of gradients that occur in recent variants of
gradient-based temporal difference learning. This key technical innovation
makes it possible to finally design "true" stochastic gradient methods for
reinforcement learning. Finally, Legendre transforms enable a variety of other
benefits, including modeling sparsity and domain geometry. Our work builds
extensively on recent work on the convergence of saddle-point algorithms, and
on the theory of monotone operators.Comment: 121 page
Decoupled Weight Decay Regularization
L regularization and weight decay regularization are equivalent for
standard stochastic gradient descent (when rescaled by the learning rate), but
as we demonstrate this is \emph{not} the case for adaptive gradient algorithms,
such as Adam. While common implementations of these algorithms employ L
regularization (often calling it "weight decay" in what may be misleading due
to the inequivalence we expose), we propose a simple modification to recover
the original formulation of weight decay regularization by \emph{decoupling}
the weight decay from the optimization steps taken w.r.t. the loss function. We
provide empirical evidence that our proposed modification (i) decouples the
optimal choice of weight decay factor from the setting of the learning rate for
both standard SGD and Adam and (ii) substantially improves Adam's
generalization performance, allowing it to compete with SGD with momentum on
image classification datasets (on which it was previously typically
outperformed by the latter). Our proposed decoupled weight decay has already
been adopted by many researchers, and the community has implemented it in
TensorFlow and PyTorch; the complete source code for our experiments is
available at https://github.com/loshchil/AdamW-and-SGDWComment: Published as a conference paper at ICLR 201
- …