16,468 research outputs found
MARTHE: Scheduling the Learning Rate Via Online Hypergradients
We study the problem of fitting task-specific learning rate schedules from
the perspective of hyperparameter optimization, aiming at good generalization.
We describe the structure of the gradient of a validation error w.r.t. the
learning rate schedule -- the hypergradient. Based on this, we introduce
MARTHE, a novel online algorithm guided by cheap approximations of the
hypergradient that uses past information from the optimization trajectory to
simulate future behaviour. It interpolates between two recent techniques, RTHO
(Franceschi et al., 2017) and HD (Baydin et al. 2018), and is able to produce
learning rate schedules that are more stable leading to models that generalize
better.Comment: IJCAI 2020. Larger images. Code available at
https://github.com/awslabs/adatun
Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
In most practical settings and theoretical analyses, one assumes that a model
can be trained until convergence. However, the growing complexity of machine
learning datasets and models may violate such assumptions. Indeed, current
approaches for hyper-parameter tuning and neural architecture search tend to be
limited by practical resource constraints. Therefore, we introduce a formal
setting for studying training under the non-asymptotic, resource-constrained
regime, i.e., budgeted training. We analyze the following problem: "given a
dataset, algorithm, and fixed resource budget, what is the best achievable
performance?" We focus on the number of optimization iterations as the
representative resource. Under such a setting, we show that it is critical to
adjust the learning rate schedule according to the given budget. Among
budget-aware learning schedules, we find simple linear decay to be both robust
and high-performing. We support our claim through extensive experiments with
state-of-the-art models on ImageNet (image classification), Kinetics (video
classification), MS COCO (object detection and instance segmentation), and
Cityscapes (semantic segmentation). We also analyze our results and find that
the key to a good schedule is budgeted convergence, a phenomenon whereby the
gradient vanishes at the end of each allowed budget. We also revisit existing
approaches for fast convergence and show that budget-aware learning schedules
readily outperform such approaches under (the practical but under-explored)
budgeted training setting.Comment: ICLR 2020. Project page with code is at
http://www.cs.cmu.edu/~mengtial/proj/budgetnn
The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares
Minimax optimal convergence rates for classes of stochastic convex
optimization problems are well characterized, where the majority of results
utilize iterate averaged stochastic gradient descent (SGD) with polynomially
decaying step sizes. In contrast, SGD's final iterate behavior has received
much less attention despite their widespread use in practice. Motivated by this
observation, this work provides a detailed study of the following question:
what rate is achievable using the final iterate of SGD for the streaming least
squares regression problem with and without strong convexity?
First, this work shows that even if the time horizon T (i.e. the number of
iterations SGD is run for) is known in advance, SGD's final iterate behavior
with any polynomially decaying learning rate scheme is highly sub-optimal
compared to the minimax rate (by a condition number factor in the strongly
convex case and a factor of in the non-strongly convex case). In
contrast, this paper shows that Step Decay schedules, which cut the learning
rate by a constant factor every constant number of epochs (i.e., the learning
rate decays geometrically) offers significant improvements over any
polynomially decaying step sizes. In particular, the final iterate behavior
with a step decay schedule is off the minimax rate by only factors (in
the condition number for strongly convex case, and in T for the non-strongly
convex case). Finally, in stark contrast to the known horizon case, this paper
shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is
poor (in that it queries iterates with highly sub-optimal function value
infinitely often, i.e. in a limsup sense) irrespective of the stepsizes
employed. These results demonstrate the subtlety in establishing optimal
learning rate schemes (for the final iterate) for stochastic gradient
procedures in fixed time horizon settings.Comment: Appears in the proceedings of the Conference on Neural Information
Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7
figure
Facility Deployment Decisions through Warp Optimizaton of Regressed Gaussian Processes
A method for quickly determining deployment schedules that meet a given fuel
cycle demand is presented here. This algorithm is fast enough to perform in
situ within low-fidelity fuel cycle simulators. It uses Gaussian process
regression models to predict the production curve as a function of time and the
number of deployed facilities. Each of these predictions is measured against
the demand curve using the dynamic time warping distance. The minimum distance
deployment schedule is evaluated in a full fuel cycle simulation, whose
generated production curve then informs the model on the next optimization
iteration. The method converges within five to ten iterations to a distance
that is less than one percent of the total deployable production. A
representative once-through fuel cycle is used to demonstrate the methodology
for reactor deployment.Comment: Number of Pages: 35, Number of Tables: 0, Number of Figures: 1
AutoLoss: Learning Discrete Schedules for Alternate Optimization
Many machine learning problems involve iteratively and alternately optimizing
different task objectives with respect to different sets of parameters.
Appropriately scheduling the optimization of a task objective or a set of
parameters is usually crucial to the quality of convergence. In this paper, we
present AutoLoss, a meta-learning framework that automatically learns and
determines the optimization schedule. AutoLoss provides a generic way to
represent and learn the discrete optimization schedule from metadata, allows
for a dynamic and data-driven schedule in ML problems that involve alternating
updates of different parameters or from different loss objectives. We apply
AutoLoss on four ML tasks: d-ary quadratic regression, classification using a
multi-layer perceptron (MLP), image generation using GANs, and multi-task
neural machine translation (NMT). We show that the AutoLoss controller is able
to capture the distribution of better optimization schedules that result in
higher quality of convergence on all four tasks. The trained AutoLoss
controller is generalizable -- it can guide and improve the learning of a new
task model with different specifications, or on different datasets.Comment: 19-pages manuscripts. The first two authors contributed equall
Variational Optimization of Annealing Schedules
Annealed importance sampling (AIS) is a common algorithm to estimate
partition functions of useful stochastic models. One important problem for
obtaining accurate AIS estimates is the selection of an annealing schedule.
Conventionally, an annealing schedule is often determined heuristically or is
simply set as a linearly increasing sequence. In this paper, we propose an
algorithm for the optimal schedule by deriving a functional that dominates the
AIS estimation error and by numerically minimizing this functional. We
experimentally demonstrate that the proposed algorithm mostly outperforms
conventional scheduling schemes with large quantization numbers
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks
Training deep neural networks with Stochastic Gradient Descent, or its
variants, requires careful choice of both learning rate and batch size. While
smaller batch sizes generally converge in fewer training epochs, larger batch
sizes offer more parallelism and hence better computational efficiency. We have
developed a new training approach that, rather than statically choosing a
single batch size for all epochs, adaptively increases the batch size during
the training process. Our method delivers the convergence rate of small batch
sizes while achieving performance similar to large batch sizes. We analyse our
approach using the standard AlexNet, ResNet, and VGG networks operating on the
popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate
that learning with adaptive batch sizes can improve performance by factors of
up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1%
relative to training with fixed batch sizes.Comment: 14 page
On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants
We study optimization algorithms based on variance reduction for stochastic
gradient descent (SGD). Remarkable recent progress has been made in this
direction through development of algorithms like SAG, SVRG, SAGA. These
algorithms have been shown to outperform SGD, both theoretically and
empirically. However, asynchronous versions of these algorithms---a crucial
requirement for modern large-scale applications---have not been studied. We
bridge this gap by presenting a unifying framework for many variance reduction
techniques. Subsequently, we propose an asynchronous algorithm grounded in our
framework, and prove its fast convergence. An important consequence of our
general approach is that it yields asynchronous versions of variance reduction
algorithms such as SVRG and SAGA as a byproduct. Our method achieves near
linear speedup in sparse settings common to machine learning. We demonstrate
the empirical performance of our method through a concrete realization of
asynchronous SVRG
A Block Coordinate Ascent Algorithm for Mean-Variance Optimization
Risk management in dynamic decision problems is a primary concern in many
fields, including financial investment, autonomous driving, and healthcare. The
mean-variance function is one of the most widely used objective functions in
risk management due to its simplicity and interpretability. Existing algorithms
for mean-variance optimization are based on multi-time-scale stochastic
approximation, whose learning rate schedules are often hard to tune, and have
only asymptotic convergence proof. In this paper, we develop a model-free
policy search framework for mean-variance optimization with finite-sample error
bound analysis (to local optima). Our starting point is a reformulation of the
original mean-variance function with its Fenchel dual, from which we propose a
stochastic block coordinate ascent policy search algorithm. Both the asymptotic
convergence guarantee of the last iteration's solution and the convergence rate
of the randomly picked solution are provided, and their applicability is
demonstrated on several benchmark domains.Comment: Accepted by NIPS 201
Integrating production scheduling and process control using latent variable dynamic models
Given their increasing participation in fast-changing markets, the
integration of scheduling and control is an important consideration in chemical
process operations. This generally involves computing optimal production
schedules using dynamic models, which is challenging due to the nonlinearity
and high-dimensionality of the models of chemical processes. In this paper, we
begin by observing that the intrinsic dimensionality of process dynamics (as
relevant to scheduling) is often much lower than the number of model state
and/or algebraic variables. We introduce a data mining approach to "learn"
closed-loop process dynamics on a low-dimensional, latent manifold. The
manifold dimensionality is selected based on a tradeoff between model accuracy
and complexity. After projecting process data, system identification and
optimal scheduling calculations can be performed in the low-dimensional,
latent-variable space. We apply these concepts to schedule an air separation
unit under time-varying electricity prices. We show that our approach reduces
the computational effort, while offering more detailed dynamic information
compared to previous related works.Comment: Revised versio
- …