16,468 research outputs found

    MARTHE: Scheduling the Learning Rate Via Online Hypergradients

    Full text link
    We study the problem of fitting task-specific learning rate schedules from the perspective of hyperparameter optimization, aiming at good generalization. We describe the structure of the gradient of a validation error w.r.t. the learning rate schedule -- the hypergradient. Based on this, we introduce MARTHE, a novel online algorithm guided by cheap approximations of the hypergradient that uses past information from the optimization trajectory to simulate future behaviour. It interpolates between two recent techniques, RTHO (Franceschi et al., 2017) and HD (Baydin et al. 2018), and is able to produce learning rate schedules that are more stable leading to models that generalize better.Comment: IJCAI 2020. Larger images. Code available at https://github.com/awslabs/adatun

    Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

    Full text link
    In most practical settings and theoretical analyses, one assumes that a model can be trained until convergence. However, the growing complexity of machine learning datasets and models may violate such assumptions. Indeed, current approaches for hyper-parameter tuning and neural architecture search tend to be limited by practical resource constraints. Therefore, we introduce a formal setting for studying training under the non-asymptotic, resource-constrained regime, i.e., budgeted training. We analyze the following problem: "given a dataset, algorithm, and fixed resource budget, what is the best achievable performance?" We focus on the number of optimization iterations as the representative resource. Under such a setting, we show that it is critical to adjust the learning rate schedule according to the given budget. Among budget-aware learning schedules, we find simple linear decay to be both robust and high-performing. We support our claim through extensive experiments with state-of-the-art models on ImageNet (image classification), Kinetics (video classification), MS COCO (object detection and instance segmentation), and Cityscapes (semantic segmentation). We also analyze our results and find that the key to a good schedule is budgeted convergence, a phenomenon whereby the gradient vanishes at the end of each allowed budget. We also revisit existing approaches for fast convergence and show that budget-aware learning schedules readily outperform such approaches under (the practical but under-explored) budgeted training setting.Comment: ICLR 2020. Project page with code is at http://www.cs.cmu.edu/~mengtial/proj/budgetnn

    The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

    Full text link
    Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of T\sqrt{T} in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only loglog factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.Comment: Appears in the proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7 figure

    Facility Deployment Decisions through Warp Optimizaton of Regressed Gaussian Processes

    Full text link
    A method for quickly determining deployment schedules that meet a given fuel cycle demand is presented here. This algorithm is fast enough to perform in situ within low-fidelity fuel cycle simulators. It uses Gaussian process regression models to predict the production curve as a function of time and the number of deployed facilities. Each of these predictions is measured against the demand curve using the dynamic time warping distance. The minimum distance deployment schedule is evaluated in a full fuel cycle simulation, whose generated production curve then informs the model on the next optimization iteration. The method converges within five to ten iterations to a distance that is less than one percent of the total deployable production. A representative once-through fuel cycle is used to demonstrate the methodology for reactor deployment.Comment: Number of Pages: 35, Number of Tables: 0, Number of Figures: 1

    AutoLoss: Learning Discrete Schedules for Alternate Optimization

    Full text link
    Many machine learning problems involve iteratively and alternately optimizing different task objectives with respect to different sets of parameters. Appropriately scheduling the optimization of a task objective or a set of parameters is usually crucial to the quality of convergence. In this paper, we present AutoLoss, a meta-learning framework that automatically learns and determines the optimization schedule. AutoLoss provides a generic way to represent and learn the discrete optimization schedule from metadata, allows for a dynamic and data-driven schedule in ML problems that involve alternating updates of different parameters or from different loss objectives. We apply AutoLoss on four ML tasks: d-ary quadratic regression, classification using a multi-layer perceptron (MLP), image generation using GANs, and multi-task neural machine translation (NMT). We show that the AutoLoss controller is able to capture the distribution of better optimization schedules that result in higher quality of convergence on all four tasks. The trained AutoLoss controller is generalizable -- it can guide and improve the learning of a new task model with different specifications, or on different datasets.Comment: 19-pages manuscripts. The first two authors contributed equall

    Variational Optimization of Annealing Schedules

    Full text link
    Annealed importance sampling (AIS) is a common algorithm to estimate partition functions of useful stochastic models. One important problem for obtaining accurate AIS estimates is the selection of an annealing schedule. Conventionally, an annealing schedule is often determined heuristically or is simply set as a linearly increasing sequence. In this paper, we propose an algorithm for the optimal schedule by deriving a functional that dominates the AIS estimation error and by numerically minimizing this functional. We experimentally demonstrate that the proposed algorithm mostly outperforms conventional scheduling schemes with large quantization numbers

    AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

    Full text link
    Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch size for all epochs, adaptively increases the batch size during the training process. Our method delivers the convergence rate of small batch sizes while achieving performance similar to large batch sizes. We analyse our approach using the standard AlexNet, ResNet, and VGG networks operating on the popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate that learning with adaptive batch sizes can improve performance by factors of up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1% relative to training with fixed batch sizes.Comment: 14 page

    On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

    Full text link
    We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale applications---have not been studied. We bridge this gap by presenting a unifying framework for many variance reduction techniques. Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct. Our method achieves near linear speedup in sparse settings common to machine learning. We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG

    A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

    Full text link
    Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare. The mean-variance function is one of the most widely used objective functions in risk management due to its simplicity and interpretability. Existing algorithms for mean-variance optimization are based on multi-time-scale stochastic approximation, whose learning rate schedules are often hard to tune, and have only asymptotic convergence proof. In this paper, we develop a model-free policy search framework for mean-variance optimization with finite-sample error bound analysis (to local optima). Our starting point is a reformulation of the original mean-variance function with its Fenchel dual, from which we propose a stochastic block coordinate ascent policy search algorithm. Both the asymptotic convergence guarantee of the last iteration's solution and the convergence rate of the randomly picked solution are provided, and their applicability is demonstrated on several benchmark domains.Comment: Accepted by NIPS 201

    Integrating production scheduling and process control using latent variable dynamic models

    Full text link
    Given their increasing participation in fast-changing markets, the integration of scheduling and control is an important consideration in chemical process operations. This generally involves computing optimal production schedules using dynamic models, which is challenging due to the nonlinearity and high-dimensionality of the models of chemical processes. In this paper, we begin by observing that the intrinsic dimensionality of process dynamics (as relevant to scheduling) is often much lower than the number of model state and/or algebraic variables. We introduce a data mining approach to "learn" closed-loop process dynamics on a low-dimensional, latent manifold. The manifold dimensionality is selected based on a tradeoff between model accuracy and complexity. After projecting process data, system identification and optimal scheduling calculations can be performed in the low-dimensional, latent-variable space. We apply these concepts to schedule an air separation unit under time-varying electricity prices. We show that our approach reduces the computational effort, while offering more detailed dynamic information compared to previous related works.Comment: Revised versio
    corecore