Search CORE

19,938 research outputs found

Population Based Training of Neural Networks

Author: Czarnecki Wojciech M.
Dalibard Valentin
Donahue Jeff
Dunning Iain
Fernando Chrisantha
Green Tim
Jaderberg Max
Kavukcuoglu Koray
Osindero Simon
Razavi Ali
Simonyan Karen
Vinyals Oriol
Publication venue
Publication date: 28/11/2017
Field of study

Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In this work we present \emph{Population Based Training (PBT)}, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models. We demonstrate the effectiveness of PBT on deep reinforcement learning problems, showing faster wall-clock convergence and higher final performance of agents by optimising over a suite of hyperparameters. In addition, we show the same method can be applied to supervised learning for machine translation, where PBT is used to maximise the BLEU score directly, and also to training of Generative Adversarial Networks to maximise the Inception score of generated images. In all cases PBT results in the automatic discovery of hyperparameter schedules and model selection which results in stable training and better final performance

arXiv.org e-Print Archive

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Author: Prashanth Tejas
Saha Snehanshu
Yedida Rahul
Publication venue
Publication date: 31/07/2020
Field of study

Optimizing deep neural networks is largely thought to be an empirical process, requiring manual tuning of several hyper-parameters, such as learning rate, weight decay, and dropout rate. Arguably, the learning rate is the most important of these to tune, and this has gained more attention in recent works. In this paper, we propose a novel method to compute the learning rate for training deep neural networks with stochastic gradient descent. We first derive a theoretical framework to compute learning rates dynamically based on the Lipschitz constant of the loss function. We then extend this framework to other commonly used optimization algorithms, such as gradient descent with momentum and Adam. We run an extensive set of experiments that demonstrate the efficacy of our approach on popular architectures and datasets, and show that commonly used learning rates are an order of magnitude smaller than the ideal value.Comment: v4; comparison studies adde

arXiv.org e-Print Archive

Neural Approaches to Conversational AI

Author: Galley Michel
Gao Jianfeng
Li Lihong
Publication venue
Publication date: 10/09/2019
Field of study

The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.Comment: Foundations and Trends in Information Retrieval (95 pages

arXiv.org e-Print Archive

Bayesian Optimisation for Machine Translation

Author: Blunsom Phil
Miao Yishu
Wang Ziyu
Publication venue
Publication date: 22/12/2014
Field of study

This paper presents novel Bayesian optimisation algorithms for minimum error rate training of statistical machine translation systems. We explore two classes of algorithms for efficiently exploring the translation space, with the first based on N-best lists and the second based on a hypergraph representation that compactly represents an exponential number of translation options. Our algorithms exhibit faster convergence and are capable of obtaining lower error rates than the existing translation model specific approaches, all within a generic Bayesian optimisation framework. Further more, we also introduce a random embedding algorithm to scale our approach to sparse high dimensional feature sets.Comment: Bayesian optimisation workshop, NIPS 201

arXiv.org e-Print Archive

Quasi-hyperbolic momentum and Adam for deep learning

Author: Ma Jerry
Yarats Denis
Publication venue
Publication date: 02/05/2019
Field of study

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. Code is immediately available.Comment: Published as a conference paper at ICLR 2019. This version corrects one typological error in the published tex

arXiv.org e-Print Archive

An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search

Author: Berg-Kirkpatrick Taylor
Dyer Chris
Goyal Kartik
Publication venue
Publication date: 15/04/2019
Field of study

Globally normalized neural sequence models are considered superior to their locally normalized equivalents because they may ameliorate the effects of label bias. However, when considering high-capacity neural parametrizations that condition on the whole input sequence, both model classes are theoretically equivalent in terms of the distributions they are capable of representing. Thus, the practical advantage of global normalization in the context of modern neural methods remains unclear. In this paper, we attempt to shed light on this problem through an empirical study. We extend an approach for search-aware training via a continuous relaxation of beam search (Goyal et al., 2017b) in order to enable training of globally normalized recurrent sequence models through simple backpropagation. We then use this technique to conduct an empirical study of the interaction between global normalization, high-capacity encoders, and search-aware optimization. We observe that in the context of inexact search, globally normalized neural models are still more effective than their locally normalized counterparts. Further, since our training approach is sensitive to warm-starting with pre-trained models, we also propose a novel initialization strategy based on self-normalization for pre-training globally normalized models. We perform analysis of our approach on two tasks: CCG supertagging and Machine Translation, and demonstrate the importance of global normalization under different conditions while using search-aware training.Comment: Long paper at NAACL 201

arXiv.org e-Print Archive

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

Author: Haffari Gholamreza
He Xuanli
Norouzi Mohammad
Publication venue
Publication date: 01/08/2020
Field of study

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English (German, Romanian, Estonian, Finnish, Hungarian).Comment: update related wor

arXiv.org e-Print Archive

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Author: Cho Kyunghyun
Publication venue
Publication date: 12/05/2016
Field of study

Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g., image/video description generation, speech recognition, etc.) On the other hand, we notice that decoding algorithms/strategies have not been investigated as much, and it has become standard to use greedy or beam search. In this paper, we propose a novel decoding strategy motivated by an earlier observation that nonlinear hidden layers of a deep neural network stretch the data manifold. The proposed strategy is embarrassingly parallelizable without any communication overhead, while improving an existing decoding algorithm. We extensively evaluate it with attention-based neural machine translation on the task of En->Cz translation

arXiv.org e-Print Archive

Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

Author: Curtis Frank E.
Scheinberg Katya
Publication venue
Publication date: 30/06/2017
Field of study

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models

arXiv.org e-Print Archive

A Machine Learning Approach to Routing

Author: Schapira Michael
Shahaf Dafna
Tamar Aviv
Valadarsky Asaf
Publication venue
Publication date: 11/11/2017
Field of study

Can ideas and techniques from machine learning be leveraged to automatically generate "good" routing configurations? We investigate the power of data-driven routing protocols. Our results suggest that applying ideas and techniques from deep reinforcement learning to this context yields high performance, motivating further research along these lines

arXiv.org e-Print Archive