19,938 research outputs found
Population Based Training of Neural Networks
Neural networks dominate the modern machine learning landscape, but their
training and success still suffer from sensitivity to empirical choices of
hyperparameters such as model architecture, loss function, and optimisation
algorithm. In this work we present \emph{Population Based Training (PBT)}, a
simple asynchronous optimisation algorithm which effectively utilises a fixed
computational budget to jointly optimise a population of models and their
hyperparameters to maximise performance. Importantly, PBT discovers a schedule
of hyperparameter settings rather than following the generally sub-optimal
strategy of trying to find a single fixed set to use for the whole course of
training. With just a small modification to a typical distributed
hyperparameter training framework, our method allows robust and reliable
training of models. We demonstrate the effectiveness of PBT on deep
reinforcement learning problems, showing faster wall-clock convergence and
higher final performance of agents by optimising over a suite of
hyperparameters. In addition, we show the same method can be applied to
supervised learning for machine translation, where PBT is used to maximise the
BLEU score directly, and also to training of Generative Adversarial Networks to
maximise the Inception score of generated images. In all cases PBT results in
the automatic discovery of hyperparameter schedules and model selection which
results in stable training and better final performance
LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence
Optimizing deep neural networks is largely thought to be an empirical
process, requiring manual tuning of several hyper-parameters, such as learning
rate, weight decay, and dropout rate. Arguably, the learning rate is the most
important of these to tune, and this has gained more attention in recent works.
In this paper, we propose a novel method to compute the learning rate for
training deep neural networks with stochastic gradient descent. We first derive
a theoretical framework to compute learning rates dynamically based on the
Lipschitz constant of the loss function. We then extend this framework to other
commonly used optimization algorithms, such as gradient descent with momentum
and Adam. We run an extensive set of experiments that demonstrate the efficacy
of our approach on popular architectures and datasets, and show that commonly
used learning rates are an order of magnitude smaller than the ideal value.Comment: v4; comparison studies adde
Neural Approaches to Conversational AI
The present paper surveys neural approaches to conversational AI that have
been developed in the last few years. We group conversational systems into
three categories: (1) question answering agents, (2) task-oriented dialogue
agents, and (3) chatbots. For each category, we present a review of
state-of-the-art neural approaches, draw the connection between them and
traditional approaches, and discuss the progress that has been made and
challenges still being faced, using specific systems and models as case
studies.Comment: Foundations and Trends in Information Retrieval (95 pages
Bayesian Optimisation for Machine Translation
This paper presents novel Bayesian optimisation algorithms for minimum error
rate training of statistical machine translation systems. We explore two
classes of algorithms for efficiently exploring the translation space, with the
first based on N-best lists and the second based on a hypergraph representation
that compactly represents an exponential number of translation options. Our
algorithms exhibit faster convergence and are capable of obtaining lower error
rates than the existing translation model specific approaches, all within a
generic Bayesian optimisation framework. Further more, we also introduce a
random embedding algorithm to scale our approach to sparse high dimensional
feature sets.Comment: Bayesian optimisation workshop, NIPS 201
Quasi-hyperbolic momentum and Adam for deep learning
Momentum-based acceleration of stochastic gradient descent (SGD) is widely
used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM)
as an extremely simple alteration of momentum SGD, averaging a plain SGD step
with a momentum step. We describe numerous connections to and identities with
other algorithms, and we characterize the set of two-state optimization
algorithms that QHM can recover. Finally, we propose a QH variant of Adam
called QHAdam, and we empirically demonstrate that our algorithms lead to
significantly improved training in a variety of settings, including a new
state-of-the-art result on WMT16 EN-DE. We hope that these empirical results,
combined with the conceptual and practical simplicity of QHM and QHAdam, will
spur interest from both practitioners and researchers. Code is immediately
available.Comment: Published as a conference paper at ICLR 2019. This version corrects
one typological error in the published tex
An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search
Globally normalized neural sequence models are considered superior to their
locally normalized equivalents because they may ameliorate the effects of label
bias. However, when considering high-capacity neural parametrizations that
condition on the whole input sequence, both model classes are theoretically
equivalent in terms of the distributions they are capable of representing.
Thus, the practical advantage of global normalization in the context of modern
neural methods remains unclear. In this paper, we attempt to shed light on this
problem through an empirical study. We extend an approach for search-aware
training via a continuous relaxation of beam search (Goyal et al., 2017b) in
order to enable training of globally normalized recurrent sequence models
through simple backpropagation. We then use this technique to conduct an
empirical study of the interaction between global normalization, high-capacity
encoders, and search-aware optimization. We observe that in the context of
inexact search, globally normalized neural models are still more effective than
their locally normalized counterparts. Further, since our training approach is
sensitive to warm-starting with pre-trained models, we also propose a novel
initialization strategy based on self-normalization for pre-training globally
normalized models. We perform analysis of our approach on two tasks: CCG
supertagging and Machine Translation, and demonstrate the importance of global
normalization under different conditions while using search-aware training.Comment: Long paper at NAACL 201
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
This paper introduces Dynamic Programming Encoding (DPE), a new segmentation
algorithm for tokenizing sentences into subword units. We view the subword
segmentation of output sentences as a latent variable that should be
marginalized out for learning and inference. A mixed character-subword
transformer is proposed, which enables exact log marginal likelihood estimation
and exact MAP inference to find target segmentations with maximum posterior
probability. DPE uses a lightweight mixed character-subword transformer as a
means of pre-processing parallel data to segment output sentences using dynamic
programming. Empirical results on machine translation suggest that DPE is
effective for segmenting output sentences and can be combined with BPE dropout
for stochastic segmentation of source sentences. DPE achieves an average
improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average
improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several
WMT datasets including English (German, Romanian, Estonian, Finnish,
Hungarian).Comment: update related wor
Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model
Recent advances in conditional recurrent language modelling have mainly
focused on network architectures (e.g., attention mechanism), learning
algorithms (e.g., scheduled sampling and sequence-level training) and novel
applications (e.g., image/video description generation, speech recognition,
etc.) On the other hand, we notice that decoding algorithms/strategies have not
been investigated as much, and it has become standard to use greedy or beam
search. In this paper, we propose a novel decoding strategy motivated by an
earlier observation that nonlinear hidden layers of a deep neural network
stretch the data manifold. The proposed strategy is embarrassingly
parallelizable without any communication overhead, while improving an existing
decoding algorithm. We extensively evaluate it with attention-based neural
machine translation on the task of En->Cz translation
Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning
The goal of this tutorial is to introduce key models, algorithms, and open
questions related to the use of optimization methods for solving problems
arising in machine learning. It is written with an INFORMS audience in mind,
specifically those readers who are familiar with the basics of optimization
algorithms, but less familiar with machine learning. We begin by deriving a
formulation of a supervised learning problem and show how it leads to various
optimization problems, depending on the context and underlying assumptions. We
then discuss some of the distinctive features of these optimization problems,
focusing on the examples of logistic regression and the training of deep neural
networks. The latter half of the tutorial focuses on optimization algorithms,
first for convex logistic regression, for which we discuss the use of
first-order methods, the stochastic gradient method, variance reducing
stochastic methods, and second-order methods. Finally, we discuss how these
approaches can be employed to the training of deep neural networks, emphasizing
the difficulties that arise from the complex, nonconvex structure of these
models
A Machine Learning Approach to Routing
Can ideas and techniques from machine learning be leveraged to automatically
generate "good" routing configurations? We investigate the power of data-driven
routing protocols. Our results suggest that applying ideas and techniques from
deep reinforcement learning to this context yields high performance, motivating
further research along these lines
- …