1,232 research outputs found
Differentiable lower bound for expected BLEU score
In natural language processing tasks performance of the models is often
measured with some non-differentiable metric, such as BLEU score. To use
efficient gradient-based methods for optimization, it is a common workaround to
optimize some surrogate loss function. This approach is effective if
optimization of such loss also results in improving target metric. The
corresponding problem is referred to as loss-evaluation mismatch. In the
present work we propose a method for calculation of differentiable lower bound
of expected BLEU score that does not involve computationally expensive sampling
procedure such as the one required when using REINFORCE rule from reinforcement
learning (RL) framework.Comment: Presented at NIPS 2017 Workshop on Conversational AI: Today's
Practice and Tomorrow's Potentia
Differentiable Scheduled Sampling for Credit Assignment
We demonstrate that a continuous relaxation of the argmax operation can be
used to create a differentiable approximation to greedy decoding for
sequence-to-sequence (seq2seq) models. By incorporating this approximation into
the scheduled sampling training procedure (Bengio et al., 2015)--a well-known
technique for correcting exposure bias--we introduce a new training objective
that is continuous and differentiable everywhere and that can provide
informative gradients near points where previous decoding decisions change
their value. In addition, by using a related approximation, we demonstrate a
similar approach to sampled-based training. Finally, we show that our approach
outperforms cross-entropy training and scheduled sampling procedures in two
sequence prediction tasks: named entity recognition and machine translation.Comment: Accepted at ACL2017 (http://bit.ly/2oj1muX
Task Loss Estimation for Sequence Prediction
Often, the performance on a supervised machine learning task is evaluated
with a emph{task loss} function that cannot be optimized directly. Examples of
such loss functions include the classification error, the edit distance and the
BLEU score. A common workaround for this problem is to instead optimize a
emph{surrogate loss} function, such as for instance cross-entropy or hinge
loss. In order for this remedy to be effective, it is important to ensure that
minimization of the surrogate loss results in minimization of the task loss, a
condition that we call emph{consistency with the task loss}. In this work, we
propose another method for deriving differentiable surrogate losses that
provably meet this requirement. We focus on the broad class of models that
define a score for every input-output pair. Our idea is that this score can be
interpreted as an estimate of the task loss, and that the estimation error may
be used as a consistent surrogate loss. A distinct feature of such an approach
is that it defines the desirable value of the score for every input-output
pair. We use this property to design specialized surrogate losses for
Encoder-Decoder models often used for sequence prediction tasks. In our
experiment, we benchmark on the task of speech recognition. Using a new
surrogate loss instead of cross-entropy to train an Encoder-Decoder speech
recognizer brings a significant ~13% relative improvement in terms of Character
Error Rate (CER) in the case when no extra corpora are used for language
modeling.Comment: Submitted to ICLR 201
Bi-Directional Differentiable Input Reconstruction for Low-Resource Neural Machine Translation
We aim to better exploit the limited amounts of parallel text available in
low-resource settings by introducing a differentiable reconstruction loss for
neural machine translation (NMT). This loss compares original inputs to
reconstructed inputs, obtained by back-translating translation hypotheses into
the input language. We leverage differentiable sampling and bi-directional NMT
to train models end-to-end, without introducing additional parameters. This
approach achieves small but consistent BLEU improvements on four language pairs
in both translation directions, and outperforms an alternative differentiable
reconstruction strategy based on hidden states.Comment: Accepted at NAACL 201
A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models
Beam search is a desirable choice of test-time decoding algorithm for neural
sequence models because it potentially avoids search errors made by simpler
greedy methods. However, typical cross entropy training procedures for these
models do not directly consider the behaviour of the final decoding method. As
a result, for cross-entropy trained models, beam decoding can sometimes yield
reduced test performance when compared with greedy decoding. In order to train
models that can more effectively make use of beam search, we propose a new
training procedure that focuses on the final loss metric (e.g. Hamming loss)
evaluated on the output of beam search. While well-defined, this "direct loss"
objective is itself discontinuous and thus difficult to optimize. Hence, in our
approach, we form a sub-differentiable surrogate objective by introducing a
novel continuous approximation of the beam search decoding procedure. In
experiments, we show that optimizing this new training objective yields
substantially better results on two sequence tasks (Named Entity Recognition
and CCG Supertagging) when compared with both cross entropy trained greedy
decoding and cross entropy trained beam decoding baselines.Comment: Updated for clarity and notational consistenc
Sparse Sequence-to-Sequence Models
Sequence-to-sequence models are a powerful workhorse of NLP. Most variants
employ a softmax transformation in both their attention mechanism and output
layer, leading to dense alignments and strictly positive output probabilities.
This density is wasteful, making models less interpretable and assigning
probability mass to many implausible outputs. In this paper, we propose sparse
sequence-to-sequence models, rooted in a new family of -entmax
transformations, which includes softmax and sparsemax as particular cases, and
is sparse for any . We provide fast algorithms to evaluate these
transformations and their gradients, which scale well for large vocabulary
sizes. Our models are able to produce sparse alignments and to assign nonzero
probability to a short list of plausible outputs, sometimes rendering beam
search exact. Experiments on morphological inflection and machine translation
reveal consistent gains over dense models.Comment: ACL 2019 Camera Read
Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis
Program synthesis is the task of automatically generating a program
consistent with a specification. Recent years have seen proposal of a number of
neural approaches for program synthesis, many of which adopt a sequence
generation paradigm similar to neural machine translation, in which
sequence-to-sequence models are trained to maximize the likelihood of known
reference programs. While achieving impressive results, this strategy has two
key limitations. First, it ignores Program Aliasing: the fact that many
different programs may satisfy a given specification (especially with
incomplete specifications such as a few input-output examples). By maximizing
the likelihood of only a single reference program, it penalizes many
semantically correct programs, which can adversely affect the synthesizer
performance. Second, this strategy overlooks the fact that programs have a
strict syntax that can be efficiently checked. To address the first limitation,
we perform reinforcement learning on top of a supervised model with an
objective that explicitly maximizes the likelihood of generating semantically
correct programs. For addressing the second limitation, we introduce a training
procedure that directly maximizes the probability of generating syntactically
correct programs that fulfill the specification. We show that our contributions
lead to improved accuracy of the models, especially in cases where the training
data is limited.Comment: ICLR 201
Non-Autoregressive Neural Machine Translation
Existing approaches to neural machine translation condition each output word
on previously generated outputs. We introduce a model that avoids this
autoregressive property and produces its outputs in parallel, allowing an order
of magnitude lower latency during inference. Through knowledge distillation,
the use of input token fertilities as a latent variable, and policy gradient
fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative
to the autoregressive Transformer network used as a teacher. We demonstrate
substantial cumulative improvements associated with each of the three aspects
of our training strategy, and validate our approach on IWSLT 2016
English-German and two WMT language pairs. By sampling fertilities in parallel
at inference time, our non-autoregressive model achieves near-state-of-the-art
performance of 29.8 BLEU on WMT 2016 English-Romanian.Comment: Accepted by ICLR 201
pix2code: Generating Code from a Graphical User Interface Screenshot
Transforming a graphical user interface screenshot created by a designer into
computer code is a typical task conducted by a developer in order to build
customized software, websites, and mobile applications. In this paper, we show
that deep learning methods can be leveraged to train a model end-to-end to
automatically generate code from a single input image with over 77% of accuracy
for three different platforms (i.e. iOS, Android and web-based technologies)
Search-Guided, Lightly-supervised Training of Structured Prediction Energy Networks
In structured output prediction tasks, labeling ground-truth training output
is often expensive. However, for many tasks, even when the true output is
unknown, we can evaluate predictions using a scalar reward function, which may
be easily assembled from human knowledge or non-differentiable pipelines. But
searching through the entire output space to find the best output with respect
to this reward function is typically intractable. In this paper, we instead use
efficient truncated randomized search in this reward function to train
structured prediction energy networks (SPENs), which provide efficient
test-time inference using gradient-based search on a smooth, learned
representation of the score landscape, and have previously yielded
state-of-the-art results in structured prediction. In particular, this
truncated randomized search in the reward function yields previously unknown
local improvements, providing effective supervision to SPENs, avoiding their
traditional need for labeled training data
- …