13 research outputs found
A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models
Beam search is a desirable choice of test-time decoding algorithm for neural
sequence models because it potentially avoids search errors made by simpler
greedy methods. However, typical cross entropy training procedures for these
models do not directly consider the behaviour of the final decoding method. As
a result, for cross-entropy trained models, beam decoding can sometimes yield
reduced test performance when compared with greedy decoding. In order to train
models that can more effectively make use of beam search, we propose a new
training procedure that focuses on the final loss metric (e.g. Hamming loss)
evaluated on the output of beam search. While well-defined, this "direct loss"
objective is itself discontinuous and thus difficult to optimize. Hence, in our
approach, we form a sub-differentiable surrogate objective by introducing a
novel continuous approximation of the beam search decoding procedure. In
experiments, we show that optimizing this new training objective yields
substantially better results on two sequence tasks (Named Entity Recognition
and CCG Supertagging) when compared with both cross entropy trained greedy
decoding and cross entropy trained beam decoding baselines.Comment: Updated for clarity and notational consistenc
Successive Halving Top-k Operator
We propose a differentiable successive halving method of relaxing the top-k
operator, rendering gradient-based optimization possible. The need to perform
softmax iteratively on the entire vector of scores is avoided by using a
tournament-style selection. As a result, a much better approximation of top-k
with lower computational cost is achieved compared to the previous approach.Comment: Work in progres
Sparsifying Transformer Models with Trainable Representation Pooling
We propose a novel method to sparsify attention in the Transformer model by
learning to select the most-informative token representations during the
training process, thus focusing on task-specific parts of the input. A
reduction of quadratic time and memory complexity to sublinear was achieved due
to a robust trainable top-k operator. For example, our experiments on a
challenging summarization task of long documents show that our method is over 3
times faster and up to 16 times more memory efficient while significantly
outperforming both dense and state-of-the-art sparse transformer models. The
method can be effortlessly applied to many models used in NLP and CV,
simultaneously with other improvements.Comment: Provided formal overview. Reevaluated with Google Research scrip
Learning to Summarize Videos by Contrasting Clips
Video summarization aims at choosing parts of a video that narrate a story as
close as possible to the original one. Most of the existing video summarization
approaches focus on hand-crafted labels. As the number of videos grows
exponentially, there emerges an increasing need for methods that can learn
meaningful summarizations without labeled annotations. In this paper, we aim to
maximally exploit unsupervised video summarization while concentrating the
supervision to a few, personalized labels as an add-on. To do so, we formulate
the key requirements for the informative video summarization. Then, we propose
contrastive learning as the answer to both questions. To further boost
Contrastive video Summarization (CSUM), we propose to contrast top-k features
instead of a mean video feature as employed by the existing method, which we
implement with a differentiable top-k feature selector. Our experiments on
several benchmarks demonstrate, that our approach allows for meaningful and
diverse summaries when no labeled data is provided
Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement
The well-known Gumbel-Max trick for sampling from a categorical distribution
can be extended to sample elements without replacement. We show how to
implicitly apply this 'Gumbel-Top-' trick on a factorized distribution over
sequences, allowing to draw exact samples without replacement using a
Stochastic Beam Search. Even for exponentially large domains, the number of
model evaluations grows only linear in and the maximum sampled sequence
length. The algorithm creates a theoretical connection between sampling and
(deterministic) beam search and can be used as a principled intermediate
alternative. In a translation task, the proposed method compares favourably
against alternatives to obtain diverse yet good quality translations. We show
that sequences sampled without replacement can be used to construct
low-variance estimators for expected sentence-level BLEU score and model
entropy.Comment: ICML 2019 ; 13 pages, 4 figure