405 research outputs found
Self-Supervised and Controlled Multi-Document Opinion Summarization
We address the problem of unsupervised abstractive summarization of
collections of user generated reviews with self-supervision and control. We
propose a self-supervised setup that considers an individual document as a
target summary for a set of similar documents. This setting makes training
simpler than previous approaches by relying only on standard log-likelihood
loss. We address the problem of hallucinations through the use of control
codes, to steer the generation towards more coherent and relevant
summaries.Finally, we extend the Transformer architecture to allow for multiple
reviews as input. Our benchmarks on two datasets against graph-based and recent
neural abstractive unsupervised models show that our proposed method generates
summaries with a superior quality and relevance.This is confirmed in our human
evaluation which focuses explicitly on the faithfulness of generated summaries
We also provide an ablation study, which shows the importance of the control
setup in controlling hallucinations and achieve high sentiment and topic
alignment of the summaries with the input reviews.Comment: 18 pages including 5 pages appendi
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
Multi-head self-attention is a key component of the Transformer, a
state-of-the-art architecture for neural machine translation. In this work we
evaluate the contribution made by individual attention heads in the encoder to
the overall performance of the model and analyze the roles played by them. We
find that the most important and confident heads play consistent and often
linguistically-interpretable roles. When pruning heads using a method based on
stochastic gates and a differentiable relaxation of the L0 penalty, we observe
that specialized heads are last to be pruned. Our novel pruning method removes
the vast majority of heads without seriously affecting performance. For
example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads
results in a drop of only 0.15 BLEU.Comment: ACL 2019 (camera-ready
Simple Recurrent Units for Highly Parallelizable Recurrence
Common recurrent neural architectures scale poorly due to the intrinsic
difficulty in parallelizing their state computations. In this work, we propose
the Simple Recurrent Unit (SRU), a light recurrent unit that balances model
capacity and scalability. SRU is designed to provide expressive recurrence,
enable highly parallelized implementation, and comes with careful
initialization to facilitate training of deep models. We demonstrate the
effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over
cuDNN-optimized LSTM on classification and question answering datasets, and
delivers stronger results than LSTM and convolutional models. We also obtain an
average of 0.7 BLEU improvement over the Transformer model on translation by
incorporating SRU into the architecture.Comment: EMNL
Contextualized Translation of Automatically Segmented Speech
Direct speech-to-text translation (ST) models are usually trained on corpora
segmented at sentence level, but at inference time they are commonly fed with
audio split by a voice activity detector (VAD). Since VAD segmentation is not
syntax-informed, the resulting segments do not necessarily correspond to
well-formed sentences uttered by the speaker but, most likely, to fragments of
one or more sentences. This segmentation mismatch degrades considerably the
quality of ST models' output. So far, researchers have focused on improving
audio segmentation towards producing sentence-like splits. In this paper,
instead, we address the issue in the model, making it more robust to a
different, potentially sub-optimal segmentation. To this aim, we train our
models on randomly segmented data and compare two approaches: fine-tuning and
adding the previous segment as context. We show that our context-aware solution
is more robust to VAD-segmented input, outperforming a strong base model and
the fine-tuning on different VAD segmentations of an English-German test set by
up to 4.25 BLEU points.Comment: Interspeech 202
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
Sequence-to-sequence models usually transfer all encoder outputs to the
decoder for generation. In this work, by contrast, we hypothesize that these
encoder outputs can be compressed to shorten the sequence delivered for
decoding. We take Transformer as the testbed and introduce a layer of
stochastic gates in-between the encoder and the decoder. The gates are
regularized using the expected value of the sparsity-inducing L0penalty,
resulting in completely masking-out a subset of encoder outputs. In other
words, via joint training, the L0DROP layer forces Transformer to route
information through a subset of its encoder states. We investigate the effects
of this sparsification on two machine translation and two summarization tasks.
Experiments show that, depending on the task, around 40-70% of source encodings
can be pruned without significantly compromising quality. The decrease of the
output length endows L0DROP with the potential of improving decoding
efficiency, where it yields a speedup of up to 1.65x on document summarization
tasks against the standard Transformer. We analyze the L0DROP behaviour and
observe that it exhibits systematic preferences for pruning certain word types,
e.g., function words and punctuation get pruned most. Inspired by these
observations, we explore the feasibility of specifying rule-based patterns that
mask out encoder outputs based on information such as part-of-speech tags, word
frequency and word position
Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation
In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context: the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argue that this relative contribution can be evaluated by adopting a variant of Layerwise Relevance Propagation (LRP). Its underlying ‘conservation principle’ makes relevance propagation unique: differently from other methods, it evaluates not an abstract quantity reflecting token importance, but the proportion of each token’s influence. We extend LRP to the Transformer and conduct an analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes, when varying the training objective or the amount of training data, and during the training process. We find that models trained with more data tend to rely on source information more and to have more sharp token contributions; the training process is non-monotonic with several stages of different nature
- …