12 research outputs found
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
Boosted by the simultaneous translation shared task at IWSLT 2020, promising
end-to-end online speech translation approaches were recently proposed. They
consist in incrementally encoding a speech input (in a source language) and
decoding the corresponding text (in a target language) with the best possible
trade-off between latency and translation quality. This paper investigates two
key aspects of end-to-end simultaneous speech translation: (a) how to encode
efficiently the continuous speech flow, and (b) how to segment the speech flow
in order to alternate optimally between reading (R: encoding input) and writing
(W: decoding output) operations. We extend our previously proposed end-to-end
online decoding strategy and show that while replacing BLSTM by ULSTM encoding
degrades performance in offline mode, it actually improves both efficiency and
performance in online mode. We also measure the impact of different methods to
segment the speech signal (using fixed interval boundaries, oracle word
boundaries or randomly set boundaries) and show that our best end-to-end online
decoding strategy is surprisingly the one that alternates R/W operations on
fixed size blocks on our English-German speech translation setup.Comment: Accepted for presentation at Interspeech 202
Efficient Wait-k Models for Simultaneous Machine Translation
Simultaneous machine translation consists in starting output generation
before the entire input sequence is available. Wait-k decoders offer a simple
but efficient approach for this problem. They first read k source tokens, after
which they alternate between producing a target token and reading another
source token. We investigate the behavior of wait-k decoding in low resource
settings for spoken corpora using IWSLT datasets. We improve training of these
models using unidirectional encoders, and training across multiple values of k.
Experiments with Transformer and 2D-convolutional architectures show that our
wait-k models generalize well across a wide range of latency levels. We also
show that the 2D-convolution architecture is competitive with Transformers for
simultaneous translation of spoken language.Comment: Accepted at INTERSPEECH 202
Efficient Monotonic Multihead Attention
We introduce the Efficient Monotonic Multihead Attention (EMMA), a
state-of-the-art simultaneous translation model with numerically-stable and
unbiased monotonic alignment estimation. In addition, we present improved
training and inference strategies, including simultaneous fine-tuning from an
offline translation model and reduction of monotonic alignment variance. The
experimental results demonstrate that the proposed model attains
state-of-the-art performance in simultaneous speech-to-text translation on the
Spanish and English translation task
Online Versus Offline NMT Quality: An In-depth Analysis on English-German and German-English
We conduct in this work an evaluation study comparing offline and online
neural machine translation architectures. Two sequence-to-sequence models:
convolutional Pervasive Attention (Elbayad et al. 2018) and attention-based
Transformer (Vaswani et al. 2017) are considered. We investigate, for both
architectures, the impact of online decoding constraints on the translation
quality through a carefully designed human evaluation on English-German and
German-English language pairs, the latter being particularly sensitive to
latency constraints. The evaluation results allow us to identify the strengths
and shortcomings of each model when we shift to the online setup.Comment: Accepted at COLING 202