15 research outputs found
One-To-Many Multilingual End-to-end Speech Translation
Nowadays, training end-to-end neural models for spoken language translation
(SLT) still has to confront with extreme data scarcity conditions. The existing
SLT parallel corpora are indeed orders of magnitude smaller than those
available for the closely related tasks of automatic speech recognition (ASR)
and machine translation (MT), which usually comprise tens of millions of
instances. To cope with data paucity, in this paper we explore the
effectiveness of transfer learning in end-to-end SLT by presenting a
multilingual approach to the task. Multilingual solutions are widely studied in
MT and usually rely on ``\textit{target forcing}'', in which multilingual
parallel data are combined to train a single model by prepending to the input
sequences a language token that specifies the target language. However, when
tested in speech translation, our experiments show that MT-like \textit{target
forcing}, used as is, is not effective in discriminating among the target
languages. Thus, we propose a variant that uses target-language embeddings to
shift the input representations in different portions of the space according to
the language, so to better support the production of output in the desired
target language. Our experiments on end-to-end SLT from English into six
languages show important improvements when translating into similar languages,
especially when these are supported by scarce data. Further improvements are
obtained when using English ASR data as an additional language (up to
BLEU points).Comment: 8 pages, one figure, version accepted at ASRU 201
Instance-Based Model Adaptation For Direct Speech Translation
Despite recent technology advancements, the effectiveness of neural
approaches to end-to-end speech-to-text translation is still limited by the
paucity of publicly available training corpora. We tackle this limitation with
a method to improve data exploitation and boost the system's performance at
inference time. Our approach allows us to customize "on the fly" an existing
model to each incoming translation request. At its core, it exploits an
instance selection procedure to retrieve, from a given pool of data, a small
set of samples similar to the input query in terms of latent properties of its
audio signal. The retrieved samples are then used for an instance-specific
fine-tuning of the model. We evaluate our approach in three different
scenarios. In all data conditions (different languages, in/out-of-domain
adaptation), our instance-based adaptation yields coherent performance gains
over static models.Comment: 6 pages, under review at ICASSP 202
End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
This paper describes FBK's participation in the IWSLT 2020 offline speech
translation (ST) task. The task evaluates systems' ability to translate English
TED talks audio into German texts. The test talks are provided in two versions:
one contains the data already segmented with automatic tools and the other is
the raw data without any segmentation. Participants can decide whether to work
on custom segmentation or not. We used the provided segmentation. Our system is
an end-to-end model based on an adaptation of the Transformer for speech data.
Its training process is the main focus of this paper and it is based on: i)
transfer learning (ASR pretraining and knowledge distillation), ii) data
augmentation (SpecAugment, time stretch and synthetic data), iii) combining
synthetic and real data marked as different domains, and iv) multi-task
learning using the CTC loss. Finally, after the training with word-level
knowledge distillation is complete, our ST models are fine-tuned using label
smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test
set, which is an excellent result compared to recent papers, and 23.7 BLEU on
the same data segmented with VAD, showing the need for researching solutions
addressing this specific data condition.Comment: Accepted at IWSLT202
On Target Segmentation for Direct Speech Translation
Recent studies on direct speech translation show continuous improvements by
means of data augmentation techniques and bigger deep learning models. While
these methods are helping to close the gap between this new approach and the
more traditional cascaded one, there are many incongruities among different
studies that make it difficult to assess the state of the art. Surprisingly,
one point of discussion is the segmentation of the target text. Character-level
segmentation has been initially proposed to obtain an open vocabulary, but it
results on long sequences and long training time. Then, subword-level
segmentation became the state of the art in neural machine translation as it
produces shorter sequences that reduce the training time, while being superior
to word-level models. As such, recent works on speech translation started using
target subwords despite the initial use of characters and some recent claims of
better results at the character level. In this work, we perform an extensive
comparison of the two methods on three benchmarks covering 8 language
directions and multilingual training. Subword-level segmentation compares
favorably in all settings, outperforming its character-level counterpart in a
range of 1 to 3 BLEU points.Comment: 14 pages single column, 4 figures, accepted for presentation at the
AMTA2020 research trac