41,868 research outputs found
One-To-Many Multilingual End-to-end Speech Translation
Nowadays, training end-to-end neural models for spoken language translation
(SLT) still has to confront with extreme data scarcity conditions. The existing
SLT parallel corpora are indeed orders of magnitude smaller than those
available for the closely related tasks of automatic speech recognition (ASR)
and machine translation (MT), which usually comprise tens of millions of
instances. To cope with data paucity, in this paper we explore the
effectiveness of transfer learning in end-to-end SLT by presenting a
multilingual approach to the task. Multilingual solutions are widely studied in
MT and usually rely on ``\textit{target forcing}'', in which multilingual
parallel data are combined to train a single model by prepending to the input
sequences a language token that specifies the target language. However, when
tested in speech translation, our experiments show that MT-like \textit{target
forcing}, used as is, is not effective in discriminating among the target
languages. Thus, we propose a variant that uses target-language embeddings to
shift the input representations in different portions of the space according to
the language, so to better support the production of output in the desired
target language. Our experiments on end-to-end SLT from English into six
languages show important improvements when translating into similar languages,
especially when these are supported by scarce data. Further improvements are
obtained when using English ASR data as an additional language (up to
BLEU points).Comment: 8 pages, one figure, version accepted at ASRU 201
Towards a Deep Understanding of Multilingual End-to-End Speech Translation
In this paper, we employ Singular Value Canonical Correlation Analysis
(SVCCA) to analyze representations learnt in a multilingual end-to-end speech
translation model trained over 22 languages. SVCCA enables us to estimate
representational similarity across languages and layers, enhancing our
understanding of the functionality of multilingual speech translation and its
potential connection to multilingual neural machine translation. The
multilingual speech translation model is trained on the CoVoST 2 dataset in all
possible directions, and we utilize LASER to extract parallel bitext data for
SVCCA analysis. We derive three major findings from our analysis: (I)
Linguistic similarity loses its efficacy in multilingual speech translation
when the training data for a specific language is limited. (II) Enhanced
encoder representations and well-aligned audio-text data significantly improve
translation quality, surpassing the bilingual counterparts when the training
data is not compromised. (III) The encoder representations of multilingual
speech translation demonstrate superior performance in predicting phonetic
features in linguistic typology prediction. With these findings, we propose
that releasing the constraint of limited data for low-resource languages and
subsequently combining them with linguistically related high-resource languages
could offer a more effective approach for multilingual end-to-end speech
translation.Comment: Accepted to Findings of EMNLP 202
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers
End-to-end formulation of automatic speech recognition (ASR) and speech
translation (ST) makes it easy to use a single model for both multilingual ASR
and many-to-many ST. In this paper, we propose streaming language-agnostic
multilingual speech recognition and translation using neural transducers
(LAMASSU). To enable multilingual text generation in LAMASSU, we conduct a
systematic comparison between specified and unified prediction and joint
networks. We leverage a language-agnostic multilingual encoder that
substantially outperforms shared encoders. To enhance LAMASSU, we propose to
feed target LID to encoders. We also apply connectionist temporal
classification regularization to transducer training. Experimental results show
that LAMASSU not only drastically reduces the model size but also outperforms
monolingual ASR and bilingual ST models.Comment: Submitted to ICASSP 202
Multilingual Speech Translation KIT @ IWSLT2021
This paper contains the description for the submission of Karlsruhe Institute of Technology (KIT) for the multilingual TEDx translation task in the IWSLT 2021 evaluation campaign. Our main approach is to develop both cascade and end-to-end systems and eventually combine them together to achieve the best possible results for this extremely low-resource setting. The report also confirms certain consistent architectural improvement added to the Transformer architecture, for all tasks: translation, transcription and speech translation
KIT's Multilingual Speech Translation System for IWSLT 2023
Many existing speech translation benchmarks focus on native-English speech in
high-quality recording conditions, which often do not match the conditions in
real-life use-cases. In this paper, we describe our speech translation system
for the multilingual track of IWSLT 2023, which focuses on the translation of
scientific conference talks. The test condition features accented input speech
and terminology-dense contents. The tasks requires translation into 10
languages of varying amounts of resources. In absence of training data from the
target domain, we use a retrieval-based approach (kNN-MT) for effective
adaptation (+0.8 BLEU for speech translation). We also use adapters to easily
integrate incremental training data from data augmentation, and show that it
matches the performance of re-training. We observe that cascaded systems are
more easily adaptable towards specific target domains, due to their separate
modules. Our cascaded speech system substantially outperforms its end-to-end
counterpart on scientific talk translation, although their performance remains
similar on TED talks.Comment: IWSLT 202
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Joint speech-language training is challenging due to the large demand for
training data and GPU consumption, as well as the modality gap between speech
and language. We present ComSL, a speech-language model built atop a composite
architecture of public pretrained speech-only and language-only models and
optimized data-efficiently for spoken language tasks. Particularly, we propose
to incorporate cross-modality learning into transfer learning and conduct them
simultaneously for downstream tasks in a multi-task learning manner. Our
approach has demonstrated effectiveness in end-to-end speech-to-text
translation tasks, achieving a new state-of-the-art average BLEU score of 31.5
on the multilingual speech to English text translation task for 21 languages,
as measured on the public CoVoST2 evaluation set
- …