6 research outputs found
Tight Integrated End-to-End Training for Cascaded Speech Translation
A cascaded speech translation model relies on discrete and non-differentiable
transcription, which provides a supervision signal from the source side and
helps the transformation between source speech and target text. Such modeling
suffers from error propagation between ASR and MT models. Direct speech
translation is an alternative method to avoid error propagation; however, its
performance is often behind the cascade system. To use an intermediate
representation and preserve the end-to-end trainability, previous studies have
proposed using two-stage models by passing the hidden vectors of the recognizer
into the decoder of the MT model and ignoring the MT encoder. This work
explores the feasibility of collapsing the entire cascade components into a
single end-to-end trainable model by optimizing all parameters of ASR and MT
models jointly without ignoring any learned parameters. It is a tightly
integrated method that passes renormalized source word posterior distributions
as a soft decision instead of one-hot vectors and enables backpropagation.
Therefore, it provides both transcriptions and translations and achieves strong
consistency between them. Our experiments on four tasks with different data
scenarios show that the model outperforms cascade models up to 1.8% in BLEU and
2.0% in TER and is superior compared to direct models.Comment: 8 pages, accepted at SLT202
Consecutive Decoding for Speech-to-text Translation
Speech-to-text translation (ST), which directly translates the source
language speech to the target language text, has attracted intensive attention
recently. However, the combination of speech recognition and machine
translation in a single model poses a heavy burden on the direct cross-modal
cross-lingual mapping. To reduce the learning difficulty, we propose
COnSecutive Transcription and Translation (COSTT), an integral approach for
speech-to-text translation. The key idea is to generate source transcript and
target translation text with a single decoder. It benefits the model training
so that additional large parallel text corpus can be fully exploited to enhance
the speech translation training. Our method is verified on three mainstream
datasets, including Augmented LibriSpeech English-French dataset, TED
English-German dataset, and TED English-Chinese dataset. Experiments show that
our proposed COSTT outperforms the previous state-of-the-art methods. The code
is available at https://github.com/dqqcasia/st.Comment: Accepted by AAAI 2021. arXiv admin note: text overlap with
arXiv:2009.0970
Enhancing Transformer for End-to-end Speech-to-Text Translation
Neural end-to-end architectures have beenrecently proposed for spoken languagetranslation (SLT), following the state-of-the-art results obtained in machine translation (MT) and speech recognition (ASR).Motivated by this contiguity, we proposean SLT adaptation of Transformer (thestate-of-the-art architecture in MT), whichexploits the integration of ASR solutionsto cope with long input sequences featuring low information density. Long audiorepresentations hinder the training of largemodels due to Transformer’s quadraticmemory complexity.Moreover, for thesake of translation quality, handling suchsequences requires capturing both short-and long-range dependencies between bi-dimensional features. Focusing on Trans-former’s encoder, our adaptation is basedon:i)downsampling the input with con-volutional neural networks, which enablesmodel training on non cutting-edge GPUs,ii)modeling the bidimensional nature ofthe audio spectrogram with 2D components, andiii)adding a distance penaltyto the attention, which is able to bias ittowards short-range dependencies.Ourexperiments show that our SLT-adaptedTransformer outperforms the RNN-basedbaseline both in translation quality andtraining time, setting the state-of-the-artperformance on six language directions
"Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation
An end-to-end speech-to-text translation (ST) takes audio in a source
language and outputs the text in a target language. Existing methods are
limited by the amount of parallel corpus. Can we build a system to fully
utilize signals in a parallel ST corpus? We are inspired by human understanding
system which is composed of auditory perception and cognitive processing. In
this paper, we propose Listen-Understand-Translate, (LUT), a unified framework
with triple supervision signals to decouple the end-to-end speech-to-text
translation task. LUT is able to guide the acoustic encoder to extract as much
information from the auditory input. In addition, LUT utilizes a pre-trained
BERT model to enforce the upper encoder to produce as much semantic information
as possible, without extra data. We perform experiments on a diverse set of
speech translation benchmarks, including Librispeech English-French, IWSLT
English-German and TED English-Chinese. Our results demonstrate LUT achieves
the state-of-the-art performance, outperforming previous methods. The code is
available at https://github.com/dqqcasia/st.Comment: Accepted by AAAI 202