60 research outputs found
Fluent Translations from Disfluent Speech in End-to-End Speech Translation
Spoken language translation applications for speech suffer due to
conversational speech phenomena, particularly the presence of disfluencies.
With the rise of end-to-end speech translation models, processing steps such as
disfluency removal that were previously an intermediate step between speech
recognition and machine translation need to be incorporated into model
architectures. We use a sequence-to-sequence model to translate from noisy,
disfluent speech to fluent text with disfluencies removed using the recently
collected `copy-edited' references for the Fisher Spanish-English dataset. We
are able to directly generate fluent translations and introduce considerations
about how to evaluate success on this task. This work provides a baseline for a
new task, the translation of conversational speech with joint removal of
disfluencies.Comment: Accepted at NAACL 201
Tutorial: End-to-End Speech Translation
Speech translation is the translation of speech in one language typically to text in another, traditionally accomplished through a combination of automatic speech recognition and machine translation. Speech translation has attracted interest for many years, but the recent successful applications of deep learning to both individual tasks have enabled new opportunities through joint modeling, in what we today call 'end-to-end speech translation.' In this tutorial we will introduce the techniques used in cutting-edge research on speech translation. Starting from the traditional cascaded approach, we will given an overview on data sources and model architectures to achieve state-of-the art performance with end-to-end speech translation for both high- and low-resource languages. In addition, we will discuss methods to evaluate analyze the proposed solutions, as well as the challenges faced when applying speech translation models for real-world applications
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
We introduce and demonstrate how to effectively train multilingual machine
translation models with pixel representations. We experiment with two different
data settings with a variety of language and script coverage, demonstrating
improved performance compared to subword embeddings. We explore various
properties of pixel representations such as parameter sharing within and across
scripts to better understand where they lead to positive transfer. We observe
that these properties not only enable seamless cross-lingual transfer to unseen
scripts, but make pixel representations more data-efficient than alternatives
such as vocabulary expansion. We hope this work contributes to more extensible
multilingual models for all languages and scripts.Comment: EMNLP 202
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss
Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages
- …