190 research outputs found
Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation
Previous work has shown that for low-resource source languages, automatic
speech-to-text translation (AST) can be improved by pretraining an end-to-end
model on automatic speech recognition (ASR) data from a high-resource language.
However, it is not clear what factors --e.g., language relatedness or size of
the pretraining data-- yield the biggest improvements, or whether pretraining
can be effectively combined with other methods such as data augmentation. Here,
we experiment with pretraining on datasets of varying sizes, including
languages related and unrelated to the AST source language. We find that the
best predictor of final AST performance is the word error rate of the
pretrained ASR model, and that differences in ASR/AST performance correlate
with how phonetic information is encoded in the later RNN layers of our model.
We also show that pretraining and data augmentation yield complementary
benefits for AST.Comment: Accepted at ICASSP 202
Tight Integrated End-to-End Training for Cascaded Speech Translation
A cascaded speech translation model relies on discrete and non-differentiable
transcription, which provides a supervision signal from the source side and
helps the transformation between source speech and target text. Such modeling
suffers from error propagation between ASR and MT models. Direct speech
translation is an alternative method to avoid error propagation; however, its
performance is often behind the cascade system. To use an intermediate
representation and preserve the end-to-end trainability, previous studies have
proposed using two-stage models by passing the hidden vectors of the recognizer
into the decoder of the MT model and ignoring the MT encoder. This work
explores the feasibility of collapsing the entire cascade components into a
single end-to-end trainable model by optimizing all parameters of ASR and MT
models jointly without ignoring any learned parameters. It is a tightly
integrated method that passes renormalized source word posterior distributions
as a soft decision instead of one-hot vectors and enables backpropagation.
Therefore, it provides both transcriptions and translations and achieves strong
consistency between them. Our experiments on four tasks with different data
scenarios show that the model outperforms cascade models up to 1.8% in BLEU and
2.0% in TER and is superior compared to direct models.Comment: 8 pages, accepted at SLT202
Adaptive Feature Selection for End-to-End Speech Translation
Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation)
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
Joint speech-language training is challenging due to the large demand for
training data and GPU consumption, as well as the modality gap between speech
and language. We present ComSL, a speech-language model built atop a composite
architecture of public pretrained speech-only and language-only models and
optimized data-efficiently for spoken language tasks. Particularly, we propose
to incorporate cross-modality learning into transfer learning and conduct them
simultaneously for downstream tasks in a multi-task learning manner. Our
approach has demonstrated effectiveness in end-to-end speech-to-text
translation tasks, achieving a new state-of-the-art average BLEU score of 31.5
on the multilingual speech to English text translation task for 21 languages,
as measured on the public CoVoST2 evaluation set
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech
This paper proposes Virtuoso, a massively multilingual speech-text joint
semi-supervised learning framework for text-to-speech synthesis (TTS) models.
Existing multilingual TTS typically supports tens of languages, which are a
small fraction of the thousands of languages in the world. One difficulty to
scale multilingual TTS to hundreds of languages is collecting high-quality
speech-text paired data in low-resource languages. This study extends Maestro,
a speech-text joint pretraining framework for automatic speech recognition
(ASR), to speech generation tasks. To train a TTS model from various types of
speech and text data, different training schemes are designed to handle
supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and
unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS
models trained on Virtuoso can achieve significantly better naturalness and
intelligibility than baseline ones in seen languages, and 2) they can
synthesize reasonably intelligible and naturally sounding speech for unseen
languages where no high-quality paired TTS data is available.Comment: Submitted to ICASSP 202
Unit-based Speech-to-Speech Translation Without Parallel Data
We propose an unsupervised speech-to-speech translation (S2ST) system that
does not rely on parallel data between the source and target languages. Our
approach maps source and target language speech signals into automatically
discovered, discrete units and reformulates the problem as unsupervised
unit-to-unit machine translation. We develop a three-step training procedure
that involves (a) pre-training an unit-based encoder-decoder language model
with a denoising objective (b) training it with word-by-word translated
utterance pairs created by aligning monolingual text embedding spaces and (c)
running unsupervised backtranslation bootstrapping off of the initial
translation model. Our approach avoids mapping the speech signal into text and
uses speech-to-unit and unit-to-speech models instead of automatic speech
recognition and text to speech models. We evaluate our model on
synthetic-speaker Europarl-ST English-German and German-English evaluation
sets, finding that unit-based translation is feasible under this constrained
scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to
German.Comment: 17 pages, 3 figure
TOWARDS ROBUST END-TO-END SPEECH TRANSLATION
Noisy inputs in Speech Recognition causes performance to drop, the same is happening for the more complex case of Speech Translation. We want to explore speech enhancement techniques in a multi-task setting for end-to-end speech translationInterest in speech-to-text translation systems has experienced a remarkable growth in recent years. The main motivation for this is the need to adapt to users the digital content they consume, for example, on social networks or video streaming platforms. In addition, nowadays we have high-quality automatic speech recognition and text translation systems which makes it the perfect time to investigate on speech translation systems. Traditionally cascade systems (ASR + MT) have worked best but great advances have recently been made in End-to-End systems which show their potential. This work is a study of the robustness of both systems, with the aim of being able to establish which approach is more resistant to noise. A series of experiments have been performed to determine which system is more robust. Both cascade and End-to-End systems have been trained with different noise levels using data from MuST-C En-Es, which contains 504 hours of speech, to study the difference in their performances. End-to-End systems have achieved a higher performance systematically. Despite of that, the behaviour of Cascade systems is pretty similar although they don?t achieve the same performance. Moreover, training with noise provides a lot of stability and robustnessEl interés por los sistemas de traducción de habla a texto ha experimentado un crecimiento notable en los últimos años. La principal motivación que ha comportado este crecimiento es la necesidad de adaptar al usuario el contenido digital que consume, por ejemplo, en las redes sociales o plataformas de vÃdeo streaming. Además, hoy en dÃa tenemos sistemas automáticos de reconocimiento de habla y de traducción de texto de gran calidad lo que hace que sea el momento idóneo para investigar sistemas de traducción de habla. Tradicionalmente los sistemas en cascada (ASR + MT) son los que han funcionado mejor pero recientemente se han producido grandes avances en los sistemas End-to-End. Este trabajo es un estudio de la robustez de ambos sistemas, con el objetivo de poder establecer qué estrategia es más resistente a la presencia de ruido. Se han realizado una serie de experimentos entrenando sistemas en cascada y End-to-End con diferentes niveles de ruido utilizando los datos de MuST-C En-Es, que contiene 504 horas de habla, para determinar qué sistema es más robusto. Los sistemas End-to-End consiguen un rendimiento más elevado y funcionan mejor. Sin embargo, el comportamiento delante señales ruidosas es muy parecido al de los sistemas en Cascada, aunque estos tienen un rendimiento pero. Añadir que entrenar con datos ruidosos aporta mucha estabilidad y robustez a cualquiera de los dos sistemasL'interès pels sistemes de traducció de parla a text ha experimentat un creixement notable els darrers anys. La principal motivació que ha comportat aquest creixement és la necessitat d'adaptar a l'usuari el contingut digital que consumeix, per exemple, a les xarxes socials o a plataformes de vÃdeo streaming. A més, avui en dia tenim sistemes automà tics de reconeixement de parla i de traducció de text de gran qualitat la qual cosa fa que sigui el moment idoni per investigar sistemes de traducció de parla. Tradicionalment els sistemes en cascada (ASR+MT) són els que han funcionat millor però recentment s'han produït grans avenços en els sistemes End-to-End. Aquest treball és un estudi de la robustesa d'ambdós sistemes, amb l'objectiu de poder establir quina estratègia és més resistent a la presència de soroll. S'han realitzat una sèrie experiments entrenant sistemes en cascada i End-to-End, amb diferents nivells de soroll utilitzant les dades de MuST-C En-Es, que conté 504 hores de parla, per determinar quin sistema és més robust. Les conclusions que se?n poden extreure és que els sistemes End-to-End assoleixen un rendiment més elevat. Tot i això, el comportament davant el soroll és comparable als sistemes Cascada. Afegir que entrenar amb dades sorolloses aporta molta estabilitat i robustesa a qualsevol dels dos sistemes
Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23
This paper describes the submission of the UPC Machine Translation group to
the IWSLT 2023 Offline Speech Translation task. Our Speech Translation systems
utilize foundation models for speech (wav2vec 2.0) and text (mBART50). We
incorporate a Siamese pretraining step of the speech and text encoders with CTC
and Optimal Transport, to adapt the speech representations to the space of the
text model, thus maximizing transfer learning from MT. After this pretraining,
we fine-tune our system end-to-end on ST, with Cross Entropy and Knowledge
Distillation. Apart from the available ST corpora, we create synthetic data
with SegAugment to better adapt our models to the custom segmentations of the
IWSLT test sets. Our best single model obtains 31.2 BLEU points on MuST-C
tst-COMMON, 29.8 points on IWLST.tst2020 and 33.4 points on the newly released
IWSLT.ACLdev2023.Comment: IWSLT 202
- …