167 research outputs found

    Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation

    Get PDF
    Previous work has shown that for low-resource source languages, automatic speech-to-text translation (AST) can be improved by pretraining an end-to-end model on automatic speech recognition (ASR) data from a high-resource language. However, it is not clear what factors --e.g., language relatedness or size of the pretraining data-- yield the biggest improvements, or whether pretraining can be effectively combined with other methods such as data augmentation. Here, we experiment with pretraining on datasets of varying sizes, including languages related and unrelated to the AST source language. We find that the best predictor of final AST performance is the word error rate of the pretrained ASR model, and that differences in ASR/AST performance correlate with how phonetic information is encoded in the later RNN layers of our model. We also show that pretraining and data augmentation yield complementary benefits for AST.Comment: Accepted at ICASSP 202

    Adaptive Feature Selection for End-to-End Speech Translation

    Get PDF
    Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation)

    Tight Integrated End-to-End Training for Cascaded Speech Translation

    Full text link
    A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.Comment: 8 pages, accepted at SLT202

    ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

    Full text link
    Joint speech-language training is challenging due to the large demand for training data and GPU consumption, as well as the modality gap between speech and language. We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models and optimized data-efficiently for spoken language tasks. Particularly, we propose to incorporate cross-modality learning into transfer learning and conduct them simultaneously for downstream tasks in a multi-task learning manner. Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks, achieving a new state-of-the-art average BLEU score of 31.5 on the multilingual speech to English text translation task for 21 languages, as measured on the public CoVoST2 evaluation set

    Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

    Full text link
    This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.Comment: Submitted to ICASSP 202

    Unit-based Speech-to-Speech Translation Without Parallel Data

    Full text link
    We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.Comment: 17 pages, 3 figure

    TOWARDS ROBUST END-TO-END SPEECH TRANSLATION

    Get PDF
    Noisy inputs in Speech Recognition causes performance to drop, the same is happening for the more complex case of Speech Translation. We want to explore speech enhancement techniques in a multi-task setting for end-to-end speech translationInterest in speech-to-text translation systems has experienced a remarkable growth in recent years. The main motivation for this is the need to adapt to users the digital content they consume, for example, on social networks or video streaming platforms. In addition, nowadays we have high-quality automatic speech recognition and text translation systems which makes it the perfect time to investigate on speech translation systems. Traditionally cascade systems (ASR + MT) have worked best but great advances have recently been made in End-to-End systems which show their potential. This work is a study of the robustness of both systems, with the aim of being able to establish which approach is more resistant to noise. A series of experiments have been performed to determine which system is more robust. Both cascade and End-to-End systems have been trained with different noise levels using data from MuST-C En-Es, which contains 504 hours of speech, to study the difference in their performances. End-to-End systems have achieved a higher performance systematically. Despite of that, the behaviour of Cascade systems is pretty similar although they don?t achieve the same performance. Moreover, training with noise provides a lot of stability and robustnessEl interés por los sistemas de traducción de habla a texto ha experimentado un crecimiento notable en los últimos años. La principal motivación que ha comportado este crecimiento es la necesidad de adaptar al usuario el contenido digital que consume, por ejemplo, en las redes sociales o plataformas de vídeo streaming. Además, hoy en día tenemos sistemas automáticos de reconocimiento de habla y de traducción de texto de gran calidad lo que hace que sea el momento idóneo para investigar sistemas de traducción de habla. Tradicionalmente los sistemas en cascada (ASR + MT) son los que han funcionado mejor pero recientemente se han producido grandes avances en los sistemas End-to-End. Este trabajo es un estudio de la robustez de ambos sistemas, con el objetivo de poder establecer qué estrategia es más resistente a la presencia de ruido. Se han realizado una serie de experimentos entrenando sistemas en cascada y End-to-End con diferentes niveles de ruido utilizando los datos de MuST-C En-Es, que contiene 504 horas de habla, para determinar qué sistema es más robusto. Los sistemas End-to-End consiguen un rendimiento más elevado y funcionan mejor. Sin embargo, el comportamiento delante señales ruidosas es muy parecido al de los sistemas en Cascada, aunque estos tienen un rendimiento pero. Añadir que entrenar con datos ruidosos aporta mucha estabilidad y robustez a cualquiera de los dos sistemasL'interès pels sistemes de traducció de parla a text ha experimentat un creixement notable els darrers anys. La principal motivació que ha comportat aquest creixement és la necessitat d'adaptar a l'usuari el contingut digital que consumeix, per exemple, a les xarxes socials o a plataformes de vídeo streaming. A més, avui en dia tenim sistemes automàtics de reconeixement de parla i de traducció de text de gran qualitat la qual cosa fa que sigui el moment idoni per investigar sistemes de traducció de parla. Tradicionalment els sistemes en cascada (ASR+MT) són els que han funcionat millor però recentment s'han produït grans avenços en els sistemes End-to-End. Aquest treball és un estudi de la robustesa d'ambdós sistemes, amb l'objectiu de poder establir quina estratègia és més resistent a la presència de soroll. S'han realitzat una sèrie experiments entrenant sistemes en cascada i End-to-End, amb diferents nivells de soroll utilitzant les dades de MuST-C En-Es, que conté 504 hores de parla, per determinar quin sistema és més robust. Les conclusions que se?n poden extreure és que els sistemes End-to-End assoleixen un rendiment més elevat. Tot i això, el comportament davant el soroll és comparable als sistemes Cascada. Afegir que entrenar amb dades sorolloses aporta molta estabilitat i robustesa a qualsevol dels dos sistemes

    Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23

    Full text link
    This paper describes the submission of the UPC Machine Translation group to the IWSLT 2023 Offline Speech Translation task. Our Speech Translation systems utilize foundation models for speech (wav2vec 2.0) and text (mBART50). We incorporate a Siamese pretraining step of the speech and text encoders with CTC and Optimal Transport, to adapt the speech representations to the space of the text model, thus maximizing transfer learning from MT. After this pretraining, we fine-tune our system end-to-end on ST, with Cross Entropy and Knowledge Distillation. Apart from the available ST corpora, we create synthetic data with SegAugment to better adapt our models to the custom segmentations of the IWSLT test sets. Our best single model obtains 31.2 BLEU points on MuST-C tst-COMMON, 29.8 points on IWLST.tst2020 and 33.4 points on the newly released IWSLT.ACLdev2023.Comment: IWSLT 202
    • …
    corecore