485 research outputs found
Training ASR models by Generation of Contextual Information
Supervised ASR models have reached unprecedented levels of accuracy, thanks
in part to ever-increasing amounts of labelled training data. However, in many
applications and locales, only moderate amounts of data are available, which
has led to a surge in semi- and weakly-supervised learning research. In this
paper, we conduct a large-scale study evaluating the effectiveness of
weakly-supervised learning for speech recognition by using loosely related
contextual information as a surrogate for ground-truth labels. For weakly
supervised training, we use 50k hours of public English social media videos
along with their respective titles and post text to train an encoder-decoder
transformer model. Our best encoder-decoder models achieve an average of 20.8%
WER reduction over a 1000 hours supervised baseline, and an average of 13.4%
WER reduction when using only the weakly supervised encoder for CTC
fine-tuning. Our results show that our setup for weak supervision improved both
the encoder acoustic representations as well as the decoder language generation
abilities
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
We introduce VoxPopuli, a large-scale multilingual corpus providing 100K
hours of unlabelled speech data in 23 languages. It is the largest open data to
date for unsupervised representation learning as well as semi-supervised
learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16
languages and their aligned oral interpretations into 5 other languages
totaling 5.1K hours. We provide speech recognition baselines and validate the
versatility of VoxPopuli unlabelled data in semi-supervised learning under
challenging out-of-domain settings. We will release the corpus at
https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
We introduce the Universal Speech Model (USM), a single large model that
performs automatic speech recognition (ASR) across 100+ languages. This is
achieved by pre-training the encoder of the model on a large unlabeled
multilingual dataset of 12 million (M) hours spanning over 300 languages, and
fine-tuning on a smaller labeled dataset. We use multilingual pre-training with
random-projection quantization and speech-text modality matching to achieve
state-of-the-art performance on downstream multilingual ASR and speech-to-text
translation tasks. We also demonstrate that despite using a labeled training
set 1/7-th the size of that used for the Whisper model, our model exhibits
comparable or better performance on both in-domain and out-of-domain speech
recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
We summarize the results of a host of efforts using giant automatic speech
recognition (ASR) models pre-trained using large, diverse unlabeled datasets
containing approximately a million hours of audio. We find that the combination
of pre-training, self-training and scaling up model size greatly increases data
efficiency, even for extremely large tasks with tens of thousands of hours of
labeled data. In particular, on an ASR task with 34k hours of labeled data, by
fine-tuning an 8 billion parameter pre-trained Conformer model we can match
state-of-the-art (SoTA) performance with only 3% of the training data and
significantly improve SoTA with the full training set. We also report on the
universal benefits gained from using big pre-trained and self-trained models
for a large set of downstream tasks that cover a wide range of speech domains
and span multiple orders of magnitudes of dataset sizes, including obtaining
SoTA performance on many public benchmarks. In addition, we utilize the learned
representation of pre-trained networks to achieve SoTA results on non-ASR
tasks.Comment: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference
baselines and bibliography updated; v3: corrections based on reviewer
feedback, bibliography update
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
- …