Search CORE

485 research outputs found

Training ASR models by Generation of Contextual Information

Author: Edunov Sergey
Girshick Ross
Liu Jun
Mohamed Abdelrahman
Okhonko Dmytro
Peng Fuchun
Saraf Yatharth
Singh Kritika
Wang Yongqiang
Zhang Frank
Zweig Geoffrey
Publication venue
Publication date: 14/02/2020
Field of study

Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities

arXiv.org e-Print Archive

Crossref

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Author: Dupoux Emmanuel
Haziza Daniel
Lee Ann
Pino Juan
Rivière Morgane
Talnikar Chaitanya
Wang Changhan
Williamson Mary
Wu Anne
Publication venue
Publication date: 27/07/2021
Field of study

We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Author: Axelrod Vera
Bapna Ankur
Beaufays Françoise
Chen Nanxin
Chen Zhehuai
Chiu Chung-Cheng
Haghani Parisa
Han Wei
Hu Ke
Li Bo
Meng Zhong
Moreno Pedro
Park Daniel S.
Perng Ginger
Prabhavalkar Rohit
Qin James
Ramabhadran Bhuvana
Riesa Jason
Rosenberg Andrew
Sainath Tara
Schalkwyk Johan
Soltau Hagen
Strohman Trevor
Wang Gary
Wang Yongqiang
Wu Yonghui
Zhang Yu
Publication venue
Publication date: 24/09/2023
Field of study

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table

arXiv.org e-Print Archive

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Author: Beaufays Françoise
Cao Liangliang
Chan William
Chen Zhifeng
Chiu Chung-Cheng
Gulati Anmol
Han Wei
Huang Yanping
Jansen Aren
Le Quoc V.
Li Bo
Ma Min
Pang Ruoming
Park Daniel S.
Qin James
Ramabhadran Bhuvana
Sainath Tara N.
Shor Joel
Sim Khe Chai
Wang Shibo
Wang Yongqiang
Wu Yonghui
Xu Yuanzhong
Yu Jiahui
Zhang Yu
Zhou Zongwei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/07/2022
Field of study

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.Comment: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography update

arXiv.org e-Print Archive

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Author: Cheng Xize
He Jinzheng
Huang Rongjie
Li Linjun
Liu Huadai
Liu Jinglin
Ren Yi
Ye Zhenhui
Yin Xiang
Zhang Lichao
Zhao Zhou
Publication venue
Publication date: 24/05/2023
Field of study

Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202

arXiv.org e-Print Archive