Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Axelrod, Vera; Bapna, Ankur; Beaufays, Françoise; Chen, Nanxin; Chen, Zhehuai; Chiu, Chung-Cheng; Haghani, Parisa; Han, Wei; Hu, Ke; Li, Bo; Meng, Zhong; Moreno, Pedro; Park, Daniel S.; Perng, Ginger; Prabhavalkar, Rohit; Qin, James; Ramabhadran, Bhuvana; Riesa, Jason; Rosenberg, Andrew; Sainath, Tara; Schalkwyk, Johan; Soltau, Hagen; Strohman, Trevor; Wang, Gary; Wang, Yongqiang; Wu, Yonghui; Zhang, Yu

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Abstract

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2303.01037

Last time updated on 20/03/2023