Search CORE

24 research outputs found

Improved Noisy Student Training for Automatic Speech Recognition

Author: Chiu Chung-Cheng
Han Wei
Jia Ye
Le Quoc V.
Li Bo
Park Daniel S.
Wu Yonghui
Zhang Yu
Publication venue: 'International Speech Communication Association'
Publication date: 29/10/2020
Field of study

Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method. We find effective methods to filter, balance and augment the data generated in between self-training iterations. By doing so, we are able to obtain word error rates (WERs) 4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h subset of LibriSpeech as the supervised set and the rest (860h) as the unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h (4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde

arXiv.org e-Print Archive

Crossref

Unsupervised ASR via Cross-Lingual Pseudo-Labeling

Author: Collobert Ronan
Likhomanenko Tatiana
Lugosch Loren
Publication venue
Publication date: 18/05/2023
Field of study

Recent work has shown that it is possible to train an

\textit{unsupervised}

automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is

\textit{always}

labeled data available for other languages. We show that it is possible to use character-level acoustic models (AMs) from other languages to bootstrap an

\textit{unsupervised}

AM in a new language. Here, "unsupervised" means no labeled audio is available for the

\textit{target}

language. Our approach is based on two key ingredients: (i) generating pseudo-labels (PLs) of the

\textit{target}

language using some

\textit{other}

language AM and (ii) constraining these PLs with a

\textit{target language model}

. Our approach is effective on Common Voice: e.g. transfer of English AM to Swahili achieves 18% WER. It also outperforms character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of labeled German data instead of 60k hours of unlabeled English data.Comment: under revie

arXiv.org e-Print Archive

LMs with a Voice: Spoken Language Modeling beyond Speech Tokens

Author: Asawaroengchai Chulayutsh
Levkovitch Alon
Mariooryad Soroosh
Nachmani Eliya
Ramanovich Michelle Tadmor
Salazar Julian
Skerry-Ryan RJ
Publication venue
Publication date: 24/05/2023
Field of study

We present SPECTRON, a novel approach to adapting pre-trained language models (LMs) to perform speech continuation. By leveraging pre-trained speech encoders, our model generates both text and speech outputs with the entire system being trained end-to-end operating directly on spectrograms. Training the entire model in the spectrogram domain simplifies our speech continuation system versus existing cascade methods which use discrete speech representations. We further show our method surpasses existing spoken language models both in semantic content and speaker preservation while also benefiting from the knowledge transferred from pre-existing models. Audio samples can be found in our website https://michelleramanovich.github.io/spectron/spectro

arXiv.org e-Print Archive