24 research outputs found
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
Unsupervised ASR via Cross-Lingual Pseudo-Labeling
Recent work has shown that it is possible to train an
automatic speech recognition (ASR) system using only unpaired audio and text.
Existing unsupervised ASR methods assume that no labeled data can be used for
training. We argue that even if one does not have any labeled audio for a given
language, there is labeled data available for other
languages. We show that it is possible to use character-level acoustic models
(AMs) from other languages to bootstrap an AM in a new
language. Here, "unsupervised" means no labeled audio is available for the
language. Our approach is based on two key ingredients: (i)
generating pseudo-labels (PLs) of the language using some
language AM and (ii) constraining these PLs with a
. Our approach is effective on Common Voice:
e.g. transfer of English AM to Swahili achieves 18% WER. It also outperforms
character-based wav2vec-U 2.0 by 15% absolute WER on LJSpeech with 800h of
labeled German data instead of 60k hours of unlabeled English data.Comment: under revie
LMs with a Voice: Spoken Language Modeling beyond Speech Tokens
We present SPECTRON, a novel approach to adapting pre-trained language models
(LMs) to perform speech continuation. By leveraging pre-trained speech
encoders, our model generates both text and speech outputs with the entire
system being trained end-to-end operating directly on spectrograms. Training
the entire model in the spectrogram domain simplifies our speech continuation
system versus existing cascade methods which use discrete speech
representations. We further show our method surpasses existing spoken language
models both in semantic content and speaker preservation while also benefiting
from the knowledge transferred from pre-existing models. Audio samples can be
found in our website https://michelleramanovich.github.io/spectron/spectro