889 research outputs found
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition
We propose a novel approach to semi-supervised automatic speech recognition
(ASR). We first exploit a large amount of unlabeled audio data via
representation learning, where we reconstruct a temporal slice of filterbank
features from past and future context frames. The resulting deep contextualized
acoustic representations (DeCoAR) are then used to train a CTC-based end-to-end
ASR system using a smaller amount of labeled audio data. In our experiments, we
show that systems trained on DeCoAR consistently outperform ones trained on
conventional filterbank features, giving 42% and 19% relative improvement over
the baseline on WSJ eval92 and LibriSpeech test-clean, respectively. Our
approach can drastically reduce the amount of labeled data required;
unsupervised training on LibriSpeech then supervision with 100 hours of labeled
data achieves performance on par with training on all 960 hours directly.
Pre-trained models and code will be released online.Comment: Accepted to ICASSP 2020 (oral
Realizing Petabyte Scale Acoustic Modeling
Large scale machine learning (ML) systems such as the Alexa automatic speech
recognition (ASR) system continue to improve with increasing amounts of
manually transcribed training data. Instead of scaling manual transcription to
impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic
models (AM) from the vast firehose of untranscribed audio data. Learning an AM
from 1 Million hours of audio presents unique ML and system design challenges.
We present the design and evaluation of a highly scalable and resource
efficient SSL system for AM. Employing the student/teacher learning paradigm,
we focus on the student learning subsystem: a scalable and robust data pipeline
that generates features and targets from raw audio, and an efficient model
pipeline, including the distributed trainer, that builds a student model. Our
evaluations show that, even without extensive hyper-parameter tuning, we obtain
relative accuracy improvements in the 10 to 20 range, with higher gains in
noisier conditions. The end-to-end processing time of this SSL system was 12
days, and several components in this system can trivially scale linearly with
more compute resources.Comment: 2156-3357 \copyright 2019 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications standards/publications/rights/index.html for
more informatio
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
This paper describes the NTNU ASR system participating in the Interspeech
2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD
group of ISCA. This ASR shared task is made much more challenging due to the
coexisting diversity of non-native and children speaking characteristics. In
the setting of closed-track evaluation, all participants were restricted to
develop their systems merely based on the speech and text corpora provided by
the organizer. To work around this under-resourced issue, we built our ASR
system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the
synergistic power of various data augmentation strategies, including both
utterance- and word-level speed perturbation and spectrogram augmentation,
alongside a simple yet effective data-cleansing approach. All variants of our
ASR system employed an RNN-based language model to rescore the first-pass
recognition hypotheses, which was trained solely on the text dataset released
by the organizer. Our system with the best configuration came out in second
place, resulting in a word error rate (WER) of 17.59 %, while those of the
top-performing, second runner-up and official baseline systems are 15.67%,
18.71%, 35.09%, respectively.Comment: Submitted to Interspeech 2020 Special Session: Shared Task on
Automatic Speech Recognition for Non-Native Children's Speec
Lessons from Building Acoustic Models with a Million Hours of Speech
This is a report of our lessons learned building acoustic models from 1
Million hours of unlabeled speech, while labeled speech is restricted to 7,000
hours. We employ student/teacher training on unlabeled data, helping scale out
target generation in comparison to confidence model based methods, which
require a decoder and a confidence model. To optimize storage and to
parallelize target generation, we store high valued logits from the teacher
model. Introducing the notion of scheduled learning, we interleave learning on
unlabeled and labeled data. To scale distributed training across a large number
of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on
labeled data with gradient threshold compression SGD using 16 GPUs. Our
experiments show that extremely large amounts of data are indeed useful; with
little hyper-parameter tuning, we obtain relative WER improvements in the 10 to
20% range, with higher gains in noisier conditions.Comment: "Copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works.
Improved Noisy Student Training for Automatic Speech Recognition
Recently, a semi-supervised learning method known as "noisy student training"
has been shown to improve image classification performance of deep networks
significantly. Noisy student training is an iterative self-training method that
leverages augmentation to improve network performance. In this work, we adapt
and improve noisy student training for automatic speech recognition, employing
(adaptive) SpecAugment as the augmentation method. We find effective methods to
filter, balance and augment the data generated in between self-training
iterations. By doing so, we are able to obtain word error rates (WERs)
4.2%/8.6% on the clean/noisy LibriSpeech test sets by only using the clean 100h
subset of LibriSpeech as the supervised set and the rest (860h) as the
unlabeled set. Furthermore, we are able to achieve WERs 1.7%/3.4% on the
clean/noisy LibriSpeech test sets by using the unlab-60k subset of LibriLight
as the unlabeled set for LibriSpeech 960h. We are thus able to improve upon the
previous state-of-the-art clean/noisy test WERs achieved on LibriSpeech 100h
(4.74%/12.20%) and LibriSpeech (1.9%/4.1%).Comment: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference adde
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Convolutional neural networks (CNN) have shown promising results for
end-to-end speech recognition, albeit still behind other state-of-the-art
methods in performance. In this paper, we study how to bridge this gap and go
beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet.
ContextNet features a fully convolutional encoder that incorporates global
context information into convolution layers by adding squeeze-and-excitation
modules. In addition, we propose a simple scaling method that scales the widths
of ContextNet that achieves good trade-off between computation and accuracy. We
demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves
a word error rate (WER) of 2.1%/4.6% without external language model (LM),
1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy
LibriSpeech test sets. This compares to the previous best published system of
2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the
proposed ContextNet model is also verified on a much larger internal dataset.Comment: Submitted to Interspeech 202
NAUTILUS: a Versatile Voice Cloning System
We introduce a novel speech synthesis system, called NAUTILUS, that can
generate speech with a target voice either from a text input or a reference
utterance of an arbitrary source speaker. By using a multi-speaker speech
corpus to train all requisite encoders and decoders in the initial training
stage, our system can clone unseen voices using untranscribed speech of target
speakers on the basis of the backpropagation algorithm. Moreover, depending on
the data circumstance of the target speaker, the cloning strategy can be
adjusted to take advantage of additional data and modify the behaviors of
text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the
situation. We test the performance of the proposed framework by using deep
convolution layers to model the encoders, decoders and WaveNet vocoder.
Evaluations show that it achieves comparable quality with state-of-the-art TTS
and VC systems when cloning with just five minutes of untranscribed speech.
Moreover, it is demonstrated that the proposed framework has the ability to
switch between TTS and VC with high speaker consistency, which will be useful
for many applications.Comment: Submitted to The IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Automatic Data Expansion for Customer-care Spoken Language Understanding
Spoken language understanding (SLU) systems are widely used in handling of
customer-care calls.A traditional SLU system consists of an acoustic model (AM)
and a language model (LM) that areused to decode the utterance and a natural
language understanding (NLU) model that predicts theintent. While AM can be
shared across different domains, LM and NLU models need to be
trainedspecifically for every new task. However, preparing enough data to train
these models is prohibitivelyexpensive. In this paper, we introduce an
efficient method to expand the limited in-domain data. Theprocess starts with
training a preliminary NLU model based on logistic regression on the
in-domaindata. Since the features are based onn= 1,2-grams, we can detect the
most informative n-gramsfor each intent class. Using these n-grams, we find the
samples in the out-of-domain corpus that1) contain the desired n-gram and/or 2)
have similar intent label. The ones which meet the firstconstraint are used to
train a new LM model and the ones that meet both constraints are used to train
anew NLU model. Our results on two divergent experimental setups show that the
proposed approachreduces by 30% the absolute classification error rate (CER)
comparing to the preliminary modelsand it significantly outperforms the
traditional data expansion algorithms such as the ones based onsemi-supervised
learning, TF-IDF and embedding vectors.Comment: 10 pages, 4 figures, 5 tabel
Listening while Speaking: Speech Chain by Deep Learning
Despite the close relationship between speech perception and production,
research in automatic speech recognition (ASR) and text-to-speech synthesis
(TTS) has progressed more or less independently without exerting much mutual
influence on each other. In human communication, on the other hand, a
closed-loop speech chain mechanism with auditory feedback from the speaker's
mouth to her ear is crucial. In this paper, we take a step further and develop
a closed-loop speech chain model based on deep learning. The
sequence-to-sequence model in close-loop architecture allows us to train our
model on the concatenation of both labeled and unlabeled data. While ASR
transcribes the unlabeled speech features, TTS attempts to reconstruct the
original speech waveform based on the text from ASR. In the opposite direction,
ASR also attempts to reconstruct the original text transcription given the
synthesized speech. To the best of our knowledge, this is the first deep
learning model that integrates human speech perception and production
behaviors. Our experimental results show that the proposed approach
significantly improved the performance more than separate systems that were
only trained with labeled data
Toward domain-invariant speech recognition via large scale training
Current state-of-the-art automatic speech recognition systems are trained to
work in specific `domains', defined based on factors like application, sampling
rate and codec. When such recognizers are used in conditions that do not match
the training domain, performance significantly drops. This work explores the
idea of building a single domain-invariant model for varied use-cases by
combining large scale training data from multiple application domains. Our
final system is trained using 162,000 hours of speech. Additionally, each
utterance is artificially distorted during training to simulate effects like
background noise, codec distortion, and sampling rates. Our results show that,
even at such a scale, a model thus trained works almost as well as those
fine-tuned to specific subsets: A single model can be robust to multiple
application domains, and variations like codecs and noise. More importantly,
such models generalize better to unseen conditions and allow for rapid
adaptation -- we show that by using as little as 10 hours of data from a new
domain, an adapted domain-invariant model can match performance of a
domain-specific model trained from scratch using 70 times as much data. We also
highlight some of the limitations of such models and areas that need addressing
in future work
- …