3 research outputs found
Realizing Petabyte Scale Acoustic Modeling
Large scale machine learning (ML) systems such as the Alexa automatic speech
recognition (ASR) system continue to improve with increasing amounts of
manually transcribed training data. Instead of scaling manual transcription to
impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic
models (AM) from the vast firehose of untranscribed audio data. Learning an AM
from 1 Million hours of audio presents unique ML and system design challenges.
We present the design and evaluation of a highly scalable and resource
efficient SSL system for AM. Employing the student/teacher learning paradigm,
we focus on the student learning subsystem: a scalable and robust data pipeline
that generates features and targets from raw audio, and an efficient model
pipeline, including the distributed trainer, that builds a student model. Our
evaluations show that, even without extensive hyper-parameter tuning, we obtain
relative accuracy improvements in the 10 to 20 range, with higher gains in
noisier conditions. The end-to-end processing time of this SSL system was 12
days, and several components in this system can trivially scale linearly with
more compute resources.Comment: 2156-3357 \copyright 2019 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications standards/publications/rights/index.html for
more informatio
Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models
We present results from Alexa speech teams on semi-supervised learning (SSL)
of acoustic models (AM) with experiments spanning over 3000 hours of GPU time,
making our study one of the largest of its kind. We discuss SSL for AMs in a
small footprint setting, showing that a smaller capacity model trained with 1
million hours of unsupervised data can outperform a baseline supervised system
by 14.3% word error rate reduction (WERR). When increasing the supervised data
to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at
larger supervised data regimes, we employ a step-wise distillation into a
smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger
student models in low data regimes; while learning efficiency with unsupervised
data is higher, student models may outperform teacher models in such a setting.
We develop a theoretical sketch to explain this behavior.Comment: TSD202
Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition
The past decade has witnessed great progress in Automatic Speech Recognition
(ASR) due to advances in deep learning. The improvements in performance can be
attributed to both improved models and large-scale training data. Key to
training such models is the employment of efficient distributed learning
techniques. In this article, we provide an overview of distributed training
techniques for deep neural network acoustic models for ASR. Starting with the
fundamentals of data parallel stochastic gradient descent (SGD) and ASR
acoustic modeling, we will investigate various distributed training strategies
and their realizations in high performance computing (HPC) environments with an
emphasis on striking the balance between communication and computation.
Experiments are carried out on a popular public benchmark to study the
convergence, speedup and recognition performance of the investigated
strategies.Comment: Accepted to IEEE Signal Processing Magazin