2 research outputs found
power-law nonlinearity with maximally uniform distribution criterion for improved neural network training in automatic speech recognition
In this paper, we describe the Maximum Uniformity of Distribution (MUD)
algorithm with the power-law nonlinearity. In this approach, we hypothesize
that neural network training will become more stable if feature distribution is
not too much skewed. We propose two different types of MUD approaches: power
function-based MUD and histogram-based MUD. In these approaches, we first
obtain the mel filterbank coefficients and apply nonlinearity functions for
each filterbank channel. With the power function-based MUD, we apply a
power-function based nonlinearity where power function coefficients are chosen
to maximize the likelihood assuming that nonlinearity outputs follow the
uniform distribution. With the histogram-based MUD, the empirical Cumulative
Density Function (CDF) from the training database is employed to transform the
original distribution into a uniform distribution. In MUD processing, we do not
use any prior knowledge (e.g. logarithmic relation) about the energy of the
incoming signal and the perceived intensity by a human. Experimental results
using an end-to-end speech recognition system demonstrate that power-function
based MUD shows better result than the conventional Mel Filterbank Cepstral
Coefficients (MFCCs). On the LibriSpeech database, we could achieve 4.02 % WER
on test-clean and 13.34 % WER on test-other without using any Language Models
(LMs). The major contribution of this work is that we developed a new algorithm
for designing the compressive nonlinearity in a data-driven way, which is much
more flexible than the previous approaches and may be extended to other domains
as well.Comment: Accepted and presented at the ASRU 2019 conferenc
end-to-end training of a large vocabulary end-to-end speech recognition system
In this paper, we present an end-to-end training framework for building
state-of-the-art end-to-end speech recognition systems. Our training system
utilizes a cluster of Central Processing Units(CPUs) and Graphics Processing
Units (GPUs). The entire data reading, large scale data augmentation, neural
network parameter updates are all performed "on-the-fly". We use vocal tract
length perturbation [1] and an acoustic simulator [2] for data augmentation.
The processed features and labels are sent to the GPU cluster. The Horovod
allreduce approach is employed to train neural network parameters. We evaluated
the effectiveness of our system on the standard Librispeech corpus [3] and the
10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition
system built using this training infrastructure showed a 2.44 % WER on
test-clean of the LibriSpeech test set after applying shallow fusion with a
Transformer language model (LM). For the proprietary English Bixby open domain
test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention
(BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the
monotonic chunckwise attention (MoCha) based approach is employed for streaming
speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain
test set.Comment: Accepted and presented at the ASRU 2019 conferenc