110 research outputs found
Building competitive direct acoustics-to-word models for English conversational speech recognition
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have
received increasing attention compared to conventional sub-word based automatic
speech recognition models using phones, characters, or context-dependent hidden
Markov model states. This is because A2W models recognize words from speech
without any decoder, pronunciation lexicon, or externally-trained language
model, making training and decoding with such models simple. Prior work has
shown that A2W models require orders of magnitude more training data in order
to perform comparably to conventional models. Our work also showed this
accuracy gap when using the English Switchboard-Fisher data set. This paper
describes a recipe to train an A2W model that closes this gap and is at-par
with state-of-the-art sub-word based models. We achieve a word error rate of
8.8%/13.9% on the Hub5-2000 Switchboard/CallHome test sets without any decoder
or language model. We find that model initialization, training data order, and
regularization have the most impact on the A2W model performance. Next, we
present a joint word-character A2W model that learns to first spell the word
and then recognize it. This model provides a rich output to the user instead of
simple word hypotheses, making it especially useful in the case of words unseen
or rarely-seen during training.Comment: Submitted to IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 201
Semi-Autoregressive Streaming ASR With Label Context
Non-autoregressive (NAR) modeling has gained significant interest in speech
processing since these models achieve dramatically lower inference time than
autoregressive (AR) models while also achieving good transcription accuracy.
Since NAR automatic speech recognition (ASR) models must wait for the
completion of the entire utterance before processing, some works explore
streaming NAR models based on blockwise attention for low-latency applications.
However, streaming NAR models significantly lag in accuracy compared to
streaming AR and non-streaming NAR models. To address this, we propose a
streaming "semi-autoregressive" ASR model that incorporates the labels emitted
in previous blocks as additional context using a Language Model (LM)
subnetwork. We also introduce a novel greedy decoding algorithm that addresses
insertion and deletion errors near block boundaries while not significantly
increasing the inference time. Experiments show that our method outperforms the
existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on
Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB) /
Callhome(CH) test sets. It also reduced the accuracy gap with streaming AR and
non-streaming NAR models while achieving 2.5x lower latency. We also
demonstrate that our approach can effectively utilize external text data to
pre-train the LM subnetwork to further improve streaming ASR accuracy.Comment: Submitted to ICASSP 202
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
Beam search, which is the dominant ASR decoding algorithm for end-to-end
models, generates tree-structured hypotheses. However, recent studies have
shown that decoding with hypothesis merging can achieve a more efficient search
with comparable or better performance. But, the full context in recurrent
networks is not compatible with hypothesis merging. We propose to use
vector-quantized long short-term memory units (VQ-LSTM) in the prediction
network of RNN transducers. By training the discrete representation jointly
with the ASR network, hypotheses can be actively merged for lattice generation.
Our experiments on the Switchboard corpus show that the proposed VQ RNN
transducers improve ASR performance over transducers with regular prediction
networks while also producing denser lattices with a very low oracle word error
rate (WER) for the same beam size. Additional language model rescoring
experiments also demonstrate the effectiveness of the proposed lattice
generation scheme.Comment: Interspeech 2022 accepted pape
Improvements to deep convolutional neural networks for LVCSR
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural
Networks (DNN), as they are able to better reduce spectral variation in the
input signal. This has also been confirmed experimentally, with CNNs showing
improvements in word error rate (WER) between 4-12% relative compared to DNNs
across a variety of LVCSR tasks. In this paper, we describe different methods
to further improve CNN performance. First, we conduct a deep analysis comparing
limited weight sharing and full weight sharing with state-of-the-art features.
Second, we apply various pooling strategies that have shown improvements in
computer vision to an LVCSR speech task. Third, we introduce a method to
effectively incorporate speaker adaptation, namely fMLLR, into log-mel
features. Fourth, we introduce an effective strategy to use dropout during
Hessian-free sequence training. We find that with these improvements,
particularly with fMLLR and dropout, we are able to achieve an additional 2-3%
relative improvement in WER on a 50-hour Broadcast News task over our previous
best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5%
relative improvement over our previous best CNN baseline.Comment: 6 pages, 1 figur
- …