110 research outputs found
Linguistic Search Optimization for Deep Learning Based LVCSR
Recent advances in deep learning based large vocabulary con- tinuous speech
recognition (LVCSR) invoke growing demands in large scale speech transcription.
The inference process of a speech recognizer is to find a sequence of labels
whose corresponding acoustic and language models best match the input feature
[1]. The main computation includes two stages: acoustic model (AM) inference
and linguistic search (weighted finite-state transducer, WFST). Large
computational overheads of both stages hamper the wide application of LVCSR.
Benefit from stronger classifiers, deep learning, and more powerful computing
devices, we propose general ideas and some initial trials to solve these
fundamental problems.Comment: accepted by Doctoral Consortium, INTERSPEECH 201
Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting
Speech recognition is a sequence prediction problem. Besides employing
various deep learning approaches for framelevel classification, sequence-level
discriminative training has been proved to be indispensable to achieve the
state-of-the-art performance in large vocabulary continuous speech recognition
(LVCSR). However, keyword spotting (KWS), as one of the most common speech
recognition tasks, almost only benefits from frame-level deep learning due to
the difficulty of getting competing sequence hypotheses. The few studies on
sequence discriminative training for KWS are limited for fixed vocabulary or
LVCSR based methods and have not been compared to the state-of-the-art deep
learning based KWS approaches. In this paper, a sequence discriminative
training framework is proposed for both fixed vocabulary and unrestricted
acoustic KWS. Sequence discriminative training for both sequence-level
generative and discriminative models are systematically investigated. By
introducing word-independent phone lattices or non-keyword blank symbols to
construct competing hypotheses, feasible and efficient sequence discriminative
training approaches are proposed for acoustic KWS. Experiments showed that the
proposed approaches obtained consistent and significant improvement in both
fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level
deep learning based acoustic KWS methods.Comment: accepted by Speech Communication, 08/02/201
On Modular Training of Neural Acoustics-to-Word Model for LVCSR
End-to-end (E2E) automatic speech recognition (ASR) systems directly map
acoustics to words using a unified model. Previous works mostly focus on E2E
training a single model which integrates acoustic and language model into a
whole. Although E2E training benefits from sequence modeling and simplified
decoding pipelines, large amount of transcribed acoustic data is usually
required, and traditional acoustic and language modelling techniques cannot be
utilized. In this paper, a novel modular training framework of E2E ASR is
proposed to separately train neural acoustic and language models during
training stage, while still performing end-to-end inference in decoding stage.
Here, an acoustics-to-phoneme model (A2P) and a phoneme-to-word model (P2W) are
trained using acoustic data and text data respectively. A phone synchronous
decoding (PSD) module is inserted between A2P and P2W to reduce sequence
lengths without precision loss. Finally, modules are integrated into an
acousticsto-word model (A2W) and jointly optimized using acoustic data to
retain the advantage of sequence modeling. Experiments on a 300- hour
Switchboard task show significant improvement over the direct A2W model. The
efficiency in both training and decoding also benefits from the proposed
method.Comment: accepted by ICASSP201
Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition
Commonly used automatic speech recognition (ASR) systems can be classified
into frame-synchronous and label-synchronous categories, based on whether the
speech is decoded on a per-frame or per-label basis. Frame-synchronous systems,
such as traditional hidden Markov model systems, can easily incorporate
existing knowledge and can support streaming ASR applications.
Label-synchronous systems, based on attention-based encoder-decoder models, can
jointly learn the acoustic and language information with a single model, which
can be regarded as audio-grounded language models. In this paper, we propose
rescoring the N-best hypotheses or lattices produced by a first-pass
frame-synchronous system with a label-synchronous system in a second-pass. By
exploiting the complementary modelling of the different approaches, the
combined two-pass systems achieve competitive performance without using any
extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM
dataset, the combined system has a 13.7% word error rate (WER) on the
evaluation set, which is up to a 29% relative WER reduction over the individual
systems. For the 300-hour Switchboard dataset, the WERs of the combined system
are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2%
and 7.6% on Switchboard Cellular and Fisher subsets of RT03, up to a 33%
relative reduction in WER over the individual systems.Comment: Submitted to IEEE/ACM Transactions on Audio Speech and Language
Processin
Tiny Transducer: A Highly-efficient Speech Recognition Model on Edge Devices
This paper proposes an extremely lightweight phone-based transducer model
with a tiny decoding graph on edge devices. First, a phone synchronous decoding
(PSD) algorithm based on blank label skipping is first used to speed up the
transducer decoding process. Then, to decrease the deletion errors introduced
by the high blank score, a blank label deweighting approach is proposed. To
reduce parameters and computation, deep feedforward sequential memory network
(DFSMN) layers are used in the transducer encoder, and a CNN-based stateless
predictor is adopted. SVD technology compresses the model further. WFST-based
decoding graph takes the context-independent (CI) phone posteriors as input and
allows us to flexibly bias user-specific information. Finally, with only 0.9M
parameters after SVD, our system could give a relative 9.1% - 20.5% improvement
compared with a bigger conventional hybrid system on edge devices.Comment: Accepted by ICASSP 202
Vectorization of hypotheses and speech for faster beam search in encoder decoder-based speech recognition
Attention-based encoder decoder network uses a left-to-right beam search
algorithm in the inference step. The current beam search expands hypotheses and
traverses the expanded hypotheses at the next time step. This traversal is
implemented using a for-loop program in general, and it leads to speed down of
the recognition process. In this paper, we propose a parallelism technique for
beam search, which accelerates the search process by vectorizing multiple
hypotheses to eliminate the for-loop program. We also propose a technique to
batch multiple speech utterances for off-line recognition use, which reduces
the for-loop program with regard to the traverse of multiple utterances. This
extension is not trivial during beam search unlike during training due to
several pruning and thresholding techniques for efficient decoding. In
addition, our method can combine scores of external modules, RNNLM and CTC, in
a batch as shallow fusion. We achieved 3.7 x speedup compared with the original
beam search algorithm by vectoring hypotheses, and achieved 10.5 x speedup by
further changing processing unit to GPU
Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks
We explore a keyword-based spoken language understanding system, in which the
intent of the user can directly be derived from the detection of a sequence of
keywords in the query. In this paper, we focus on an open-vocabulary keyword
spotting method, allowing the user to define their own keywords without having
to retrain the whole model. We describe the different design choices leading to
a fast and small-footprint system, able to run on tiny devices, for any
arbitrary set of user-defined keywords, without training data specific to those
keywords. The model, based on a quantized long short-term memory (LSTM) neural
network, trained with connectionist temporal classification (CTC), weighs less
than 500KB. Our approach takes advantage of some properties of the predictions
of CTC-trained networks to calibrate the confidence scores and implement a fast
detection algorithm. The proposed system outperforms a standard keyword-filler
model approach
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
A GPU-based WFST Decoder with Exact Lattice Generation
We describe initial work on an extension of the Kaldi toolkit that supports
weighted finite-state transducer (WFST) decoding on Graphics Processing Units
(GPUs). We implement token recombination as an atomic GPU operation in order to
fully parallelize the Viterbi beam search, and propose a dynamic load balancing
strategy for more efficient token passing scheduling among GPU threads. We also
redesign the exact lattice generation and lattice pruning algorithms for better
utilization of the GPUs. Experiments on the Switchboard corpus show that the
proposed method achieves identical 1-best results and lattice quality in
recognition and confidence measure tasks, while running 3 to 15 times faster
than the single process Kaldi decoder. The above results are reported on
different GPU architectures. Additionally we obtain a 46-fold speedup with
sequence parallelism and multi-process service (MPS) in GPU.Comment: accepted by INTERSPEECH 201
Stream attention-based multi-array end-to-end speech recognition
Automatic Speech Recognition (ASR) using multiple microphone arrays has
achieved great success in the far-field robustness. Taking advantage of all the
information that each array shares and contributes is crucial in this task.
Motivated by the advances of joint Connectionist Temporal Classification
(CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based
multi-array framework is proposed in this work. Microphone arrays, acting as
information streams, are activated by separate encoders and decoded under the
instruction of both CTC and attention networks. In terms of attention, a
hierarchical structure is adopted. On top of the regular attention networks,
stream attention is introduced to steer the decoder toward the most informative
encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora
using the encoder-decoder architecture. Compared with the best single-array
results, the proposed framework has achieved relative Word Error Rates (WERs)
reduction of 3.7% and 9.7% in the two datasets, respectively, which is better
than conventional strategies as well.Comment: Submitted to ICASSP 201
- …