4 research outputs found
Multi-encoder multi-resolution framework for end-to-end speech recognition
Attention-based methods and Connectionist Temporal Classification (CTC)
network have been promising research directions for end-to-end Automatic Speech
Recognition (ASR). The joint CTC/Attention model has achieved great success by
utilizing both architectures during multi-task training and joint decoding. In
this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework
based on the joint CTC/Attention model. Two heterogeneous encoders with
different architectures, temporal resolutions and separate CTC networks work in
parallel to extract complimentary acoustic information. A hierarchical
attention mechanism is then used to combine the encoder-level information. To
demonstrate the effectiveness of the proposed model, experiments are conducted
on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate
(WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6%
WER in the WSJ eval92 test set, which is the best WER reported for an
end-to-end system on this benchmark
Performance Monitoring for End-to-End Speech Recognition
Measuring performance of an automatic speech recognition (ASR) system without
ground-truth could be beneficial in many scenarios, especially with data from
unseen domains, where performance can be highly inconsistent. In conventional
ASR systems, several performance monitoring (PM) techniques have been
well-developed to monitor performance by looking at tri-phone posteriors or
pre-softmax activations from neural network acoustic modeling. However,
strategies for monitoring more recently developed end-to-end ASR systems have
not yet been explored, and so that is the focus of this paper. We adapt
previous PM measures (Entropy, M-measure and Auto-encoder) and apply our
proposed RNN predictor in the end-to-end setting. These measures utilize the
decoder output layer and attention probability vectors, and their predictive
power is measured with simple linear models. Our findings suggest that
decoder-level features are more feasible and informative than attention-level
probabilities for PM measures, and that M-measure on the decoder posteriors
achieves the best overall predictive performance with an average prediction
error 8.8%. Entropy measures and RNN-based prediction also show competitive
predictability, especially for unseen conditions.Comment: Submitted to Interspeech 201
Multistream CNN for Robust Acoustic Modeling
This paper presents multistream CNN, a novel neural network architecture for
robust acoustic modeling in speech recognition tasks. The proposed architecture
accommodates diverse temporal resolutions in multiple streams to achieve
robustness in acoustic modeling. For the diversity of temporal resolution in
embedding processing, we consider dilation on TDNN-F, a variant of 1D-CNN. Each
stream stacks narrower TDNN-F layers whose kernel has a unique, stream-specific
dilation rate when processing input speech frames in parallel. Hence it can
better represent acoustic events without the increase of model complexity. We
validate the effectiveness of the proposed multistream CNN architecture by
showing consistent improvement across various data sets. Trained with data
augmentation methods, multistream CNN improves the WER of the test-other set in
the LibriSpeech corpus by 12% (relative). On custom data from ASAPP's
production system for a contact center, it records a relative WER improvement
of 11% for the customer channel audios (10% on average for the agent and
customer channel recordings) to prove the superiority of the proposed model
architecture in the wild. In terms of real-time factor (RTF), multistream CNN
outperforms the normal TDNN-F by 15%, which also suggests its practicality on
production systems or applications.Comment: Submitted to Interspeech 202
Multi-QuartzNet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion
In this paper, we propose an end-to-end speech recognition network based on
Nvidia's previous QuartzNet model. We try to promote the model performance, and
design three components: (1) Multi-Resolution Convolution Module, replaces the
original 1D time-channel separable convolution with multi-stream convolutions.
Each stream has a unique dilated stride on convolutional operations. (2)
Channel-Wise Attention Module, calculates the attention weight of each
convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature
Fusion Module, reweights each convolutional block by global multi-layer feature
maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77%
on AISHELL-1 data set, which outperforms original QuartzNet and is close to
state-of-art result.Comment: will be presented in SLT 202