4 research outputs found
DFSMN-SAN with Persistent Memory Model for Automatic Speech Recognition
Self-attention networks (SAN) have been introduced into automatic speech
recognition (ASR) and achieved state-of-the-art performance owing to its
superior ability in capturing long term dependency. One of the key ingredients
is the self-attention mechanism which can be effectively performed on the whole
utterance level. In this paper, we try to investigate whether even more
information beyond the whole utterance level can be exploited and beneficial.
We propose to apply self-attention layer with augmented memory to ASR.
Specifically, we first propose a variant model architecture which combines deep
feed-forward sequential memory network (DFSMN) with self-attention layers to
form a better baseline model compared with a purely self-attention network.
Then, we propose and compare two kinds of additional memory structures added
into self-attention layers. Experiments on large-scale LVCSR tasks show that on
four individual test sets, the DFSMN-SAN architecture outperforms vanilla SAN
encoder by 5% relatively in character error rate (CER). More importantly, the
additional memory structure provides further 5% to 11% relative improvement in
CER.Comment: 5 pages, 2 figures, subbmitted to ICASSP 202
SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition
End-to-end speech recognition has become popular in recent years, since it
can integrate the acoustic, pronunciation and language models into a single
neural network. Among end-to-end approaches, attention-based methods have
emerged as being superior. For example, Transformer, which adopts an
encoder-decoder architecture. The key improvement introduced by Transformer is
the utilization of self-attention instead of recurrent mechanisms, enabling
both encoder and decoder to capture long-range dependencies with lower
computational complexity.In this work, we propose boosting the self-attention
ability with a DFSMN memory block, forming the proposed memory equipped
self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have
been made to demonstrate the relevancy and complementarity between
self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M
provides an efficient mechanism to integrate these two modules. We have
evaluated our approach on the public AISHELL-1 benchmark and an
industrial-level 20,000-hour Mandarin speech recognition task. On both tasks,
SAN-M systems achieved much better performance than the self-attention based
Transformer baseline system. Specially, it can achieve a CER of 6.46% on the
AISHELL-1 task even without using any external LM, comfortably outperforming
other state-of-the-art systems.Comment: submitted to INTERSPEECH202
Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition
Speech enhancement techniques based on deep learning have brought significant
improvement on speech quality and intelligibility. Nevertheless, a large gain
in speech quality measured by objective metrics, such as perceptual evaluation
of speech quality (PESQ), does not necessarily lead to improved speech
recognition performance due to speech distortion in the enhancement stage. In
this paper, a multi-channel dilated convolutional network based frequency
domain modeling is presented to enhance target speaker in the far-field, noisy
and multi-talker conditions. We study three approaches towards distortionless
waveforms for overlapped speech recognition: estimating complex ideal ratio
mask with an infinite range, incorporating the fbank loss in a multi-objective
learning and finetuning the enhancement model by an acoustic model.
Experimental results proved the effectiveness of all three approaches on
reducing speech distortions and improving recognition accuracy. Particularly,
the jointly tuned enhancement model works very well with other standalone
acoustic model on real test data
Latency-Controlled Neural Architecture Search for Streaming Speech Recognition
Neural architecture search (NAS) has attracted much attention and has been
explored for automatic speech recognition (ASR). In this work, we focus on
streaming ASR scenarios and propose the latency-controlled NAS for acoustic
modeling. First, based on the vanilla neural architecture, normal cells are
altered to causal cells to control the total latency of the architecture.
Second, a revised operation space with a smaller receptive field is proposed to
generate the final architecture with low latency. Extensive experiments show
that: 1) Based on the proposed neural architecture, the neural networks with a
medium latency of 550ms (millisecond) and a low latency of 190ms can be learned
in the vanilla and revised operation space respectively. 2) For the low latency
setting, the evaluation network can achieve more than 19\% (average on the four
test sets) relative improvements compared with the hybrid CLDNN baseline, on a
10k-hour large-scale dataset