3 research outputs found
Deep Feed-forward Sequential Memory Networks for Speech Synthesis
The Bidirectional LSTM (BLSTM) RNN based speech synthesis system is among the
best parametric Text-to-Speech (TTS) systems in terms of the naturalness of
generated speech, especially the naturalness in prosody. However, the model
complexity and inference cost of BLSTM prevents its usage in many runtime
applications. Meanwhile, Deep Feed-forward Sequential Memory Networks (DFSMN)
has shown its consistent out-performance over BLSTM in both word error rate
(WER) and the runtime computation cost in speech recognition tasks. Since
speech synthesis also requires to model long-term dependencies compared to
speech recognition, in this paper, we investigate the Deep-FSMN (DFSMN) in
speech synthesis. Both objective and subjective experiments show that, compared
with BLSTM TTS method, the DFSMN system can generate synthesized speech with
comparable speech quality while drastically reduce model complexity and speech
generation time.Comment: 5 pages, ICASSP 201
Deep-FSMN for Large Vocabulary Continuous Speech Recognition
In this paper, we present an improved feedforward sequential memory networks
(FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections
between memory blocks in adjacent layers. These skip connections enable the
information flow across different layers and thus alleviate the gradient
vanishing problem when building very deep structure. As a result, DFSMN
significantly benefits from these skip connections and deep structure. We have
compared the performance of DFSMN to BLSTM both with and without lower frame
rate (LFR) on several large speech recognition tasks, including English and
Mandarin. Experimental results shown that DFSMN can consistently outperform
BLSTM with dramatic gain, especially trained with LFR using CD-Phone as
modeling units. In the 2000 hours Fisher (FSH) task, the proposed DFSMN can
achieve a word error rate of 9.4% by purely using the cross-entropy criterion
and decoding with a 3-gram language model, which achieves a 1.5% absolute
improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task,
the LFR trained DFSMN can achieve more than 20% relative improvement compared
to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter
order of the memory blocks in DFSMN to control the latency for real-time
applications
Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition
(ASR), significant performance degradation is still observed when recognizing
multi-talker mixed speech. In this paper, we propose and evaluate several
architectures to address this problem under the assumption that only a single
channel of mixed signal is available. Our technique extends permutation
invariant training (PIT) by introducing the front-end feature separation module
with the minimum mean square error (MSE) criterion and the back-end recognition
module with the minimum cross entropy (CE) criterion. More specifically, during
training we compute the average MSE or CE over the whole utterance for each
possible utterance-level output-target assignment, pick the one with the
minimum MSE or CE, and optimize for that assignment. This strategy elegantly
solves the label permutation problem observed in the deep learning based
multi-talker mixed speech separation and recognition systems. The proposed
architectures are evaluated and compared on an artificially mixed AMI dataset
with both two- and three-talker mixed speech. The experimental results indicate
that our proposed architectures can cut the word error rate (WER) by 45.0% and
25.0% relatively against the state-of-the-art single-talker speech recognition
system across all speakers when their energies are comparable, for two- and
three-talker mixed speech, respectively. To our knowledge, this is the first
work on the multi-talker mixed speech recognition on the challenging
speaker-independent spontaneous large vocabulary continuous speech task.Comment: 11 pages, 6 figures, Submitted to IEEE/ACM Transactions on Audio,
Speech and Language Processing. arXiv admin note: text overlap with
arXiv:1704.0198