2 research outputs found
Improving noise robust automatic speech recognition with single-channel time-domain enhancement network
With the advent of deep learning, research on noise-robust automatic speech
recognition (ASR) has progressed rapidly. However, ASR performance in noisy
conditions of single-channel systems remains unsatisfactory. Indeed, most
single-channel speech enhancement (SE) methods (denoising) have brought only
limited performance gains over state-of-the-art ASR back-end trained on
multi-condition training data. Recently, there has been much research on neural
network-based SE methods working in the time-domain showing levels of
performance never attained before. However, it has not been established whether
the high enhancement performance achieved by such time-domain approaches could
be translated into ASR. In this paper, we show that a single-channel
time-domain denoising approach can significantly improve ASR performance,
providing more than 30 % relative word error reduction over a strong ASR
back-end on the real evaluation data of the single-channel track of the CHiME-4
dataset. These positive results demonstrate that single-channel noise reduction
can still improve ASR performance, which should open the door to more research
in that direction.Comment: 5 pages, to appear in ICASSP202
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition
The joint training framework for speech enhancement and recognition methods
have obtained quite good performances for robust end-to-end automatic speech
recognition (ASR). However, these methods only utilize the enhanced feature as
the input of the speech recognition component, which are affected by the speech
distortion problem. In order to address this problem, this paper proposes a
gated recurrent fusion (GRF) method with joint training framework for robust
end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and
enhanced features. Therefore, the GRF can not only remove the noise signals
from the enhanced features, but also learn the raw fine structures from the
noisy features so that it can alleviate the speech distortion. The proposed
method consists of speech enhancement, GRF and speech recognition. Firstly, the
mask based speech enhancement network is applied to enhance the input speech.
Secondly, the GRF is applied to address the speech distortion problem. Thirdly,
to improve the performance of ASR, the state-of-the-art speech transformer
algorithm is used as the speech recognition component. Finally, the joint
training framework is utilized to optimize these three components,
simultaneously. Our experiments are conducted on an open-source Mandarin speech
corpus called AISHELL-1. Experimental results show that the proposed method
achieves the relative character error rate (CER) reduction of 10.04\% over the
conventional joint enhancement and transformer method only using the enhanced
features. Especially for the low signal-to-noise ratio (0 dB), our proposed
method can achieves better performances with 12.67\% CER reduction, which
suggests the potential of our proposed method.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language
Processin