2 research outputs found
CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments
Casual conversations involving multiple speakers and noises from surrounding
devices are common in everyday environments, which degrades the performances of
automatic speech recognition systems. These challenging characteristics of
environments are the target of the CHiME-5 challenge. By employing a
convolutional neural network (CNN)-based multichannel end-to-end speech
recognition system, this study attempts to overcome the presents difficulties
in everyday environments. The system comprises of an attention-based
encoder-decoder neural network that directly generates a text as an output from
a sound input. The multichannel CNN encoder, which uses residual connections
and batch renormalization, is trained with augmented data, including white
noise injection. The experimental results show that the word error rate is
reduced by 8.5% and 0.6% absolute from a single channel end-to-end and the best
baseline (LF-MMI TDNN) on the CHiME-5 corpus, respectively.Comment: 5 pages, 1 figure, EUSIPCO 201
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription
While end-to-end ASR systems have proven competitive with the conventional
hybrid approach, they are prone to accuracy degradation when it comes to noisy
and low-resource conditions. In this paper, we argue that, even in such
difficult cases, some end-to-end approaches show performance close to the
hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an
example of challenging environments and noisy conditions of everyday speech. We
experimentally compare and analyze CTC-Attention versus RNN-Transducer
approaches along with RNN versus Transformer architectures. We also provide a
comparison of acoustic features and speech enhancements. Besides, we evaluate
the effectiveness of neural network language models for hypothesis re-scoring
in low-resource conditions. Our best end-to-end model based on RNN-Transducer,
together with improved beam search, reaches quality by only 3.8% WER abs. worse
than the LF-MMI TDNN-F CHiME-6 Challenge baseline. With the Guided Source
Separation based training data augmentation, this approach outperforms the
hybrid baseline system by 2.7% WER abs. and the end-to-end system best known
before by 25.7% WER abs.Comment: Accepted by Interspeech 202