Search CORE

2 research outputs found

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

Author: Hori Takaaki
Nakadai Kazuhiro
Ogata Tetsuya
Watanabe Shinji
Yalta Nelson
Publication venue
Publication date: 20/06/2019
Field of study

Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments. The system comprises of an attention-based encoder-decoder neural network that directly generates a text as an output from a sound input. The multichannel CNN encoder, which uses residual connections and batch renormalization, is trained with augmented data, including white noise injection. The experimental results show that the word error rate is reduced by 8.5% and 0.6% absolute from a single channel end-to-end and the best baseline (LF-MMI TDNN) on the CHiME-5 corpus, respectively.Comment: 5 pages, 1 figure, EUSIPCO 201

arXiv.org e-Print Archive

Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription

Author: Andrusenko Andrei
Laptev Aleksandr
Medennikov Ivan
Publication venue: 'International Speech Communication Association'
Publication date: 07/08/2020
Field of study

While end-to-end ASR systems have proven competitive with the conventional hybrid approach, they are prone to accuracy degradation when it comes to noisy and low-resource conditions. In this paper, we argue that, even in such difficult cases, some end-to-end approaches show performance close to the hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an example of challenging environments and noisy conditions of everyday speech. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. We also provide a comparison of acoustic features and speech enhancements. Besides, we evaluate the effectiveness of neural network language models for hypothesis re-scoring in low-resource conditions. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline. With the Guided Source Separation based training data augmentation, this approach outperforms the hybrid baseline system by 2.7% WER abs. and the end-to-end system best known before by 25.7% WER abs.Comment: Accepted by Interspeech 202

arXiv.org e-Print Archive