Deep Scattering and End-to-End Speech Models towards Low Resource Speech Recognition

Abstract

Automatic Speech Recognition (ASR) has made major leaps in its advancement largely due to two different machine learning models: Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). State-of-the art results have been achieved by combining these two disparate methods to form a hybrid system. This also requires that various components of the speech recognizer be trained independently based on a probabilistic noisy channel model. Although this HMM-DNN hybrid ASR method has been successful in recent studies, the independent development of the individual components used in hybrid HMM-DNN models makes ASR development fragile and expensive in terms of time-to-develop the various components and their associated sub-systems. The resulting trade-off is that ASR systems are difficult to develop and use especially for new applications and languages. The alternative approach, known as the end-to-end paradigm, makes use of a single deep neural-network architecture used to encapsulate as many as possible subcomponents of speech recognition as a single process. In the so-called end-to-end paradigm, latent variables of sub-components are subsumed by the neural network sub-architectures and the associated parameters. The end-to-end paradigm gains of a simplified ASR-development process again are traded for higher internal model complexity and computational resources needed to train the end-to-end models. This research focuses on taking advantage of the end-to-end model ASR development gains for new and low-resource languages. Using a specialised light weight convolution-like neural network called the deep scattering network (DSN) to replace the input layer of the end-to-end model, our objective was to measure the performance of the end-to-end model using these augmented speech features while checking to see if the light-weight, wavelet-based architecture brought about any improvements for low resource Speech recognition in particular. The results showed that it is possible to use this compact strategy for speech pattern recognition by deploying deep scattering network features with higher dimensional vectors when compared to traditional speech features. With Word Error Rates of 26.8% and 76.7% for SVCSR and LVCSR respective tasks, the ASR system metrics fell few WER points short of their respective baselines. In addition, training times tended to be longer when compared to their respective baselines and therefore had no significant improvement for low resource speech recognition training

    Similar works