Network Training for Continuous Speech Recognition

Abstract

Spoken language processing is one of the oldest and most natural modes of information exchange between humans beings. For centuries, people have tried to develop machines that can understand and produce speech the way humans do so naturally. The biggest problem in our inability to model speech with computer programs and mathematics results from the fact that language is instinctive, whereas, the vocabulary and dialect used in communication are learned. Human beings are genetically equipped with the ability to learn languages, and culture imprints the vocabulary and dialect on each member of society. This thesis examines the role of pattern classification in the recognition of human speech, i.e., machine learning techniques that are currently being applied to the spoken language processing problem. The primary objective of this thesis is to create a network training paradigm that allows for direct training of multi-path models and alleviates the need for complicated systems and training recipes. A traditional trainer uses an expectation maximization (EM)based supervised training framework to estimate the parameters of a spoken language processing system. EM-based parameter estimation for speech recognition is performed using several complicated stages of iterative reestimation. These stages typically are prone to human error. The network training paradigm reduces the complexity of the training process while retaining the robustness of the EM-based supervised training framework. The hypothesis of this thesis is that the network training paradigm can achieve comparable recognition performance to a traditional trainer while alleviating the need for complicated systems and training recipes for spoken language processing systems

    Similar works