A critical examination of deep learningapproaches to automated speech recognition

Abstract

Recently, deep learning techniques have been successfully applied to automatic speech recognition (ASR) tasks. Most current speech recognition systems use Hidden Markov Models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) are exploited to model the emission probability of the HMM. Deep Neural Networks (DNNs) and Deep Belief Networks(DBNs) have recently proven though to outperform GMMs in modeling the probability of emission in HMMs. Deep architectures such as DBNs with many hidden layers are useful for multilevel feature representation thus building a distributed representation at different levels of a certain input. These networks are first pre-trained as a multi-layer generative model of a window of feature vector without making use of any discriminative information in unsupervised mode. Once the generative pre-training is complete, discriminative fine-tuning is performed to adjust the model parameters to make them better at predicting. Our aim is to study different levels of representation for speech acoustic features that are produced by the hidden layers of DBNs. To this end, we estimate phoneme recognition error and use classification accuracy evaluated with Support Vector Machines (SVMs) as a measure of separability between the DBN representations of 61 phoneme classes. In addition, we investigate the relation between different subgroups/categories of phonemes at various representation levels using correlation analysis. The tests have been performed on TIMIT database and simulations have been developed to run on a graphics processing unit (GPU) cluster at PDC/KTH

    Similar works

    Full text

    thumbnail-image

    Available Versions