Learning Feature Representation for Automatic Speech Recognition

Abstract

Feature extraction in automatic speech recognition (ASR) can be regarded as learning representations from lower-level to more abstract higher-level features. Lower-level feature can be viewed as features from the signal domain, such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients (MFCCs) features. Higher-level feature representations can be considered as bottleneck features (BNFs) learned using deep neural networks (DNNs). In this thesis, we focus on improving feature extraction at different levels mainly for ASR. The first part of this thesis focuses on learning features from the signal domain that help ASR. Hand-crafted spectral and cepstral features such as MFCC are the main features used in most conventional ASR systems; all are inspired by physiological models of the human auditory system. However, some aspects of the signal such as pitch cannot be easily extracted from spectral features, but are found to be useful for ASR. We explore new algorithm to extract a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages, especially tonal languages. We then investigate replacing the conventional features with jointly training from the signal domain using time domain, and frequency domain approaches. The results show that our time-domain joint feature learning setup achieves state-of-the-art performance using MFCC, while our frequency domain setup outperforms them in various datasets. Joint feature extraction results in learning data or language-dependent filter banks, that can degrade the performance in unseen noise and channel conditions or other languages. To tackle this, we investigate joint universal feature learning across different languages using the proposed direct-from-signal setups. We then investigate the filter banks learned in this setup and propose a new set of features as an extension to conventional Mel filter banks. The results show consistent word error rate (WER) improvement, especially in clean condition. The second part of this thesis focuses on learning higher-level feature embedding. We investigate learning and transferring deep feature representations across different domains using multi-task learning and weight transfer approaches. They have been adopted to explicitly learn intermediate-level features that are useful for several different tasks

    Similar works