[[abstract]]Speech is the primary and the most convenient means of communication between people. Due to the successful development of much smaller electronic devices and the popularity of wireless communication and networking, it is widely believed that speech will play a more active role and will serve as the major human-machine interface for the interaction between people and different kinds of smart devices in the near future. Therefore, research on automatic speech recognition (ASR) is now becoming more and more emphasized, and in which the development of discriminative as well as robust feature extraction approaches for ASR to be deployed in real and diverse environments has continuously gained much attention over the past two decades. With the above observation in mind, in this thesis we studied the techniques of auditory-perception-based feature extraction and data-driven linear feature transformation for robust speech recognition. For auditory-perception-based feature extraction, we extensively compares the conventional Mel-frequency Cepstral Coefficients (MFCC) with the Perceptual Linear Prediction Coefficients (PLPC), as well as compared various ways to derive and combine their corresponding time trajectory information. For data-driven linear feature transformation, we started with the attempt to show the superior performance of the linear discriminant analysis (LDA) over that of the principal component analysis (PCA) in the feature transformation for speech recognition. We then investigated several improved approaches, such as the heteroscedastic linear discriminant analysis (HLDA) and heteroscedastic discriminant analysis (HDA) etc., for removing the inherent assumption of the same cluster variation in the derivation of LDA. Moreover, we proposed the use of the minimum classification error (MCE) and maximum mutual information (MMI) criteria, respectively, in the optimization of the transformation matrices, in comparison to the maximum likelihood (ML) criterion. Finally, the maximum likelihood linear transformation (MLLT) and other robust techniques, such as the feature mean subtraction or/and variance normalization were further applied. All experiments were carried out on the Mandarin broadcast news corpus (MATBN). Very promising experimental results were initially indicated.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.