Feature extraction in automatic speech recognition (ASR) can be regarded
as learning representations from lower-level to more abstract higher-level features.
Lower-level feature can be viewed as features from the signal domain,
such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients
(MFCCs) features. Higher-level feature representations can be considered
as bottleneck features (BNFs) learned using deep neural networks
(DNNs). In this thesis, we focus on improving feature extraction at different
levels mainly for ASR.
The first part of this thesis focuses on learning features from the signal
domain that help ASR. Hand-crafted spectral and cepstral features such as
MFCC are the main features used in most conventional ASR systems; all are
inspired by physiological models of the human auditory system. However, some
aspects of the signal such as pitch cannot be easily extracted from spectral
features, but are found to be useful for ASR. We explore new algorithm to extract
a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages,
especially tonal languages.
We then investigate replacing the conventional features with jointly training
from the signal domain using time domain, and frequency domain approaches.
The results show that our time-domain joint feature learning setup
achieves state-of-the-art performance using MFCC, while our frequency domain
setup outperforms them in various datasets.
Joint feature extraction results in learning data or language-dependent filter
banks, that can degrade the performance in unseen noise and channel conditions
or other languages. To tackle this, we investigate joint universal feature
learning across different languages using the proposed direct-from-signal
setups. We then investigate the filter banks learned in this setup and propose
a new set of features as an extension to conventional Mel filter banks. The results
show consistent word error rate (WER) improvement, especially in clean
condition.
The second part of this thesis focuses on learning higher-level feature embedding.
We investigate learning and transferring deep feature representations
across different domains using multi-task learning and weight transfer
approaches. They have been adopted to explicitly learn intermediate-level features
that are useful for several different tasks