37 research outputs found
Learning Feature Representation for Automatic Speech Recognition
Feature extraction in automatic speech recognition (ASR) can be regarded
as learning representations from lower-level to more abstract higher-level features.
Lower-level feature can be viewed as features from the signal domain,
such as perceptual linear predictive (PLP) and Mel-frequency cepstral coefficients
(MFCCs) features. Higher-level feature representations can be considered
as bottleneck features (BNFs) learned using deep neural networks
(DNNs). In this thesis, we focus on improving feature extraction at different
levels mainly for ASR.
The first part of this thesis focuses on learning features from the signal
domain that help ASR. Hand-crafted spectral and cepstral features such as
MFCC are the main features used in most conventional ASR systems; all are
inspired by physiological models of the human auditory system. However, some
aspects of the signal such as pitch cannot be easily extracted from spectral
features, but are found to be useful for ASR. We explore new algorithm to extract
a pitch feature directly from a signal for ASR and show that this feature, appended to the other feature, gives consistent improvements in various languages,
especially tonal languages.
We then investigate replacing the conventional features with jointly training
from the signal domain using time domain, and frequency domain approaches.
The results show that our time-domain joint feature learning setup
achieves state-of-the-art performance using MFCC, while our frequency domain
setup outperforms them in various datasets.
Joint feature extraction results in learning data or language-dependent filter
banks, that can degrade the performance in unseen noise and channel conditions
or other languages. To tackle this, we investigate joint universal feature
learning across different languages using the proposed direct-from-signal
setups. We then investigate the filter banks learned in this setup and propose
a new set of features as an extension to conventional Mel filter banks. The results
show consistent word error rate (WER) improvement, especially in clean
condition.
The second part of this thesis focuses on learning higher-level feature embedding.
We investigate learning and transferring deep feature representations
across different domains using multi-task learning and weight transfer
approaches. They have been adopted to explicitly learn intermediate-level features
that are useful for several different tasks
Neural PLDA Modeling for End-to-End Speaker Verification
While deep learning models have made significant advances in supervised
classification problems, the application of these models for out-of-set
verification tasks like speaker recognition has been limited to deriving
feature embeddings. The state-of-the-art x-vector PLDA based speaker
verification systems use a generative model based on probabilistic linear
discriminant analysis (PLDA) for computing the verification score. Recently, we
had proposed a neural network approach for backend modeling in speaker
verification called the neural PLDA (NPLDA) where the likelihood ratio score of
the generative PLDA model is posed as a discriminative similarity function and
the learnable parameters of the score function are optimized using a
verification cost. In this paper, we extend this work to achieve joint
optimization of the embedding neural network (x-vector network) with the NPLDA
network in an end-to-end (E2E) fashion. This proposed end-to-end model is
optimized directly from the acoustic features with a verification cost function
and during testing, the model directly outputs the likelihood ratio score. With
various experiments using the NIST speaker recognition evaluation (SRE) 2018
and 2019 datasets, we show that the proposed E2E model improves significantly
over the x-vector PLDA baseline speaker verification system.Comment: Accepted in Interspeech 2020. GitHub Implementation Repos:
https://github.com/iiscleap/E2E-NPLDA and
https://github.com/iiscleap/NeuralPld
Recent Advances in Signal Processing
The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
Low latency modeling of temporal contexts for speech recognition
This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference.
Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred.
In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs.
As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models.
We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms.
Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization