5 research outputs found
Recommended from our members
Joint Training Methods for Tandem and Hybrid Speech Recognition Systems using Deep Neural Networks
Hidden Markov models (HMMs) have been the mainstream acoustic modelling approach for state-of-the-art automatic speech recognition (ASR) systems over the
past few decades. Recently, due to the rapid development of deep learning technologies, deep neural networks (DNNs) have become an essential part of nearly all kinds of ASR approaches. Among HMM-based ASR approaches, DNNs are most commonly used to extract features (tandem system configuration) or to directly produce HMM output probabilities (hybrid system configuration).
Although DNN tandem and hybrid systems have been shown to have superior
performance to traditional ASR systems without any DNN models, there are still
issues with such systems. First, some of the DNN settings, such as the choice of
the context-dependent (CD) output targets set and hidden activation functions, are
usually determined independently from the DNN training process. Second, different
ASR modules are separately optimised based on different criteria following a greedy
build strategy. For instance, for tandem systems, the features are often extracted by a
DNN trained to classify individual speech frames while acoustic models are built upon
such features according to a sequence level criterion. These issues mean that the best performance is not theoretically guaranteed.
This thesis focuses on alleviating both issues using joint training methods. In DNN
acoustic model joint training, the decision tree HMM state tying approach is extended
to cluster DNN-HMM states. Based on this method, an alternative CD-DNN training
procedure without relying on any additional system is proposed, which can produce
DNN acoustic models comparable in word error rate (WER) with those trained by the
conventional procedure. Meanwhile, the most common hidden activation functions,
the sigmoid and rectified linear unit (ReLU), are parameterised to enable automatic
learning of function forms. Experiments using conversational telephone speech (CTS)
Mandarin data result in an average of 3.4% and 2.2% relative character error rate (CER) reduction with sigmoid and ReLU parameterisations. Such parameterised functions can also be applied to speaker adaptation tasks.
At the ASR system level, DNN acoustic model and corresponding speaker dependent (SD) input feature transforms are jointly learned through minimum phone error
(MPE) training as an example of hybrid system joint training, which outperforms the
conventional hybrid system speaker adaptive training (SAT) method. MPE based speaker independent (SI) tandem system joint training is also studied. Experiments on
multi-genre broadcast (MGB) English data show that this method gives a reduction
in tandem system WER of 11.8% (relative), and the resulting tandem systems are
comparable to MPE hybrid systems in both WER and the number of parameters. In
addition, all approaches in this thesis have been implemented using the hidden Markov model toolkit (HTK) and the related source code has been or will be made publicly available with either recent or future HTK releases, to increase the reproducibility of the work presented in this thesis.Cambridge International Scholarship, Cambridge Overseas Trust
Research funding, EPSRC Natural Speech Technology Project
Research funding, DARPA BOLT Program
Research funding, iARPA Babel Progra
Development of the CUHTK 2004 Mandarin conversational telephone speech transcription system
This paper describes the development of the CUHTK 2004 Mandarin conversational telephone speech transcription system. The paper details all aspects of the system, but concentrates on the development of the acoustic models. As there are significant differences between the available training corpora, both in terms of topics of conversation and accents, forms of data normalisation and adaptive training techniques are investigated. The baseline discriminatively trained acoustic models are compared to a system built with a Gaussianisation front-end, a speaker adaptively trained system and an adaptively trained structured precision matrix system. The models are finally evaluated within a multi-pass, mult-branch, system combination framework
Development of the CUHTK 2004 Mandarin conversational telephone speech transcription system
This paper describes the development of the CUHTK 2004 Mandarin conversational telephone speech transcription system. The paper details all aspects of the system, but concentrates on the development of the acoustic models. As there are significant differences between the available training corpora, both in terms of topics of conversation and accents, forms of data normalisation and adaptive training techniques are investigated. The baseline discriminatively trained acoustic models are compared to a system built with a Gaussianisation front-end, a speaker adaptively trained system and an adaptively trained structured precision matrix system. The models are finally evaluated within a multi-pass, mult-branch, system combination framework. © 2005 IEEE