471 research outputs found
Improving large vocabulary continuous speech recognition by combining GMM-based and reservoir-based acoustic modeling
In earlier work we have shown that good phoneme recognition is possible with a so-called reservoir, a special type of recurrent neural network. In this paper, different architectures based on Reservoir Computing (RC) for large vocabulary continuous speech recognition are investigated. Besides experiments with HMM hybrids, it is shown that a RC-HMM tandem can achieve the same recognition accuracy as a classical HMM, which is a promising result for such a fairly new paradigm. It is also demonstrated that a state-level combination of the scores of the tandem and the baseline HMM leads to a significant improvement over the baseline. A word error rate reduction of the order of 20\% relative is possible
Cluster-Based Adaptation Using Density Forest for HMM Phone Recognition
Publication in the conference proceedings of EUSIPCO, Lisbon, Portugal, 201
Speaker adaptation and adaptive training for jointly optimised tandem systems
Speaker independent (SI) Tandem systems trained by joint optimisation
of bottleneck (BN) deep neural networks (DNNs) and
Gaussian mixture models (GMMs) have been found to produce
similar word error rates (WERs) to Hybrid DNN systems. A
key advantage of using GMMs is that existing speaker adaptation
methods, such as maximum likelihood linear regression
(MLLR), can be used which to account for diverse speaker
variations and improve system robustness. This paper investigates
speaker adaptation and adaptive training (SAT) schemes
for jointly optimised Tandem systems. Adaptation techniques
investigated include constrained MLLR (CMLLR) transforms
based on BN features for SAT as well as MLLR and parameterised
sigmoid functions for unsupervised test-time adaptation.
Experiments using English multi-genre broadcast (MGB3) data
show that CMLLR SAT yields a 4% relative WER reduction
over jointly trained Tandem and Hybrid SI systems, and further
reductions in WER are obtained by system combination
Phoneme and sentence-level ensembles for speech recognition
We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition
Acoustic Scene Classification
This work was supported by the Centre for Digital Music Platform (grant EP/K009559/1) and a Leadership Fellowship
(EP/G007144/1) both from the United Kingdom Engineering and Physical Sciences Research Council
- …