70 research outputs found
Advances in the application of support vector machines as probabilistic estimators for continuous automatic speech recognition
Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, noviembre de 200
Modelling of a speech-to-text recognition system for air traffic control and NATO air command
Accent invariance in speech recognition is a challenging problem especially in the are of aviation. In this paper a speech recognition system is developed to transcribe accented speech between pilots and air traffic controllers. The system allows handling of accents in continuous speech by modelling phonemes using Hidden Markov Models (HMMs) with Gaussian mixture model (GMM) probability density functions for each state. These phonemes are used to build word models of the NATO phonetic alphabet as well as the numerals 0 to 9 with transcriptions obtained from the Carnegie Mellon University (CMU) pronouncing dictionary. Mel-Frequency Cepstral Co-efficients (MFCC) with delta and delta-delta coefficients are used for the feature extraction process. Amplitude normalisation and covariance scaling is implemented to improve recognition accuracy. A word error rate (WER) of 2% for seen speakers and 22% for unseen speakers is obtained.http://jit.ndhu.edu.twhj2023Electrical, Electronic and Computer Engineerin
Context-dependent modeling in a segment-based speech recognition system
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (leaves 78-80).by Benjamin M. Serridge.M.Eng
Spoken command recognition for robotics
In this thesis, I investigate spoken command recognition technology for robotics. While high
robustness is expected, the distant and noisy conditions in which the system has to operate
make the task very challenging. Unlike commercial systems which all rely on a "wake-up"
word to initiate the interaction, the pipeline proposed here directly detect and recognizes
commands from the continuous audio stream. In order to keep the task manageable despite
low-resource conditions, I propose to focus on a limited set of commands, thus trading off
flexibility of the system against robustness.
Domain and speaker adaptation strategies based on a multi-task regularization paradigm
are first explored. More precisely, two different methods are proposed which rely on a tied
loss function which penalizes the distance between the output of several networks. The first
method considers each speaker or domain as a task. A canonical task-independent network is
jointly trained with task-dependent models, allowing both types of networks to improve by
learning from one another. While an improvement of 3.2% on the frame error rate (FER) of
the task-independent network is obtained, this only partially carried over to the phone error
rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel
training of the canonical network with a privileged model having access to i-vectors. This
method proved less effective with only 1.2% of improvement on the FER.
In order to make the developed technology more accessible, I also investigated the use
of a sequence-to-sequence (S2S) architecture for command classification. The use of an
attention-based encoder-decoder model reduced the classification error by 40% relative to a
strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing
the relevance of S2S architectures in such context. In order to improve the flexibility of the
trained system, I also explored strategies for few-shot learning, which allow to extend the
set of commands with minimum requirements in terms of data. Retraining a model on the
combination of original and new commands, I managed to achieve 40.5% of accuracy on the
new commands with only 10 examples for each of them. This scores goes up to 81.5% of
accuracy with a larger set of 100 examples per new command. An alternative strategy, based
on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10
and 100 examples respectively, while being faster to train. This high performance is obtained
at the expense of the original categories though, on which the accuracy deteriorated. Those
results are very promising as the methods allow to easily extend an existing S2S model with
minimal resources.
Finally, a full spoken command recognition system (named iCubrec) has been developed
for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to
propose a fully hand-free experience. By segmenting only regions that are likely to contain
commands, the VAD module also allows to reduce greatly the computational cost of the
pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM
command recognition system for transcription. The VoCub dataset has been specifically
gathered to train a DNN-based acoustic model for our task. Through multi-condition training
with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model,
complemented by a rejection mechanism based on a confidence score, is finally added to the
system to reject non-command speech in a live demonstration of the system
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
- …