31 research outputs found
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Deep neural network features and semi-supervised training for low resource speech recognition
We propose a new technique for training deep neural networks (DNNs) as data-driven feature front-ends for large vocabulary con-tinuous speech recognition (LVCSR) in low resource settings. To circumvent the lack of sufficient training data for acoustic mod-eling in these scenarios, we use transcribed multilingual data and semi-supervised training to build the proposed feature front-ends. In our experiments, the proposed features provide an absolute im-provement of 16 % in a low-resource LVCSR setting with only one hour of in-domain training data. While close to three-fourths of these gains come from DNN-based features, the remaining are from semi-supervised training. Index Terms — Low resource, speech recognition, deep neural networks, semi-supervised training, bottleneck features
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
Recent years have seen an increased interest in the computational speech
processing of Maltese, but resources remain sparse. In this paper, we consider
data augmentation techniques for improving speech recognition for low-resource
languages, focusing on Maltese as a test case. We consider three different
types of data augmentation: unsupervised training, multilingual training and
the use of synthesized speech as training data. The goal is to determine which
of these techniques, or combination of them, is the most effective to improve
speech recognition for languages where the starting point is a small corpus of
approximately 7 hours of transcribed speech. Our results show that combining
the data augmentation techniques studied here lead us to an absolute WER
improvement of 15% without the use of a language model.Comment: 12 page
Cross-Lingual Subspace Gaussian Mixture Models for Low-Resource Speech Recognition
This paper studies cross-lingual acoustic modelling in the context of subspace Gaussian mixture models (SGMMs). SGMMs factorize the acoustic model parameters into a set that is globally shared between all the states of a hidden Markov model (HMM) and another that is specific to the HMM states. We demonstrate that the SGMM global parameters are transferable between languages, particularly when the parameters are trained multilingually. As a result, acoustic models may be trained using limited amounts of transcribed audio by borrowing the SGMM global parameters from one or more source languages, and only training the state-specific parameters on the target language audio. Model regularization using ℓ1-norm penalty is shown to be particularly effective at avoiding overtraining and leading to lower word error rates. We investigate maximum a posteriori (MAP) adaptation of subspace parameters in order to reduce the mismatch between the SGMM global parameters of the source and target languages. In addition, monolingual and cross-lingual speaker adaptive training is used to reduce the model variance introduced by speakers. We have systematically evaluated these techniques by experiments on the GlobalPhone corpus
Deep Scattering and End-to-End Speech Models towards Low Resource Speech Recognition
Automatic Speech Recognition (ASR) has made major leaps in its advancement
largely due to two different machine learning models: Hidden Markov Models (HMMs)
and Deep Neural Networks (DNNs). State-of-the art results have been achieved by
combining these two disparate methods to form a hybrid system. This also requires
that various components of the speech recognizer be trained independently based on
a probabilistic noisy channel model. Although this HMM-DNN hybrid ASR method
has been successful in recent studies, the independent development of the individual
components used in hybrid HMM-DNN models makes ASR development fragile and
expensive in terms of time-to-develop the various components and their associated
sub-systems. The resulting trade-off is that ASR systems are difficult to develop
and use especially for new applications and languages.
The alternative approach, known as the end-to-end paradigm, makes use of a
single deep neural-network architecture used to encapsulate as many as possible subcomponents
of speech recognition as a single process. In the so-called end-to-end
paradigm, latent variables of sub-components are subsumed by the neural network
sub-architectures and the associated parameters. The end-to-end paradigm gains
of a simplified ASR-development process again are traded for higher internal model
complexity and computational resources needed to train the end-to-end models.
This research focuses on taking advantage of the end-to-end model ASR development
gains for new and low-resource languages. Using a specialised light weight
convolution-like neural network called the deep scattering network (DSN) to replace
the input layer of the end-to-end model, our objective was to measure the
performance of the end-to-end model using these augmented speech features while
checking to see if the light-weight, wavelet-based architecture brought about any
improvements for low resource Speech recognition in particular.
The results showed that it is possible to use this compact strategy for speech
pattern recognition by deploying deep scattering network features with higher dimensional
vectors when compared to traditional speech features. With Word Error
Rates of 26.8% and 76.7% for SVCSR and LVCSR respective tasks, the ASR system
metrics fell few WER points short of their respective baselines. In addition, training
times tended to be longer when compared to their respective baselines and therefore
had no significant improvement for low resource speech recognition training