207 research outputs found
Improved ASR for Under-Resourced Languages Through Multi-Task Learning with Acoustic Landmarks
Furui first demonstrated that the identity of both consonant and vowel can be
perceived from the C-V transition; later, Stevens proposed that acoustic
landmarks are the primary cues for speech perception, and that steady-state
regions are secondary or supplemental. Acoustic landmarks are perceptually
salient, even in a language one doesn't speak, and it has been demonstrated
that non-speakers of the language can identify features such as the primary
articulator of the landmark. These factors suggest a strategy for developing
language-independent automatic speech recognition: landmarks can potentially be
learned once from a suitably labeled corpus and rapidly applied to many other
languages. This paper proposes enhancing the cross-lingual portability of a
neural network by using landmarks as the secondary task in multi-task learning
(MTL). The network is trained in a well-resourced source language with both
phone and landmark labels (English), then adapted to an under-resourced target
language with only word labels (Iban). Landmark-tasked MTL reduces
source-language phone error rate by 2.9% relative, and reduces target-language
word error rate by 1.9%-5.9% depending on the amount of target-language
training data. These results suggest that landmark-tasked MTL causes the DNN to
learn hidden-node features that are useful for cross-lingual adaptation.Comment: Submitted in Interspeech201
Robust Open-Set Spoken Language Identification and the CU MultiLang Dataset
Most state-of-the-art spoken language identification models are closed-set;
in other words, they can only output a language label from the set of classes
they were trained on. Open-set spoken language identification systems, however,
gain the ability to detect when an input exhibits none of the original
languages. In this paper, we implement a novel approach to open-set spoken
language identification that uses MFCC and pitch features, a TDNN model to
extract meaningful feature embeddings, confidence thresholding on softmax
outputs, and LDA and pLDA for learning to classify new unknown languages. We
present a spoken language identification system that achieves 91.76% accuracy
on trained languages and has the capability to adapt to unknown languages on
the fly. To that end, we also built the CU MultiLang Dataset, a large and
diverse multilingual speech corpus which was used to train and evaluate our
system.Comment: 6pages, 1 table, 6 figure
Towards Weakly Supervised Acoustic Subword Unit Discovery and Lexicon Development Using Hidden Markov Models
Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. An alternative to development of a phonetic lexicon is to automatically derive subword units using acoustic information and generate associated pronunciations. In the literature, this has been mostly studied from the pronunciation variation modeling perspective. In this article, we investigate automatic subword unit derivation from the under-resourced language point of view. Towards that, we present a novel hidden Markov model (HMM) formalism for automatic derivation of subword units and pronunciation generation using only transcribed speech data. In this approach, the subword units are derived from the clustered context-dependent units in a grapheme based system using the maximum-likelihood criterion. The subword unit based pronunciations are then generated either by deterministic or probabilistic learning of the relationship between the graphemes and the acoustic subword units (ASWUs). In this article, we first establish the proposed framework on a well resourced language by comparing it against related approaches in the literature and investigating the transferability of the derived subword units to other domains. We then show the scalability of the proposed approach on real under-resourced scenarios by conducting studies on Scottish Gaelic, a genuinely minority and endangered language, and comparing the approach against state-of-the-art grapheme-based approaches in under-resourced scenarios. Our experimental studies on English show that the derived subword units can not only lead to better ASR systems compared to graphemes, but can also be exploited to build out-of-domain ASR systems. The experimental studies on Scottish Gaelic show that the proposed ASWU-based lexicon development approach retains its dominance over grapheme-based lexicon. Alternately, the proposed approach yields significant gains in ASR performance, even when multilingual resources from resource-rich languages are exploited in the development of ASR systems
Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling
Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion
Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning
Almost none of the 2,000+ languages spoken in Africa have widely available
automatic speech recognition systems, and the required data is also only
available for a few languages. We have experimented with two techniques which
may provide pathways to large vocabulary speech recognition for African
languages: multilingual modeling and self-supervised learning. We gathered
available open source data and collected data for 15 languages, and trained
experimental models using these techniques. Our results show that pooling the
small amounts of data available in multilingual end-to-end models, and
pre-training on unsupervised data can help improve speech recognition quality
for many African languages
TheanoLM - An Extensible Toolkit for Neural Network Language Modeling
We present a new tool for training neural network language models (NNLMs),
scoring sentences, and generating text. The tool has been written using Python
library Theano, which allows researcher to easily extend it and tune any aspect
of the training process. Regardless of the flexibility, Theano is able to
generate extremely fast native code that can utilize a GPU or multiple CPU
cores in order to parallelize the heavy numerical computations. The tool has
been evaluated in difficult Finnish and English conversational speech
recognition tasks, and significant improvement was obtained over our best
back-off n-gram models. The results that we obtained in the Finnish task were
compared to those from existing RNNLM and RWTHLM toolkits, and found to be as
good or better, while training times were an order of magnitude shorter
- …