86 research outputs found
Multilingual Non-Native Speech Recognition using Phonetic Confusion-Based Acoustic Model Modification and Graphemic Constraints
In this paper we present an automated approach for non-native speech recognition. We introduce a new phonetic confusion concept that associates sequences of native language (NL) phones to spoken language (SL) phones. Phonetic confusion rules are automatically extracted from a non-native speech database for a given NL and SL using both NL's and SL's ASR systems. These rules are used to modify the acoustic models (HMMs) of SL's ASR by adding acoustic models of NL's phones according to these rules. As pronunciation errors that non-native speakers produce depend on the writing of the words, we have also used graphemic constraints in the phonetic confusion extraction process. In the lexicon, the phones in words' pronunciations are linked to the corresponding graphemes (characters) of the word. In this way, the phonetic confusion is established between couples of (SL phones, graphemes) and sequences of NL phones. We evaluated our approach on French, Italian, Spanish and Greek non-native speech databases. The spoken language is English. The modified ASR system achieved significant improvements ranging from 20.3% to 43.2% (relative) in sentence error rate and from 26.6% to 50.0% in WER
Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition
In this paper, we present several adaptation methods for non-native speech
recognition. We have tested pronunciation modelling, MLLR and MAP non-native
pronunciation adaptation and HMM models retraining on the HIWIRE foreign
accented English speech database. The ``phonetic confusion'' scheme we have
developed consists in associating to each spoken phone several sequences of
confused phones. In our experiments, we have used different combinations of
acoustic models representing the canonical and the foreign pronunciations:
spoken and native models, models adapted to the non-native accent with MAP and
MLLR. The joint use of pronunciation modelling and acoustic adaptation led to
further improvements in recognition accuracy. The best combination of the above
mentioned techniques resulted in a relative word error reduction ranging from
46% to 71%
Non-native children speech recognition through transfer learning
This work deals with non-native children's speech and investigates both
multi-task and transfer learning approaches to adapt a multi-language Deep
Neural Network (DNN) to speakers, specifically children, learning a foreign
language. The application scenario is characterized by young students learning
English and German and reading sentences in these second-languages, as well as
in their mother language. The paper analyzes and discusses techniques for
training effective DNN-based acoustic models starting from children native
speech and performing adaptation with limited non-native audio material. A
multi-lingual model is adopted as baseline, where a common phonetic lexicon,
defined in terms of the units of the International Phonetic Alphabet (IPA), is
shared across the three languages at hand (Italian, German and English); DNN
adaptation methods based on transfer learning are evaluated on significant
non-native evaluation sets. Results show that the resulting non-native models
allow a significant improvement with respect to a mono-lingual system adapted
to speakers of the target language
Leveraging native language information for improved accented speech recognition
Recognition of accented speech is a long-standing challenge for automatic
speech recognition (ASR) systems, given the increasing worldwide population of
bi-lingual speakers with English as their second language. If we consider
foreign-accented speech as an interpolation of the native language (L1) and
English (L2), using a model that can simultaneously address both languages
would perform better at the acoustic level for accented speech. In this study,
we explore how an end-to-end recurrent neural network (RNN) trained system with
English and native languages (Spanish and Indian languages) could leverage data
of native languages to improve performance for accented English speech. To this
end, we examine pre-training with native languages, as well as multi-task
learning (MTL) in which the main task is trained with native English and the
secondary task is trained with Spanish or Indian Languages. We show that the
proposed MTL model performs better than the pre-training approach and
outperforms a baseline model trained simply with English data. We suggest a new
setting for MTL in which the secondary task is trained with both English and
the native language, using the same output set. This proposed scenario yields
better performance with +11.95% and +17.55% character error rate gains over
baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201
Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition
One of the main challenge in non-native speech recognition is how to handle acoustic variability present in multiaccented non-native speech with limited amount of training data. In this paper, we investigate an approach that addresses this challenge by using Kullback-Leibler divergence based hidden Markov models (KL-HMM). More precisely, the acoustic variability in the multi-accented speech is handled by using multilingual phoneme posterior probabilities, estimated by a multilayer perceptron trained on auxiliary data, as input feature for the KL-HMM system. With limited training data, we then build better acoustic models by exploiting the advantage that the KL-HMM system has fewer number of parameters. On HIWIRE corpus, the proposed approach yields a performance of 1.9% word error rate (WER) with 149 minutes of training data and a performance of 5.5% WER with 2 minutes of training data
Acoustic Modelling for Under-Resourced Languages
Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones.
In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
Automatic assessment of spoken language proficiency of non-native children
This paper describes technology developed to automatically grade Italian
students (ages 9-16) on their English and German spoken language proficiency.
The students' spoken answers are first transcribed by an automatic speech
recognition (ASR) system and then scored using a feedforward neural network
(NN) that processes features extracted from the automatic transcriptions.
In-domain acoustic models, employing deep neural networks (DNNs), are derived
by adapting the parameters of an original out of domain DNN
Automatic Speech Recognition without Transcribed Speech or Pronunciation Lexicons
Rapid deployment of automatic speech recognition (ASR) in new languages, with very limited data, is of great interest and importance for intelligence gathering, as well as for humanitarian assistance and disaster relief (HADR). Deploying ASR systems in these languages often relies on cross-lingual acoustic modeling followed by supervised adaptation and almost always assumes that either a pronunciation lexicon using the International Phonetic Alphabet (IPA), and/or some amount of transcribed speech exist in the new language of interest. For many languages, neither requirement is generally true -- only a limited amount of text and untranscribed audio is available. This work focuses specifically on scalable techniques for building ASR systems in most languages without any existing transcribed speech or pronunciation lexicons.
We first demonstrate how cross-lingual acoustic model transfer, when phonemic pronunciation lexicons do exist in a new language, can significantly reduce the need for target-language transcribed speech. We then explore three methods for handling languages without a pronunciation lexicon. First we examine the effectiveness of graphemic acoustic model transfer, which allows for pronunciation lexicons to be trivially constructed. We then present two methods for rapid construction of phonemic pronunciation lexicons based on submodular selection of a small set of words for manual annotation, or words from other languages for which we have IPA pronunciations. We also explore techniques for training sequence-to-sequence models with very small amounts of data by transferring models trained on other languages, and leveraging large unpaired text corpora in training. Finally, as an alternative to acoustic model transfer, we present a novel hybrid generative/discriminative semi-supervised training framework that merges recent progress in Energy Based Models (EBMs) as well as lattice-free maximum mutual information (LF-MMI) training, capable of making use of purely untranscribed audio.
Together, these techniques enabled ASR capabilities that supported triage of spoken communications in real-world HADR work-flows in many languages using fewer than 30 minutes of transcribed speech. These techniques were successfully applied in multiple NIST evaluations and were among the top-performing systems in each evaluation
- …