Search CORE

945 research outputs found

Current trends in multilingual speech processing

Author: BOURLARD HERVÉ
DINES JOHN
GARNER PHILIP
IMSENG DAVID
LIANG HUI
MAGIMAI-DOSS MATHEW
MOTLICEK PETR
SAHEER LAKSHMI
VALENTE FABIO
Publication venue
Publication date: 18/06/2018
Field of study

In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

RERO DOC Digital Library

Dual Language Models for Code Switched Speech Recognition

Author: Garg Saurabh
Jyothi Preethi
Parekh Tanmay
Publication venue
Publication date: 03/08/2018
Field of study

In this work, we present a simple and elegant approach to language modeling for bilingual code-switched text. Since code-switching is a blend of two or more different languages, a standard bilingual language model can be improved upon by using structures of the monolingual language models. We propose a novel technique called dual language models, which involves building two complementary monolingual language models and combining them using a probabilistic model for switching between the two. We evaluate the efficacy of our approach using a conversational Mandarin-English speech corpus. We prove the robustness of our model by showing significant improvements in perplexity measures over the standard bilingual language model without the use of any external information. Similar consistent improvements are also reflected in automatic speech recognition error rates.Comment: Accepted at Interspeech 201

arXiv.org e-Print Archive

Crossref

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

Author: Glass James
Sarkar Achintya kr.
Shon Suwon
Tan Zheng-Hua
Tang Hao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/05/2019
Field of study

There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

arXiv.org e-Print Archive

VBN

Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition

Author: Imseng David
Magimai.-Doss Mathew
Rasipuram Ramya
Publication venue: Idiap
Publication date: 19/12/2013
Field of study

One of the main challenge in non-native speech recognition is how to handle acoustic variability present in multiaccented non-native speech with limited amount of training data. In this paper, we investigate an approach that addresses this challenge by using Kullback-Leibler divergence based hidden Markov models (KL-HMM). More precisely, the acoustic variability in the multi-accented speech is handled by using multilingual phoneme posterior probabilities, estimated by a multilayer perceptron trained on auxiliary data, as input feature for the KL-HMM system. With limited training data, we then build better acoustic models by exploiting the advantage that the KL-HMM system has fewer number of parameters. On HIWIRE corpus, the proposed approach yields a performance of 1.9% word error rate (WER) with 149 minutes of training data and a performance of 5.5% WER with 2 minutes of training data

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Using KL-divergence and multilingual information to improve ASR for under-resourced languages

Author: Bourlard Hervé
Garner Philip N.
Imseng David
Publication venue
Publication date: 01/01/2012
Field of study

Setting out from the point of view that automatic speech recognition (ASR) ought to benefit from data in languages other than the target language, we propose a novel Kullback-Leibler (KL) divergence based method that is able to exploit multilingual information in the form of universal phoneme posterior probabilities conditioned on the acoustics. We formulate a means to train a recognizer on several different languages, and subsequently recognize speech in a target language for which only a small amount of data is available. Taking the Greek SpeechDat(II) data as an example, we show that the proposed formulation is sound, and show that it is able to outperform a current state-of-the-art HMM/GMM system. We also use a hybrid Tandem-like system to further understand the source of the benefit

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

Fast and flexible Kullback-Leibler divergence based acoustic modeling for non-native speech recognition

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Improving non-native ASR through stochastic multilingual phoneme space transformations

Author: Bourlard Hervé
Dines John
Garner Philip N.
Imseng David
Magimai.-Doss Mathew
Publication venue: Idiap
Publication date: 06/07/2011
Field of study

We propose a stochastic phoneme space transformation technique that allows the conversion of conditional source phoneme posterior probabilities (conditioned on the acoustics) into target phoneme posterior probabilities. The source and target phonemes can be in any language and phoneme format such as the International Phonetic Alphabet. The novel technique makes use of a Kullback-Leibler divergence based hidden Markov model and can be applied to non-native and accented speech recognition or used to adapt systems to under-resourced languages. In this paper, and in the context of hybrid HMM/MLP recognizers, we successfully apply the proposed approach to non-native English speech recognition on the HIWIRE dataset

Infoscience - École polytechnique fédérale de Lausanne

Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Author: Raymond W.M. Ng
Mauro Nicolao
Thomas Hain
Ambikairajah
Anderson
BenZeghiba
BenZeghiba
Caraballo
Corboda
Davis
Dehak
D’Haro
D’Haro
Fék
Ferrer
Gauvain
Gibson
Glembek
Hazen
Hermansky
Joachims
Knill
Li
Li
Lööf
Ma
Muthusamy
Navrátil
Ng
Ng
Richardson
Schultz
Schwarz
Singer
Suzuki
Torres-Carrasquillo
Torres-Carrasquillo
Veselý
Vu
Xue
Zissman
Zissman
Publication venue: 'Elsevier BV'
Publication date: 01/11/2017
Field of study

Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

Crossref

Biblioteca Digital de la Comunidad de Madrid

White Rose Research Online

Multilingual speech recognition:a posterior based approach

Author: Imseng David
Publication venue: Lausanne, EPFL
Publication date: 12/06/2013
Field of study

Infoscience - École polytechnique fédérale de Lausanne