400 research outputs found
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
Acoustic Modelling for Under-Resourced Languages
Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones.
In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
The Structure of the Kuria Verbal and Its Position in the Sentence.
Abstract Not Provided
Cross-lingual automatic speech recognition using tandem features
Automatic speech recognition requires many hours of transcribed speech recordings
in order for an acoustic model to be effectively trained. However, recording speech
corpora is time-consuming and expensive, so such quantities of data exist only for
a handful of languages — there are many languages for which little or no data exist.
Given that there are acoustic similarities between different languages, it may be fruitful
to use data from a well-supported source language for the task of training a recogniser
in a target language with little training data.
Since most languages do not share a common phonetic inventory, we propose an
indirect way of transferring information from a source language model to a target language
model. Tandem features, in which class-posteriors from a separate classifier
are decorrelated and appended to conventional acoustic features, are used to do that.
They have the advantage that the language used to train the classifier, typically a Multilayer
Perceptron (MLP) need not be the same as the target language being recognised.
Consistent with prior work, positive results are achieved for monolingual systems in a
number of different languages.
Furthermore, improvements are also shown for the cross-lingual case, in which the
tandem features were generated using a classifier not trained for the target language.
We examine factors which may predict the relative improvements brought about by
tandem features for a given source and target pair. We examine some cross-corpus
normalization issues that naturally arise in multilingual speech recognition and validate
our solution in terms of recognition accuracy and a mutual information measure.
The tandem classifier in work up to this point in the thesis has been a phoneme classifier.
Articulatory features (AFs), represented here as a multi-stream, discrete, multivalued
labelling of speech, can be used as an alternative task. The motivation for this is
that since AFs are a set of physically grounded categories that are not language-specific
they may be more suitable for cross-lingual transfer. Then, using either phoneme or
AF classification as our MLP task, we look at training the MLP using data from more
than one language — again we hypothesise that AF tandem will resulting greater improvements
in accuracy. We also examine performance where only limited amounts of
target language data are available, and see how our various tandem systems perform
under those conditions
PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION
Práce pojednává o fonotaktickém a akustickém přístupu pro automatické rozpoznávání jazyka. První část práce pojednává o fonotaktickém přístupu založeném na výskytu fonémových sekvenci v řeči. Nejdříve je prezentován popis vývoje fonémového rozpoznávače jako techniky pro přepis řeči do sekvence smysluplných symbolů. Hlavní důraz je kladen na dobré natrénování fonémového rozpoznávače a kombinaci výsledků z několika fonémových rozpoznávačů trénovaných na různých jazycích (Paralelní fonémové rozpoznávání následované jazykovými modely (PPRLM)). Práce také pojednává o nové technice anti-modely v PPRLM a studuje použití fonémových grafů místo nejlepšího přepisu. Na závěr práce jsou porovnány dva přístupy modelování výstupu fonémového rozpoznávače -- standardní n-gramové jazykové modely a binární rozhodovací stromy. Hlavní přínos v akustickém přístupu je diskriminativní modelování cílových modelů jazyků a první experimenty s kombinací diskriminativního trénování a na příznacích, kde byl odstraněn vliv kanálu. Práce dále zkoumá různé druhy technik fúzi akustického a fonotaktického přístupu. Všechny experimenty jsou provedeny na standardních datech z NIST evaluaci konané v letech 2003, 2005 a 2007, takže jsou přímo porovnatelné s výsledky ostatních skupin zabývajících se automatickým rozpoznáváním jazyka. S fúzí uvedených technik jsme posunuli state-of-the-art výsledky a dosáhli vynikajících výsledků ve dvou NIST evaluacích.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.
- …