1,387 research outputs found
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Multilingual Speech Recognition With A Single End-To-End Model
Training a conventional automatic speech recognition (ASR) system to support
multiple languages is challenging because the sub-word unit, lexicon and word
inventories are typically language specific. In contrast, sequence-to-sequence
models are well suited for multilingual ASR because they encapsulate an
acoustic, pronunciation and language model jointly in a single network. In this
work we present a single sequence-to-sequence ASR model trained on 9 different
Indian languages, which have very little overlap in their scripts.
Specifically, we take a union of language-specific grapheme sets and train a
grapheme-based sequence-to-sequence model jointly on data from all languages.
We find that this model, which is not explicitly given any information about
language identity, improves recognition performance by 21% relative compared to
analogous sequence-to-sequence models trained on each language individually. By
modifying the model to accept a language identifier as an additional input
feature, we further improve performance by an additional 7% relative and
eliminate confusion between different languages.Comment: Accepted in ICASSP 201
Acoustic Modelling for Under-Resourced Languages
Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones.
In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
Language Identification Using Visual Features
Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to iden- tify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discrimi- nating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
Speech vocoding for laboratory phonology
Using phonological speech vocoding, we propose a platform for exploring
relations between phonology and speech processing, and in broader terms, for
exploring relations between the abstract and physical structures of a speech
signal. Our goal is to make a step towards bridging phonology and speech
processing and to contribute to the program of Laboratory Phonology. We show
three application examples for laboratory phonology: compositional phonological
speech modelling, a comparison of phonological systems and an experimental
phonological parametric text-to-speech (TTS) system. The featural
representations of the following three phonological systems are considered in
this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English
(SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded
speech, we conclude that the latter achieves slightly better results than the
former. However, GP - the most compact phonological speech representation -
performs comparably to the systems with a higher number of phonological
features. The parametric TTS based on phonological speech representation, and
trained from an unlabelled audiobook in an unsupervised manner, achieves
intelligibility of 85% of the state-of-the-art parametric speech synthesis. We
envision that the presented approach paves the way for researchers in both
fields to form meaningful hypotheses that are explicitly testable using the
concepts developed and exemplified in this paper. On the one hand, laboratory
phonologists might test the applied concepts of their theoretical models, and
on the other hand, the speech processing community may utilize the concepts
developed for the theoretical phonological models for improvements of the
current state-of-the-art applications
Language-based multimedia information retrieval
This paper describes various methods and approaches for language-based multimedia information retrieval, which have been developed in the projects POP-EYE and OLIVE and which will be developed further in the MUMIS project. All of these project aim at supporting automated indexing of video material by use of human language technologies. Thus, in contrast to image or sound-based retrieval methods, where both the query language and the indexing methods build on non-linguistic data, these methods attempt to exploit advanced text retrieval technologies for the retrieval of non-textual material. While POP-EYE was building on subtitles or captions as the prime language key for disclosing video fragments, OLIVE is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which then serve as the basis for text-based retrieval functionality
Improved acoustic word embeddings for zero-resource languages using multilingual transfer
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. Such embeddings can form the basis for speech
search, indexing and discovery systems when conventional speech recognition is
not possible. In zero-resource settings where unlabelled speech is the only
available resource, we need a method that gives robust embeddings on an
arbitrary language. Here we explore multilingual transfer: we train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. We consider
three multilingual recurrent neural network (RNN) models: a classifier trained
on the joint vocabularies of all training languages; a Siamese RNN trained to
discriminate between same and different words from multiple languages; and a
correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a
word discrimination task on six target languages, all of these models
outperform state-of-the-art unsupervised models trained on the zero-resource
languages themselves, giving relative improvements of more than 30% in average
precision. When using only a few training languages, the multilingual CAE
performs better, but with more training languages the other multilingual models
perform similarly. Using more training languages is generally beneficial, but
improvements are marginal on some languages. We present probing experiments
which show that the CAE encodes more phonetic, word duration, language identity
and speaker information than the other multilingual models.Comment: 11 pages, 7 figures, 8 tables. arXiv admin note: text overlap with
arXiv:2002.02109. Submitted to the IEEE Transactions on Audio, Speech and
Language Processin
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop exploring
the computational and scientific issues surrounding the discovery of linguistic
units (subwords and words) in a language without orthography. We study the
replacement of orthographic transcriptions by images and/or translated text in
a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201
- âŠ