482 research outputs found
Boosting End-to-End Multilingual Phoneme Recognition through Exploiting Universal Speech Attributes Constraints
We propose a first step toward multilingual end-to-end automatic speech
recognition (ASR) by integrating knowledge about speech articulators. The key
idea is to leverage a rich set of fundamental units that can be defined
"universally" across all spoken languages, referred to as speech attributes,
namely manner and place of articulation. Specifically, several deterministic
attribute-to-phoneme mapping matrices are constructed based on the predefined
set of universal attribute inventory, which projects the knowledge-rich
articulatory attribute logits, into output phoneme logits. The mapping puts
knowledge-based constraints to limit inconsistency with acoustic-phonetic
evidence in the integrated prediction. Combined with phoneme recognition, our
phone recognizer is able to infer from both attribute and phoneme information.
The proposed joint multilingual model is evaluated through phoneme recognition.
In multilingual experiments over 6 languages on benchmark datasets LibriSpeech
and CommonVoice, we find that our proposed solution outperforms conventional
multilingual approaches with a relative improvement of 6.85% on average, and it
also demonstrates a much better performance compared to monolingual model.
Further analysis conclusively demonstrates that the proposed solution
eliminates phoneme predictions that are inconsistent with attributes
Towards Zero-shot Learning for Automatic Phonemic Transcription
Automatic phonemic transcription tools are useful for low-resource language
documentation. However, due to the lack of training sets, only a tiny fraction
of languages have phonemic transcription tools. Fortunately, multilingual
acoustic modeling provides a solution given limited audio training data. A more
challenging problem is to build phonemic transcribers for languages with zero
training data. The difficulty of this task is that phoneme inventories often
differ between the training languages and the target language, making it
infeasible to recognize unseen phonemes. In this work, we address this problem
by adopting the idea of zero-shot learning. Our model is able to recognize
unseen phonemes in the target language without any training data. In our model,
we decompose phonemes into corresponding articulatory attributes such as vowel
and consonant. Instead of predicting phonemes directly, we first predict
distributions over articulatory attributes, and then compute phoneme
distributions with a customized acoustic model. We evaluate our model by
training it using 13 languages and testing it using 7 unseen languages. We find
that it achieves 7.7% better phoneme error rate on average over a standard
multilingual model.Comment: AAAI 202
Recommended from our members
Investigation of multilingual deep neural networks for spoken term detection
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (∼10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance. © 2013 IEEE
Cross-lingual automatic speech recognition using tandem features
Automatic speech recognition requires many hours of transcribed speech recordings
in order for an acoustic model to be effectively trained. However, recording speech
corpora is time-consuming and expensive, so such quantities of data exist only for
a handful of languages — there are many languages for which little or no data exist.
Given that there are acoustic similarities between different languages, it may be fruitful
to use data from a well-supported source language for the task of training a recogniser
in a target language with little training data.
Since most languages do not share a common phonetic inventory, we propose an
indirect way of transferring information from a source language model to a target language
model. Tandem features, in which class-posteriors from a separate classifier
are decorrelated and appended to conventional acoustic features, are used to do that.
They have the advantage that the language used to train the classifier, typically a Multilayer
Perceptron (MLP) need not be the same as the target language being recognised.
Consistent with prior work, positive results are achieved for monolingual systems in a
number of different languages.
Furthermore, improvements are also shown for the cross-lingual case, in which the
tandem features were generated using a classifier not trained for the target language.
We examine factors which may predict the relative improvements brought about by
tandem features for a given source and target pair. We examine some cross-corpus
normalization issues that naturally arise in multilingual speech recognition and validate
our solution in terms of recognition accuracy and a mutual information measure.
The tandem classifier in work up to this point in the thesis has been a phoneme classifier.
Articulatory features (AFs), represented here as a multi-stream, discrete, multivalued
labelling of speech, can be used as an alternative task. The motivation for this is
that since AFs are a set of physically grounded categories that are not language-specific
they may be more suitable for cross-lingual transfer. Then, using either phoneme or
AF classification as our MLP task, we look at training the MLP using data from more
than one language — again we hypothesise that AF tandem will resulting greater improvements
in accuracy. We also examine performance where only limited amounts of
target language data are available, and see how our various tandem systems perform
under those conditions
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Self-supervision has shown great potential for audio-visual speech
recognition by vastly reducing the amount of labeled data required to build
good systems. However, existing methods are either not entirely end-to-end or
do not train joint representations of both modalities. In this paper, we
introduce AV-data2vec which addresses these challenges and builds audio-visual
representations based on predicting contextualized representations which has
been successful in the uni-modal case. The model uses a shared transformer
encoder for both audio and video and can combine both modalities to improve
speech recognition. Results on LRS3 show that AV-data2vec consistently
outperforms existing methods under all settings with the same amount of data
and model size.Comment: 2023 ASR
Neurocognitive Informatics Manifesto.
Informatics studies all aspects of the structure of natural and artificial information systems. Theoretical and abstract approaches to information have made great advances, but human information processing is still unmatched in many areas, including information management, representation and understanding. Neurocognitive informatics is a new, emerging field that should help to improve the matching of artificial and natural systems, and inspire better computational algorithms to solve problems that are still beyond the reach of machines. In this position paper examples of neurocognitive inspirations and promising directions in this area are given
- …