70 research outputs found
Multilingual Bottleneck Features for Improving ASR Performance of Code-Switched Speech in Under-Resourced Languages
In this work, we explore the benefits of using multilingual bottleneck
features (mBNF) in acoustic modelling for the automatic speech recognition of
code-switched (CS) speech in African languages. The unavailability of annotated
corpora in the languages of interest has always been a primary challenge when
developing speech recognition systems for this severely under-resourced type of
speech. Hence, it is worthwhile to investigate the potential of using speech
corpora available for other better-resourced languages to improve speech
recognition performance. To achieve this, we train a mBNF extractor using nine
Southern Bantu languages that form part of the freely available multilingual
NCHLT corpus. We append these mBNFs to the existing MFCCs, pitch features and
i-vectors to train acoustic models for automatic speech recognition (ASR) in
the target code-switched languages. Our results show that the inclusion of the
mBNF features leads to clear performance improvements over a baseline trained
without the mBNFs for code-switched English-isiZulu, English-isiXhosa,
English-Sesotho and English-Setswana speech.Comment: In Proceedings of The First Workshop on Speech Technologies for
Code-Switching in Multilingual Communitie
Code-Switching Detection with Data-Augmented Acoustic and Language Models
In this paper, we investigate the code-switching detection performance of a
code-switching (CS) automatic speech recognition (ASR) system with
data-augmented acoustic and language models. We focus on the recognition of
Frisian-Dutch radio broadcasts where one of the mixed languages, namely
Frisian, is under-resourced. Recently, we have explored how the acoustic
modeling (AM) can benefit from monolingual speech data belonging to the
high-resourced mixed language. For this purpose, we have trained
state-of-the-art AMs on a significantly increased amount of CS speech by
applying automatic transcription and monolingual Dutch speech. Moreover, we
have improved the language model (LM) by creating CS text in various ways
including text generation using recurrent LMs trained on existing CS text.
Motivated by the significantly improved CS ASR performance, we delve into the
CS detection performance of the same ASR system in this work by reporting CS
detection accuracies together with a detailed detection error analysis.Comment: Accepted for publication at SLTU 2018. arXiv admin note: substantial
text overlap with arXiv:1807.1094
Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech
This paper considers the impact of automatic segmentation on the
fully-automatic, semi-supervised training of automatic speech recognition (ASR)
systems for five-lingual code-switched (CS) speech. Four automatic segmentation
techniques were evaluated in terms of the recognition performance of an ASR
system trained on the resulting segments in a semi-supervised manner. The
system's output was compared with the recognition rates achieved by a
semi-supervised system trained on manually assigned segments. Three of the
automatic techniques use a newly proposed convolutional neural network (CNN)
model for framewise classification, and include a novel form of HMM smoothing
of the CNN outputs. Automatic segmentation was applied in combination with
automatic speaker diarization. The best-performing segmentation technique was
also tested without speaker diarization. An evaluation based on 248 unsegmented
soap opera episodes indicated that voice activity detection (VAD) based on a
CNN followed by Gaussian mixture modelhidden Markov model smoothing
(CNN-GMM-HMM) yields the best ASR performance. The semi-supervised system
trained with the resulting segments achieved an overall WER improvement of 1.1%
absolute over the system trained with manually created segments. Furthermore,
we found that system performance improved even further when the automatic
segmentation was used in conjunction with speaker diarization.Comment: SLTU 2020. arXiv admin note: text overlap with arXiv:2003.0313
Using heterogeneity in semi-supervised transcription hypotheses to improve code-switched speech recognition
Modeling code-switched speech is an important problem in automatic speech
recognition (ASR). Labeled code-switched data are rare, so monolingual data are
often used to model code-switched speech. These monolingual data may be more
closely matched to one of the languages in the code-switch pair. We show that
such asymmetry can bias prediction toward the better-matched language and
degrade overall model performance. To address this issue, we propose a
semi-supervised approach for code-switched ASR. We consider the case of
English-Mandarin code-switching, and the problem of using monolingual data to
build bilingual "transcription models'' for annotation of unlabeled
code-switched data. We first build multiple transcription models so that their
individual predictions are variously biased toward either English or Mandarin.
We then combine these biased transcriptions using confidence-based selection.
This strategy generates a superior transcript for semi-supervised training, and
obtains a 19% relative improvement compared to a semi-supervised system that
relies on a transcription model built with only the best-matched monolingual
data.Comment: 5 page
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
Building conversational speech recognition systems for new languages is
constrained by the availability of utterances that capture user-device
interactions. Data collection is both expensive and limited by the speed of
manual transcription. In order to address this, we advocate the use of neural
machine translation as a data augmentation technique for bootstrapping language
models. Machine translation (MT) offers a systematic way of incorporating
collections from mature, resource-rich conversational systems that may be
available for a different language. However, ingesting raw translations from a
general purpose MT system may not be effective owing to the presence of named
entities, intra sentential code-switching and the domain mismatch between the
conversational data being translated and the parallel text used for MT
training. To circumvent this, we explore the following domain adaptation
techniques: (a) sentence embedding based data selection for MT training, (b)
model finetuning, and (c) rescoring and filtering translated hypotheses. Using
Hindi as the experimental testbed, we translate US English utterances to
supplement the transcribed collections. We observe a relative word error rate
reduction of 7.8-15.6%, depending on the bootstrapping phase. Fine grained
analysis reveals that translation particularly aids the interaction scenarios
which are underrepresented in the transcribed data.Comment: Accepted by IEEE ASRU workshop, 201
Multi-Graph Decoding for Code-Switching ASR
In the FAME! Project, a code-switching (CS) automatic speech recognition
(ASR) system for Frisian-Dutch speech is developed that can accurately
transcribe the local broadcaster's bilingual archives with CS speech. This
archive contains recordings with monolingual Frisian and Dutch speech segments
as well as Frisian-Dutch CS speech, hence the recognition performance on
monolingual segments is also vital for accurate transcriptions. In this work,
we propose a multi-graph decoding and rescoring strategy using bilingual and
monolingual graphs together with a unified acoustic model for CS ASR. The
proposed decoding scheme gives the freedom to design and employ alternative
search spaces for each (monolingual or bilingual) recognition task and enables
the effective use of monolingual resources of the high-resourced mixed language
in low-resourced CS scenarios. In our scenario, Dutch is the high-resourced and
Frisian is the low-resourced language. We therefore use additional monolingual
Dutch text resources to improve the Dutch language model (LM) and compare the
performance of single- and multi-graph CS ASR systems on Dutch segments using
larger Dutch LMs. The ASR results show that the proposed approach outperforms
baseline single-graph CS ASR systems, providing better performance on the
monolingual Dutch segments without any accuracy loss on monolingual Frisian and
code-mixed segments.Comment: Accepted for publication at Interspeech 201
An Overview of Indian Spoken Language Recognition from Machine Learning Perspective
International audienceAutomatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields
Non-autoregressive Mandarin-English Code-switching Speech Recognition with Pinyin Mask-CTC and Word Embedding Regularization
Mandarin-English code-switching (CS) is frequently used among East and
Southeast Asian people. However, the intra-sentence language switching of the
two very different languages makes recognizing CS speech challenging.
Meanwhile, the recent successful non-autoregressive (NAR) ASR models remove the
need for left-to-right beam decoding in autoregressive (AR) models and achieved
outstanding performance and fast inference speed. Therefore, in this paper, we
took advantage of the Mask-CTC NAR ASR framework to tackle the CS speech
recognition issue. We propose changing the Mandarin output target of the
encoder to Pinyin for faster encoder training, and introduce Pinyin-to-Mandarin
decoder to learn contextualized information. Moreover, we propose word
embedding label smoothing to regularize the decoder with contextualized
information and projection matrix regularization to bridge that gap between the
encoder and decoder. We evaluate the proposed methods on the SEAME corpus and
achieved exciting results.Comment: 5 pages, 1 figure, submitted to INTERSPEECH202
- …