30 research outputs found
XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese
This research paper focuses on the development and evaluation of Automatic
Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims
to improve ASR performance in converting spoken language into written text,
specifically for Indonesian, Javanese, and Sundanese languages. The paper
discusses the testing procedures, datasets used, and methodology employed in
training and evaluating the ASR systems. The results show that the XLS-R 300m
model achieves competitive Word Error Rate (WER) measurements, with a slight
compromise in performance for Javanese and Sundanese languages. The integration
of a 5-gram KenLM language model significantly reduces WER and enhances ASR
accuracy. The research contributes to the advancement of ASR technology by
addressing linguistic diversity and improving performance across various
languages. The findings provide insights into optimizing ASR accuracy and
applicability for diverse linguistic contexts
Self-supervised end-to-end ASR for low resource L2 Swedish
Publisher Copyright: Copyright © 2021 ISCA.Unlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.Peer reviewe
Bridging Speech and Textual Pre-trained Models with Unsupervised ASR
Spoken language understanding (SLU) is a task aiming to extract high-level
semantics from spoken utterances. Previous works have investigated the use of
speech self-supervised models and textual pre-trained models, which have shown
reasonable improvements to various SLU tasks. However, because of the
mismatched modalities between speech signals and text tokens, previous methods
usually need complex designs of the frameworks. This work proposes a simple yet
efficient unsupervised paradigm that connects speech and textual pre-trained
models, resulting in an unsupervised speech-to-semantic pre-trained model for
various tasks in SLU. To be specific, we propose to use unsupervised automatic
speech recognition (ASR) as a connector that bridges different modalities used
in speech and textual pre-trained models. Our experiments show that
unsupervised ASR itself can improve the representations from speech
self-supervised models. More importantly, it is shown as an efficient connector
between speech and textual pre-trained models, improving the performances of
five different SLU tasks. Notably, on spoken question answering, we reach the
state-of-the-art result over the challenging NMSQA benchmark.Comment: ICASSP2023 submissio
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
Self-Supervised Learning (SSL) using huge unlabeled data has been
successfully explored for image and natural language processing. Recent works
also investigated SSL from speech. They were notably successful to improve
performance on downstream tasks such as automatic speech recognition (ASR).
While these works suggest it is possible to reduce dependence on labeled data
for building efficient speech systems, their evaluation was mostly made on ASR
and using multiple and heterogeneous experimental settings (most of them for
English). This questions the objective comparison of SSL approaches and the
evaluation of their impact on building speech systems. In this paper, we
propose LeBenchmark: a reproducible framework for assessing SSL from speech. It
not only includes ASR (high and low resource) tasks but also spoken language
understanding, speech translation and emotion recognition. We also focus on
speech technologies in a language different than English: French. SSL models of
different sizes are trained from carefully sourced and documented datasets.
Experiments show that SSL is beneficial for most but not all tasks which
confirms the need for exhaustive and reliable benchmarks to evaluate its real
impact. LeBenchmark is shared with the scientific community for reproducible
research in SSL from speech.Comment: Will be presented at Interspeech 202