188 research outputs found
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching
While many speakers of low-resource languages regularly code-switch between
their languages and other regional languages or English, datasets of
codeswitched speech are too small to train bespoke acoustic models from scratch
or do language model rescoring. Here we propose finetuning self-supervised
speech representations such as wav2vec 2.0 XLSR to recognize code-switched
data. We find that finetuning self-supervised multilingual representations and
augmenting them with n-gram language models trained from transcripts reduces
absolute word error rates by up to 20% compared to baselines of hybrid models
trained from scratch on code-switched data. Our findings suggest that in
circumstances with limited training data finetuning self-supervised
representations is a better performing and viable solution.Comment: 5 pages, 1 figure. Computational Approaches to Linguistic
Code-Switching, CALCS 2023 (co-located with EMNLP 2023
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
MASRI-HEADSET: A Maltese Corpus for Speech Recognition
Maltese, the national language of Malta, is spoken by approximately 500,000
people. Speech processing for Maltese is still in its early stages of
development. In this paper, we present the first spoken Maltese corpus designed
purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was
developed by the MASRI project at the University of Malta. It consists of 8
hours of speech paired with text, recorded by using short text snippets in a
laboratory environment. The speakers were recruited from different geographical
locations all over the Maltese islands, and were roughly evenly distributed by
gender. This paper also presents some initial results achieved in baseline
experiments for Maltese ASR using Sphinx and Kaldi. The MASRI-HEADSET Corpus is
publicly available for research/academic purposes.Comment: 8 pages, 2 figures, 4 tables, 1 appendix. Appears in Proceedings of
the 12th edition of the Language Resources and Evaluation Conference
(LREC'20
- âŠ