Search CORE

188 research outputs found

Improving code-switched ASR with linguistic information

Author: Bell Peter
Chi Jie
Publication venue
Publication date: 03/11/2022
Field of study

Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching

Author: Jurafsky Dan
Manning Christopher D.
Ògúnrèmí Tolúlopé
Publication venue
Publication date: 25/11/2023
Field of study

While many speakers of low-resource languages regularly code-switch between their languages and other regional languages or English, datasets of codeswitched speech are too small to train bespoke acoustic models from scratch or do language model rescoring. Here we propose finetuning self-supervised speech representations such as wav2vec 2.0 XLSR to recognize code-switched data. We find that finetuning self-supervised multilingual representations and augmenting them with n-gram language models trained from transcripts reduces absolute word error rates by up to 20% compared to baselines of hybrid models trained from scratch on code-switched data. Our findings suggest that in circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.Comment: 5 pages, 1 figure. Computational Approaches to Linguistic Code-Switching, CALCS 2023 (co-located with EMNLP 2023

arXiv.org e-Print Archive

Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

Author: Doğruöz A. Seza
Sitaram Sunayana
Yong Zheng-Xin
Publication venue
Publication date: 31/10/2023
Field of study

Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings

arXiv.org e-Print Archive

MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Author: 12th Language Resources and Evaluation Conference
Borg Claudia
DeMarco Andrea
Gatt Albert
Mena Carlos
Muscat Amanda
Padovani Ian
Van der Plas Lonneke
Publication venue
Publication date: 01/05/2020
Field of study

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI-HEADSET Corpus is publicly available for research/academic purposes.Comment: 8 pages, 2 figures, 4 tables, 1 appendix. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20

arXiv.org e-Print Archive

OAR@UM