Search CORE

483 research outputs found

Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

Author: Doğruöz A. Seza
Sitaram Sunayana
Yong Zheng-Xin
Publication venue
Publication date: 31/10/2023
Field of study

Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions. However, there is still not much progress in building successful CSW systems, despite the recent advances in Massive Multilingual Language Models (MMLMs). We investigate the reasons behind this setback through a critical study about the existing CSW data sets (68) across language pairs in terms of the collection and preparation (e.g. transcription and annotation) stages. This in-depth analysis reveals that \textbf{a)} most CSW data involves English ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of representativeness in data collection and preparation stages due to ignoring the location based, socio-demographic and register variation in CSW. In addition, lack of clarity on the data selection and filtering stages shadow the representativeness of CSW data sets. We conclude by providing a short check-list to improve the representativeness for forthcoming studies involving CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings

arXiv.org e-Print Archive

Marathi-English Code-mixed Text Generation

Author: Amin Dhiraj
Govilkar Sharvari
Gupta Sahil Girijashankar
Khwaja Arshi Ajaz
Kulkarni Sagar
Lalit Yash Shashikant
Xavier Daries
Publication venue
Publication date: 28/09/2023
Field of study

Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings, yielding hybrid languages like Hinglish and Minglish. Marathi, India's third most spoken language, often integrates English for precision and formality. Developing code-mixed language systems, like Marathi-English (Minglish), faces resource constraints. This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics. Across 2987 code-mixed questions, it achieved an average CMI of 0.2 and an average DCM of 7.4, indicating effective and comprehensible code-mixed sentences. These results offer potential for enhanced NLP tools, bridging linguistic gaps in multilingual societies

arXiv.org e-Print Archive

Code-mixed question answering challenge : crowd-sourcing data and techniques

Author: Black Alan W.
Chandu Khyathi
Chinnakotla Manoj
Genabith Josef van
Gupta Vishal
Loginova Ekaterina
Neumann Günter
Nyberg Eric
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Crossref

Ghent University Academic Bibliography

Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), August 11, 2016, Berlin, Germany

Author: Friedrich Annemarie
Tomanek Katrin
Publication venue
Publication date: 01/01/2016
Field of study

OPUS Augsburg

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Author: Lu Bo-Han
Lu Chao-Yi
Lu Sin-En
Tsai Richard Tzong-Han
Publication venue
Publication date: 21/01/2023
Field of study

In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.Comment: The paper was accepted by EMNLP 2022 finding

arXiv.org e-Print Archive