Search CORE

99 research outputs found

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken

arXiv.org e-Print Archive

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Author: Adhista Dea
Aji Alham Fikri
Akbar Salsabil Maulana
Cahyawijaya Samuel
Cenggoro Tjeng Wawan
Dave Emmanuel
Fung Pascale
Koto Fajri
Lee Jhonson
Linuwih Hanung Wahyuning
Lovenia Holy
Moeljadi David
Muridan Galih Pradipta
Oktavianti Sarah
Purwarianti Ayu
Shadieq Nuur
Wilie Bryan
Winata Genta Indra
Publication venue
Publication date: 19/09/2023
Field of study

Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes

arXiv.org e-Print Archive

Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages

Author: Ishida Toru
Murakami Yohei
Nasution Arbi Haza
Publication venue
Publication date: 01/01/2018
Field of study

The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5 languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a 63.3% cost reduction

Repository Universitas Islam Riau

Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages

Author: Ishida Toru
Murakami Yohei
Nasution Arbi Haza
Publication venue
Publication date: 01/01/2018
Field of study

Repository Universitas Islam Riau

Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages

Author: Ishida Toru
Murakami Yohei
Nasution Arbi Haza
Publication venue
Publication date: 01/01/2018
Field of study

Repository Universitas Islam Riau

Social actors in an Intercultural Communication classroom: A discursive lens of intercultural education

Author: Abdullah Fuad
Lulita -
Publication venue: 'Atma Jaya Catholic University of Indonesia'
Publication date: 31/05/2018
Field of study

This study focused on how teachers and students as the social actors in an Intercultural Communication (IC) classroom were represented discursively. A video recording transcript of IC classroom activities at a state University in Indonesia was selected as the data source. The data source was rigorously analysed through van Leeuwen’s Socio-semantic inventory of social actors framework (Van Leeuwen, 1996). The main findings show that social actors in IC classroom can be categorised into two main thematic representations, namely positive and negative ones. disclosed that Hamzah as the representative of classroom presenters was represented as victimised, oppressed, intimidated and minoritised actor. Hamzah’s Mathematics teacher was depicted as an intolerant, dehumanising, discriminatory and oppressing actor. Hamzah’s Social Sciences teacher was illustrated as a racial, stereotyping, dominant and provoking actor. The Intercultural Communication teacher was delineated as the actor endeavoring to encourage his students to be tolerant, critical, supportive and open-minded people. Hamzah’s classmates in IC classroom were characterised as sympathetic, supportive, friendly and reactionary actors

Damianus Journal of Medicine

Indonesian JELT

JURNAL PERKOTAAN

Social actors in an Intercultural Communication classroom: A discursive lens of intercultural education

Author: Abdullah Fuad
Lulita -
Publication venue: 'Atma Jaya Catholic University of Indonesia'
Publication date: 31/05/2018
Field of study

eJournal Unika Atma Jaya (Universitas Katolik Indonesia)

Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

Author: Leung Cheung Chi
Ma Bin
Ni Chongjia
Sivadas Sunil
Tong Rong
Wang Lei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/10/2022
Field of study

This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to illustrate the strategies of collecting various resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017

arXiv.org e-Print Archive