99 research outputs found
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Democratizing access to natural language processing (NLP) technology is
crucial, especially for underrepresented and extremely low-resource languages.
Previous research has focused on developing labeled and unlabeled corpora for
these languages through online scraping and document translation. While these
methods have proven effective and cost-efficient, we have identified
limitations in the resulting corpora, including a lack of lexical diversity and
cultural relevance to local communities. To address this gap, we conduct a case
study on Indonesian local languages. We compare the effectiveness of online
scraping, human translation, and paragraph writing by native speakers in
constructing datasets. Our findings demonstrate that datasets generated through
paragraph writing by native speakers exhibit superior quality in terms of
lexical diversity and cultural content. In addition, we present the
\datasetname{} benchmark, encompassing 12 underrepresented and extremely
low-resource languages spoken by millions of individuals in Indonesia. Our
empirical experiment results using existing multilingual large language models
conclude the need to extend these models to more underrepresented languages. We
release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes
Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages
The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages.
When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native
speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation
of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based
approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5
languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary
generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native
speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a
63.3% cost reduction
Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages
The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation
of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5 languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a 63.3% cost reduction
Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages
The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages.
When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native
speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation
of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based
approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5
languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary
generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native
speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a
63.3% cost reduction
Social actors in an Intercultural Communication classroom: A discursive lens of intercultural education
This study focused on how teachers and students as the social actors in an Intercultural Communication (IC) classroom were represented discursively. A video recording transcript of IC classroom activities at a state University in Indonesia was selected as the data source. The data source was rigorously analysed through van Leeuwen’s Socio-semantic inventory of social actors framework (Van Leeuwen, 1996). The main findings show that social actors in IC classroom can be categorised into two main thematic representations, namely positive and negative ones. disclosed that Hamzah as the representative of classroom presenters was represented as victimised, oppressed, intimidated and minoritised actor. Hamzah’s Mathematics teacher was depicted as an intolerant, dehumanising, discriminatory and oppressing actor. Hamzah’s Social Sciences teacher was illustrated as a racial, stereotyping, dominant and provoking actor. The Intercultural Communication teacher was delineated as the actor endeavoring to encourage his students to be tolerant, critical, supportive and open-minded people. Hamzah’s classmates in IC classroom were characterised as sympathetic, supportive, friendly and reactionary actors
Social actors in an Intercultural Communication classroom: A discursive lens of intercultural education
This study focused on how teachers and students as the social actors in an Intercultural Communication (IC) classroom were represented discursively. A video recording transcript of IC classroom activities at a state University in Indonesia was selected as the data source. The data source was rigorously analysed through van Leeuwen’s Socio-semantic inventory of social actors framework (Van Leeuwen, 1996). The main findings show that social actors in IC classroom can be categorised into two main thematic representations, namely positive and negative ones. disclosed that Hamzah as the representative of classroom presenters was represented as victimised, oppressed, intimidated and minoritised actor. Hamzah’s Mathematics teacher was depicted as an intolerant, dehumanising, discriminatory and oppressing actor. Hamzah’s Social Sciences teacher was illustrated as a racial, stereotyping, dominant and provoking actor. The Intercultural Communication teacher was delineated as the actor endeavoring to encourage his students to be tolerant, critical, supportive and open-minded people. Hamzah’s classmates in IC classroom were characterised as sympathetic, supportive, friendly and reactionary actors
Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages
This paper provides an overall introduction of our Automatic Speech
Recognition (ASR) systems for Southeast Asian languages. As not much existing
work has been carried out on such regional languages, a few difficulties should
be addressed before building the systems: limitation on speech and text
resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia
and Thai as examples to illustrate the strategies of collecting various
resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange
Technologies (ICOT 2017
- …