99 research outputs found

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Full text link
    We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken

    NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

    Full text link
    Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes

    Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages

    Get PDF
    The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5 languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a 63.3% cost reduction

    Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages

    Get PDF
    The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5 languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a 63.3% cost reduction

    Designing a Collaborative Process to Create Bilingual Dictionaries of Indonesian Ethnic Languages

    Get PDF
    The constraint-based approach has been proven useful for inducing bilingual dictionary for closely-related low-resource languages. When we want to create multiple bilingual dictionaries linking several languages, we need to consider manual creation by a native speaker if there are no available machine-readable dictionaries are available as input. To overcome the difficulty in planning the creation of bilingual dictionaries, the consideration of various methods and costs, plan optimization is essential. Utilizing both constraint-based approach and plan optimizer, we design a collaborative process for creating 10 bilingual dictionaries from every combination of 5 languages, i.e., Indonesian, Malay, Minangkabau, Javanese, and Sundanese. We further design an online collaborative dictionary generation to bridge spatial gap between native speakers. We define a heuristic plan that only utilizes manual investment by the native speaker to evaluate our optimal plan with total cost as an evaluation metric. The optimal plan outperformed the heuristic plan with a 63.3% cost reduction

    Social actors in an Intercultural Communication classroom: A discursive lens of intercultural education

    Get PDF
    This study focused on how teachers and students as the social actors in an Intercultural Communication (IC) classroom were represented discursively. A video recording transcript of IC classroom activities at a state University in Indonesia was selected as the data source. The data source was rigorously analysed through van Leeuwen’s Socio-semantic inventory of social actors framework (Van Leeuwen, 1996). The main findings show that social actors in IC classroom can be categorised into two main thematic representations, namely positive and negative ones. disclosed that Hamzah as the representative of classroom presenters was represented as victimised, oppressed, intimidated and minoritised actor. Hamzah’s Mathematics teacher was depicted as an intolerant, dehumanising, discriminatory and oppressing actor. Hamzah’s Social Sciences teacher was illustrated as a racial, stereotyping, dominant and provoking actor. The Intercultural Communication teacher was delineated as the actor endeavoring to encourage his students to be tolerant, critical, supportive and open-minded people. Hamzah’s classmates in IC classroom were characterised as sympathetic, supportive, friendly and reactionary actors

    Social actors in an Intercultural Communication classroom: A discursive lens of intercultural education

    Get PDF
    This study focused on how teachers and students as the social actors in an Intercultural Communication (IC) classroom were represented discursively. A video recording transcript of IC classroom activities at a state University in Indonesia was selected as the data source. The data source was rigorously analysed through van Leeuwen’s Socio-semantic inventory of social actors framework (Van Leeuwen, 1996). The main findings show that social actors in IC classroom can be categorised into two main thematic representations, namely positive and negative ones. disclosed that Hamzah as the representative of classroom presenters was represented as victimised, oppressed, intimidated and minoritised actor. Hamzah’s Mathematics teacher was depicted as an intolerant, dehumanising, discriminatory and oppressing actor. Hamzah’s Social Sciences teacher was illustrated as a racial, stereotyping, dominant and provoking actor. The Intercultural Communication teacher was delineated as the actor endeavoring to encourage his students to be tolerant, critical, supportive and open-minded people. Hamzah’s classmates in IC classroom were characterised as sympathetic, supportive, friendly and reactionary actors

    Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

    Full text link
    This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to illustrate the strategies of collecting various resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017
    • …
    corecore