Search CORE

719 research outputs found

Developing an Online Indonesian Corpora Repository

Author: Distiawan Bayu
Manurung Ruli
Putra Desmond Darma
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia

Author: Ariyanto Gunawan
Pranoto Wawan Joko
Thamrin Husni
Yuliana Irma
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/10/2019
Field of study

Language repository is valuable as a reference in using the language, its preservation, and in developing and implementation of natural language processing algorithms. Bahasa Indonesia is one of natural languages that hardly has repository despite its large number of speakers and previous attempts to build ones. We devised a way to develop repository of phrase definition in Bahasa using a kind of crowdsourcing and investigated its implementation. An application add-on was inserted to an information system that manages final year projects of undergraduate students. The add-on invites students to participate in writing keyword definition and validating definition. Investigation in a period of six months reveals that about 25% of application users take parts into the voluntary activities either as definition writers and/or validators. During the period, about 1200 phrase definitions were added into the repository and in average each definition is validated by two participants. The activity is supported by users that are well aware of the tasks, and have positive perception about the work, despite different reasons that motivate their contribution

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

WatsaQ: Repository of Al Hadith in Bahasa (Case Study: Hadith Bukhari)

Author: Aulia Atqia
Bahaweres Rizal Broer
Hakiem Nashrul
Khairani Dewi
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/11/2017
Field of study

The Hadith is one of the two sources of Islamic law after the Qur'an. It is a fact that there are a number of false hadith, recognised by Muslim scholars since the end of the first century of Hijra, and even earlier. In addition to the breadth of false hadith circulating among the public at this time, it is difficult to determine the source of authenticity and distinguish false from genuine. This is due to the configuration of the genuine documents which are revealed in Arabic. To that end, the authors have built a repository collection of hadith al- Bukhari in the Indonesian language. The hadith chosen have secured originality and standardisation has been applied that can assist users in learning the content of the hadith. The authors implemented a repository of translation in Bahasa of Bukhari Hadith using XML schema. To study the repository performance, we use a web presentation using PHP employing brute-force string match algorithms to display the search results based on keywords entered by the user. We analyse the results of our proposed repository implementation average searching time is faster by 0.85 milliseconds compared with the repository based on the unstructured one

Proceeding of the Electrical Engineering Computer Science and Informatics

ANNOTATED DISJUNCT FOR MACHINE TRANSLATION

Author: BHARATA ADJI TEGUH BHARATA ADJI
Publication venue
Publication date: 01/05/2010
Field of study

Most information found in the Internet is available in English version. However, most people in the world are non-English speaker. Hence, it will be of great advantage to have reliable Machine Translation tool for those people. There are many approaches for developing Machine Translation (MT) systems, some of them are direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses on developing an MT for less resourced languages i.e. languages that do not have available grammar formalism, parser, and corpus, such as some languages in South East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer approaches. Moreover, the unavailability of grammar formalism and parser in the target languages motivates us to develop a hybrid between direct and transfer approaches. This hybrid approach is referred as a hybrid transfer approach. This approach uses the Annotated Disjunct (ADJ) method. This method, based on Link Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and many-to-many word(s) translations. This method consists of transfer rules module which maps source words in a source sentence (SS) into target words in correct position in a target sentence (TS). The developed transfer rules are demonstrated on English → Indonesian translation tasks. An experimental evaluation is conducted to measure the performance of the developed system over available English-Indonesian MT systems. The developed ADJ-based MT system translated simple, compound, and complex English sentences in present, present continuous, present perfect, past, past perfect, and future tenses with better precision than other systems, with the accuracy of 71.17% in Subjective Sentence Error Rate metric

UTPedia

Establishing a COVID-19 lemmatized word list for journalists and ESP learners

Author: Al-Salman Saleh
Haider Ahmad S
Hussein Riyad F.
Odeh Iyad M.
Saed Hadeel A
Publication venue: 'Universitas Pendidikan Indonesia (UPI)'
Publication date: 31/01/2022
Field of study

The aim of this research is two-fold; first, to explore the most frequent COVID-19 inspired words in medical news reporting contexts, and second, to classify them into different categories. This paper adopts a corpus-based approach to build a lemmatized academic word list (AWL) inspired by the COVID-19 pandemic. Factiva was used to retrieve the pandemic-related articles published in News Rx from January 1 - October 31, 2020. A total number of 18,249,093-word corpus was compiled. The corpus linguistic software program Wordsmith (WS-6) (Scott, 2012) was used to generate a word list based on the complied corpus. Subsequent to compiling, lemmatizing, and analyzing the AWL, six categories were identified, namely, acronyms and abbreviation, diseases, COVID-19, biology, medicine, and scientific disciplines, all of which are of essential use for media workers, ESP learners of journalism, medicine, nursing, pharmacy, and allied health sciences. Building such a discipline-specific glossary will be of special pedagogical value for health journalists, textbook writers and curriculum designers, instructors, and ESP learners in the health sciences field. One of the major contributions of this research is establishing lemmas of a large set of AWL. This set can be utilized by news media workers, health communication specialists, and ESP learners. Lemmatization will ensure rapid dissemination of the word list and its integration in the linguistic system through derivation and other word-formation processes

Indonesian Journal of Applied Linguistics

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken

arXiv.org e-Print Archive

Resources and benchmark corpora for hate speech detection: a systematic review

Author: Basile Valerio
Bosco Cristina
Patti Viviana
Poletto Fabio
Sanguinetti Manuela
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Institutional Research Information System University of Turin