719 research outputs found
Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia
Language repository is valuable as a reference in using the language, its preservation, and in developing and implementation of natural language processing algorithms. Bahasa Indonesia is one of natural languages that hardly has repository despite its large number of speakers and previous attempts to build ones. We devised a way to develop repository of phrase definition in Bahasa using a kind of crowdsourcing and investigated its implementation. An application add-on was inserted to an information system that manages final year projects of undergraduate students. The add-on invites students to participate in writing keyword definition and validating definition. Investigation in a period of six months reveals that about 25% of application users take parts into the voluntary activities either as definition writers and/or validators. During the period, about 1200 phrase definitions were added into the repository and in average each definition is validated by two participants. The activity is supported by users that are well aware of the tasks, and have positive perception about the work, despite different reasons that motivate their contribution
WatsaQ: Repository of Al Hadith in Bahasa (Case Study: Hadith Bukhari)
The Hadith is one of the two sources of Islamic law after the Qur'an. It is a fact that there are a number of false hadith, recognised by Muslim scholars since the end of the first century of Hijra, and even earlier. In addition to the breadth of false hadith circulating among the public at this time, it is difficult to determine the source of authenticity and distinguish false from genuine. This is due to the configuration of the genuine documents which are revealed in Arabic. To that end, the authors have built a repository collection of hadith al- Bukhari in the Indonesian language. The hadith chosen have secured originality and standardisation has been applied that can assist users in learning the content of the hadith. The authors implemented a repository of translation in Bahasa of Bukhari Hadith using XML schema. To study the repository performance, we use a web presentation using PHP employing brute-force string match algorithms to display the search results based on keywords entered by the user. We analyse the results of our proposed repository implementation average searching time is faster by 0.85 milliseconds compared with the repository based on the unstructured one
ANNOTATED DISJUNCT FOR MACHINE TRANSLATION
Most information found in the Internet is available in English version. However,
most people in the world are non-English speaker. Hence, it will be of great advantage
to have reliable Machine Translation tool for those people. There are many
approaches for developing Machine Translation (MT) systems, some of them are
direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses
on developing an MT for less resourced languages i.e. languages that do not have
available grammar formalism, parser, and corpus, such as some languages in South
East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer
approaches. Moreover, the unavailability of grammar formalism and parser in the
target languages motivates us to develop a hybrid between direct and transfer
approaches. This hybrid approach is referred as a hybrid transfer approach. This
approach uses the Annotated Disjunct (ADJ) method. This method, based on Link
Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and
many-to-many word(s) translations. This method consists of transfer rules module
which maps source words in a source sentence (SS) into target words in correct
position in a target sentence (TS). The developed transfer rules are demonstrated on
English → Indonesian translation tasks. An experimental evaluation is conducted to
measure the performance of the developed system over available English-Indonesian
MT systems. The developed ADJ-based MT system translated simple, compound, and
complex English sentences in present, present continuous, present perfect, past, past
perfect, and future tenses with better precision than other systems, with the accuracy
of 71.17% in Subjective Sentence Error Rate metric
Establishing a COVID-19 lemmatized word list for journalists and ESP learners
The aim of this research is two-fold; first, to explore the most frequent COVID-19 inspired words in medical news reporting contexts, and second, to classify them into different categories. This paper adopts a corpus-based approach to build a lemmatized academic word list (AWL) inspired by the COVID-19 pandemic. Factiva was used to retrieve the pandemic-related articles published in News Rx from January 1 - October 31, 2020. A total number of 18,249,093-word corpus was compiled. The corpus linguistic software program Wordsmith (WS-6) (Scott, 2012) was used to generate a word list based on the complied corpus. Subsequent to compiling, lemmatizing, and analyzing the AWL, six categories were identified, namely, acronyms and abbreviation, diseases, COVID-19, biology, medicine, and scientific disciplines, all of which are of essential use for media workers, ESP learners of journalism, medicine, nursing, pharmacy, and allied health sciences. Building such a discipline-specific glossary will be of special pedagogical value for health journalists, textbook writers and curriculum designers, instructors, and ESP learners in the health sciences field. One of the major contributions of this research is establishing lemmas of a large set of AWL. This set can be utilized by news media workers, health communication specialists, and ESP learners. Lemmatization will ensure rapid dissemination of the word list and its integration in the linguistic system through derivation and other word-formation processes
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
- …