Search CORE

301 research outputs found

ERRORS BY AUTO-MORPHOLOGICAL ANALYSIS IN A CHILDREN STORY CORPUS: AN EVALUATION OF MORPHIND PROGRAM

Author: Alfiani Noveka Erviana Nur
Publication venue
Publication date: 01/01/2017
Field of study

Indonesian Morphological Tool, Morphind, is meant to make a proper morphological analysis before doing further automatic language processing.Morphind is applied to enrich raw Indonesian text with morphological information, the preprocessing stage of an Indonesian corpus. In this study, the data is obtained from children's stories in the website ceritaanak.org by taking 500 types of total 2101 types. The purpose of this study is to identify and classify the types of errors present in data processing using morphind program. In the analalysis I uses the method Introspective and Dictionary Indonesian (KBBI) to validate the analysis. The findings of this research suggest that there are still many aspects that can be improved about morphind. Recommendations are fixing the data base especially for OOV (out of vocabulary) and dictionary accuracy, improving the display for the Allomorph, and improving the algorithm for morpheme segmentation

Neliti

Diponegoro University Institutional Repository

Samawa Part of Speech Tagging using Brill Tagger

Author: Aida Saori
Hariyanti Trienani
Kameda Hiroyuki
Publication venue: Talenta Publisher
Publication date: 31/07/2019
Field of study

There exist 7,097 living languages in the world cited by Ethnologue. Most of them, however, do not exist on the Internet as the objects of research. It indicates the gap in language resources. One of them is Samawa language which has over 500,000 native speakers and is identified as endangered language by UNESCO. What we known about Samawa so far is a lack of information, tools, and resources to maintain its sustainability. This paper aims to contribute to NLP, a growing field of research, by exploring Samawa part of speech tagging problem using rule-based approach, i.e. Brill tagger. It has been trained on very limited data of Samawa corpus, which is 24,627 tokens including punctuation marks with 24 tags of our original tagset. K-fold cross-validation (k = 5 and k = 10) was applied to compare Brill’s performance with Unigram, HMM, and TnT. Brill tagger with the combination of default tagger, Unigram, Bigram and Trigram as baseline tagger achieve higher accuracy over 95% than others. It suggests that the Brill tagger can be used to extend Samawa corpus automatically

Talenta Publisher (E-Journals, Universitas Sumatera Utara)

Linguistic studies using large annotated corpora: Introduction

Author: Moeljadi David
Nomoto Hiroki
Publication venue: Research Institute for Languages and Cultures of Asia and Africa (ILCAA), Tokyo University of Foreign Studies
Publication date: 30/09/2019
Field of study

Prometheus-Academic Collections

Building Cendana: a Treebank for Informal Indonesian

Author: 98985
98986
98987
Goswami Debaditya
Kurniawan Aditya
Moeljadi David
Publication venue: Waseda Institute for the Study of Language and Information
Publication date: 01/01/2019
Field of study

conference pape

Waseda University Repository

A Novel Part-of-Speech Set Developing Method for Statistical Machine Translation

Author: Akhmad Arman Arry
Kuspriyanto Kuspriyanto
Purwarianti Ayu
Sujaini Herry
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/09/2014
Field of study

Part of speech (PoS) is one of the features that can be used to improve the quality of statistical-based machine translation. Typically, the language PoS determined based grammar of the language or adopt from other languages PoS. This work aims to formulate a model to developing PoS as linguistic factors to improve the quality of machine translation automatically. The research method using word similarity approach, where we perform clustering of the words contained in a corpus. Further classes will be defined as PoS set obtained for a given language.We evaluated the results of the PoS that defined computational results using machine translation system MOSES as the system by comparing the results of the SMT are using PoS sets generated manually, while the assessment of the system using BLEU method. Language that will be used for evaluation is English as the source language and Indonesian as the target language

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Indonesian Language Term Extraction using Multi-Task Neural Network

Author: Ferdinandus Fransiskus Xaverius
Gunawan Gunawan
Hernandez Leonel
Santoso Joan
Setiawan Esther Irawati
Publication venue: 'State University of Malang (UM)'
Publication date: 01/12/2022
Field of study

The rapidly expanding size of data makes it difficult to extricate information and store it as computerized knowledge. Relation extraction and term extraction play a crucial role in resolving this issue. Automatically finding a concealed relationship between terms that appear in the text can help people build computer-based knowledge more quickly. Term extraction is required as one of the components because identifying terms that play a significant role in the text is the essential step before determining their relationship. We propose an end-to-end system capable of extracting terms from text to address this Indonesian language issue. Our method combines two multilayer perceptron neural networks to perform Part-of-Speech (PoS) labeling and Noun Phrase Chunking. Our models were trained as a joint model to solve this problem. Our proposed method, with an f-score of 86.80%, can be considered a state-of-the-art algorithm for performing term extraction in the Indonesian Language using noun phrase chunking

Portal Jurnal Elektronik Universitas Negeri Malang

Directory of Open Access Journals

Rule-based Reordering and Post-Processing for Indonesian-Korean Statistical Machine Translation

Author: Lestari Dessi Puji
Mawalim Candy Olivia
Purwarianti Ayu
Publication venue: the National University (Philippines)
Publication date: 01/01/2017
Field of study

Waseda University Repository

Universal Dependencies Parsing for Colloquial Singaporean English

Author: Chan GuangYong Leonard
Chieu Hai Leong
Wang Hongmin
Yang Jie
Zhang Yue
Publication venue
Publication date: 01/01/2017
Field of study

Singlish can be interesting to the ACL community both linguistically as a major creole based on English, and computationally for information extraction and sentiment analysis of regional social media. We investigate dependency parsing of Singlish by constructing a dependency treebank under the Universal Dependencies scheme, and then training a neural network model by integrating English syntactic knowledge into a state-of-the-art parser trained on the Singlish treebank. Results show that English knowledge can lead to 25% relative error reduction, resulting in a parser of 84.47% accuracies. To the best of our knowledge, we are the first to use neural stacking to improve cross-lingual dependency parsing on low-resource languages. We make both our annotation and parser available for further research.Comment: Accepted by ACL 201

arXiv.org e-Print Archive

Crossref