449 research outputs found
Topic identification using filtering and rule generation algorithm for textual document
Information stored digitally in text documents are seldom arranged according to specific topics. The necessity to read whole documents is time-consuming and decreases the interest
for searching information. Most existing topic identification methods depend on occurrence
of terms in the text. However, not all frequent occurrence terms are relevant. The term
extraction phase in topic identification method has resulted in extracted terms that might have
similar meaning which is known as synonymy problem. Filtering and rule generation
algorithms are introduced in this study to identify topic in textual documents. The proposed filtering algorithm (PFA) will extract the most relevant terms from text and solve synonym roblem amongst the extracted terms. The rule generation algorithm (TopId) is proposed to
identify topic for each verse based on the extracted terms. The PFA will process and filter
each sentence based on nouns and predefined keywords to produce suitable terms for the
topic. Rules are then generated from the extracted terms using the rule-based classifier. An experimental design was performed on 224 English translated Quran verses which are related to female issues. Topics identified by both TopId and Rough Set technique were compared and later verified by experts. PFA has successfully extracted more relevant terms compared to other filtering techniques. TopId has identified topics that are closer to the topics from experts with an accuracy of 70%. The proposed algorithms were able to extract relevant terms without losing important terms and identify topic in the verse
Enhanced Affixation Word Stemmer with Stemming Error Reducer to Solve Affxation Stemming Errors
Word stemming algorithm (or word stemmer) is an important preprocessing component in the information retrieval and text categorization that aims to reduce derived words to their respective root words. Most of the existing Malay word stemmers adopt rule-based affixes removal method and dictionary lookup to stem affixation words. Despite of many stemming approaches have been proposed in the past research, the existing Malay word stemmers still suffer from affixation stemming errors due to the complexity of Malay morphology. These stemming errors can be classified into over stemming, under stemming, unstem, and special variations and exceptions. Hence this paper presents the enhanced affixation word stemmer that aims to solve these stemming errors. This paper also examined the root causes of these stemming errors in the existing Malay stemmers. The experimental results indicate that the enhanced word stemmerable to stem prefixation, suffixation, confixation and infixation wordswith better stemming accuracy by using enhanced Rule Application Order and Stemming Errors Reducer
Enhanced text stemmer for standard and non-standard word patterns in Malay texts
Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application
Clustering bilingual documents using various clustering linkages coupled with different proximity measurement techniques
With the rich data on the web, a documents clustering task for monolingual documents is insufficient in order to produce an efficient information retrieval system. A Multilingual Document Clustering (MDC) had been introduced and it is one of the most popular trends in the area of natural language processing (NLP). In this paper, the effects of applying different clustering linkages coupled with different proximity measurements on the clustering bilingual Malay-English documents in parallel are investigated. A Hierarchical Agglomerative Clustering (HAC) has been implemented and applied in clustering bilingual Malay-English documents. Several different linkages are used in the HAC method that includes Single, Complete, Centroid and Average linkages. Not only that, the cosine similarity and the extend Jaccard coefficient are also applied in order to investigate a proper proximity measurement that can be coupled with the different type of clustering linkages used for clustering bilingual news articles written in English and Malay. The HAC method coupled with the average linkage can be considered to produce reasonable clustering results even though the average DBI is a bit high. Now only that, the study also shows that the extend Jaccard coefficient proximity measurement can produce a better clustering results compared to the cosine similarity
Ontological Approach for Semantic Modelling of Malay Translated Qur’an
This thesis contributes to the areas of ontology development and analysis, natural language processing (NLP), Information Retrieval (IR), and Language Resource and Corpus Development. Research in Natural Language Processing and semantic search for English has shown successful results for more than a decade. However, it is difficult to adapt those techniques to the Malay language, because its complex morphology and orthographic forms are very different from English. Moreover, limited resources and tools for computational linguistic analysis are available for Malay. In this thesis, we address those issues and challenges by proposing MyQOS, the Malay Qur’an Ontology System, a prototype ontology-based IR with semantics for representing and accessing a Malay translation of the Qur’an. This supports the development of a semantic search engine and a question answering system and provides a framework for storing and accessing a Malay language corpus and providing computational linguistics resources. The primary use of MyQOS in the current research is for creating and improving the quality and accuracy of the query mechanism to retrieve information embedded in the Malay text of the Qur’an translation. To demonstrate the feasibility of this approach, we describe a new architecture of morphological analysis for MyQOS and query algorithms based on MyQOS. Data analysis consisted of two measures; precision and recall, where data was obtained from MyQOS Corpus conducted in three search engines. The precision and recall for semantic search are 0.8409 (84%) and 0.8043(80%), double the results of the question-answer search which are 0.4971(50%) for precision and 0.6027 (60%) for recall. The semantic search gives high precision and high recall comparing the other two methods. This indicates that semantic search returns more relevant results than irrelevant ones. To conclude, this research is among research in the retrieval of the Qur’an texts in the Malay language that managed to outline state-of-the-art information retrieval system models. Thus, the use of MyQOS will help Malay readers to understand the Qur’an in better ways. Furthermore, the creation of a Malay language corpus and computational linguistics resources will benefit other researchers, especially in religious texts, morphological analysis, and semantic modelling
ALGORITMA STEMMING BAHASA MALAYSIA
AUDINA SRI REZEKI (2021) : ALGORITMA STEMMING BAHASA MALAYSIA
Malaysia merupakan salah satu negara tetangga yang memiliki kemiripan
bahasa dengan Indonesia. Kata yang terdapat di dalam kamus adalah kata dasar.
Hal ini menyulitkan pencarian arti kata yang telah berimbuhan karena pada kamus
kata diurutkan sesuai abjad kata dasar, bukan berdasarkan imbuhan. Algoritma
stemming adalah inti dari teknik natural language processing untuk mendapatkan
informasi kembali (Information Retrieval) yang efektif dan efesien dan secara luas
dapat diterima oleh pengguna (users). Pada pengujian white box terhadap 50 kata
yang terdiri dari kata berimbuhan berhasil dijalankan sesuai dengan hasil yang di
inginkan. Sedangkan pada pengujian akurasi terhadap 500 kata berimbuhan
dilakukan dengan 6 kombinasi. Kombinasi 1 menghasilkan 95,2%, kombinasi 2
menghasilkan 92,2%, kombinasi 3 menghasilkan 95,2%, kombinasi 4
menghasilkan 53,6%, kombinasi 5 menghasilkan 54,4%, dan kombinasi 6
menghasilkan 51,8%
Kata Kunci: Bahasa Malaysia, Algoritma Stemming, Natural Language
Processing, White Box, Akurasi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Indonesian and Malay are underrepresented in the development of natural language processing (NLP) technologies and available resources are difficult to find. A clear picture of existing work can invigorate and inform how researchers conceptualise worthwhile projects. Using an education sector project to motivate the study, we conducted a wide-ranging overview of Indonesian and Malay human language technologies and corpus work. We charted 657 included studies according to Hirschberg and Manning's 2015 description of NLP, concluding that the field was dominated by exploratory corpus work, machine reading of text gathered from the Internet, and sentiment analysis. In this paper, we identify most published authors and research hubs, and make a number of recommendations to encourage future collaboration and efficiency within NLP in Indonesian and Malay
Terms interrelationship query expansion to improve accuracy of Quran search
Quran retrieval system is becoming an instrument for users to search for needed
information. The search engine is one of the most popular search engines that
successfully implemented for searching relevant verses queries. However, a major
challenge to the Quran search engine is word ambiguities, specifically lexical
ambiguities. With the advent of query expansion techniques for Quran retrieval
systems, the performance of the Quran retrieval system has problem and issue in
terms of retrieving users needed information. The results of the current semantic
techniques still lack precision values without considering several semantic
dictionaries. Therefore, this study proposes a stemmed terms interrelationship query
expansion approach to improve Quran search results. More specifically, related terms
were collected from different semantic dictionaries and then utilize to get roots of
words using a stemming algorithm. To assess the performance of the stemmed terms
interrelationship query expansion, experiments were conducted using eight Quran
datasets from the Tanzil website. Overall, the results indicate that the stemmed terms
interrelationship query expansion is superior to unstemmed terms interrelationship
query expansion in Mean Average Precision with Yusuf Ali 68%, Sarawar 67%,
Arberry 72%, Malay 65%, Hausa 62%, Urdu 62%, Modern Arabic 60% and
Classical Arabic 59%
Sentiment Analysis in Digital Spaces: An Overview of Reviews
Sentiment analysis (SA) is commonly applied to digital textual data,
revealing insight into opinions and feelings. Many systematic reviews have
summarized existing work, but often overlook discussions of validity and
scientific practices. Here, we present an overview of reviews, synthesizing 38
systematic reviews, containing 2,275 primary studies. We devise a bespoke
quality assessment framework designed to assess the rigor and quality of
systematic review methodologies and reporting standards. Our findings show
diverse applications and methods, limited reporting rigor, and challenges over
time. We discuss how future research and practitioners can address these issues
and highlight their importance across numerous applications.Comment: 44 pages, 4 figures, 6 tables, 3 appendice
- …