39 research outputs found

    Modification of Stemming Algorithm Using A Non Deterministic Approach To Indonesian Text

    Get PDF
     Natural Language Processing is part of Artificial Intelegence that focus on language processing. One of stage in Natural Language Processing is Preprocessing. Preprocessing is the stage to prepare data before it is processed. There are many types of proccess in preprocessing, one of them is stemming. Stemming is process to find the root word from regular word. Errors when determining root words can cause misinformation. In addition, stemming process does not always produce one root word because there are several words in Indonesian that have two possibilities as root word or affixes word, e.g.the word “beruang”.To handle these problems, this study proposes a stemmer with more accurate word results by employing a non deterministic algorithm which gives more than one word candidate result. All rules are checked and the word results are kept in a candidate list. In case there are several word candidates were found, then one result will be chosen.This stemmer has been tested to 15.934 word and results in an accurate level of 93%. Therefore the stemmer can be used to detect words with more than one root word

    Sundanese Stemming using Syllable Pattern

    Get PDF
    Stemming is a technique to return the word derivation to the root or base word. Stemming is widely used for data processing such as searching word indexes, translating, and information retrieval from a document in the database. In general, stemming uses a morphological pattern from a derived word to produce the original word or root word.  In the previous research, this technique faced over-stemming and under-stemming problems. In this study, the stemming process will be improved by the syllable pattern (canonical) based on the phonological rule in Sundanese. The stemming result for syllable patterns gets an accuracy of 89% and the execution of the test data resulted in 95% from all the basic words. This simple algorithm has the advantage of being able to adjust the position of the syllable pattern with the word to be stemmed. Due to some data shortage constraints (typo, loan-word, non-deterministic word with syllable pattern), we can improve to increase the accuracy such as adjusting words and adding reference dictionaries. In addition, this algorithm has a drawback that causes the execution to be over-stemming

    New instances classification framework on Quran ontology applied to question answering system

    Get PDF
    Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology

    Open Source Natural Language Processing

    Get PDF
    Our MQP aimed to introduce finite state machine based techniques for natural language processing into Hunspell, the world\u27s premiere Open Source spell checker used in several prominent projects such as Firefox and Open Office. We created compact machine-readable finite state transducer representations of 26 of the most commonly used languages on Wikipedia. We then created an automata based spell checker. In addition, we implemented an transducer based stemmer, which will be used in the future of transducer based morphological analysis

    Ekstraksi Kata Dasar Secara Berjenjang (Incremental Stemming) Berbasis Aturan Morfologi untuk Teks Berbahasa Indonesia

    Get PDF
    Ekstraksi kata dasar atau stemming pada Bahasa Indonesia adalah proses yang kompleks di mana beberapa partikel awalan dan beberapa partikel akhiran dari 13 awalan, 3 sisipan dan 19 akhiran yang dikenal dapat digunakan secara sekaligus pada sebuah kata. Selain itu, proses stemming tidak selalu menghasilkan 1 kata dasar (non-deterministik), karena terdapat beberapa kata dalam bahasa Indonesia yang memiliki 2 kemungkinan yaitu sebagai kata dasar maupun kata berimbuhan, misalnya pada kata “beruang”. Penelitian yang telah ada sebelumnya menggunakan kombinasi awalan dan akhiran yang tidak mungkin dan menerapkan heuristik untuk memilih kata dasar. Dalam penelitian ini diusulkan sebuah metode stemming secara berjenjang di mana berdasarkan urutan tertentu, secara bergantian partikel akhiran dan awalan dilepaskan dari sebuah kata sehingga dihasilkan sebuah kata dasar. Jika ditemukan beberapa kandidat kata dasar maka salah satu kata dasar akan dipilih. Metode ini diuji pada 6464 dokumen Al-Quran Terjemahan Indonesia dengan menggunakan kamus berukuran 5000 kata yang disampling secara acak dari Kamus Besar Bahasa Indonesia.. Dari 3432 kata unik yang diproses 94,7% kata dasar dapat diekstrak secara langsung dan hanya 5,3% yang perlu diproses lebih lanjut karena kandidat kata dasar yang ditemukan lebih dari satu. Dibandingkan dengan melakukan pemilihan kata dasar secara manual, metode ini dapat  memilih kata dasar yang tepat hingga 79.12%.&nbsp

    New Instances Classification Framework On Quran Ontology Applied To Question Answering System

    Get PDF
    Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology

    Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers

    Get PDF
    With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing

    The text classification pipeline: Starting shallow, going deeper

    Get PDF
    An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

    Algoritma Jaro-Winkler Distance: Fitur Autocorrect dan Spelling Suggestion pada Penulisan Naskah Bahasa Indonesia di BMS TV

    Get PDF
    Autocorrect adalah suatu sistem yang dapat memeriksa dan memperbaiki kesalahan penulisan kata secara otomatis. Dewasa ini fitur autocorrect memang sering ditemui pada berbagai perangkat dan aplikasi, misalkan pada papan ketik smartphone dan aplikasi misalkan sebut saja Microsoft Word. Sistem autocorrect tersebut langsung mengganti kata yang dianggap salah oleh sistem secara otomatis tanpa memberi tahu pengguna sehingga pengguna seringkali tidak sadar tulisannya berubah sedangkan kata penggantinya tidak selalu benar sesuai dengan yang dimaksud pengguna. Pengetahuan Microsoft Word pada fitur autocorrect-nya berbahasa Inggris sehingga tidak dapat diterapkan pada penulisan naskah berita di BMS TV. Setiap harinya News Director BMS TV memeriksa naskah yang akan diberitakan dimana termasuk diantaranya adalah pemeriksaan ejaan. Dengan fitur autocorrect dan spelling suggestion bahasa Indonesia diharapkan dapat membantu News Director BMS TV untuk memeriksa dan memperbaiki kesalahan penulisan kata secara otomatis serta memberi saran penulisan ejaan kata yang benar dalam bahasa Indonesia. Metode pengembangan perangkat lunak yang digunakan adalah Extreme Programming dan algoritme Jaro-Winkler Distance. Jaro-Winkler adalah algoritme untuk menghitung nilai jarak kedekatan antara dua teks. Hasil dari penelitian ini adalah sebuah sistem yang dapat membantu News Director BMS TV dalam pemeriksaan kesalahan penulisan ejaan kata pada naskah bahasa Indonesia dan mempermudah News Director pusat dalam penghimpunan naskah dari berbagai kontributor BMS TV. Dapat disimpulkan bahwa fitur autocorrect dan spelling suggestion dapat menengani kesalahan penulisan ejaan kata dengan pengujian 60 kata yang terdiri dari berbagai skenario kesalahan penulisan kata fitur ini dapat memperbaiki sepuluh kata secara otomatis dengan benar dan memunculkan saran ejaan kata pada 39 kata dengan tepat.   Abstract Autocorrect is a software system that automatically identifies and correct misspelled words. Nowadays autocorrect feature is often encountered in various devices dan applications, like on the smartphone keyboard dan Microsoft Word application. The autocorrect system instantly replaces the word that is considered wrong by the system automatically without notifying the user so that users are often not aware of writing changes while the replacement word is not always true in accordance with the intended user. The Autocorrect feature of Microsoft Word uses English so it can’t be applied on writing news script in BMS TV. Every day News Director of BMS TV checks the script that would be reported where there is a spell checking included. By using bahasa in autocorrect dan spelling suggestion, it is expected to help News Director BMS TV to check dan fix the misspelled word automatically dan give suggestion for the right words spelling in bahasa. The development software method that is used is Extreme Programming dan Jaro-Winkler Distance algorithm. Jaro-Winkler is an algorithm that is applied to calculate the distance of proximity between two texts. The results of this study is a system that could help News Director BMS TV in identifying  misspelled words on script in bahasa dan to make it easier for News Director center in collecting of manuscripts from various contributors of BMS TV. It can be concluded that the autocorrect dan spelling suggestion features can compound the misspelled words with a 60-word test consisting of various error scenarios. This feature can correct ten words automatically dan show correct spelling suggestion word on 39 words
    corecore