Search CORE

11,992 research outputs found

A Case Study of Using Domain Analysis for the Conflation Algorithms Domain

Author: Frakes William
Yilmaz Okan
Publication venue
Publication date: 01/01/2007
Field of study

This paper documents the domain engineering process for much of the conflation algorithms domain. Empirical data on the process and products of domain engineering were collected. Six conflation algorithms of four different types: three affix removal, one successor variety, one table lookup, and one n-gram were analyzed. Products of the analysis include a generic architecture, reusable components, a little language and an application generator that extends the scope of the domain analysis beyond previous generators. The application generator produces source code for not only affix removal type but also successor variety, table lookup, and n-gram stemmers. The performance of the stemmers generated automatically was compared with the stemmers developed manually in terms of stem similarity, source and executable sizes, and development and execution times. All five stemmers generated by the application generator produced more than 99.9% identical stems with the manually developed stemmers. Some of the generated stemmers were as efficient as their manual equivalents and some were not

Computer Science Technical Reports @Virginia Tech

PENGARUH AFFIX REMOVAL DENGAN PORTER STEMMER DAN KROVETZ STEMMER DALAM KATEGORISASI BERITA BERBAHASA INDONESIA

Author: FIQI DWISWISTYAN
Publication venue: Universitas Telkom
Publication date: 01/07/2009
Field of study

ABSTRAKSI: Berkembangnya teknologi di dunia maya membuat jumlah informasi berupa artikel semakin banyak. Untuk itu, diperlukan suatu metode terhadap artikel yang memudahkan pembaca mencari informasi dengan menerapkan salah satu fungsionalitas dari data mining, yaitu kategorisasi. Untuk mendapatkan hasil kategorisasi yang baik diperlukan tahap preprocessing data yang baik pula. Salah satu tahap preprocessing data yang umum digunakan adalah stemming. Stemming adalah proses untuk menemukan akar kata atau kata dasar dengan memisahkan semua affix atau imbuhan yang melekat pada kata tersebut. Dengan proses stemming ini akan mampu mengurangi dimensi dari data dalam melakukan proses kategorisasi sehingga dapat meningkatkan hasil kategorisasi karena beberapa kata yang memiliki kata dasar yang sama dapat dikelompokkan menjadi satu stem. Ada beberapa metode stemming menurut prosesnya, salah satunya adalah affix removal. Pada Tugas Akhir ini akan dibahas beberapa metode affix removal yaitu Porter Stemmer dan Krovetz Stemmer, serta pengaruhnya terhadap proses kategorisasi. Performansi stemmer dihitung berdasarkan nilai akurasi (accuracy) dan ICF (Index Compression Factor). Berdasar hasil pengujian, Modifikasi Porter Stemmer memiliki nilai performansi yang lebih baik dibanding Porter Stemmer dan Krovetz Stemmer. Namun dari hasil performansi stemmer terbaik yang didapat oleh Modifikasi Porter Stemmer belum tentu bisa meningkatkan nilai precision dan recall dalam kategorisasi.Kata Kunci : stemming, affix removal, Porter Stemmer, Krovetz Stemmer, Index Compression Factor, accuracy.ABSTRACT: The developing technology in cyber world has made the number of article as a part of information increased. Therefore, a method of articles is needed to ease the reader in seeking information by applying a functionality of data mining, which is categorization. To obtain a good categorization result, a good data preprocessing stage is also needed. The generally used data preprocessing stage is stemming. Stemming is a process to obtain root word by separating all affixes that are attached on that word. This stemming process will be able to reduce dimension of the data in categorization process so that it can be included in one stem. There are several methods of stemming according to the process, one of them is affix removal. This final assignment will mention two methods of affix removal, which are Porter Stemmer and Krovetz Stemmer, as well as their effect on categorization process. Stemmer performance is calculated by accuracy and ICF (Index Compression Factor). Based on testing result, Improved Porter Stemmer has better accuracy and ICF score than Porter Stemmer dan Krovetz Stemmer. However, from the best stemmer performance acquired by Improved Porter Stemmer does not absolutely increase the score of precision and recall in categorization.Keyword: stemming, affix removal, Porter Stemmer, Krovetz Stemmer, Index Compression Factor, accuracy

Open Library

IMPLEMENTASI DAN ANALISIS PENGARUH AFFIX REMOVAL STEMMING TERHADAP CLUSTERING STUDI KASUS CLUSTERING TERJEMAHAN AYAT-AYAT AL-QUR’AN TENTANG PERMASALAHAN AKIDAH

Author: MOHAMMAD SHOBRI
Publication venue: Universitas Telkom
Publication date: 01/01/2011
Field of study

ABSTRAKSI: Clustering merupakan pengelompokkan data tanpa proses label kedalam kelompok – kelompok atau cluster-cluster, yang mempunyai nilai parameter kualitas cluster yang dibentuk (purity cluster), dan sebelum melakukan membentukan cluster-cluster didalam clustering dilakukan proses pengelolaan data menjadi kumpulan teks yang disebut dengan text mining. Pada text mining ada dua jenis kata yaitu kelompok kata yang berimbuhan dan kelompok kata yang sudah hilang imbuhannya. Untuk melakukan proses membentukan kata dasar dari kata yang berimbuhan dibutuhkan suatu proses yang disebut stemming. Algoritma affix removal stemming merupakan salah satu algoritma stemming yang bisa diterapkan dalam data teks bahasa Indonesia karena mampu menstemming data teks dengan nilai akurasi mencapai 90%. Pada studi kasus clustering teks terjemahan ayat-ayat Al-Qur’an, dibuktikan bahwa pengaruh penerapan stemming pada text mining, dapat mengurangi jumlah data teks yang akan diproses pada clustering dan mampu memberikan nilai purity cluster yang lebih baik dari pada clustering yang tidak menerapkan stemming.Kata Kunci : Clustering, Cluster, Text Mining, Affix Removal Stemming, Purity Cluster, Al-Qur’an.ABSTRACT: Clustering is the grouping of data without the process of label into groups - groups or clusters, that have a quality parameter values formed clusters (cluster purity), and before doing to create clusters, in the clustering have been doing the data management process into a collection of text is called text mining . In text mining there are two types of words : word groups that have affix and group of words that have been lost the affix. To perform basic word processing, required a process called stemming. Affix removal stemming algorithm is one stemming algorithm that can be applied in the Indonesian language text data, as text data can stemming with values reaching 90% accuracy. In the case study translated text clustering verses the Qur\u27an, proved that the effect of stemming on the text mining application, able to reduce the amount of text data will be processed on clustering and cluster purity capable of delivering value better than the clustering does not apply stemming.Keyword: Clustering, Cluster, Text Mining, Affix Removal Stemming, Purity Cluster, Al-Qur’an

Open Library

Unsupervised induction of Arabic root and pattern lexicons using machine learning

Author: Carroll John
Khaliq Bilal
Publication venue
Publication date: 01/09/2013
Field of study

We describe an approach to building a morphological analyser of Arabic by inducing a lexicon of root and pattern templates from an unannotated corpus. Using maximum entropy modelling, we capture orthographic features from surface words, and cluster the words based on the similarity of their possible roots or patterns. From these clusters, we extract root and pattern lexicons, which allows us to morphologically analyse words. Further enhancements are applied, adjusting for morpheme length and structure. Final root extraction accuracy of 87.2% is achieved. In contrast to previous work on unsupervised learning of Arabic morphology, our approach is applicable to naturally-written, unvowelled Arabic text

Sussex Research Online