3 research outputs found

    Visual analysis of research paper collections using normalized relative compression

    Get PDF
    The analysis of research paper collections is an interesting topic that can give insights on whether a research area is stalled in the same problems, or there is a great amount of novelty every year. Previous research has addressed similar tasks by the analysis of keywords or reference lists, with different degrees of human intervention. In this paper, we demonstrate how, with the use of Normalized Relative Compression, together with a set of automated data-processing tasks, we can successfully visually compare research articles and document collections. We also achieve very similar results with Normalized Conditional Compression that can be applied with a regular compressor. With our approach, we can group papers of different disciplines, analyze how a conference evolves throughout the different editions, or how the profile of a researcher changes through the time. We provide a set of tests that validate our technique, and show that it behaves better for these tasks than other techniques previously proposed.Peer ReviewedPostprint (published version

    Perbaikan Kinerja Praproses Karakter Berulang Dalam Mengenali Kata Pada Klasifikasi Sentimen Berbahasa Indonesia

    Get PDF
    Data yang relevan didapatkan melalui tahap praproses dengan menghilangkan noise agar data yang akan diolah sesuai dengan kebutuhan. Penghilangan noise tersebut dilakukan dengan menghapus karakter berulang, karena karakter ini sering dijumpai pada data twitter akibat kesalahan penulisan. Permasalahan akan muncul ketika memproses kata yang berulang, sehingga menyebabkan kata akan kehilangan makna dan tidak dapat diproses dengan baik. Penelitian ini bertujuan untuk melakukan modifikasi penghapusan karakter berulang dengan menambahkan pengukuran similarity dan mengukur tingkat kesamaan dengan kamus. Ada empat jenis pengulangan (kata baku mengandung pengulangan yang mengalami kesalahan pengulangan karakter lebih dari satu jenis, mengandung pengulangan yang tidak mengalami kesalahan pengulangan karakter, tidak mengandung pengulangan yang mengalami kesalahan pengulangan karakter, dan tidak mengandung pengulangan yang mengalami kesalahan pengulangan karakter lebih dari satu jenis) yang akan diselesaikan menggunakan modifikasi penghapusan karakter untuk meningkatkan kualitas hasil analisis sentiment menggunakan (SVM). Penelitian ini menggunakan tiga cara pengujian yaitu membandingkan tanpa, dengan, dan modifikasi penghapusan karakter berulang. Hasil pengujian menunjukkan bahwa modifikasi yang dilakukan menunjukan performa klasifikasi paling baik dengan nilai akurasi sebesar 74.46%, sedangkan dengan metode illicker menghasilkan nilai 71.71%, dan dengan metode jaccard menghasilkan nilai 68.04%. Modifikasi yang dilakukan memiliki peran yang signifikan dari aspek kesalahan makna dari kata, hasil terbaik dari modifikasi penghapusan karakter dengan kata dikenali sebesar 59%. Selain itu modifikasi yang dilakukan dapat meningkatkan kinerja pada tahap stemming dan stop words. Peningkatan kinerja stemming dibuktikan dengan jumlah kata yang dapat dikenali sebesar 682 kata. Di sisi lain peningkatan kinerja stop words dibuktikan dengan terdapat 86 kata yang dapat direduksi sehingga dapat menurunkan tingkat keberagaman kata yang memiliki arti dan maksud yang sama. ================================================================================================== Relevant data is obtained through the pre-process by removing the noise so that the data to be processed in accordance with the needs. Noise removal is done by deleting repetitive characters, as the characters are often encountered in twitter data due to errors. This study aims to analyze the relevant results of the pre-process removal of repeated characters in the Indonesian sentiment classification. This is obtained by modifying the removal of characters repeatedly to calculate the similarity to determine the level of similarity with the dictionary. There are four types of characters repetitions were analyzed using repetitive character removal modifications to improve the quality of sentiment results using Support Vector Machines (SVM). Three ways of testing are done to analyze the deletion of repetitive characters by comparing: without, with, and modification of repetitive character removal. The test results show that the modifications performed show the best classification performance with an accuracy of 74.46%, whereas with Illecker method produces a value of 71.71%, and Jaccard method produces a value of 68.04%. The modification performed has a significant role in the aspect of the meaning of the word, the best result of the character removal modification with a recognizable word of 59%. In addition, modifications made to improve performance at stemming and stop words. Improved stemming performance is evidenced by the number of words that can be recognized for 682 words. On the other hand improvement in performance of stop words is evidenced by 86 words that can be reduced so as to decrease the level of diversity of words that have the same meanin

    An Information Theoretic Approach to Text Sentiment Analysis

    No full text
    corecore