13 research outputs found

    Klasifikasi Sentimen Pada Ulasan Buku Berbahasa Inggris Menggunakan Information Gain dan Naïve Bayes

    Get PDF
    Semakin berkembangnya teknologi informasi, mengakibatkan pertumbuhan data mengenai ulasan buku semakin besar dan pesat. Dengan membaca review atau ulasan berdasarkan pengalaman pembaca lain, maka kita akan mengetahui kualitas dari buku tersebut. Begitu banyaknya ulasan akan mempersulit pengguna lain untuk menyimpulkan hasil dari ulasan tersebut. Analisis sentimen bertugas untuk melakukan pengklasifikasian sentimen menjadi sentimen positif dan sentimen negatif. Klasifikasi sentimen pada ulasan buku pada penelitian ini, menggunakan Information Gain dan Naïve Bayes. Information Gain digunakan sebagai metode pemilihan fitur. Pemilihan fitur dapat membuat akurasi penelitian menjadi meningkat dengan mengurangi fitur-fitur yang kurang. Naïve Bayes digunakan untuk mengklasifikasikan ulasan tersebut, cenderung beropini positif atau negatif berdasarkan nilai probabilitasnya. Berdasarkan skenario pengujian yang telah dilakukan, performa klasifikasi sentimen pada ulasan buku berbahasa inggris menggunakan Information Gain dan Naive Bayes dari rata-rata F1-score menggunakan 5-fold-cross validation adalah 88,28

    Analisis Sentimen dan Peringkasan Opini pada Ulasan Produk Menggunakan Algoritma Random Forest

    Get PDF
    Ulasan produk merupakan salah satu kriteria yang berguna bagi calon pembeli untuk mengambil keputusan pada pembelian suatu produk. Jumlah ulasan produk yang banyak membuat isi ulasan produk tidak dapat disimpulkan dengan cepat sehingga akan menyulitkan konsumen dalam penarikan kesimpulan pembelian sebuah produk. Untuk mengatasi masalah tersebut diperlukan suatu sistem yang secara otomatis dapat mengidentifikasi fitur - fitur produk dalam ulasan produk, mengklasifikasikannya kedalam polaritas positif negatif dan melakukan pembangkitan ringkasan ulasan produk untuk dapat membantu proses pembacaan suatu ulasan produk. Terdapat dua tahapan sebelum memasuki pembangkitan ringkasan, pertama adalah ekstraksi fitur produk yang dilakukan dengna menggunakan metode association mining untuk mendapatkan frequent itemset dengan dua skema pemilihan kata yaitu noun filtering dan noun phrase filtering. Tahap kedua dilakukan proses klasifikasi terhadap fitur produk terekstrak terhadap orientasi postif dan negatifnya menggunakan pendekatan supervised learning dengan algoritma Random Forest. Satu kalimat ulasan dapat memiliki lebih dari satu fitur produk, sehingga dilakukan pemilihan level aspek pada penentuan sentimen. Peringkasan ulasan produk pada setiap fiturnya dilakukan secara ekstraktif dengan menampilkan fitur produk dengan orientasi yang dipisahkan antar positif dan negatif. Penggunaan metode association mining dengan menggunakan dua skema pemilihan kata ini dapat menghasilkan f-score sekitar 20%-40% sesuai dengan minimum support yang ditentukan. Hal tersebut dapat terjadi karena banyak fitur produk yang terekstrak namun tidak sama dengan fitur produk expert judgemnetnya dan banyak pula kesalahan pelabelan expert judgement yang mempengaruhi nilai perhitungan evaluasinya. Pada proses klasifikasi penggunakan beberapa atribut klasifikasi berpengaruh pada nilai akurasi yang dihasilkan. Kata kunci : ulasan produk, ekstraksi fitur produk, association mining, klasifikasi, peringkasan opini, supervised learning

    Icelandic Language Resources and Technology: Status and Prospects

    Get PDF
    Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), 27-32. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9207

    Penentuan Pola Kalimat Bahasa Inggris Pada Simple Present Tense Menggunakan Metode Bottom Up Parsing

    Get PDF
    English is a language widely used by many countries in the world. It has many grammar rules in which  each has structural pattern regulating composition of clause, phrase, and words in natural language. Simple Present Tense is one of 16 tenses in English Grammar. It is the first and basic tense which used to express regular and habitual action. Most Junior  High School students have difficulty in determining grammar pattern of tenses, particularly in Simple Present Tense. This research discussed about an application for determining English sentence pattern of Simple Present Tense by using Bottom Up Parsing Method. This is one of Parsing methods used for unscrambling sentence. The English sentence was broken down and translated into pattern forms. The constructed form had rule-base character and implemented Context Free Grammar (CFG). Among 110 inputs of testing, the usage of Bottom Up Parsing Method in determining a sentence pattern got 88.1% data accuracy. In this case, the system could not detect the name of person in interrogative sentence which located at the second word functioning as a subject

    Lexicon acquisition through Noun Clustering

    Get PDF
    This paper describes an experiment with clustering of Icelandicnouns based on semantic relatedness. This work is part of a largerproject aiming at semi-automatically constructing a semantic databasefor Icelandic language technology. The harvested semantic clustersalso provide valuable information for traditional lexicography

    Sprogteknologiske ressourcer for islandsk leksikografi

    Get PDF
    Ten years ago, the Icelandic government launched a special LanguageTechnology Program with the aim of supporting institutionsand companies in creating basic resources for Icelandic languagetechnology work. This initiative resulted in the creation and developmentof several important resources and tools that have hadprofound influence on Icelandic language technology, and are alsovaluable for Icelandic lexicography and linguistic research in general.The present paper describes briefly some of the most importantof these products, such as a morphological database (260,000 lemmas),a 25 million word balanced PoS tagged corpus, a lemmatiser, arule-based tagger, and a shallow parser. Finally, it is pointed out thatall the tools that the Icelandic Language Technology Communityhas developed in the past few years have been made Open Source,and the importance of adopting Open Source Policy for small languagecommunities is emphasized

    A Hybrid Framework for Text Analysis

    Get PDF
    2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists and Computer Scientists. The rst ones, with a strong knowledge of language structures, have not engineering skills. The second ones, contrariwise, expert in computer and mathematics skills, do not assign values to basic mechanisms and structures of language. Moreover, this discrepancy, especially in the last decades, has increased due to the growth of computational resources and to the gradual computerization of the world; the use of Machine Learning technologies in Arti cial Intelligence problems solving, which allows for example the machines to learn , starting from manually generated examples, has been more and more often used in Computational Linguistics in order to overcome the obstacle represented by language structures and its formal representation. The dichotomy has resulted in the birth of two main approaches to Computational Linguistics that respectively prefers: rule-based methods, that try to imitate the way in which man uses and understands the language, reproducing syntactic structures on which the understanding process is based on, building lexical resources as electronic dictionaries, taxonomies or ontologies; statistic-based methods that, conversely, treat language as a group of elements, quantifying words in a mathematical way and trying to extract information without identifying syntactic structures or, in some algorithms, trying to confer to the machine the ability to learn these structures. One of the main problems is the lack of communication between these two di erent approaches, due to substantial di erences characterizing them: on the one hand there is a strong focus on how language works and on language characteristics, there is a tendency to analytical and manual work. From other hand, engineering perspective nds in language an obstacle, and recognizes in the algorithms the fastest way to overcome this problem. However, the lack of communication is not only an incompatibility: following Harris, the best way to approach natural language, could result by taking the best of both. At the moment, there is a large number of open-source tools that perform text analysis and Natural Language Processing. A great part of these tools are based on statistical models and consist on separated modules which could be combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User Interface) and they result impossible to use for users without programming skills. Furthermore, the vast majority of these open-source tools support only English language and, when Italian language is included, the performances of the tools decrease signi cantly. On the other hand, open source tools for Italian language are very few. In this work we want to ll this gap by present a new hybrid framework for the analysis of Italian texts. It must not be intended as a commercial tool, but the purpose for which it was built is to help linguists and other scholars to perform rapid text analysis and to produce linguistic data. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes, in the beginning, the basic algorithms to perform di erent kind of analysis. Modules will perform the following tasks: Preprocessing Module: a module with which it is possible to charge a text, normalize it or delete stop-words. As output, the module presents the list of tokens and letters which compose the texts with respective occurrences count and the processed text. Mr. Ling Module: a module with which POS tagging and Lemmatization are performed. The module also returns the table of lemmas with the count of occurrences and the table with the quanti cation of grammatical tags. Statistic Module: with which it is possible to calculate Term Frequency and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units and export results as tables. Semantic Module: which use The Hyperspace Analogue to Language algorithm to calculate semantic similarity between words. The module returns similarity matrices of words per word which can be exported and analyzed. SyntacticModule: which analyze syntax structures of a selected sentence and tag the verbs and its arguments with semantic labels. The objective of the Framework is to build an all-in-one platform for NLP which allows any kind of users to perform basic and advanced text analysis. With the purpose of make the Framework accessible to users who have not speci c computer science and programming language skills, the modules have been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained in the previous lines, it uses both statistical and rule/based methods, by relying on standard statistical algorithms or techniques, and, at the same time, on Lexicon-Grammar syntactic theory. In addition, it has been written in both Java and Python programming languages. LG-Starship Framework has a simple Graphic User Interface but will be also released as separated modules which may be included in any NLP pipelines independently. There are many resources of this kind, but the large majority works for English. There are very few free resources for Italian language and this work tries to cover this need by proposing a tool which can be used both by linguists or other scientist interested in language and text analysis who have no idea about programming languages, as by computer scientists, who can use free modules in their own code or in combination with di erent NLP algorithms. The Framework takes the start from a text or corpus written directly by the user or charged from an external resource. The LG-Starship Framework work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original imported or generated text in order to produce a clean and normalized preprocessed text. This module includes a function for text splitting, a stop-word list and a tokenization method. On the text preprocessed the Statistic Module or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces as output databases of lexical and numerical data which can be used to produce charts or perform more external analysis; the second one, is divided in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?] and trained on the Paisà Corpus [Lyding et al., 2014], perform the Part-Of- Speech Tagging and produce an annotated text. A lemmatization method, which relies on a set of electronic dictionaries developed at the University of Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and produces a new lemmatized version of original text with information about syntactic and semantic properties. This lemmatized text, which can also be processed with the Statistic Module, serves as input for two deeper level of text analysis carried out by both the Syntactic Module and the Semantic Module. The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and use a database of Predicate structures in development at the Department of Political, Social and Communication Science. Its objective is to produce a Dependency Graph of the sentences that compose the text. The Semantic Module uses the Hyperspace Analogue to Language distributional semantics algorithm [Lund and Burgess, 1996] trained on the Paisà Corpus to produce a semantic network of the words of the text. These work ow has been included in two di erent experiments in which two User Generated Corpora have been involved. The rst experiment represent a statistical study of the language of Rap Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded from on line databases of user generated lyrics. The second experiment is a Feature-Based Sentiment Analysis project performed on user product reviews. For this project we integrated a large domain database of linguistic resources for Sentiment Analysis, developed in the past years by the Department of Political, Social and Communication Science of the University of Salerno, which consists of polarized dictionaries of Verbs, Adjectives, Adverbs and Nouns. These two experiment underline how the linguistic framework can be applied to di erent level of analysis and to produce both Qualitative data and Quantitative data. For what concern the obtained results, the Framework, which is only at a Beta Version, obtain discrete results both in terms of processing time that in terms of precision. Nevertheless, the work is far from being considered complete. More algorithms will be added to the Statistic Module and the Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version of the modules will be published. [edited by author]XV n.s

    Topic Modelling Skripsi menggunakan metode Latent Dirichlet Allocation

    Get PDF
    Program Studi Sastra Inggris Universitas Islam Negeri Sunan Ampel Surabaya (UINSA) merupakan salah satu program studi yang skripsinya ditulis secara penuh menggunakan bahasa inggris. Permasalahan yang terjadi pada Program Studi Sastra Inggris UINSA adalah belum pernah dilakukan clustering pada topik skripsi yang telah diambil mahasiswa. Sedangkan clustering diperlukan untuk melihat tren dan kesesuaian konsentrasi pada Program Studi Sastra Inggris UINSA. Latent Dirichlet Allocation (LDA) merupakan salah satu metode dari topic modelling yang paling populer saat ini. Selain dapat meringkas, mengklusterkan, menghubungkan, LDA memiliki kelebihan utama yaitu mampu memproses data yang sangat besar. Untuk itu penelitian ini menggunakan metode LDA. Penelitian ini menggunakan dataset berupa 584 abstract skripsi pada Program Studi Sastra Inggris UINSA. Penggunaan dataset abstract Program Studi Sastra Inggris UINSA ini dikarenakan untuk pre-processing, data Stopword serta data pendukung proses Lemmatization dan Stemming yang tersedia standarnya baru untuk bahasa inggris. Dataset setelah melewati proses tersebut dijadikan sebagai document term matriks menggunakan metode bag of word. Metode LDA melakukan clustering dengan menggunakan bag of word sebagai kata yang diolah, kemudian menentukan jumlah cluster atau disebut dengan jumlah topik dan menentukan jumlah iterasi. Metode LDA menandai setiap kata pada topik yang di tentukan secara semi random distribution dan dihitung probabilitas topik pada dokumen dan probabilitas kata pada topik setiap iterasinya. Pada penelitian ini dilakukan percobaan sebanyak 5 uji iterasi dengan iterasi berbeda yakni: 100, 500, 1000, dan 5000. Sedangkan terhadap setiap uji iterasi dimasukkan jumlah topik yang berbeda yaitu: 2, 3, 4, 5, dan 7. Berdasarkan percobaan tersebut diperoleh hasil analisis bahwa 3 adalah jumlah topik yang paling fit. Hasil tersebut yang telah diuji secara kualitatif kepada stakeholder Program Studi Sastra Inggris, dan dinyatakan sesuai dengan tren serta konsentrasi yang ada pada Program Studi Sastra Inggris

    Persönliche Wege der Interaktion mit multimedialen Inhalten

    Get PDF
    Today the world of multimedia is almost completely device- and content-centered. It focuses it’s energy nearly exclusively on technical issues such as computing power, network specifics or content and device characteristics and capabilities. In most multimedia systems, the presentation of multimedia content and the basic controls for playback are main issues. Because of this, a very passive user experience, comparable to that of traditional TV, is most often provided. In the face of recent developments and changes in the realm of multimedia and mass media, this ”traditional” focus seems outdated. The increasing use of multimedia content on mobile devices, along with the continuous growth in the amount and variety of content available, make necessary an urgent re-orientation of this domain. In order to highlight the depth of the increasingly difficult situation faced by users of such systems, it is only logical that these individuals be brought to the center of attention. In this thesis we consider these trends and developments by applying concepts and mechanisms to multimedia systems that were first introduced in the domain of usercentrism. Central to the concept of user-centrism is that devices should provide users with an easy way to access services and applications. Thus, the current challenge is to combine mobility, additional services and easy access in a single and user-centric approach. This thesis presents a framework for introducing and supporting several of the key concepts of user-centrism in multimedia systems. Additionally, a new definition of a user-centric multimedia framework has been developed and implemented. To satisfy the user’s need for mobility and flexibility, our framework makes possible seamless media and service consumption. The main aim of session mobility is to help people cope with the increasing number of different devices in use. Using a mobile agent system, multimedia sessions can be transferred between different devices in a context-sensitive way. The use of the international standard MPEG-21 guarantees extensibility and the integration of content adaptation mechanisms. Furthermore, a concept is presented that will allow for individualized and personalized selection and face the need for finding appropriate content. All of which can be done, using this approach, in an easy and intuitive way. Especially in the realm of television, the demand that such systems cater to the need of the audience is constantly growing. Our approach combines content-filtering methods, state-of-the-art classification techniques and mechanisms well known from the area of information retrieval and text mining. These are all utilized for the generation of recommendations in a promising new way. Additionally, concepts from the area of collaborative tagging systems are also used. An extensive experimental evaluation resulted in several interesting findings and proves the applicability of our approach. In contrast to the ”lean-back” experience of traditional media consumption, interactive media services offer a solution to make possible the active participation of the audience. Thus, we present a concept which enables the use of interactive media services on mobile devices in a personalized way. Finally, a use case for enriching TV with additional content and services demonstrates the feasibility of this concept.Die heutige Welt der Medien und der multimedialen Inhalte ist nahezu ausschließlich inhalts- und geräteorientiert. Im Fokus verschiedener Systeme und Entwicklungen stehen oft primär die Art und Weise der Inhaltspräsentation und technische Spezifika, die meist geräteabhängig sind. Die zunehmende Menge und Vielfalt an multimedialen Inhalten und der verstärkte Einsatz von mobilen Geräten machen ein Umdenken bei der Konzeption von Multimedia Systemen und Frameworks dringend notwendig. Statt an eher starren und passiven Konzepten, wie sie aus dem TV Umfeld bekannt sind, festzuhalten, sollte der Nutzer in den Fokus der multimedialen Konzepte rücken. Um dem Nutzer im Umgang mit dieser immer komplexeren und schwierigen Situation zu helfen, ist ein Umdenken im grundlegenden Paradigma des Medienkonsums notwendig. Durch eine Fokussierung auf den Nutzer kann der beschriebenen Situation entgegengewirkt werden. In der folgenden Arbeit wird auf Konzepte aus dem Bereich Nutzerzentrierung zurückgegriffen, um diese auf den Medienbereich zu übertragen und sie im Sinne einer stärker nutzerspezifischen und nutzerorientierten Ausrichtung einzusetzen. Im Fokus steht hierbei der TV-Bereich, wobei die meisten Konzepte auch auf die allgemeine Mediennutzung übertragbar sind. Im Folgenden wird ein Framework für die Unterstützung der wichtigsten Konzepte der Nutzerzentrierung im Multimedia Bereich vorgestellt. Um dem Trend zur mobilen Mediennutzung Sorge zu tragen, ermöglicht das vorgestellte Framework die Nutzung von multimedialen Diensten und Inhalten auf und über die Grenzen verschiedener Geräte und Netzwerke hinweg (Session mobility). Durch die Nutzung einer mobilen Agentenplattform in Kombination mit dem MPEG-21 Standard konnte ein neuer und flexibel erweiterbarer Ansatz zur Mobilität von Benutzungssitzungen realisiert werden. Im Zusammenhang mit der stetig wachsenden Menge an Inhalten und Diensten stellt diese Arbeit ein Konzept zur einfachen und individualisierten Selektion und dem Auffinden von interessanten Inhalten und Diensten in einer kontextspezifischen Weise vor. Hierbei werden Konzepte und Methoden des inhaltsbasierten Filterns, aktuelle Klassifikationsmechanismen und Methoden aus dem Bereich des ”Textminings” in neuer Art und Weise in einem Multimedia Empfehlungssystem eingesetzt. Zusätzlich sind Methoden des Web 2.0 in eine als Tag-basierte kollaborative Komponente integriert. In einer umfassenden Evaluation wurde sowohl die Umsetzbarkeit als auch der Mehrwert dieser Komponente demonstriert. Eine aktivere Beteiligung im Medienkonsum ermöglicht unsere iTV Komponente. Sie unterstützt das Anbieten und die Nutzung von interaktiven Diensten, begleitend zum Medienkonsum, auf mobilen Geräten. Basierend auf einem Szenario zur Anreicherung von TV Sendungen um interaktive Dienste konnte die Umsetzbarkeit dieses Konzepts demonstriert werden
    corecore