Search CORE

454 research outputs found

A novel two stage scheme utilizing the test set for model selection in text classification

Author: Mayo Michael
Pfahringer Bernhard
Reutemann Peter
Publication venue: 'University of Technology, Sydney (UTS)'
Publication date: 01/01/2005
Field of study

Text classification is a natural application domain for semi-supervised learning, as labeling documents is expensive, but on the other hand usually an abundance of unlabeled documents is available. We describe a novel simple two stage scheme based on dagging which allows for utilizing the test set in model selection. The dagging ensemble can also be used by itself instead of the original classifier. We evaluate the performance of a meta classifier choosing between various base learners and their respective dagging ensembles. The selection process seems to perform robustly especially for small percentages of available labels for training

Research Commons@Waikato

Twitter’s Sentiment Analysis on Gsm Services using Multinomial Naïve Bayes

Author: Djatna Taufik
Kusuma Wisnu Ananta
Susanti Aisah Rini
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/09/2017
Field of study

Telecommunication users are rapidly growing each year. As people keep demanding a better service level of Short Message Service (SMS), telephone or data use, service providers compete to attract their customer, while customer feedbacks in some platforms, for example Twitter, are their souce of information. Multinomial Naïve Bayes Tree, adapted from the method of Multinomial Naïve Bayes and Decision Tree, is one technique in data mining used to classify the raw data or feedback from customers.Multinomial Naïve Bayes method used specifically addressing frequency in the text of the sentence or document. Documents used in this study are comments of Twitter users on the GSM telecommunications provider in Indonesia.This research employed Multinomial Naïve Bayes Tree classification technique to categorize customers sentiment opinion towards telecommunication providers in Indonesia. Sentiment analysis only included the class of positive, negative and neutral. This research generated a Decision Tree roots in the feature "aktif" in which the probability of the feature "aktif" was from positive class in Multinomial Naive Bayes method. The evaluation showed that the highest accuracy of classification using Multinomial Naïve Bayes Tree (MNBTree) method was 16.26% using 145 features. Moreover, the Multinomial Naïve Bayes (MNB) yielded the highest accuracy of 73,15% by using all dataset of 1665 features. The expected benefits in this research are that the Indonesian telecommunications provider can evaluate the performance and services to reach customer satisfaction of various needs

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Recommended from our members

Augmenting Naive Bayes Classifiers with Statistical Language Models

Author: Peng Fuchun
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2003
Field of study

We augment naive Bayes models with statistical n-gram language models to address short- comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we re- fer to as the Chain Augmented Naive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the indepen- dence assumptions of naive Bayes—allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, they permit straight- forward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text clas- sification problems—authorship attribution, text genre classification, and topic detection—in several languages—Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model

ScholarWorks@UMass Amherst

A unified approach to authorship attribution and verification

Author: Font Valverde Martí
Ginebra Molins Josep
Puig Oriol Xavier
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2016
Field of study

In authorship attribution, one assigns texts from an unknown author to either one of two or more candidate authors by comparing the disputed texts with texts known to have been written by the candidate authors. In authorship verification, one decides whether a text or a set of texts could have been written by a given author. These two problems are usually treated separately. By assuming an open-set classification framework for the attribution problem, contemplating the possibility that none of the candidate authors is the unknown author, the verification problem becomes a special case of attribution problem. Here both problems are posed as a formal Bayesian multinomial model selection problem and are given a closed-form solution, tailored for categorical data, naturally incorporating text length and dependence in the analysis, and coping well with settings with a small number of training texts. The approach to authorship verification is illustrated by exploring whether a court ruling sentence could have been written by the judge that signs it, and the approach to authorship attribution is illustrated by revisiting the authorship attribution of the Federalist papers and through a small simulation study.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Analyzing Twitter Feeds to Facilitate Crises Informatics and Disaster Response During Mass Emergencies

Author: Kaur Arshdeep
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2019
Field of study

It is a common practice these days for general public to use various micro-blogging platforms, predominantly Twitter, to share ideas, opinions and information about things and life. Twitter is also being increasingly used as a popular source of information sharing during natural disasters and mass emergencies to update and communicate the extent of the geographic phenomena, report the affected population and casualties, request or provide volunteering services and to share the status of disaster recovery process initiated by humanitarian-aid and disaster-management organizations. Recent research in this area has affirmed the potential use of such social media data for various disaster response tasks. Even though the availability of social media data is massive, open and free, there is a significant limitation in making sense of this data because of its high volume, variety, velocity, value, variability and veracity. The current work provides a comprehensive framework of text processing and analysis performed on several thousands of tweets shared on Twitter during natural disaster events. Specifically, this work em- ploys state-of-the-art machine learning techniques from natural language processing on tweet content to process the ginormous data generated at the time of disasters. This study shall serve as a basis to provide useful actionable information to the crises management and mitigation teams in planning and preparation of effective disaster response and to facilitate the development of future automated systems for handling crises situations

Arrow@TUDublin

Scalable Text Mining with Sparse Generative Models

Author: Puurula Antti
Publication venue: 'University of Waikato'
Publication date: 22/06/2015
Field of study

The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

arXiv.org e-Print Archive

Research Commons@Waikato