1,338 research outputs found
A Comparative Study of Text Classification Methods: An Experimental Approach
Text classification is the process in which text document is assigned to one or more predefined categories based on the contents of document. This paper focuses on experimentation of our implementation of three popular machine learning algorithms and their performance comparative evaluation on sample English Text document categorization. Three well known classifiers namely Naïve Bayes (NB), Centroid Based (CB) and K-Nearest Neighbor (KNN) were implemented and tested on same dataset R-52 chosen from Reuters-21578 corpus. For performance evaluation classical metrics like precision, recall and micro and macro F1-measures were used. For statistical comparison of the three classifiers Randomized Block Design method with T-test was applied. The experimental result exhibited that Centroid based classifier out performed with 97% Micro F1 measure. NB and KNN also produce satisfactory performance on the test dataset, with 91% Micro F1 measure and 89% Micro F1 measure respectively
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
- …