76,357 research outputs found
Comparing SVM and Naive Bayes classifiers for text categorization with Wikitology as knowledge enrichment
The activity of labeling of documents according to their content is known as
text categorization. Many experiments have been carried out to enhance text
categorization by adding background knowledge to the document using knowledge
repositories like Word Net, Open Project Directory (OPD), Wikipedia and
Wikitology. In our previous work, we have carried out intensive experiments by
extracting knowledge from Wikitology and evaluating the experiment on Support
Vector Machine with 10- fold cross-validations. The results clearly indicate
Wikitology is far better than other knowledge bases. In this paper we are
comparing Support Vector Machine (SVM) and Na\"ive Bayes (NB) classifiers under
text enrichment through Wikitology. We validated results with 10-fold cross
validation and shown that NB gives an improvement of +28.78%, on the other hand
SVM gives an improvement of +6.36% when compared with baseline results. Na\"ive
Bayes classifier is better choice when external enriching is used through any
external knowledge base.Comment: 5 page
Toward Optimal Feature Selection in Naive Bayes for Text Categorization
Automated feature selection is important for text categorization to reduce
the feature size and to speed up the learning process of classifiers. In this
paper, we present a novel and efficient feature selection framework based on
the Information Theory, which aims to rank the features with their
discriminative capacity for classification. We first revisit two information
measures: Kullback-Leibler divergence and Jeffreys divergence for binary
hypothesis testing, and analyze their asymptotic properties relating to type I
and type II errors of a Bayesian classifier. We then introduce a new divergence
measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure
multi-distribution divergence for multi-class classification. Based on the
JMH-divergence, we develop two efficient feature selection methods, termed
maximum discrimination () and methods, for text categorization.
The promising results of extensive experiments demonstrate the effectiveness of
the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data
Engineering. 14 pages, 5 figure
Evolving text classification rules with genetic programming
We describe a novel method for using genetic programming to create compact classification rules using combinations of N-grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that the rules may have a number of other uses beyond classification and provide a basis for text mining applications
Recommended from our members
Linking Data Across Universities: An Integrated Video Lectures Dataset
This paper presents our work and experience interlinking educational information across universities through the use of Linked Data principles and technologies. More specifically this paper is focused on selecting, extracting, structuring and interlinking information of video lectures produced by 27 different educational institutions. For this purpose, selected information from several websites and YouTube channels have been scraped and structured according to well-known vocabularies, like FOAF 1, or the W3C Ontology for Media Resources 2. To integrate this information, the extracted videos have been categorized under a common classification space, the taxonomy defined by the Open Directory Project 3. An evaluation of this categorization process has been conducted obtaining a 98% degree of coverage and 89% degree of correctness. As a result of this process a new Linked Data dataset has been released containing more than 14,000 video lectures from 27 different institutions and categorized under a common classification scheme
- …