16,559 research outputs found
Toward Optimal Feature Selection in Naive Bayes for Text Categorization
Automated feature selection is important for text categorization to reduce
the feature size and to speed up the learning process of classifiers. In this
paper, we present a novel and efficient feature selection framework based on
the Information Theory, which aims to rank the features with their
discriminative capacity for classification. We first revisit two information
measures: Kullback-Leibler divergence and Jeffreys divergence for binary
hypothesis testing, and analyze their asymptotic properties relating to type I
and type II errors of a Bayesian classifier. We then introduce a new divergence
measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure
multi-distribution divergence for multi-class classification. Based on the
JMH-divergence, we develop two efficient feature selection methods, termed
maximum discrimination () and methods, for text categorization.
The promising results of extensive experiments demonstrate the effectiveness of
the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data
Engineering. 14 pages, 5 figure
KACST Arabic Text Classification Project: Overview and Preliminary Results
Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization
In this paper, we present a new wrapper feature selection approach based on
Jensen-Shannon (JS) divergence, termed feature selection with maximum
JS-divergence (FSMJ), for text categorization. Unlike most existing feature
selection approaches, the proposed FSMJ approach is based on real-valued
features which provide more information for discrimination than binary-valued
features used in conventional approaches. We show that the FSMJ is a greedy
approach and the JS-divergence monotonically increases when more features are
selected. We conduct several experiments on real-life data sets, compared with
the state-of-the-art feature selection approaches for text categorization. The
superior performance of the proposed FSMJ approach demonstrates its
effectiveness and further indicates its wide potential applications on data
mining.Comment: 8 pages, 6 figures, World Congress on Intelligent Control and
Automation, 201
- …