83,053 research outputs found
Toward Optimal Feature Selection in Naive Bayes for Text Categorization
Automated feature selection is important for text categorization to reduce
the feature size and to speed up the learning process of classifiers. In this
paper, we present a novel and efficient feature selection framework based on
the Information Theory, which aims to rank the features with their
discriminative capacity for classification. We first revisit two information
measures: Kullback-Leibler divergence and Jeffreys divergence for binary
hypothesis testing, and analyze their asymptotic properties relating to type I
and type II errors of a Bayesian classifier. We then introduce a new divergence
measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure
multi-distribution divergence for multi-class classification. Based on the
JMH-divergence, we develop two efficient feature selection methods, termed
maximum discrimination () and methods, for text categorization.
The promising results of extensive experiments demonstrate the effectiveness of
the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data
Engineering. 14 pages, 5 figure
EEF: Exponentially Embedded Families with Class-Specific Features for Classification
In this letter, we present a novel exponentially embedded families (EEF)
based classification method, in which the probability density function (PDF) on
raw data is estimated from the PDF on features. With the PDF construction, we
show that class-specific features can be used in the proposed classification
method, instead of a common feature subset for all classes as used in
conventional approaches. We apply the proposed EEF classifier for text
categorization as a case study and derive an optimal Bayesian classification
rule with class-specific feature selection based on the Information Gain (IG)
score. The promising performance on real-life data sets demonstrates the
effectiveness of the proposed approach and indicates its wide potential
applications.Comment: 9 pages, 3 figures, to be published in IEEE Signal Processing Letter.
IEEE Signal Processing Letter, 201
Recommended from our members
Hierarchical classification for multiple, distributed web databases
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Our research aims to provide an alternative hierarchical categorization and search capability based on a Bayesian network learning algorithm. Our proposed approach, which is grounded on automatic textual analysis of subject content of online web databases, attempts to address the database selection problem by first classifying web databases into a hierarchy of topic categories. The experimental results reported demonstrate that such a classification approach not only effectively reduces the class search space, but also helps to significantly improve the accuracy of classification performance
Assessing similarity of feature selection techniques in high-dimensional domains
Recent research efforts attempt to combine multiple feature selection techniques instead of using a single one. However, this combination is often made on an “ad hoc” basis, depending on the specific problem at hand, without considering the degree of diversity/similarity of the involved methods. Moreover, though it is recognized that different techniques may return quite dissimilar outputs, especially in high dimensional/small sample size domains, few direct comparisons exist that quantify these differences and their implications on classification performance. This paper aims to provide a contribution in this direction by proposing a general methodology for assessing the similarity between the outputs of different feature selection methods in high dimensional classification problems. Using as benchmark the genomics domain, an empirical study has been conducted to compare some of the most popular feature selection methods, and useful insight has been obtained about their pattern of agreement
Multi-Instance Multi-Label Learning
In this paper, we propose the MIML (Multi-Instance Multi-Label learning)
framework where an example is described by multiple instances and associated
with multiple class labels. Compared to traditional learning frameworks, the
MIML framework is more convenient and natural for representing complicated
objects which have multiple semantic meanings. To learn from MIML examples, we
propose the MimlBoost and MimlSvm algorithms based on a simple degeneration
strategy, and experiments show that solving problems involving complicated
objects with multiple semantic meanings in the MIML framework can lead to good
performance. Considering that the degeneration process may lose information, we
propose the D-MimlSvm algorithm which tackles MIML problems directly in a
regularization framework. Moreover, we show that even when we do not have
access to the real objects and thus cannot capture more information from real
objects by using the MIML representation, MIML is still useful. We propose the
InsDif and SubCod algorithms. InsDif works by transforming single-instances
into the MIML representation for learning, while SubCod works by transforming
single-label examples into the MIML representation for learning. Experiments
show that in some tasks they are able to achieve better performance than
learning the single-instances or single-label examples directly.Comment: 64 pages, 10 figures; Artificial Intelligence, 201
- …