22,822 research outputs found
Toward Optimal Feature Selection in Naive Bayes for Text Categorization
Automated feature selection is important for text categorization to reduce
the feature size and to speed up the learning process of classifiers. In this
paper, we present a novel and efficient feature selection framework based on
the Information Theory, which aims to rank the features with their
discriminative capacity for classification. We first revisit two information
measures: Kullback-Leibler divergence and Jeffreys divergence for binary
hypothesis testing, and analyze their asymptotic properties relating to type I
and type II errors of a Bayesian classifier. We then introduce a new divergence
measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure
multi-distribution divergence for multi-class classification. Based on the
JMH-divergence, we develop two efficient feature selection methods, termed
maximum discrimination () and methods, for text categorization.
The promising results of extensive experiments demonstrate the effectiveness of
the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data
Engineering. 14 pages, 5 figure
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Chi-square-based scoring function for categorization of MEDLINE citations
Objectives: Text categorization has been used in biomedical informatics for
identifying documents containing relevant topics of interest. We developed a
simple method that uses a chi-square-based scoring function to determine the
likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
procedure requires construction of a genetic and a nongenetic domain document
corpus. We used MeSH descriptors assigned to MEDLINE citations for this
categorization task. We compared frequencies of MeSH descriptors between two
corpora applying chi-square test. A MeSH descriptor was considered to be a
positive indicator if its relative observed frequency in the genetic domain
corpus was greater than its relative observed frequency in the nongenetic
domain corpus. The output of the proposed method is a list of scores for all
the citations, with the highest score given to those citations containing MeSH
descriptors typical for the genetic domain. Results: Validation was done on a
set of 734 manually annotated MEDLINE citations. It achieved predictive
accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method
by comparing it to three machine learning algorithms (support vector machines,
decision trees, na\"ive Bayes). Although the differences were not statistically
significantly different, results showed that our chi-square scoring performs as
good as compared machine learning algorithms. Conclusions: We suggest that the
chi-square scoring is an effective solution to help categorize MEDLINE
citations. The algorithm is implemented in the BITOLA literature-based
discovery support system as a preprocessor for gene symbol disambiguation
process.Comment: 34 pages, 2 figure
Recommended from our members
On the adequacy of current empirical evaluations of formal models of categorization
Categorization is one of the fundamental building blocks of cognition, and the study of categorization is notable for the extent to which formal modeling has been a central and influential component of research. However, the field has seen a proliferation of noncomplementary models with little consensus on the relative adequacy of these accounts. Progress in assessing the relative adequacy of formal categorization models has, to date, been limited because (a) formal model comparisons are narrow in the number of models and phenomena considered and (b) models do not often clearly define their explanatory scope. Progress is further hampered by the practice of fitting models with arbitrarily variable parameters to each data set independently. Reviewing examples of good practice in the literature, we conclude that model comparisons are most fruitful when relative adequacy is assessed by comparing well-defined models on the basis of the number and proportion of irreversible, ordinal, penetrable successes (principles of minimal flexibility, breadth, good-enough precision, maximal simplicity, and psychological focus)
An Intelligent System For Arabic Text Categorization
Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%
FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization
In this paper, we present a new wrapper feature selection approach based on
Jensen-Shannon (JS) divergence, termed feature selection with maximum
JS-divergence (FSMJ), for text categorization. Unlike most existing feature
selection approaches, the proposed FSMJ approach is based on real-valued
features which provide more information for discrimination than binary-valued
features used in conventional approaches. We show that the FSMJ is a greedy
approach and the JS-divergence monotonically increases when more features are
selected. We conduct several experiments on real-life data sets, compared with
the state-of-the-art feature selection approaches for text categorization. The
superior performance of the proposed FSMJ approach demonstrates its
effectiveness and further indicates its wide potential applications on data
mining.Comment: 8 pages, 6 figures, World Congress on Intelligent Control and
Automation, 201
Brain image clustering by wavelet energy and CBSSO optimization algorithm
Previously, the diagnosis of brain abnormality was significantly important in the saving of social and hospital resources. Wavelet energy is known as an effective feature detection which has great efficiency in different utilities. This paper suggests a new method based on wavelet energy to automatically classify magnetic resonance imaging (MRI) brain images into two groups (normal and abnormal), utilizing support vector machine (SVM) classification based on chaotic binary shark smell optimization (CBSSO) to optimize the SVM weights.
The results of the suggested CBSSO-based KSVM are compared favorably to several other methods in terms of better sensitivity and authenticity. The proposed CAD system can additionally be utilized to categorize the images with various pathological conditions, types, and illness modes
- âŠ