53,271 research outputs found
FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization
In this paper, we present a new wrapper feature selection approach based on
Jensen-Shannon (JS) divergence, termed feature selection with maximum
JS-divergence (FSMJ), for text categorization. Unlike most existing feature
selection approaches, the proposed FSMJ approach is based on real-valued
features which provide more information for discrimination than binary-valued
features used in conventional approaches. We show that the FSMJ is a greedy
approach and the JS-divergence monotonically increases when more features are
selected. We conduct several experiments on real-life data sets, compared with
the state-of-the-art feature selection approaches for text categorization. The
superior performance of the proposed FSMJ approach demonstrates its
effectiveness and further indicates its wide potential applications on data
mining.Comment: 8 pages, 6 figures, World Congress on Intelligent Control and
Automation, 201
Toward Optimal Feature Selection in Naive Bayes for Text Categorization
Automated feature selection is important for text categorization to reduce
the feature size and to speed up the learning process of classifiers. In this
paper, we present a novel and efficient feature selection framework based on
the Information Theory, which aims to rank the features with their
discriminative capacity for classification. We first revisit two information
measures: Kullback-Leibler divergence and Jeffreys divergence for binary
hypothesis testing, and analyze their asymptotic properties relating to type I
and type II errors of a Bayesian classifier. We then introduce a new divergence
measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure
multi-distribution divergence for multi-class classification. Based on the
JMH-divergence, we develop two efficient feature selection methods, termed
maximum discrimination () and methods, for text categorization.
The promising results of extensive experiments demonstrate the effectiveness of
the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data
Engineering. 14 pages, 5 figure
KACST Arabic Text Classification Project: Overview and Preliminary Results
Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
Automatic categorization of diverse experimental information in the bioscience literature
Background:
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
Results:
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Conclusions:
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort
The Role of Text Pre-processing in Sentiment Analysis
It is challenging to understand the latest trends and summarise the state or general opinions about products due to the big diversity and size of social media data, and this creates the need of automated and real time opinion extraction and mining. Mining online opinion is a form of sentiment analysis that is treated as a difficult text classification task. In this paper, we explore the role of text pre-processing in sentiment analysis, and report on experimental results that demonstrate that with appropriate feature selection and representation, sentiment analysis accuracies using support vector machines (SVM) in this area may be significantly improved. The level of accuracy achieved is shown to be comparable to the ones achieved in topic categorisation although sentiment analysis is considered to be a much harder problem in the literature
Recommended from our members
Predicting Category Intuitiveness With the Rational Model, the Simplicity Model, and the Generalized Context Model
Naïve observers typically perceive some groupings for a set of stimuli as more intuitive than others. The problem of predicting category intuitiveness has been historically considered the remit of models of unsupervised categorization. In contrast, this article develops a measure of category intuitiveness from one of the most widely supported models of supervised categorization, the generalized context model (GCM). Considering different category assignments for a set of instances, the authors asked how well the GCM can predict the classification of each instance on the basis of all the other instances. The category assignment that results in the smallest prediction error is interpreted as the most intuitive for the GCM—the authors refer to this way of applying the GCM as “unsupervised GCM.” The authors systematically compared predictions of category intuitiveness from the unsupervised GCM and two models of unsupervised categorization: the simplicity model and the rational model. The unsupervised GCM compared favorably with the simplicity model and the rational model. This success of the unsupervised GCM illustrates that the distinction between supervised and unsupervised categorization may need to be reconsidered. However, no model emerged as clearly superior, indicating that there is more work to be done in understanding and modeling category intuitiveness
- …