227,775 research outputs found
Feature Selection for Text and Image Data Using Differential Evolution with SVM and NaĂŻve Bayes Classifiers
Classification problems are increasing in various important applications such as text categorization, images, medical imaging diagnosis and bimolecular analysis etc. due to large amount of attribute set. Feature extraction methods in case of large dataset play an important role to reduce the irrelevant feature and thereby increases the performance of classifier algorithm. There exist various methods based on machine learning for text and image classification. These approaches are utilized for dimensionality reduction which aims to filter less informative and outlier data. Therefore, these approaches provide compact representation and computationally better tractable accuracy. At the same time, these methods can be challenging if the search space is doubled multiple time. To optimize such challenges, a hybrid approach is suggested in this paper. The proposed approach uses differential evolution (DE) for feature selection with naïve bayes (NB) and support vector machine (SVM) classifiers to enhance the performance of selected classifier. The results are verified using text and image data which reflects improved accuracy compared with other conventional techniques. A 25 benchmark datasets (UCI) from different domains are considered to test the proposed algorithms. A comparative study between proposed hybrid classification algorithms are presented in this work. Finally, the experimental result shows that the differential evolution with NB classifier outperforms and produces better estimation of probability terms. The proposed technique in terms of computational time is also feasible
Unionization method for changing opinion in sentiment classification using machine learning
Sentiment classification aims to determine whether an opinionated text expresses a positive, negative or neutral opinion. Most existing sentiment classification approaches have focused on supervised text classification techniques. One critical problem of sentiment classification is that a text collection may contain tens or hundreds of thousands of features, i.e. high dimensionality, which can be solved by dimension reduction approach. Nonetheless, although feature selection as a dimension reduction method can reduce feature space to provide a reduced feature subset, the size of the subset commonly requires further reduction. In this research, a novel dimension reduction approach called feature unionization is proposed to construct a more reduced feature subset. This approach works based on the combination of several features to create a more informative single feature. Another challenge of sentiment classification is the handling of concept drift problem in the learning step. Usersâ opinions are changed due to evolution of target entities over time. However, the existing sentiment classification approaches do not consider the evolution of usersâ opinions. They assume that instances are independent, identically distributed and generated from a stationary distribution, even though they are generated from a stream distribution. In this study, a stream sentiment classification method is proposed to deal with changing opinion and imbalanced data distribution using ensemble learning and instance selection methods. In relation to the concept drift problem, another important issue is the handling of feature drift in the sentiment classification. To handle feature drift, relevant features need to be detected to update classifiers. Since proposed feature unionization method is very effective to construct more relevant features, it is further used to handle feature drift. Thus, a method to deal with concept and feature drifts for stream sentiment classification was proposed. The effectiveness of the feature unionization method was compared with the feature selection method over fourteen publicly available datasets in sentiment classification domain using three typical classifiers. The experimental results showed the proposed approach is more effective than current feature selection approaches. In addition, the experimental results showed the effectiveness of the proposed stream sentiment classification method in comparison to static sentiment classification. The experiments conducted on four datasets, have successfully shown that the proposed algorithm achieved better results and proving the effectiveness of the proposed method
Feature Ranking for Text Classifiers
Feature selection based on feature ranking has received much
attention by researchers in the field of text classification. The
major reasons are their scalability, ease of use, and fast computation. %,
However, compared to the search-based feature selection methods such
as wrappers and filters, they suffer from poor performance. This is
linked to their major deficiencies, including: (i) feature ranking
is problem-dependent; (ii) they ignore term dependencies, including
redundancies and correlation; and (iii) they usually fail in
unbalanced data.
While using feature ranking methods for dimensionality reduction, we
should be aware of these drawbacks, which arise from the function of
feature ranking methods. In this thesis, a set of solutions is
proposed to handle the drawbacks of feature ranking and boost their
performance. First, an evaluation framework called feature
meta-ranking is proposed to evaluate ranking measures. The framework
is based on a newly proposed Differential Filter Level Performance
(DFLP) measure. It was proved that, in ideal cases, the performance
of text classifier is a monotonic, non-decreasing function of the
number of features. Then we theoretically and empirically validate
the effectiveness of DFLP as a meta-ranking measure to evaluate and
compare feature ranking methods. The meta-ranking framework is also
examined by a stopword extraction problem. We use the framework to
select appropriate feature ranking measure for building
domain-specific stoplists. The proposed framework is evaluated by
SVM and Rocchio text classifiers on six benchmark data. The
meta-ranking method suggests that in searching for a proper feature
ranking measure, the backward feature ranking is as important as the
forward one.
Second, we show that the destructive effect of term redundancy gets
worse as we decrease the feature ranking threshold. It implies that
for aggressive feature selection, an effective redundancy reduction
should be performed as well as feature ranking. An algorithm based
on extracting term dependency links using an information theoretic
inclusion index is proposed to detect and handle term dependencies.
The dependency links are visualized by a tree structure called a
term dependency tree. By grouping the nodes of the tree into two
categories, including hub and link nodes, a heuristic algorithm is
proposed to handle the term dependencies by merging or removing the
link nodes. The proposed method of redundancy reduction is evaluated
by SVM and Rocchio classifiers for four benchmark data sets.
According to the results, redundancy reduction is more effective on
weak classifiers since they are more sensitive to term redundancies.
It also suggests that in those feature ranking methods which compact
the information in a small number of features, aggressive feature
selection is not recommended.
Finally, to deal with class imbalance in feature level using ranking
methods, a local feature ranking scheme called reverse
discrimination approach is proposed. The proposed method is applied
to a highly unbalanced social network discovery problem. In this
case study, the problem of learning a social network is translated
into a text classification problem using newly proposed actor and
relationship modeling. Since social networks are usually sparse
structures, the corresponding text classifiers become highly
unbalanced. Experimental assessment of the reverse discrimination
approach validates the effectiveness of the local feature ranking
method to improve the classifier performance when dealing with
unbalanced data. The application itself suggests a new approach to
learn social structures from textual data
Embedding Feature Selection for Large-scale Hierarchical Classification
Large-scale Hierarchical Classification (HC) involves datasets consisting of
thousands of classes and millions of training instances with high-dimensional
features posing several big data challenges. Feature selection that aims to
select the subset of discriminant features is an effective strategy to deal
with large-scale HC problem. It speeds up the training process, reduces the
prediction time and minimizes the memory requirements by compressing the total
size of learned model weight vectors. Majority of the studies have also shown
feature selection to be competent and successful in improving the
classification accuracy by removing irrelevant features. In this work, we
investigate various filter-based feature selection methods for dimensionality
reduction to solve the large-scale HC problem. Our experimental evaluation on
text and image datasets with varying distribution of features, classes and
instances shows upto 3x order of speed-up on massive datasets and upto 45% less
memory requirements for storing the weight vectors of learned model without any
significant loss (improvement for some datasets) in the classification
accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016
FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization
In this paper, we present a new wrapper feature selection approach based on
Jensen-Shannon (JS) divergence, termed feature selection with maximum
JS-divergence (FSMJ), for text categorization. Unlike most existing feature
selection approaches, the proposed FSMJ approach is based on real-valued
features which provide more information for discrimination than binary-valued
features used in conventional approaches. We show that the FSMJ is a greedy
approach and the JS-divergence monotonically increases when more features are
selected. We conduct several experiments on real-life data sets, compared with
the state-of-the-art feature selection approaches for text categorization. The
superior performance of the proposed FSMJ approach demonstrates its
effectiveness and further indicates its wide potential applications on data
mining.Comment: 8 pages, 6 figures, World Congress on Intelligent Control and
Automation, 201
Assessing similarity of feature selection techniques in high-dimensional domains
Recent research efforts attempt to combine multiple feature selection techniques instead of using a single one. However, this combination is often made on an âad hocâ basis, depending on the specific problem at hand, without considering the degree of diversity/similarity of the involved methods. Moreover, though it is recognized that different techniques may return quite dissimilar outputs, especially in high dimensional/small sample size domains, few direct comparisons exist that quantify these differences and their implications on classification performance. This paper aims to provide a contribution in this direction by proposing a general methodology for assessing the similarity between the outputs of different feature selection methods in high dimensional classification problems. Using as benchmark the genomics domain, an empirical study has been conducted to compare some of the most popular feature selection methods, and useful insight has been obtained about their pattern of agreement
Automatic categorization of diverse experimental information in the bioscience literature
Background:
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
Results:
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Conclusions:
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort
CEAI: CCM based Email Authorship Identification Model
In this paper we present a model for email authorship identification (EAI) by
employing a Cluster-based Classification (CCM) technique. Traditionally,
stylometric features have been successfully employed in various authorship
analysis tasks; we extend the traditional feature-set to include some more
interesting and effective features for email authorship identification (e.g.
the last punctuation mark used in an email, the tendency of an author to use
capitalization at the start of an email, or the punctuation after a greeting or
farewell). We also included Info Gain feature selection based content features.
It is observed that the use of such features in the authorship identification
process has a positive impact on the accuracy of the authorship identification
task. We performed experiments to justify our arguments and compared the
results with other base line models. Experimental results reveal that the
proposed CCM-based email authorship identification model, along with the
proposed feature set, outperforms the state-of-the-art support vector machine
(SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The
proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25
authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5%
accuracy has been achieved on authors' constructed real email dataset. The
results on Enron dataset have been achieved on quite a large number of authors
as compared to the models proposed by Iqbal et al. [1, 2]
- âŠ