457,534 research outputs found
Weighted Heuristic Ensemble of Filters
Feature selection has become increasingly important in data mining in recent years due to the rapid increase in the dimensionality of big data. However, the reliability and consistency of feature selection methods (filters) vary considerably on different data and no single filter performs consistently well under various conditions. Therefore, feature selection ensemble has been investigated recently to provide more reliable and effective results than any individual one but all the existing feature selection ensemble treat the feature selection methods equally regardless of their performance. In this paper, we present a novel framework which applies weighted feature selection ensemble through proposing a systemic way of adding different weights to the feature selection methods-filters. Also, we investigate how to determine the appropriate weight for each filter in an ensemble. Experiments based on ten benchmark datasets show that theoretically and intuitively adding more weight to ‘good filters’ should lead to better results but in reality it is very uncertain. This assumption was found to be correct for some examples in our experiment. However, for other situations, filters which had been assumed to perform well showed bad performance leading to even worse results. Therefore adding weight to filters might not achieve much in accuracy terms, in addition to increasing complexity, time consumption and clearly decreasing the stability
Tackling Ant Colony Optimization Meta-Heuristic as Search Method in Feature Subset Selection Based on Correlation or Consistency Measures
This paper introduces the use of an ant colony optimization
(ACO) algorithm, called Ant System, as a search method in two wellknown
feature subset selection methods based on correlation or consistency
measures such as CFS (Correlation-based Feature Selection) and
CNS (Consistency-based Feature Selection). ACO guides the search using
a heuristic evaluator. Empirical results on twelve real-world classification
problems are reported. Statistical tests have revealed that InfoGain is a
very suitable heuristic for CFS or CNS feature subset selection methods
with ACO acting as search method. The use of InfoGain is shown to be
the significantly better heuristic over a range of classifiers. The results
achieved by means of ACO-based feature subset selection with the suitable
heuristic evaluator are better for most of the problems comparing
with those obtained with CFS or CNS combined with Best First search.MICYT TIN2007-68084- C02-02MICYT TIN2011-28956-C02-02Junta de Andalucía P11-TIC-752
Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach
This paper presents a novel procedure to apply in a sequential
way two data preparation techniques from a different nature such as
data cleansing and feature selection. For the former we have experienced
with a partial removal of outliers via inter-quartile range whereas for
the latter we have chosen relevant attributes with two widespread feature
subset selectors like CFS (Correlation-based Feature Selection) and
CNS (Consistency-based Feature Selection), which are founded on correlation
and consistency measures, respectively. Empirical results on seven
difficult binary and multi-class data sets, that is, with a test error rate of
at least a 10%, according to accuracy, with C4.5 or 1-nearest neighbour
classifiers without any kind of prior data pre-processing are outlined.
Non-parametric statistical tests assert that the meeting of the aforementioned
two data preparation strategies using a correlation measure for
feature selection with C4.5 algorithm is significant better, measured with
roc measure, than the single application of the data cleansing approach.
Last but not least, a weak and not very powerful learner like PART
achieved promising results with the new proposal based on a consistency
measure and is able to compete with the best configuration of C4.5. To
sum up, bearing in mind the new approach, for roc measure PART classifier
with a consistency metric behaves slightly better than C4.5 and a
correlation measureMICYT TIN2007-68084-C02- 02MICYT TIN2011-28956-C02-02Junta de Andalucía P11-TIC-752
A Framework for Consistency Based Feature Selection
Feature selection is an effective technique in reducing the dimensionality of features in many applications where datasets involve hundreds or thousands of features. The objective of feature selection is to find an optimal subset of relevant features such that the feature size is reduced and understandability of a learning process is improved without significantly decreasing the overall accuracy and applicability. This thesis focuses on the consistency measure where a feature subset is consistent if there exists a set of instances of length more than two with the same feature values and the same class labels. This thesis introduces a new consistency-based algorithm, Automatic Hybrid Search (AHS) and reviews several existing feature selection algorithms (ES, PS and HS) which are based on the consistency rate. After that, we conclude this work by conducting an empirical study to a comparative analysis of different search algorithms
Feature Selection Approaches In Antibody Display
Molecular diagnostics tools provide specific data that have high dimensionality due to many factors analyzed in one experiment and few records due to high costs of the experiments. This study addresses the problem of dimensionality in melanoma patient antibody display data by applying data mining feature selection techniques. The article describes feature selection ranking and subset selection approaches and analyzes the performance of various methods evaluating selected feature subsets using classification algorithms C4.5, Random Forest, SVM and Naïve Bayes, which have to differentiate between cancer patient data and healthy donor data. The feature selection methods include correlation-based, consistency based and wrapper subset selection algorithms as well as statistical, information evaluation, prediction potential of rules and SVM feature selection evaluation of single features for ranking purposes
Bayesian variable selection with shrinking and diffusing priors
We consider a Bayesian approach to variable selection in the presence of high
dimensional covariates based on a hierarchical model that places prior
distributions on the regression coefficients as well as on the model space. We
adopt the well-known spike and slab Gaussian priors with a distinct feature,
that is, the prior variances depend on the sample size through which
appropriate shrinkage can be achieved. We show the strong selection consistency
of the proposed method in the sense that the posterior probability of the true
model converges to one even when the number of covariates grows nearly
exponentially with the sample size. This is arguably the strongest selection
consistency result that has been available in the Bayesian variable selection
literature; yet the proposed method can be carried out through posterior
sampling with a simple Gibbs sampler. Furthermore, we argue that the proposed
method is asymptotically similar to model selection with the penalty. We
also demonstrate through empirical work the fine performance of the proposed
approach relative to some state of the art alternatives.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1207 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …