58 research outputs found
Conditional-Entropy Metrics for Feature Selection
Institute for Communicating and Collaborative SystemsWe examine the task of feature selection, which is a method of forming simplified
descriptions of complex data for use in probabilistic classifiers. Feature selection typically
requires a numerical measure or metric of the desirability of a given set of features.
The thesis considers a number of existing metrics, with particular attention to
those based on entropy and other quantities derived from information theory. A useful
new perspective on feature selection is provided by the concepts of partitioning and
encoding of data by a feature set. The ideas of partitioning and encoding, together
with the theoretical shortcomings of existing metrics, motivate a new class of feature
selection metrics based on conditional entropy. The simplest of the new metrics is
referred to as expected partition entropy or EPE.
Performances of the new and existing metrics are compared by experiments with
a simplified form of part-of-speech tagging and with classification of Reuters news
stories by topic. In order to conduct the experiments, a new class of accelerated feature
selection search algorithms is introduced; a member of this class is found to provide
significantly increased speed with minimal loss in performance, as measured by feature
selection metrics and accuracy on test data. The comparative performance of existing
metrics is also analysed, giving rise to a new general conjecture regarding the wrapper
class of metrics. Each wrapper is inherently tied to a specific type of classifier. The
experimental results support the idea that a wrapper selects feature sets which perform
well in conjunction with its own particular classifier, but this good performance cannot
be expected to carry over to other types of model.
The new metrics introduced in this thesis prove to have substantial advantages over
a representative selection of other feature selection mechanisms: Mutual information,
frequency-based cutoff, the Koller-Sahami information loss measure, and two different
types of wrapper method. Feature selection using the new metrics easily outperforms
other filter-based methods such as mutual information; additionally, our approach attains
comparable performance to a wrapper method, but at a fraction of the computational
expense. Finally, members of the new class of metrics succeed in a case where
the Koller-Sahami metric fails to provide a meaningful criterion for feature selection
FEPI-MB: identifying SNPs-disease association using a Markov Blanket-based approach.
The interactions among genetic factors related to diseases are called epistasis. With the availability of genotyped data from genome-wide association studies, it is now possible to computationally unravel epistasis related to the susceptibility to common complex human diseases such as asthma, diabetes, and hypertension. However, the difficulties of detecting epistatic interaction arose from the large number of genetic factors and the enormous size of possible combinations of genetic factors. Most computational methods to detect epistatic interactions are predictor-based methods and can not find true causal factor elements. Moreover, they are both time-consuming and sample-consuming. RESULTS: We propose a new and fast Markov Blanket-based method, FEPI-MB (Fast EPistatic Interactions detection using Markov Blanket), for epistatic interactions detection. The Markov Blanket is a minimal set of variables that can completely shield the target variable from all other variables. Learning of Markov blankets can be used to detect epistatic interactions by a heuristic search for a minimal set of SNPs, which may cause the disease. Experimental results on both simulated data sets and a real data set demonstrate that FEPI-MB significantly outperforms other existing methods and is capable of finding SNPs that have a strong association with common diseases. CONCLUSIONS: FEPI-MB algorithm outperforms other computational methods for detection of epistatic interactions in terms of both the power and sample-efficiency. Moreover, compared to other Markov Blanket learning methods, FEPI-MB is more time-efficient and achieves a better performance
Digging into acceptor splice site prediction : an iterative feature selection approach
Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction.
We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature.
The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets
FEPI-MB: identifying SNPs-disease association using a Markov Blanket-based approach
<p>Abstract</p> <p>Background</p> <p>The interactions among genetic factors related to diseases are called epistasis. With the availability of genotyped data from genome-wide association studies, it is now possible to computationally unravel epistasis related to the susceptibility to common complex human diseases such as asthma, diabetes, and hypertension. However, the difficulties of detecting epistatic interaction arose from the large number of genetic factors and the enormous size of possible combinations of genetic factors. Most computational methods to detect epistatic interactions are predictor-based methods and can not find true causal factor elements. Moreover, they are both time-consuming and sample-consuming.</p> <p>Results</p> <p>We propose a new and fast Markov Blanket-based method, FEPI-MB (Fast EPistatic Interactions detection using Markov Blanket), for epistatic interactions detection. The Markov Blanket is a minimal set of variables that can completely shield the target variable from all other variables. Learning of Markov blankets can be used to detect epistatic interactions by a heuristic search for a minimal set of SNPs, which may cause the disease. Experimental results on both simulated data sets and a real data set demonstrate that FEPI-MB significantly outperforms other existing methods and is capable of finding SNPs that have a strong association with common diseases.</p> <p>Conclusions</p> <p>FEPI-MB algorithm outperforms other computational methods for detection of epistatic interactions in terms of both the power and sample-efficiency. Moreover, compared to other Markov Blanket learning methods, FEPI-MB is more time-efficient and achieves a better performance.</p
- âŠ