243,129 research outputs found
Feature selection for splice site prediction: A new method using EDA-based feature ranking
BACKGROUND: The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. RESULTS: In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. CONCLUSION: We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do) this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features
Recommended from our members
Robust microbial markers for non-invasive inflammatory bowel disease identification
Inflammatory Bowel Disease (IBD) is an umbrella term for a group of inflammatory diseases of the human gastrointestinal tract, including Crohn’s Disease (CD) and ulcerative colitis (UC). Changes to the intestinal microbiome, the community of micro-organisms that resides in the human gut, have been shown to contribute to the pathogenesis of IBD. IBD diagnosis is often delayed due its non-specific symptoms (e.g. abdominal pain) and an invasive colonoscopy is required for confirmation. Delayed diagnosis is linked to poor growth in children and worse treatment outcomes. Microbial communities are extremely complex and feature selection algorithms are often applied to identify key bacterial groups that drive disease. It has been shown that aggregating Ensemble Feature Selection (EFS) can be used to improve the robustness of feature selection algorithms. The robustness of a feature selector is defined as the variation of feature selector output caused by small changes to the dataset. Typical feature selection algorithms can be used to help build simpler, faster, and easier to understand models - but suffer from poor robustness. Having confidence in the output of a feature selector algorithm is key for enabling knowledge discovery from complex biological datasets. In this work we apply a two-step filter and an EFS process to generate robust feature subsets that can non-invasively predict IBD subtypes from high-resolution microbiome data. The predictive power of the robust feature subsets is the highest reported in literature to date. Furthermore, we identify five biologically plausible bacterial species that have not previously been implicated in IBD aetiology
Hybrid Correlation and Causal Feature Selection for Ensemble Classifiers
PC and TPDA algorithms are robust and well known prototype algorithms, incorporating constraint-based approaches for causal discovery. However, both algorithms cannot scale up to deal with high dimensional data, that is more than few hundred features. This chapter presents hybrid correlation and causal feature selection for ensemble classifiers to deal with this problem. Redundant features are removed by correlation-based feature selection and then irrelevant features are eliminated by causal feature selection. The number of eliminated features, accuracy, the area under the receiver operating characteristic curve (AUC) and false negative rate (FNR) of proposed algorithms are compared with correlation-based feature selection (FCBF and CFS) and causal based feature selection algorithms (PC, TPDA, GS, IAMB)
Improving Feature Selection Techniques for Machine Learning
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases
- …