177,765 research outputs found

    Stability of feature selection algorithms: a study on high-dimensional spaces

    Get PDF
    With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process. Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected. This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set. We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset. We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them. We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets. The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms. Finally, we show how stability profiles can support the choice of a feature selection algorith

    Statistical model for reproducibility in ranking-based feature selection

    Get PDF
    The stability of feature subset selection algorithms has become crucial in real-world problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the reproducibility of ranking-based feature subset selection algorithms. When applied to data, this family of algorithms builds an ordering of variables in terms of a measure of relevance. In order to quantify the reproducibility of ranking-based feature subset selection algorithms, we propose a model that takes into account all the different sized subsets of top-ranked features. The model is fitted to data through the minimization of an error function related to the expected values of Kuncheva’s consistency index for those subsets. Once it is fitted, the model provides practical information about the feature subset selection algorithm analyzed, such as a measure of its expected reproducibility or its estimated area under the receiver operating characteristic curve regarding the identification of relevant features. We test our model empirically using both synthetic and a wide range of real data. The results show that our proposal can be used to analyze feature subset selection algorithms based on rankings in terms of their reproducibility and their performance

    An embedded two-layer feature selection approach for microarray data analysis

    Full text link
    Feature selection is an important technique in dealing with application problems with large number of variables and limited training samples, such as image processing, combinatorial chemistry, and microarray analysis. Commonly employed feature selection strategies can be divided into filter and wrapper. In this study, we propose an embedded two-layer feature selection approach to combining the advantages of filter and wrapper algorithms while avoiding their drawbacks. The hybrid algorithm, called GAEF (Genetic Algorithm with embedded filter), divides the feature selection process into two stages. In the first stage, Genetic Algorithm (GA) is employed to pre-select features while in the second stage a filter selector is used to further identify a small feature subset for accurate sample classification. Three benchmark microarray datasets are used to evaluate the proposed algorithm. The experimental results suggest that this embedded two-layer feature selection strategy is able to improve the stability of the selection results as well as the sample classification accuracy.<br /

    A Survey of Feature Selection Strategies for DNA Microarray Classification

    Get PDF
    Classification tasks are difficult and challenging in the bioinformatics field, that used to predict or diagnose patients at an early stage of disease by utilizing DNA microarray technology. However, crucial characteristics of DNA microarray technology are a large number of features and small sample sizes, which means the technology confronts a "dimensional curse" in its classification tasks because of the high computational execution needed and the discovery of biomarkers difficult. To reduce the dimensionality of features to find the significant features that can employ feature selection algorithms and not affect the performance of classification tasks. Feature selection helps decrease computational time by removing irrelevant and redundant features from the data. The study aims to briefly survey popular feature selection methods for classifying DNA microarray technology, such as filters, wrappers, embedded, and hybrid approaches. Furthermore, this study describes the steps of the feature selection process used to accomplish classification tasks and their relationships to other components such as datasets, cross-validation, and classifier algorithms. In the case study, we chose four different methods of feature selection on two-DNA microarray datasets to evaluate and discuss their performances, namely classification accuracy, stability, and the subset size of selected features. Keywords: Brief survey; DNA microarray data; feature selection; filter methods; wrapper methods; embedded methods; and hybrid methods. DOI: 10.7176/CEIS/14-2-01 Publication date:March 31st 202

    Heuristic ensembles of filters for accurate and reliable feature selection

    Get PDF
    Feature selection has become increasingly important in data mining in recent years. However, the accuracy and stability of feature selection methods vary considerably when used individually, and yet no rule exists to indicate which one should be used for a particular dataset. Thus, an ensemble method that combines the outputs of several individual feature selection methods appears to be a promising approach to address the issue and hence is investigated in this research. This research aims to develop an effective ensemble that can improve the accuracy and stability of the feature selection. We proposed a novel heuristic ensemble of filters (HEF). It combines two types of filters: subset filters and ranking filters with a heuristic consensus algorithm in order to utilise the strength of each type. The ensemble is tested on ten benchmark datasets and its performance is evaluated by two stability measures and three classifiers. The experimental results demonstrate that HEF improves the stability and accuracy of the selected features and in most cases outperforms the other ensemble algorithms, individual filters and the full feature set. The research on the HEF algorithm is extended in several dimensions; including more filter members, three novel schemes of mean rank aggregation with partial lists, and three novel schemes for a weighted heuristic ensemble of filters. However, the experimental results demonstrate that adding weight to filters in HEF does not achieve the expected improvement in accuracy, but increases time and space complexity, and clearly decreases stability. Therefore, the core ensemble algorithm (HEF) is demonstrated to be not just simpler but also more reliable and consistent than the later more complicated and weighted ensembles. In addition, we investigated how to use data in feature selection, using ALL or PART of it. Systematic experiments with thirty five synthetic and benchmark real-world datasets were carried out
    corecore