10,917 research outputs found
The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Motivation: Biomarker discovery from high-dimensional data is a crucial
problem with enormous applications in biology and medicine. It is also
extremely challenging from a statistical viewpoint, but surprisingly few
studies have investigated the relative strengths and weaknesses of the plethora
of existing feature selection methods. Methods: We compare 32 feature selection
methods on 4 public gene expression datasets for breast cancer prognosis, in
terms of predictive performance, stability and functional interpretability of
the signatures they produce. Results: We observe that the feature selection
method has a significant influence on the accuracy, stability and
interpretability of signatures. Simple filter methods generally outperform more
complex embedded or wrapper methods, and ensemble feature selection has
generally no positive effect. Overall a simple Student's t-test seems to
provide the best results. Availability: Code and data are publicly available at
http://cbio.ensmp.fr/~ahaury/
An Incremental Construction of Deep Neuro Fuzzy System for Continual Learning of Non-stationary Data Streams
Existing FNNs are mostly developed under a shallow network configuration
having lower generalization power than those of deep structures. This paper
proposes a novel self-organizing deep FNN, namely DEVFNN. Fuzzy rules can be
automatically extracted from data streams or removed if they play limited role
during their lifespan. The structure of the network can be deepened on demand
by stacking additional layers using a drift detection method which not only
detects the covariate drift, variations of input space, but also accurately
identifies the real drift, dynamic changes of both feature space and target
space. DEVFNN is developed under the stacked generalization principle via the
feature augmentation concept where a recently developed algorithm, namely
gClass, drives the hidden layer. It is equipped by an automatic feature
selection method which controls activation and deactivation of input attributes
to induce varying subsets of input features. A deep network simplification
procedure is put forward using the concept of hidden layer merging to prevent
uncontrollable growth of dimensionality of input space due to the nature of
feature augmentation approach in building a deep network structure. DEVFNN
works in the sample-wise fashion and is compatible for data stream
applications. The efficacy of DEVFNN has been thoroughly evaluated using seven
datasets with non-stationary properties under the prequential test-then-train
protocol. It has been compared with four popular continual learning algorithms
and its shallow counterpart where DEVFNN demonstrates improvement of
classification accuracy. Moreover, it is also shown that the concept drift
detection method is an effective tool to control the depth of network structure
while the hidden layer merging scenario is capable of simplifying the network
complexity of a deep network with negligible compromise of generalization
performance.Comment: This paper has been published in IEEE Transactions on Fuzzy System
Ensemble machine learning approach for electronic nose signal processing
Electronic nose (e-nose) systems have been reported to be used in many areas as rapid, low- cost, and non-invasive instruments. Especially in meat production and processing, e-nose system is a powerful tool to process volatile compounds as a unique ‘fingerprint’. The ability of the pattern recognition algorithm to analyze e-nose signals is the key to the success of the e-nose system in many applications. On the other hand, ensemble methods have been reported for favorable performances in various data sets. This research proposes an ensemble learning approach for e-nose signal processing, especially in beef quality assessment. Ensemble methods are not only used for learning algorithms but also sensor array optimization. For sensor array optimization, three filter-based feature selection algorithms (FSAs) are used to build ensemble FSA such as reliefF, chi-square, and gini index. Ensemble FSA is developed to deal with different or unstable outputs of a single FSA on homogeneous e-nose data sets in beef quality monitoring. Moreover, ensemble learning algorithms are employed to deal with multi-class classification and regression tasks. Random forest and Adaboost are used that represent bagging and boosting algorithms, respectively. The results are also compared with support vector machine and decision tree as single learners. According to the experimental results, our ensemble approach has good performance and generalization in e-nose signal processing. Optimized sensor combination based on filter-based FSA shows stable results both in classification and regression tasks. Furthermore, Adaboost as a boosting algorithm produces the best prediction even though using a smaller number of sensor
Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains
Selecting a subset of relevant features is crucial to the analysis of high-dimensional datasets coming from a number of application domains, such as biomedical data, document and image analysis. Since no single selection algorithm seems to be capable of ensuring optimal results in terms of both predictive performance and stability (i.e. robustness to changes in the input data), researchers have increasingly explored the effectiveness of "ensemble" approaches involving the combination of different selectors. While interesting proposals have been reported in the literature, most of them have been so far evaluated in a limited number of settings (e.g. with data from a single domain and in conjunction with specific selection approaches), leaving unanswered important questions about the large-scale applicability and utility of ensemble feature selection. To give a contribution to the field, this work presents an empirical study which encompasses different kinds of selection algorithms (filters and embedded methods, univariate and multivariate techniques) and different application domains. Specifically, we consider 18 classification tasks with heterogeneous characteristics (in terms of number of classes and instances-to-features ratio) and experimentally evaluate, for feature subsets of different cardinalities, the extent to which an ensemble approach turns out to be more robust than a single selector, thus providing useful insight for both researchers and practitioners
Trimming Stability Selection increases variable selection robustness
Contamination can severely distort an estimator unless the estimation
procedure is suitably robust. This is a well-known issue and has been addressed
in Robust Statistics, however, the relation of contamination and distorted
variable selection has been rarely considered in literature. As for variable
selection, many methods for sparse model selection have been proposed,
including the Stability Selection which is a meta-algorithm based on some
variable selection algorithm in order to immunize against particular data
configurations. We introduce the variable selection breakdown point that
quantifies the number of cases resp. cells that have to be contaminated in
order to let no relevant variable be detected. We show that particular outlier
configurations can completely mislead model selection and argue why even
cell-wise robust methods cannot fix this problem. We combine the variable
selection breakdown point with resampling, resulting in the Stability Selection
breakdown point that quantifies the robustness of Stability Selection. We
propose a trimmed Stability Selection which only aggregates the models with the
lowest in-sample losses so that, heuristically, models computed on heavily
contaminated resamples should be trimmed away. An extensive simulation study
with non-robust regression and classification algorithms as well as with Sparse
Least Trimmed Squares reveals both the potential of our approach to boost the
model selection robustness as well as the fragility of variable selection using
non-robust algorithms, even for an extremely small cell-wise contamination
rate
- …