16,436 research outputs found
The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Motivation: Biomarker discovery from high-dimensional data is a crucial
problem with enormous applications in biology and medicine. It is also
extremely challenging from a statistical viewpoint, but surprisingly few
studies have investigated the relative strengths and weaknesses of the plethora
of existing feature selection methods. Methods: We compare 32 feature selection
methods on 4 public gene expression datasets for breast cancer prognosis, in
terms of predictive performance, stability and functional interpretability of
the signatures they produce. Results: We observe that the feature selection
method has a significant influence on the accuracy, stability and
interpretability of signatures. Simple filter methods generally outperform more
complex embedded or wrapper methods, and ensemble feature selection has
generally no positive effect. Overall a simple Student's t-test seems to
provide the best results. Availability: Code and data are publicly available at
http://cbio.ensmp.fr/~ahaury/
Evolutionary intelligent agents for e-commerce: Generic preference detection with feature analysis
Product recommendation and preference tracking systems have been adopted extensively in e-commerce businesses. However, the heterogeneity of product attributes results in undesired impediment for an efficient yet personalized e-commerce product brokering. Amid the assortment of product attributes, there are some intrinsic generic attributes having significant relation to a customer’s generic preference. This paper proposes a novel approach in the detection of generic product attributes through feature analysis. The objective is to provide an insight to the understanding of customers’ generic preference. Furthermore, a genetic algorithm is used to find the suitable feature weight set, hence reducing the rate of misclassification. A prototype has been implemented and the experimental results are promising
A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
The combination of multiple classifiers using ensemble methods is
increasingly important for making progress in a variety of difficult prediction
problems. We present a comparative analysis of several ensemble methods through
two case studies in genomics, namely the prediction of genetic interactions and
protein functions, to demonstrate their efficacy on real-world datasets and
draw useful conclusions about their behavior. These methods include simple
aggregation, meta-learning, cluster-based meta-learning, and ensemble selection
using heterogeneous classifiers trained on resampled data to improve the
diversity of their predictions. We present a detailed analysis of these methods
across 4 genomics datasets and find the best of these methods offer
statistically significant improvements over the state of the art in their
respective domains. In addition, we establish a novel connection between
ensemble selection and meta-learning, demonstrating how both of these disparate
methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013
International Conference on Data Minin
Temporal Feature Selection with Symbolic Regression
Building and discovering useful features when constructing machine learning models is the central task for the machine learning practitioner. Good features are useful not only in increasing the predictive power of a model but also in illuminating the underlying drivers of a target variable. In this research we propose a novel feature learning technique in which Symbolic regression is endowed with a ``Range Terminal\u27\u27 that allows it to explore functions of the aggregate of variables over time. We test the Range Terminal on a synthetic data set and a real world data in which we predict seasonal greenness using satellite derived temperature and snow data over a portion of the Arctic. On the synthetic data set we find Symbolic regression with the Range Terminal outperforms standard Symbolic regression and Lasso regression. On the Arctic data set we find it outperforms standard Symbolic regression, fails to beat the Lasso regression, but finds useful features describing the interaction between Land Surface Temperature, Snow, and seasonal vegetative growth in the Arctic
Unsupervised Graph-based Rank Aggregation for Improved Retrieval
This paper presents a robust and comprehensive graph-based rank aggregation
approach, used to combine results of isolated ranker models in retrieval tasks.
The method follows an unsupervised scheme, which is independent of how the
isolated ranks are formulated. Our approach is able to combine arbitrary
models, defined in terms of different ranking criteria, such as those based on
textual, image or hybrid content representations.
We reformulate the ad-hoc retrieval problem as a document retrieval based on
fusion graphs, which we propose as a new unified representation model capable
of merging multiple ranks and expressing inter-relationships of retrieval
results automatically. By doing so, we claim that the retrieval system can
benefit from learning the manifold structure of datasets, thus leading to more
effective results. Another contribution is that our graph-based aggregation
formulation, unlike existing approaches, allows for encapsulating contextual
information encoded from multiple ranks, which can be directly used for
ranking, without further computations and post-processing steps over the
graphs. Based on the graphs, a novel similarity retrieval score is formulated
using an efficient computation of minimum common subgraphs. Finally, another
benefit over existing approaches is the absence of hyperparameters.
A comprehensive experimental evaluation was conducted considering diverse
well-known public datasets, composed of textual, image, and multimodal
documents. Performed experiments demonstrate that our method reaches top
performance, yielding better effectiveness scores than state-of-the-art
baseline methods and promoting large gains over the rankers being fused, thus
demonstrating the successful capability of the proposal in representing queries
based on a unified graph-based model of rank fusions
Evolving Spatially Aggregated Features from Satellite Imagery for Regional Modeling
Satellite imagery and remote sensing provide explanatory variables at
relatively high resolutions for modeling geospatial phenomena, yet regional
summaries are often desirable for analysis and actionable insight. In this
paper, we propose a novel method of inducing spatial aggregations as a
component of the machine learning process, yielding regional model features
whose construction is driven by model prediction performance rather than prior
assumptions. Our results demonstrate that Genetic Programming is particularly
well suited to this type of feature construction because it can automatically
synthesize appropriate aggregations, as well as better incorporate them into
predictive models compared to other regression methods we tested. In our
experiments we consider a specific problem instance and real-world dataset
relevant to predicting snow properties in high-mountain Asia
A test problem for visual investigation of high-dimensional multi-objective search
An inherent problem in multiobjective optimization is that the visual observation of solution vectors with four or more objectives is infeasible, which brings major difficulties for algorithmic design, examination, and development. This paper presents a test problem, called the Rectangle problem, to aid the visual investigation of high-dimensional multiobjective search. Key features of the Rectangle problem are that the Pareto optimal solutions 1) lie in a rectangle in the two-variable decision space and 2) are similar (in the sense of Euclidean geometry) to their images in the four-dimensional objective space. In this case, it is easy to examine the behavior of objective vectors in terms of both convergence and diversity, by observing their proximity to the optimal rectangle and their distribution in the rectangle, respectively, in the decision space. Fifteen algorithms are investigated. Underperformance of Pareto-based algorithms as well as most state-of-the-art many-objective algorithms indicates that the proposed problem not only is a good tool to help visually understand the behavior of multiobjective search in a high-dimensional objective space but also can be used as a challenging benchmark function to test algorithms' ability in balancing the convergence and diversity of solutions
- …