Search CORE

491 research outputs found

Heuristic Search over a Ranking for Feature Selection

Author: E. Xing
H. Almuallim
H. Liu
H. Liu
I. Guyon
I. Guyon
I. Inza
I. Witten
L. Yu
M. Hall
M. Xiong
R. Kohavi
Publication venue
Publication date: 01/01/2005
Field of study

In this work, we suggest a new feature selection technique that lets us use the wrapper approach for finding a well suited feature set for distinguishing experiment classes in high dimensional data sets. Our method is based on the relevance and redundancy idea, in the sense that a ranked-feature is chosen if additional information is gained by adding it. This heuristic leads to considerably better accuracy results, in comparison to the full set, and other representative feature selection algorithms in twelve well–known data sets, coupled with notable dimensionality reduction

CiteSeerX

Crossref

idUS. Depósito de Investigación Universidad de Sevilla

Digging into acceptor splice site prediction : an iterative feature selection approach

Author: A.I. Blum
A.K. Jain
C. Mathé
D. Mladenić
E. Alpaydin
G.R. Harik
H. Mühlenbein
I. Guyon
I. Guyon
J. Weston
M. Kudo
M. Pertea
P. Larrañaga
R. Kohavi
R.O. Duda
S. Degroeve
T. Joachims
X. Zhang
Y. Saeys
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets

Crossref

Ghent University Academic Bibliography

Pragmatic Ontology Evolution: Reconciling User Requirements and Application Performance

Author: A Groß
DN Xuan
F Osborne
F Osborne
F Osborne
F Osborne
F Zablith
F Zablith
H Kondylakis
I Guyon
L Ding
L Qin
M Hartung
M Sabou
NF Noy
P Cimiano
P Ristoski
R Kohavi
SE Middleton
Y Liang
Z Huang
Z Sellami
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Increasingly, organizations are adopting ontologies to describe their large catalogues of items. These ontologies need to evolve regularly in response to changes in the domain and the emergence of new requirements. An important step of this process is the selection of candidate concepts to include in the new version of the ontology. This operation needs to take into account a variety of factors and in particular reconcile user requirements and application performance. Current ontology evolution methods focus either on ranking concepts according to their relevance or on preserving compatibility with existing applications. However, they do not take in consideration the impact of the ontology evolution process on the performance of computational tasks – e.g., in this work we focus on instance tagging, similarity computation, generation of recommendations, and data clustering. In this paper, we propose the Pragmatic Ontology Evolution (POE) framework, a novel approach for selecting from a group of candidates a set of concepts able to produce a new version of a given ontology that i) is consistent with the a set of user requirements (e.g., max number of concepts in the ontology), ii) is parametrised with respect to a number of dimensions (e.g., topological considerations), and iii) effectively supports relevant computational tasks. Our approach also supports users in navigating the space of possible solutions by showing how certain choices, such as limiting the number of concepts or privileging trendy concepts rather than historical ones, would reflect on the application performance. An evaluation of POE on the real-world scenario of the evolving Springer Nature taxonomy for editorial classification yielded excellent results, demonstrating a significant improvement over alternative approaches

Crossref

Open Research Online

Large-scale Nonlinear Variable Selection via Kernel Random Features

Author: A Beck
A Rakotomamonjy
B Schölkopf
DX Zhou
F Bach
GI Allen
I Guyon
J Weston
K Muandet
L Rosasco
M Yamada
P Gurram
R Kohavi
S Maldonado
S Mosci
T Hastie
T Hastie
V Bolón-Canedo
V Bolón-Canedo
V Koltchinskii
Y Lin
Publication venue
Publication date: 01/09/2018
Field of study

We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets.Comment: Final version for proceedings of ECML/PKDD 201

arXiv.org e-Print Archive

Crossref

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

Archive ouverte UNIGE

Preceding rule induction with instance reduction methods

Author: A. Lukasz
D. Gamberger
D.L. Wilson
D.R. Wilsson
D.R. Wilsson
D.T. Pham
D.W. Aha
G.L. Ritter
G.W. Gates
I. Tomek
J. Fürnkranz
K. Grudzinski
K. Grudziński
K. Hindi El
K.P. Zhao
O. Othman
P. Clark
P. Clark
P.E. Hart
R. Kohavi
R. Schapire
S. Weiss
T.M. Mitchell
W. Cohen
W. Cohen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

A new prepruning technique for rule induction is presented which applies instance reduction before rule induction. An empirical evaluation records the predictive accuracy and size of rule-sets generated from 24 datasets from the UCI Machine Learning Repository. Three instance reduction algorithms (Edited Nearest Neighbour, AllKnn and DROP5) are compared. Each one is used to reduce the size of the training set, prior to inducing a set of rules using Clark and Boswell's modification of CN2. A hybrid instance reduction algorithm (comprised of AllKnn and DROP5) is also tested. For most of the datasets, pruning the training set using ENN, AllKnn or the hybrid significantly reduces the number of rules generated by CN2, without adversely affecting the predictive performance. The hybrid achieves the highest average predictive accuracy

CiteSeerX

University of Salford Institutional Repository

Crossref

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Author: A Ivshina
Anne-Claire Haury
C Ambroise
C Fan
C Lai
C Sotiriou
C Sotiriou
F Reyal
G Abraham
H Zou
I Guyon
I Guyon
J Bi
J Mairal
J Wang
Jean-Philippe Vert
JPA Ioannidis
L Ein-Dor
L Ein-Dor
M Dai
Muy-Teck Teh
N Meinshausen
P Wirapati
Pierre Gestraud
R Kohavi
R Shen
R Simon
R Tibshirani
RA Irizarry
S Michiels
T Abeel
T Barrett
T Iwamoto
W Shi
Y Benjamini
Y Pawitan
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/06/2011
Field of study

Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

HAL Descartes

HAL-MINES ParisTech

P2P Lending Analysis Using the Most Relevant Graph-Based Features

Author: D Zhang
DJ Hand
I Guyon
I-C Yeh
J Han
J Pohjalainen
JM Sotoca
JN Crook
L Bai
L Yu
M Last
M Malekipirbazari
P Hájek
R Kohavi
WH Press
Y Chen
Y Guo
Y Huang
Y Saeys
Z Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Crossref

White Rose Research Online

Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data

Author: Axel Benner
C Chang
D Jones
DB Allison
E Dimitriadou
F Markowetz
G Fung
Grischa Toedt
H Froehlich
H Zou
HH Zhang
I Guyon
I Guyon
I Inza
J Fan
J Quackenbush
JC Hsu
JD Hoheisel
JD Storey
L Wang
L Wang
LJ van't Veer
M Greiner
M Johannes
MJ van de Vijver
N Becker
Natalia Becker
Peter Lichter
PS Bradley
Q Liu
R Kohavi
R Kohavi
R Tibshirani
T Hastie
V Vapnik
W Gu
X Li
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net. We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone. Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Results Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (<it>L</it>1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error. Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. Conclusions The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters. The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'. We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Feature selection environment for genomic applications

Author: A Asuncion
A Inselberg
AK Jain
C Tsallis
CE Shannon
David Corrêa Martins
DC Martins Jr
ER Dougherty
Fabrício Martins Lopes
FM Lopes
I Guyon
IT Jolliffe
MA Hall
P Comon
P Pudil
R Kohavi
R Kohavi
RO Duda
Roberto M Cesar
S Theodoridis
T de Campos
T Hsing
U Braga-Neto
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

What Shall I Do Next? Intention Mining for Flexible Process Enactment

Author: A. Lamsweerde van
A.K.A. Medeiros de
A.R. Hevner
C. Rolland
C.A. Ellis
H. Schonenberg
I. Song
J. Lee
K.M. Wiig
L. Xu
M. Fishbein
P. Kueng
P.R. Cohen
R. Kohavi
R. Petrusel
S. Cleger-Tamayo
Z. Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceBesides the benefits of flexible processes, practical implementations of process aware information systems have also revealed difficulties encountered by process participants during enactment. Several support and guidance solutions based on process mining have been proposed, but they lack a suitable semantics for human reasoning and decisions making as they mainly rely on low level activities. Applying design science, we created FlexPAISSeer, an intention mining oriented approach, with its component artifacts: 1) IntentMiner which discovers the intentional model of the executable process in an unsupervised manner; 2) In-tentRecommender which generates recommendations as intentions and confidence factors, based on the mined intentional process model and probabilistic calculus. The artifacts were evaluated in a case study with a Netherlands software company, using a Childcare system that allows flexible data-driven process enactment

Crossref

HAL-Paris1