Search CORE

469 research outputs found

SOAP: Efficient Feature Selection of Numeric Attributes

Author: C. A. R. Hoare
G. Pagallo
H. Almuallim
J. Quinlan
R. Kohavi
R. Setiono
Publication venue
Publication date: 01/01/2002
Field of study

The attribute selection techniques for supervised learning, used in the preprocessing phase to emphasize the most relevant attributes, allow making models of classification simpler and easy to understand. Depending on the method to apply: starting point, search organization, evaluation strategy, and the stopping criterion, there is an added cost to the classification algorithm that we are going to use, that normally will be compensated, in greater or smaller extent, by the attribute reduction in the classification model. The algorithm (SOAP: Selection of Attributes by Projection) has some interesting characteristics: lower computational cost (O(mn log n) m attributes and n examples in the data set) with respect to other typical algorithms due to the absence of distance and statistical calculations; with no need for transformation. The performance of SOAP is analysed in two ways: percentage of reduction and classification. SOAP has been compared to CFS [6] and ReliefF [11]. The results are generated by C4.5 and 1NN before and after the application of the algorithms

CiteSeerX

Crossref

idUS. Depósito de Investigación Universidad de Sevilla

Shaping electron wave functions in a carbon nanotube with a parallel magnetic field

Author: Andrew J. O. Whitehouse
Bhargava N.
Conti-Ramsden G.
Cruz J. A.
David A. Copland
Efron B.
Fenson L.
Frankenburg W. K.
Hall M. A.
Hu X.
James G. Scott
Katie L. McMahon
Kohavi R.
Kotthoff L.
Lebarton E. S.
Liu S.
Martyn Symons
Quinlan J. R.
Rebecca Armstrong
Semel W.
Squires J.
Straker L.
Wendy L. Arnott
Publication venue
Publication date: 01/01/2018
Field of study

A magnetic field, through its vector potential, usually causes measurable changes in the electron wave function only in the direction transverse to the field. Here we demonstrate experimentally and theoretically that in carbon nanotube quantum dots, combining cylindrical topology and bipartite hexagonal lattice, a magnetic field along the nanotube axis impacts also the longitudinal profile of the electronic states. With the high (up to 17T) magnetic fields in our experiment the wave functions can be tuned all the way from "half-wave resonator" shape, with nodes at both ends, to "quarter-wave resonator" shape, with an antinode at one end. This in turn causes a distinct dependence of the conductance on the magnetic field. Our results demonstrate a new strategy for the control of wave functions using magnetic fields in quantum systems with nontrivial lattice and topology.Comment: 5 figure

arXiv.org e-Print Archive

University of Regensburg Publication Server

Crossref

Queensland University of Technology ePrints Archive

University of Queensland eSpace

FigShare

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Author: A Ivshina
Anne-Claire Haury
C Ambroise
C Fan
C Lai
C Sotiriou
C Sotiriou
F Reyal
G Abraham
H Zou
I Guyon
I Guyon
J Bi
J Mairal
J Wang
Jean-Philippe Vert
JPA Ioannidis
L Ein-Dor
L Ein-Dor
M Dai
Muy-Teck Teh
N Meinshausen
P Wirapati
Pierre Gestraud
R Kohavi
R Shen
R Simon
R Tibshirani
RA Irizarry
S Michiels
T Abeel
T Barrett
T Iwamoto
W Shi
Y Benjamini
Y Pawitan
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/06/2011
Field of study

Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

HAL Descartes

HAL-MINES ParisTech

Assisted Diagnosis of Parkinsonism Based on the Striatal Morphology

Author: Diego Castillo-Barnes
Fermín Segovia
Francisco J. Martínez-Murcia
Greenberg D.
Hochberg Y.
Javier Ramírez
Juan M. Górriz
Kohavi R.
Mammone N.
Sáez G.
Theodoridis S.
Towey D. J.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2019
Field of study

Parkinsonism is a clinical syndrome characterized by the progressive loss of striatal dopamine. Its diagnosis is usually corroborated by neuroimaging data such as DaTSCAN neuroimages that allow visualizing the possible dopamine deficiency. During the last decade, a number of computer systems have been proposed to automatically analyze DaTSCAN neuroimages, eliminating the subjectivity inherent to the visual examination of the data. In this work, we propose a computer system based on machine learning to separate Parkinsonian patients and control subjects using the size and shape of the striatal region, modeled from DaTSCAN data. First, an algorithm based on adaptative thresholding is used to parcel the striatum. This region is then divided into two according to the brain hemisphere division and characterized with 152 measures, extracted from the volume and its three possible 2-dimensional projections. Afterwards, the Bhattacharyya distance is used to discard the least discriminative measures and, finally, the neuroimage category is estimated by means of a Support Vector Machine classifier. This method was evaluated using a dataset with 189 DaTSCAN neuroimages, obtaining an accuracy rate over 94%. This rate outperforms those obtained by previous approaches that use the intensity of each striatal voxel as a feature.This work was supported by the MINECO/ FEDER under the TEC2015-64718-R project, the Ministry of Economy, Innovation, Science and Employment of the Junta de Andaluc´ıa under the P11-TIC-7103 Excellence Project and the Vicerectorate of Research and Knowledge Transfer of the University of Granada

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada

The identification of informative genes from multiple datasets with increasing complexity

Author: AH Fielding
Allan Tucker
BC Haynes
C Zhang
D Grossman
D Heckerman
D Madigan
DM Chickering
DR Rhodes
E Segal
G Schwarz
H Ma
J Bockhorst
J Pearl
J Su
JB Tobler
JM Peña
KK Tomczak
KP Murphy
M Miron
M Stone
N Friedman
N Friedman
N Friedman
Peter AC 't Hoen
R Jelier
R Kohavi
R Mac Nally
RA Irizarry
S Iezzi
S Yahya Anvar
SS Shen-Orr
TI Lee
TVan den Bulcke
W Lam
WL Buntine
X Xu
Y Cao
Y Lai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Background In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes. Results In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes. Conclusions We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

Brunel University Research Archive

Automatic Feature Extraction for Classifying Audio Data

Author: F. Takens
G. Guo
G. Loy
Ingo Mierswa
J. H. Holland
J. Koza
J. W. Cooley
Katharina Morik
R. Kohavi
T. Bäck
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

An information-theoretic framework for semantic-multimedia retrieval

Author: Amir A.
Argillander J.
Barnard K.
Berger A.
Chang S.-F.
Duygulu P.
Feng S. L.
Jeon J.
Joachims T.
João Magalhães
Kohavi R.
Lavrenko V.
McCallum A.
Nigam K.
Rocchio J.
Snoek C. G. M. v.
Stefan Rüger
Westerveld T.
Yang Y.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2010
Field of study

This article is set in the context of searching text and image repositories by keyword. We develop a unified probabilistic framework for text, image, and combined text and image retrieval that is based on the detection of keywords (concepts) using automated image annotation technology. Our framework is deeply rooted in information theory and lends itself to use with other media types. We estimate a statistical model in a multimodal feature space for each possible query keyword. The key element of our framework is to identify feature space transformations that make them comparable in complexity and density. We select the optimal multimodal feature space with a minimum description length criterion from a set of candidate feature spaces that are computed with the average-mutual-information criterion for the text part and hierarchical expectation maximization for the visual part of the data. We evaluate our approach in three retrieval experiments (only text retrieval, only image retrieval, and text combined with image retrieval), verify the framework’s low computational complexity, and compare with existing state-of-the-art ad-hoc models

Crossref

Open Research Online (The Open University)

Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data

Author: Axel Benner
C Chang
D Jones
DB Allison
E Dimitriadou
F Markowetz
G Fung
Grischa Toedt
H Froehlich
H Zou
HH Zhang
I Guyon
I Guyon
I Inza
J Fan
J Quackenbush
JC Hsu
JD Hoheisel
JD Storey
L Wang
L Wang
LJ van't Veer
M Greiner
M Johannes
MJ van de Vijver
N Becker
Natalia Becker
Peter Lichter
PS Bradley
Q Liu
R Kohavi
R Kohavi
R Tibshirani
T Hastie
V Vapnik
W Gu
X Li
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net. We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone. Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Results Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (<it>L</it>1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error. Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. Conclusions The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters. The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'. We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central