Search CORE

MaLeS: A Framework for Automatic Tuning of Automated Theorem Provers

Author: A Riazanov
Daniel Kühlwein
F Hutter
G Sutcliffe
G Sutcliffe
G Sutcliffe
G Sutcliffe
G Sutcliffe
G Sutcliffe
G Sutcliffe
I Guyon
Josef Urban
L Xu
S Schulz
T Tammet
TE Oliphant
VN Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Publikationer från Linköpings universitet

On reliable discovery of molecular signatures

Author: A Heorl
B Efron
C Cortes
D Singh
F Li
I Guyon
I Guyon
J Bogaerts
J Schäfer
Jesper Tegnér
Johan Björkegren
JP Ioannidis
L Devroye
L Ein-Dor
L Ein-Dor
LJ van't Veer
M Campo Dell'Orto
ME Tipping
R Nilsson
R Nilsson
R Nilsson
Roland Nilsson
S Michiels
S Mika
TM Frayling
TR Golub
U Alon
VN Vapnik
Y Benjamini
Y Wang
Y Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Molecular signatures are sets of genes, proteins, genetic variants or other variables that can be used as markers for a particular phenotype. Reliable signature discovery methods could yield valuable insight into cell biology and mechanisms of human disease. However, it is currently not clear how to control error rates such as the false discovery rate (FDR) in signature discovery. Moreover, signatures for cancer gene expression have been shown to be unstable, that is, difficult to replicate in independent studies, casting doubts on their reliability. Results We demonstrate that with modern prediction methods, signatures that yield accurate predictions may still have a high FDR. Further, we show that even signatures with low FDR may fail to replicate in independent studies due to limited statistical power. Thus, neither stability nor predictive accuracy are relevant when FDR control is the primary goal. We therefore develop a general statistical hypothesis testing framework that for the first time provides FDR control for signature discovery. Our method is demonstrated to be correct in simulation studies. When applied to five cancer data sets, the method was able to discover molecular signatures with 5% FDR in three cases, while two data sets yielded no significant findings. Conclusion Our approach enables reliable discovery of molecular signatures from genome-wide data with current sample sizes. The statistical framework developed herein is potentially applicable to a wide range of prediction problems in bioinformatics.</p

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data

Author: A Ben-Dor
AJ Rai
Alexander Miron
C Ambroise
C Cortes
C Furlanello
CM Perou
EF Petricoin III
EF Petricoin III
EP Diamandis
H Zhang
Hon-chiu E Leung
I Guyon
James D Iglehart
Jun S Liu
L Breiman
L Li
LJ van't Veer
Lyndsay N Harris
MD Hulett
Q Shi
Qian Shi
R Collobert
RO Duda
S Gruvberger
S Mukherjee
T Sorlie
TR Golub
TS Furey
VN Vapnik
VN Vapnik
VN Vapnik
Wing H Wong
X Zhang
X Zhang
Xin Lu
Xiu-qin Xu
XQ Xu
Xuegong Zhang
Y Barash
Y Yasui
Z Kou
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data. RESULTS: We developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5 %-~20 % improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments. CONCLUSION: The proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features

Approaches to working in high-dimensional data spaces: gene expression microarrays

Author: A Dupuy
A Statnikov
AK Jain
B Efron
BJ Frey
C Lai
CF Aliferis
D J Miller
D Miller
DB Allison
DF Ransohoff
DF Ransohoff
EP Xing
GV Trunk
I Guyon
I Guyon
J Novovicova
J Wang
JA Swets
JD Storey
KA Shedden
KY Yeung
L Ein-Dor
MW Graham
R Clarke
R Clarke
RO Duda
S Ramaswamy
T Lange
TR Golub
VN Vapnik
Y Wang
Z Wang
Publication venue: Nature Publishing Group
Publication date
Field of study

This review provides a focused summary of the implications of high-dimensional data spaces produced by gene expression microarrays for building better models of cancer diagnosis, prognosis, and therapeutics. We identify the unique challenges posed by high dimensionality to highlight methodological problems and discuss recent methods in predictive classification, unsupervised subclass discovery, and marker identification

Detecting multivariate differentially expressed genes

Author: A Gretton
A Subramanian
A Szabo
D Chickering
D Maglott
DB Allison
G Casella
G Fritz
H Kitano
H Ogata
I Guyon
I Guyon
J Pearl
J Schäfer
JD Storey
JE Gunton
Jesper Tegnér
JM Peña
JM Peña
Johan Björkegren
José M Peña
L Ein-Dor
LJ van't Veer
M Ashburner
N Friedman
P Proks
R Kohavi
R Nilsson
RF Hashimoto
RL Berger
Roland Nilsson
S Kropf
S Roy
VN Vapnik
WH Press
Y Benjamini
Y Hochberg
Y Lu
Y Xiao
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Gene expression is governed by complex networks, and differences in expression patterns between distinct biological conditions may therefore be complex and multivariate in nature. Yet, current statistical methods for detecting differential expression merely consider the univariate difference in expression level of each gene in isolation, thus potentially neglecting many genes of biological importance. Results We have developed a novel algorithm for detecting multivariate expression patterns, named Recursive Independence Test (RIT). This algorithm generalizes differential expression testing to more complex expression patterns, while still including genes found by the univariate approach. We prove that RIT is consistent and controls error rates for small sample sizes. Simulation studies confirm that RIT offers more power than univariate differential expression analysis when multivariate effects are present. We apply RIT to gene expression data sets from diabetes and cancer studies, revealing several putative disease genes that were not detected by univariate differential expression analysis. Conclusion The proposed RIT algorithm increases the power of gene expression analysis by considering multivariate effects while retaining error rate control, and may be useful when conventional differential expression tests yield few findings.</p

Public Library of Science (PLOS)

Expanding the Understanding of Biases in Development of Clinical-Grade Molecular Signatures: A Case Study in Acute Respiratory Viral Infections

Author: A Rangarajan
A Statnikov
A Statnikov
A Statnikov
A Statnikov
A Statnikov
AK Zaas
AK Zaas
Alexander Statnikov
AM Glas
C Ambroise
CF Aliferis
CF Aliferis
CF Aliferis
Constantin F. Aliferis
EE Ntzani
ER DeLong
F Azuaje
FJ Gonzalez
GG Jackson
I Guyon
I Guyon
I Tsamardinos
J Pearl
J Pearl
JA Sparano
JT Leek
Jörn-Hendrik Weitkamp
KA Baggerly
Lauren McVoy
LM Cope
Nikita I. Lytkin
O Ramilo
R Kohavi
R Simon
RA Irizarry
RA Irizarry
RL Somorjai
TW Anderson
UM Braga-Neto
Vladimir Brusic
VN Vapnik
WE Johnson
Y Benjamini
Y Benjamini
Z Liu
Publication venue: Public Library of Science
Publication date: 01/06/2011
Field of study

The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

Author: A Statnikov
AC Tan
C Bishop
C Lai
D Geman
DG Beer
I Guyon
I Inza
J Jin
J Weston
LJ van 't Veer
Mark A Kon
MH Asyali
P Baldi
Ping Shi
Qifu Zhu
R Blanco
R Kohavi
S Hanshall
S Ma
S Yoon
SL Pomeroy
Surajit Ray
TM Cover
TR Golub
TS Furey
V Vinaya
VN Vapnik
X Zhang
Y Saeys
Y Wang
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers. Results We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets. Conclusions The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis

CiteSeerX

Boston University Institutional Repository (OpenBU)

Public Library of Science (PLOS)

Enlighten

Improving Cancer Classification Accuracy Using Gene Pairs

Author: AC Tan
B Ripley
D Geman
D Singh
DW Parsons
GJ Gordon
I Guyon
J Kang
J Lapointe
J Liu
J Quinlan
Jaewoo Kang
JB Welsh
Jinseung Lee
Joel S. Bader
Km Lin
M Dettling
M Hall
M Mramor
MA Shipp
MF Rogers
P Chopra
P Chopra
Pankaj Chopra
R Gentleman
R Tibshirani
RO Stuart
S Dudoit
S Pomeroy
S Ramaswamy
S Yoon
Sunwon Lee
TR Golub
U Alon
VN Vapnik
XJ Zhou
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Recent studies suggest that the deregulation of pathways, rather than individual genes, may be critical in triggering carcinogenesis. The pathway deregulation is often caused by the simultaneous deregulation of more than one gene in the pathway. This suggests that robust gene pair combinations may exploit the underlying bio-molecular reactions that are relevant to the pathway deregulation and thus they could provide better biomarkers for cancer, as compared to individual genes. In order to validate this hypothesis, in this paper, we used gene pair combinations, called doublets, as input to the cancer classification algorithms, instead of the original expression values, and we showed that the classification accuracy was consistently improved across different datasets and classification algorithms. We validated the proposed approach using nine cancer datasets and five classification algorithms including Prediction Analysis for Microarrays (PAM), C4.5 Decision Trees (DT), Naive Bayesian (NB), Support Vector Machine (SVM), and k-Nearest Neighbor (k-NN)

CiteSeerX

Application of two machine learning algorithms to genetic association studies in the presence of covariates

Author: A Bureau
AA Motsinger-Reif
Andrea S Foulkes
AS Foulkes
B Dasarathy
Bareng AS Nonyane
BAS Nonyane
C Bishop
C Tan
D Ge
D Lunn
E Atkinson
E Taioli
I Guyon
IE George
J Cohen
JH Friedman
JM Robins
L Breiman
L Breiman
L Cupples
M Groenendijk
MA Hernan
MJ van der Laan
MR Segal
NJS Christenfield
RV Shohet
S Kang
SJ Tannenbaum
SR Cole
T Hastie
TJ Costello
TM Huang
VN Vapnik
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

BACKGROUND: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. METHODS AND RESULTS: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. CONCLUSION: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation

ScholarWorks@UMass Amherst

LSHTM Research Online