Search CORE

44 research outputs found

Stable Feature Selection for Biomarker Discovery

Author: He Zengyou
Yu Weichuan
Publication venue
Publication date: 01/01/2010
Field of study

Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

arXiv.org e-Print Archive

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository

Differential analysis of biological networks

Author: Montana Giovanni
Ruan Da
Young Alastair
Publication venue
Publication date: 16/06/2015
Field of study

In cancer research, the comparison of gene expression or DNA methylation networks inferred from healthy controls and patients can lead to the discovery of biological pathways associated to the disease. As a cancer progresses, its signalling and control networks are subject to some degree of localised re-wiring. Being able to detect disrupted interaction patterns induced by the presence or progression of the disease can lead to the discovery of novel molecular diagnostic and prognostic signatures. Currently there is a lack of scalable statistical procedures for two-network comparisons aimed at detecting localised topological differences. We propose the dGHD algorithm, a methodology for detecting differential interaction patterns in two-network comparisons. The algorithm relies on a statistic, the Generalised Hamming Distance (GHD), for assessing the degree of topological difference between networks and evaluating its statistical significance. dGHD builds on a non-parametric permutation testing framework but achieves computationally efficiency through an asymptotic normal approximation. We show that the GHD is able to detect more subtle topological differences compared to a standard Hamming distance between networks. This results in the dGHD algorithm achieving high performance in simulation studies as measured by sensitivity and specificity. An application to the problem of detecting differential DNA co-methylation subnetworks associated to ovarian cancer demonstrates the potential benefits of the proposed methodology for discovering network-derived biomarkers associated with a trait of interest

arXiv.org e-Print Archive

Springer - Publisher Connector

Gains in Power from Structured Two-Sample Tests of Means on Graphs

Author: Dudoit Sandrine
Jacob Laurent
Neuvial Pierre
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

Comparison of genetic association strategies in the presence of rare alleles

Author: Alain Empain
AP Morris
BS Li
C Dering
François Van Lishout
Jestinah M Mahachie John
K Van Steen
Kristel Van Steen
Lizzy De Lobel
ML Calle
NM Laird
R Tibshirani
S Dudoit
S Horvath
S Nacu
T Cattaert
T Cattaert
Tom Cattaert
YS Aulchenko
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

In the quest for the missing heritability of most complex diseases, rare variants have received increased attention. Advances in large-scale sequencing have led to a shift from the common disease/common variant hypothesis to the common disease/rare variant hypothesis or have at least reopened the debate about the relevance and importance of rare variants for gene discoveries. The investigation of modeling and testing approaches to identify significant disease/rare variant associations is in full motion. New methods to better deal with parameter estimation instabilities, convergence problems, or multiple testing corrections in the presence of rare variants or effect modifiers of rare variants are in their infancy. Using a recently developed semiparametric strategy to detect causal variants, we investigate the performance of the model-based multifactor dimensionality reduction (MB-MDR) technique in terms of power and family-wise error rate (FWER) control in the presence of rare variants, using population-based and family-based data (FAM-MDR). We compare family-based results obtained from MB-MDR analyses to screening findings from a quantitative trait Pedigree-based association test (PBAT). Population-based data were further examined using penalized regression models. We restrict attention to all available single-nucleotide polymorphisms on chromosome 4 and consider Q1 as the outcome of interest. The considered family-based methods identified marker C4S4935 in the VEGFC gene with estimated power not exceeding 0.35 (FAM-MDR), when FWER was kept under control. The considered population-based methods gave rise to highly inflated FWERs (up to 90% for PBAT screening)

Lirias

Crossref

Springer - Publisher Connector

PubMed Central

Open Repository and Bibliography - Liège

Efficient network-guided multi-locus association mapping with graph cuts

Author: Azencott Chloé-Agathe
Borgwardt Karsten M.
Grimm Dominik
Kawahara Yoshinobu
Sugiyama Mahito
Publication venue
Publication date: 01/01/2013
Field of study

As an increasing number of genome-wide association studies reveal the limitations of attempting to explain phenotypic heritability by single genetic loci, there is growing interest for associating complex phenotypes with sets of genetic loci. While several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci, or do not scale to genome-wide settings. We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype, while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints that can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci, and exhibits higher power in detecting causal SNPs in simulation studies than existing methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. Matlab code for SConES is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/Comment: 20 pages, 6 figures, accepted at ISMB (International Conference on Intelligent Systems for Molecular Biology) 201

arXiv.org e-Print Archive

PubMed Central

MPG.PuRe

Classification and biomarker identification using gene network modules and support vector machines

Author: A Djebbari
A Spira
BS Srinivasan
D Kai-Bo
D Reiss
D Zhu
F Li
H Pang
I Guyon
I Inza
L Kari
Larry Manevitz
Louise C Showe
M Nebozhyn
M Yousef
Malik Yousef
Michael K Showe
Mohamed Ketany
PvS Eugene
R Bonneau
R Kohavi
RJ Critchley-Thorne
S Nacu
T Ideker
T Li
W Pan
X Yang
X Zhang
X-w Chen
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Classification using microarray datasets is usually based on a small number of samples for which tens of thousands of gene expression measurements have been obtained. The selection of the genes most significant to the classification problem is a challenging issue in high dimension data analysis and interpretation. A previous study with SVM-RCE (Recursive Cluster Elimination), suggested that classification based on groups of correlated genes sometimes exhibits better performance than classification using single genes. Large databases of gene interaction networks provide an important resource for the analysis of genetic phenomena and for classification studies using interacting genes. We now demonstrate that an algorithm which integrates network information with recursive feature elimination based on SVM exhibits good performance and improves the biological interpretability of the results. We refer to the method as SVM with Recursive Network Elimination (SVM-RNE) Results Initially, one thousand genes selected by t-test from a training set are filtered so that only genes that map to a gene network database remain. The Gene Expression Network Analysis Tool (GXNA) is applied to the remaining genes to form <it>n </it>clusters of genes that are highly connected in the network. Linear SVM is used to classify the samples using these clusters, and a weight is assigned to each cluster based on its importance to the classification. The least informative clusters are removed while retaining the remainder for the next classification step. This process is repeated until an optimal classification is obtained. Conclusion More than 90% accuracy can be obtained in classification of selected microarray datasets by integrating the interaction network information with the gene expression information from the microarrays. The Matlab version of SVM-RNE can be downloaded from <url>http://web.macam.ac.il/~myousef</url></p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Identification of differentially expressed subnetworks based on multivariate ANOVA

Author: A Gursoy
A Subramanian
D Ucar
EJ Edelman
HY Chang
HY Chuang
J Dai
JC Tse
JD Han
K Fellenberg
L Cabusora
L Tian
MA Newton
MT Dittrich
O Keskin
O Keskin
O Keskin
P Pavlidis
S Nacu
SB Kim
T Ideker
T Park
Taesung Park
Taeyoung Hwang
VG Tusher
X Lu
Z Guo
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Since high-throughput protein-protein interaction (PPI) data has recently become available for humans, there has been a growing interest in combining PPI data with other genome-wide data. In particular, the identification of phenotype-related PPI subnetworks using gene expression data has been of great concern. Successful integration for the identification of significant subnetworks requires the use of a search algorithm with a proper scoring method. Here we propose a multivariate analysis of variance (MANOVA)-based scoring method with a greedy search for identifying differentially expressed PPI subnetworks. Results Given the MANOVA-based scoring method, we performed a greedy search to identify the subnetworks with the maximum scores in the PPI network. Our approach was successfully applied to human microarray datasets. Each identified subnetwork was annotated with the Gene Ontology (GO) term, resulting in the phenotype-related functional pathway or complex. We also compared these results with those of other scoring methods such as <it>t </it>statistic- and mutual information-based scoring methods. The MANOVA-based method produced subnetworks with a larger number of proteins than the other methods. Furthermore, the subnetworks identified by the MANOVA-based method tended to consist of highly correlated proteins. Conclusion This article proposes a MANOVA-based scoring method to combine PPI data with expression data using a greedy search. This method is recommended for the highly sensitive detection of large subnetworks.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Robust de novo pathway enrichment with KeyPathwayMiner 5

Author: Alcaraz Nicolas
Baumbach Jan
Dissing-Hansen Martin
Ditzel Henrik
List Markus
Mollenhauer Jan
Rehmsmeier Marc
Tan Qihua
Publication venue
Publication date
Field of study

University of Southern Denmark Research Output

High Accordance in Prognosis Prediction of Colorectal Cancer across Independent Datasets by Multi-Gene Module Expression Profiles

Author: A Barrier
A Hamosh
A Subramanian
A Subramanian
A Torkamani
A Zipin-Roitman
AL Barabasi
B Huang
B Zhang
C Stark
D Dong
E Lee
ED Pleasance
H Ge
HY Chuang
I Ulitsky
J Su
JB O'Connell
JS Ross
Ju-Seog Lee
K Kawada
KS Garman
KS Garman
L Cabusora
L Ein-Dor
L Tian
Linfu Bai
MEJ Newman
P D'Haeseleer
P Dao
P Pavlidis
R Salazar
R Tibshirani
Rui Wang
S Kaiser
S Michiels
S Nacu
SL Carter
T Ideker
T Ideker
TK Jenssen
TSK Prasad
Wenting Li
YH Lin
YX Wang
Z Mi
Zhangming Yan
Zhirong Sun
Publication venue: Public Library of Science
Publication date: 16/03/2012
Field of study

A considerable portion of patients with colorectal cancer have a high risk of disease recurrence after surgery. These patients can be identified by analyzing the expression profiles of signature genes in tumors. But there is no consensus on which genes should be used and the performance of specific set of signature genes varies greatly with different datasets, impeding their implementation in the routine clinical application. Instead of using individual genes, here we identified functional multi-gene modules with significant expression changes between recurrent and recurrence-free tumors, used them as the signatures for predicting colorectal cancer recurrence in multiple datasets that were collected independently and profiled on different microarray platforms. The multi-gene modules we identified have a significant enrichment of known genes and biological processes relevant to cancer development, including genes from the chemokine pathway. Most strikingly, they recruited a significant enrichment of somatic mutations found in colorectal cancer. These results confirmed the functional relevance of these modules for colorectal cancer development. Further, these functional modules from different datasets overlapped significantly. Finally, we demonstrated that, leveraging above information of these modules, our module based classifier avoided arbitrary fitting the classifier function and screening the signatures using the training data, and achieved more consistency in prognosis prediction across three independent datasets, which holds even using very small training sets of tumors

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

FigShare