Search CORE

2 research outputs found

Merged consensus clustering to assess and improve class discovery with microarray data

Author: A Thalamuthu
AK Jain
Andrew P Jarman
AP Jarman
E Levine
G Kerr
GC Tseng
I Frades
IM Gana Dresen
J Douglas Armstrong
J Gollub
J MacQueen
JC Dunn
JHH Do
L Kaufman
M Seiler
MJ van der Laan
MV Halkidi
R Suzuki
R Tibshirani
RL Camp
S Dudoit
S Dudoit
S Monti
SA Greenberg
ST Milagre
SYY Kim
T Ian Simpson
T Shimogori
TR Golub
UC Sharma
YFF Leung
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background One of the most commonly performed tasks when analysing high throughput gene expression data is to use clustering methods to classify the data into groups. There are a large number of methods available to perform clustering, but it is often unclear which method is best suited to the data and how to quantify the quality of the classifications produced. Results Here we describe an R package containing methods to analyse the consistency of clustering results from any number of different clustering methods using resampling statistics. These methods allow the identification of the the best supported clusters and additionally rank cluster members by their fidelity within the cluster. These metrics allow us to compare the performance of different clustering algorithms under different experimental conditions and to select those that produce the most reliable clustering structures. We show the application of this method to simulated data, canonical gene expression experiments and our own novel analysis of genes involved in the specification of the peripheral nervous system in the fruitfly, <it>Drosophila melanogaster</it>. Conclusions Our package enables users to apply the merged consensus clustering methodology conveniently within the R programming environment, providing both analysis and graphical display functions for exploring clustering approaches. It extends the basic principle of consensus clustering by allowing the merging of results between different methods to provide an averaged clustering robustness. We show that this extension is useful in correcting for the tendency of clustering algorithms to treat outliers differently within datasets. The R package, <it>clusterCons</it>, is freely available at CRAN and sourceforge under the GNU public licence.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Edinburgh Research Explorer

Very Important Pool (VIP) genes – an application for microarray-based molecular signatures

Author: A Ben-Dor
A Bhattacharjee
A Butte
A Rosenwald
AK Jain
AL Bluma
B Liu
C Ambroise
C Ding
C Lai
D Singh
DG Beer
DJ Lockhart
EJ Yeoh
EK Tang
GJ Gordon
H Hackl
HH Zhang
Hong Fang
Huixiao Hong
IM Gana Dresen
InfoMetrix
J Dopazoa
J Gould
J Quackenbush
J Quackenbush
JG Zhang
JJ Chen
KE Lee
L Brehelin
L Breiman
L Ein-Dor
L Li
L Shi
L Shi
L Shi
L Wang
Leming Shi
LF Wessels
LJ van 't Veer
M Dettling
M Schena
MA Shipp
R Diaz-Uriarte
R Simon
Roger Perkins
S Dudoit
S Michiels
S Mukherjee
S Wold
SE Jarvis
SJ Raudys
SL Pomeroy
U Alon
U Lutz
VN Vapnik
W Jiang
Weida Tong
WJ Fu
X Chen
Y Peng
Y Wang
Z Su
Zhenqiang Su
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Advances in DNA microarray technology portend that molecular signatures from which microarray will eventually be used in clinical environments and personalized medicine. Derivation of biomarkers is a large step beyond hypothesis generation and imposes considerably more stringency for accuracy in identifying informative gene subsets to differentiate phenotypes. The inherent nature of microarray data, with fewer samples and replicates compared to the large number of genes, requires identifying informative genes prior to classifier construction. However, improving the ability to identify differentiating genes remains a challenge in bioinformatics. Results A new hybrid gene selection approach was investigated and tested with nine publicly available microarray datasets. The new method identifies a Very Important Pool (VIP) of genes from the broad patterns of gene expression data. The method uses a bagging sampling principle, where the re-sampled arrays are used to identify the most informative genes. Frequency of selection is used in a repetitive process to identify the VIP genes. The putative informative genes are selected using two methods, t-statistic and discriminatory analysis. In the t-statistic, the informative genes are identified based on p-values. In the discriminatory analysis, disjoint Principal Component Analyses (PCAs) are conducted for each class of samples, and genes with high discrimination power (DP) are identified. The VIP gene selection approach was compared with the p-value ranking approach. The genes identified by the VIP method but not by the p-value ranking approach are also related to the disease investigated. More importantly, these genes are part of the pathways derived from the common genes shared by both the VIP and p-ranking methods. Moreover, the binary classifiers built from these genes are statistically equivalent to those built from the top 50 p-value ranked genes in distinguishing different types of samples. Conclusion The VIP gene selection approach could identify additional subsets of informative genes that would not always be selected by the p-value ranking method. These genes are likely to be additional true positives since they are a part of pathways identified by the p-value ranking method and expected to be related to the relevant biology. Therefore, these additional genes derived from the VIP method potentially provide valuable biological insights.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central