Search CORE

Dartmouth Digital Commons (Dartmouth College)

Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data

Author: A Beyer
AL Barabási
BH Good
BW Kernighan
CO Daub
D Duewer
D Marbach
DFT Veiga
E Bonnet
E Ravasz
E Segal
EH Davidson
F Luo
G Balázsi
G Getz
G Palla
G Palla
H Zare
HW Ma
J Chen
J Duch
J Hubble
J Lemke
J Reichardt
JJ Faith
JJ Faith
JN Weinstein
K Baggerly
Kevin E. Bassler
KY Yeung
M Blatt
M Riley
MB Eisen
MEJ Newman
MEJ Newman
MF Traxler
MM Barker
N Friedman
N Friedman
O Alter
PD Karp
Q Lu
R Guimerà
RA Irizarry
S Fortunato
S Fortunato
S Gama-Castro
S Raychaudhuri
S Tavazoie
Santiago Treviño
Satoru Miyano
SB Seidman
SB Seidman
SP Borgatii
SP Borgatii
TF Cooper
Tim F. Cooper
TS Gardner
U Brandes
UN Raghavan
X Wen
Y Benjamini
Y Sun
Yudong Sun
Z Shi
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/01/2012
Field of study

Determining the functional structure of biological networks is a central goal of systems biology. One approach is to analyze gene expression data to infer a network of gene interactions on the basis of their correlated responses to environmental and genetic perturbations. The inferred network can then be analyzed to identify functional communities. However, commonly used algorithms can yield unreliable results due to experimental noise, algorithmic stochasticity, and the influence of arbitrarily chosen parameter values. Furthermore, the results obtained typically provide only a simplistic view of the network partitioned into disjoint communities and provide no information of the relationship between communities. Here, we present methods to robustly detect coregulated and functionally enriched gene communities and demonstrate their application and validity for Escherichia coli gene expression data. Applying a recently developed community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method, we show that a hierarchy of network communities can be identified. These communities significantly enrich for gene ontology (GO) terms, consistent with them representing biologically meaningful groups. Further, analysis of the most significantly enriched communities identified several candidate new regulatory interactions. The robustness of our methods is demonstrated by showing that a core set of functional communities is reliably found when artificial noise, modeling experimental noise, is added to the data. We find that noise mainly acts conservatively, increasing the relatedness required for a network link to be reliably assigned and decreasing the size of the core communities, rather than causing association of genes into new communities.Comment: Due to appear in PLoS Computational Biology. Supplementary Figure S1 was not uploaded but is available by contacting the author. 27 pages, 5 figures, 15 supplementary file

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

FigShare

Mining gene expression data by interpreting principal components

Author: Hart Christopher E
King Brandon W
Mortazavi Ali
Roden Joseph C
Trout Diane
Wold Barbara J
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. RESULTS: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. CONCLUSION: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets

Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data

Author: A Ben-Dor
A Rosenwald
AA Alizadeh
C Ambroise
CH Ooi
D Faraggi
E Wang
H Zhang
J Khan
JM Deutsch
LJ van't-Veer
M Bittner
M West
MA Shipp
MD Radmacher
O Alter
R Simon
R Simon
R Simon
R Simon
R Tibshirani
S Dudoit
S Kim
S Ramaswamy
TH Bo
TR Golub
Publication venue: Nature Publishing Group
Publication date
Field of study

Collection Of Biostatistics Research Archive

Clustering and Classification Methods for Gene Expression Data Analysis

Author: Garrett-Mayer Elizabeth
Parmigiani Giovanni
Publication venue: Collection of Biostatistics Research Archive
Publication date: 23/12/2004
Field of study

Efficient use of the large data sets generated by gene expression microarray experiments requires computerized data analysis approaches. In this chapter we briefly describe and illustrate two broad families of commonly used data analysis methods: class discovery and class prediction methods. A wide range of alternative approaches for clustering and classification of gene expression data are available. While differences in efficiency do exist, none of the well established approaches is uniformly superior to others. Choosing an approach requires consideration of the goals of the analysis, the background knowledge, and the specific experimental constraints. The quality of an algorithm is important, but is not in itself a guarantee of the quality of a specific data analysis. Uncertainty, sensitivity analysis and, in the case of classifiers, external validation or cross-validation should be used to support the legitimacy of results of microarray data analyses

A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory

Author: Benso Alfredo
Di Carlo Stefano
Politano Gianfranco Michele Maria
Publication venue: IEEE Computer Society
Publication date: 01/01/2011
Field of study

Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithm

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Transcriptomic and proteomic analyses of Desulfovibrio vulgaris biofilms: carbon and energy flow contribute to the distinct biofilm growth state.

Author: Arkin Adam P
Clark Melinda E
Fields Matthew W
He Zhili
Joachimiak Marcin P
Keasling Jay D
Mukhopadhyay Aindrila
Redding Alyssa M
Zhou Jizhong Z
Publication venue: eScholarship, University of California
Publication date: 01/01/2012
Field of study

BackgroundDesulfovibrio vulgaris Hildenborough is a sulfate-reducing bacterium (SRB) that is intensively studied in the context of metal corrosion and heavy-metal bioremediation, and SRB populations are commonly observed in pipe and subsurface environments as surface-associated populations. In order to elucidate physiological changes associated with biofilm growth at both the transcript and protein level, transcriptomic and proteomic analyses were done on mature biofilm cells and compared to both batch and reactor planktonic populations. The biofilms were cultivated with lactate and sulfate in a continuously fed biofilm reactor, and compared to both batch and reactor planktonic populations.ResultsThe functional genomic analysis demonstrated that biofilm cells were different compared to planktonic cells, and the majority of altered abundances for genes and proteins were annotated as hypothetical (unknown function), energy conservation, amino acid metabolism, and signal transduction. Genes and proteins that showed similar trends in detected levels were particularly involved in energy conservation such as increases in an annotated ech hydrogenase, formate dehydrogenase, pyruvate:ferredoxin oxidoreductase, and rnf oxidoreductase, and the biofilm cells had elevated formate dehydrogenase activity. Several other hydrogenases and formate dehydrogenases also showed an increased protein level, while decreased transcript and protein levels were observed for putative coo hydrogenase as well as a lactate permease and hyp hydrogenases for biofilm cells. Genes annotated for amino acid synthesis and nitrogen utilization were also predominant changers within the biofilm state. Ribosomal transcripts and proteins were notably decreased within the biofilm cells compared to exponential-phase cells but were not as low as levels observed in planktonic, stationary-phase cells. Several putative, extracellular proteins (DVU1012, 1545) were also detected in the extracellular fraction from biofilm cells.ConclusionsEven though both the planktonic and biofilm cells were oxidizing lactate and reducing sulfate, the biofilm cells were physiologically distinct compared to planktonic growth states due to altered abundances of genes/proteins involved in carbon/energy flow and extracellular structures. In addition, average expression values for multiple rRNA transcripts and respiratory activity measurements indicated that biofilm cells were metabolically more similar to exponential-phase cells although biofilm cells are structured differently. The characterization of physiological advantages and constraints of the biofilm growth state for sulfate-reducing bacteria will provide insight into bioremediation applications as well as microbially-induced metal corrosion

eScholarship - University of California

Comprehensive evaluation of matrix factorization methods for the analysis of DNA microarray gene expression data

Author: A Hubert
A Hyvarinen
AL Edwards
BS Everitt
D Dueck
DD Lee
DL Davies
EL Lehmann
HC Romesburg
HJ Chung
HJ Chung
Hwa Jeong Seo
J Bezdek
J Dunn
Je-Gun Joung
JP Brunet
Ju Han Kim
KY Yeung
M Halkidi
Mi Hyeon Kim
N Jardine
P Paatero
P Pauca
PJ Rousseeuw
PO Hoyer
PO Hoyer
Q Qi
R Fisher
R Schachtner
R Sharan
R Tibshirani
RR Sokal
S Bicciato
S Jaccard
S Ma
SL Pomeroy
SZ Li
TR Golub
VR Iyer
W Xu
WM Rand
Y Gao
Y Tan
Y Wang
Y Xu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Clustering-based methods on gene-expression analysis have been shown to be useful in biomedical applications such as cancer subtype discovery. Among them, Matrix factorization (MF) is advantageous for clustering gene expression patterns from DNA microarray experiments, as it efficiently reduces the dimension of gene expression data. Although several MF methods have been proposed for clustering gene expression patterns, a systematic evaluation has not been reported yet. Results Here we evaluated the clustering performance of orthogonal and non-orthogonal MFs by a total of nine measurements for performance in four gene expression datasets and one well-known dataset for clustering. Specifically, we employed a non-orthogonal MF algorithm, BSNMF (Bi-directional Sparse Non-negative Matrix Factorization), that applies bi-directional sparseness constraints superimposed on non-negative constraints, comprising a few dominantly co-expressed genes and samples together. Non-orthogonal MFs tended to show better clustering-quality and prediction-accuracy indices than orthogonal MFs as well as a traditional method, K-means. Moreover, BSNMF showed improved performance in these measurements. Non-orthogonal MFs including BSNMF showed also good performance in the functional enrichment test using Gene Ontology terms and biological pathways. Conclusions In conclusion, the clustering performance of orthogonal and non-orthogonal MFs was appropriately evaluated for clustering microarray data by comprehensive measurements. This study showed that non-orthogonal MFs have better performance than orthogonal MFs and <it>K</it>-means for clustering microarray data.</p

Principal component tests: applied to temporal gene expression data

Author: Fang Hong-Bin
Song Jiuzhou
Zhang Wensheng
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Clustering analysis is a common statistical tool for knowledge discovery. It is mainly conducted when a project still is in the exploratory phase without any priori hypotheses. However, the statistical significance testing between the clusters can be meaningful in helping the researchers to assess if the classification results from implementing a clustering algorithm need to be improved, even after the cluster number has been determined by a well-established criterion. This is important when we want to identify highly-specific patterns through classification. We proposed to use a principal component (PC) test, which is an implementation of an exact F statistic for the measures at multiple endpoints based on elliptical distribution theory, to assess the statistical significance between clusters. A challenge in the implementation is the choice of the number (q) of principal components to be considered, which can severely influence the statistical power of the method. We optimized the determination via validation according to a permutation test based on the clustering to be evaluated. The method was applied to a public dataset in classifying genes according to their temporal gene expression profiles. The results demonstrated that the PC testing were useful for determining the optimal number of clusters.https://doi.org/10.1186/1471-2105-10-S1-S2