62,759 research outputs found
Using Pre-existing Microarray Datasets to Increase Experimental Power: Application to Insulin Resistance
Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study
From microarray to biology: an integrated experimental, statistical and in silico analysis of how the extracellular matrix modulates the phenotype of cancer cells
A statistically robust and biologically-based approach for analysis of microarray data is described that integrates independent biological knowledge and data with a global F-test for finding genes of interest that minimizes the need for replicates when used for hypothesis generation. First, each microarray is normalized to its noise level around zero. The microarray dataset is then globally adjusted by robust linear regression. Second, genes of interest that capture significant responses to experimental conditions are selected by finding those that express significantly higher variance than those expressing only technical variability. Clustering expression data and identifying expression-independent properties of genes of interest including upstream transcriptional regulatory elements (TREs), ontologies and networks or pathways organizes the data into a biologically meaningful system. We demonstrate that when the number of genes of interest is inconveniently large, identifying a subset of "beacon genes" representing the largest changes will identify pathways or networks altered by biological manipulation. The entire dataset is then used to complete the picture outlined by the "beacon genes." This allow construction of a structured model of a system that can generate biologically testable hypotheses. We illustrate this approach by comparing cells cultured on plastic or an extracellular matrix which organizes a dataset of over 2,000 genes of interest from a genome wide scan of transcription. The resulting model was confirmed by comparing the predicted pattern of TREs with experimental determination of active transcription factors
The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies
Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity
PathVar: analysis of gene and protein expression variance in cellular pathways using microarray data
Summary: Finding significant differences between the expression levels of genes or proteins across diverse biological conditions is one of the primary goals in the analysis of functional genomics data. However, existing methods for identifying differentially expressed genes or sets of genes by comparing measures of the average expression across predefined sample groups do not detect differential variance in the expression levels across genes in cellular pathways. Since corresponding pathway deregulations occur frequently in microarray gene or protein expression data, we present a new dedicated web application, PathVar, to analyze these data sources. The software ranks pathway-representing gene/protein sets in terms of the differences of the variance in the within-pathway expression levels across different biological conditions. Apart from identifying new pathway deregulation patterns, the tool exploits these patterns by combining different machine learning methods to find clusters of similar samples and build sample classification models
Coupled Two-Way Clustering Analysis of Gene Microarray Data
We present a novel coupled two-way clustering approach to gene microarray
data analysis. The main idea is to identify subsets of the genes and samples,
such that when one of these is used to cluster the other, stable and
significant partitions emerge. The search for such subsets is a computationally
complex task: we present an algorithm, based on iterative clustering, which
performs such a search. This analysis is especially suitable for gene
microarray data, where the contributions of a variety of biological mechanisms
to the gene expression levels are entangled in a large body of experimental
data. The method was applied to two gene microarray data sets, on colon cancer
and leukemia. By identifying relevant subsets of the data and focusing on them
we were able to discover partitions and correlations that were masked and
hidden when the full dataset was used in the analysis. Some of these partitions
have clear biological interpretation; others can serve to identify possible
directions for future research
The steady-state transcriptome of the four major life-cycle stages of Trypanosoma cruzi
<p>Abstract</p> <p>Background</p> <p>Chronic chagasic cardiomyopathy is a debilitating and frequently fatal outcome of human infection with the protozoan parasite, <it>Trypanosoma cruzi</it>. Microarray analysis of gene expression during the <it>T. cruzi </it>life-cycle could be a valuable means of identifying drug and vaccine targets based on their appropriate expression patterns, but results from previous microarray studies in <it>T. cruzi </it>and related kinetoplastid parasites have suggested that the transcript abundances of most genes in these organisms do not vary significantly between life-cycle stages.</p> <p>Results</p> <p>In this study, we used whole genome, oligonucleotide microarrays to globally determine the extent to which <it>T. cruzi </it>regulates mRNA relative abundances over the course of its complete life-cycle. In contrast to previous microarray studies in kinetoplastids, we observed that relative transcript abundances for over 50% of the genes detected on the <it>T. cruzi </it>microarrays were significantly regulated during the <it>T. cruzi </it>life-cycle. The significant regulation of 25 of these genes was confirmed by quantitative reverse-transcriptase PCR (qRT-PCR). The <it>T. cruzi </it>transcriptome also mirrored published protein expression data for several functional groups. Among the differentially regulated genes were members of paralog clusters, nearly 10% of which showed divergent expression patterns between cluster members.</p> <p>Conclusion</p> <p>Taken together, these data support the conclusion that transcript abundance is an important level of gene expression regulation in <it>T. cruzi</it>. Thus, microarray analysis is a valuable screening tool for identifying stage-regulated <it>T. cruzi </it>genes and metabolic pathways.</p
Exploiting the full power of temporal gene expression profiling through a new statistical test: Application to the analysis of muscular dystrophy data
Background: The identification of biologically interesting genes in a temporal expression profiling
dataset is challenging and complicated by high levels of experimental noise. Most statistical methods
used in the literature do not fully exploit the temporal ordering in the dataset and are not suited
to the case where temporal profiles are measured for a number of different biological conditions.
We present a statistical test that makes explicit use of the temporal order in the data by fitting
polynomial functions to the temporal profile of each gene and for each biological condition. A
Hotelling T2-statistic is derived to detect the genes for which the parameters of these polynomials
are significantly different from each other.
Results: We validate the temporal Hotelling T2-test on muscular gene expression data from four
mouse strains which were profiled at different ages: dystrophin-, beta-sarcoglycan and gammasarcoglycan
deficient mice, and wild-type mice. The first three are animal models for different
muscular dystrophies. Extensive biological validation shows that the method is capable of finding
genes with temporal profiles significantly different across the four strains, as well as identifying
potential biomarkers for each form of the disease. The added value of the temporal test compared
to an identical test which does not make use of temporal ordering is demonstrated via a simulation
study, and through confirmation of the expression profiles from selected genes by quantitative PCR
experiments. The proposed method maximises the detection of the biologically interesting genes,
whilst minimising false detections.
Conclusion: The temporal Hotelling T2-test is capable of finding relatively small and robust sets
of genes that display different temporal profiles between the conditions of interest. The test is
simple, it can be used on gene expression data generated from any experimental design and for any
number of conditions, and it allows fast interpretation of the temporal behaviour of genes. The R
code is available from V.V. The microarray data have been submitted to GEO under series
GSE1574 and GSE3523
Exploring matrix factorization techniques for significant genes identification of Alzheimer’s disease microarray gene expression data
<p>Abstract</p> <p>Background</p> <p>The wide use of high-throughput DNA microarray technology provide an increasingly detailed view of human transcriptome from hundreds to thousands of genes. Although biomedical researchers typically design microarray experiments to explore specific biological contexts, the relationships between genes are hard to identified because they are complex and noisy high-dimensional data and are often hindered by low statistical power. The main challenge now is to extract valuable biological information from the colossal amount of data to gain insight into biological processes and the mechanisms of human disease. To overcome the challenge requires mathematical and computational methods that are versatile enough to capture the underlying biological features and simple enough to be applied efficiently to large datasets.</p> <p>Methods</p> <p>Unsupervised machine learning approaches provide new and efficient analysis of gene expression profiles. In our study, two unsupervised knowledge-based matrix factorization methods, independent component analysis (ICA) and nonnegative matrix factorization (NMF) are integrated to identify significant genes and related pathways in microarray gene expression dataset of Alzheimer’s disease. The advantage of these two approaches is they can be performed as a biclustering method by which genes and conditions can be clustered simultaneously. Furthermore, they can group genes into different categories for identifying related diagnostic pathways and regulatory networks. The difference between these two method lies in ICA assume statistical independence of the expression modes, while NMF need positivity constrains to generate localized gene expression profiles.</p> <p>Results</p> <p>In our work, we performed FastICA and non-smooth NMF methods on DNA microarray gene expression data of Alzheimer’s disease respectively. The simulation results shows that both of the methods can clearly classify severe AD samples from control samples, and the biological analysis of the identified significant genes and their related pathways demonstrated that these genes play a prominent role in AD and relate the activation patterns to AD phenotypes. It is validated that the combination of these two methods is efficient.</p> <p>Conclusions</p> <p>Unsupervised matrix factorization methods provide efficient tools to analyze high-throughput microarray dataset. According to the facts that different unsupervised approaches explore correlations in the high-dimensional data space and identify relevant subspace base on different hypotheses, integrating these methods to explore the underlying biological information from microarray dataset is an efficient approach. By combining the significant genes identified by both ICA and NMF, the biological analysis shows great efficient for elucidating the molecular taxonomy of Alzheimer’s disease and enable better experimental design to further identify potential pathways and therapeutic targets of AD.</p
Bayesian meta-analysis for identifying periodically expressed genes in fission yeast cell cycle
The effort to identify genes with periodic expression during the cell cycle
from genome-wide microarray time series data has been ongoing for a decade.
However, the lack of rigorous modeling of periodic expression as well as the
lack of a comprehensive model for integrating information across genes and
experiments has impaired the effort for the accurate identification of
periodically expressed genes. To address the problem, we introduce a Bayesian
model to integrate multiple independent microarray data sets from three recent
genome-wide cell cycle studies on fission yeast. A hierarchical model was used
for data integration. In order to facilitate an efficient Monte Carlo sampling
from the joint posterior distribution, we develop a novel Metropolis--Hastings
group move. A surprising finding from our integrated analysis is that more than
40% of the genes in fission yeast are significantly periodically expressed,
greatly enhancing the reported 10--15% of the genes in the current literature.
It calls for a reconsideration of the periodically expressed gene detection
problem.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS300 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …