251 research outputs found
A Statistical Framework for the Analysis of Microarray Probe-Level Data
Microarrays are an example of the powerful high through-put genomics tools that are revolutionizing the measurement of biological systems. In this and other technologies, a number of critical steps are required to convert the raw measures into the data relied upon by biologists and clinicians. These data manipulations, referred to as preprocessing, have enormous influence on the quality of the ultimate measurements and studies that rely upon them. Many researchers have previously demonstrated that the use of modern statistical methodology can substantially improve accuracy and precision of gene expression measurements, relative to ad-hoc procedures introduced by designers and manufacturers of the technology. However, further substantial improvements are possible. Microarrays are now being used to measure diverse high genomic endpoints including yeast mutant representations, the presence of SNPs, presence of deletions/insertions, and protein binding sites by chromatin immunoprecipitation (known as ChIP-chip). In each case, the genomic units of measurement are relatively short DNA molecules referred to as probes. Without appropriate understanding of the bias and variance of these measurements, biological inferences based upon probe analysis will be compromised. Standard operating procedure for microarray researchers is to use preprocessed data as the starting point for the statistical analyses that produce reported results. This has prevented many researchers from carefully considering their choice of preprocessing methodology. Furthermore, the fact that the preprocessing step greatly affects the stochastic properties of the final statistical summaries is ignored. In this paper we propose a statistical framework that permits the integration of preprocessing into the standard statistical analysis flow of microarray data. We demonstrate its usefulness by applying the idea in three different applications of the technology
A statistical framework for the analysis of microarray probe-level data
In microarray technology, a number of critical steps are required to convert
the raw measurements into the data relied upon by biologists and clinicians.
These data manipulations, referred to as preprocessing, influence the quality
of the ultimate measurements and studies that rely upon them. Standard
operating procedure for microarray researchers is to use preprocessed data as
the starting point for the statistical analyses that produce reported results.
This has prevented many researchers from carefully considering their choice of
preprocessing methodology. Furthermore, the fact that the preprocessing step
affects the stochastic properties of the final statistical summaries is often
ignored. In this paper we propose a statistical framework that permits the
integration of preprocessing into the standard statistical analysis flow of
microarray data. This general framework is relevant in many microarray
platforms and motivates targeted analysis methods for specific applications. We
demonstrate its usefulness by applying the idea in three different applications
of the technology.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS116 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Improving the statistical detection of regulated genes from microarray data using intensity-based variance estimation
BACKGROUND: Gene microarray technology provides the ability to study the regulation of thousands of genes simultaneously, but its potential is limited without an estimate of the statistical significance of the observed changes in gene expression. Due to the large number of genes being tested and the comparatively small number of array replicates (e.g., N = 3), standard statistical methods such as the Student's t-test fail to produce reliable results. Two other statistical approaches commonly used to improve significance estimates are a penalized t-test and a Z-test using intensity-dependent variance estimates. RESULTS: The performance of these approaches is compared using a dataset of 23 replicates, and a new implementation of the Z-test is introduced that pools together variance estimates of genes with similar minimum intensity. Significance estimates based on 3 replicate arrays are calculated using each statistical technique, and their accuracy is evaluated by comparing them to a reliable estimate based on the remaining 20 replicates. The reproducibility of each test statistic is evaluated by applying it to multiple, independent sets of 3 replicate arrays. Two implementations of a Z-test using intensity-dependent variance produce more reproducible results than two implementations of a penalized t-test. Furthermore, the minimum intensity-based Z-statistic demonstrates higher accuracy and higher or equal precision than all other statistical techniques tested. CONCLUSION: An intensity-based variance estimation technique provides one simple, effective approach that can improve p-value estimates for differentially regulated genes derived from replicated microarray datasets. Implementations of the Z-test algorithms are available at
Global gene expression profiling of healthy human brain and its application in studying neurological disorders
The human brain is the most complex structure known to mankind and one of the greatest challenges in modern biology is to understand how it is built and organized. The power of the brain arises from its variety of cells and structures, and ultimately where and when different genes are switched on and off throughout the brain tissue. In other words, brain function depends on the precise regulation of gene expression in its sub-anatomical structures. But, our understanding of the complexity and dynamics of the transcriptome of the human brain is still incomplete. To fill in the need, we designed a gene expression model that accurately defines the consistent blueprint of the brain transcriptome; thereby, identifying the core brain specific transcriptional processes conserved across individuals. Functionally characterizing this model would provide profound insights into the transcriptional landscape, biological pathways and the expression distribution of neurotransmitter systems.
Here, in this dissertation we developed an expression model by capturing the similarly expressed gene patterns across congruently annotated brain structures in six individual brains by using data from the Allen Brain Atlas (ABA). We found that 84% of genes are expressed in at least one of the 190 brain structures. By employing hierarchical clustering we were able to show that distinct structures of a bigger brain region can cluster together while still retaining their expression identity. Further, weighted correlation network analysis identified 19 robust modules of coexpressing genes in the brain that demonstrated a wide range of functional associations. Since signatures of local phenomena can be masked by larger signatures, we performed local analysis on each distinct brain structure. Pathway and gene ontology enrichment analysis on these structures showed, striking enrichment for brain region specific processes. Besides, we also mapped the structural distribution of the gene expression profiles of genes associated with major neurotransmission systems in the human. We also postulated the utility of healthy brain tissue gene expression to predict potential genes involved in a neurological disorder, in the absence of data from diseased tissues. To this end, we developed a supervised classification model, which achieved an accuracy of 84% and an AUC (Area Under the Curve) of 0.81 from ROC plots, for predicting autism-implicated genes using the healthy expression model as the baseline. This study represents the first use of healthy brain gene expression to predict the scope of genes in autism implication and this generic methodology can be applied to predict genes involved in other neurological disorders
Recommended from our members
Genome-wide analyses using bead-based microarrays
Microarrays are now an established tool for biological research and have a wide range of applications. In this thesis I investigate the BeadArray microarray technology developed by Illumina. The design of this technology is unique and gives rise to many computational and statistical challenges.
However, I show how knowledge from other microarray technologies can be used to our advantage.
I describe the beadarray software package, which is now used by researchers around the world. The development of this software was motivated by the fact that Illumina's software (BeadStudio) gives a summarised view of Illumina data and does not gives users any control over several processing steps that were found to be crucial for other microarray technologies. A main
feature of beadarray is the ability to access raw data. The advantages of such data include the ability to perform more detailed quality assessment and greater control over the analysis at all stages. The analysis of a control experiment shows that the processing steps used in BeadStudio can be
improved. In particular, utilising variances calculated from the raw data can increase the ability to detect genes which have di erent expression levels between samples, a common goal for microarray studies. The data from the control experiment are made available for other researchers to use and
validate their own analysis methods. One issue discovered during the analysis of the control experiment was that only half of the intended genes could be reliably measured due to problems in the design of the probes targetting particular genes. By considering
a large set of publicly available Illumina arrays, I show how such unreliable measurements can a ect the analysis of Illumina data. I also show how potential
problems can be identi ed in advance of an experiment and incorporated into an analysis pipeline
Recommended from our members
A human lung tumor microenvironment interactome identifies clinically relevant cell-type cross-talk.
BackgroundTumors comprise a complex microenvironment of interacting malignant and stromal cell types. Much of our understanding of the tumor microenvironment comes from in vitro studies isolating the interactions between malignant cells and a single stromal cell type, often along a single pathway.ResultTo develop a deeper understanding of the interactions between cells within human lung tumors, we perform RNA-seq profiling of flow-sorted malignant cells, endothelial cells, immune cells, fibroblasts, and bulk cells from freshly resected human primary non-small-cell lung tumors. We map the cell-specific differential expression of prognostically associated secreted factors and cell surface genes, and computationally reconstruct cross-talk between these cell types to generate a novel resource called the Lung Tumor Microenvironment Interactome (LTMI). Using this resource, we identify and validate a prognostically unfavorable influence of Gremlin-1 production by fibroblasts on proliferation of malignant lung adenocarcinoma cells. We also find a prognostically favorable association between infiltration of mast cells and less aggressive tumor cell behavior.ConclusionThese results illustrate the utility of the LTMI as a resource for generating hypotheses concerning tumor-microenvironment interactions that may have prognostic and therapeutic relevance
Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics
Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables.
There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi.
Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast
Multi-scale approaches for the statistical analysis of microarray data (with an application to 3D vesicle tracking)
The recent developments in experimental methods for gene data analysis, called microarrays, provide the possibility of interrogating changes in the expression of a vast number of genes in cell or tissue cultures and thus in depth exploration of disease conditions. As part of an ongoing program of research in Guy A. Rutter (G.A.R.) laboratory, Department of Biochemistry, University of Bristol, UK, with support from the Welcome Trust, we study the impact of established and of potentially new methods to the statistical analysis of gene expression data.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Defining the human endothelial transcriptome
Thesis (S.M.)--Harvard-MIT Division of Health Sciences and Technology, 2005.Includes bibliographical references (leaves 91-100).Advances in microarray technology facilitate the study of biological systems at a genome-wide level. Meaningful analysis of these transcriptional profiling studies, however, demands the concomitant development of novel computational techniques that take into account the size and complexity of the data. We have devised statistical algorithms that use replicate microarrays to define a genome-wide expression profile of a given cell type and to determine a list of genes that are significantly differentially expressed between experimental conditions. Applying these algorithms to the study of cultured human umbilical vein endothelial cells (HUVEC), we have found approximately 54% of all genes to be expressed at a detectable level in HUVEC under basal conditions. The set of highest expressed genes is enriched in nucleic acid binding proteins, cytoskeletal proteins and isomerases as well as certain known markers of endothelium, and the complete list of genes can be found at ... We have also studied the effect of a 4-hour exposure of HUVEC to 10 U/mL of IL-1, and detected 491 upregulated and 259 downregulated statistically significant genes, including several chemokines and cytokines, as well as members of the TNFAIP3 family, the KLFfamily and the Notch pathway. Applying these rigorous statistical techniques to genome-wide expression datasets underscores known patterns of endothelial inflammatory gene regulation and unveils new pathways as well.(cont.) Finally, we performed a direct comparison of direct-labeled microarrays with amplified RNA microarrays for an initial assessment of the effect of the additional noise of amplification on the outputs of the statistical algorithms. These techniques can be applied to additional genome-wide profiling studies of endothelium and other cell types to refine our understanding of transcriptomes and the gene regulatory network governing cellular function and pathophysiology.by Sripriya Natarajan.S.M
Recommended from our members
Investigating the molecular mechanisms of the metabolic syndrome
This thesis aims to highlight molecular mechanisms that have been altered by
prenatal undernutrition and may be involved in the metabolic syndrome. Two sepa-
rate studies were conducted both using a rat model developed through manipulation
of the maternal diet to provoke the key features of the metabolic syndrome in adult
o spring. Microarray technology was used to detect changes in gene expression in tar-
get tissues between o spring of control (normally fed, AD) and undernourished (UN)
mothers to obtain a broader picture of the cellular functions and genetic pathways
that may be implicated in the metabolic syndrome.
The rst study compared gene expression di erences in liver, skeletal muscle, and
white adipose tissue between 55 day old male o spring of AD and UN mothers. No
signi cant changes were found in muscle or adipose tissue; however, the di erences
in the liver suggested the UN animals had been metabolically programmed to favour
fat as an energy source.
To investigate whether DNA methylation might be responsible for the observed
transcriptional changes, pooled liver samples from the rst study were used with
the McrBC restriction enzyme assay to determine full, partial, incomplete, or no
methylation between AD and UN. Two di erentially expressed genes (Zfand2a and
Mapk4) showed methylation changes.
The same liver samples were hybridised to a miRNA array. Two miRNAs showed
a nearly 2-fold upregulation in the UN livers. Both were found to be either directly
or indirectly associated with the metabolic syndrome. MiR-335 has been shown to
be upregulated in the livers of obese/diabetic mice. By association with miR-27a,
miR-451 might be involved in aspects of lipid metabolism in adipose tissue.
A second study used microarray to analyse the liver tissues of day 170 female o -
spring of the same rat model with additional insults (neonatal leptin treatment and
post-weaning high-fat (HF) diet). Leptin has been shown to reverse the programming
e ects of the restricted maternal diet and this study aimed to highlight mechanisms
that could be involved in this reversal. The results revealed the importance of the in-
teraction between treatments. Signi cant gene expression changes were only present
when two or more treatments were combined. This study revealed signi cantly, dif-
ferentially expressed genes involved in immune function, regulation of the circadian
rhythm, and metabolism.
These ndings provide a number of interesting genes and pathways for further
studies and also highlight the need to conduct a thorough study in multiple tissues
at di erent time-points to pinpoint the window of developmental plasticity.University of Auckland Liggins Institut
- …