613 research outputs found

    limma powers differential expression analyses for RNA-sequencing and microarray studies

    Get PDF
    limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously describe

    Simultaneous inference: When should hypothesis testing problems be combined?

    Full text link
    Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature tends to begin with the tacit assumption that a single combined analysis, for instance, a False Discovery Rate assessment, should be applied to the entire set of problems at hand. This can be a dangerous assumption, as the examples in the paper show, leading to overly conservative or overly liberal conclusions within any particular subclass of the cases. A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses. The theory allows efficient testing within small subclasses, and has applications to ``enrichment,'' the detection of multi-case effects.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS141 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Borrowing information across genes and experiments for improved error variance estimation in microarray data analysis and statistical inferences for gene expression heterosis

    Get PDF
    The advancement in microarray technology enables the simultaneous measurement of expression levels of thousands of genes. However, due to the relatively high cost of making a replicate in a microarray experiment, the number of replicates in a single experiment is typically small. This results in the small n, large p problem for statistical inferences, where there are gene expression measurements for many genes, but only a few biological replicates (or observations) for each gene. In this dissertation, we develop statistical models and methods for microarray data to borrow information across genes and/or even across experiments to improve statistical inferences for specific biological questions. In Chapter 2, we develop statistical methods to improve the estimation of gene expression error variances. Good estimation of error variances is crucial for detecting differentially expressed genes (genes that differ in mean expression level across treatments or conditions of interest). Since the sample size available for each gene is often low, the usual unbiased estimator of the error variance can be unreliable. Shrinkage methods, including empirical Bayes approaches that borrow information across genes to produce more stable estimates, have been developed in recent years. Because the same microarray platform is often used for at least several experiments to study similar biological systems, there is an opportunity to improve variance estimation further by borrowing information not only across genes but also across experiments. We propose a lognormal model for error variances that involves random gene effects and random experiment effects. Based on the model, we develop an empirical Bayes estimator of the error variance for each combination of gene and experiment and call this estimator BAGE because information is Borrowed Across Genes and Experiments. A permutation strategy is used to make inference about the differential expression status of each gene. Simulation studies with data generated from different probability models and real microarray data show that our method outperforms existing approaches. In Chapter 3, we develop statistical methods to improve the estimation and testing of gene expression heterosis. Heterosis, also known as the hybrid vigor, refers to the superior phenotype of the hybrid offspring relative to its two inbred parents. Though the heterosis phenomenon has been extensively utilized in agriculture for over a century, the molecular basis is still unknown. In an effort to understand the basic mechanisms responsible for the phenotypic heterosis at the molecular level, researchers have begun to compare expression levels of thousands of genes in the parental inbred lines and their offspring to find genes that exhibit gene expression heterosis. In our study, we focus on three types of gene expression heterosis: high-parent heterosis, low-parent heterosis and mid-parent heterosis. Currently, the sample average method is the most commonly used method for estimation and testing of gene expression heterosis. However, the sample average estimators underestimate high-parent heterosis and low-parent heterosis, which consequently leads to loss of power in hypothesis testing. Though the sample average estimator for mid-parent heterosis is unbiased, with only a few replicates in a typical microarray experiment, estimation is highly variable. To improve the estimation and testing of all three types of gene expression heterosis, we develop a hierarchical model, which permits information sharing across genes. Based on the model, we derive empirical Bayes estimators, and test gene expression heterosis using posterior probabilities. The effectiveness of our approach is demonstrated through simulations based on two real heterosis microarray experiments as well as hypothetical probability models that violate our model assumptions. Chapter 4 presents statistical analysis of a soil-based carbon sequestration experiment. Driven by global climate change due to the increasing level of atmospheric carbon dioxide, researchers have proposed a soil-based carbon sequestration approach. A soil-based carbon sequestration approach reduces carbon dioxide emission from crop residues after harvesting and sequesters more carbon into the land as a soil nutrient. Previous research has reported significant differences across species in their rates of residue decomposition and the amount of carbon dioxide emission. Because the biomass composition varies across maize genotypes, we hypothesize that there are also differences among genotypes within the maize species in their rates of biomass decomposition and abilities of carbon sequestration. We designed and performed a longitudinal experiment to measure the amount of carbon dioxide flux from crop stover samples of 14 maize varieties. Flux observations for more than 150 days were collected. We modeled the logarithm of carbon dioxide flux as a linear function of genotype, day, and genotype-by-day interaction effects as well as several other important fixed and random factors. The analysis results show significant differences among maize varieties with respect to the accumulated carbon dioxide flux from crop residues as well as flux pattern over time. We also investigate relationships of carbon dioxide emission and several potentially influential chemical compounds in the maize residue biomass composition. These results suggest the potential for development of carbon capturing crops through bioengineering or hybrid methods

    Differential expression and detection of transcripts in sweetpotato (Ipomoea batatas (L.) Lam.) using cDNA microarrays

    Get PDF
    Microarray protocols were developed for sweetpotato (Ipomoea batatas (L.) Lam.) and then used to study issues of importance in sweetpotato physiology and production. The effect of replication number and image analysis software was compared with results obtained by quantitative real-time PCR. The results indicated that reliable results could be obtained using six replicates and UCSF Spot image analysis software. These methodologies were employed to elucidate aspects of sweetpotato development, physiology and response to virus infection. Storage root formation is the most economically important process in sweetpotato development. Gene expression levels were compared between fibrous and storage roots of the cultivar Jewel. Sucrose synthase, ADP-glucose pyrophosphorylase, and fructokinase were up-regulated in storage roots, while hexokinase was not differentially expressed. A variety of transcription factors were differentially expressed as well as several auxin-related genes. The orange flesh color of sweetpotato is due to β-carotene stored in chromoplasts of root cells. β-carotene is important because of its role in human health. To elucidate biosynthesis and storage of β-carotene in sweetpotato roots, microarray analysis was used to investigate genes differentially expressed between ‘White Jewel’ and ‘Jewel’ storage roots. β-carotene content calculated for ‘Jewel’ and ‘White Jewel’ were 20.66 mg/100 g fresh weight (FW) and 1.68 mg/100 g FW, respectively. Isopentenyl diphosphate isomerase was down-regulated in ‘White Jewel’, but three other genes in the β-carotene biosynthetic pathway were not differentially expressed. Several genes associated with chloroplasts were differentially expressed, indicating probable differences in chromoplast development of ‘White Jewel’ and ‘Jewel’. Sweet potato virus disease (SPVD) is caused by the co-infection of plants with a potyvirus, Sweet potato feathery mottle virus (SPFMV), and a crinivirus, Sweet potato chlorotic stunt virus (SPCSV). Expression analysis revealed that the number of differentially expressed genes in plants infected with SPFMV alone and SPCSV alone compared to virus-tested plants was only three and 14, respectively. In contrast, more than 200 genes from various functional categories were differentially expressed between virus-tested and SPVD-affected plants. Microarray analysis has proved to be a useful tool to study important aspects of sweetpotato physiology and production

    Extent, impact, and mitigation of batch effects in tumor biomarker studies using tissue microarrays

    Get PDF
    Tissue microarrays (TMAs) have been used in thousands of cancer biomarker studies. To what extent batch effects, measurement error in biomarker levels between slides, affects TMA-based studies has not been assessed systematically. We evaluated 20 protein biomarkers on 14 TMAs with prospectively collected tumor tissue from 1,448 primary prostate cancers. In half of the biomarkers, more than 10% of biomarker variance was attributable to between-TMA differences (range, 1–48%). We implemented different methods to mitigate batch effects (R package batchtma), tested in plasmode simulation. Biomarker levels were more similar between mitigation approaches compared to uncorrected values. For some biomarkers, associations with clinical features changed substantially after addressing batch effects. Batch effects and resulting bias are not an error of an individual study but an inherent feature of TMA-based protein biomarker studies. They always need to be considered during study design and addressed analytically in studies using more than one TMA

    Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

    Get PDF
    International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses

    Population Genetics And Mixed Stock Analysis Of Chum Salmon (Oncorhynchus Keta) With Molecular Genetics

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2012Chum salmon (Oncorhynchus keta) are important for subsistence and commercial harvest in Alaska. Variability of returns to western Alaskan drainages that caused economic hardship for stakeholders has led to speculation about reasons, which may include both anthropogenic and environmental causes in the marine environment. Mixed stock analysis (MSA) compares genetic information from an individual caught at sea to a reference baseline of genotypes to assign it to its population of origin. Application of genetic baselines requires several complex steps that can introduce bias. The bias may reduce the accuracy of MSA and result in overly-optimistic evaluations of baselines. Moreover, some applications that minimize bias cannot use informative haploid mitochondrial variation. Costs of baseline development are species-specific and difficult to predict. Finally, because populations of western Alaskan chum salmon demonstrate weak genetic divergence, samples from mixtures cannot be accurately assigned to a population of origin. The chapters of this thesis address three challenges. The first chapter describes technical aspects of genetic marker development. The second chapter describes a method to evaluate the precision and accuracy of a genetic baseline that accepts any type of data and reduces bias that may have been introduced during baseline development. This chapter also includes a method that places a cost on baseline development by predicting the number of markers needed to achieve a given accuracy. The final chapter explores the reasons for the weak genetic structure of western Alaskan chum salmon populations. The results of those analyses and both geological and archaeological data suggest that recent environmental and geological processes may be involved. The methods and analyses in this thesis can be applied to any species and may be particularly useful for other western Alaskan species
    corecore