330 research outputs found

    Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments

    Full text link
    A two-groups mixed-effects model for the comparison of (normalized) microarray data from two treatment groups is considered. Most competing parametric methods that have appeared in the literature are obtained as special cases or by minor modification of the proposed model. Approximate maximum likelihood fitting is accomplished via a fast and scalable algorithm, which we call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of treatment ×\times gene interactions, derived from the model, involve shrinkage estimates of both the interactions and of the gene specific error variances. Genes are classified as being associated with treatment based on the posterior odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our model-based approach also allows one to declare the non-null status of a gene by controlling the false discovery rate (FDR). It is shown in a detailed simulation study that the approach outperforms well-known competitors. We also apply the proposed methodology to two previously analyzed microarray examples. Extensions of the proposed method to paired treatments and multiple treatments are also discussed.Comment: Published in at http://dx.doi.org/10.1214/10-STS339 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Bayesian testing of many hypotheses ×\times many genes: A study of sleep apnea

    Full text link
    Substantial statistical research has recently been devoted to the analysis of large-scale microarray experiments which provide a measure of the simultaneous expression of thousands of genes in a particular condition. A typical goal is the comparison of gene expression between two conditions (e.g., diseased vs. nondiseased) to detect genes which show differential expression. Classical hypothesis testing procedures have been applied to this problem and more recent work has employed sophisticated models that allow for the sharing of information across genes. However, many recent gene expression studies have an experimental design with several conditions that requires an even more involved hypothesis testing approach. In this paper, we use a hierarchical Bayesian model to address the situation where there are many hypotheses that must be simultaneously tested for each gene. In addition to having many hypotheses within each gene, our analysis also addresses the more typical multiple comparison issue of testing many genes simultaneously. We illustrate our approach with an application to a study of genes involved in obstructive sleep apnea in humans.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS241 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Model-based clustering with data correction for removing artifacts in gene expression data

    Full text link
    The NIH Library of Integrated Network-based Cellular Signatures (LINCS) contains gene expression data from over a million experiments, using Luminex Bead technology. Only 500 colors are used to measure the expression levels of the 1,000 landmark genes measured, and the data for the resulting pairs of genes are deconvolved. The raw data are sometimes inadequate for reliable deconvolution leading to artifacts in the final processed data. These include the expression levels of paired genes being flipped or given the same value, and clusters of values that are not at the true expression level. We propose a new method called model-based clustering with data correction (MCDC) that is able to identify and correct these three kinds of artifacts simultaneously. We show that MCDC improves the resulting gene expression data in terms of agreement with external baselines, as well as improving results from subsequent analysis.Comment: 28 page

    Sequential stopping for high-throughput experiments

    Get PDF
    In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga

    Modeling and analysis of RNA-seq data: a review from a statistical perspective

    Full text link
    Background: Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. Conclusion: The development of statistical and computational methods for analyzing RNA- seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statical models and exhibit different performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development

    Finding Groups in Gene Expression Data

    Get PDF
    The vast potential of the genomic insight offered by microarray technologies has led to their widespread use since they were introduced a decade ago. Application areas include gene function discovery, disease diagnosis, and inferring regulatory networks. Microarray experiments enable large-scale, high-throughput investigations of gene activity and have thus provided the data analyst with a distinctive, high-dimensional field of study. Many questions in this field relate to finding subgroups of data profiles which are very similar. A popular type of exploratory tool for finding subgroups is cluster analysis, and many different flavors of algorithms have been used and indeed tailored for microarray data. Cluster analysis, however, implies a partitioning of the entire data set, and this does not always match the objective. Sometimes pattern discovery or bump hunting tools are more appropriate. This paper reviews these various tools for finding interesting subgroups

    Bayesian testing of many hypothesis x many genes: a study of sleep apnea

    Get PDF
    To access publisher full text version of this article. Please click on the hyperlink in Additional Links fieldSubstantial statistical research has recently been devoted to the analysis of large-scale microarray experiments which provide a measure of the simultaneous expression of thousands of genes in a particular condition. A typical goal is the comparison of gene expression between two conditions (eg. diseased vs. non-diseased) to detect genes which show differential expression. Classical hypothesis testing procedures have been applied to this problem and more recent work has employed sophisticated models that allow for the sharing of information across genes.However,many recent gene expression studies have an experimental design with several conditions that requires an even more involved hypothesis testing approach. In this paper, we use a hierarchical Bayesian model to address the situation where there are many hypotheses that must be simultaneously tested for each gene. In addition to having many hypotheses within each gene, our analysis also addresses the more typical multiple comparison issue of testing many genes simultaneously. We illustrate our approach with an application to a study of genes involved in obstructive sleep apnea in humans

    Supervised normalization of microarrays

    Get PDF
    Motivation: A major challenge in utilizing microarray technologies to measure nucleic acid abundances is ‘normalization’, the goal of which is to separate biologically meaningful signal from other confounding sources of signal, often due to unavoidable technical factors. It is intuitively clear that true biological signal and confounding factors need to be simultaneously considered when performing normalization. However, the most popular normalization approaches do not utilize what is known about the study, both in terms of the biological variables of interest and the known technical factors in the study, such as batch or array processing date

    Multiple Testing Procedures for the Analysis of Microarray Data

    Get PDF
    We reviewed literature about various multiple testing techniques, especially addressing microarray analyses and small sample sizes, and reanalyzed data from Yuen et al. (Physiological Genomics, 2006) which compared the effect of HgCl2 and Ischemia/Reperfusion injuries on rat kidney tissues. Our analysis uses only 22 rats with small numbers of rats in each treatment group, and 9,501 genes under study. We used empirical Bayes (EB) and permutation testing (implemented in Bioconductor) in an effort to identify differentially expressed genes. EB identified a large number of genes as differentially expressed, including both previously identified and newly identified genes. The newly identified genes appear to have biological functions similar to those previously identified. We also recognized power differences between EB and permutation tests, possibly due to nonnormality of the data but also because permutation tests do not make use of all available information in the data
    corecore