591 research outputs found

    High-dimensional hierarchical models and massively parallel computing

    Get PDF
    This work expounds a computationally expedient strategy for the fully Bayesian treatment of high-dimensional hierarchical models. Most steps in a Markov chain Monte Carlo routine for such models are either conditionally independent draws or low-dimensional draws based on summary statistics of parameters at higher levels of the hierarchy. We construct both sets of steps using parallelized algorithms designed to take advantage of the immense parallel computing power of general-purpose graphics processing units while avoiding the severe memory transfer bottleneck. We apply our strategy to RNA-sequencing (RNA-seq) data analysis, a multiple-testing, low-sample-size scenario where hierarchical models provide a way to borrow information across genes. Our approach is solidly tractable, and it performs well under several metrics of estimation, posterior inference, and gene detection. Best-case-scenario empirical Bayes counterparts perform equally well, lending support to existing empirical Bayes approaches in RNA-seq. Finally, we attempt to improve the robustness of estimation and inference of our RNA-seq model using alternate hierarchical distributions

    Statistical Methods for Normalization and Analysis of High-Throughput Genomic Data

    Get PDF
    High-throughput genomic datasets obtained from microarray or sequencing studies have revolutionized the field of molecular biology over the last decade. The complexity of these new technologies also poses new challenges to statisticians to separate biological relevant information from technical noise. Two methods are introduced that address important issues with normalization of array comparative genomic hybridization (aCGH) microarrays and the analysis of RNA sequencing (RNA-Seq) studies. Many studies investigating copy number aberrations at the DNA level for cancer and genetic studies use comparative genomic hybridization (CGH) on oligo arrays. However, aCGH data often suffer from low signal to noise ratios resulting in poor resolution of fine features. Bilke et al. showed that the commonly used running average noise reduction strategy performs poorly when errors are dominated by systematic components. A method called pcaCGH is proposed that significantly reduces noise using a non-parametric regression on technical covariates of probes to estimate systematic bias. Then a robust principal components analysis (PCA) estimates any remaining systematic bias not explained by technical covariates used in the preceding regression. The proposed algorithm is demonstrated on two CGH datasets measuring the NCI-60 cell lines utilizing NimbleGen and Agilent microarrays. The method achieves a nominal error variance reduction of 60%-65% as well as an 2-fold increase in signal to noise ratio on average, resulting in more detailed copy number estimates. Furthermore, correlations of signal intensity ratios of NimbleGen and Agilent arrays are increased by 40% on average, indicating a significant improvement in agreement between the technologies. A second algorithm called gamSeq is introduced to test for differential gene expression in RNA sequencing studies. Limitations of existing methods are outlined and the proposed algorithm is compared to these existing algorithms. Simulation studies and real data are used to show that gamSeq improves upon existing methods with regards to type I error control while maintaining similar or better power for a range of sample sizes for RNA-Seq studies. Furthermore, the proposed method is applied to detect differential 3\u27 UTR usage

    Statistical power analysis for single-cell RNA-sequencing

    Get PDF
    RNA-sequencing (RNA-seq) is an established method to quantify levels of gene expression genome-wide. The recent development of single cell RNA sequencing (scRNA-seq) protocols opens up the possibility to systematically characterize cell transcriptomes and their underlying developmental and regulatory mechanisms. Since the first publication on single-cell transcriptomics a decade ago, hundreds of scRNA-seq datasets from a variety of sources have been released, profiling gene expression of sorted cells, tumors, whole dissociated organs and even complete organisms. Currently, it is also the main tool to systematically characterize human cells within the Human Cell Atlas Project. Given its wide applicability and increasing popularity, many experimental protocols and computational analysis approaches exist for scRNA-seq. However, the technology remains experimentally and computationally challenging. Firstly, single cells contain only minute mRNA amounts that need to be reliably captured and amplified for accurate quantification by sequencing. Importantly, the Polymerase Chain Reaction (PCR) is commonly used for amplification which might introduce biases and increase technical variation. Secondly, once the sequencing results are obtained, finding the best computational processing pipeline can be a struggle. A number of comparison studies have already been conducted - esp. for bulk RNA-seq - but usually they deal only with one aspect of the workflow. Furthermore, in how far the conclusions and recommendations of these studies can be transferred to scRNA-seq is unknown. Related to the processing of RNA-sequencing, we investigate the effect of PCR amplification on differential expression analysis. We find that computational removal of duplicates has either a negligible or a negative impact on specificity and sensitivity of differential expression analysis, and we therefore recommend not to remove read duplicates by mapping position. In contrast, if duplicates are identified using unique molecular identifiers (UMIs) tagging RNA molecules, both specificity and sensitivity improve. The first integral step of any scRNA-seq experiment is the preparation of sequencing libraries from the cells. We conducted an independent benchmarking study of popular library preparation protocols in terms of detection sensitivity, accuracy and precision using the same mouse embryonic stem cells and exogenous mRNA spike-ins. We recapitulate our previous finding that technical variance is markedly decreased when using UMIs to remove duplicates. In order to assign a monetary value to the detected amounts of technical variance, we developed a simulation framework, that enabled us to compare the power to detect differentially expressed genes across the scRNA-seq library preparation protocols. Our experiences during this comparison study led to the development of the sequencing data processing in zUMIs and the simulation framework and power analysis in powsimR. zUMIs is a pipeline for processing scRNA-seq data with flexible choices regarding UMI and cell barcode design. In addition, we showed with powsimR simulations that the inclusion of intronic reads for gene expression quantification increases the power to detect DE genes and added it as a unique feature to zUMIs. In powsimR, we present our simulation framework extending choices concerning data analysis, enabling researchers to assess experimental design and analysis plans of RNA-seq in terms of statistical power. Lastly, we conducted a systematic evaluation of scRNA-seq experimental and analytical pipelines. We found that choices made concerning normalisation and library preparation protocols have the biggest impact on the validity of scRNA-seq DE analysis. Choosing a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the cell sample size. Taken together, we have established and applied a simulation framework that allowed us to benchmark experimental and computational scRNA-seq protocols and hence inform the experimental design and method choices of this important technology

    Statistical and Computational Methods for Differential Expression Analysis in High-throughput Gene Expression Data

    Full text link
    In this dissertation, we develop novel statistical and computational methods for differential expression analysis in high-throughput gene expression data. In the first part, we develop statistical models for differential expression with a variety of study designs. In project one, we present an efficient algorithm for the detection of differential expression and splicing of genes in RNA-Seq data. Our approach considers three cases for each gene: no differential expression, differential expression without differential splicing, and differential splicing. We use a Poisson regression framework to model the read counts and a hierarchical likelihood ratio test for model selection. In project two, we present a non-parametric approach for the joint detection of differential expression and splicing of genes by introducing a new statistic named gene-level differential score and using a permutation test to assess the statistical significance. The method can be applied to a variety of experimental designs, including those with two (unpaired or paired) or multiple biological conditions, and those with quantitative or survival outcomes. In project three, we model the single-cell gene expression data using a two-part mixed model, which not only adequately accounts for the distinct features of single cell expression data, including extra zero expression values, high variability and clustered design, but also provides the flexibility of adjusting for covariates. Comparisons with existing methods, our approach achieves improved power for detecting differential expressed genes. In the second part, we propose novel methods to improve the computational efficiency of resampling-based test methods in genomics. In project four, we present a fast algorithm for evaluating small p-values from permutation tests based on the cross-entropy method. In chapter five, we develop an efficient algorithm for estimating small p-values in parametric bootstrap tests using the improved cross-entropy method to approximate the optimal proposal density and the Hamiltonian Monte Carlo method to efficiently sample from the optimal proposal density. These methods together address a critical challenge for resampling-based tests in genomics since an enormous number of resamples is needed for estimating very small p-values. Simulations and applications to real data demonstrate that our methods achieve significant gains in computational efficiency comparing with existing methods.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135864/1/shyboy_1.pd

    Statistical Methods for the Analysis of Epigenomic Data

    Get PDF
    Epigenomics, the study of the human genome and its interactions with proteins and other cellular elements, has become of significant interest in the past decade. Several landmark studies have shown that these interactions regulate essential cellular processes (gene transcription, gene silencing, etc.) and are associated with multiple complex disorders such as cancer incidence, cardiovascular disease, etc. Chromatin immunoprecipitation followed by massively-parallel sequencing (ChIP-seq) is one of several techniques used to (1) detect protein-DNA interaction sites, (2) classify differential epigenomic activity across conditions, and (3) characterize subpopulations of single-cells in heterogeneous samples. In this dissertation, we present statistical methods to tackle problems (1-3) in contexts where protein-DNA interaction sites expand across broad genomic domains. First, we present a statistical model that integrates data from multiple epigenomic assays and detects protein-DNA interaction sites in consensus across multiple replicates. We introduce a class of zero-inflated mixed-effects hidden Markov models (HMMs) to account for the excess of observed zeros, the latent sample-specific differences, and the local dependency of sequencing read counts. By integrating multiple samples into a statistical model tailored for broad epigenomic marks, our model shows high sensitivity and specificity in both simulated and real datasets. Second, we present an efficient framework for the detection and classification of regions exhibiting differential epigenomic activity in multi-sample multi-condition designs. The presented model utilizes a finite mixture model embedded into a HMM to classify patterns of broad and short differential epigenomic activity across conditions. We utilize a fast rejection-controlled EM algorithm that makes our implementation among the fastest algorithms available, while showing improvement in performance in data from broad epigenomic marks. Lastly, we analyze data from single-cell ChIP-seq assays and present a statistical model that allows the simultaneous clustering and characterization of single-cell subpopulations. The presented framework is robust for the often observed sparsity in single-cell epigenomic data and accounts for the local dependency of counts. We introduce an initialization scheme for the initialization of the EM algorithm as well as the identification of the number of single-cell subpopulations in the data, a common task in current single-cell epigenomic algorithms.Doctor of Philosoph

    The design and statistical analysis of single-cell RNA-sequencing experiments

    Get PDF
    Next-generation DNA- and RNA-sequencing (RNA-seq) technologies have expanded rapidly in both throughput and accuracy within the last decade. The momentum continues as emerging techniques become increasingly capable of profiling molecular content at the level of individual cells. One goal of this research is to put forward best practices in the design of single-cell RNA-sequencing (scRNA-seq) experiments, specifically as it relates to choices regarding the trade-off between sequencing depth and sample size. In addition to general guidelines, an interactive tool is presented to aid researchers in making experiment-specific decisions that are informed by real data and practical constraints. Further, a new approach to the modeling and testing of differential gene expression in scRNA-seq data is proposed, which notably incorporates salient features (e.g. highly zero-inflated expression values) of single-cell transcription that are otherwise obscured at the tissue level. As single-cell technologies offer an unprecedented window into cell-to-cell heterogeneity and its biological consequences, it is essential that suitable approaches are adopted for both the design and analysis of these experiments
    • …
    corecore