Model-based approaches for the detection of biologically active genomic regions from next generation sequencing data

Abstract

Next Generation Sequencing (NGS) technologies are quickly gaining popularity in biomedical research. A popular application of NGS is to detect potential gene regulatory elements that are captured or enriched by certain experimental procedures, for example, Chromatin Immunoprecipitation (ChIP-seq), DNase hypersensitive site mapping (DNase-seq), and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), among others. While ChIP-seq can be use to identify protein-DNA interaction sites, both DNase-seq and FAIRE-seq can be used to identify open chromatin regions, which are more likely to contain elements involved in gene expression regulation. We collectively refer to these types of sequencing data as DAE-seq, where DAE stands for DNA After Enrichment. DAE-seq data can provide important insight into gene regulation, which is crucial to understanding the molecular mechanism of phenotypic outcomes, such as complex diseases. Here we address several practical issues facing biomedical researchers in the analysis of DAE-seq data through the development of several new and relevant statistical methods. We first introduce a three-component mixture regression model to discover ``enriched regions, i.e., the genomic regions with more DAE-seq signal than expected in relation to background regions. We demonstrate its practical utility and accuracy in detecting regions of active regulatory elements across a wide range of commonly used DAE-seq datasets and experimental conditions. We then develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for often-ignored spatial dependence in DAE-seq data, and demonstrate that accounting for such dependence leads to increased performance in identifying biologically active genomic regions in both simulated and real datasets. We also introduce an efficient and novel variable selection procedure in the context of Hidden Markov Models when the means of the emission distributions of each state are modelled with covariates. We study the asymptotic properties of the proposed variable selection procedure and apply this approach to simulated and real DAE-seq data. Lastly, we introduce a new method for the joint analysis of total and allele-specific read counts from DAE-seq data and RNA-seq data. In all, we develop several statistical procedures for the analysis of DAE-seq data that are highly relevant to biomedical researchers and have broader applicability to other problems in statistics.Doctor of Philosoph

    Similar works