803 research outputs found

    A hidden Markov random field-based Bayesian method for the detection of long-range chromosomal interactions in Hi-C data

    Get PDF
    Motivation: Advances in chromosome conformation capture and next-generation sequencing technologies are enabling genome-wide investigation of dynamic chromatin interactions. For example, Hi-C experiments generate genome-wide contact frequencies between pairs of loci by sequencing DNA segments ligated from loci in close spatial proximity. One essential task in such studies is peak calling, that is, detecting non-random interactions between loci from the two-dimensional contact frequency matrix. Successful fulfillment of this task has many important implications including identifying long-range interactions that assist interpreting a sizable fraction of the results from genome-wide association studies. The task - distinguishing biologically meaningful chromatin interactions from massive numbers of random interactions - poses great challenges both statistically and computationally. Model-based methods to address this challenge are still lacking. In particular, no statistical model exists that takes the underlying dependency structure into consideration. Results: In this paper, we propose a hidden Markov random field (HMRF) based Bayesian method to rigorously model interaction probabilities in the two-dimensional space based on the contact frequency matrix. By borrowing information from neighboring loci pairs, our method demonstrates superior reproducibility and statistical power in both simulation studies and real data analysis

    ZipHiC: a novel Bayesian framework to identify enriched interactions and experimental biases in Hi-C data

    Get PDF
    Motivation Several computational and statistical methods have been developed to analyse data generated through the 3C-based methods, especially the Hi-C. Most of the existing methods do not account for dependency in Hi-C data. Results Here, we present ZipHiC, a novel statistical method to explore Hi-C data focusing on the detection of enriched contacts. ZipHiC implements a Bayesian method based on a hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC) to detect interactions in two-dimensional space based on a Hi-C contact frequency matrix. ZipHiC uses data on the sources of biases related to the contact frequency matrix, allows borrowing information from neighbours using the Potts model and improves computation speed by using the ABC model. In addition to outperforming existing tools on both simulated and real data, our model also provides insights into different sources of biases that affects Hi-C data. We show that some datasets display higher biases from DNA accessibility or Transposable Elements content. Furthermore, our analysis in D. melanogaster showed that approximately half of the detected significant interactions connect promoters with other parts of the genome indicating a functional biological role. Finally, we found that the micro-C datasets display higher biases from DNA accessibility compared to a similar Hi-C experiment, but this can be corrected by ZipHiC

    FastHiC: a fast and accurate algorithm to detect long-range chromosomal interactions from Hi-C data

    Get PDF
    Motivation: How chromatin folds in three-dimensional (3D) space is closely related to transcription regulation. As powerful tools to study such 3D chromatin conformation, the recently developed Hi-C technologies enable a genome-wide measurement of pair-wise chromatin interaction. However, methods for the detection of biologically meaningful chromatin interactions, i.e. peak calling, from Hi-C data, are still under development. In our previous work, we have developed a novel hidden Markov random field (HMRF) based Bayesian method, which through explicitly modeling the non-negligible spatial dependency among adjacent pairs of loci manifesting in high resolution Hi-C data, achieves substantially improved robustness and enhanced statistical power in peak calling. Superior to peak callers that ignore spatial dependency both methodologically and in performance, our previous Bayesian framework suffers from heavy computational costs due to intensive computation incurred by modeling the correlated peak status of neighboring loci pairs and the inference of hidden dependency structure

    Bioinformatics Tools for Exploring Regulatory Mechanisms

    Get PDF
    Gene expression is the fundamental initial step in the flow of genetic information in biological systems and it is controlled by multiple precisely coordinated regulatory mechanisms, such as structural and epigenetic regulations. Dysregulation of gene expression plays important roles in the development of a broad range of diseases. Modern high-throughput technologies provide unprecedented opportunities to investigate these diverse regulatory mechanisms on a genome-wide scale. Here we develop several methods to analyze these omics profiles. First, Hi-C experiments generate genome-wide contact frequencies between pairs of loci by sequencing DNA segments ligated from loci in close spatial proximity. To detect biologically meaningful interactions between loci, we propose a hidden Markov random field (HMRF) based Bayesian method to rigorously model interaction probabilities in the two-dimensional space based on the contact frequency matrix. By borrowing information from neighboring loci pairs, our method demonstrates superior reproducibility and statistical power in both simulation studies and real data analysis. Second, DNA methylation is a key epigenetic mark involved in both normal development and disease progression. To facilitate joint analysis of methylation data from multiple platforms with varying resolution, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from non-local probes to improve imputation quality. We compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations, and our method showed higher imputation accuracy. The simulated association study further demonstrated that our method substantially improves the statistical power to identify trait- associated methylation loci in epigenome-wide association study (EWAS). Finally, we applied an integrative analysis to characterize molecular systems associated with hepatocellular carcinoma (HCC). Dysregulaton of inflammation-related genes plays a pivotal role in the development of HCC. We performed array-based analyses to comprehensively investigate the contributions of DNA methylation and somatic copy number aberration (SCNA) to the aberrant expression of inflammation-related genes in 30 HCCs and paired non-tumor tissues. The results were validated in public datasets and an additional sample set of 47 paired HCCs and non-tumor tissues. We found that DNA methylation and SCNA together contributed to less than 30% aberrant expression of inflammation-related genes, suggesting that other molecular mechanisms might play major role in the dysregulation in HCCs.Doctor of Philosoph

    Posterior inference of Hi-C contact frequency through sampling

    Get PDF
    Hi-C is one of the most widely used approaches to study three-dimensional genome conformations. Contacts captured by a Hi-C experiment are represented in a contact frequency matrix. Due to the limited sequencing depth and other factors, Hi-C contact frequency matrices are only approximations of the true interaction frequencies and are further reported without any quantification of uncertainty. Hence, downstream analyses based on Hi-C contact maps (e.g., TAD and loop annotation) are themselves point estimations. Here, we present the Hi-C interaction frequency sampler (HiCSampler) that reliably infers the posterior distribution of the interaction frequency for a given Hi-C contact map by exploiting dependencies between neighboring loci. Posterior predictive checks demonstrate that HiCSampler can infer highly predictive chromosomal interaction frequency. Summary statistics calculated by HiCSampler provide a measurement of the uncertainty for Hi-C experiments, and samples inferred by HiCSampler are ready for use by most downstream analysis tools off the shelf and permit uncertainty measurements in these analyses without modifications

    Statistical methods to improve the analysis of biological data: Benchmarking phenotypes, protein function prediction, and spatial modelling of gene expression

    Get PDF
    Data collected in biological experiments comes in all shapes and sizes, including DNA and protein sequences, mRNA counts, spatial interactions, protein annotations, phenotypic images and so on. In order to make sense of this myriad of data, novel statistical methods are needed to not only model the biological data, but also to assess the accuracy of predictions. In this thesis, I present three research studies that perform statistical analysis in the benchmarking, assessment and modelling of genetic data, demonstrating diversity of bioinformatics research. The approach taken here is to tailor statistical methods for specific data types. To provide quality benchmark data for phenotypic image processing and assessment, a Generalized Linear Mixed effects model was used to compare the performance of different groups of people (lay people recruited through Amazon Mechanical Turk versus experts) in their efficacy to highlight key elements in phenotypic images collected from corn fields. The analyzed images were then used as ground-truth for the training and testing of automated methods. We concluded that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. To assess the quality of computational protein function predictions, the third Critical Assessment of Functional Annotation (CAFA) was launched to evaluate predictions in the form of a community challenge. Each protein is associated with multiple functions represented by Gene Ontology terms (labels). These ontological terms form a hierarchical structure, and the frequency of each term is not distributed uniformly among different proteins. Precision-recall based assessment metrics were not enough to account for the non-uniform prior distribution of this multi-label problem, so semantic-distance based methods were developed for better model assessment. We concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods, it leaves considerable room and need for improvement. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation databases, computational function prediction, and our ability to manage big data in the era of large experimental screens. To model the spatial dependency of gene expression on the 3D structure of the genome, a Poisson Hierarchical Markov Random Field model (PhiMRF) was developed for gene expression data that accounts for the pairwise spatial interaction from HiC experiments. The quantitative expression of genes on human chromosomes 1, 4, 5, 6, 8, 9, 12, 19, 20 , 21 and X all showed meaningful positive intra-chromosomal spatial dependency. Moreover, the spatial dependency is much stronger than the dependency based on linear gene neighborhoods, suggesting that 3D chromosome structures such as chromatin loops and Topologically Associating Domains (TADs) are indeed strongly correlated with gene expression levels. The results both confirm and quantify the spatial correlation in gene expression. In addition, PhiMRF improves upon the stochastic modelling of gene expression that is currently widely used in differential expression analyses. PhiMRF is available at https://github.com/ashleyzhou972/PhiMRF as an R package

    Analysis methods for studying the 3D architecture of the genome

    Get PDF

    HiView: an integrative genome browser to leverage Hi-C results for the interpretation of GWAS variants

    Get PDF
    Abstract Background Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with complex traits and diseases. However, most of them are located in the non-protein coding regions, and therefore it is challenging to hypothesize the functions of these non-coding GWAS variants. Recent large efforts such as the ENCODE and Roadmap Epigenomics projects have predicted a large number of regulatory elements. However, the target genes of these regulatory elements remain largely unknown. Chromatin conformation capture based technologies such as Hi-C can directly measure the chromatin interactions and have generated an increasingly comprehensive catalog of the interactome between the distal regulatory elements and their potential target genes. Leveraging such information revealed by Hi-C holds the promise of elucidating the functions of genetic variants in human diseases. Results In this work, we present HiView, the first integrative genome browser to leverage Hi-C results for the interpretation of GWAS variants. HiView is able to display Hi-C data and statistical evidence for chromatin interactions in genomic regions surrounding any given GWAS variant, enabling straightforward visualization and interpretation. Conclusions We believe that as the first GWAS variants-centered Hi-C genome browser, HiView is a useful tool guiding post-GWAS functional genomics studies. HiView is freely accessible at: http://www.unc.edu/~yunmli/HiView
    corecore