2,377 research outputs found

    Higher order methylation features for clustering and prediction in epigenomic studies

    Get PDF
    Motivation: DNA methylation is an intensely studied epigenetic mark, yet its functional role is incompletely understood. Attempts to quantitatively associate average DNA methylation to gene expression yield poor correlations outside of the well-understood methylation-switch at CpG islands. Results: Here we use probabilistic machine learning to extract higher order features associated with the methylation profile across a defined region. These features quantitate precisely notions of shape of a methylation profile, capturing spatial correlations in DNA methylation across genomic regions. Using these higher order features across promoter-proximal regions, we are able to construct a powerful machine learning predictor of gene expression, significantly improving upon the predictive power of average DNA methylation levels. Furthermore, we can use higher order features to cluster promoter-proximal regions, showing that five major patterns of methylation occur at promoters across different cell lines, and we provide evidence that methylation beyond CpG islands may be related to regulation of gene expression. Our results support previous reports of a functional role of spatial correlations in methylation patterns, and provide a mean to quantitate such features for downstream analyses. Availability: https://github.com/andreaskapou/BPRMethComment: 12 pages, 5 figure

    Graph embedding and unsupervised learning predict genomic sub-compartments from HiC chromatin interaction data.

    Get PDF
    Chromatin interaction studies can reveal how the genome is organized into spatially confined sub-compartments in the nucleus. However, accurately identifying sub-compartments from chromatin interaction data remains a challenge in computational biology. Here, we present Sub-Compartment Identifier (SCI), an algorithm that uses graph embedding followed by unsupervised learning to predict sub-compartments using Hi-C chromatin interaction data. We find that the network topological centrality and clustering performance of SCI sub-compartment predictions are superior to those of hidden Markov model (HMM) sub-compartment predictions. Moreover, using orthogonal Chromatin Interaction Analysis by in-situ Paired-End Tag Sequencing (ChIA-PET) data, we confirmed that SCI sub-compartment prediction outperforms HMM. We show that SCI-predicted sub-compartments have distinct epigenetic marks, transcriptional activities, and transcription factor enrichment. Moreover, we present a deep neural network to predict sub-compartments using epigenome, replication timing, and sequence data. Our neural network predicts more accurate sub-compartment predictions when SCI-determined sub-compartments are used as labels for training

    Detection of Epigenomic Network Community Oncomarkers

    Get PDF
    In this paper we propose network methodology to infer prognostic cancer biomarkers based on the epigenetic pattern DNA methylation. Epigenetic processes such as DNA methylation reflect environmental risk factors, and are increasingly recognised for their fundamental role in diseases such as cancer. DNA methylation is a gene-regulatory pattern, and hence provides a means by which to assess genomic regulatory interactions. Network models are a natural way to represent and analyse groups of such interactions. The utility of network models also increases as the quantity of data and number of variables increase, making them increasingly relevant to large-scale genomic studies. We propose methodology to infer prognostic genomic networks from a DNA methylation-based measure of genomic interaction and association. We then show how to identify prognostic biomarkers from such networks, which we term `network community oncomarkers'. We illustrate the power of our proposed methodology in the context of a large publicly available breast cancer dataset

    Discovering Cooperative Relationships of Chromatin Modifications in Human T Cells Based on a Proposed Closeness Measure

    Get PDF
    BACKGROUND: Eukaryotic transcription is accompanied by combinatorial chromatin modifications that serve as functional epigenetic markers. Composition of chromatin modifications specifies histone codes that regulate the associated gene. Discovering novel chromatin regulatory relationships are of general interest. METHODOLOGY/PRINCIPAL FINDINGS: Based on the premise that the interaction of chromatin modifications is hypothesized to influence CpG methylation, we present a closeness measure to characterize the regulatory interactions of epigenomic features. The closeness measure is applied to genome-wide CpG methylation and histone modification datasets in human CD4+T cells to select a subset of potential features. To uncover epigenomic and genomic patterns, CpG loci are clustered into nine modules associated with distinct chromatin and genomic signatures based on terms of biological function. We then performed Bayesian network inference to uncover inherent regulatory relationships from the feature selected closeness measure profile and all nine module-specific profiles respectively. The global and module-specific network exhibits topological proximity and modularity. We found that the regulatory patterns of chromatin modifications differ significantly across modules and that distinct patterns are related to specific transcriptional levels and biological function. DNA methylation and genomic features are found to have little regulatory function. The regulatory relationships were partly validated by literature reviews. We also used partial correlation analysis in other cells to verify novel regulatory relationships. CONCLUSIONS/SIGNIFICANCE: The interactions among chromatin modifications and genomic elements characterized by a closeness measure help elucidate cooperative patterns of chromatin modification in transcriptional regulation and help decipher complex histone codes

    Computational solutions for addressing heterogeneity in DNA methylation data

    Get PDF
    DNA methylation, a reversible epigenetic modification, has been implicated with various bi- ological processes including gene regulation. Due to the multitude of datasets available, it is a premier candidate for computational tool development, especially for investigating hetero- geneity within and across samples. We differentiate between three levels of heterogeneity in DNA methylation data: between-group, between-sample, and within-sample heterogeneity. Here, we separately address these three levels and present new computational approaches to quantify and systematically investigate heterogeneity. Epigenome-wide association studies relate a DNA methylation aberration to a phenotype and therefore address between-group heterogeneity. To facilitate such studies, which necessar- ily include data processing, exploratory data analysis, and differential analysis of DNA methy- lation, we extended the R-package RnBeads. We implemented novel methods for calculating the epigenetic age of individuals, novel imputation methods, and differential variability analysis. A use-case of the new features is presented using samples from Ewing sarcoma patients. As an important driver of epigenetic differences between phenotypes, we systematically investigated associations between donor genotypes and DNA methylation states in methylation quantitative trait loci (methQTL). To that end, we developed a novel computational framework –MAGAR– for determining statistically significant associations between genetic and epigenetic variations. We applied the new pipeline to samples obtained from sorted blood cells and complex bowel tissues of healthy individuals and found that tissue-specific and common methQTLs have dis- tinct genomic locations and biological properties. To investigate cell-type-specific DNA methylation profiles, which are the main drivers of within-group heterogeneity, computational deconvolution methods can be used to dissect DNA methylation patterns into latent methylation components. Deconvolution methods require pro- files of high technical quality and the identified components need to be biologically interpreted. We developed a computational pipeline to perform deconvolution of complex DNA methyla- tion data, which implements crucial data processing steps and facilitates result interpretation. We applied the protocol to lung adenocarcinoma samples and found indications of tumor in- filtration by immune cells and associations of the detected components with patient survival. Within-sample heterogeneity (WSH), i.e., heterogeneous DNA methylation patterns at a ge- nomic locus within a biological sample, is often neglected in epigenomic studies. We present the first systematic benchmark of scores quantifying WSH genome-wide using simulated and experimental data. Additionally, we created two novel scores that quantify DNA methyla- tion heterogeneity at single CpG resolution with improved robustness toward technical biases. WSH scores describe different types of WSH in simulated data, quantify differential hetero- geneity, and serve as a reliable estimator of tumor purity. Due to the broad availability of DNA methylation data, the levels of heterogeneity in DNA methylation data can be comprehensively investigated. We contribute novel computational frameworks for analyzing DNA methylation data with respect to different levels of hetero- geneity. We envision that this toolbox will be indispensible for understanding the functional implications of DNA methylation patterns in health and disease.DNA Methylierung ist eine reversible, epigenetische Modifikation, die mit verschiedenen biologischen Prozessen wie beispielsweise der Genregulation in Verbindung steht. Eine Vielzahl von DNA Methylierungsdatensätzen bildet die perfekte Grundlage zur Entwicklung von Softwareanwendungen, insbesondere um Heterogenität innerhalb und zwischen Proben zu beschreiben. Wir unterscheiden drei Ebenen von Heterogenität in DNA Methylierungsdaten: zwischen Gruppen, zwischen Proben und innerhalb einer Probe. Hier betrachten wir die drei Ebenen von Heterogenität in DNA Methylierungsdaten unabhängig voneinander und präsentieren neue Ansätze um die Heterogenität zu beschreiben und zu quantifizieren. Epigenomweite Assoziationsstudien verknüpfen eine DNA Methylierungsveränderung mit einem Phänotypen und beschreiben Heterogenität zwischen Gruppen. Um solche Studien, welche Datenprozessierung, sowie exploratorische und differentielle Datenanalyse beinhalten, zu vereinfachen haben wir die R-basierte Softwareanwendung RnBeads erweitert. Die Erweiterungen beinhalten neue Methoden, um das epigenetische Alter vorherzusagen, neue Schätzungsmethoden für fehlende Datenpunkte und eine differentielle Variabilitätsanalyse. Die Analyse von Ewing-Sarkom Patientendaten wurde als Anwendungsbeispiel für die neu entwickelten Methoden gewählt. Wir untersuchten Assoziationen zwischen Genotypen und DNA Methylierung von einzelnen CpGs, um sogenannte methylation quantitative trait loci (methQTL) zu definieren. Diese stellen einen wichtiger Faktor dar, der epigenetische Unterschiede zwischen Gruppen induziert. Hierzu entwickelten wir ein neues Softwarepaket (MAGAR), um statistisch signifikante Assoziationen zwischen genetischer und epigenetischer Variation zu identifizieren. Wir wendeten diese Pipeline auf Blutzelltypen und komplexe Biopsien von gesunden Individuen an und konnten gemeinsame und gewebespezifische methQTLs in verschiedenen Bereichen des Genoms lokalisieren, die mit unterschiedlichen biologischen Eigenschaften verknüpft sind. Die Hauptursache für Heterogenität innerhalb einer Gruppe sind zelltypspezifische DNA Methylierungsmuster. Um diese genauer zu untersuchen kann Dekonvolutionssoftware die DNA Methylierungsmatrix in unabhängige Variationskomponenten zerlegen. Dekonvolutionsmethoden auf Basis von DNA Methylierung benötigen technisch hochwertige Profile und die identifizierten Komponenten müssen biologisch interpretiert werden. In dieser Arbeit entwickelten wir eine computerbasierte Pipeline zur Durchführung von Dekonvolutionsexperimenten, welche die Datenprozessierung und Interpretation der Resultate beinhaltet. Wir wendeten das entwickelte Protokoll auf Lungenadenokarzinome an und fanden Anzeichen für eine Tumorinfiltration durch Immunzellen, sowie Verbindungen zum Überleben der Patienten. Heterogenität innerhalb einer Probe (within-sample heterogeneity, WSH), d.h. heterogene Methylierungsmuster innerhalb einer Probe an einer genomischen Position, wird in epigenomischen Studien meist vernachlässigt. Wir präsentieren den ersten Vergleich verschiedener, genomweiter WSH Maße auf simulierten und experimentellen Daten. Zusätzlich entwickelten wir zwei neue Maße um WSH für einzelne CpGs zu berechnen, welche eine verbesserte Robustheit gegenüber technischen Faktoren aufweisen. WSH Maße beschreiben verschiedene Arten von WSH, quantifizieren differentielle Heterogenität und sagen Tumorreinheit vorher. Aufgrund der breiten Verfügbarkeit von DNA Methylierungsdaten können die Ebenen der Heterogenität ganzheitlich beschrieben werden. In dieser Arbeit präsentieren wir neue Softwarelösungen zur Analyse von DNA Methylierungsdaten in Bezug auf die verschiedenen Ebenen der Heterogenität. Wir sind davon überzeugt, dass die vorgestellten Softwarewerkzeuge unverzichtbar für das Verständnis von DNA Methylierung im kranken und gesunden Stadium sein werden

    A Review of Integrative Imputation for Multi-Omics Datasets

    Get PDF
    Multi-omics studies, which explore the interactions between multiple types of biological factors, have significant advantages over single-omics analysis for their ability to provide a more holistic view of biological processes, uncover the causal and functional mechanisms for complex diseases, and facilitate new discoveries in precision medicine. However, omics datasets often contain missing values, and in multi-omics study designs it is common for individuals to be represented for some omics layers but not all. Since most statistical analyses cannot be applied directly to the incomplete datasets, imputation is typically performed to infer the missing values. Integrative imputation techniques which make use of the correlations and shared information among multi-omics datasets are expected to outperform approaches that rely on single-omics information alone, resulting in more accurate results for the subsequent downstream analyses. In this review, we provide an overview of the currently available imputation methods for handling missing values in bioinformatics data with an emphasis on multi-omics imputation. In addition, we also provide a perspective on how deep learning methods might be developed for the integrative imputation of multi-omics datasets
    • …
    corecore