4,724 research outputs found

    A Feature Selection Algorithm to Compute Gene Centric Methylation from Probe Level Methylation Data

    Get PDF
    DNA methylation is an important epigenetic event that effects gene expression during development and various diseases such as cancer. Understanding the mechanism of action of DNA methylation is important for downstream analysis. In the Illumina Infinium HumanMethylation 450K array, there are tens of probes associated with each gene. Given methylation intensities of all these probes, it is necessary to compute which of these probes are most representative of the gene centric methylation level. In this study, we developed a feature selection algorithm based on sequential forward selection that utilized different classification methods to compute gene centric DNA methylation using probe level DNA methylation data. We compared our algorithm to other feature selection algorithms such as support vector machines with recursive feature elimination, genetic algorithms and ReliefF. We evaluated all methods based on the predictive power of selected probes on their mRNA expression levels and found that a K-Nearest Neighbors classification using the sequential forward selection algorithm performed better than other algorithms based on all metrics. We also observed that transcriptional activities of certain genes were more sensitive to DNA methylation changes than transcriptional activities of other genes. Our algorithm was able to predict the expression of those genes with high accuracy using only DNA methylation data. Our results also showed that those DNA methylation-sensitive genes were enriched in Gene Ontology terms related to the regulation of various biological processes

    Modeling dependency structures in 450k DNA methylation data

    Get PDF
    Motivation: DNA methylation has been shown to be spatially dependent across chromosomes. Previous studies have focused on the influence of genomic context on the dependency structure, while not considering differences in dependency structure between individuals. Results: We modeled spatial dependency with a flexible framework to quantify the dependency structure, focusing on inter-individual differences by exploring the association between dependency parameters and technical and biological variables. The model was applied to a subset of the Finnish Twin Cohort study (N = 1611 individuals). The estimates of the dependency parameters varied considerably across individuals, but were generally consistent across chromosomes within individuals. The variation in dependency parameters was associated with bisulfite conversion plate, zygosity, sex and age. The age differences presumably reflect accumulated environmental exposures and/or accumulated small methylation differences caused by stochastic mitotic events, establishing recognizable, individual patterns more strongly seen in older individuals.Peer reviewe

    Detection of Epigenomic Network Community Oncomarkers

    Get PDF
    In this paper we propose network methodology to infer prognostic cancer biomarkers based on the epigenetic pattern DNA methylation. Epigenetic processes such as DNA methylation reflect environmental risk factors, and are increasingly recognised for their fundamental role in diseases such as cancer. DNA methylation is a gene-regulatory pattern, and hence provides a means by which to assess genomic regulatory interactions. Network models are a natural way to represent and analyse groups of such interactions. The utility of network models also increases as the quantity of data and number of variables increase, making them increasingly relevant to large-scale genomic studies. We propose methodology to infer prognostic genomic networks from a DNA methylation-based measure of genomic interaction and association. We then show how to identify prognostic biomarkers from such networks, which we term `network community oncomarkers'. We illustrate the power of our proposed methodology in the context of a large publicly available breast cancer dataset

    Identification of active regulatory regions from DNA methylation data

    Get PDF
    We have recently shown that transcription factor binding leads to defined reduction in DNA methylation, allowing for the identification of active regulatory regions from high-resolution methylomes. Here, we present MethylSeekR, a computational tool to accurately identify such footprints from bisulfite-sequencing data. Applying our method to a large number of published human methylomes, we demonstrate its broad applicability and generalize our previous findings from a neuronal differentiation system to many cell types and tissues. MethylSeekR is available as an R package at www.bioconductor.or

    Editorial: Computational Methods for Analysis of DNA Methylation Data

    Get PDF
    DNA methylation is among the most studied epigenetic modifications in eukaryotes. The interest in DNA methylation stems from its role in development, as well as its well- established association with phenotypic changes. Particularly, there is strong evidence that methylation pattern alterations in mammals are linked to developmental disorders and cancer (Kulis and Esteller, 2010). Owing to its potential as a prognostic marker for preventive medicine, in recent years, the analysis of DNA methylation data has garnered interest in many different contexts of computational biology (Bock, 2012). As it typically happens with omic data, processing, analyzing and interpreting large-scale DNA methylation datasets requires computational methods and software tools that address multiple challenges. In the present Research Topic, we collected papers that tackle different aspects of computational approaches for the analysis of DNA methylation data. These manuscripts address novel computational solutions for copy number variation detection, cell-type deconvolution and methylation pattern imputation, while others discuss interpretations of well-established computational techniques. Over the last 10 years, DNA methylation profiles have been successfully exploited to develop biomarkers of age, also referred to as epigenetic clocks (Bell et al., 2019). Epigenetic clocks accurately estimate both chronological and biological age from methylation levels. DNA methylation age and, most importantly, its deviation from chronological age have been shown to be associated with a variety of health issues. More recently, a second generation of epigenetic clocks has emerged. The new generation of clocks incorporates not only methylation profiles but also environmental variants, such as smoking and alcohol consumption, and they outperform the first generation in mortality prediction and prognosis of certain diseases. In our collection, the review by Chen et al. compares the first and second generation of epigenetic clocks that predict cancer risk and discusses pathways known to exhibit altered methylation in aging tissues and cancer. Differentially methylated regions (DMRs), that is genomic regions that show significant differences in methylation levels across distinct biological and/or medical conditions (e.g., normal vs. disease), have been reported to be implicated in a variety of disorders (Rakyan et al., 2011). As a result, identifying DMRs is one of the most critical and fundamental challenges in deciphering disease mechanisms at the molecular level. Although DNA methylation patterns remain stable during normal somatic cell growth, alterations in genomic methylation may be caused by genetic alterations, or vice versa. However, standard DMR analysis often ignores whether methylation alterations should be viewed as a cause or an effect. Rhamani et al. discuss the effect of model directionality, i.e. whether the condition of interest (phenotype) may be affected by methylation or whether it may affect methylation, in differential methylation analyses at the cell-type level. They show that correctly accounting for model directionality has a significant impact on the ability to identify cell type specific differential methylation. Different cell types exhibit DMRs at many genomic regions and such rich information can be exploited to infer underlying cell type proportions using deconvolution techniques. DNA methylation-based cell mixture deconvolution approaches can be classified into two main categories: reference-based and reference-free. While the latter are more broadly applicable, as they do not rely on the availability of methylation profiles from each of the purified cell types that compose a tissue of interest, they are also less precise. Reference-based approaches use DMRs specific to cell types (reference library) to determine the underlying cellular composition within a DNA methylation sample. The quality of the reference library has a big impact on the accuracy of reference-based approaches. Bell-Glenn et al. present RESET, a framework for reference library selection for deconvolution algorithms exploiting a modified version of the Dispersion Separability Criteria score, for the inference of the best DMRs composing the library, contributing to de facto standards (Koestler et al., 2016). In short, RESET does not require researchers to identify a priori the size of the reference library (number of DMRs), nor to rely on costly associated purified cells’ mDNA profiles. Within a cellular population, the methylation patterns of different cell types and at specific genomic locations are indicative of cellular heterogeneity. Alterations of such heterogeneity are predictive of development as well as prognostic markers of diseases. Computational methods that exploit heterogeneity in methylation patterns are typically constrained by partially observed patterns due to the nature of shotgun sequencing, which frequently generates limited coverage for downstream analysis. One possible solution to overcome such limitations is offered by Chang et al. presenting BSImp, a probabilistic based imputation method that uses local information to impute partially observed methylation patterns. They show that using this approach they are able to recover heterogeneity estimates at 15% more regions with moderate sequencing depths. This should therefore improve our ability to study how methylation heterogeneity is associated with disease. Finally, recent studies have shown how the associations between Copy Number Variations (CNVs) and methylation alterations offer a richer and hence more informative picture of the samples under study, in particular for tumor data characterized by large scale genomic rearrangements (Sun et al., 2018). Consequently, recent technological and methodological developments have enabled the possibility to measure CNVs from DNA methylation data. The main advantage of DNA methylation based CNV approaches is that they offer the possibility to integrate both genomic (copy number) and epigenomic (methylation) information. Mariani et al. propose MethylMasteR, an R software package that integrates DNA methylation-based CNV calling routines, facilitating standardization, comparison and customization of CNV analyses. This package, built into the Docker architecture to seamlessly mange dependencies, includes four of the most commonly used routines for this integrated analysis, ChAMP (Morris et al., 2014), SeSAMe (Zhou et al., 2018), Epicopy (Cho et al., 2019), plus a custom version of cnAnalysis450k (Knoll et al., 2017), overall enabling analysis of comparative results. All the topics in this issue, although limited to specific aspects of DNA methylation analysis, highlight the importance of research in this field, the associated computational challenges and illustrate the significant impact that this type of data will likely have on preventive medicine

    Computational solutions for addressing heterogeneity in DNA methylation data

    Get PDF
    DNA methylation, a reversible epigenetic modification, has been implicated with various bi- ological processes including gene regulation. Due to the multitude of datasets available, it is a premier candidate for computational tool development, especially for investigating hetero- geneity within and across samples. We differentiate between three levels of heterogeneity in DNA methylation data: between-group, between-sample, and within-sample heterogeneity. Here, we separately address these three levels and present new computational approaches to quantify and systematically investigate heterogeneity. Epigenome-wide association studies relate a DNA methylation aberration to a phenotype and therefore address between-group heterogeneity. To facilitate such studies, which necessar- ily include data processing, exploratory data analysis, and differential analysis of DNA methy- lation, we extended the R-package RnBeads. We implemented novel methods for calculating the epigenetic age of individuals, novel imputation methods, and differential variability analysis. A use-case of the new features is presented using samples from Ewing sarcoma patients. As an important driver of epigenetic differences between phenotypes, we systematically investigated associations between donor genotypes and DNA methylation states in methylation quantitative trait loci (methQTL). To that end, we developed a novel computational framework –MAGAR– for determining statistically significant associations between genetic and epigenetic variations. We applied the new pipeline to samples obtained from sorted blood cells and complex bowel tissues of healthy individuals and found that tissue-specific and common methQTLs have dis- tinct genomic locations and biological properties. To investigate cell-type-specific DNA methylation profiles, which are the main drivers of within-group heterogeneity, computational deconvolution methods can be used to dissect DNA methylation patterns into latent methylation components. Deconvolution methods require pro- files of high technical quality and the identified components need to be biologically interpreted. We developed a computational pipeline to perform deconvolution of complex DNA methyla- tion data, which implements crucial data processing steps and facilitates result interpretation. We applied the protocol to lung adenocarcinoma samples and found indications of tumor in- filtration by immune cells and associations of the detected components with patient survival. Within-sample heterogeneity (WSH), i.e., heterogeneous DNA methylation patterns at a ge- nomic locus within a biological sample, is often neglected in epigenomic studies. We present the first systematic benchmark of scores quantifying WSH genome-wide using simulated and experimental data. Additionally, we created two novel scores that quantify DNA methyla- tion heterogeneity at single CpG resolution with improved robustness toward technical biases. WSH scores describe different types of WSH in simulated data, quantify differential hetero- geneity, and serve as a reliable estimator of tumor purity. Due to the broad availability of DNA methylation data, the levels of heterogeneity in DNA methylation data can be comprehensively investigated. We contribute novel computational frameworks for analyzing DNA methylation data with respect to different levels of hetero- geneity. We envision that this toolbox will be indispensible for understanding the functional implications of DNA methylation patterns in health and disease.DNA Methylierung ist eine reversible, epigenetische Modifikation, die mit verschiedenen biologischen Prozessen wie beispielsweise der Genregulation in Verbindung steht. Eine Vielzahl von DNA Methylierungsdatensätzen bildet die perfekte Grundlage zur Entwicklung von Softwareanwendungen, insbesondere um Heterogenität innerhalb und zwischen Proben zu beschreiben. Wir unterscheiden drei Ebenen von Heterogenität in DNA Methylierungsdaten: zwischen Gruppen, zwischen Proben und innerhalb einer Probe. Hier betrachten wir die drei Ebenen von Heterogenität in DNA Methylierungsdaten unabhängig voneinander und präsentieren neue Ansätze um die Heterogenität zu beschreiben und zu quantifizieren. Epigenomweite Assoziationsstudien verknüpfen eine DNA Methylierungsveränderung mit einem Phänotypen und beschreiben Heterogenität zwischen Gruppen. Um solche Studien, welche Datenprozessierung, sowie exploratorische und differentielle Datenanalyse beinhalten, zu vereinfachen haben wir die R-basierte Softwareanwendung RnBeads erweitert. Die Erweiterungen beinhalten neue Methoden, um das epigenetische Alter vorherzusagen, neue Schätzungsmethoden für fehlende Datenpunkte und eine differentielle Variabilitätsanalyse. Die Analyse von Ewing-Sarkom Patientendaten wurde als Anwendungsbeispiel für die neu entwickelten Methoden gewählt. Wir untersuchten Assoziationen zwischen Genotypen und DNA Methylierung von einzelnen CpGs, um sogenannte methylation quantitative trait loci (methQTL) zu definieren. Diese stellen einen wichtiger Faktor dar, der epigenetische Unterschiede zwischen Gruppen induziert. Hierzu entwickelten wir ein neues Softwarepaket (MAGAR), um statistisch signifikante Assoziationen zwischen genetischer und epigenetischer Variation zu identifizieren. Wir wendeten diese Pipeline auf Blutzelltypen und komplexe Biopsien von gesunden Individuen an und konnten gemeinsame und gewebespezifische methQTLs in verschiedenen Bereichen des Genoms lokalisieren, die mit unterschiedlichen biologischen Eigenschaften verknüpft sind. Die Hauptursache für Heterogenität innerhalb einer Gruppe sind zelltypspezifische DNA Methylierungsmuster. Um diese genauer zu untersuchen kann Dekonvolutionssoftware die DNA Methylierungsmatrix in unabhängige Variationskomponenten zerlegen. Dekonvolutionsmethoden auf Basis von DNA Methylierung benötigen technisch hochwertige Profile und die identifizierten Komponenten müssen biologisch interpretiert werden. In dieser Arbeit entwickelten wir eine computerbasierte Pipeline zur Durchführung von Dekonvolutionsexperimenten, welche die Datenprozessierung und Interpretation der Resultate beinhaltet. Wir wendeten das entwickelte Protokoll auf Lungenadenokarzinome an und fanden Anzeichen für eine Tumorinfiltration durch Immunzellen, sowie Verbindungen zum Überleben der Patienten. Heterogenität innerhalb einer Probe (within-sample heterogeneity, WSH), d.h. heterogene Methylierungsmuster innerhalb einer Probe an einer genomischen Position, wird in epigenomischen Studien meist vernachlässigt. Wir präsentieren den ersten Vergleich verschiedener, genomweiter WSH Maße auf simulierten und experimentellen Daten. Zusätzlich entwickelten wir zwei neue Maße um WSH für einzelne CpGs zu berechnen, welche eine verbesserte Robustheit gegenüber technischen Faktoren aufweisen. WSH Maße beschreiben verschiedene Arten von WSH, quantifizieren differentielle Heterogenität und sagen Tumorreinheit vorher. Aufgrund der breiten Verfügbarkeit von DNA Methylierungsdaten können die Ebenen der Heterogenität ganzheitlich beschrieben werden. In dieser Arbeit präsentieren wir neue Softwarelösungen zur Analyse von DNA Methylierungsdaten in Bezug auf die verschiedenen Ebenen der Heterogenität. Wir sind davon überzeugt, dass die vorgestellten Softwarewerkzeuge unverzichtbar für das Verständnis von DNA Methylierung im kranken und gesunden Stadium sein werden

    Spectral Learning of Binomial HMMs for DNA Methylation Data

    Full text link
    We consider learning parameters of Binomial Hidden Markov Models, which may be used to model DNA methylation data. The standard algorithm for the problem is EM, which is computationally expensive for sequences of the scale of the mammalian genome. Recently developed spectral algorithms can learn parameters of latent variable models via tensor decomposition, and are highly efficient for large data. However, these methods have only been applied to categorial HMMs, and the main challenge is how to extend them to Binomial HMMs while still retaining computational efficiency. We address this challenge by introducing a new feature-map based approach that exploits specific properties of Binomial HMMs. We provide theoretical performance guarantees for our algorithm and evaluate it on real DNA methylation data
    • …
    corecore