591 research outputs found

    Multi-omics analysis of early molecular mechanisms of type 1 diabetes

    Get PDF
    Type 1 diabetes (T1D) is a complicated autoimmune disease with largely unknown disease mechanisms. The diagnosis is preceded by a long asymptomatic period of autoimmune activity in the insulin-producing pancreatic islets. Currently the only clinical markers used for T1D prediction are islet autoantibodies, which are a sign of already-broken immune tolerance. The focus of this dissertation is on the early asymptomatic period preceding seroconversion to islet autoantibody positivity. The genetic risk of type 1 diabetes has been thoroughly mapped in genome-wide association studies, but environmental factors and molecular mechanisms that mediate the risk are less well understood. According to the hygiene hypothesis, the risk of immune-mediated disorders is increased by the lack of exposure to pathogens in modern environments. Within a study on the hygiene hypothesis, we compared umbilical cord blood gene expression patterns between children born in environments with contrasting standards of living and type 1 diabetes incidences (Finland, Russia, and Estonia). The differentially expressed genes were associated with innate immunity and immune maturation. Our results suggest that the environment influences the immune system development already in-utero. Furthermore, we analyzed genome-wide DNA methylation and gene expression profiles in samples collected prospectively from Finnish children and newborn infants at risk of type 1 diabetes. Bisulfite sequencing analysis did not show any association of neonatal DNA methylation with later progression to T1D. However, antiviral type I interferon response in early childhood was found to be a risk factor of T1D. This transcriptomic signature was detectable in the peripheral blood already before islet autoantibodies, and the main observations were confirmed in an independent German study. These results contributed to the hypothesis that virus infections might play a role in T1D. Additionally, this dissertation contributed to transcriptomic and epigenomic data analysis workflows. Simple probe-level analysis of exon array data was shown to improve the reproducibility, specificity, and sensitivity of detected differential exon inclusion events. Type 1 error rate was markedly reduced by permutation-based significance assessment of differential methylation in bisulfite sequencing studies.Tyypin 1 diabeteksen varhaisten molekulaaristen mekanismien multiomiikka-analyysi Tyypin 1 diabetes (T1D) on autoimmuunitauti, jonka taustalla olevista mekanismeista tiedetään vähän. Diagnoosia edeltää pitkä oireeton jakso, jonka aikana insuliinia tuottaviin beetasoluihin kohdistuva autoimmuunireaktio etenee haiman saarekkeissa. Tämä väitöskirjatutkimus keskittyy T1D:n varhaiseen oireettomaan ajanjaksoon, joka edeltää serokonversiota autovasta-ainepositiiviseksi. Tyypin 1 diabeteksen geneettiset riskitekijät on kartoitettu perusteellisesti genominlaajuisissa assosiaatiotutkimuksissa, mutta ympäristön riskitekijöistä ja riskiä välittävistä molekyylimekanismeista tiedetään vähemmän. Hygieniahypoteesin mukaan vähäinen altistuminen taudinaiheuttajille lisää immuunijärjestelmän häiriöiden riskiä. Hygieniahypoteesiin liittyvässä osatyössä vertasimme hygienian ja T1D:n ilmaantuvuuden suhteen erilaisissa ympäristöissä (Suomi, Venäjä ja Viro) syntyneiden lasten napaveren geeniekpressioprofiileja. Erilaisesti ekspressoituneet geenit liittyivät synnynnäiseen immuniteettiin ja immuunijärjestelmän maturaatioon. Näiden tulosten perusteella ympäristö saattaa vaikuttaa immuunijärjestelmän kehitykseen jo raskauden aikana. Genominlaajuista DNA-metylaatiota ja geeniekspressiota analysoitiin näytteistä, jotka oli kerätty laajassa suomalaisessa seurantatutkimuksessa T1D:n riskiryhmään kuuluvilta lapsilta ja vastasyntyneiltä. Bisulfiittisekvensointianalyysin perusteella vastasyntyneen DNA-metylaation ja lapsuuden aikana kehittyvän T1D:n välillä ei ollut yhteyttä. Sen sijaan RNA:n tasolla havaittava viruksiin kohdistuva tyypin 1 interferonivaste varhaislapsuudessa todettiin T1D:n riskitekijäksi. Tämä havainto tehtiin perifeerisestä verestä jo ennen saarekevasta-aineiden ilmaantumista, ja päähavainnot vahvistettiin saksalaisessa tutkimuksessa. Nämä tulokset vahvistivat hypoteesia, jonka mukaan virukset voivat vaikuttaa T1D:n puhkeamiseen. T1D-tutkimuksen ohella tämä väitöskirjatyö kehitti transkriptomiikkaan ja epigenomiikkaan sopivia analyysimenetelmiä. Eksonimikrosirujen koetintasoisen analyysin todettiin parantavan toistettavuutta, sensitiivisyyttä ja tarkkuutta vaihtoehtoisen silmukoinniin kartoittamisessa. Tilastollisen merkitsevyyden permutaatiopohjainen analyysi vähensi tyypin 1 virhettä bisulfiittisekvensointidatan analyysissa

    Computational solutions for addressing heterogeneity in DNA methylation data

    Get PDF
    DNA methylation, a reversible epigenetic modification, has been implicated with various bi- ological processes including gene regulation. Due to the multitude of datasets available, it is a premier candidate for computational tool development, especially for investigating hetero- geneity within and across samples. We differentiate between three levels of heterogeneity in DNA methylation data: between-group, between-sample, and within-sample heterogeneity. Here, we separately address these three levels and present new computational approaches to quantify and systematically investigate heterogeneity. Epigenome-wide association studies relate a DNA methylation aberration to a phenotype and therefore address between-group heterogeneity. To facilitate such studies, which necessar- ily include data processing, exploratory data analysis, and differential analysis of DNA methy- lation, we extended the R-package RnBeads. We implemented novel methods for calculating the epigenetic age of individuals, novel imputation methods, and differential variability analysis. A use-case of the new features is presented using samples from Ewing sarcoma patients. As an important driver of epigenetic differences between phenotypes, we systematically investigated associations between donor genotypes and DNA methylation states in methylation quantitative trait loci (methQTL). To that end, we developed a novel computational framework –MAGAR– for determining statistically significant associations between genetic and epigenetic variations. We applied the new pipeline to samples obtained from sorted blood cells and complex bowel tissues of healthy individuals and found that tissue-specific and common methQTLs have dis- tinct genomic locations and biological properties. To investigate cell-type-specific DNA methylation profiles, which are the main drivers of within-group heterogeneity, computational deconvolution methods can be used to dissect DNA methylation patterns into latent methylation components. Deconvolution methods require pro- files of high technical quality and the identified components need to be biologically interpreted. We developed a computational pipeline to perform deconvolution of complex DNA methyla- tion data, which implements crucial data processing steps and facilitates result interpretation. We applied the protocol to lung adenocarcinoma samples and found indications of tumor in- filtration by immune cells and associations of the detected components with patient survival. Within-sample heterogeneity (WSH), i.e., heterogeneous DNA methylation patterns at a ge- nomic locus within a biological sample, is often neglected in epigenomic studies. We present the first systematic benchmark of scores quantifying WSH genome-wide using simulated and experimental data. Additionally, we created two novel scores that quantify DNA methyla- tion heterogeneity at single CpG resolution with improved robustness toward technical biases. WSH scores describe different types of WSH in simulated data, quantify differential hetero- geneity, and serve as a reliable estimator of tumor purity. Due to the broad availability of DNA methylation data, the levels of heterogeneity in DNA methylation data can be comprehensively investigated. We contribute novel computational frameworks for analyzing DNA methylation data with respect to different levels of hetero- geneity. We envision that this toolbox will be indispensible for understanding the functional implications of DNA methylation patterns in health and disease.DNA Methylierung ist eine reversible, epigenetische Modifikation, die mit verschiedenen biologischen Prozessen wie beispielsweise der Genregulation in Verbindung steht. Eine Vielzahl von DNA Methylierungsdatensätzen bildet die perfekte Grundlage zur Entwicklung von Softwareanwendungen, insbesondere um Heterogenität innerhalb und zwischen Proben zu beschreiben. Wir unterscheiden drei Ebenen von Heterogenität in DNA Methylierungsdaten: zwischen Gruppen, zwischen Proben und innerhalb einer Probe. Hier betrachten wir die drei Ebenen von Heterogenität in DNA Methylierungsdaten unabhängig voneinander und präsentieren neue Ansätze um die Heterogenität zu beschreiben und zu quantifizieren. Epigenomweite Assoziationsstudien verknüpfen eine DNA Methylierungsveränderung mit einem Phänotypen und beschreiben Heterogenität zwischen Gruppen. Um solche Studien, welche Datenprozessierung, sowie exploratorische und differentielle Datenanalyse beinhalten, zu vereinfachen haben wir die R-basierte Softwareanwendung RnBeads erweitert. Die Erweiterungen beinhalten neue Methoden, um das epigenetische Alter vorherzusagen, neue Schätzungsmethoden für fehlende Datenpunkte und eine differentielle Variabilitätsanalyse. Die Analyse von Ewing-Sarkom Patientendaten wurde als Anwendungsbeispiel für die neu entwickelten Methoden gewählt. Wir untersuchten Assoziationen zwischen Genotypen und DNA Methylierung von einzelnen CpGs, um sogenannte methylation quantitative trait loci (methQTL) zu definieren. Diese stellen einen wichtiger Faktor dar, der epigenetische Unterschiede zwischen Gruppen induziert. Hierzu entwickelten wir ein neues Softwarepaket (MAGAR), um statistisch signifikante Assoziationen zwischen genetischer und epigenetischer Variation zu identifizieren. Wir wendeten diese Pipeline auf Blutzelltypen und komplexe Biopsien von gesunden Individuen an und konnten gemeinsame und gewebespezifische methQTLs in verschiedenen Bereichen des Genoms lokalisieren, die mit unterschiedlichen biologischen Eigenschaften verknüpft sind. Die Hauptursache für Heterogenität innerhalb einer Gruppe sind zelltypspezifische DNA Methylierungsmuster. Um diese genauer zu untersuchen kann Dekonvolutionssoftware die DNA Methylierungsmatrix in unabhängige Variationskomponenten zerlegen. Dekonvolutionsmethoden auf Basis von DNA Methylierung benötigen technisch hochwertige Profile und die identifizierten Komponenten müssen biologisch interpretiert werden. In dieser Arbeit entwickelten wir eine computerbasierte Pipeline zur Durchführung von Dekonvolutionsexperimenten, welche die Datenprozessierung und Interpretation der Resultate beinhaltet. Wir wendeten das entwickelte Protokoll auf Lungenadenokarzinome an und fanden Anzeichen für eine Tumorinfiltration durch Immunzellen, sowie Verbindungen zum Überleben der Patienten. Heterogenität innerhalb einer Probe (within-sample heterogeneity, WSH), d.h. heterogene Methylierungsmuster innerhalb einer Probe an einer genomischen Position, wird in epigenomischen Studien meist vernachlässigt. Wir präsentieren den ersten Vergleich verschiedener, genomweiter WSH Maße auf simulierten und experimentellen Daten. Zusätzlich entwickelten wir zwei neue Maße um WSH für einzelne CpGs zu berechnen, welche eine verbesserte Robustheit gegenüber technischen Faktoren aufweisen. WSH Maße beschreiben verschiedene Arten von WSH, quantifizieren differentielle Heterogenität und sagen Tumorreinheit vorher. Aufgrund der breiten Verfügbarkeit von DNA Methylierungsdaten können die Ebenen der Heterogenität ganzheitlich beschrieben werden. In dieser Arbeit präsentieren wir neue Softwarelösungen zur Analyse von DNA Methylierungsdaten in Bezug auf die verschiedenen Ebenen der Heterogenität. Wir sind davon überzeugt, dass die vorgestellten Softwarewerkzeuge unverzichtbar für das Verständnis von DNA Methylierung im kranken und gesunden Stadium sein werden

    Metilação diferencial de DNA no envelhecimento: exploração in silico utilizando dados de elevado rendimento

    Get PDF
    The emergence of high-throughput methodologies after the conclusion of the Human Genome Project has brought genomic and epigenomic wide studies to the forefront of current research of biological and biomedical knowledge. Currently, the focus in genetic mutations as primary cause of certain disorders is not so relevant as before, since it was demonstrated that epigenetic mechanisms are involved in cellular programming and gene regulation providing adaptive variants of a given gene to a changing environment with an association to cellular differentiation. The research in the DNA methylation field has already revealed essential facts as the existence of methylation in CpG islands and alternative contexts that influence gene expression in tissue-specific manner. The influence of lifestyle choices in aging processes has also been related to methylome variations. And, in the case of cancer, the cooperation of epigenetic and genetic information is essential to understand the progress of cancer development as well as the silencing of key regulatory genes. An overall hypomethylation in cancer genome leads to oncogene activation whereas hypermethylation in specific regions is associated with silencing of tumour suppressor genes. For that reason, the research for new therapeutic approaches to cancer and aging is a current issue of the scientific community that work in the epigenomic field. In order to contribute to the study of mammalian epigenomes during lifespans, this research focused on the usage of public databases datasets to further investigation about DNA methylation across aged individuals in order to extract tissue-specific markers related with healthy aging. The validation of results was made through the usage of samples, form healthy individuals with good or bad cognitive performances, available in iBiMED. In both situations the genes ELOVL2 (cg16867657) and FHL2 (cg06639320) were identified as good markers of ageO aparecimento de metodologias de sequenciação de elevado rendimento após a conclusão do Projeto do Genoma Humano foi um avanço fundamental para a pesquisa biológica e biomédica na área da genómica. Embora as mutações genéticas tenham sido durante décadas o foco principal na causa de certas desordens, atualmente demonstrou-se que os mecanismos epigenéticos estão envolvidos na programação celular e na regulação genética, providenciando variações adaptativas do mesmo gene a um determinado ambiente e possuindo ainda uma associação direta com a diferenciação celular. O desenvolvimento científico no campo da metilação de DNA revela atualmente factos essenciais na biologia molecular, como a existência de metilação nas ilhas CpG e em contextos alternativos que influenciam a expressão genética nos diferentes tecidos humanos. Para além disso, a influência dos estilos de vida no processo de envelhecimento já demonstrou estar relacionada com o estado do epigenoma, nomeadamente com as variações no metiloma humano. No caso do cancro, a cooperação dos fatores genéticos e epigenéticos é essencial para a compreensão do desenvolvimento desta patologia no organismo humano nomeadamente através do silenciamento de genes reguladores essenciais. Uma hipometilação global no genoma do cancro conduz geralmente a uma ativação de oncogenes enquanto que hipermetilações localizadas estão associadas com o silenciamento de genes supressores de tumores. Por estes motivos, o desenvolvimento de novas terapias para o cancro ou o envelhecimento torna-se um tópico de interesse pela comunidade científica da área da epigenómica. Com o objetivo de desenvolver estes temas e melhorar a determinação de variações globais no epigenoma humano, esta investigação desenvolveu-se com base na utilização de dados de bases de dados públicas de indivíduos saudaveis de forma a extrair marcadores de metilação diferenciada em variados tecidos ao longo do envelhecimento saudável. O projeto foi validado através da utilização de amostras saúdaveis e de indivíduos com boas ou más performances cognitivas disponíveis no iBiMED. Em ambas as situações os genes ELOVL2 (cg16867657) e FHL2 (cg06639320) foram identificados como bons marcadores da idade dos indivíduosMestrado em Biotecnologi

    Transforming cancer molecular diagnostics: Molecular subgrouping of medulloblastoma via lowdepth whole genome bisulfite sequencing

    Get PDF
    INTRODUCTION: International consensus recognises four molecular subgroups of medulloblastoma, each with distinct molecular features and clinical outcomes. Assigning molecular subgroup is typically achieved via the Illumina DNA methylation microarray. Given the rapidly-expanding WGS capacity in healthcare institutions, there is an unmet need to develop platform-independent, sequence-based subgrouping assays. Whole genome bisulfite sequencing (WGBS) enables the assessment of genome-wide methylation status at single-base resolution. To date, its routine application for subgroup assignment has been limited, due to high economic cost and sample input requirements and currently no optimised pipeline exists that is tailored for handling samples sequenced at low-pass (i.e., 1-10x depth). METHODOLOGY: Two datasets were utilised; 36 newly-sequenced low-depth (10x) and 42 publicly available high-depth (30x) WGBS medulloblastoma and cerebellar samples, all with matched DNA methylation microarray data. We applied imputation to low-pass WGBS data, assessed inter-platform correlation and identified molecular subgroups by directly integrating WGBS sample data with preexisting array-trained models. We developed machine learning WGBS-based classifiers and compared performance against microarray. We optimised reference-free aneuploidy detection with low-pass WGBS and assessed concordance with microarray-derived aneuploidy calls. RESULTS: We optimised a pipeline for processing and analysis of low-pass WGBS data, suitable for routine molecular subgrouping and aneuploidy assessment. Using down-sampling, we showed that subgroup assignment remains robust at low depths and identified additional regions of differential methylation that are not assessed by methylation microarray. WGBS data can be integrated into existing array-trained models with high assignment probabilities, and WGBS-derived classifier performance measures exceeded microarray-derived classifiers. CONCLUSION: We describe a platform-independent WGBS assay for molecular subgrouping of medulloblastoma. It performs equivalently to array-based methods at increasingly comparable cost (currently ~396vs 396 vs ~584) and provides proof-of-concept for routine clinical adoption using standard WGS technology. Finally, the full methylome enabled elucidation of additional biological heterogeneity that has hitherto been inaccessible

    DNA Methylation Methods for Donor Age Prediction Using Touch DNA

    Get PDF
    The International Labor Organization (ILO) estimated over 30 million individuals fall victim to human trafficking each year, of which, 50% are children below the age of 16. In 2012, the ILO reported there to be 168 million child laborers worldwide, with many trafficked into hazardous conditions to manufacture consumer products that are sold in developed countries. This is a modern form of slavery with poor working conditions, no access to education, and low wages. The hidden nature of this crime, however, makes it extremely difficult to identify and locate victims of forced child labor, and thus making it challenging to eradicate. Children exploited in textile factories typically handle fabrics with bare hands, causing them to shed epithelial cells that contain DNA onto items they are manufacturing. It has been established that touch DNA can be isolated from a variety of substrates, which has the potential to be used to estimate the chronological age of an individual that handled the fabric. DNA methylation is an epigenetic modification which adds a methyl group to the nitrogenous base, cytosine, which can be involved in the regulation of gene expression. Previous research has determined that children have differentially methylated sites in their DNA that can be used as markers to estimate chronological age. To establish that current procedures could identify DNA from child laborers, touch samples were collected from sixty-seven volunteers within the age range of 0-65 years old on sterile gauze swatches following IRB approval. Total DNA was isolated from the gauze using the DNA Investigator Kit and bisulfite converted using the Qigen EpiTect BC Kit. Samples were quantified using the Qubit® 3 Fluorometer. In addition, some samples were quantified using the Human Quantifiler Kit. Custom primers and TaqMan Probes were designed for several age-associated methylation sites. Two different methylation qPCR kits were attempted for this assay - the EpiTect MethyLight + ROX Kit and the Methylamp MS-qPCR Fast Kit. Both qPCR assays were unsuccessful at quantifying DNA methylation from touch samples due to the low quantity of original DNA (average 0.092ng/μl). This study makes is clear that touch DNA is extremely difficult to collect in large enough quantities that can be used for downstream analysis. There is an apparent need for improved touch DNA collection methods. In addition, increased sensitivity of methylation quantification could contribute to optimizing this methodology for future use in chronological age estimation and subsequently identify manufacturers that are exploiting child laborers

    Methods and Mechanisms of DNA Methylation in Development and Disease

    Get PDF
    DNA methylation is a mechanism for long-term transcriptional regulation and is required for normal cellular differentiation. Failure to properly establish or maintain DNA methylation patterns leads to cell dysfunction and diseases such as cancer and neurological disorders. The goal of this thesis is to understand the role of DNA methylation in oncological cellular transformation and in normal development. To achieve this goal, I have developed a novel method for mapping genome-wide DNA methylation patterns and have applied the method to gonadectomy-induced adrenocortical neoplasms and to maturing motor neurons. The novel method, called Laser Capture Microdissected-Reduced Representation Bisulfite Sequencing (LCM-RRBS), accurately and reproducibly profiles genome-wide methylation of DNA extracted from microdissected fresh frozen or formalin-fixed paraffin-embedded tissue samples. Using this method, I find that significant DNA methylation changes, associated with attendant expression changes, occur in transformed adrenocortical cells. My work has also uncovered significant DNA methylation configuration in maturing motor neurons associated with dramatic expression changes. I show that demethylated regions are enriched for known neuron-specific transcription factor binding sites and that genetic disruption of the active demethylation machinery significantly inhibits motor neuron differentiation and maturation. Together, these experiments demonstrate that DNA methylation plays a role in the transformation of normal cells to cancer cells and that DNA methylation is critical to proper motor neuron formation. I conclude that aberrant DNA methylation controls gene expression in gonadectomy-induced adrenocortical neoplasms and that neuron-specific transcription factors could recruit demethylating enzymes to regions that lose DNA methylation in motor neurons upon maturation

    Imputation Aided Methylation Analysis

    Get PDF
    Genome-wide DNA methylation analysis is of broad interest to medical research because of its central role in human development and disease. However, generating high-quality methylomes on a large scale is particularly expensive due to technical issues inherent to DNA treatment with bisulfite, requiring deeper than usual sequencing. In silico methodologies, such as imputation, can be used to address this limitation and improve the coverage and quality of data produced in these experiments. Imputation is a statistical technique where missing values are substituted with computed values. The process involves leveraging information from reference data to calculate probable values for missing data points. In this thesis, imputation is explored for its potential to increase the value of methylation datasets sequenced at different depths: 1. First, a new R package, Methylation Analysis ToolkiT (MATT), was developed to deal with large numbers of WGBS datasets in a computationally- and memory-efficient manner. 2. Second, the performance of DNA methylation-specific and generic imputation tools were assessed by down-sampling high-quality (100x) WGBS datasets to determine the extent to which missing data can be recovered and the accuracy of imputed values. 3. Third, to overcome shortfalls within existing tools, a novel imputation tool was developed, termed Global IMputation of cpg MEthylation (GIMMEcpg). GIMMEcpg default implementation is based on Model Stacking and outperforms existing tools in accuracy and speed. 4. Lastly, to demonstrate its potential, GIMMEcpg was used to impute ten shallow (17x) WGBS datasets from healthy volunteers of the Personal Genome Project UK with high accuracy. Moreover, the extent of missing and low-quality data, as well as the reproducibility and accuracy of methylation datasets, were explored for different data types (Microarrays, Reduced Representation Bisulfite Sequencing (RRBS), Whole Genome Bisulfite Sequencing (WGBS), EM-Seq and Nanopore sequencing)

    Inter-individual variation of the human epigenome & applications

    Get PDF
    Genome-wide association studies (GWAS) have led to the discovery of genetic variants influencing human phenotypes in health and disease. However, almost two decades later, most human traits can still not be accurately predicted from common genetic variants. Moreover, genetic variants discovered via GWAS mostly map to the non-coding genome and have historically resisted interpretation via mechanistic models. Alternatively, the epigenome lies in the cross-roads between genetics and the environment. Thus, there is great excitement towards the mapping of epigenetic inter-individual variation since its study may link environmental factors to human traits that remain unexplained by genetic variants. For instance, the environmental component of the epigenome may serve as a source of biomarkers for accurate, robust and interpretable phenotypic prediction on low-heritability traits that cannot be attained by classical genetic-based models. Additionally, its research may provide mechanisms of action for genetic associations at non-coding regions that mediate their effect via the epigenome. The aim of this thesis was to explore epigenetic inter-individual variation and to mitigate some of the methodological limitations faced towards its future valorisation.Chapter 1 is dedicated to the scope and aims of the thesis. It begins by describing historical milestones and basic concepts in human genetics, statistical genetics, the heritability problem and polygenic risk scores. It then moves towards epigenetics, covering the several dimensions it encompasses. It subsequently focuses on DNA methylation with topics like mitotic stability, epigenetic reprogramming, X-inactivation or imprinting. This is followed by concepts from epigenetic epidemiology such as epigenome-wide association studies (EWAS), epigenetic clocks, Mendelian randomization, methylation risk scores and methylation quantitative trait loci (mQTL). The chapter ends by introducing the aims of the thesis.Chapter 2 focuses on stochastic epigenetic inter-individual variation resulting from processes occurring post-twinning, during embryonic development and early life. Specifically, it describes the discovery and characterisation of hundreds of variably methylated CpGs in the blood of healthy adolescent monozygotic (MZ) twins showing equivalent variation among co-twins and unrelated individuals (evCpGs) that could not be explained only by measurement error on the DNA methylation microarray. DNA methylation levels at evCpGs were shown to be stable short-term but susceptible to aging and epigenetic drift in the long-term. The identified sites were significantly enriched at the clustered protocadherin loci, known for stochastic methylation in neurons in the context of embryonic neurodevelopment. Critically, evCpGs were capable of clustering technical and longitudinal replicates while differentiating young MZ twins. Thus, discovered evCpGs can be considered as a first prototype towards universal epigenetic fingerprint, relevant in the discrimination of MZ twins for forensic purposes, currently impossible with standard DNA profiling. Besides, DNA methylation microarrays are the preferred technology for EWAS and mQTL mapping studies. However, their probe design inherently assumes that the assayed genomic DNA is identical to the reference genome, leading to genetic artifacts whenever this assumption is not fulfilled. Building upon the previous experience analysing microarray data, Chapter 3 covers the development and benchmarking of UMtools, an R-package for the quantification and qualification of genetic artifacts on DNA methylation microarrays based on the unprocessed fluorescence intensity signals. These tools were used to assemble an atlas on genetic artifacts encountered on DNA methylation microarrays, including interactions between artifacts or with X-inactivation, imprinting and tissue-specific regulation. Additionally, to distinguish artifacts from genuine epigenetic variation, a co-methylation-based approach was proposed. Overall, this study revealed that genetic artifacts continue to filter through into the reported literature since current methodologies to address them have overlooked this challenge.Furthermore, EWAS, mQTL and allele-specific methylation (ASM) mapping studies have all been employed to map epigenetic variation but require matching phenotypic/genotypic data and can only map specific components of epigenetic inter-individual variation. Inspired by the previously proposed co-methylation strategy, Chapter 4 describes a novel method to simultaneously map inter-haplotype, inter-cell and inter-individual variation without these requirements. Specifically, binomial likelihood function-based bootstrap hypothesis test for co-methylation within reads (Binokulars) is a randomization test that can identify jointly regulated CpGs (JRCs) from pooled whole genome bisulfite sequencing (WGBS) data by solely relying on joint DNA methylation information available in reads spanning multiple CpGs. Binokulars was tested on pooled WGBS data in whole blood, sperm and combined, and benchmarked against EWAS and ASM. Our comparisons revealed that Binokulars can integrate a wide range of epigenetic phenomena under the same umbrella since it simultaneously discovered regions associated with imprinting, cell type- and tissue-specific regulation, mQTL, ageing or even unknown epigenetic processes. Finally, we verified examples of mQTL and polymorphic imprinting by employing another novel tool, JRC_sorter, to classify regions based on epigenotype models and non-pooled WGBS data in cord blood. In the future, we envision how this cost-effective approach can be applied on larger pools to simultaneously highlight regions of interest in the methylome, a highly relevant task in the light of the post-GWAS era.Moving towards future applications of epigenetic inter-individual variation, Chapters 5 and 6 are dedicated to solving some of methodological issues faced in translational epigenomics.Firstly, due to its simplicity and well-known properties, linear regression is the starting point methodology when performing prediction of a continuous outcome given a set of predictors. However, linear regression is incompatible with missing data, a common phenomenon and a huge threat to the integrity of data analysis in empirical sciences, including (epi)genomics. Chapter 5 describes the development of combinatorial linear models (cmb-lm), an imputation-free, CPU/RAM-efficient and privacy-preserving statistical method for linear regression prediction on datasets with missing values. Cmb-lm provide prediction errors that take into account the pattern of missing values in the incomplete data, even at extreme missingness. As a proof-of-concept, we tested cmb-lm in the context of epigenetic ageing clocks, one of the most popular applications of epigenetic inter-individual variation. Overall, cmb-lm offer a simple and flexible methodology with a wide range of applications that can provide a smooth transition towards the valorisation of linear models in the real world, where missing data is almost inevitable. Beyond microarrays, due to its high accuracy, reliability and sample multiplexing capabilities, massively parallel sequencing (MPS) is currently the preferred methodology of choice to translate prediction models for traits of interests into practice. At the same time, tobacco smoking is a frequent habit sustained by more than 1.3 billion people in 2020 and a leading (and preventable) health risk factor in the modern world. Predicting smoking habits from a persistent biomarker, such as DNA methylation, is not only relevant to account for self-reporting bias in public health and personalized medicine studies, but may also allow broadening forensic DNA phenotyping. Previously, a model to predict whether someone is a current, former, or never smoker had been published based on solely 13 CpGs from the hundreds of thousands included in the DNA methylation microarray. However, a matching lab tool with lower marker throughput, and higher accuracy and sensitivity was missing towards translating the model in practice. Chapter 6 describes the development of an MPS assay and data analysis pipeline to quantify DNA methylation on these 13 smoking-associated biomarkers for the prediction of smoking status. Though our systematic evaluation on DNA standards of known methylation levels revealed marker-specific amplification bias, our novel tool was still able to provide highly accurate and reproducible DNA methylation quantification and smoking habit prediction. Overall, our MPS assay allows the technological transfer of DNA methylation microarray findings and models to practical settings, one step closer towards future applications.Finally, Chapter 7 provides a general discussion on the results and topics discussed across Chapters 2-6. It begins by summarizing the main findings across the thesis, including proposals for follow-up studies. It then covers technical limitations pertaining bisulfite conversion and DNA methylation microarrays, but also more general considerations such as restricted data access. This chapter ends by covering the outlook of this PhD thesis, including topics such as bisulfite-free methods, third-generation sequencing, single-cell methylomics, multi-omics and systems biology.<br/

    On the Analysis of DNA Methylation

    Get PDF
    Recent genome-wide studies lend support to the idea that the patterns of DNA methylation are in some way related either causally or as a readout of cell-type specific protein binding. We lay the groundwork for a framework to test whether the pattern of DNA methylation levels in a cell combined with protein binding models is sufficient to completely describe the location of the component of proteins binding to its genome in an assayed context. There is only one method, whole-genome bisulfite sequencing, WGBS, available to study DNA methylation genome-wide at such high resolution, however its accuracy has not been determined on the scale of individual binding locations. We address this with a two-fold approach. First, we developed an alternative high-resolution, whole-genome assay using a combination of an enrichment-based and a restriction-enzyme-based assay of methylation, methylCRF. While both assays are considered inferior to WGBS, by using two distinct assays, this method has the advantage that each assay in part cancels out the biases of the other. Additionally, this method is up to 15 times lower in cost than WGBS. By formulating the estimation of methylation from the two methods as a structured prediction problem using a conditional random field, this work will also address the general problem of incorporating data of varying qualities -a common characteristic of biological data- for the purpose of prediction. We show that methylCRF is concordant with WGBS within the range of two WGBS methylomes. Due to the lower cost, we were able to analyze at high-resolution, methylation across more cell-types than previously possible and estimate that 28% of CpGs, in regions comprising 11% of the genome, show variable methylation and are enriched in regulatory regions. Secondly, we show that WGBS has inherent resulution limitations in a read count dependent manner and that the identification of unmethylated regions is highly affected by GC-bias in the underlying protocol suggesting simple estimate procedures may not be sufficient for high-resolution analysis. To address this, we propose a novel approach to DNA methylation analysis using change point detection instead of estimating methylation level directly. However, we show that current change-point detection methods are not robust to methylation signal, we therefore explore how to extend current non-parametric methods to simultaneously find change-points as well as characteristic methylation levels. We believe this framework may have the power to examine the connection between changes in methylation and transcription factor binding in the context of cell-type specific behaviors
    corecore