1,719 research outputs found

    Practical application of a Bayesian network approach to poultry epigenetics and stress

    Get PDF
    This work was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 812777. We also greatly appreciate funding from the Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning (FORMAS) grants #2018-01074 and #2017-00946 to CG-B. FP appreciates funding from São Paulo Research Foundation (FAPESP, Brazil) projects #2016/20440-3 and #2018/13600-0.Background: Relationships among genetic or epigenetic features can be explored by learning probabilistic networks and unravelling the dependencies among a set of given genetic/epigenetic features. Bayesian networks (BNs) consist of nodes that represent the variables and arcs that represent the probabilistic relationships between the variables. However, practical guidance on how to make choices among the wide array of possibilities in Bayesian network analysis is limited. Our study aimed to apply a BN approach, while clearly laying out our analysis choices as an example for future researchers, in order to provide further insights into the relationships among epigenetic features and a stressful condition in chickens (Gallus gallus). Results: Chickens raised under control conditions (n = 22) and chickens exposed to a social isolation protocol (n = 24) were used to identify differentially methylated regions (DMRs). A total of 60 DMRs were selected by a threshold, after bioinformatic pre-processing and analysis. The treatment was included as a binary variable (control = 0; stress = 1). Thereafter, a BN approach was applied: initially, a pre-filtering test was used for identifying pairs of features that must not be included in the process of learning the structure of the network; then, the average probability values for each arc of being part of the network were calculated; and finally, the arcs that were part of the consensus network were selected. The structure of the BN consisted of 47 out of 61 features (60 DMRs and the stressful condition), displaying 43 functional relationships. The stress condition was connected to two DMRs, one of them playing a role in tight and adhesive intracellular junctions in organs such as ovary, intestine, and brain. Conclusions: We clearly explain our steps in making each analysis choice, from discrete BN models to final generation of a consensus network from multiple model averaging searches. The epigenetic BN unravelled functional relationships among the DMRs, as well as epigenetic features in close association with the stressful condition the chickens were exposed to. The DMRs interacting with the stress condition could be further explored in future studies as possible biomarkers of stress in poultry species.Publisher PDFPeer reviewe

    Exploring Patterns of Epigenetic Information With Data Mining Techniques

    Get PDF
    [Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo; 209RT-0366Galicia. Consellería de Economía e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000

    Set Based Association Testing in High Dimensional Genomic Studies

    Get PDF
    The last decade has ushered in an era of high dimensional, high volume data. In particular with the biotechnological revolution of the era, high-dimensional genomic studies of various designs have provided investigators with the tools to study thousands or even millions of genomic features simultaneously. These studies have shed new light on the underlying mechanisms of complex diseases. The accumulated knowledge of these complex relationships between genes has led scientists to formalize pathways and graphical networks that visually and succinctly give descriptions of the geometry of these relationships. With such knowledge, it has become possible to develop procedures for statistical inference, not just at the individual genes level, but at the more meaningful gene-set level. The focus of this thesis is the development of new statistical procedures for such gene-set analysis. After presenting an overview at the introduction, we give a comprehensive review of the literature relevant developments in the thesis in Chapter 2. In Chapter 3, we develop a Bayesian procedure that incorporates information contained in a gene graphical network, viewed as a directed graph, into the construction of prior distributions and we use the derived posterior distributions to construct statistical tests at the gene-set level. Our procedure extends the work of Pan (2006) and Wei and Pan (2008) which did not use the direction as information in the graphical network, but rather used undirected graphs and assumed a mixture model for the distribution to generate the posterior distribution of the mixing parameters via the use of a Markov random field. We demonstrate the gain in statistical power of our procedure over Pan and Wei\u27s in an application to detect differentially expressed genes, and gene-sets by analyzing a data set that compares favorable risk and poor risk defined by cytogenetics in adults with acute myeloid leukemia (AML). To enhance comprehension of the vast and complex information in high-dimensional data from genomic studies, it is sometimes useful and desirable to have a procedure that relates such data to specific endpoints. In this regards, association tests are highly desirable. In Chapter 4, we propose a procedure which we label `Projection onto Orthogonal Space Testing (POST)\u27 as a flexible method for testing association of gene sets and pathways with specific phenotypic endpoints while adjusting for other factors and variables as needed. In a simulation study, we demonstrate that POST has better operating characteristics than other methods recently developed to address the same objective. Thus we feel that POST does not only help to better understand treatment responses, but also prioritizes pathways for further study. We expect that POST will be especially valuable in clinical studies where cohorts with moderate to large sample sizes have rich high-dimensional data. Another new procedure for association testing which we label \u27Locus Based Integrated Testing(LOCIT)\u27 and an extension of the procedure -LOCITO- are introduced in Chapter 5. LOCIT is designed to test association of multiple forms of genomic data within a locus with an endpoint of interest in genomic studies. Given different forms of genomic data such as SNP genotypes, gene expression, and methylation levels, LOCIT performs one test per locus, taking several features at the locus into consideration. To illustrate the efficacy of LOCIT, we apply the procedure to a set consisting of SNP genotypes and gene profiling in an AML cohort to identify loci /genes that are associated with clinical outcomes. In chapter 6, we summarize our development of gene-set level association tests and outline future directions of our research in this area

    Methods for Epigenetic Analyses from Long-Read Sequencing Data

    Get PDF
    Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease. DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity. Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads. With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another. Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures. Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data. The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies. Methods for storage, retrieval, and analysis of such data therefore require careful consideration. Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation. These avenues had not been considered in existing tools. In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods. I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information. This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing. It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface. Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing. This implementation takes advantage of the performance benefits provided by my high performance storage container. It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties. Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction. I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions. Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma. I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures. Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation. These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding. In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing

    The Reasonable Effectiveness of Randomness in Scalable and Integrative Gene Regulatory Network Inference and Beyond

    Get PDF
    Gene regulation is orchestrated by a vast number of molecules, including transcription factors and co-factors, chromatin regulators, as well as epigenetic mechanisms, and it has been shown that transcriptional misregulation, e.g., caused by mutations in regulatory sequences, is responsible for a plethora of diseases, including cancer, developmental or neurological disorders. As a consequence, decoding the architecture of gene regulatory networks has become one of the most important tasks in modern (computational) biology. However, to advance our understanding of the mechanisms involved in the transcriptional apparatus, we need scalable approaches that can deal with the increasing number of large-scale, high-resolution, biological datasets. In particular, such approaches need to be capable of efficiently integrating and exploiting the biological and technological heterogeneity of such datasets in order to best infer the underlying, highly dynamic regulatory networks, often in the absence of sufficient ground truth data for model training or testing. With respect to scalability, randomized approaches have proven to be a promising alternative to deterministic methods in computational biology. As an example, one of the top performing algorithms in a community challenge on gene regulatory network inference from transcriptomic data is based on a random forest regression model. In this concise survey, we aim to highlight how randomized methods may serve as a highly valuable tool, in particular, with increasing amounts of large-scale, biological experiments and datasets being collected. Given the complexity and interdisciplinary nature of the gene regulatory network inference problem, we hope our survey maybe helpful to both computational and biological scientists. It is our aim to provide a starting point for a dialogue about the concepts, benefits, and caveats of the toolbox of randomized methods, since unravelling the intricate web of highly dynamic, regulatory events will be one fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases

    Statistical methods to deflect allele specific expression, alterations of allele specific expression and differential expression

    Get PDF
    The advent of next-generation sequencing (NGS) technology has facilitated the recent development of RNA sequencing (RNA-seq), which is a novel mapping and quantifying method for transcriptomes. By RNA-seq, one can measure the expression of different features such as gene expression, allelic expression, and intragenic expression in the forms of read counts. These features have provided new opportunities to study and interpret the molecular intricacy and variations that are potentially associated with the occurrence of specific diseases. Therefore, there has been an emerging interest in statistical method to analyze the RNA-seq data from different perspectives. In this dissertation, we focus on three important challenges: identifying allele specific expression (ASE) on the gene level and single nucleotide polymorphism (SNP) level simultaneously, the detection of ASE regions in the control group and regions of ASE alterations in case group simultaneously, and detecting genes whose expression levels are significantly different across treatment groups (DE genes). In Chapter 2, we propose a method to test ASE of a gene as a whole and variation in ASE within a gene across exons separately and simultaneously. A generalized linear mixed model is employed to incorporate variations due to genes, SNPs, and biological replicates. To improve reliability of statistical inferences, we assign priors on each effect in the model so that information is shared across genes in the entire genome. We utilize the Bayes factor to test the hypothesis of ASE for each gene and variations across SNPs within a gene. We compare the proposed method to competing approaches through simulation studies that mimicked the real datasets. The proposed method exhibits improved control of the false discovery rate and improved power over existing methods when SNP variation and biological variation are present. Besides, the proposed method also maintains low computational requirements that allows for whole genome analysis. As an example of real data analysis, we apply the proposed method to four tissue types in a bovine study to de novo detect ASE genes in the bovine genome, and uncover intriguing predictions of regulatory ASEs across gene exons and across tissue types. In Chapter 3, we propose a new and powerful algorithm for detecting ASE regions in a healthy control group and regions of ASE alterations in a disease/case group compared to the control. Specifically, we develop a bivariate Bayesian hidden Markov model (HMM) and an expectation-maximization inferential procedure. The proposed algorithm gains advantages over existing methods by addressing their limitations and by recognizing the complexity of biology. First, the bivariate Bayesian HMM detects ASEs for different mRNA isoforms due to alternative splicing and RNA variants. Second, it models spatial correlations among genomic observations, unlike existing methods that often assume independence. At last, the bivariate HMM draws inferences simultaneously for control and case samples, which maximizes the utilization of available information in data. Real data analysis and simulation studies that mimic real data sets are shown to illustrate the improved performance and practical utility of the proposed method. In Chapter 4, we present a new method to detect DE genes in any sequencing experiment. The number of read counts for different treatment groups are modelled by two Negative Binomial distributions which may have different means but share the same dispersion parameter. We propose a mixture prior model for the dispersion parameters with a point mass at zero and a lognormal distribution. The mixture model allows shrinkage across genes within each of the two mixture components, thus prevents the overcorrection resulting from shrinkage across all genes. The simulation studies demonstrate that the proposed method yields a better dispersion estimation and FDR control, and a higher accuracy in gene ranking. In addition, the proposed method exhibits robustness to the misspecification of the bimodal distribution for the dispersion parameters, thus is exible and can be easily generalized

    Stochastic spatial modelling of DNA methylation patterns and moment-based parameter estimation

    Get PDF
    In the first part of this thesis, we introduce and analyze spatial stochastic models for DNA methylation, an epigenetic mark with an important role in development. The underlying mechanisms controlling methylation are only partly understood. Several mechanistic models of enzyme activities responsible for methylation have been proposed. Here, we extend existing hidden Markov models (HMMs) for DNA methylation by describing the occurrence of spatial methylation patterns with stochastic automata networks. We perform numerical analysis of the HMMs applied to (non-)hairpin bisulfite sequencing KO data and accurately predict the wild-type data from these results. We find evidence that the activities of Dnmt3a/b responsible for de novo methylation depend on the left but not on the right CpG neighbors. The second part focuses on parameter estimation in chemical reaction networks (CRNs). We propose a generalized method of moments (GMM) approach for inferring the parameters of CRNs based on a sophisticated matching of the statistical moments of the stochastic model and the sample moments of population snapshot data. The proposed parameter estimation method exploits recently developed moment-based approximations and provides estimators with desirable statistical properties when many samples are available. The GMM provides accurate and fast estimations of unknown parameters of CRNs. The accuracy increases and the variance decreases when higher-order moments are considered.Im ersten Teil der Arbeit führen wir eine Analyse für spatielle stochastische Modelle der DNA Methylierung, ein wichtiger epigenetischer Marker in der Entwicklung, durch. Die zugrunde liegenden Mechanismen der Methylierung werden noch nicht vollständig verstanden. Mechanistische Modelle beschreiben die Aktivität der Methylierungsenzyme. Wir erweitern bestehende Hidden Markov Models (HMMs) zur DNA Methylierung durch eine Stochastic Automata Networks Beschreibung von spatiellen Methylierungsmustern. Wir führen eine numerische Analyse der HMMs auf bisulfit-sequenzierten KO Datens¨atzen aus und nutzen die Resultate, um die Wildtyp-Daten erfolgreich vorherzusagen. Unsere Ergebnisse deuten an, dass die Aktivitäten von Dnmt3a/b, die überwiegend für die de novo Methylierung verantwortlich sind, nur vom Methylierungsstatus des linken, nicht aber vom rechten CpG Nachbarn abhängen. Der zweite Teil befasst sich mit Parameterschätzung in chemischen Reaktionsnetzwerken (CRNs). Wir führen eine Verallgemeinerte Momentenmethode (GMM) ein, die die statistischen Momente des stochastischen Modells an die Momente von Stichproben geschickt anpasst. Die GMM nutzt hier kürzlich entwickelte, momentenbasierte Näherungen, liefert Schätzer mit wünschenswerten statistischen Eigenschaften, wenn genügend Stichproben verfügbar sind, mit schnellen und genauen Schätzungen der unbekannten Parameter in CRNs. Momente höherer Ordnung steigern die Genauigkeit des Schätzers, während die Varianz sinkt

    A Bayesian Approach for Analysis of Whole-Genome Bisulfite Sequencing Data Identifies Disease-Associated Changes in DNA Methylation

    Get PDF
    DNA methylation is a key epigenetic modification involved in gene regulation whose contribution to disease susceptibility remains to be fully understood. Here, we present a novel Bayesian smoothing approach (called ABBA) to detect differentially methylated regions (DMRs) from whole-genome bisulfite sequencing (WGBS). We also show how this approach can be leveraged to identify disease-associated changes in DNA methylation, suggesting mechanisms through which these alterations might affect disease. From a data modeling perspective, ABBA has the distinctive feature of automatically adapting to different correlation structures in CpG methylation levels across the genome while taking into account the distance between CpG sites as a covariate. Our simulation study shows that ABBA has greater power to detect DMRs than existing methods, providing an accurate identification of DMRs in the large majority of simulated cases. To empirically demonstrate the method’s efficacy in generating biological hypotheses, we performed WGBS of primary macrophages derived from an experimental rat system of glomerulonephritis and used ABBA to identify >1000 disease-associated DMRs. Investigation of these DMRs revealed differential DNA methylation localized to a 600 bp region in the promoter of the Ifitm3 gene. This was confirmed by ChIP-seq and RNA-seq analyses, showing differential transcription factor binding at the Ifitm3 promoter by JunD (an established determinant of glomerulonephritis), and a consistent change in Ifitm3 expression. Our ABBA analysis allowed us to propose a new role for Ifitm3 in the pathogenesis of glomerulonephritis via a mechanism involving promoter hypermethylation that is associated with Ifitm3 repression in the rat strain susceptible to glomerulonephritis
    corecore