3,069 research outputs found

    Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation

    Get PDF
    Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction

    Mapping the Landscape of Mutation Rate Heterogeneity in the Human Genome: Approaches and Applications

    Full text link
    All heritable genetic variation is ultimately the result of mutations that have occurred in the past. Understanding the processes which determine the rate and spectra of new mutations is therefore fundamentally important in efforts to characterize the genetic basis of heritable disease, infer the timing and extent of past demographic events (e.g., population expansion, migration), or identify signals of natural selection. This dissertation aims to describe patterns of mutation rate heterogeneity in detail, identify factors contributing to this heterogeneity, and develop methods and tools to harness such knowledge for more effective and efficient analysis of whole-genome sequencing data. In Chapters 2 and 3, we catalog granular patterns of germline mutation rate heterogeneity throughout the human genome by analyzing extremely rare variants ascertained from large-scale whole-genome sequencing datasets. In Chapter 2, we describe how mutation rates are influenced by local sequence context and various features of the genomic landscape (e.g., histone marks, recombination rate, replication timing), providing detailed insight into the determinants of single-nucleotide mutation rate variation. We show that these estimates reflect genuine patterns of variation among de novo mutations, with broad potential for improving our understanding of the biology of underlying mutation processes and the consequences for human health and evolution. These estimated rates are publicly available at http://mutation.sph.umich.edu/. In Chapter 3, we introduce a novel statistical model to elucidate the variation in rate and spectra of multinucleotide mutations throughout the genome. We catalog two major classes of multinucleotide mutations: those resulting from error-prone translesion synthesis, and those resulting from repair of double-strand breaks. In addition, we identify specific hotspots for these unique mutation classes and describe the genomic features associated with their spatial variation. We show how these multinucleotide mutation processes, along with sample demography and mutation rate heterogeneity, contribute to the overall patterns of clustered variation throughout the genome, promoting a more holistic approach to interpreting the source of these patterns. In chapter 4, we develop Helmsman, a computationally efficient software tool to infer mutational signatures in large samples of cancer genomes. By incorporating parallelization routines and efficient programming techniques, Helmsman performs this task up to 300 times faster and with a memory footprint 100 times smaller than existing mutation signature analysis software. Moreover, Helmsman is the only such program capable of directly analyzing arbitrarily large datasets. The Helmsman software can be accessed at https://github.com/carjed/helmsman. Finally, in Chapter 5, we present a new method for quality control in large-scale whole-genome sequencing datasets, using a combination of dimensionality reduction algorithms and unsupervised anomaly detection techniques. Just as the mutation spectrum can be used to infer the presence of underlying mechanisms, we show that the spectrum of rare variation is a powerful and informative indicator of sample sequencing quality. Analyzing three large-scale datasets, we demonstrate that our method is capable of identifying samples affected by a variety of technical artifacts that would otherwise go undetected by standard ad hoc filtering criteria. We have implemented this method in a software package, Doomsayer, available at https://github.com/carjed/doomsayer.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147537/1/jedidiah_1.pd

    Genome-wide analysis of DNA methylation topology to understand cell fate

    Get PDF
    DNA methylation is an epignetic modification associated with gene regulation. It has extensively been studied in the context of small regulatory regions. Yet, not so much is known about large domains characterized by fuzzy methylation patterns, termed Partially Methylated Domains (PMDs). The present thesis comprises PMD analyses in various contexts and provides several new aspects to study DNA methylation. First, a comprehensive analysis of PMDs across a large cohort of WGBS samples was performed, to identify structural and functional features associated with PMDs. A newly developed approach, ChromH3M, was proposed for the analysis and integration of a large spectrum of WGBS data sets. Second, PMDs were found to be indicators of the cellular proliferation history and segmented loss of DNA methylation in PMDs supports the sequential linear differentiation model of memory T-cells. Third, assessment of genome-wide methylation changes in PMDs of Multiple Sclerosis-discordant monozygotic co-twins did not show significant differences, but local changes (DMRs) were identified. Taken together, the outcomes of the presented studies shed light on a so far neglected aspect of DNA methylation, that is PMDs, in different contexts; lineage specialization, differentiation, replication, disease, chromatin organization and gene expression.Die DNA-Methylierung ist eine epigenetische Modi1kation, die funktionell mit der Genregulation verbunden ist. Sie wurde bereits ausführlich im Kontext kleiner regulatorischer Regionen untersucht. Es ist jedoch noch nicht sehr viel bekannt über große Domänen, welche erstmals in WGBS-Daten beschrieben wurden. Sie werden als partiell methylierte Regionen (PMDs) bezeichnet und sind durch das Vorhandensein variabler Methylierungsmuster charakterisiert. Die vorliegende Arbeit umfasst PMD-Analysen in unterschiedlichen Kontexten und liefert verschiedene neue Aspekte zur Untersuchung der DNA-Methylierung. Zuerst wurde eine umfassende Analyse von PMDs in einer großen Kohorte von WGBS-Proben durchgeführt, um strukturelle und funktionelle Merkmale zu identi 1zieren, die mit PMDs assoziert sind. Ein neu entwickelter Ansatz, ChromH3M, wurde für die Analyse und Integration einer großen Kohorte vonWGBS Datensätzen angewandt. Zweitens wurde festgestellt, dass PMDs Indikatoren für die Zellproliferationshistorie sind, und der zu beobachtende graduelle Verlust der globalen DNAMethylierung bei der Differenzierung von T-Gedächtniszellen unterstützt die Hypothese der sequenziellen linearen Differenzierung. Drittens zeigte die Bewertung der genomweiten Methylierungsänderungen in PMDs von Multiple Sklerose-diskordanten monozygoten Zwillingen keine signi1kanten Unterschiede, jedoch wurden lokale Änderungen (DMRs) identi1ziert. Insgesamt geben die Ergebnisse der vorgestellten Studien Aufschluss über einen bislang eher vernachlässigten Aspekt der DNA-Methylierung, d.h. PMDs, in verschiedenen Zusammenhängen: der Festlegung der Zell-entwicklungsbahnen, der Zelldifferenzierung, der Replikation, die Krankheit, der Organisation des Chromatins, sowie der Regulation der Genexpression

    Graph embedding and unsupervised learning predict genomic sub-compartments from HiC chromatin interaction data.

    Get PDF
    Chromatin interaction studies can reveal how the genome is organized into spatially confined sub-compartments in the nucleus. However, accurately identifying sub-compartments from chromatin interaction data remains a challenge in computational biology. Here, we present Sub-Compartment Identifier (SCI), an algorithm that uses graph embedding followed by unsupervised learning to predict sub-compartments using Hi-C chromatin interaction data. We find that the network topological centrality and clustering performance of SCI sub-compartment predictions are superior to those of hidden Markov model (HMM) sub-compartment predictions. Moreover, using orthogonal Chromatin Interaction Analysis by in-situ Paired-End Tag Sequencing (ChIA-PET) data, we confirmed that SCI sub-compartment prediction outperforms HMM. We show that SCI-predicted sub-compartments have distinct epigenetic marks, transcriptional activities, and transcription factor enrichment. Moreover, we present a deep neural network to predict sub-compartments using epigenome, replication timing, and sequence data. Our neural network predicts more accurate sub-compartment predictions when SCI-determined sub-compartments are used as labels for training

    Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine

    Get PDF
    High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer
    • …
    corecore