9 research outputs found
Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation
Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction
Recommended from our members
Containerization on Petascale HPC Clusters
Containerization technologies provide a mechanism to encapsulate applications and many of their dependencies, facilitating software portability and reproducibility on HPC systems. However, in order to access many of the architectural features that enable HPC system performance, compatibility between certain components of the container and host are required, resulting in a trade-off between portability and performance. In this work, we discuss our early experiences running three state-of-the-art containerization technologies on the petascale Frontera system. We present how we build the containers to ensure performance and security and their performance at scale.We ran microbenchmarks at a scale of 4,096 nodes and demonstrate the near-native performance and minimal memory overheads by the containerized environments at 70,000 processes on 1,296 nodes with a scientific application MILC - a quantum chromodynamics code.UT Austin-Portugal Program, a collaboration between the Portuguese Foundation of Science and Technology and the University of Texas at Austin, award UTA18-001217Texas Advanced Computing Center (TACC
Comparing DNA replication programs reveals large timing shifts at centromeres of endocycling cells in maize roots.
Plant cells undergo two types of cell cycles-the mitotic cycle in which DNA replication is coupled to mitosis, and the endocycle in which DNA replication occurs in the absence of cell division. To investigate DNA replication programs in these two types of cell cycles, we pulse labeled intact root tips of maize (Zea mays) with 5-ethynyl-2'-deoxyuridine (EdU) and used flow sorting of nuclei to examine DNA replication timing (RT) during the transition from a mitotic cycle to an endocycle. Comparison of the sequence-based RT profiles showed that most regions of the maize genome replicate at the same time during S phase in mitotic and endocycling cells, despite the need to replicate twice as much DNA in the endocycle and the fact that endocycling is typically associated with cell differentiation. However, regions collectively corresponding to 2% of the genome displayed significant changes in timing between the two types of cell cycles. The majority of these regions are small with a median size of 135 kb, shift to a later RT in the endocycle, and are enriched for genes expressed in the root tip. We found larger regions that shifted RT in centromeres of seven of the ten maize chromosomes. These regions covered the majority of the previously defined functional centromere, which ranged between 1 and 2 Mb in size in the reference genome. They replicate mainly during mid S phase in mitotic cells but primarily in late S phase of the endocycle. In contrast, the immediately adjacent pericentromere sequences are primarily late replicating in both cell cycles. Analysis of CENH3 enrichment levels in 8C vs 2C nuclei suggested that there is only a partial replacement of CENH3 nucleosomes after endocycle replication is complete. The shift to later replication of centromeres and possible reduction in CENH3 enrichment after endocycle replication is consistent with a hypothesis that centromeres are inactivated when their function is no longer needed
Subtle Perturbations of the Maize Methylome Reveal Genes and Transposons Silenced by Chromomethylase or RNA-Directed DNA Methylation Pathways
DNA methylation is a chromatin modification that can provide epigenetic regulation of gene and transposon expression. Plants utilize several pathways to establish and maintain DNA methylation in specific sequence contexts. The chromomethylase (CMT) genes maintain CHG (where H = A, C or T) methylation. The RNA-directed DNA methylation (RdDM) pathway is important for CHH methylation. Transcriptome analysis was performed in a collection of Zea mays lines carrying mutant alleles for CMT or RdDM-associated genes. While the majority of the transcriptome was not affected, we identified sets of genes and transposon families sensitive to context-specific decreases in DNA methylation in mutant lines. Many of the genes that are up-regulated in CMT mutant lines have high levels of CHG methylation, while genes that are differentially expressed in RdDM mutants are enriched for having nearby mCHH islands, implicating context-specific DNA methylation in the regulation of expression for a small number of genes. Many genes regulated by CMTs exhibit natural variation for DNA methylation and transcript abundance in a panel of diverse inbred lines. Transposon families with differential expression in the mutant genotypes show few defining features, though several families up-regulated in RdDM mutants show enriched expression in endosperm tissue, highlighting the potential importance for this pathway during reproduction. Taken together, our findings suggest that while the number of genes and transposon families whose expression is reproducibly affected by mild perturbations in context-specific methylation is small, there are distinct patterns for loci impacted by RdDM and CMT mutants
Repliscan: a tool for classifying replication timing regions
Abstract Background Replication timing experiments that use label incorporation and high throughput sequencing produce peaked data similar to ChIP-Seq experiments. However, the differences in experimental design, coverage density, and possible results make traditional ChIP-Seq analysis methods inappropriate for use with replication timing. Results To accurately detect and classify regions of replication across the genome, we present Repliscan. Repliscan robustly normalizes, automatically removes outlying and uninformative data points, and classifies Repli-seq signals into discrete combinations of replication signatures. The quality control steps and self-fitting methods make Repliscan generally applicable and more robust than previous methods that classify regions based on thresholds. Conclusions Repliscan is simple and effective to use on organisms with different genome sizes. Even with analysis window sizes as small as 1 kilobase, reliable profiles can be generated with as little as 2.4x coverage