216 research outputs found

    A machine learning based framework to identify and classify long terminal repeat retrotransposons

    Get PDF
    Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-LEARNER, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: REPEATMASKER, CENSOR and LTRDIGEST. In contrast to these methods, TE-LEARNER is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance , while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-LEARNER'S predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE

    Bioinformatics Applications Based On Machine Learning

    Get PDF
    The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems

    Dynamic genome evolution in a model fern

    Get PDF
    The large size and complexity of most fern genomes have hampered efforts to elucidate fundamental aspects of fern biology and land plant evolution through genome-enabled research. Here we present a chromosomal genome assembly and associated methylome, transcriptome and metabolome analyses for the model fern species Ceratopteris richardii. The assembly reveals a history of remarkably dynamic genome evolution including rapid changes in genome content and structure following the most recent whole-genome duplication approximately 60 million years ago. These changes include massive gene loss, rampant tandem duplications and multiple horizontal gene transfers from bacteria, contributing to the diversification of defence-related gene families. The insertion of transposable elements into introns has led to the large size of the Ceratopteris genome and to exceptionally long genes relative to other plants. Gene family analyses indicate that genes directing seed development were co-opted from those controlling the development of fern sporangia, providing insights into seed plant evolution. Our findings and annotated genome assembly extend the utility of Ceratopteris as a model for investigating and teaching plant biology

    Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation

    Get PDF
    Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction

    Assessing the Gene Content of the Megagenome: Sugar Pine (Pinus lambertiana).

    Get PDF
    Sugar pine (Pinus lambertiana Douglas) is within the subgenus Strobus with an estimated genome size of 31 Gbp. Transcriptomic resources are of particular interest in conifers due to the challenges presented in their megagenomes for gene identification. In this study, we present the first comprehensive survey of the P. lambertiana transcriptome through deep sequencing of a variety of tissue types to generate more than 2.5 billion short reads. Third generation, long reads generated through PacBio Iso-Seq have been included for the first time in conifers to combat the challenges associated with de novo transcriptome assembly. A technology comparison is provided here to contribute to the otherwise scarce comparisons of second and third generation transcriptome sequencing approaches in plant species. In addition, the transcriptome reference was essential for gene model identification and quality assessment in the parallel project responsible for sequencing and assembly of the entire genome. In this study, the transcriptomic data were also used to address questions surrounding lineage-specific Dicer-like proteins in conifers. These proteins play a role in the control of transposable element proliferation and the related genome expansion in conifers

    Non-coding regulatory elements: potential roles in disease and the case of epilepsy

    Get PDF
    Non-coding DNA (ncDNA) refers to the portion of the genome that does not code for proteins and accounts for the greatest physical proportion of the human genome. ncDNA includes sequences that are transcribed into RNA molecules, such as ribosomal RNAs (rRNAs), microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and un-transcribed sequences that have regulatory functions, including gene promoters and enhancers. Variation in non-coding regions of the genome have an established role in human disease, with growing evidence from many areas, including several cancers, Parkinson's disease and autism. Here, we review the features and functions of the regulatory elements that are present in the non-coding genome and the role that these regions have in human disease. We then review the existing research in epilepsy and emphasise the potential value of further exploring non-coding regulatory elements in epilepsy. In addition, we outline the most widely used techniques for recognising regulatory elements throughout the genome, current methodologies for investigating variation and the main challenges associated with research in the field of non-coding DNA

    Transposable element annotation in non‐model species ‐ the benefits of species‐specific repeat libraries using semi‐automated EDTA and DeepTE de novo pipelines

    Get PDF
    Transposable elements (TEs) are significant genomic components which can be detected either through sequence homology against existing databases or de novo, with the latter potentially reducing the risk of underestimating TE abundance. Here, we describe the semi-automated generation of a de novo TE library using the newly developed EDTA pipeline and DeepTE classifier in a non-model teleost (Corydoras fulleri). Using both genomic and transcriptomic data, we assess this de novo pipeline’s performance across four TE based metrics: (i) abundance, (ii) composition, (iii) fragmentation and (iv) age distributions. We then compare the results to those found when using a curated teleost library (Danio rerio). We identify quantitative differences in these metrics and highlight how TE library choice can have major impacts on TE-based estimates in non-model species

    ExplorePipolin: a pipeline for identification and exploration of pipolins, novel mobile genetic elements widespread among bacteria

    Full text link
    Trabajo de fin de máster en Bioinformática y Biología ComputacionalPipolins constitute a new group of self-synthesizing or self-replicating mobile genetic elements (MGEs), encoding for their own replicative DNA polymerase B. These elements have been found to be mostly integrated into the genomes of bacteria from diverse phyla and also present as circular plasmids in mitochondria. Since a reduced number of pipolins has been identified and described so far, their origin and role remains unknown as well as there is little evidence of their horizontal transfer. A bioinformatics software capable of automatic identification and analysis of pipolins from bacterial genomes might ensure the progress in the accumulation of knowledge about these mobile genetic elements. Therefore, the main goal of the current project was to design and implement a pilot version of a pipeline for the identification and analysis of pipolins from Escherichia coli genomes. The pipeline should be flexible enough to easily extend it to other bacteria in the future. As a sub-goal, it was decided to perform a detailed analysis of pipolins of E. coli strains and isolates, available from the NCBI database and from the Spanish E. coli Reference Laboratory (LREC) collectio

    On the molecular basis of mammalian totipotency

    Get PDF
    The transient capacity to autonomously form and organize all of the embryonic and extra- embryonic tissues involved in the development of a complete organism is termed totipotency. In mammals, totipotency is a feature restricted to the earliest cells of the pre-implantation embryo, which harbor this unique capacity during the first 1-3 cell cycles, depending on the species. However, our understanding of the regulatory mechanisms responsible for the establishment, maintenance and termination of such a highly plastic regulatory state remains limited. Mammalian totipotency occurs concomitantly to a set of highly-intermingled biological processes such as global chromatin remodeling, an unusual set of metabolic characteristics and the de-repression of the vast majority of transposable elements, and it is unclear whether these processes act to sustain it. Following a general overview of these processes, in this dissertation I present my contributions to a body of work on an in vitro model system for mammalian totipotency, which exhibits certain molecular features of the in vivo totipotent state. Afterwards, in the second part of this thesis, I present the transcriptional analyses that I have conducted with the aim of understanding the role of transposable element transcription during pre-implantation development. Overall, this work describes a set of phenomena that arise in totipotent cells in vivo and in totipotent-like cells in vitro and explores how recapitulating certain molecular features of totipotent cells in pluripotent cells induces a totipotent-like state in vitro
    corecore