402 research outputs found

    A call for benchmarking transposable element annotation methods

    Get PDF

    A call for benchmarking transposable element annotation methods.

    Get PDF
    International audienceDNA derived from transposable elements (TEs) constitutes large parts of the genomes of complex eukaryotes, with major impacts not only on genomic research but also on how organisms evolve and function. Although a variety of methods and tools have been developed to detect and annotate TEs, there are as yet no standard benchmarks-that is, no standard way to measure or compare their accuracy. This lack of accuracy assessment calls into question conclusions from a wide range of research that depends explicitly or implicitly on TE annotation. In the absence of standard benchmarks, toolmakers are impeded in improving their tools, annotators cannot properly assess which tools might best suit their needs, and downstream researchers cannot judge how accuracy limitations might impact their studies. We therefore propose that the TE research community create and adopt standard TE annotation benchmarks, and we call for other researchers to join the authors in making this long-overdue effort a success

    Paired-End Mappability of Transposable Elements in the Human Genome

    Get PDF
    Though transposable elements make up around half of the human genome, the repetitive nature of their sequences makes it difficult to accurately align conventional sequencing reads. However, in light of new advances in sequencing technology, such as increased read length and paired-end libraries, these repetitive regions are now becoming easier to align to. This study investigates the mappability of transposable elements with 50 bp, 76 bp and 100 bp paired-end read libraries. With respect to those read lengths and allowing for 3 mismatches during alignment, over 68, 85, and 88% of all transposable elements in the RepeatMasker database are uniquely mappable, suggesting that accurate locus-specific mapping of older transposable elements is well within reach

    High-throughput sequencing data and the impact of plant gene annotation quality

    Get PDF
    The use of draft genomes of different species and re-sequencing of accessions and populations are now a common tool for plant biology research. The de novo assembled draft genomes make it possible to identify pivotal divergence points in the plant lineage and provide an opportunity to investigate the genomic basis and timing of biological innovations by inferring orthologs between species. Furthermore, re-sequencing facilitates the mapping and subsequent molecular characterization of causative loci for traits including plant stress tolerance or development. In both cases high quality gene annotation, the identification of protein-coding regions, gene promoters and 5’ and 3’ untranslated regions, is critical for investigation of gene function. Annotations are constantly improving but automated gene annotations still require manual curation and experimental validation. This is particularly important for genes with large introns, genes located in regions rich with transposable elements or repeats, large gene families and segmentally duplicated genes. In this opinion paper we highlight the impact of annotation quality on evolutionary analyses, genome-wide association studies and the identification of orthologous genes in plants. Furthermore, we predict that incorporating the accurate information from manual curation into databases will dramatically improve the performance of automated gene predictors.Peer reviewe

    DNAscan2: a versatile, scalable, and user-friendly analysis pipeline for human next-generation sequencing data

    Get PDF
    SUMMARY: The current widespread adoption of next-generation sequencing (NGS) in all branches of basic research and clinical genetics fields means that users with highly variable informatics skills, computing facilities and application purposes need to process, analyse, and interpret NGS data. In this landscape, versatility, scalability, and user-friendliness are key characteristics for an NGS analysis software. We developed DNAscan2, a highly flexible, end-to-end pipeline for the analysis of NGS data, which (i) can be used for the detection of multiple variant types, including SNVs, small indels, transposable elements, short tandem repeats, and other large structural variants; (ii) covers all standard steps of NGS analysis, from quality control of raw data and genome alignment to variant calling, annotation, and generation of reports for the interpretation and prioritization of results; (iii) is highly adaptable as it can be deployed and run via either a graphic user interface for non-bioinformaticians and a command line tool for personal computer usage; (iv) is scalable as it can be executed in parallel as a Snakemake workflow, and; (v) is computationally efficient by minimizing RAM and CPU time requirements. AVAILABILITY AND IMPLEMENTATION: DNAscan2 is implemented in Python3 and is available at https://github.com/KHP-Informatics/DNAscanv2

    Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

    Get PDF
    BACKGROUND: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. RESULTS: We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. CONCLUSIONS: The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA

    K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

    Get PDF
    The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.Instituto de BiotecnologíaFil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); ArgentinaFil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unid

    T-lex3 : An accurate tool to genotype and estimate population frequencies of transposable elements using the latest short-read whole genome sequencing data

    Get PDF
    Motivation: Transposable elements (TEs) constitute a significant proportion of the majority of genomes sequenced to date. TEs are responsible for a considerable fraction of the genetic variation within and among species. Accurate genotyping of TEs in genomes is therefore crucial for a complete identification of the genetic differences among individuals, populations and species. Results: In this work, we present a new version of T-lex, a computational pipeline that accurately genotypes and estimates the population frequencies of reference TE insertions using short-read high-throughput sequencing data. In this new version, we have re-designed the T-lex algorithm to integrate the BWA-MEM short-read aligner, which is one of the most accurate short-read mappers and can be launched on longer short-reads (e.g. reads >150 bp). We have added new filtering steps to increase the accuracy of the genotyping, and new parameters that allow the user to control both the minimum and maximum number of reads, and the minimum number of strains to genotype a TE insertion. We also showed for the first time that T-lex3 provides accurate TE calls in a plant genome. Availability and implementation: To test the accuracy of T-lex3, we called 1630 individual TE insertions in Drosophila melanogaster, 1600 individual TE insertions in humans, and 3067 individual TE insertions in the rice genome. We showed that this new version of T-lex is a broadly applicable and accurate tool for genotyping and estimating TE frequencies in organisms with different genome sizes and different TE contents. T-lex3 is available at Github: https://github.com/GonzalezLab/T-lex3

    In Silico Approach For Characterization And Comparison Of Repeats In The Genomes Of Oil And Date Palms

    Get PDF
    Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Transposable elements (TEs) are mobile genetic elements present in almost all eukaryotic genomes. Due to their typical patterns of repetition, discovery, and characterization, they demand analysis by various bioinformatics software. Probably, as a result of the need for a complex analysis, many genomes publicly available do not have these elements annotated yet. In this study, a de novo and homology-based identification of TEs and microsatellites was performed using genomic data from 3 palm species: Elaeis oleifera (American oil palm, v.1, Embrapa, unpublished; v.8, Malaysian Palm Oil Board [MPOB], public), Elaeis guineensis (African oil palm, v. 5, MPOB, public), and Phoenix dactylifera (date palm). The estimated total coverage of TEs was 50.96% (523 572 kb) and 42.31% (593 463 kb), 39.41% (605 015 kb), and 33.67% (187 361 kb), respectively. A total of 155 726 microsatellite loci were identified in the genomes of oil and date palms. This is the first detailed description of repeats in the genomes of oil and date palms. A relatively high diversity and abundance of TEs were found in the genomes, opening a range of further opportunities for applied research in these genera. The development of molecular markers (mainly simple sequence repeat), which may be immediately applied in breeding programs of those species to support the selection of superior genotypes and to enhance knowledge of the genetic structure of the breeding and natural populations, is the most notable opportunity.11JAFF by the Coordination for the Improvement of Higher Education Personnel (CAPES)Foundation within the Ministry of Education in Brazil via the Graduate Program in Plant Biotechnology, Federal University of Lavras (UFLA)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES
    • …
    corecore