8 research outputs found

    RGAAT: A Reference-based Genome Assembly and Annotation Tool for New Genomes and Upgrade of Known Genomes

    No full text
    The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of version-dependent annotation files and other compatible public dataset for downstream analysis. To handle these tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool (RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNA-seq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool (RATT), RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification, genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/ and https://github.com/wushyer/RGAAT_v2 at no cost. Keywords: Variant identification, Genome assembly, Genome annotation, Genome compariso

    Complete Sequence and Analysis of Coconut Palm (<i>Cocos nucifera</i>) Mitochondrial Genome

    No full text
    <div><p>Coconut (<i>Cocos nucifera</i> L.), a member of the palm family (Arecaceae), is one of the most economically important crops in tropics, serving as an important source of food, drink, fuel, medicine, and construction material. Here we report an assembly of the coconut (<i>C</i>. <i>nucifera</i>, Oman local Tall cultivar) mitochondrial (mt) genome based on next-generation sequencing data. This genome, 678,653bp in length and 45.5% in GC content, encodes 72 proteins, 9 pseudogenes, 23 tRNAs, and 3 ribosomal RNAs. Within the assembly, we find that the chloroplast (cp) derived regions account for 5.07% of the total assembly length, including 13 proteins, 2 pseudogenes, and 11 tRNAs. The mt genome has a relatively large fraction of repeat content (17.26%), including both forward (tandem) and inverted (palindromic) repeats. Sequence variation analysis shows that the Ti/Tv ratio of the mt genome is lower as compared to that of the nuclear genome and neutral expectation. By combining public RNA-Seq data for coconut, we identify 734 RNA editing sites supported by at least two datasets. In summary, our data provides the second complete mt genome sequence in the family Arecaceae, essential for further investigations on mitochondrial biology of seed plants.</p></div

    Circular display of <i>C. nucifera</i> mt genome.

    No full text
    <p>We display (from outside to inside): physical map scaled in kb; coding sequences transcribed in the clockwise and counterclockwise directions (<i>nad</i> in red; <i>cob</i>, <i>matR</i> and <i>mttB</i> in green; <i>cox</i> in blue; <i>atp</i> in purple; <i>ccm</i> in orange; <i>rpl</i> in yellow; <i>rps</i> in dark red; rRNA in dark green; tRNA in dark blue; <i>orf</i> in dark purple; and others in black); chloroplast-derived regions (green); repeats (forward repeats in green, palindrome repeats in red and tandem repeats in blue); RNA edit sites (synonymous in green and non-synonymous in red); gene conserve scores (black); proper HiSeq mate-pair (MP) reads percent with insert size 5kb and 8kb (blue); and the four regions (thick lines indicate IRs and thin lines indicate LSC and SSC). * indicates pseudogenes.</p
    corecore