13 research outputs found

    Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies

    Get PDF
    Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly

    Splign: algorithms for computing spliced alignments with identification of paralogs

    No full text
    Abstract Background The computation of accurate alignments of cDNA sequences against a genome is at the foundation of modern genome annotation pipelines. Several factors such as presence of paralogs, small exons, non-consensus splice signals, sequencing errors and polymorphic sites pose recognized difficulties to existing spliced alignment algorithms. Results We describe a set of algorithms behind a tool called Splign for computing cDNA-to-Genome alignments. The algorithms include a high-performance preliminary alignment, a compartment identification based on a formally defined model of adjacent duplicated regions, and a refined sequence alignment. In a series of tests, Splign has produced more accurate results than other tools commonly used to compute spliced alignments, in a reasonable amount of time. Conclusion Splign's ability to deal with various issues complicating the spliced alignment problem makes it a helpful tool in eukaryotic genome annotation processes and alternative splicing studies. Its performance is enough to align the largest currently available pools of cDNA data such as the human EST set on a moderate-sized computing cluster in a matter of hours. The duplications identification (compartmentization) algorithm can be used independently in other areas such as the study of pseudogenes. Reviewers This article was reviewed by: Steven Salzberg, Arcady Mushegian and Andrey Mironov (nominated by Mikhail Gelfand).</p

    Mapping rates of annotated gene sequences between the UMD2 and UMD3 assemblies.

    No full text
    <p>Plotted values represent the numbers of genes in one annotation that have coverage <i>x</i> or larger in the other genome, where coverage refers to the proportion of the transcript covered. All: all alignments of a transcript are used to compute coverage; best: only the best alignment is used. For example, the red line shows that just over 23,000 genes from UMD2 have at least 50% of their sequence (coverage 0.5) aligned to a corresponding gene in UMD3. The total number of annotated protein-coding genes is 23,221 for UMD2, and 21,342 for UMD3.</p

    An assembly error at the <i>AQR</i> gene locus in UMD3 creates a significantly altered protein.

    No full text
    <p>(A) An incorrectly inverted contig (red) moves two exons (exons 14 and 15) to the wrong strand, causing the annotation software to miss them. Instead, it used low-quality alignments on the wrong strand, creating frameshifts in the predicted protein sequence that contained multiple premature stop codons. (B) Sequence alignment of the predicted AQR protein (conceptual translation) and its RefSeq model.</p

    Examples showing how the same gene annotated on two assemblies completely fails to overlap.

    No full text
    <p>A) The alignment of the RefSeq DNA sequence for <i>INTS8</i> spans the entire gene on UMD2, but is truncated on UMD3. The figure shows how the <i>INTS8</i> sequence aligns to two distinct locations on UMD3, a longer, primary alignment containing exons 1–16 and a shorter one containing exons 17–27. The annotation system chose the shorter alignment (on the left) for the UMD3 annotation, which is thus disjoint from the primary alignment of the UMD2 annotation of <i>INTS8.</i> B) The gene <i>ENTPD6</i> is fragmented in both assemblies, and different segments were used by the annotation software in each case. Again, the primary alignment on UMD3 and the local annotation are distinct. C) The gene <i>ZNF813</i> has multiple matches on UMD3, but the best match and the annotated gene are disjoint.</p
    corecore