Search CORE

13 research outputs found

Genes and genomes, an imperfect world: comparison of gene annotations of two Bos taurus draft assemblies

Author: Florea Liliana
Salzberg Steven L
Souvorov Alexander
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

PubMed Central

Splign: algorithms for computing spliced alignments with identification of paralogs

Author: Kapustin Yuri
Lipman David
Souvorov Alexander
Tatusova Tatiana
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies

Author: Alexander Souvorov
AV Zimin
DA Wheeler
DL Wheeler
E Pennisi
IH Consortium
J Wang
JC Venter
K Eilbeck
K Liolios
L Florea
L Florea
Liliana Florea
M Clamp
M Nowrousian
M Pertea
MC Schatz
Najib M. El-Sayed
R Li
R Li
RA Gibbs
SF Altschul
SF Altschul
SL Salzberg
Steven L. Salzberg
TD Wu
Theodore S. Kalbfleisch
WJ Kent
WR Pearson
Publication venue: Public Library of Science
Publication date: 22/06/2011
Field of study

Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Splign: algorithms for computing spliced alignments with identification of paralogs

Author: Kapustin Yuri
Lipman David
Souvorov Alexander
Tatusova Tatiana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2008
Field of study

Abstract Background The computation of accurate alignments of cDNA sequences against a genome is at the foundation of modern genome annotation pipelines. Several factors such as presence of paralogs, small exons, non-consensus splice signals, sequencing errors and polymorphic sites pose recognized difficulties to existing spliced alignment algorithms. Results We describe a set of algorithms behind a tool called Splign for computing cDNA-to-Genome alignments. The algorithms include a high-performance preliminary alignment, a compartment identification based on a formally defined model of adjacent duplicated regions, and a refined sequence alignment. In a series of tests, Splign has produced more accurate results than other tools commonly used to compute spliced alignments, in a reasonable amount of time. Conclusion Splign's ability to deal with various issues complicating the spliced alignment problem makes it a helpful tool in eukaryotic genome annotation processes and alternative splicing studies. Its performance is enough to align the largest currently available pools of cDNA data such as the human EST set on a moderate-sized computing cluster in a matter of hours. The duplications identification (compartmentization) algorithm can be used independently in other areas such as the study of pseudogenes. Reviewers This article was reviewed by: Steven Salzberg, Arcady Mushegian and Andrey Mironov (nominated by Mikhail Gelfand).</p

Directory of Open Access Journals

Mapping rates of annotated gene sequences between the UMD2 and UMD3 assemblies.

Author: Alexander Souvorov (349352)
Liliana Florea (242468)
Steven L. Salzberg (70641)
Theodore S. Kalbfleisch (186117)
Publication venue
Publication date
Field of study

Plotted values represent the numbers of genes in one annotation that have coverage x or larger in the other genome, where coverage refers to the proportion of the transcript covered. All: all alignments of a transcript are used to compute coverage; best: only the best alignment is used. For example, the red line shows that just over 23,000 genes from UMD2 have at least 50% of their sequence (coverage 0.5) aligned to a corresponding gene in UMD3. The total number of annotated protein-coding genes is 23,221 for UMD2, and 21,342 for UMD3.</p

FigShare

An assembly error at the AQR gene locus in UMD3 creates a significantly altered protein.

Author: Alexander Souvorov (349352)
Liliana Florea (242468)
Steven L. Salzberg (70641)
Theodore S. Kalbfleisch (186117)
Publication venue
Publication date
Field of study

(A) An incorrectly inverted contig (red) moves two exons (exons 14 and 15) to the wrong strand, causing the annotation software to miss them. Instead, it used low-quality alignments on the wrong strand, creating frameshifts in the predicted protein sequence that contained multiple premature stop codons. (B) Sequence alignment of the predicted AQR protein (conceptual translation) and its RefSeq model.</p

FigShare

Examples showing how the same gene annotated on two assemblies completely fails to overlap.

Author: Alexander Souvorov (349352)
Liliana Florea (242468)
Steven L. Salzberg (70641)
Theodore S. Kalbfleisch (186117)
Publication venue
Publication date
Field of study

A) The alignment of the RefSeq DNA sequence for INTS8 spans the entire gene on UMD2, but is truncated on UMD3. The figure shows how the INTS8 sequence aligns to two distinct locations on UMD3, a longer, primary alignment containing exons 1–16 and a shorter one containing exons 17–27. The annotation system chose the shorter alignment (on the left) for the UMD3 annotation, which is thus disjoint from the primary alignment of the UMD2 annotation of INTS8. B) The gene ENTPD6 is fragmented in both assemblies, and different segments were used by the annotation software in each case. Again, the primary alignment on UMD3 and the local annotation are distinct. C) The gene ZNF813 has multiple matches on UMD3, but the best match and the annotated gene are disjoint.</p

FigShare

Best matches for UMD2 transcripts in the UMD3 annotation determined based on compatibility of exon-intron structures.

Author: Alexander Souvorov (349352)
Liliana Florea (242468)
Steven L. Salzberg (70641)
Theodore S. Kalbfleisch (186117)
Publication venue
Publication date
Field of study

Comparisons include a margin V = 20 of error at exon and intron boundaries.</p

FigShare

Molecular epidemiology and antimicrobial resistance of Clostridioides difficile detected in chicken, soil and human samples from Zimbabwe

Author: Abdel-Glil
al Saif
Alexander Mellmann
Alvarez-Perez
Bandelj
Barbara Gärtner
Becker
Berger
Berger
Berger
Bletz
Cairns
Center for Disease Dynamics
Cheng
Clifford Simango
Collins
Davies
Davies
Djebbar
Du
Fabian K. Berger
Freeman
Furuya-Kanamori
Färber
Gerding
Goorhuis
Goorhuis
Harvey
He
Hensgens
Hussain
Indra
Janssen
Knight
Kotila
Kullin
Kullin
Lutz von Müller
Magistrali
Markus Bischoff
Matsuki
Moono
Natarajan
Onwueme
Oyaro
Plants-Paris
Razmyar
Richards
Samie
Schneeberg
Seugendo
Simango
Simango
Souvorov
Sören L. Becker
Tenover
Tickler
Varshney
Weber
Weese
Zamani
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref