Search CORE

8 research outputs found

Improving Genome Assembly

Author: Ustun Cevat
Publication venue
Publication date: 01/01/2005
Field of study

We present a reliable, easy to implement algorithm to generate a set of highly reliable overlaps based on identifying repeat k-mers. Our method is coverage independent. Whereas traditionally reads have been trimmed to have expected error rates of 2%, we find our error correction allows extending usable sequence in reads to 16% trimming. We use a version of the Phrap assembly program that uses only overlaps computed by the UMD overlapper, called PhrapUMD. We integrate the UMD algorithms with Baylor's ATLAS assembler applied to Rattus norvegicus. Starting with the same data as the Nov. 2002 ATLAS assembly, we compare our results to 4.5 Mbp of rat sequence in 21 BACs that have been finished. We find that after extension and error correction, (i) the reads are 30% longer than reads trimmed to 2%; (ii) the average error rate across the extended reads is about 3 in 10,000 bases, with 88% of the extended reads matching finished sequence exactly across their entire length; and (iii) PhrapUMD with these reads and our reliable overlaps produces a draft assembly of the rat which has no misassemblies and increases the coverage of finished sequence from 92.2% to 95.7%, while simultaneously reducing the base error rate for quality 20 or higher bases from 1.50 to 0.87 errors per 10,000

CiteSeerX

Digital Repository at the University of Maryland

Improving Phrap-Based Assembly of the Rat Using “Reliable” Overlaps

Author: AL Delcher
Aleksey V. Zimin
B Ewing
B Ewing
Brian R. Hunt
Cevat Ustun
EW Myers
GG Sutton
James R. White
James Yorke
JC Mullikin
M Roberts
Michael Roberts
Neil Hall
P Green
P Havlak
Paul Havlak
S Aparicio
S Batzoglou
S Schwartz
SL Salzberg
Wayne Hayes
X Huang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2008
Field of study

The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of “reliable” overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our “reliable-overlap” algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Caltech Authors

Recommended from our members

Improving Phrap-based assembly of the rat using "reliable" overlaps.

Author: Havlak Paul
Hayes Wayne
Hunt Brian R
Roberts Michael
Ustun Cevat
White James R
Yorke James
Zimin Aleksey V
Publication venue: eScholarship, University of California
Publication date: 01/03/2008
Field of study

The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of "reliable" overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our "reliable-overlap" algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps

eScholarship - University of California

Two alignments of assemblies to the finished sequence of BAC GMEZ.

Author: Aleksey V. Zimin (242456)
Brian R. Hunt (378013)
Cevat Ustun (378014)
James R. White (159367)
James Yorke (378015)
Michael Roberts (312937)
Paul Havlak (303963)
Wayne Hayes (378011)
Publication venue
Publication date
Field of study

<p>The original Atlas assembly created a single scaffold. The UMD+Atlas assembly of GMEZ assembled a 26 Kb section from the middle of the bigger scaffold into a separate Scaffold 1. Note that the large scaffold gap in the Scaffold 2 is estimated correctly. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave UMD+Atlas the most trouble and the only case where UMD+Atas assembly had two scaffolds.</p

The Francis Crick Institute

Two alignments of assemblies to the finished sequence of BAC GQQD.

Author: Aleksey V. Zimin (242456)
Brian R. Hunt (378013)
Cevat Ustun (378014)
James R. White (159367)
James Yorke (378015)
Michael Roberts (312937)
Paul Havlak (303963)
Wayne Hayes (378011)
Publication venue
Publication date
Field of study

<p>The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.</p

The Francis Crick Institute

Comparison of the three assemblies for the subset of the 21 BACs from the Rat genome.

Author: Aleksey V. Zimin (242456)
Brian R. Hunt (378013)
Cevat Ustun (378014)
James R. White (159367)
James Yorke (378015)
Michael Roberts (312937)
Paul Havlak (303963)
Wayne Hayes (378011)
Publication venue
Publication date
Field of study

<p>The “original Atlas with UMD Plausible” and “original Atlas with UMD reliable” assembly results obtained by substituting Phrap for PhrapUMD with UMD plausible and reliable overlaps respectively. The best assembly (the bottom line) uses PhrapUMD and UMD reliable overlaps utilizing the 2-pass approach described in the “<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0001836#s2" target="_blank">Methods</a>” section. It has almost 3% more sequence matching finished sequence than original Atlas with Phrap at less than 1/4 the original base error rate.</p

The Francis Crick Institute

Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.

Author: Aleksey V. Zimin (242456)
Brian R. Hunt (378013)
Cevat Ustun (378014)
James R. White (159367)
James Yorke (378015)
Michael Roberts (312937)
Paul Havlak (303963)
Wayne Hayes (378011)
Publication venue
Publication date
Field of study

<p>The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.</p

The Francis Crick Institute