8 research outputs found

    Improving Genome Assembly

    Get PDF
    We present a reliable, easy to implement algorithm to generate a set of highly reliable overlaps based on identifying repeat k-mers. Our method is coverage independent. Whereas traditionally reads have been trimmed to have expected error rates of 2%, we find our error correction allows extending usable sequence in reads to 16% trimming. We use a version of the Phrap assembly program that uses only overlaps computed by the UMD overlapper, called PhrapUMD. We integrate the UMD algorithms with Baylor's ATLAS assembler applied to Rattus norvegicus. Starting with the same data as the Nov. 2002 ATLAS assembly, we compare our results to 4.5 Mbp of rat sequence in 21 BACs that have been finished. We find that after extension and error correction, (i) the reads are 30% longer than reads trimmed to 2%; (ii) the average error rate across the extended reads is about 3 in 10,000 bases, with 88% of the extended reads matching finished sequence exactly across their entire length; and (iii) PhrapUMD with these reads and our reliable overlaps produces a draft assembly of the rat which has no misassemblies and increases the coverage of finished sequence from 92.2% to 95.7%, while simultaneously reducing the base error rate for quality 20 or higher bases from 1.50 to 0.87 errors per 10,000

    Improving Phrap-Based Assembly of the Rat Using “Reliable” Overlaps

    Get PDF
    The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of “reliable” overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our “reliable-overlap” algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps

    Two alignments of assemblies to the finished sequence of BAC GMEZ.

    No full text
    <p>The original Atlas assembly created a single scaffold. The UMD+Atlas assembly of GMEZ assembled a 26 Kb section from the middle of the bigger scaffold into a separate Scaffold 1. Note that the large scaffold gap in the Scaffold 2 is estimated correctly. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave UMD+Atlas the most trouble and the only case where UMD+Atas assembly had two scaffolds.</p

    Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.

    No full text
    <p>The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.</p

    Two alignments of assemblies to the finished sequence of BAC GQQD.

    No full text
    <p>The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.</p

    Comparison of the three assemblies for the subset of the 21 BACs from the Rat genome.

    No full text
    <p>The “original Atlas with UMD Plausible” and “original Atlas with UMD reliable” assembly results obtained by substituting Phrap for PhrapUMD with UMD plausible and reliable overlaps respectively. The best assembly (the bottom line) uses PhrapUMD and UMD reliable overlaps utilizing the 2-pass approach described in the “<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0001836#s2" target="_blank">Methods</a>” section. It has almost 3% more sequence matching finished sequence than original Atlas with Phrap at less than 1/4 the original base error rate.</p
    corecore