2 research outputs found

    A preprocessor for shotgun assembly of large genomes

    No full text
    The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a “read”. Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of “overlaps”, i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the “UMD Overlapper”, can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera’s Drosophila reads. When we replaced Celera’s overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome

    2196 A Whole-Genome Assembly of Drosophila

    No full text
    We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly’s sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99 % without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community. The primary obstacle to determining the sequence of a very large genome is that, with current technology, one can directly determine the sequence of at most a thousan
    corecore