Search CORE

2 research outputs found

A space-efficient construction of the Burrows Wheeler transform for genomic data

Author: Brian P. Walenz
Ross A Lippert Clark M. Mobarry
Publication venue
Publication date: 01/01/2005
Field of study

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time-efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log 2 |Σ|)(1+ɛ) bits per character (where ɛ is a small fraction). The construction algorithm ran in as much as twice that space, in O(|Σ|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, only uses 1 (1 + |Σ|)(1 + ɛ) bits per character, a factor of 2 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach

CiteSeerX

Whole-genome shotgun assembly and comparison of human genome assemblies

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304–1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860–921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats

Crossref

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

PubMed Central

MPG.PuRe