SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size.

Eldar Giladi; James Ze Wang; Michael G. Walker; Wayne Volkmuth

SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size.

Authors: Eldar Giladi
James Ze Wang
Michael G. Walker
Wayne Volkmuth
Publication date
Publisher

Abstract

We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called "windows" using multiple offsets. Each window is mapped into a vector of dimension 4 k which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4 \Gamma 6. Then we create a tree-structured index of the windows in vector space, using tree structured vector quantization (TSVQ). We identify the nearest-neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest neighbor windows in the database. This yields an O(log n) complexity for the search. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequenc..

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.43.55...

Last time updated on 22/10/2014

CiteSeerX

oai:CiteSeerX.psu:10.1.1.78.44...

Last time updated on 22/10/2014