SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size.
- Publication date
- Publisher
Abstract
We have developed an algorithm, called SST (Sequence Search Tree), that searches a database of DNA sequences for near exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called "windows" using multiple offsets. Each window is mapped into a vector of dimension 4 k which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4 \Gamma 6. Then we create a tree-structured index of the windows in vector space, using tree structured vector quantization (TSVQ). We identify the nearest-neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest neighbor windows in the database. This yields an O(log n) complexity for the search. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequenc..