143 research outputs found
An implementation of dynamic fully compressed suffix trees
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree.
Suffix trees are important algorithms in stringology and provide optimal solutions
for myriads of problems. Suffix trees are used, in bioinformatics to index large
volumes of data. For most aplications suffix trees need to be efficient in size and
funcionality. Until recently they were very large, suffix trees for the 700 megabyte
human genome spawn 40 gigabytes of data.
The compressed suffix tree requires less space and the recent static fully compressed
suffix tree requires even less space, in fact it requires optimal compressed space.
However since it is static it is not suitable for dynamic environments. Chan et.
al.[3] proposed the first dynamic compressed suffix tree however the space used for
a text of size n is O(n log )bits which is far from the new static solutions. Our
goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space
An(other) Entropy-Bounded Compressed Suffix Tree
Suffix trees are one of the most important data structures in stringology, with myriads of applications in fluorishing areas like bioinformatics. As their main problem is space usage, recent efforts have focused on compressed suffix tree representations, which obtain large space reductions in exchange
for moderate slowdowns. Such a smaller suffix tree could fit in a faster memory, outweighting by far the theoretical slowdown. We present a novel compressed suffix tree. Compared to the current compressed
suffix trees, it is the first achieving at the same time sublogarithmic complexity for the operations, and space usage which goes to zero as the entropy of the text does. Our development contains several novel ideas, such as compressing the longest common prefix information, and totally getting rid of the
suffix tree topology, expressing all the suffix tree operations using range minimum queries and a new primitive called next/previous smaller value in a sequence
On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching
We present parallel algorithms for exact and approximate pattern matching
with suffix arrays, using a CREW-PRAM with processors. Given a static text
of length , we first show how to compute the suffix array interval of a
given pattern of length in
time for . For approximate pattern matching with differences or
mismatches, we show how to compute all occurrences of a given pattern in
time, where is the size of the alphabet
and . The workhorse of our algorithms is a data structure
for merging suffix array intervals quickly: Given the suffix array intervals
for two patterns and , we present a data structure for computing the
interval of in sequential time, or in
parallel time. All our data structures are of size bits (in addition to
the suffix array)
Optimized Compressed Data Structures for Infinite-order Language Models
In recent years highly compact succinct text indexes developed in bioinformatics have spread to the domain of natural language processing, in particular n-gram indexing. One line of research has been to utilize compressed suffix trees as both the text index and the language model. Compressed suffix trees have several favourable properties for compressing n-gram strings and associated satellite data while allowing for both fast access and fast computation of the language model probabilities over the text. When it comes to count based n-gram language models and especially to low-order n-gram models, the Kneser-Ney language model has long been de facto industry standard.
Shareghi et al. showed how to utilize a compressed suffix tree to build a highly compact index that is competitive with state-of-the-art language models in space. In addition, they showed how the index can work as a language model and allows computing modified Kneser-Ney probabilities straight from the data structure.
This thesis analyzes and extends the works of Shareghi et al. in building a compressed suffix tree based modified Kneser-Ney language model. We explain their solution and present three attempts to improve the approach. Out of the three experiments, one performed far worse than the original approach, but two showed minor gains in time with no real loss in space
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Indexing highly repetitive texts - such as genomic databases, software
repositories and versioned text collections - has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms
(BWTs). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used O(r) space and was able to efficiently count the number of
occurrences of a pattern of length m in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
r. In this paper we close this long-standing problem, showing how to extend the
Run-Length FM-index so that it can locate the occ occurrences efficiently
within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m
+ occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n
over an alphabet of size {\sigma} on a RAM machine with words of w =
{\Omega}(log n) bits. Within that space, our index can also count in optimal
time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and
locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which
is optimal in the packed setting and had not been obtained before in compressed
space. We also describe a structure using O(r log(n/r)) space that replaces the
text and extracts any text substring of length ` in almost-optimal time
O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct
access to suffix array, inverse suffix array, and longest common prefix array
cells, and extend these capabilities to full suffix tree functionality,
typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log
log_w(n/r + sigma)
- …