30,777 research outputs found

    Wavelet Trees Meet Suffix Trees

    Full text link
    We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size σn\sigma\leq n, our method builds the wavelet tree in O(nlogσ/logn)O(n \log \sigma/ \sqrt{\log{n}}) time, improving upon the state-of-the-art algorithm by a factor of logn\sqrt{\log n}. As a consequence, given an array of n integers we can construct in O(nlogn)O(n \sqrt{\log n}) time a data structure consisting of O(n)O(n) machine words and capable of answering rank/select queries for the subranges of the array in O(logn/loglogn)O(\log n / \log \log n) time. This is a loglogn\log \log n-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a logn\sqrt{\log n}-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies O(n)O(n) words, takes O(nlogn)O(n \sqrt{\log n}) time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in O(logx)O(\log |x|) time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in O(slogx)O(s \log |x|) time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201

    Weighted ancestors in suffix trees

    Full text link
    The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

    Relative Suffix Trees

    Get PDF
    Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.Peer reviewe

    New Algorithms for Position Heaps

    Full text link
    We present several results about position heaps, a relatively new alternative to suffix trees and suffix arrays. First, we show that, if we limit the maximum length of patterns to be sought, then we can also limit the height of the heap and reduce the worst-case cost of insertions and deletions. Second, we show how to build a position heap in linear time independent of the size of the alphabet. Third, we show how to augment a position heap such that it supports access to the corresponding suffix array, and vice versa. Fourth, we introduce a variant of a position heap that can be simulated efficiently by a compressed suffix array with a linear number of extra bits

    On the suitability of suffix arrays for lempel-ziv data compression

    Get PDF
    Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded systems) where memory is at a premium and high speed is not critical

    On-line construction of position heaps

    Get PDF
    We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear-time string matching algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm

    An implementation of dynamic fully compressed suffix trees

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree. Suffix trees are important algorithms in stringology and provide optimal solutions for myriads of problems. Suffix trees are used, in bioinformatics to index large volumes of data. For most aplications suffix trees need to be efficient in size and funcionality. Until recently they were very large, suffix trees for the 700 megabyte human genome spawn 40 gigabytes of data. The compressed suffix tree requires less space and the recent static fully compressed suffix tree requires even less space, in fact it requires optimal compressed space. However since it is static it is not suitable for dynamic environments. Chan et. al.[3] proposed the first dynamic compressed suffix tree however the space used for a text of size n is O(n log )bits which is far from the new static solutions. Our goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space

    Synthesis of Suffix Trees

    Full text link
    The implications of probabilistic communication have been far-reaching and pervasive. Given the current status of flexible models, leading analysts particularly desire the simulation of sensor networks, which embodies the intuitive principles of cryptography. We propose a novel method for the evaluation of Smalltalk, which we call WAE
    corecore