46 research outputs found

    Engineering Fully-Compressed Suffix Trees

    Get PDF

    An implementation of dynamic fully compressed suffix trees

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree. Suffix trees are important algorithms in stringology and provide optimal solutions for myriads of problems. Suffix trees are used, in bioinformatics to index large volumes of data. For most aplications suffix trees need to be efficient in size and funcionality. Until recently they were very large, suffix trees for the 700 megabyte human genome spawn 40 gigabytes of data. The compressed suffix tree requires less space and the recent static fully compressed suffix tree requires even less space, in fact it requires optimal compressed space. However since it is static it is not suitable for dynamic environments. Chan et. al.[3] proposed the first dynamic compressed suffix tree however the space used for a text of size n is O(n log )bits which is far from the new static solutions. Our goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space

    Fast Label Extraction in the CDAWG

    Full text link
    The compact directed acyclic word graph (CDAWG) of a string TT of length nn takes space proportional just to the number ee of right extensions of the maximal repeats of TT, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which ee grows significantly more slowly than nn. We reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to count the number of occurrences of a pattern of length mm, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(mloglogn+occ)O(m\log{\log{n}}+\mathtt{occ}) to O(m+occ)O(m+\mathtt{occ}) in the time needed to locate all the occ\mathtt{occ} occurrences of the pattern. We also reduce from O(kloglogn)O(k\log{\log{n}}) to O(k)O(k) the time needed to read the kk characters of the label of an edge of the suffix tree of TT, and we reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to compute the matching statistics between a query of length mm and TT, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

    Combined Data Structure for Previous- and Next-Smaller-Values

    Get PDF
    Let AA be a static array storing nn elements from a totally ordered set. We present a data structure of optimal size at most nlog2(3+22)+o(n)n\log_2(3+2\sqrt{2})+o(n) bits that allows us to answer the following queries on AA in constant time, without accessing AA: (1) previous smaller value queries, where given an index ii, we wish to find the first index to the left of ii where AA is strictly smaller than at ii, and (2) next smaller value queries, which search to the right of ii. As an additional bonus, our data structure also allows to answer a third kind of query: given indices i<ji<j, find the position of the minimum in A[i..j]A[i..j]. Our data structure has direct consequences for the space-efficient storage of suffix trees.Comment: to appear in Theoretical Computer Scienc

    Storage and Retrieval of Individual Genomes

    Get PDF
    A repetitive sequence collection is one where portions of a emph{base sequence} of length nn are repeated many times with small variations, forming a collection of total length NN. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(NlogN)O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to O(Nlogsigma)O(N log sigma) bits, where sigmasigma is the alphabet size. In practice, the space reduction is more than 1010-fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length nn of the base sequence and the number ss of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on N/nN/n. We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies

    Simple Algorithm to Maintain Dynamic Suffix Array for Text Indexes

    Full text link
    Dynamic suffix array is a suffix data structure that reflects various patterns in a mutable string. Dynamic suffix array is rather convenient for performing substring search queries over database indexes that are frequently modified. We are to introduce an O(nlog2n) algorithm that builds suffix array for any string and to show how to implement dynamic suffix array using this algorithm under certain constraints. We propose that this algorithm could be useful in real-life database applications

    Storage and retrieval of individual genomes

    Get PDF
    Volume: 5541A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.Peer reviewe
    corecore