46 research outputs found
An implementation of dynamic fully compressed suffix trees
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree.
Suffix trees are important algorithms in stringology and provide optimal solutions
for myriads of problems. Suffix trees are used, in bioinformatics to index large
volumes of data. For most aplications suffix trees need to be efficient in size and
funcionality. Until recently they were very large, suffix trees for the 700 megabyte
human genome spawn 40 gigabytes of data.
The compressed suffix tree requires less space and the recent static fully compressed
suffix tree requires even less space, in fact it requires optimal compressed space.
However since it is static it is not suitable for dynamic environments. Chan et.
al.[3] proposed the first dynamic compressed suffix tree however the space used for
a text of size n is O(n log )bits which is far from the new static solutions. Our
goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
Combined Data Structure for Previous- and Next-Smaller-Values
Let be a static array storing elements from a totally ordered set. We
present a data structure of optimal size at most
bits that allows us to answer the following queries on in constant time,
without accessing : (1) previous smaller value queries, where given an index
, we wish to find the first index to the left of where is strictly
smaller than at , and (2) next smaller value queries, which search to the
right of . As an additional bonus, our data structure also allows to answer
a third kind of query: given indices , find the position of the minimum in
. Our data structure has direct consequences for the space-efficient
storage of suffix trees.Comment: to appear in Theoretical Computer Scienc
Storage and Retrieval of Individual Genomes
A repetitive sequence collection is one where portions of a emph{base sequence} of length are repeated many times with small variations, forming a collection of total length . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies bits, which very soon inhibits
in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to bits, where is the alphabet size. In practice, the space reduction is more than -fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection
We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length of the base sequence and the number of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on .
We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies
Simple Algorithm to Maintain Dynamic Suffix Array for Text Indexes
Dynamic suffix array is a suffix data structure that reflects various patterns in a mutable string. Dynamic suffix array is rather convenient for performing substring search queries over database indexes that are frequently modified. We are to introduce an O(nlog2n) algorithm that builds suffix array for any string and to show how to implement dynamic suffix array using this
algorithm under certain constraints. We propose that this algorithm could be useful in real-life database applications
Storage and retrieval of individual genomes
Volume: 5541A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.Peer reviewe