Search CORE

46 research outputs found

Engineering Fully-Compressed Suffix Trees

Author: Ocker Christian
Publication venue
Publication date: 15/12/2016
Field of study

An implementation of dynamic fully compressed suffix trees

Author: Figueiredo Miguel Filipe da Silva
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2010
Field of study

Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaThis dissertation studies and implements a dynamic fully compressed suffix tree. Suffix trees are important algorithms in stringology and provide optimal solutions for myriads of problems. Suffix trees are used, in bioinformatics to index large volumes of data. For most aplications suffix trees need to be efficient in size and funcionality. Until recently they were very large, suffix trees for the 700 megabyte human genome spawn 40 gigabytes of data. The compressed suffix tree requires less space and the recent static fully compressed suffix tree requires even less space, in fact it requires optimal compressed space. However since it is static it is not suitable for dynamic environments. Chan et. al.[3] proposed the first dynamic compressed suffix tree however the space used for a text of size n is O(n log )bits which is far from the new static solutions. Our goal is to implement a recent proposal by Russo, Arlindo and Navarro[22] that defines a dynamic fully compressed suffix tree and uses only nH0 +O(n log ) bits of space

Repositório da Universidade Nova de Lisboa

Fast Label Extraction in the CDAWG

Author: A Blumer
D Belazzougui
D Gusfield
J Sirén
L Gasieniec
LS Russo
M Crochemore
M Crochemore
M Crochemore
M Crochemore
M Raffinot
MA Bender
O Berkman
T Gagie
V Mäkinen
V Mäkinen
Publication venue
Publication date: 26/09/2017
Field of study

The compact directed acyclic word graph (CDAWG) of a string

T

of length

n

takes space proportional just to the number

e

of right extensions of the maximal repeats of

T

, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which

e

grows significantly more slowly than

n

. We reduce from

O(m\log{\log{n}})

O(m)

the time needed to count the number of occurrences of a pattern of length

m

, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from

O(m\log{\log{n}}+\mathtt{occ})

O(m+\mathtt{occ})

in the time needed to locate all the

\mathtt{occ}

occurrences of the pattern. We also reduce from

O(k\log{\log{n}})

O(k)

the time needed to read the

k

characters of the label of an edge of the suffix tree of

T

, and we reduce from

O(m\log{\log{n}})

O(m)

the time needed to compute the matching statistics between a query of length

m

and

T

, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

arXiv.org e-Print Archive

Crossref

Combined Data Structure for Previous- and Next-Smaller-Values

Author: Fischer Johannes
Publication venue
Publication date: 02/02/2011
Field of study

Let

A

be a static array storing

n

elements from a totally ordered set. We present a data structure of optimal size at most

n\log_2(3+2\sqrt{2})+o(n)

bits that allows us to answer the following queries on

A

in constant time, without accessing

A

: (1) previous smaller value queries, where given an index

i

, we wish to find the first index to the left of

i

where

A

is strictly smaller than at

i

, and (2) next smaller value queries, which search to the right of

i

. As an additional bonus, our data structure also allows to answer a third kind of query: given indices

i<j

, find the position of the minimum in

A[i..j]

. Our data structure has direct consequences for the space-efficient storage of suffix trees.Comment: to appear in Theoretical Computer Scienc

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Storage and Retrieval of Highly Repetitive Sequence Collections

Author: Mäkinen Veli
Navarro Gonzalo
Sirén Jouni
Välimäki Niko
Publication venue
Publication date: 01/01/2009
Field of study

Peer reviewe

CiteSeerX

Helsingin yliopiston digitaalinen arkisto

Storage and Retrieval of Individual Genomes

Author: Navarro Gonzalo
Publication venue: Dagstuhl Seminar Proceedings. 08261 - Structure-Based Compression of Complex Massive Data
Publication date: 01/01/2008
Field of study

A repetitive sequence collection is one where portions of a emph{base sequence} of length

n

are repeated many times with small variations, forming a collection of total length

N

. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies

O(N log N)

bits, which very soon inhibits in-memory analyses. Recent advances in full-text emph{self-indexing} reduce the space of suffix tree to

O(N log sigma)

bits, where

sigma

is the alphabet size. In practice, the space reduction is more than

10

-fold for example on suffix tree of Human Genome. However, this reduction remains a constant factor when more sequences are added to the collection We develop a new self-index suited for the repetitive sequence collection setting. Its expected space requirement depends only on the length

n

of the base sequence and the number

s

of variations in its repeated copies. That is, the space reduction is no longer constant, but depends on

N/n

. We believe the structure developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies

Dagstuhl Research Online Publication Server

Simple Algorithm to Maintain Dynamic Suffix Array for Text Indexes

Author: Ajtkulov P.
Urbanovich D.
Publication venue: St. Petersburg University Press
Publication date: 01/01/2011
Field of study

Dynamic suffix array is a suffix data structure that reflects various patterns in a mutable string. Dynamic suffix array is rather convenient for performing substring search queries over database indexes that are frequently modified. We are to introduce an O(nlog2n) algorithm that builds suffix array for any string and to show how to implement dynamic suffix array using this algorithm under certain constraints. We propose that this algorithm could be useful in real-life database applications

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Udmurt State University: UdNOEB / Удмуртский государственный университет: Удмуртская научно-образовательная электронная библиотека (УдНОЭБ)

Storage and retrieval of individual genomes

Author: D. Gusfield
E. Pennisi
G. Manzini
G.M. Church
H. Kaplan
J. Fischer
J. Sirén
K. Sadakane
K. Sadakane
L. Russo
L. Russo
N. Hall
P. Ferragina
R. Grossi
U. Manber
V. Mäkinen
Publication venue: Springer
Publication date: 01/01/2009
Field of study

Volume: 5541A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log σ) bits, where σ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.Peer reviewe

CiteSeerX

Crossref

Helsingin yliopiston digitaalinen arkisto