Search CORE

91 research outputs found

Random Access to Grammar Compressed Strings

Author: Bille Philip
Landau Gad M.
Raman Rajeev
Sadakane Kunihiko
Satti Srinivasa Rao
Weimann Oren
Publication venue
Publication date: 01/01/2011
Field of study

Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let

S

be a string of length

N

compressed into a context-free grammar

\mathcal{S}

of size

n

. We present two representations of

\mathcal{S}

achieving

O(\log N)

random access time, and either

O(n\cdot \alpha_k(n))

construction time and space on the pointer machine model, or

O(n)

construction time and space on the RAM. Here,

\alpha_k(n)

is the inverse of the

k^{th}

row of Ackermann's function. Our representations also efficiently support decompression of any substring in

S

: we can decompress any substring of length

m

in the same complexity as a single random access query and additional

O(m)

time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern

P

with at most

k

errors in time

O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ)

, where

occ

is the number of occurrences of

P

S

. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.Comment: Preliminary version in SODA 201

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Leicester Research Archive

Rank, select and access in grammar-compressed strings

Author: Belazzougui Djamal
Puglisi Simon J.
Tabei Yasuo
Publication venue
Publication date: 14/08/2014
Field of study

Given a string

S

of length

N

on a fixed alphabet of

\sigma

symbols, a grammar compressor produces a context-free grammar

G

of size

n

that generates

S

and only

S

. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol

c

before position

i

S

); \mbox{select}_c(S,i) (return the position of the

i

th occurrence of

c

S

); and \mbox{access}(S,i,j) (return substring

S[i,j]

). For rank and select we describe data structures of size

O(n\sigma\log N)

bits that support the two operations in

O(\log N)

time. We propose another structure that uses

O(n\sigma\log (N/n)(\log N)^{1+\epsilon})

bits and that supports the two queries in

O(\log N/\log\log N)

, where

\epsilon>0

is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires

O(n\log N)

bits of space and

O(\log N+m/\log_\sigma N)

time to extract

m=j-i+1

consecutive symbols from

S

. Alternatively, we can achieve

O(\log N/\log\log N+m/\log_\sigma N)

query time using

O(n\log (N/n)(\log N)^{1+\epsilon})

bits of space. This matches a lower bound stated by Verbin and Yu for strings where

N

is polynomially related to

n

.Comment: 16 page

arXiv.org e-Print Archive

CiteSeerX

Searching and Indexing Genomic Databases via Kernelization

Author: Gagie Travis
Puglisi Simon J.
Publication venue
Publication date: 04/12/2014
Field of study

The rapid advance of DNA sequencing technologies has yielded databases of thousands of genomes. To search and index these databases effectively, it is important that we take advantage of the similarity between those genomes. Several authors have recently suggested searching or indexing only one reference genome and the parts of the other genomes where they differ. In this paper we survey the twenty-year history of this idea and discuss its relation to kernelization in parameterized complexity

arXiv.org e-Print Archive

Frontiers - Publisher Connector

Finger Search in Grammar-Compressed Strings

Author: Bille Philip
Christiansen Anders Roy
Cording Patrick Hagge
Gørtz Inge Li
Publication venue
Publication date: 01/01/2016
Field of study

Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index

f

, called the \emph{finger}, and the query index

i

. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let

n

be the size the grammar, and let

N

be the size of the string. For the static variant we give a linear space representation that supports placing the finger in

O(\log N)

time and subsequently accessing in

O(\log D)

time, where

D

is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in

O(\log N)

time and accessing and moving the finger in

O(\log D + \log \log N)

time. Compared to the best linear space solution to random access, we improve a

O(\log N)

query bound to

O(\log D)

for the static variant and to

O(\log D + \log \log N)

for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Online Research Database In Technology

Algorithms and data structures for grammar-compressed strings

Author: Cording Patrick Hagge
Publication venue: Technical University of Denmark
Publication date: 01/01/2015
Field of study

Online Research Database In Technology