Search CORE

719 research outputs found

Succinct Representations of Dynamic Strings

Author: He Meng
Munro J. Ian
Publication venue
Publication date: 01/01/2010
Field of study

The rank and select operations over a string of length n from an alphabet of size

\sigma

have been used widely in the design of succinct data structures. In many applications, the string itself need be maintained dynamically, allowing characters of the string to be inserted and deleted. Under the word RAM model with word size

w=\Omega(\lg n)

, we design a succinct representation of dynamic strings using

nH_0 + o(n)\lg\sigma + O(w)

bits to support rank, select, insert and delete in

O(\frac{\lg n}{\lg\lg n}(\frac{\lg \sigma}{\lg\lg n}+1))

time. When the alphabet size is small, i.e. when \sigma = O(\polylog (n)), including the case in which the string is a bit vector, these operations are supported in

O(\frac{\lg n}{\lg\lg n})

time. Our data structures are more efficient than previous results on the same problem, and we have applied them to improve results on the design and construction of space-efficient text indexes

arXiv.org e-Print Archive

CiteSeerX

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

Author: Bille Philip
Cording Patrick Hagge
Gørtz Inge Li
Skjoldjensen Frederik Rye
Vildhøj Hjalte Wedel
Vind Søren
Publication venue
Publication date: 01/01/2016
Field of study

Given a static reference string

R

and a source string

S

, a relative compression of

S

with respect to

R

is an encoding of

S

as a sequence of references to substrings of

R

. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string

S

is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

arXiv.org e-Print Archive

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Author: Cohn Trevor
Haffari Gholamreza
Petri Matthias
Shareghi Ehsan
Publication venue
Publication date: 01/01/2016
Field of study

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

arXiv.org e-Print Archive

More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries

Author: Grossi Roberto
Orlandi Alessio
Raman Rajeev
Rao S. Srinivasa
Publication venue
Publication date: 01/01/2009
Field of study

We consider the problem of representing, in a compressed format, a bit-vector

S

m

bits with

n

1s, supporting the following operations, where

b \in \{0, 1 \}

rank_b(S,i)

returns the number of occurrences of bit

b

in the prefix

S[1..i]

;

select_b(S,i)

returns the position of the

i

th occurrence of bit

b

S

. Such a data structure is called \emph{fully indexable dictionary (FID)} [Raman et al.,2007], and is at least as powerful as predecessor data structures. Our focus is on space-efficient FIDs on the \textsc{ram} model with word size

\Theta(\lg m)

and constant time for all operations, so that the time cost is independent of the input size. Given the bitstring

S

to be encoded, having length

m

and containing

n

ones, the minimal amount of information that needs to be stored is

B(n,m) = \lceil \log {{m}\choose{n}} \rceil

. The state of the art in building a FID for

S

is given in [Patrascu,2008] using

B(m,n)+O(m / ((\log m/ t) ^t)) + O(m^{3/4})

bits, to support the operations in

O(t)

time. Here, we propose a parametric data structure exhibiting a time/space trade-off such that, for any real constants

0 0

, it uses B(n,m) + O(n^{1+\delta} + n (\frac{m}{n^s})^\eps) bits and performs all the operations in time O(s\delta^{-1} + \eps^{-1}). The improvement is twofold: our redundancy can be lowered parametrically and, fixing

s = O(1)

, we get a constant-time FID whose space is B(n,m) + O(m^\eps/\poly{n}) bits, for sufficiently large

m

. This is a significant improvement compared to the previous bounds for the general case

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Hal-Diderot