Search CORE

1,399 research outputs found

The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

Author: Grossi Roberto
Ottaviano Giuseppe
Publication venue
Publication date: 01/01/2012
Field of study

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Crucial and bicrucial permutations with respect to arithmetic monotone patterns

Author: Avgustinovich Sergey
Kitaev Sergey
Valyuzhenich Alexandr
Publication venue
Publication date: 09/10/2012
Field of study

A pattern

\tau

is a permutation, and an arithmetic occurrence of

\tau

in (another) permutation

\pi=\pi_1\pi_2...\pi_n

is a subsequence

\pi_{i_1}\pi_{i_2}...\pi_{i_m}

\pi

that is order isomorphic to

\tau

where the numbers

i_1<i_2<...<i_m

form an arithmetic progression. A permutation is

(k,\ell)

-crucial if it avoids arithmetically the patterns

12... k

and

\ell(\ell-1)... 1

but its extension to the right by any element does not avoid arithmetically these patterns. A

(k,\ell)

-crucial permutation that cannot be extended to the left without creating an arithmetic occurrence of

12... k

\ell(\ell-1)... 1

is called

(k,\ell)

-bicrucial. In this paper we prove that arbitrary long

(k,\ell)

-crucial and

(k,\ell)

-bicrucial permutations exist for any

k,\ell\geq 3

. Moreover, we show that the minimal length of a

(k,\ell)

-crucial permutation is

\max(k,\ell)(\min(k,\ell)-1)

, while the minimal length of a

(k,\ell)

-bicrucial permutation is at most

2\max(k,\ell)(\min(k,\ell)-1)

, again for

k,\ell\geq3

arXiv.org e-Print Archive

University of Strathclyde Institutional Repository

Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems

Author: Clifford Raphael
Jalsenius Markus
Sach Benjamin
Publication venue
Publication date: 24/07/2014
Field of study

We give cell-probe bounds for the computation of edit distance, Hamming distance, convolution and longest common subsequence in a stream. In this model, a fixed string of

n

symbols is given and one

\delta

-bit symbol arrives at a time in a stream. After each symbol arrives, the distance between the fixed string and a suffix of most recent symbols of the stream is reported. The cell-probe model is perhaps the strongest model of computation for showing data structure lower bounds, subsuming in particular the popular word-RAM model. * We first give an

\Omega((\delta \log n)/(w+\log\log n))

lower bound for the time to give each output for both online Hamming distance and convolution, where

w

is the word size. This bound relies on a new encoding scheme and for the first time holds even when

w

is as small as a single bit. * We then consider the online edit distance and longest common subsequence problems in the bit-probe model (

w=1

) with a constant sized input alphabet. We give a lower bound of

\Omega(\sqrt{\log n}/(\log\log n)^{3/2})

which applies for both problems. This second set of results relies both on our new encoding scheme as well as a carefully constructed hard distribution. * Finally, for the online edit distance problem we show that there is an

O((\log n)^2/w)

upper bound in the cell-probe model. This bound gives a contrast to our new lower bound and also establishes an exponential gap between the known cell-probe and RAM model complexities.Comment: 32 pages, 4 figure

arXiv.org e-Print Archive

Explore Bristol Research

File Updates Under Random/Arbitrary Insertions And Deletions

Author: Cadambe Viveck
Jaggi Sidharth
Médard Muriel
Schwartz Moshe
Wang Qiwen
Publication venue
Publication date: 27/02/2015
Field of study

A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Tree Contractions and Evolutionary Trees

Author: Kao Ming-Yang
Publication venue
Publication date: 26/01/2001
Field of study

An evolutionary tree is a rooted tree where each internal vertex has at least two children and where the leaves are labeled with distinct symbols representing species. Evolutionary trees are useful for modeling the evolutionary history of species. An agreement subtree of two evolutionary trees is an evolutionary tree which is also a topological subtree of the two given trees. We give an algorithm to determine the largest possible number of leaves in any agreement subtree of two trees T_1 and T_2 with n leaves each. If the maximum degree d of these trees is bounded by a constant, the time complexity is O(n log^2(n)) and is within a log(n) factor of optimal. For general d, this algorithm runs in O(n d^2 log(d) log^2(n)) time or alternatively in O(n d sqrt(d) log^3(n)) time

arXiv.org e-Print Archive

CiteSeerX

Edit Distance: Sketching, Streaming and Document Exchange

Author: Belazzougui Djamal
Zhang Qin
Publication venue
Publication date: 14/07/2016
Field of study

We show that in the document exchange problem, where Alice holds

x \in \{0,1\}^n

and Bob holds

y \in \{0,1\}^n

, Alice can send Bob a message of size

O(K(\log^2 K+\log n))

bits such that Bob can recover

x

using the message and his input

y

if the edit distance between

x

and

y

is no more than

K

, and output "error" otherwise. Both the encoding and decoding can be done in time

\tilde{O}(n+\mathsf{poly}(K))

. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold

x

and

y

respectively, they can compute sketches of

x

and

y

of sizes

\mathsf{poly}(K \log n)

bits (the encoding), and send to the referee, who can then compute the edit distance between

x

and

y

together with all the edit operations if the edit distance is no more than

K

, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using

\mathsf{poly}(K \log n)

bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using

\mathsf{poly}(K \log n)

bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016

arXiv.org e-Print Archive

Crossref