31 research outputs found
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Run-length encoding Burrows-Wheeler Transformed strings, resulting in
Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive
strings. We propose a new algorithm for online RLBWT working in run-compressed
space, which runs in time and bits of space, where
is the length of input string received so far and is the number of runs
in the BWT of the reversed . We improve the state-of-the-art algorithm for
online RLBWT in terms of empirical construction time. Adopting the dynamic list
for maintaining a total order, we can replace rank queries in a dynamic wavelet
tree on a run-length compressed string by the direct comparison of labels in a
dynamic list. The empirical result for various benchmarks show the efficiency
of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201
Cross-Document Pattern Matching
We study a new variant of the string matching problem called cross-document
string matching, which is the problem of indexing a collection of documents to
support an efficient search for a pattern in a selected document, where the
pattern itself is a substring of another document. Several variants of this
problem are considered, and efficient linear-space solutions are proposed with
query time bounds that either do not depend at all on the pattern size or
depend on it in a very limited way (doubly logarithmic). As a side result, we
propose an improved solution to the weighted level ancestor problem
On Online Labeling with Polynomially Many Labels
In the online labeling problem with parameters n and m we are presented with
a sequence of n keys from a totally ordered universe U and must assign each
arriving key a label from the label set {1,2,...,m} so that the order of labels
(strictly) respects the ordering on U. As new keys arrive it may be necessary
to change the labels of some items; such changes may be done at any time at
unit cost for each change. The goal is to minimize the total cost. An
alternative formulation of this problem is the file maintenance problem, in
which the items, instead of being labeled, are maintained in sorted order in an
array of length m, and we pay unit cost for moving an item.
For the case m=cn for constant c>1, there are known algorithms that use at
most O(n log(n)^2) relabelings in total [Itai, Konheim, Rodeh, 1981], and it
was shown recently that this is asymptotically optimal [Bul\'anek, Kouck\'y,
Saks, 2012]. For the case of m={\Theta}(n^C) for C>1, algorithms are known that
use O(n log n) relabelings. A matching lower bound was claimed in [Dietz,
Seiferas, Zhang, 2004]. That proof involved two distinct steps: a lower bound
for a problem they call prefix bucketing and a reduction from prefix bucketing
to online labeling. The reduction seems to be incorrect, leaving a (seemingly
significant) gap in the proof. In this paper we close the gap by presenting a
correct reduction to prefix bucketing. Furthermore we give a simplified and
improved analysis of the prefix bucketing lower bound. This improvement allows
us to extend the lower bounds for online labeling to the case where the number
m of labels is superpolynomial in n. In particular, for superpolynomial m we
get an asymptotically optimal lower bound {\Omega}((n log n) / (log log m - log
log n)).Comment: 15 pages, Presented at European Symposium on Algorithms 201
Blame Trees
We consider the problem of merging individual text documents, motivated by the single-file merge algorithms of document-based version control systems. Abstracting away the merging of conflicting edits to an external conflict resolution function (possibly implemented by a human), we consider the efficient identification of conflicting regions. We show how to implement tree-based document representation to quickly answer a data structure inspired by the “blame” query of some version control systems. A “blame” query associates every line of a document with the revision in which it was last edited. Our tree uses this idea to quickly identify conflicting edits. We show how to perform a merge operation in time proportional to the sum of the logarithms of the shared regions of the documents, plus the cost of conflict resolution. Our data structure is functional and therefore confluently persistent, allowing arbitrary version DAGs as in real version-control systems. Our results rely on concurrent traversal of two trees with short circuiting when shared subtrees are encountered.United States. Defense Advanced Research Projects Agency (Clean-Slate Design of Resilient, Adaptive, Secure Hosts (CRASH) program, BAA10-70)United States. Defense Advanced Research Projects Agency (contract #N66001-10-2-4088 (Bridging the Security Gap with Decentralized Information Flow Control))Danish National Research Foundation (Center for Massive Data Algorithmics (MADALGO)
Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing
This paper presents a general technique for optimally transforming any
dynamic data structure that operates on atomic and indivisible keys by
constant-time comparisons, into a data structure that handles unbounded-length
keys whose comparison cost is not a constant. Examples of these keys are
strings, multi-dimensional points, multiple-precision numbers, multi-key data
(e.g.~records), XML paths, URL addresses, etc. The technique is more general
than what has been done in previous work as no particular exploitation of the
underlying structure of is required. The only requirement is that the insertion
of a key must identify its predecessor or its successor.
Using the proposed technique, online suffix tree can be constructed in worst
case time per input symbol (as opposed to amortized
time per symbol, achieved by previously known algorithms). To our knowledge,
our algorithm is the first that achieves worst case time per input
symbol. Searching for a pattern of length in the resulting suffix tree
takes time, where is the
number of occurrences of the pattern. The paper also describes more
applications and show how to obtain alternative methods for dealing with suffix
sorting, dynamic lowest common ancestors and order maintenance