666 research outputs found

### The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence

### On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree

Exact pattern matching in labeled graphs is the problem of searching paths of
a graph $G=(V,E)$ that spell the same string as the pattern $P[1..m]$. This
basic problem can be found at the heart of more complex operations on variation
graphs in computational biology, of query operations in graph databases, and of
analysis operations in heterogeneous networks, where the nodes of some paths
must match a sequence of labels or types. We describe a simple conditional
lower bound that, for any constant $\epsilon>0$, an $O(|E|^{1 - \epsilon} \,
m)$-time or an $O(|E| \, m^{1 - \epsilon})$-time algorithm for exact pattern
matching on graphs, with node labels and patterns drawn from a binary alphabet,
cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is
false. The result holds even if restricted to undirected graphs of maximum
degree three or directed acyclic graphs of maximum sum of indegree and
outdegree three. Although a conditional lower bound of this kind can be somehow
derived from previous results (Backurs and Indyk, FOCS'16), we give a direct
reduction from SETH for dissemination purposes, as the result might interest
researchers from several areas, such as computational biology, graph database,
and graph mining, as mentioned before. Indeed, as approximate pattern matching
on graphs can be solved in $O(|E|\,m)$ time, exact and approximate matching are
thus equally hard (quadratic time) on graphs under the SETH assumption. In
comparison, the same problems restricted to strings have linear time vs
quadratic time solutions, respectively, where the latter ones have a matching
SETH lower bound on computing the edit distance of two strings (Backurs and
Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14.
However, the proof of Lemma 14 is correct if you assume that the graph used
in the reduction is a DAG. Hence, since the problem is already quadratic for
a DAG and a binary alphabet, it has to be quadratic also for a general graph
and a binary alphabe

### Round-Hashing for Data Storage: Distributed Servers and External-Memory Tables

This paper proposes round-hashing, which is suitable for data storage on distributed servers and for implementing external-memory tables in which each lookup retrieves at most one single block of external memory, using a stash. For data storage, round-hashing is like consistent hashing as it avoids a full rehashing of the keys when new servers are added. Experiments show that the speed to serve requests is tenfold or more than the state of the art. In distributed data storage, this guarantees better throughput for serving requests and, moreover, greatly reduces decision times for which data should move to new servers as rescanning data is much faster

### String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Burrows-Wheeler transform (BWT) is an invertible text transformation that,
given a text $T$ of length $n$, permutes its symbols according to the
lexicographic order of suffixes of $T$. BWT is one of the most heavily studied
algorithms in data compression with numerous applications in indexing, sequence
analysis, and bioinformatics. Its construction is a bottleneck in many
scenarios, and settling the complexity of this task is one of the most
important unsolved problems in sequence analysis that has remained open for 25
years. Given a binary string of length $n$, occupying $O(n/\log n)$ machine
words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009)
runs in $O(n)$ time and $O(n/\log n)$ space. Recent advancements (Belazzougui,
STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size
dependency in the time complexity, but they still require $\Omega(n)$ time.
In this paper, we propose the first algorithm that breaks the $O(n)$-time
barrier for BWT construction. Given a binary string of length $n$, our
procedure builds the Burrows-Wheeler transform in $O(n/\sqrt{\log n})$ time and
$O(n/\log n)$ space. We complement this result with a conditional lower bound
proving that any further progress in the time complexity of BWT construction
would yield faster algorithms for the very well studied problem of counting
inversions: it would improve the state-of-the-art $O(m\sqrt{\log m})$-time
solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a
novel concept of string synchronizing sets, which is of independent interest.
As one of the applications, we show that this technique lets us design a data
structure of the optimal size $O(n/\log n)$ that answers Longest Common
Extension queries (LCE queries) in $O(1)$ time and, furthermore, can be
deterministically constructed in the optimal $O(n/\log n)$ time.Comment: Full version of a paper accepted to STOC 201

### Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

This paper presents a general technique for optimally transforming any
dynamic data structure that operates on atomic and indivisible keys by
constant-time comparisons, into a data structure that handles unbounded-length
keys whose comparison cost is not a constant. Examples of these keys are
strings, multi-dimensional points, multiple-precision numbers, multi-key data
(e.g.~records), XML paths, URL addresses, etc. The technique is more general
than what has been done in previous work as no particular exploitation of the
underlying structure of is required. The only requirement is that the insertion
of a key must identify its predecessor or its successor.
Using the proposed technique, online suffix tree can be constructed in worst
case time $O(\log n)$ per input symbol (as opposed to amortized $O(\log n)$
time per symbol, achieved by previously known algorithms). To our knowledge,
our algorithm is the first that achieves $O(\log n)$ worst case time per input
symbol. Searching for a pattern of length $m$ in the resulting suffix tree
takes $O(\min(m\log |\Sigma|, m + \log n) + tocc)$ time, where $tocc$ is the
number of occurrences of the pattern. The paper also describes more
applications and show how to obtain alternative methods for dealing with suffix
sorting, dynamic lowest common ancestors and order maintenance

### Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

AMS subject classifications. 68W05, 68Q25, 68P05, 68P10, 68P30
DOI. 10.1137/S0097539702402354The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols,
with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they
require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost
RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg|Σ| n), which is significant when Σ is of
constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching,
either in O(mlg |Σ|) time or in O(m+lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences.
We present a new text index that is based upon compressed representations of suffix arrays and
suffix trees. It achieves a fast O(m/lg|Σ| n + lg
| Σ| n) search time in the worst case, for any constant 0 < ≤ 1, using at most −1 + O(1) n lg |Σ| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg
| Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/lg|Σ| n + occ) time, which is optimal

- …