1,262 research outputs found
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space
An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence
Recognizing Partial Cubes in Quadratic Time
We show how to test whether a graph with n vertices and m edges is a partial
cube, and if so how to find a distance-preserving embedding of the graph into a
hypercube, in the near-optimal time bound O(n^2), improving previous O(nm)-time
solutions.Comment: 25 pages, five figures. This version significantly expands previous
versions, including a new report on an implementation of the algorithm and
experiments with i
New Algorithms for Position Heaps
We present several results about position heaps, a relatively new alternative
to suffix trees and suffix arrays. First, we show that, if we limit the maximum
length of patterns to be sought, then we can also limit the height of the heap
and reduce the worst-case cost of insertions and deletions. Second, we show how
to build a position heap in linear time independent of the size of the
alphabet. Third, we show how to augment a position heap such that it supports
access to the corresponding suffix array, and vice versa. Fourth, we introduce
a variant of a position heap that can be simulated efficiently by a compressed
suffix array with a linear number of extra bits
Online Sorting via Searching and Selection
In this paper, we present a framework based on a simple data structure and
parameterized algorithms for the problems of finding items in an unsorted list
of linearly ordered items based on their rank (selection) or value (search). As
a side-effect of answering these online selection and search queries, we
progressively sort the list. Our algorithms are based on Hoare's Quickselect,
and are parameterized based on the pivot selection method.
For example, if we choose the pivot as the last item in a subinterval, our
framework yields algorithms that will answer q<=n unique selection and/or
search queries in a total of O(n log q) average time. After q=\Omega(n) queries
the list is sorted. Each repeated selection query takes constant time, and each
repeated search query takes O(log n) time. The two query types can be
interleaved freely. By plugging different pivot selection methods into our
framework, these results can, for example, become randomized expected time or
deterministic worst-case time. Our methods are easy to implement, and we show
they perform well in practice
Towards Verifying Nonlinear Integer Arithmetic
We eliminate a key roadblock to efficient verification of nonlinear integer
arithmetic using CDCL SAT solvers, by showing how to construct short resolution
proofs for many properties of the most widely used multiplier circuits. Such
short proofs were conjectured not to exist. More precisely, we give n^{O(1)}
size regular resolution proofs for arbitrary degree 2 identities on array,
diagonal, and Booth multipliers and quasipolynomial- n^{O(\log n)} size proofs
for these identities on Wallace tree multipliers.Comment: Expanded and simplified with improved result
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
- …
