60,596 research outputs found
String Indexing with Compressed Patterns
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern
Partial fillup and search time in LC tries
Andersson and Nilsson introduced in 1993 a level-compressed trie (in short:
LC trie) in which a full subtree of a node is compressed to a single node of
degree being the size of the subtree. Recent experimental results indicated a
'dramatic improvement' when full subtrees are replaced by partially filled
subtrees. In this paper, we provide a theoretical justification of these
experimental results showing, among others, a rather moderate improvement of
the search time over the original LC tries. For such an analysis, we assume
that n strings are generated independently by a binary memoryless source with p
denoting the probability of emitting a 1. We first prove that the so called
alpha-fillup level (i.e., the largest level in a trie with alpha fraction of
nodes present at this level) is concentrated on two values with high
probability. We give these values explicitly up to O(1), and observe that the
value of alpha (strictly between 0 and 1) does not affect the leading term.
This result directly yields the typical depth (search time) in the alpha-LC
tries with p not equal to 1/2, which turns out to be C loglog n for an
explicitly given constant C (depending on p but not on alpha). This should be
compared with recently found typical depth in the original LC tries which is C'
loglog n for a larger constant C'. The search time in alpha-LC tries is thus
smaller but of the same order as in the original LC tries.Comment: 13 page
On Locating Paths in Compressed Tries
In this paper, we consider the problem of compressing a trie while supporting
the powerful \emph{locate} queries: to return the pre-order identifiers of all
nodes reached by a path labeled with a given query pattern. Our result builds
on top of the XBWT tree transform of Ferragina et al. [FOCS 2005] and
generalizes the \emph{r-index} locate machinery of Gagie et al. [SODA 2018,
JACM 2020] based on the run-length encoded Burrows-Wheeler transform (BWT). Our
first contribution is to propose a suitable generalization of the run-length
BWT to tries. We show that this natural generalization enjoys several of the
useful properties of its counterpart on strings: in particular, the transform
natively supports counting occurrences of a query pattern on the trie's paths
and its size captures the trie's repetitiveness and lower-bounds a natural
notion of trie entropy. Our main contribution is a much deeper insight into the
combinatorial structure of this object. In detail, we show that a data
structure of bits, where is the number of nodes,
allows locating the occurrences of a pattern of length in
nearly-optimal time, where is the alphabet's
size. Our solution consists in sampling nodes that can be used as
"anchor points" during the locate process. Once obtained the pre-order
identifier of the first pattern occurrence (in co-lexicographic order), we show
that a constant number of constant-time jumps between those anchor points lead
to the identifier of the next pattern occurrence, thus enabling locating in
optimal time per occurrence.Comment: Improved toehold lemma running time; added more detailed proofs that
take care of all border cases in the locate strategy; postprint version to
appear in SODA 202
A fast approach for overcomplete sparse decomposition based on smoothed L0 norm
In this paper, a fast algorithm for overcomplete sparse decomposition, called
SL0, is proposed. The algorithm is essentially a method for obtaining sparse
solutions of underdetermined systems of linear equations, and its applications
include underdetermined Sparse Component Analysis (SCA), atomic decomposition
on overcomplete dictionaries, compressed sensing, and decoding real field
codes. Contrary to previous methods, which usually solve this problem by
minimizing the L1 norm using Linear Programming (LP) techniques, our algorithm
tries to directly minimize the L0 norm. It is experimentally shown that the
proposed algorithm is about two to three orders of magnitude faster than the
state-of-the-art interior-point LP solvers, while providing the same (or
better) accuracy.Comment: Accepted in IEEE Transactions on Signal Processing. For MATLAB codes,
see (http://ee.sharif.ir/~SLzero). File replaced, because Fig. 5 was missing
erroneousl
On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching
We present parallel algorithms for exact and approximate pattern matching
with suffix arrays, using a CREW-PRAM with processors. Given a static text
of length , we first show how to compute the suffix array interval of a
given pattern of length in
time for . For approximate pattern matching with differences or
mismatches, we show how to compute all occurrences of a given pattern in
time, where is the size of the alphabet
and . The workhorse of our algorithms is a data structure
for merging suffix array intervals quickly: Given the suffix array intervals
for two patterns and , we present a data structure for computing the
interval of in sequential time, or in
parallel time. All our data structures are of size bits (in addition to
the suffix array)
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries
We present the first thorough practical study of the Lempel-Ziv-78 and the
Lempel-Ziv-Welch computation based on trie data structures. With a careful
selection of trie representations we can beat well-tuned popular trie data
structures like Judy, m-Bonsai or Cedar
c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches
Given a dynamic set of strings of total length whose characters
are drawn from an alphabet of size , a keyword dictionary is a data
structure built on that provides locate, prefix search, and update
operations on . Under the assumption that
characters fit into a single machine word , we propose a keyword dictionary
that represents in bits of space,
supporting all operations in expected time on an
input string of length in the word RAM model. This data structure is
underlined with an exhaustive practical evaluation, highlighting the practical
usefulness of the proposed data structure, especially for prefix searches - one
of the most elementary keyword dictionary operations
- …