2,286 research outputs found
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
String Matching with Variable Length Gaps
We consider string matching with variable length gaps. Given a string and
a pattern consisting of strings separated by variable length gaps
(arbitrary strings of length in a specified range), the problem is to find all
ending positions of substrings in that match . This problem is a basic
primitive in computational biology applications. Let and be the lengths
of and , respectively, and let be the number of strings in . We
present a new algorithm achieving time and space , where is the sum of the lower bounds of the lengths of the gaps in
and is the total number of occurrences of the strings in
within . Compared to the previous results this bound essentially achieves
the best known time and space complexities simultaneously. Consequently, our
algorithm obtains the best known bounds for almost all combinations of ,
, , , and . Our algorithm is surprisingly simple and
straightforward to implement. We also present algorithms for finding and
encoding the positions of all strings in for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
Dictionary Matching with One Gap
The dictionary matching with gaps problem is to preprocess a dictionary
of gapped patterns over alphabet , where each
gapped pattern is a sequence of subpatterns separated by bounded
sequences of don't cares. Then, given a query text of length over
alphabet , the goal is to output all locations in in which a
pattern , , ends. There is a renewed current interest
in the gapped matching problem stemming from cyber security. In this paper we
solve the problem where all patterns in the dictionary have one gap with at
least and at most don't cares, where and are
given parameters. Specifically, we show that the dictionary matching with a
single gap problem can be solved in either time and
space, and query time , where is the number
of patterns found, or preprocessing time and space: , and query
time , where is the number of patterns found.
As far as we know, this is the best solution for this setting of the problem,
where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201
Mind the Gap: Essentially Optimal Algorithms for Online Dictionary Matching with One Gap
We examine the complexity of the online Dictionary Matching with One Gap Problem (DMOG) which is the following. Preprocess a dictionary D of d patterns, where each pattern contains a special gap symbol that can match any string, so that given a text that arrives online, a character at a time, we can report all of the patterns from D that are suffixes of the text that has arrived so far, before the next character arrives. In more general versions the gap symbols are associated with bounds determining the possible lengths of matching strings. Online DMOG captures the difficulty in a bottleneck procedure for cyber-security, as many digital signatures of viruses manifest themselves as patterns with a single gap.
In this paper, we demonstrate that the difficulty in obtaining efficient solutions for the DMOG problem, even in the offline setting, can be traced back to the infamous 3SUM conjecture. We show a conditional lower bound of Omega(delta(G_D)+op) time per text character, where G_D is a bipartite graph that captures the structure of D, delta(G_D) is the degeneracy of this graph, and op is the output size. Moreover, we show a conditional lower bound in terms of the magnitude of gaps for the bounded case, thereby showing that some known offline upper bounds are essentially optimal.
We also provide matching upper-bounds (up to sub-polynomial factors), in terms of the degeneracy, for the online DMOG problem. In particular, we introduce algorithms whose time cost depends linearly on delta(G_D). Our algorithms make use of graph orientations, together with some additional techniques. These algorithms are of practical interest since although delta(G_D) can be as large as sqrt(d), and even larger if G_D is a multi-graph, it is typically a very small constant in practice. Finally, when delta(G_D) is large we are able to obtain even more efficient solutions
Fast Indexes for Gapped Pattern Matching
We describe indexes for searching large data sets for variable-length-gapped
(VLG) patterns. VLG patterns are composed of two or more subpatterns, between
each adjacent pair of which is a gap-constraint specifying upper and lower
bounds on the distance allowed between subpatterns. VLG patterns have numerous
applications in computational biology (motif search), information retrieval
(e.g., for language models, snippet generation, machine translation) and
capture a useful subclass of the regular expressions commonly used in practice
for searching source code. Our best approach provides search speeds several
times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454
and has received funding from the European Union's Horizon 2020 research and
innovation programme under the Marie Sklodowska-Curie Actions
H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Gapped Indexing for Consecutive Occurrences
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P? and P? and a gap range [?, ?] we can quickly find the consecutive occurrences of P? and P? with distance in [?, ?], i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use O?(n) space and query time O?(|P?|+|P?|+n^{2/3}) for existence and counting and O?(|P?|+|P?|+n^{2/3}occ^{1/3}) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using O?(n) space must use ??(|P?| + |P?| + ?n) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem
- …