19,267 research outputs found
Dictionary Matching with One Gap
The dictionary matching with gaps problem is to preprocess a dictionary
of gapped patterns over alphabet , where each
gapped pattern is a sequence of subpatterns separated by bounded
sequences of don't cares. Then, given a query text of length over
alphabet , the goal is to output all locations in in which a
pattern , , ends. There is a renewed current interest
in the gapped matching problem stemming from cyber security. In this paper we
solve the problem where all patterns in the dictionary have one gap with at
least and at most don't cares, where and are
given parameters. Specifically, we show that the dictionary matching with a
single gap problem can be solved in either time and
space, and query time , where is the number
of patterns found, or preprocessing time and space: , and query
time , where is the number of patterns found.
As far as we know, this is the best solution for this setting of the problem,
where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
String Matching with Variable Length Gaps
We consider string matching with variable length gaps. Given a string and
a pattern consisting of strings separated by variable length gaps
(arbitrary strings of length in a specified range), the problem is to find all
ending positions of substrings in that match . This problem is a basic
primitive in computational biology applications. Let and be the lengths
of and , respectively, and let be the number of strings in . We
present a new algorithm achieving time and space , where is the sum of the lower bounds of the lengths of the gaps in
and is the total number of occurrences of the strings in
within . Compared to the previous results this bound essentially achieves
the best known time and space complexities simultaneously. Consequently, our
algorithm obtains the best known bounds for almost all combinations of ,
, , , and . Our algorithm is surprisingly simple and
straightforward to implement. We also present algorithms for finding and
encoding the positions of all strings in for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
Palindromic Decompositions with Gaps and Errors
Identifying palindromes in sequences has been an interesting line of research
in combinatorics on words and also in computational biology, after the
discovery of the relation of palindromes in the DNA sequence with the HIV
virus. Efficient algorithms for the factorization of sequences into palindromes
and maximal palindromes have been devised in recent years. We extend these
studies by allowing gaps in decompositions and errors in palindromes, and also
imposing a lower bound to the length of acceptable palindromes.
We first present an algorithm for obtaining a palindromic decomposition of a
string of length n with the minimal total gap length in time O(n log n * g) and
space O(n g), where g is the number of allowed gaps in the decomposition. We
then consider a decomposition of the string in maximal \delta-palindromes (i.e.
palindromes with \delta errors under the edit or Hamming distance) and g
allowed gaps. We present an algorithm to obtain such a decomposition with the
minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201
Opaque Service Virtualisation: A Practical Tool for Emulating Endpoint Systems
Large enterprise software systems make many complex interactions with other
services in their environment. Developing and testing for production-like
conditions is therefore a very challenging task. Current approaches include
emulation of dependent services using either explicit modelling or
record-and-replay approaches. Models require deep knowledge of the target
services while record-and-replay is limited in accuracy. Both face
developmental and scaling issues. We present a new technique that improves the
accuracy of record-and-replay approaches, without requiring prior knowledge of
the service protocols. The approach uses Multiple Sequence Alignment to derive
message prototypes from recorded system interactions and a scheme to match
incoming request messages against prototypes to generate response messages. We
use a modified Needleman-Wunsch algorithm for distance calculation during
message matching. Our approach has shown greater than 99% accuracy for four
evaluated enterprise system messaging protocols. The approach has been
successfully integrated into the CA Service Virtualization commercial product
to complement its existing techniques.Comment: In Proceedings of the 38th International Conference on Software
Engineering Companion (pp. 202-211). arXiv admin note: text overlap with
arXiv:1510.0142
Mind the Gap: Essentially Optimal Algorithms for Online Dictionary Matching with One Gap
We examine the complexity of the online Dictionary Matching with One Gap Problem (DMOG) which is the following. Preprocess a dictionary D of d patterns, where each pattern contains a special gap symbol that can match any string, so that given a text that arrives online, a character at a time, we can report all of the patterns from D that are suffixes of the text that has arrived so far, before the next character arrives. In more general versions the gap symbols are associated with bounds determining the possible lengths of matching strings. Online DMOG captures the difficulty in a bottleneck procedure for cyber-security, as many digital signatures of viruses manifest themselves as patterns with a single gap.
In this paper, we demonstrate that the difficulty in obtaining efficient solutions for the DMOG problem, even in the offline setting, can be traced back to the infamous 3SUM conjecture. We show a conditional lower bound of Omega(delta(G_D)+op) time per text character, where G_D is a bipartite graph that captures the structure of D, delta(G_D) is the degeneracy of this graph, and op is the output size. Moreover, we show a conditional lower bound in terms of the magnitude of gaps for the bounded case, thereby showing that some known offline upper bounds are essentially optimal.
We also provide matching upper-bounds (up to sub-polynomial factors), in terms of the degeneracy, for the online DMOG problem. In particular, we introduce algorithms whose time cost depends linearly on delta(G_D). Our algorithms make use of graph orientations, together with some additional techniques. These algorithms are of practical interest since although delta(G_D) can be as large as sqrt(d), and even larger if G_D is a multi-graph, it is typically a very small constant in practice. Finally, when delta(G_D) is large we are able to obtain even more efficient solutions
- …