19,267 research outputs found

    Dictionary Matching with One Gap

    Full text link
    The dictionary matching with gaps problem is to preprocess a dictionary DD of dd gapped patterns P1,,PdP_1,\ldots,P_d over alphabet Σ\Sigma, where each gapped pattern PiP_i is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text TT of length nn over alphabet Σ\Sigma, the goal is to output all locations in TT in which a pattern PiDP_i\in D, 1id1\leq i\leq d, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least α\alpha and at most β\beta don't cares, where α\alpha and β\beta are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either O(dlogd+D)O(d\log d + |D|) time and O(dlogεd+D)O(d\log^{\varepsilon} d + |D|) space, and query time O(n(βα)loglogdlog2min{d,logD}+occ)O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ), where occocc is the number of patterns found, or preprocessing time and space: O(d2+D)O(d^2 + |D|), and query time O(n(βα)+occ)O(n(\beta -\alpha ) + occ), where occocc is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201

    String Indexing for Patterns with Wildcards

    Get PDF
    We consider the problem of indexing a string tt of length nn to report the occurrences of a query pattern pp containing mm characters and jj wildcards. Let occocc be the number of occurrences of pp in tt, and σ\sigma the size of the alphabet. We obtain the following results. - A linear space index with query time O(m+σjloglogn+occ)O(m+\sigma^j \log \log n + occ). This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time Θ(jn)\Theta(jn) in the worst case. - An index with query time O(m+j+occ)O(m+j+occ) using space O(σk2nlogklogn)O(\sigma^{k^2} n \log^k \log n), where kk is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

    String Matching with Variable Length Gaps

    Get PDF
    We consider string matching with variable length gaps. Given a string TT and a pattern PP consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in TT that match PP. This problem is a basic primitive in computational biology applications. Let mm and nn be the lengths of PP and TT, respectively, and let kk be the number of strings in PP. We present a new algorithm achieving time O(nlogk+m+α)O(n\log k + m +\alpha) and space O(m+A)O(m + A), where AA is the sum of the lower bounds of the lengths of the gaps in PP and α\alpha is the total number of occurrences of the strings in PP within TT. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of mm, nn, kk, AA, and α\alpha. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in PP for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

    Palindromic Decompositions with Gaps and Errors

    Full text link
    Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing gaps in decompositions and errors in palindromes, and also imposing a lower bound to the length of acceptable palindromes. We first present an algorithm for obtaining a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n * g) and space O(n g), where g is the number of allowed gaps in the decomposition. We then consider a decomposition of the string in maximal \delta-palindromes (i.e. palindromes with \delta errors under the edit or Hamming distance) and g allowed gaps. We present an algorithm to obtain such a decomposition with the minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201

    Opaque Service Virtualisation: A Practical Tool for Emulating Endpoint Systems

    Full text link
    Large enterprise software systems make many complex interactions with other services in their environment. Developing and testing for production-like conditions is therefore a very challenging task. Current approaches include emulation of dependent services using either explicit modelling or record-and-replay approaches. Models require deep knowledge of the target services while record-and-replay is limited in accuracy. Both face developmental and scaling issues. We present a new technique that improves the accuracy of record-and-replay approaches, without requiring prior knowledge of the service protocols. The approach uses Multiple Sequence Alignment to derive message prototypes from recorded system interactions and a scheme to match incoming request messages against prototypes to generate response messages. We use a modified Needleman-Wunsch algorithm for distance calculation during message matching. Our approach has shown greater than 99% accuracy for four evaluated enterprise system messaging protocols. The approach has been successfully integrated into the CA Service Virtualization commercial product to complement its existing techniques.Comment: In Proceedings of the 38th International Conference on Software Engineering Companion (pp. 202-211). arXiv admin note: text overlap with arXiv:1510.0142

    Mind the Gap: Essentially Optimal Algorithms for Online Dictionary Matching with One Gap

    Get PDF
    We examine the complexity of the online Dictionary Matching with One Gap Problem (DMOG) which is the following. Preprocess a dictionary D of d patterns, where each pattern contains a special gap symbol that can match any string, so that given a text that arrives online, a character at a time, we can report all of the patterns from D that are suffixes of the text that has arrived so far, before the next character arrives. In more general versions the gap symbols are associated with bounds determining the possible lengths of matching strings. Online DMOG captures the difficulty in a bottleneck procedure for cyber-security, as many digital signatures of viruses manifest themselves as patterns with a single gap. In this paper, we demonstrate that the difficulty in obtaining efficient solutions for the DMOG problem, even in the offline setting, can be traced back to the infamous 3SUM conjecture. We show a conditional lower bound of Omega(delta(G_D)+op) time per text character, where G_D is a bipartite graph that captures the structure of D, delta(G_D) is the degeneracy of this graph, and op is the output size. Moreover, we show a conditional lower bound in terms of the magnitude of gaps for the bounded case, thereby showing that some known offline upper bounds are essentially optimal. We also provide matching upper-bounds (up to sub-polynomial factors), in terms of the degeneracy, for the online DMOG problem. In particular, we introduce algorithms whose time cost depends linearly on delta(G_D). Our algorithms make use of graph orientations, together with some additional techniques. These algorithms are of practical interest since although delta(G_D) can be as large as sqrt(d), and even larger if G_D is a multi-graph, it is typically a very small constant in practice. Finally, when delta(G_D) is large we are able to obtain even more efficient solutions
    corecore