427 research outputs found
The k-mismatch problem revisited
We revisit the complexity of one of the most basic problems in pattern
matching. In the k-mismatch problem we must compute the Hamming distance
between a pattern of length m and every m-length substring of a text of length
n, as long as that Hamming distance is at most k. Where the Hamming distance is
greater than k at some alignment of the pattern and text, we simply output
"No".
We study this problem in both the standard offline setting and also as a
streaming problem. In the streaming k-mismatch problem the text arrives one
symbol at a time and we must give an output before processing any future
symbols. Our main results are as follows:
1) Our first result is a deterministic time offline algorithm for k-mismatch on a text of length n. This is a
factor of k improvement over the fastest previous result of this form from SODA
2000 by Amihood Amir et al.
2) We then give a randomised and online algorithm which runs in the same time
complexity but requires only space in total.
3) Next we give a randomised -approximation algorithm for the
streaming k-mismatch problem which uses
space and runs in worst-case time per
arriving symbol.
4) Finally we combine our new results to derive a randomised
space algorithm for the streaming k-mismatch problem
which runs in worst-case time per
arriving symbol. This improves the best previous space complexity for streaming
k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We
also improve the time complexity of this previous result by an even greater
factor to match the fastest known offline algorithm (up to logarithmic
factors)
MatchPy: A Pattern Matching Library
Pattern matching is a powerful tool for symbolic computations, based on the
well-defined theory of term rewriting systems. Application domains include
algebraic expressions, abstract syntax trees, and XML and JSON data.
Unfortunately, no lightweight implementation of pattern matching as general and
flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt.
Therefore, we created the open source module MatchPy which offers similar
pattern matching functionality in Python using a novel algorithm which finds
matches for large pattern sets more efficiently by exploiting similarities
between patterns.Comment: arXiv admin note: substantial text overlap with arXiv:1710.0007
String Indexing for Patterns with Wildcards
We consider the problem of indexing a string of length to report the
occurrences of a query pattern containing characters and wildcards.
Let be the number of occurrences of in , and the size of
the alphabet. We obtain the following results.
- A linear space index with query time .
This significantly improves the previously best known linear space index by Lam
et al. [ISAAC 2007], which requires query time in the worst case.
- An index with query time using space , where is the maximum number of wildcards allowed in the pattern.
This is the first non-trivial bound with this query time.
- A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].
We also show that these indexes can be generalized to allow variable length
gaps in the pattern. Our results are obtained using a novel combination of
well-known and new techniques, which could be of independent interest
String Matching with Variable Length Gaps
We consider string matching with variable length gaps. Given a string and
a pattern consisting of strings separated by variable length gaps
(arbitrary strings of length in a specified range), the problem is to find all
ending positions of substrings in that match . This problem is a basic
primitive in computational biology applications. Let and be the lengths
of and , respectively, and let be the number of strings in . We
present a new algorithm achieving time and space , where is the sum of the lower bounds of the lengths of the gaps in
and is the total number of occurrences of the strings in
within . Compared to the previous results this bound essentially achieves
the best known time and space complexities simultaneously. Consequently, our
algorithm obtains the best known bounds for almost all combinations of ,
, , , and . Our algorithm is surprisingly simple and
straightforward to implement. We also present algorithms for finding and
encoding the positions of all strings in for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
Efficient Pattern Matching in Python
Pattern matching is a powerful tool for symbolic computations. Applications
include term rewriting systems, as well as the manipulation of symbolic
expressions, abstract syntax trees, and XML and JSON data. It also allows for
an intuitive description of algorithms in the form of rewrite rules. We present
the open source Python module MatchPy, which offers functionality and
expressiveness similar to the pattern matching in Mathematica. In particular,
it includes syntactic pattern matching, as well as matching for commutative
and/or associative functions, sequence variables, and matching with
constraints. MatchPy uses new and improved algorithms to efficiently find
matches for large pattern sets by exploiting similarities between patterns. The
performance of MatchPy is investigated on several real-world problems
Fast Indexes for Gapped Pattern Matching
We describe indexes for searching large data sets for variable-length-gapped
(VLG) patterns. VLG patterns are composed of two or more subpatterns, between
each adjacent pair of which is a gap-constraint specifying upper and lower
bounds on the distance allowed between subpatterns. VLG patterns have numerous
applications in computational biology (motif search), information retrieval
(e.g., for language models, snippet generation, machine translation) and
capture a useful subclass of the regular expressions commonly used in practice
for searching source code. Our best approach provides search speeds several
times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454
and has received funding from the European Union's Horizon 2020 research and
innovation programme under the Marie Sklodowska-Curie Actions
H2020-MSCA-RISE-2015 BIRDS GA No. 69094
- …