Search CORE

427 research outputs found

The k-mismatch problem revisited

Author: Clifford Raphaël
Fontaine Allyx
Porat Ely
Sach Benjamin
Starikovskaya Tatiana
Publication venue
Publication date: 27/08/2015
Field of study

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic

O(n k^2\log{k} / m+n \text{polylog} m)

time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only

O(k^2\text{polylog} {m})

space in total. 3) Next we give a randomised

(1+\epsilon)

-approximation algorithm for the streaming k-mismatch problem which uses

O(k^2\text{polylog} m / \epsilon^2)

space and runs in

O(\text{polylog} m / \epsilon^2)

worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised

O(k^2\text{polylog} {m})

space algorithm for the streaming k-mismatch problem which runs in

O(\sqrt{k}\log{k} + \text{polylog} {m})

worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors)

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

MatchPy: A Pattern Matching Library

Author: Barthels Henrik
Bientinesi Paolo
Krebber Manuel
Publication venue: 'SciPy'
Publication date: 01/01/2017
Field of study

Pattern matching is a powerful tool for symbolic computations, based on the well-defined theory of term rewriting systems. Application domains include algebraic expressions, abstract syntax trees, and XML and JSON data. Unfortunately, no lightweight implementation of pattern matching as general and flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt. Therefore, we created the open source module MatchPy which offers similar pattern matching functionality in Python using a novel algorithm which finds matches for large pattern sets more efficiently by exploiting similarities between patterns.Comment: arXiv admin note: substantial text overlap with arXiv:1710.0007

arXiv.org e-Print Archive

Crossref

String Indexing for Patterns with Wildcards

Author: A. Tam
B. Chazelle
D. Harel
D. Tsur
G. Chen
G. Landau
G. Landau
G. Navarro
H.L. Chan
K. Hofmann
L.P. Coelho
M. Lewenstein
M. Maas
M.L. Fredman
P. Bille
P. Bille
P. Clifford
T.-W. Lam
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of indexing a string

t

of length

n

to report the occurrences of a query pattern

p

containing

m

characters and

j

wildcards. Let

occ

be the number of occurrences of

p

t

, and

\sigma

the size of the alphabet. We obtain the following results. - A linear space index with query time

O(m+\sigma^j \log \log n + occ)

. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time

\Theta(jn)

in the worst case. - An index with query time

O(m+j+occ)

using space

O(\sigma^{k^2} n \log^k \log n)

, where

k

is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

String Matching with Variable Length Gaps

Author: Aho
Crochemore
David Kofoed Wind
Fredriksson
Hjalte Wedel Vildhøj
Hofmann
Inge Li Gørtz
Knuth
Morgante
Myers
Myers
Myers
Navarro
Navarro
Philip Bille
Thompson
Publication venue
Publication date: 01/01/2010
Field of study

We consider string matching with variable length gaps. Given a string

T

and a pattern

P

consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in

T

that match

P

. This problem is a basic primitive in computational biology applications. Let

m

and

n

be the lengths of

P

and

T

, respectively, and let

k

be the number of strings in

P

. We present a new algorithm achieving time

O(n\log k + m +\alpha)

and space

O(m + A)

, where

A

is the sum of the lower bounds of the lengths of the gaps in

P

and

\alpha

is the total number of occurrences of the strings in

P

within

T

. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of

m

n

k

A

, and

\alpha

. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in

P

for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Online Research Database In Technology

Efficient Pattern Matching in Python

Author: Bachmair L.
Bachmair L.
Behnel S.
Clausen M.
Kirchner H.
Klop J. W.
Kounalis E.
Nedjah N.
Rivoal F.
Robie J.
Publication venue
Publication date: 29/09/2017
Field of study

Pattern matching is a powerful tool for symbolic computations. Applications include term rewriting systems, as well as the manipulation of symbolic expressions, abstract syntax trees, and XML and JSON data. It also allows for an intuitive description of algorithms in the form of rewrite rules. We present the open source Python module MatchPy, which offers functionality and expressiveness similar to the pattern matching in Mathematica. In particular, it includes syntactic pattern matching, as well as matching for commutative and/or associative functions, sequence variables, and matching with constraints. MatchPy uses new and improved algorithms to efficiently find matches for large pattern sets by exploiting similarities between patterns. The performance of MatchPy is investigated on several real-world problems

arXiv.org e-Print Archive

Crossref

Fast Indexes for Gapped Pattern Matching

Author: D Knuth
G Navarro
J Bader
K Fredriksson
M Crochemore
M Lewenstein
M Morgante
P Bille
P Bille
Philip Bille
R Saikkonen
SP Pissis
T Crawford
T Haapasalo
U Manber
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/02/2020
Field of study

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454 and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Crossref