Search CORE

36,929 research outputs found

Optimal-Hash Exact String Matching Algorithms

Author: Lecroq Thierry
Publication venue
Publication date: 10/03/2023
Field of study

String matching is the problem of finding all the occurrences of a pattern in a text. We propose improved versions of the fast family of string matching algorithms based on hashing

q

-grams. The improvement consists of considering minimal values

q

such that each

q

-grams of the pattern has a unique hash value. The new algorithms are fastest than algorithm of the HASH family for short patterns on large size alphabets.Comment: 14 page

arXiv.org e-Print Archive

A new family and structure for Commentz-Walter-style multiple-keyword pattern matching algorithms

Author: Watson B.W.
Publication venue
Publication date: 01/01/2003
Field of study

In this paper, I present a new family of Commentz-Walter-style multiple-keyword string pattern matching algorithms. The algorithms share a common algorithmic skeleton, which is significantly optimized when compared to the original Commentz- Walter skeleton and subsequently derived improvements. The new skeleton is derived via correctness-preserving stepwise algorithmic improvements, in the Eindhoven style of programming

Repository TU/e

Pure OAI Repository

Generalised Pattern Matching Revisited

Author: Dudek Bart?omiej
Gawrychowski Pawe?
Starikovskaya Tatiana
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)
Publication date: 01/01/2020
Field of study

In the problem of

\texttt{Generalised Pattern Matching}\ (\texttt{GPM})

[STOC'94, Muthukrishnan and Palem], we are given a text

T

of length

n

over an alphabet

\Sigma_T

, a pattern

P

of length

m

over an alphabet

\Sigma_P

, and a matching relationship

\subseteq \Sigma_T \times \Sigma_P

, and must return all substrings of

T

that match

P

(reporting) or the number of mismatches between each substring of

T

of length

m

and

P

(counting). In this work, we improve over all previously known algorithms for this problem for various parameters describing the input instance: *

\mathcal{D}\,

being the maximum number of characters that match a fixed character, *

\mathcal{S}\,

being the number of pairs of matching characters, *

\mathcal{I}\,

being the total number of disjoint intervals of characters that match the

m

characters of the pattern

P

. At the heart of our new deterministic upper bounds for

\mathcal{D}\,

and

\mathcal{S}\,

lies a faster construction of superimposed codes, which solves an open problem posed in [FOCS'97, Indyk] and can be of independent interest. To conclude, we demonstrate first lower bounds for

\texttt{GPM}

. We start by showing that any deterministic or Monte Carlo algorithm for

\texttt{GPM}

must use

\Omega(\mathcal{S})

time, and then proceed to show higher lower bounds for combinatorial algorithms. These bounds show that our algorithms are almost optimal, unless a radically new approach is developed

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Recommended from our members

A constraint based structure description language for Biosequences

Author: Eidhammer I
Gilbert D
Grindhaug SH
Jonassen J
Ratnayake R
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2001
Field of study

Brunel University Research Archive