Search CORE

8 research outputs found

String Matching with Variable Length Gaps

Author: Aho
Crochemore
David Kofoed Wind
Fredriksson
Hjalte Wedel Vildhøj
Hofmann
Inge Li Gørtz
Knuth
Morgante
Myers
Myers
Myers
Navarro
Navarro
Philip Bille
Thompson
Publication venue
Publication date: 01/01/2010
Field of study

We consider string matching with variable length gaps. Given a string

T

and a pattern

P

consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in

T

that match

P

. This problem is a basic primitive in computational biology applications. Let

m

and

n

be the lengths of

P

and

T

, respectively, and let

k

be the number of strings in

P

. We present a new algorithm achieving time

O(n\log k + m +\alpha)

and space

O(m + A)

, where

A

is the sum of the lower bounds of the lengths of the gaps in

P

and

\alpha

is the total number of occurrences of the strings in

P

within

T

. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of

m

n

k

A

, and

\alpha

. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in

P

for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Online Research Database In Technology

Dictionary Matching with One Gap

Author: A. Amir
A. Amir
A. Amir
A. Amir
A.V. Aho
E. Ukkonen
E.M. McCreight
G. Kucherov
G. Myers
G. Myers
G. Navarro
G. Navarro
G.S. Brodal
J.C. Naa
K. Fredriksson
M. Morgante
M. Zhang
M.S. Rahman
P. Bille
T. Haapasalo
Publication venue
Publication date: 01/01/2014
Field of study

The dictionary matching with gaps problem is to preprocess a dictionary

D

d

gapped patterns

P_1,\ldots,P_d

over alphabet

\Sigma

, where each gapped pattern

P_i

is a sequence of subpatterns separated by bounded sequences of don't cares. Then, given a query text

T

of length

n

over alphabet

\Sigma

, the goal is to output all locations in

T

in which a pattern

P_i\in D

1\leq i\leq d

, ends. There is a renewed current interest in the gapped matching problem stemming from cyber security. In this paper we solve the problem where all patterns in the dictionary have one gap with at least

\alpha

and at most

\beta

don't cares, where

\alpha

and

\beta

are given parameters. Specifically, we show that the dictionary matching with a single gap problem can be solved in either

O(d\log d + |D|)

time and

O(d\log^{\varepsilon} d + |D|)

space, and query time

O(n(\beta -\alpha )\log\log d \log ^2 \min \{ d, \log |D| \} + occ)

, where

occ

is the number of patterns found, or preprocessing time and space:

O(d^2 + |D|)

, and query time

O(n(\beta -\alpha ) + occ)

, where

occ

is the number of patterns found. As far as we know, this is the best solution for this setting of the problem, where many overlaps may exist in the dictionary.Comment: A preliminary version was published at CPM 201

arXiv.org e-Print Archive

Crossref

String Indexing for Patterns with Wildcards

Author: A. Tam
B. Chazelle
D. Harel
D. Tsur
G. Chen
G. Landau
G. Landau
G. Navarro
H.L. Chan
K. Hofmann
L.P. Coelho
M. Lewenstein
M. Maas
M.L. Fredman
P. Bille
P. Bille
P. Clifford
T.-W. Lam
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of indexing a string

t

of length

n

to report the occurrences of a query pattern

p

containing

m

characters and

j

wildcards. Let

occ

be the number of occurrences of

p

t

, and

\sigma

the size of the alphabet. We obtain the following results. - A linear space index with query time

O(m+\sigma^j \log \log n + occ)

. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time

\Theta(jn)

in the worst case. - An index with query time

O(m+j+occ)

using space

O(\sigma^{k^2} n \log^k \log n)

, where

k

is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Fast Indexes for Gapped Pattern Matching

Author: D Knuth
G Navarro
J Bader
K Fredriksson
M Crochemore
M Lewenstein
M Morgante
P Bille
P Bille
Philip Bille
R Saikkonen
SP Pissis
T Crawford
T Haapasalo
U Manber
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/02/2020
Field of study

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454 and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Crossref

Research on Pattern Matching with Wildcards and Length Constraints: Methods and Completeness

Author: Hu Xuegang
Wang Haiping
Xiang Taining
Publication venue: 'IntechOpen'
Publication date: 28/11/2012
Field of study

IntechOpen

Bioinformatics

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

Directory of Open Access Books (DOAB)