Search CORE

2 research outputs found

Fast Indexes for Gapped Pattern Matching

Author: D Knuth
G Navarro
J Bader
K Fredriksson
M Crochemore
M Lewenstein
M Morgante
P Bille
P Bille
Philip Bille
R Saikkonen
SP Pissis
T Crawford
T Haapasalo
U Manber
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/02/2020
Field of study

We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454 and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Crossref

String Indexing for Top- $k$ Close Consecutive Occurrences

Author: Bille Philip
Gørtz Inge Li
Pedersen Max Rishøj
Rotenberg Eva
Steiner Teresa Anna
Publication venue
Publication date: 29/09/2020
Field of study

The classic string indexing problem is to preprocess a string

S

into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string

P

, report all occurrences of

P

within

S

. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-

k

close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair

(i,j)

i < j

, such that

P

occurs at positions

i

and

j

S

and there is no occurrence of

P

between

i

and

j

, and their distance is defined as

j-i

. Given a pattern

P

and a parameter

k

, the goal is to report the top-

k

consecutive occurrences of

P

S

of minimal distance. The challenge is to compactly represent

S

while supporting queries in time close to length of

P

and

k

. We give two time-space trade-offs for the problem. Let

n

be the length of

S

m

the length of

P

, and

\epsilon\in(0,1]

. Our first result achieves

O(n\log n)

space and optimal query time of

O(m+k)

, and our second result achieves linear space and query time

O(m+k^{1+\epsilon})

. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change

arXiv.org e-Print Archive

Online Research Database In Technology