2 research outputs found
Fast Indexes for Gapped Pattern Matching
We describe indexes for searching large data sets for variable-length-gapped
(VLG) patterns. VLG patterns are composed of two or more subpatterns, between
each adjacent pair of which is a gap-constraint specifying upper and lower
bounds on the distance allowed between subpatterns. VLG patterns have numerous
applications in computational biology (motif search), information retrieval
(e.g., for language models, snippet generation, machine translation) and
capture a useful subclass of the regular expressions commonly used in practice
for searching source code. Our best approach provides search speeds several
times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454
and has received funding from the European Union's Horizon 2020 research and
innovation programme under the Marie Sklodowska-Curie Actions
H2020-MSCA-RISE-2015 BIRDS GA No. 69094
String Indexing for Top- Close Consecutive Occurrences
The classic string indexing problem is to preprocess a string into a
compact data structure that supports efficient subsequent pattern matching
queries, that is, given a pattern string , report all occurrences of
within . In this paper, we study a basic and natural extension of string
indexing called the string indexing for top- close consecutive occurrences
problem (SITCCO). Here, a consecutive occurrence is a pair , ,
such that occurs at positions and in and there is no occurrence
of between and , and their distance is defined as . Given a
pattern and a parameter , the goal is to report the top- consecutive
occurrences of in of minimal distance. The challenge is to compactly
represent while supporting queries in time close to length of and .
We give two time-space trade-offs for the problem. Let be the length of
, the length of , and . Our first result achieves
space and optimal query time of , and our second result
achieves linear space and query time . Along the way, we
develop several techniques of independent interest, including a new translation
of the problem into a line segment intersection problem and a new recursive
clustering technique for trees.Comment: Fixed typos, minor change