2 research outputs found

    Fast Indexes for Gapped Pattern Matching

    Full text link
    We describe indexes for searching large data sets for variable-length-gapped (VLG) patterns. VLG patterns are composed of two or more subpatterns, between each adjacent pair of which is a gap-constraint specifying upper and lower bounds on the distance allowed between subpatterns. VLG patterns have numerous applications in computational biology (motif search), information retrieval (e.g., for language models, snippet generation, machine translation) and capture a useful subclass of the regular expressions commonly used in practice for searching source code. Our best approach provides search speeds several times faster than prior art across a broad range of patterns and texts.Comment: This research is supported by Academy of Finland through grant 319454 and has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    String Indexing for Top-kk Close Consecutive Occurrences

    Full text link
    The classic string indexing problem is to preprocess a string SS into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string PP, report all occurrences of PP within SS. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-kk close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair (i,j)(i,j), i<ji < j, such that PP occurs at positions ii and jj in SS and there is no occurrence of PP between ii and jj, and their distance is defined as jij-i. Given a pattern PP and a parameter kk, the goal is to report the top-kk consecutive occurrences of PP in SS of minimal distance. The challenge is to compactly represent SS while supporting queries in time close to length of PP and kk. We give two time-space trade-offs for the problem. Let nn be the length of SS, mm the length of PP, and ϵ(0,1]\epsilon\in(0,1]. Our first result achieves O(nlogn)O(n\log n) space and optimal query time of O(m+k)O(m+k), and our second result achieves linear space and query time O(m+k1+ϵ)O(m+k^{1+\epsilon}). Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.Comment: Fixed typos, minor change
    corecore