research

String Matching with Variable Length Gaps

Abstract

We consider string matching with variable length gaps. Given a string TT and a pattern PP consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in TT that match PP. This problem is a basic primitive in computational biology applications. Let mm and nn be the lengths of PP and TT, respectively, and let kk be the number of strings in PP. We present a new algorithm achieving time O(nlogk+m+α)O(n\log k + m +\alpha) and space O(m+A)O(m + A), where AA is the sum of the lower bounds of the lengths of the gaps in PP and α\alpha is the total number of occurrences of the strings in PP within TT. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of mm, nn, kk, AA, and α\alpha. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in PP for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

    Similar works