61 research outputs found
Faster Approximate Pattern Matching: {A} Unified Approach
Approximate pattern matching is a natural and well-studied problem on strings: Given a text , a pattern , and a threshold , find (the starting positions of) all substrings of that are at distance at most from . We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of that have at most mismatches with , while under the edit distance, we search for substrings of that can be transformed to with at most edits. Exact occurrences of in have a very simple structure: If we assume for simplicity that and trim so that occurs both as a prefix and as a suffix of , then both and are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are -mismatch occurrences of in , or both and are at Hamming distance from strings with a common period . We tighten this characterization by showing that there are -mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of -edit occurrences by in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where and are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18])
Faster Pattern Matching under Edit Distance
We consider the approximate pattern matching problem under the edit distance.Given a text of length , a pattern of length , and a threshold, the task is to find the starting positions of all substrings of thatcan be transformed to with at most edits. More than 20 years ago, Coleand Hariharan [SODA'98, J. Comput.'02] gave an -time algorithm for this classic problem, and this runtime has not beenimproved since. Here, we present an algorithm that runs in time , thus breaking through this long-standingbarrier. In the case where n^{1/4+\varepsilon} \leq k \leqn^{2/5-\varepsilon} for some arbitrarily small positive constant, our algorithm improves over the state-of-the-art by polynomialfactors: it is polynomially faster than both the algorithm of Cole andHariharan and the classic -time algorithm of Landau andVishkin [STOC'86, J. Algorithms'89]. We observe that the bottleneck case of the alternative -time algorithm of Charalampopoulos, Kociumaka, and Wellnitz[FOCS'20] is when the text and the pattern are (almost) periodic. Our newalgorithm reduces this case to a new dynamic problem (Dynamic Puzzle Matching),which we solve by building on tools developed by Tiskin [SODA'10,Algorithmica'15] for the so-called seaweed monoid of permutation matrices. Ouralgorithm relies only on a small set of primitive operations on strings andthus also applies to the fully-compressed setting (where text and pattern aregiven as straight-line programs) and to the dynamic setting (where we maintaina collection of strings under creation, splitting, and concatenation),improving over the state of the art.<br
Faster pattern matching under edit distance : a reduction to dynamic puzzle matching and the Seaweed Monoid of permutation matrices
We consider the approximate pattern matching problem under the edit distance. Given a text T of length n, a pattern P of length m, and a threshold k, the task is to find the starting positions of all substrings of T that can be transformed to P with at most k edits. More than 20 years ago, Cole and Hariharan [SODA’98, J. Comput.’02] gave an O(n + k^4·n/m)-time algorithm for this classic problem, and this runtime has not been improved since.
Here, we present an algorithm that runs in time O(n + k^{3.5}√(
log m log k) · n/m), thus breaking through this longstanding barrier. In the case where n^{1/4+ε} ≤ k ≤ n^{2/5−ε} for some arbitrarily small positive constant ε, our algorithm improves over the state-of-the-art by polynomial factors: it is polynomially faster than both the algorithm of Cole and Hariharan and the classic O(kn)-time algorithm of Landau and Vishkin [STOC’86, J. Algorithms’89].
We observe that the bottleneck case of the alternative O(n + k^4· n/m)-time algorithm of Charalampopoulos, Kociumaka, and Wellnitz [FOCS’20] is when the text and the pattern are (almost) periodic. Our new algorithm reduces this case to a new Dynamic Puzzle Matching problem, which we solve by building on tools developed by Tiskin [SODA’10, Algorithmica’15] for the so called seaweed monoid of permutation matrices. Our algorithm relies only on a small set of primitive operations on strings and thus also applies to the fully-compressed setting (where text and pattern are given as straight-line programs) and to the dynamic setting (where we maintain a collection of strings under creation, splitting, and concatenation), improving over the state of the art
Wavelet Trees Meet Suffix Trees
We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size , our method builds the wavelet tree in time, improving upon the state-of-the-art algorithm by a factor of . As a consequence, given an array of n integers we can construct in time a data structure consisting of machine words and capable of answering rank/select queries for the subranges of the array in time. This is a -factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a -factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies words, takes time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression
Normal, Abby Normal, Prefix Normal
A prefix normal word is a binary word with the property that no substring has
more 1s than the prefix of the same length. This class of words is important in
the context of binary jumbled pattern matching. In this paper we present
results about the number of prefix normal words of length , showing
that for some and
. We introduce efficient
algorithms for testing the prefix normal property and a "mechanical algorithm"
for computing prefix normal forms. We also include games which can be played
with prefix normal words. In these games Alice wishes to stay normal but Bob
wants to drive her "abnormal" -- we discuss which parameter settings allow
Alice to succeed.Comment: Accepted at FUN '1
On Maximal Unbordered Factors
Given a string of length , its maximal unbordered factor is the
longest factor which does not have a border. In this work we investigate the
relationship between and the length of the maximal unbordered factor of
. We prove that for the alphabet of size the expected length
of the maximal unbordered factor of a string of length~ is at least
(for sufficiently large values of ). As an application of this result, we
propose a new algorithm for computing the maximal unbordered factor of a
string.Comment: Accepted to the 26th Annual Symposium on Combinatorial Pattern
Matching (CPM 2015
Indexing weighted sequences: Neat and efficient
In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example in molecular biology, where they are known under the name of Position Weight Matrices. Given a probability threshold 1/z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m−1 in X is at least 1/z . In this article, we consider an indexing variant of the problem, in which we are to pre-process a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n that answers pattern matching queries in the optimal O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of [z] strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We thus improve the most efficient previously known index by Amir et al. (Theor. Comput. Sci., 2008) with size and construction time O(nz2 log z), preserving optimal query time. On the way we develop a new, more straightforward index for the so-called property matching problem. We provide an open-source implementation of our data structure and present experimental results using both synthetic and real data. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. at EDBT 2016 and an improvement of the space complexity of their general index. We also present applications of our index
Searching of gapped repeats and subrepetitions in a word
A gapped repeat is a factor of the form where and are nonempty
words. The period of the gapped repeat is defined as . The gapped
repeat is maximal if it cannot be extended to the left or to the right by at
least one letter with preserving its period. The gapped repeat is called
-gapped if its period is not greater than . A
-subrepetition is a factor which exponent is less than 2 but is not
less than (the exponent of the factor is the quotient of the length
and the minimal period of the factor). The -subrepetition is maximal if
it cannot be extended to the left or to the right by at least one letter with
preserving its minimal period. We reveal a close relation between maximal
gapped repeats and maximal subrepetitions. Moreover, we show that in a word of
length the number of maximal -gapped repeats is bounded by
and the number of maximal -subrepetitions is bounded by
. Using the obtained upper bounds, we propose algorithms for
finding all maximal -gapped repeats and all maximal
-subrepetitions in a word of length . The algorithm for finding all
maximal -gapped repeats has time complexity for the case
of constant alphabet size and time complexity for the
general case. For finding all maximal -subrepetitions we propose two
algorithms. The first algorithm has time
complexity for the case of constant alphabet size and time complexity for the general case. The
second algorithm has
expected time complexity
- …