9 research outputs found
k-Approximate Quasiperiodicity under Hamming and Edit Distance
Quasiperiodicity in strings was introduced almost 30 years ago as an extension of string periodicity. The basic notions of quasiperiodicity are cover and seed. A cover of a text T is a string whose occurrences in T cover all positions of T. A seed of text T is a cover of a superstring of T. In various applications exact quasiperiodicity is still not sufficient due to the presence of errors. We consider approximate notions of quasiperiodicity, for which we allow approximate occurrences in T with a small Hamming, Levenshtein or weighted edit distance.
In previous work Sip et al. (2002) and Christodoulakis et al. (2005) showed that computing approximate covers and seeds, respectively, under weighted edit distance is NP-hard. They, therefore, considered restricted approximate covers and seeds which need to be factors of the original string T and presented polynomial-time algorithms for computing them. Further algorithms, considering approximate occurrences with Hamming distance bounded by k, were given in several contributions by Guth et al. They also studied relaxed approximate quasiperiods that do not need to cover all positions of T.
In case of large data the exponents in polynomial time complexity play a crucial role. We present more efficient algorithms for computing restricted approximate covers and seeds. In particular, we improve upon the complexities of many of the aforementioned algorithms, also for relaxed quasiperiods. Our solutions are especially efficient if the number (or total cost) of allowed errors is bounded. We also show NP-hardness of computing non-restricted approximate covers and seeds under Hamming distance.
Approximate covers were studied in three recent contributions at CPM over the last three years. However, these works consider a different definition of an approximate cover of T, that is, the shortest exact cover of a string T\u27 with the smallest Hamming distance from T
Linear Time Construction of Cover Suffix Tree and Applications
The Cover Suffix Tree (CST) of a string is the suffix tree of with
additional explicit nodes corresponding to halves of square substrings of .
In the CST an explicit node corresponding to a substring of is
annotated with two numbers: the number of non-overlapping consecutive
occurrences of and the total number of positions in that are covered by
occurrences of in . Kociumaka et al. (Algorithmica, 2015) have shown how
to compute the CST of a length- string in time. We show how to
compute the CST in time assuming that is over an integer alphabet.
Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown
that knowing the CST of a length- string , one can compute a linear-sized
representation of all seeds of as well as all shortest -partial
covers and seeds in for a given in time. Thus our result
implies linear-time algorithms computing these notions of quasiperiodicity. The
resulting algorithm computing seeds is substantially different from the
previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020).
Kociumaka et al. (Algorithmica, 2015) proposed an -time algorithm
for computing a shortest -partial cover for each ;
we improve this complexity to .
Our results are based on a new characterization of consecutive overlapping
occurrences of a substring of in terms of the set of runs (see Kolpakov
and Kucherov, FOCS 1999) in . This new insight also leads to an -sized
index for reporting overlapping consecutive occurrences of a given pattern
of length in time, where is the number of
occurrences reported. In comparison, a general index for reporting bounded-gap
consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016)
uses space.Comment: Accepted to ESA 2023. Abstract abridged to satisfy arxiv requirement
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum