39 research outputs found
The Number of Repetitions in 2D-Strings
The notions of periodicity and repetitions in strings, and hence these of
runs and squares, naturally extend to two-dimensional strings. We consider two
types of repetitions in 2D-strings: 2D-runs and quartics (quartics are a
2D-version of squares in standard strings). Amir et al. introduced 2D-runs,
showed that there are of them in an 2D-string and
presented a simple construction giving a lower bound of for their
number (TCS 2020). We make a significant step towards closing the gap between
these bounds by showing that the number of 2D-runs in an 2D-string
is . In particular, our bound implies that the run-time of the algorithm of Amir et al. for computing
2D-runs is also . We expect this result to allow for
exploiting 2D-runs algorithmically in the area of 2D pattern matching.
A quartic is a 2D-string composed of identical blocks
(2D-strings) that was introduced by Apostolico and Brimkov (TCS 2000), where by
quartics they meant only primitively rooted quartics, i.e. built of a primitive
block. Here our notion of quartics is more general and analogous to that of
squares in 1D-strings. Apostolico and Brimkov showed that there are occurrences of primitively rooted quartics in an
2D-string and that this bound is attainable. Consequently the number of
distinct primitively rooted quartics is . Here, we prove that
the number of distinct general quartics is also . This extends
the rich combinatorial study of the number of distinct squares in a 1D-string,
that was initiated by Fraenkel and Simpson (J. Comb. Theory A 1998), to two
dimensions.
Finally, we show some algorithmic applications of 2D-runs. (Abstract
shortened due to arXiv requirements.)Comment: To appear in the ESA 2020 proceeding
Hardness of Detecting Abelian and Additive Square Factors in Strings
We prove 3SUM-hardness (no strongly subquadratic-time algorithm, assuming the
3SUM conjecture) of several problems related to finding Abelian square and
additive square factors in a string. In particular, we conclude conditional
optimality of the state-of-the-art algorithms for finding such factors.
Overall, we show 3SUM-hardness of (a) detecting an Abelian square factor of
an odd half-length, (b) computing centers of all Abelian square factors, (c)
detecting an additive square factor in a length- string of integers of
magnitude , and (d) a problem of computing a double 3-term
arithmetic progression (i.e., finding indices such that
) in a sequence of integers of
magnitude .
Problem (d) is essentially a convolution version of the AVERAGE problem that
was proposed in a manuscript of Erickson. We obtain a conditional lower bound
for it with the aid of techniques recently developed by Dudek et al. [STOC
2020]. Problem (d) immediately reduces to problem (c) and is a step in
reductions to problems (a) and (b). In conditional lower bounds for problems
(a) and (b) we apply an encoding of Amir et al. [ICALP 2014] and extend it
using several string gadgets that include arbitrarily long Abelian-square-free
strings.
Our reductions also imply conditional lower bounds for detecting Abelian
squares in strings over a constant-sized alphabet. We also show a subquadratic
upper bound in this case, applying a result of Chan and Lewenstein [STOC 2015].Comment: Accepted to ESA 202
Elastic-Degenerate String Matching with 1 Error
An elastic-degenerate string is a sequence of finite sets of strings of
total length , introduced to represent a set of related DNA sequences, also
known as a pangenome. The ED string matching (EDSM) problem consists in
reporting all occurrences of a pattern of length in an ED text. This
problem has recently received some attention by the combinatorial pattern
matching community, culminating in an
-time algorithm [Bernardini
et al., SIAM J. Comput. 2022], where denotes the matrix multiplication
exponent and the notation suppresses polylog
factors. In the -EDSM problem, the approximate version of EDSM, we are asked
to report all pattern occurrences with at most errors. -EDSM can be
solved in time, under edit distance, or
time, under Hamming distance, where denotes the total
number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020].
Unfortunately, is only bounded by , and so even for , the existing
algorithms run in time in the worst case. In this paper we show
that -EDSM can be solved in or
time under edit distance. For the decision version, we
present a faster -time algorithm.
We also show that -EDSM can be solved in time
under Hamming distance. Our algorithms for edit distance rely on non-trivial
reductions from -EDSM to special instances of classic computational geometry
problems (2d rectangle stabbing or 2d range emptiness), which we show how to
solve efficiently. In order to obtain an even faster algorithm for Hamming
distance, we rely on employing and adapting the -errata trees for indexing
with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202
Linear-Time Computation of Cyclic Roots and Cyclic Covers of a String
Cyclic versions of covers and roots of a string are considered in this paper. A prefix V of a string S is a cyclic root of S if S is a concatenation of cyclic rotations of V. A prefix V of S is a cyclic cover of S if the occurrences of the cyclic rotations of V cover all positions of S. We present ?(n)-time algorithms computing all cyclic roots (using number-theoretic tools) and all cyclic covers (using tools related to seeds) of a length-n string over an integer alphabet. Our results improve upon ?(n log log n) and ?(n log n) time complexities of recent algorithms of Grossi et al. (WALCOM 2023) for the respective problems and provide novel approaches to the problems. As a by-product, we obtain an optimal data structure for Internal Circular Pattern Matching queries that generalize Internal Pattern Matching and Cyclic Equivalence queries of Kociumaka et al. (SODA 2015)
Internal Quasiperiod Queries
Internal pattern matching requires one to answer queries about factors of a
given string. Many results are known on answering internal period queries,
asking for the periods of a given factor. In this paper we investigate (for the
first time) internal queries asking for covers (also known as quasiperiods) of
a given factor. We propose a data structure that answers such queries in
time for the shortest cover and in time for a representation of all the covers, after time
and space preprocessing.Comment: To appear in the SPIRE 2020 proceeding
Approximate Circular Pattern Matching
We investigate the complexity of approximate circular pattern matching (CPM, in short) under the Hamming and edit distance. Under each of these two basic metrics, we are given a length-n text T, a length-m pattern P, and a positive integer threshold k, and we are to report all starting positions (called occurrences) of fragments of T that are at distance at most k from some cyclic rotation of P. In the decision version of the problem, we are to check if there is any such occurrence. All previous results for approximate CPM were either average-case upper bounds or heuristics, with the exception of the work of Charalampopoulos et al. [CKP+, JCSS'21], who considered only the Hamming distance. For the reporting version of the approximate CPM problem, under the Hamming distance we improve upon the main algorithm of [CKP+, JCSS'21] from O(n+(n/m) k4) to O(n+(n/m) k3 log log k) time; for the edit distance, we give an O(nk2)-time algorithm. Notably, for the decision versions and wide parameter-ranges, we give algorithms whose complexities are almost identical to the state-of-the-art for standard (i.e., non-circular) approximate pattern matching: For the decision version of the approximate CPM problem under the Hamming distance, we obtain an O(n + (n/m) k2 log k/ log log k)-time algorithm, which works in O(n) time whenever k = O( p mlog log m/logm). In comparison, the fastest algorithm for the standard counterpart of the problem, by Chan et al. [CGKKP, STOC'20], runs in O(n) time only for k = O(â m). We achieve this result via a reduction to a geometric problem by building on ideas from [CKP+, JCSS'21] and Charalampopoulos et al. [CKW, FOCS'20]. For the decision version of the approximate CPM problem under the edit distance, the O(nk log3 k) runtime of our algorithm near matches the O(nk) runtime of the Landau-Vishkin algorithm [LV, J. Algorithms'89] for approximate pattern matching under edit distance; the latter algorithm remains the fastest known for k = Ω(m2/5). As a stepping stone, we propose an O(nk log3 k)-time algorithm for solving the Longest Prefix k-Approximate Match problem, proposed by Landau et al. [LMS, SICOMP'98], for all k â {1, , k}. Our algorithm is based on Tiskin's theory of seaweeds [Tiskin, Math. Comput. Sci.'08], with recent advancements (see Charalampopoulos et al. [CKW, FOCS'22]), and on exploiting the seaweeds' relation to Monge matrices. In contrast, we obtain a conditional lower bound that suggests a polynomial separation between approximate CPM under the Hamming distance over the binary alphabet and its non-circular counterpart. We also show that a strongly subquadratic-time algorithm for the decision version of approximate CPM under edit distance would refute the Strong Exponential Time Hypothesis