9 research outputs found
Longest common substring made fully dynamic
Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n)-time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to this problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in Õ(n2/3) time, after Õ(n)-time and space preprocessing. 1 This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, they presented an Õ(n)-sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in Õ(1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings. We show that the techniques we develop can be applied to obtain fully dynamic algorithms for all of these variants. The only previously known sublinear-time dynamic algorithms for problems on strings were for maintaining a dynamic collection of strings for comparison queries and for pattern matching, with the most recent advances made by Gawrychowski et al. [SODA 2018] and by Clifford et al. [STACS 2018]. As an intermediate problem we consider computing the solution for a string with a given set of k edits, which leads us, in particular, to answering internal queries on a string. The input to such a query is specified by a substring (or substrings) of a given string. Data structures for answering internal string queries that were proposed by Kociumaka et al. [SODA 2015] and by Gagie et al. [CCCG 2013] are used, along with new ones, based on ingredients such as the suffix tree, heavy-path decomposition, orthogonal range queries, difference covers, and string periodicity
Dynamic and Internal Longest Common Substring
Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n) -time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to the fully dynamic LCS problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in O~ (n2 / 3) time, after O~ (n) -time and space preprocessing. This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, the authors presented an O~ (n) -sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in O~ (1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings; specifically, computing the longest palindrome and the Lyndon factorization of a string after a single edit operation. We develop dynamic sublinear-time algorithms for both of these problems as well. We also consider internal LCS queries, that is, queries in which we are to return an LCS of a pair of substrings of S and T. We show that answering such queries is hard in general and propose efficient data structures for several restricted cases
Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications
An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si ∈ T[i] for all i ∈ [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1−ϵ)-time algorithm, thus no O ((N1m2 + N2m1)1−ϵ)-time algorithm and no O ((N1n2 + N2n1)1−ϵ)-time algorithm, for any constant ϵ > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2−ϵf(n1, n2))-time algorithm, for any constant ϵ > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Õ(N1ω−1n2 + N2ω−1n1)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison
Elastic-Degenerate String Matching with 1 Error
An elastic-degenerate string is a sequence of finite sets of strings of
total length , introduced to represent a set of related DNA sequences, also
known as a pangenome. The ED string matching (EDSM) problem consists in
reporting all occurrences of a pattern of length in an ED text. This
problem has recently received some attention by the combinatorial pattern
matching community, culminating in an
-time algorithm [Bernardini
et al., SIAM J. Comput. 2022], where denotes the matrix multiplication
exponent and the notation suppresses polylog
factors. In the -EDSM problem, the approximate version of EDSM, we are asked
to report all pattern occurrences with at most errors. -EDSM can be
solved in time, under edit distance, or
time, under Hamming distance, where denotes the total
number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020].
Unfortunately, is only bounded by , and so even for , the existing
algorithms run in time in the worst case. In this paper we show
that -EDSM can be solved in or
time under edit distance. For the decision version, we
present a faster -time algorithm.
We also show that -EDSM can be solved in time
under Hamming distance. Our algorithms for edit distance rely on non-trivial
reductions from -EDSM to special instances of classic computational geometry
problems (2d rectangle stabbing or 2d range emptiness), which we show how to
solve efficiently. In order to obtain an even faster algorithm for Hamming
distance, we rely on employing and adapting the -errata trees for indexing
with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202
Substring Complexity in Sublinear Space
Shannon's entropy is a definitive lower bound for statistical compression.
Unfortunately, no such clear measure exists for the compressibility of
repetitive strings. Thus, ad-hoc measures are employed to estimate the
repetitiveness of strings, e.g., the size of the Lempel-Ziv parse or the
number of equal-letter runs of the Burrows-Wheeler transform. A more recent
one is the size of a smallest string attractor. Unfortunately, Kempa
and Prezza [STOC 2018] showed that computing is NP-hard. Kociumaka et
al. [LATIN 2020] considered a new measure that is based on the function
counting the cardinalities of the sets of substrings of each length of ,
also known as the substring complexity. This new measure is defined as and lower bounds all the measures previously
considered. In particular, always holds and can be
computed in time using working space. Kociumaka et
al. showed that if is given, one can construct an -sized representation of supporting efficient direct
access and efficient pattern matching queries on . Given that for highly
compressible strings, is significantly smaller than , it is natural
to pose the following question: Can we compute efficiently using
sublinear working space?
It is straightforward to show that any algorithm computing using
space requires time through a reduction
from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present
the following results: an -time and
-space algorithm to compute , for any ; and
an -time and -space algorithm to
compute , for any
Faster algorithms for longest common substring
In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log σ/log n) space and read in (n log σ/log n) time. We show that, in this model, we can compute an LCS in time (n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = (1)), using optimal space (n log σ/log n).
We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. </p
Faster Algorithms for Longest Common Substring
International audienceIn the classic longest common substring (LCS) problem, we are given two strings S and T , each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an O(n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an O(n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in O(n log σ/ log n) space and read in O(n log σ/ log n) time. We show that, in this model, we can compute an LCS in time O(n log σ/ √ log n), which is sublinear in n if σ = 2 o(√ log n) (in particular, if σ = O(1)), using optimal space O(n log σ/ log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in O(n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in O(n log k n) time for k = O(1) [J. Comput. Biol. 2016]. We show an O(n log k−1/2 n)-time algorithm, for any constant k > 0 and irrespective of the alphabet size, using O(n) space as the previous approaches. We thus notably break through the well-known n log k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors
Efficient Data Structures for Text Processing Applications
This thesis is devoted to designing and analyzing efficient text indexing data structures and associated algorithms for processing text data. The general problem is to preprocess a given text or a collection of texts into a space-efficient index to quickly answer various queries on this data. Basic queries such as counting/reporting a given pattern\u27s occurrences as substrings of the original text are useful in modeling critical bioinformatics applications. This line of research has witnessed many breakthroughs, such as the suffix trees, suffix arrays, FM-index, etc. In this work, we revisit the following problems: 1. The Heaviest Induced Ancestors problem 2. Range Longest Common Prefix problem 3. Range Shortest Unique Substrings problem 4. Non-Overlapping Indexing problem For the first problem, we present two new space-time trade-offs that improve the space, query time, or both of the existing solutions by roughly a logarithmic factor. For the second problem, our solution takes linear space, which improves the previous result by a logarithmic factor. The techniques developed are then extended to obtain an efficient solution for our third problem, which is newly formulated. Finally, we present a new framework that yields efficient solutions for the last problem in both cache-aware and cache-oblivious models