14 research outputs found
Longest Common Prefixes with -Errors and Applications
Although real-world text datasets, such as DNA sequences, are far from being
uniformly random, average-case string searching algorithms perform
significantly better than worst-case ones in most applications of interest. In
this paper, we study the problem of computing the longest prefix of each suffix
of a given string of length over a constant-sized alphabet that occurs
elsewhere in the string with -errors. This problem has already been studied
under the Hamming distance model. Our first result is an improvement upon the
state-of-the-art average-case time complexity for non-constant and using
only linear space under the Hamming distance model. Notably, we show that our
technique can be extended to the edit distance model with the same time and
space complexities. Specifically, our algorithms run in time on average using space. We show that our
technique is applicable to several algorithmic problems in computational
biology and elsewhere
Efficient Computation of Sequence Mappability
Sequence mappability is an important task in genome re-sequencing. In the
-mappability problem, for a given sequence of length , our goal
is to compute a table whose th entry is the number of indices such
that length- substrings of starting at positions and have at
most mismatches. Previous works on this problem focused on heuristic
approaches to compute a rough approximation of the result or on the case of
. We present several efficient algorithms for the general case of the
problem. Our main result is an algorithm that works in time and space for
. It requires a carefu l adaptation of the technique of Cole
et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also
show -time algorithms to compute all results for a fixed
and all or a fixed and all . Finally we show
that the -mappability problem cannot be solved in strongly subquadratic
time for unless the Strong Exponential Time Hypothesis
fails.Comment: Accepted to SPIRE 201
Longest property-preserved common factor
In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider two fundamental string properties: square-free factors and periodic factors under two different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer 1 < kâ †k and we are asked to find a longest periodic factor common to at least kâ strings. We present linear-time solutions for both settings. We anticipate that our paradigm can be extended to other string properties
Linear-Time Algorithm for Long LCF with k Mismatches
In the Longest Common Factor with k Mismatches (LCF_k) problem, we are given two strings X and Y of total length n, and we are asked to find a pair of maximal-length factors, one of X and the other of Y, such that their Hamming distance is at most k. Thankachan et al. [Thankachan et al. 2016] show that this problem can be solved in O(n log^k n) time and O(n) space for constant k. We consider the LCF_k(l) problem in which we assume that the sought factors have length at least l. We use difference covers to reduce the LCF_k(l) problem with l=Omega(log^{2k+2}n) to a task involving m=O(n/log^{k+1}n) synchronized factors. The latter can be solved in O(m log^{k+1}m) time, which results in a linear-time algorithm for LCF_k(l) with l=Omega(log^{2k+2}n). In general, our solution to the LCF_k(l) problem for arbitrary l takes O(n + n log^{k+1} n/sqrt{l}) time
Longest common substring made fully dynamic
Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n)-time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to this problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in Ă(n2/3) time, after Ă(n)-time and space preprocessing. 1 This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, they presented an Ă(n)-sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in Ă(1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings. We show that the techniques we develop can be applied to obtain fully dynamic algorithms for all of these variants. The only previously known sublinear-time dynamic algorithms for problems on strings were for maintaining a dynamic collection of strings for comparison queries and for pattern matching, with the most recent advances made by Gawrychowski et al. [SODA 2018] and by Clifford et al. [STACS 2018]. As an intermediate problem we consider computing the solution for a string with a given set of k edits, which leads us, in particular, to answering internal queries on a string. The input to such a query is specified by a substring (or substrings) of a given string. Data structures for answering internal string queries that were proposed by Kociumaka et al. [SODA 2015] and by Gagie et al. [CCCG 2013] are used, along with new ones, based on ingredients such as the suffix tree, heavy-path decomposition, orthogonal range queries, difference covers, and string periodicity
Longest Property-Preserved Common Factor
International audienceIn this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider three fundamental string properties: square-free factors, periodic factors, and palindromic factors under three different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer 1 < k †k and we are asked to find a longest periodic factor common to at least k strings. In the third setting, we are given two strings and we are asked to find a longest palindromic factor common to the two strings. We present linear-time solutions for all settings. We anticipate that our paradigm can be extended to other string properties or settings
Faster algorithms for longest common substring
In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size Ï, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log Ï)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log Ï/log n) space and read in (n log Ï/log n) time. We show that, in this model, we can compute an LCS in time (n log Ï / â{log n}), which is sublinear in n if Ï = 2^{o(â{log n})} (in particular, if Ï = (1)), using optimal space (n log Ï/log n).
We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. </p