99 research outputs found
Computing the antiperiod(s) of a string
A string S[1, n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time. © Hayam Alamro, Golnaz Badkobeh, Djamal Belazzougui, Costas S. Iliopoulos, and Simon J. Puglisi.Peer reviewe
A characterization of the squares in a Fibonacci string
A (finite) Fibonacci stringFn is defined as follows: F0 = b, F1 = a; for every integer n â©Ÿ 2, Fn = Fn â 1Fn â 2. For n â©Ÿ 1, the length of Fn is denoted by . The infinite Fibonacci stringF is the string which contains every Fn, n â©Ÿ 1, as a prefix. Apart from their general theoretical importance, Fibonacci strings are often cited as worst-case examples for algorithms which compute all the repetitions or all the âAbelian squaresâ in a given string. In this paper we provide a characterization of all the squares in F, hence in every prefix Fn; this characterization naturally gives rise to a algorithm which specifies all the squares of Fn in an appropriate encoding. This encoding is made possible by the fact that the squares of Fn occur consecutively, in ârunsâ, the number of which is . By contrast, the known general algorithms for the computation of the repetitions in an arbitrary string require time (and produce outputs) when applied to a Fibonacci string Fn
The covers of a circular Fibonacci string
Fibonacci strings turn out to constitute worst cases for a number of computer algorithms which find generic patterns in strings. Examples of such patterns are repetitions, Abelian squares, and "covers". In particular, we characterize in this paper the covers of a circular Fibonacci string C(F k ) and show that they are \Theta(jF k j 2 ) in number. We show also that, by making use of an appropriate encoding, these covers can be reported in \Theta(jF k j) time. By contrast, the fastest known algorithm for computing the covers of an arbitrary circular string of length n requires time O(n log n)
A linear algorithm for computing all the squares of a Fibonacci string
A (finite) Fibonacci string is defined as follows: , ; for every integer , . For , the length of is denoted by , while it is convenient to define . The infinite Fibonacci string is the string which contains every , , as a prefix. Apart from their general theoretical importance, Fibonacci strings are often cited as worst case examples for algorithms which compute all the repetitions or all the ``Abelian squares'' in a given string. In this paper we provide a characterization of all the squares in , hence in every prefix ; this characterization naturally gives rise to a algorithm which specifies all the squares of in an appropriate encoding. This encoding is made possible by the fact that the squares of occur consecutively, in ``runs'', the number of which is . By contrast, the known general algorithms for the computation of the repetitions in an arbitrary string require time (and produce outputs) when applied to a Fibonacci string
On Quasiperiodic Morphisms
Weakly and strongly quasiperiodic morphisms are tools introduced to study
quasiperiodic words. Formally they map respectively at least one or any
non-quasiperiodic word to a quasiperiodic word. Considering them both on finite
and infinite words, we get four families of morphisms between which we study
relations. We provide algorithms to decide whether a morphism is strongly
quasiperiodic on finite words or on infinite words.Comment: 12 page
Efficient Seeds Computation Revisited
The notion of the cover is a generalization of a period of a string, and
there are linear time algorithms for finding the shortest cover. The seed is a
more complicated generalization of periodicity, it is a cover of a superstring
of a given string, and the shortest seed problem is of much higher algorithmic
difficulty. The problem is not well understood, no linear time algorithm is
known. In the paper we give linear time algorithms for some of its versions ---
computing shortest left-seed array, longest left-seed array and checking for
seeds of a given length. The algorithm for the last problem is used to compute
the seed array of a string (i.e., the shortest seeds for all the prefixes of
the string) in time. We describe also a simpler alternative algorithm
computing efficiently the shortest seeds. As a by-product we obtain an
time algorithm checking if the shortest seed has length at
least and finding the corresponding seed. We also correct some important
details missing in the previously known shortest-seed algorithm (Iliopoulos et
al., 1996).Comment: 14 pages, accepted to CPM 201
Computing the minimum k-Cover of a string
We study the minimum k-cover problem. For a given string x of length n and an integer k, the minimum k-cover is the minimum set of k-substrings that covers x. We show that the on-line algorithm that has been proposed by Iliopoulos and Smyth [IS92] is not correct. We prove that the problem is in fact NP-hard. Furthermore, we propose two greedy algorithms that are implemented and tested on different kind of data
IUPACpal: efficient identification of inverted repeats in IUPAC-encoded DNA sequences
Background: An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. Results: We present IUPACpal, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. Conclusion: Within the parameters that were tested, our experimental results show that IUPACpal compares favourably to a similar application packaged with EMBOSS. We show that IUPACpal identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.</p
On the maximal number of cubic subwords in a string
We investigate the problem of the maximum number of cubic subwords (of the
form ) in a given word. We also consider square subwords (of the form
). The problem of the maximum number of squares in a word is not well
understood. Several new results related to this problem are produced in the
paper. We consider two simple problems related to the maximum number of
subwords which are squares or which are highly repetitive; then we provide a
nontrivial estimation for the number of cubes. We show that the maximum number
of squares such that is not a primitive word (nonprimitive squares) in
a word of length is exactly , and the
maximum number of subwords of the form , for , is exactly .
In particular, the maximum number of cubes in a word is not greater than
either. Using very technical properties of occurrences of cubes, we improve
this bound significantly. We show that the maximum number of cubes in a word of
length is between and . (In particular, we improve the
lower bound from the conference version of the paper.)Comment: 14 page
Efficient computation of sequence mappability
Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices jâ i such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k= 1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k= O(1) , works in O(n) space and, with high probability, in O(n· min { mk, log kn}) time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n2) -time algorithms to compute all (k, m)-mappability tables for a fixed m and all kâ { 0 , ⊠, m} or a fixed k and all mâ { k, ⊠, n}. Finally, we show that, for k, m= Î (log n) , the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018
- âŠ