Search CORE

On the maximal sum of exponents of runs in a string

Author: D. Gusfield
F. Franek
J. Berstel
J. Simpson
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
M. Giraud
M. Lothaire
R.M. Kolpakov
S.J. Puglisi
W. Rytter
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/03/2010
Field of study

A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition

v

with a period

p

such that

2p \le |v|

. The exponent of a run is defined as

|v|/p

and is

\ge 2

. We show new bounds on the maximal sum of exponents of runs in a string of length

n

. Our upper bound of

4.1n

is better than the best previously known proven bound of

5.6n

by Crochemore & Ilie (2008). The lower bound of

2.035n

, obtained using a family of binary words, contradicts the conjecture of Kolpakov & Kucherov (1999) that the maximal sum of exponents of runs in a string of length

n

is smaller than

2n

Comment: 7 pages, 1 figur

Elsevier - Publisher Connector

Searching of gapped repeats and subrepetitions in a word

Author: D. Gusfield
G. Brodal
J. Storer
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
P. Emde Boas van
R. Kolpakov
R. Kolpakov
R. Kolpakov
T. Kociumaka
Z. Galil
Publication venue
Publication date: 29/09/2013
Field of study

A gapped repeat is a factor of the form

uvu

where

u

and

v

are nonempty words. The period of the gapped repeat is defined as

|u|+|v|

. The gapped repeat is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its period. The gapped repeat is called

\alpha

-gapped if its period is not greater than

\alpha |v|

. A

\delta

-subrepetition is a factor which exponent is less than 2 but is not less than

1+\delta

(the exponent of the factor is the quotient of the length and the minimal period of the factor). The

\delta

-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. We reveal a close relation between maximal gapped repeats and maximal subrepetitions. Moreover, we show that in a word of length

n

the number of maximal

\alpha

-gapped repeats is bounded by

O(\alpha^2n)

and the number of maximal

\delta

-subrepetitions is bounded by

O(n/\delta^2)

. Using the obtained upper bounds, we propose algorithms for finding all maximal

\alpha

-gapped repeats and all maximal

\delta

-subrepetitions in a word of length

n

. The algorithm for finding all maximal

\alpha

-gapped repeats has

O(\alpha^2n)

time complexity for the case of constant alphabet size and

O(n\log n + \alpha^2n)

time complexity for the general case. For finding all maximal

\delta

-subrepetitions we propose two algorithms. The first algorithm has

O(\frac{n\log\log n}{\delta^2})

time complexity for the case of constant alphabet size and

O(n\log n +\frac{n\log\log n}{\delta^2})

time complexity for the general case. The second algorithm has

O(n\log n+\frac{n}{\delta^2}\log \frac{1}{\delta})

expected time complexity

Understanding maximal repetitions in strings

Author: Crochemore Maxime
Ilie Lucian
Publication venue
Publication date: 01/01/2008
Field of study

The cornerstone of any algorithm computing all repetitions in a string of length n in O(n) time is the fact that the number of runs (or maximal repetitions) is O(n). We give a simple proof of this result. As a consequence of our approach, the stronger result concerning the linearity of the sum of exponents of all runs follows easily

Dagstuhl Research Online Publication Server

HAL Descartes

On the maximal number of cubic subwords in a string

Author: A. Apostolico
A. Thue
A.S. Freankel
C.S. Iliopoulos
D. Damanik
L. Ilie
L. Ilie
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
M. Giraud
M. Lothaire
M.G. Main
M.G. Main
N.J. Fine
P. Baturo
R.M. Kolpakov
S.J. Puglisi
W. Rytter
W. Rytter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

We investigate the problem of the maximum number of cubic subwords (of the form

www

) in a given word. We also consider square subwords (of the form

ww

). The problem of the maximum number of squares in a word is not well understood. Several new results related to this problem are produced in the paper. We consider two simple problems related to the maximum number of subwords which are squares or which are highly repetitive; then we provide a nontrivial estimation for the number of cubes. We show that the maximum number of squares

xx

such that

x

is not a primitive word (nonprimitive squares) in a word of length

n

is exactly

\lfloor \frac{n}{2}\rfloor - 1

, and the maximum number of subwords of the form

x^k

, for

k\ge 3

, is exactly

n-2

. In particular, the maximum number of cubes in a word is not greater than

n-2

either. Using very technical properties of occurrences of cubes, we improve this bound significantly. We show that the maximum number of cubes in a word of length

n

is between

(1/2)n

and

(4/5)n

. (In particular, we improve the lower bound from the conference version of the paper.)Comment: 14 page

Fewest repetitions in infinite binary words

Author: Badkobeh Golnaz
Crochemore Maxime
Publication venue
Publication date: 26/08/2011
Field of study

A square is the concatenation of a nonempty word with itself. A word has period p if its letters at distance p match. The exponent of a nonempty word is the quotient of its length over its smallest period. In this article we give a proof of the fact that there exists an infinite binary word which contains finitely many squares and simultaneously avoids words of exponent larger than 7/3. Our infinite word contains 12 squares, which is the smallest possible number of squares to get the property, and 2 factors of exponent 7/3. These are the only factors of exponent larger than 2. The value 7/3 introduces what we call the finite-repetition threshold of the binary alphabet. We conjecture it is 7/4 for the ternary alphabet, like its repetitive threshold

EDP Sciences OAI-PMH repository (1.2.0)

Numérisation de Documents Anciens Mathématiques

Efficient Seeds Computation Revisited

Author: A. Apostolico
C.S. Iliopoulos
D. Breslauer
G.S. Brodal
J. Fischer
K. Sadakane
M. Crochemore
M. Crochemore
M. Crochemore
O. Berkman
Y. Li
Publication venue
Publication date: 01/01/2011
Field of study

The notion of the cover is a generalization of a period of a string, and there are linear time algorithms for finding the shortest cover. The seed is a more complicated generalization of periodicity, it is a cover of a superstring of a given string, and the shortest seed problem is of much higher algorithmic difficulty. The problem is not well understood, no linear time algorithm is known. In the paper we give linear time algorithms for some of its versions --- computing shortest left-seed array, longest left-seed array and checking for seeds of a given length. The algorithm for the last problem is used to compute the seed array of a string (i.e., the shortest seeds for all the prefixes of the string) in

O(n^2)

time. We describe also a simpler alternative algorithm computing efficiently the shortest seeds. As a by-product we obtain an

O(n\log{(n/m)})

time algorithm checking if the shortest seed has length at least

m

and finding the corresponding seed. We also correct some important details missing in the previously known shortest-seed algorithm (Iliopoulos et al., 1996).Comment: 14 pages, accepted to CPM 201

On-line construction of position heaps

Author: A. Blumer
A. Ehrenfeucht
D. Gusfield
E. Coffman
E. Fredkin
E. Ukkonen
J.I. Munro
M. Crochemore
M. Crochemore
M. Crochemore
T. Cormen
Publication venue
Publication date: 01/01/2011
Field of study

We propose a simple linear-time on-line algorithm for constructing a position heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it considers the suffixes ordered from left to right. Our construction is based on classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees [Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into the augmented position heap that allows for a linear-time string matching algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm

Fast Label Extraction in the CDAWG

Author: A Blumer
D Belazzougui
D Gusfield
J Sirén
L Gasieniec
LS Russo
M Crochemore
M Crochemore
M Crochemore
M Crochemore
M Raffinot
MA Bender
O Berkman
T Gagie
V Mäkinen
V Mäkinen
Publication venue
Publication date: 26/09/2017
Field of study

The compact directed acyclic word graph (CDAWG) of a string

T

of length

n

takes space proportional just to the number

e

of right extensions of the maximal repeats of

T

, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which

e

grows significantly more slowly than

n

. We reduce from

O(m\log{\log{n}})

O(m)

the time needed to count the number of occurrences of a pattern of length

m

, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from

O(m\log{\log{n}}+\mathtt{occ})

O(m+\mathtt{occ})

in the time needed to locate all the

\mathtt{occ}

occurrences of the pattern. We also reduce from

O(k\log{\log{n}})

O(k)

the time needed to read the

k

characters of the label of an edge of the suffix tree of

T

, and we reduce from

O(m\log{\log{n}})

O(m)

the time needed to compute the matching statistics between a query of length

m

and

T

, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

Minimal Forbidden Factors of Circular Words

Author: AJ Pinho
C Barton
C Barton
D Belazzougui
F Mignosi
G Fici
M Béal
M Béal
M Crochemore
M Crochemore
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2017
Field of study

Minimal forbidden factors are a useful tool for investigating properties of words and languages. Two factorial languages are distinct if and only if they have different (antifactorial) sets of minimal forbidden factors. There exist algorithms for computing the minimal forbidden factors of a word, as well as of a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an algorithm that, given the trie recognizing a finite antifactorial language

M

, computes a DFA recognizing the language whose set of minimal forbidden factors is

M

. In the same paper, they showed that the obtained DFA is minimal if the input trie recognizes the minimal forbidden factors of a single word. We generalize this result to the case of a circular word. We discuss several combinatorial properties of the minimal forbidden factors of a circular word. As a byproduct, we obtain a formal definition of the factor automaton of a circular word. Finally, we investigate the case of minimal forbidden factors of the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc