12 research outputs found
Linear-time Computation of Minimal Absent Words Using Suffix Array
An absent word of a word y of length n is a word that does not occur in y. It
is a minimal absent word if all its proper factors occur in y. Minimal absent
words have been computed in genomes of organisms from all domains of life;
their computation provides a fast alternative for measuring approximation in
sequence comparison. There exists an O(n)-time and O(n)-space algorithm for
computing all minimal absent words on a fixed-sized alphabet based on the
construction of suffix automata (Crochemore et al., 1998). No implementation of
this algorithm is publicly available. There also exists an O(n^2)-time and
O(n)-space algorithm for the same problem based on the construction of suffix
arrays (Pinho et al., 2009). An implementation of this algorithm was also
provided by the authors and is currently the fastest available. In this
article, we bridge this unpleasant gap by presenting an O(n)-time and
O(n)-space algorithm for computing all minimal absent words based on the
construction of suffix arrays. Experimental results using real and synthetic
data show that the respective implementation outperforms the one by Pinho et
al
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
On the Structure of Bispecial Sturmian Words
A balanced word is one in which any two factors of the same length contain
the same number of each letter of the alphabet up to one. Finite binary
balanced words are called Sturmian words. A Sturmian word is bispecial if it
can be extended to the left and to the right with both letters remaining a
Sturmian word. There is a deep relation between bispecial Sturmian words and
Christoffel words, that are the digital approximations of Euclidean segments in
the plane. In 1997, J. Berstel and A. de Luca proved that \emph{palindromic}
bispecial Sturmian words are precisely the maximal internal factors of
\emph{primitive} Christoffel words. We extend this result by showing that
bispecial Sturmian words are precisely the maximal internal factors of
\emph{all} Christoffel words. Our characterization allows us to give an
enumerative formula for bispecial Sturmian words. We also investigate the
minimal forbidden words for the language of Sturmian words.Comment: arXiv admin note: substantial text overlap with arXiv:1204.167
A framework for space-efficient string kernels
String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the -mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in time and
in bits of space in addition to the input, using just a
data structure on the Burrows-Wheeler transform of the
input strings, which takes time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of , like the -mer profile and the -th order empirical
entropy, and for calibrating the value of using the data
Minimal Forbidden Factors of Circular Words
Minimal forbidden factors are a useful tool for investigating properties of
words and languages. Two factorial languages are distinct if and only if they
have different (antifactorial) sets of minimal forbidden factors. There exist
algorithms for computing the minimal forbidden factors of a word, as well as of
a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an
algorithm that, given the trie recognizing a finite antifactorial language ,
computes a DFA recognizing the language whose set of minimal forbidden factors
is . In the same paper, they showed that the obtained DFA is minimal if the
input trie recognizes the minimal forbidden factors of a single word. We
generalize this result to the case of a circular word. We discuss several
combinatorial properties of the minimal forbidden factors of a circular word.
As a byproduct, we obtain a formal definition of the factor automaton of a
circular word. Finally, we investigate the case of minimal forbidden factors of
the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc
Searching Page-Images of Early Music Scanned with OMR: A Scalable Solution Using Minimal Absent Words
We define three retrieval tasks requiring efficient search of the musical content of a collection of ~32k page images of 16th-century music to find: duplicates; pages with the same musical content; pages of related music. The images are subjected to Optical Music Recognition (OMR), introducing inevitable errors. We encode pages as strings of diatonic pitch intervals, ignoring rests, to reduce the effect of such errors. We extract indices comprising lists of two kinds of âwordâ. Approximate matching is done by counting the number of common words between a query page and those in the collection. The two word-types are (a) normal ngrams and (b) minimal absent words (MAWs). The latter have three important properties for our purpose: they can be built and searched in linear time, the number of MAWs generated tends to be smaller, and they preserve the structure and order of the text, obviating the need for expensive sorting operations. We show that retrieval performance of MAWs is comparable with ngrams, but with a marked speed improvement. We also show the effect of word length on retrieval. Our results suggest that an index of MAWs of mixed length provides a good method for these tasks which is scalable to larger collections
Internal Shortest Absent Word Queries in Constant Time and Linear Space
International audienceGiven a string T of length n over an alphabet ÎŁ â {1, 2,. .. , n O(1) } of size Ï, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over ÎŁ that is absent in the fragment T [i] âą âą âą T [j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log Ï n) time
Using minimal absent words to build phylogeny
International audienc