27 research outputs found
A Grammar Compression Algorithm based on Induced Suffix Sorting
We introduce GCIS, a grammar compression algorithm based on the induced
suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution
builds on the factorization performed by SAIS during suffix sorting. We
construct a context-free grammar on the input string which can be further
reduced into a shorter string by substituting each substring by its
correspondent factor. The resulting grammar is encoded by exploring some
redundancies, such as common prefixes between suffix rules, which are sorted
according to SAIS framework. When compared to well-known compression tools such
as Re-Pair and 7-zip, our algorithm is competitive and very effective at
handling repetitive string regarding compression ratio, compression and
decompression running time
Linear-time Computation of Minimal Absent Words Using Suffix Array
An absent word of a word y of length n is a word that does not occur in y. It
is a minimal absent word if all its proper factors occur in y. Minimal absent
words have been computed in genomes of organisms from all domains of life;
their computation provides a fast alternative for measuring approximation in
sequence comparison. There exists an O(n)-time and O(n)-space algorithm for
computing all minimal absent words on a fixed-sized alphabet based on the
construction of suffix automata (Crochemore et al., 1998). No implementation of
this algorithm is publicly available. There also exists an O(n^2)-time and
O(n)-space algorithm for the same problem based on the construction of suffix
arrays (Pinho et al., 2009). An implementation of this algorithm was also
provided by the authors and is currently the fastest available. In this
article, we bridge this unpleasant gap by presenting an O(n)-time and
O(n)-space algorithm for computing all minimal absent words based on the
construction of suffix arrays. Experimental results using real and synthetic
data show that the respective implementation outperforms the one by Pinho et
al
Faster algorithms for 1-mappability of a sequence
In the k-mappability problem, we are given a string x of length n and
integers m and k, and we are asked to count, for each length-m factor y of x,
the number of other factors of length m of x that are at Hamming distance at
most k from y. We focus here on the version of the problem where k = 1. The
fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and
space O(n). We present two algorithms that require worst-case time O(mn) and
O(n log^2 n), respectively, and space O(n), thus greatly improving the state of
the art. Moreover, we present an algorithm that requires average-case time and
space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where
{\sigma} is the alphabet size
External Batched Range Minimum Queries and LCP Construction
This work deals with I/O-optimal suffix array (SA) and longest common prefix (LCP) array construction in external memory. For this purpose, the I/O-optimale DC3 algorithm is enhanced by LCP construction and adapted accordingly to the external memory model. In this context we present a method to construct the required range minimum queries (RMQs) efficiently in external memory. The core of this work is a description and implementation of the resulting external DC3-LCP algorithm using the Stxxl - the C++ Standard Template Library for Extra Large Data Sets. Experimental results based on realistic, real-world instances rounds off this work
Longest Common Prefixes with -Errors and Applications
Although real-world text datasets, such as DNA sequences, are far from being
uniformly random, average-case string searching algorithms perform
significantly better than worst-case ones in most applications of interest. In
this paper, we study the problem of computing the longest prefix of each suffix
of a given string of length over a constant-sized alphabet that occurs
elsewhere in the string with -errors. This problem has already been studied
under the Hamming distance model. Our first result is an improvement upon the
state-of-the-art average-case time complexity for non-constant and using
only linear space under the Hamming distance model. Notably, we show that our
technique can be extended to the edit distance model with the same time and
space complexities. Specifically, our algorithms run in time on average using space. We show that our
technique is applicable to several algorithmic problems in computational
biology and elsewhere
Inducing the Lyndon Array
In this paper we propose a variant of the induced suffix sorting algorithm by Nong (TOIS, 2013) that computes simultaneously the Lyndon array and the suffix array of a text in O(n) time using O(n) words of working space, where n is the length of the text and is the alphabet size. Our result improves the previous best space requirement for linear time computation of the Lyndon array. In fact, all the known linear algorithms for Lyndon array computation use suffix sorting as a preprocessing step and use O(n) words of working space in addition to the Lyndon array and suffix array. Experimental results with real and synthetic datasets show that our algorithm is not only space-efficient but also fast in practice
Inducing Suffix and LCP Arrays in External Memory
We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best EM suffix sorter [Dementiev et al., JEA 2008] by a factor of about two in time and I/O-volume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments).