239 research outputs found
On Bijective Variants of the Burrows-Wheeler Transform
The sort transform (ST) is a modification of the Burrows-Wheeler transform
(BWT). Both transformations map an arbitrary word of length n to a pair
consisting of a word of length n and an index between 1 and n. The BWT sorts
all rotation conjugates of the input word, whereas the ST of order k only uses
the first k letters for sorting all such conjugates. If two conjugates start
with the same prefix of length k, then the indices of the rotations are used
for tie-breaking. Both transforms output the sequence of the last letters of
the sorted list and the index of the input within the sorted list. In this
paper, we discuss a bijective variant of the BWT (due to Scott), proving its
correctness and relations to other results due to Gessel and Reutenauer (1993)
and Crochemore, Desarmenien, and Perrin (2005). Further, we present a novel
bijective variant of the ST.Comment: 15 pages, presented at the Prague Stringology Conference 2009 (PSC
2009
Space Efficient Construction of Lyndon Arrays in Linear Time
Given a string S of length n, its Lyndon array identifies for each suffix S[i..n] the next lexicographically smaller suffix S[j..n], i.e. the minimal index j > i with S[i..n] ? S[j..n]. Apart from its plain (n log? n)-bit array representation, the Lyndon array can also be encoded as a succinct parentheses sequence that requires only 2n bits of space. While linear time construction algorithms for both representations exist, it has previously been unknown if the same time bound can be achieved with less than ?(n lg n) bits of additional working space. We show that, in fact, o(n) additional bits are sufficient to compute the succinct 2n-bit version of the Lyndon array in linear time. For the plain (n log? n)-bit version, we only need ?(1) additional words to achieve linear time. Our space efficient construction algorithm makes the Lyndon array more accessible as a fundamental data structure in applications like full-text indexing
Longest Lyndon Substring After Edit
The longest Lyndon substring of a string T is the longest substring of T which is a Lyndon word. LLS(T) denotes the length of the longest Lyndon substring of a string T. In this paper, we consider computing LLS(T\u27) where T\u27 is an edited string formed from T. After O(n) time and space preprocessing, our algorithm returns LLS(T\u27) in O(log n) time for any single character edit. We also consider a version of the problem with block edits, i.e., a substring of T is replaced by a given string of length l. After O(n) time and space preprocessing, our algorithm returns LLS(T\u27) in O(l log sigma + log n) time for any block edit where sigma is the number of distinct characters in T. We can modify our algorithm so as to output all the longest Lyndon substrings of T\u27 for both problems
Computation of the suffix array, burrows-wheeler transform and FM-index in V-order
V-order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V-comparison of strings can be done in linear time and constant space. V-order has been proposed as an alternative to lexicographic order (lexorder) in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform (BWT). In line with the recent interest in the connection between suffix arrays and Lyndon factorization, in this paper we obtain similar results for the V-order factorization. Indeed, we show that the results describing the connection between suffix arrays and Lyndon factorization are matched by analogous V-order processing. We also describe a methodology for efficiently computing the FM-Index in V-order, as well as V-order substring pattern matching using backward search
Repeat-Free Codes
In this paper we consider the problem of encoding data into repeat-free
sequences in which sequences are imposed to contain any -tuple at most once
(for predefined ). First, the capacity and redundancy of the repeat-free
constraint are calculated. Then, an efficient algorithm, which uses a single
bit of redundancy, is presented to encode length- sequences for . This algorithm is then improved to support any value of of the form
, for , while its redundancy is . We also calculate the
capacity of repeat-free sequences when combined with local constraints which
are given by a constrained system, and the capacity of multi-dimensional
repeat-free codes.Comment: 18 page
Compressibility-Aware Quantum Algorithms on Strings
Sublinear time quantum algorithms have been established for many fundamental
problems on strings. This work demonstrates that new, faster quantum algorithms
can be designed when the string is highly compressible. We focus on two popular
and theoretically significant compression algorithms -- the Lempel-Ziv77
algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT),
and obtain the results below.
We first provide a quantum algorithm running in time
for finding the LZ77 factorization of an input string with
factors. Combined with multiple existing results, this yields an
time quantum algorithm for finding the RL-BWT encoding
with BWT runs. Note that . We complement these
results with lower bounds proving that our algorithms are optimal (up to
polylog factors).
Next, we study the problem of compressed indexing, where we provide a
time quantum algorithm for constructing a recently
designed space structure with equivalent capabilities as the
suffix tree. This data structure is then applied to numerous problems to obtain
sublinear time quantum algorithms when the input is highly compressible. For
example, we show that the longest common substring of two strings of total
length can be computed in time, where is the
number of factors in the LZ77 factorization of their concatenation. This beats
the best known time quantum algorithm when is
sufficiently small
On baier's sort of maximal Lyndon substrings
We describe and analyze in terms of Lyndon words an elementary sort of maximal Lyndon factors of a string and prove formally its correctness. Since the sort is based on the first phase of Baier’s algorithm for sorting of the suffixes of a string, we refer to it as Baier’s sort
- …