4,230 research outputs found
Space-efficient construction of compressed suffix trees
We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let n be the text length and σ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nlogσ) time using just o(nlogσ) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 00. This improves the previous most space-efficient algorithms, which worked in O(n) bits and O(nlogn) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nlogσ) time and using just o(nlogσ) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as n bits on top of a packed representation of the input/output and process data as fast as 2.92 megabases per second
Internal Pattern Matching Queries in a Text and Applications
We consider several types of internal queries: questions about subwords of a
text. As the main tool we develop an optimal data structure for the problem
called here internal pattern matching. This data structure provides
constant-time answers to queries about occurrences of one subword in
another subword of a given text, assuming that ,
which allows for a constant-space representation of all occurrences. This
problem can be viewed as a natural extension of the well-studied pattern
matching problem. The data structure has linear size and admits a linear-time
construction algorithm.
Using the solution to the internal pattern matching problem, we obtain very
efficient data structures answering queries about: primitivity of subwords,
periods of subwords, general substring compression, and cyclic equivalence of
two subwords. All these results improve upon the best previously known
counterparts. The linear construction time of our data structure also allows to
improve the algorithm for finding -subrepetitions in a text (a more
general version of maximal repetitions, also called runs). For any fixed
we obtain the first linear-time algorithm, which matches the linear
time complexity of the algorithm computing runs. Our data structure has already
been used as a part of the efficient solutions for subword suffix rank &
selection, as well as substring compression using Burrows-Wheeler transform
composed with run-length encoding.Comment: 31 pages, 9 figures; accepted to SODA 201
Lyndon Array Construction during Burrows-Wheeler Inversion
In this paper we present an algorithm to compute the Lyndon array of a string
of length as a byproduct of the inversion of the Burrows-Wheeler
transform of . Our algorithm runs in linear time using only a stack in
addition to the data structures used for Burrows-Wheeler inversion. We compare
our algorithm with two other linear-time algorithms for Lyndon array
construction and show that computing the Burrows-Wheeler transform and then
constructing the Lyndon array is competitive compared to the known approaches.
We also propose a new balanced parenthesis representation for the Lyndon array
that uses bits of space and supports constant time access. This
representation can be built in linear time using words of space, or in
time using asymptotically the same space as
Sorting suffixes of a text via its Lyndon Factorization
The process of sorting the suffixes of a text plays a fundamental role in
Text Algorithms. They are used for instance in the constructions of the
Burrows-Wheeler transform and the suffix array, widely used in several fields
of Computer Science. For this reason, several recent researches have been
devoted to finding new strategies to obtain effective methods for such a
sorting. In this paper we introduce a new methodology in which an important
role is played by the Lyndon factorization, so that the local suffixes inside
factors detected by this factorization keep their mutual order when extended to
the suffixes of the whole word. This property suggests a versatile technique
that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013
On the combinatorics of suffix arrays
We prove several combinatorial properties of suffix arrays, including a
characterization of suffix arrays through a bijection with a certain
well-defined class of permutations. Our approach is based on the
characterization of Burrows-Wheeler arrays given in [1], that we apply by
reducing suffix sorting to cyclic shift sorting through the use of an
additional sentinel symbol. We show that the characterization of suffix arrays
for a special case of binary alphabet given in [2] easily follows from our
characterization. Based on our results, we also provide simple proofs for the
enumeration results for suffix arrays, obtained in [3]. Our approach to
characterizing suffix arrays is the first that exploits their relationship with
Burrows-Wheeler permutations
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
Parallel Wavelet Tree Construction
We present parallel algorithms for wavelet tree construction with
polylogarithmic depth, improving upon the linear depth of the recent parallel
algorithms by Fuentes-Sepulveda et al. We experimentally show on a 40-core
machine with two-way hyper-threading that we outperform the existing parallel
algorithms by 1.3--5.6x and achieve up to 27x speedup over the sequential
algorithm on a variety of real-world and artificial inputs. Our algorithms show
good scalability with increasing thread count, input size and alphabet size. We
also discuss extensions to variants of the standard wavelet tree.Comment: This is a longer version of the paper that appears in the Proceedings
of the IEEE Data Compression Conference, 201
- …