354 research outputs found
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure
in indexed string matching. It can be used to speed up searching using the
suffix array (SA) and provides an implicit representation of the topology of an
underlying suffix tree. The LCP array of a string of length can be
represented as an array of length words, or, in the presence of the SA, as
a bit vector of bits plus asymptotically negligible support data
structures. External memory construction algorithms for the LCP array have been
proposed, but those proposed so far have a space requirement of words
(i.e. bits) in external memory. This space requirement is in some
practical cases prohibitively expensive. We present an external memory
algorithm for constructing the bit version of the LCP array which uses
bits of additional space in external memory when given a
(compressed) BWT with alphabet size and a sampled inverse suffix array
at sampling rate . This is often a significant space gain in
practice where is usually much smaller than or even constant. We
also consider the case of computing succinct LCP arrays for circular strings
Compressed Spaced Suffix Arrays
Spaced seeds are important tools for similarity search in bioinformatics, and
using several seeds together often significantly improves their performance.
With existing approaches, however, for each seed we keep a separate linear-size
data structure, either a hash table or a spaced suffix array (SSA). In this
paper we show how to compress SSAs relative to normal suffix arrays (SAs) and
still support fast random access to them. We first prove a theoretical upper
bound on the space needed to store an SSA when we already have the SA. We then
present experiments indicating that our approach works even better in practice
RLZAP: Relative Lempel-Ziv with Adaptive Pointers
Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of
genomes from individuals of the same species when fast random access is
desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a
reference genome is selected and then the other genomes are greedily parsed
into phrases exactly matching substrings of the reference. Deorowicz and
Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with
a mismatch character usually gives better compression because many of the
differences between individuals' genomes are single-nucleotide substitutions.
Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers
and run-length compressing them usually gives even better compression. In this
paper we generalize Ferrada et al.'s idea to handle well also short insertions,
deletions and multi-character substitutions. We show experimentally that our
generalization achieves better compression than Ferrada et al.'s implementation
with comparable random-access times
Parallel Wavelet Tree Construction
We present parallel algorithms for wavelet tree construction with
polylogarithmic depth, improving upon the linear depth of the recent parallel
algorithms by Fuentes-Sepulveda et al. We experimentally show on a 40-core
machine with two-way hyper-threading that we outperform the existing parallel
algorithms by 1.3--5.6x and achieve up to 27x speedup over the sequential
algorithm on a variety of real-world and artificial inputs. Our algorithms show
good scalability with increasing thread count, input size and alphabet size. We
also discuss extensions to variants of the standard wavelet tree.Comment: This is a longer version of the paper that appears in the Proceedings
of the IEEE Data Compression Conference, 201
Enhanced molecular dynamics performance with a programmable graphics processor
Design considerations for molecular dynamics algorithms capable of taking
advantage of the computational power of a graphics processing unit (GPU) are
described. Accommodating the constraints of scalable streaming-multiprocessor
hardware necessitates a reformulation of the underlying algorithm. Performance
measurements demonstrate the considerable benefit and cost-effectiveness of
such an approach, which produces a factor of 2.5 speed improvement over
previous work for the case of the soft-sphere potential.Comment: 20 pages (v2: minor additions and changes; v3: corrected typos
Efficient Construction of the BWT for Repetitive Text Using String Compression
Funding Information: Funding Diego Díaz-Domínguez: Academy of Finland Grant 323233 Gonzalo Navarro: ANID Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile Publisher Copyright: © Diego Daz-Domnguez and Gonzalo Navarro; licensed under Creative Commons License CC-BY 4.0We present a new semi-external algorithm that builds the Burrows-Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce the computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space, but it also speeds up the required computations. Our experiments show important savings in both space and computation time when the text is repetitive. On average, we are 3.7x faster than the baseline compressed approach, while maintaining a similar memory consumption. These results make our method stand out as the only one (to our knowledge) that can build the BCR BWT of a collection of 25 human genomes (75 GB) in about 7.3 hours, and using only 27 GB of working memory.Peer reviewe
Lightweight BWT and LCP merging via the gap algorithm
Recently, Holt and McMillan [Bioinformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a collection of strings. In this paper we show that their algorithm can be improved so that, in addition to the BWTs, it also merges the Longest Common Prefix (LCP) arrays. Because of its small memory footprint this new algorithm can be used for the final merge of BWT and LCP arrays computed by a faster but memory intensive construction algorithm
Compressed Spaced Suffix Arrays
As a first step in designing relatively-compressed data structures---i.e., such that storing an instance for one dataset helps us store instances for similar datasets---we consider how to compress spaced suffix arrays relative to normal suffix arrays and still support fast access to them. This problem is of practical interest when performing similarity search with spaced seeds because using several seeds in parallel significantly improves their performance, but with existing approaches we keep a separate linear-space hash table or spaced suffix array for each seed. We first prove a theoretical upper bound on the space needed to store a spaced suffix array when we already have the suffix array. We then present experiments indicating that our approach works even better in practice.Peer reviewe
- …