952 research outputs found
An Elegant Algorithm for the Construction of Suffix Arrays
The suffix array is a data structure that finds numerous applications in
string processing problems for both linguistic texts and biological data. It
has been introduced as a memory efficient alternative for suffix trees. The
suffix array consists of the sorted suffixes of a string. There are several
linear time suffix array construction algorithms (SACAs) known in the
literature. However, one of the fastest algorithms in practice has a worst case
run time of . The problem of designing practically and theoretically
efficient techniques remains open. In this paper we present an elegant
algorithm for suffix array construction which takes linear time with high
probability; the probability is on the space of all possible inputs. Our
algorithm is one of the simplest of the known SACAs and it opens up a new
dimension of suffix array construction that has not been explored until now.
Our algorithm is easily parallelizable. We offer parallel implementations on
various parallel models of computing. We prove a lemma on the -mers of a
random string which might find independent applications. We also present
another algorithm that utilizes the above algorithm. This algorithm is called
RadixSA and has a worst case run time of . RadixSA introduces an
idea that may find independent applications as a speedup technique for other
SACAs. An empirical comparison of RadixSA with other algorithms on various
datasets reveals that our algorithm is one of the fastest algorithms to date.
The C++ source code is freely available at
http://www.engr.uconn.edu/~man09004/radixSA.zi
Linear pattern matching on sparse suffix trees
Packing several characters into one computer word is a simple and natural way
to compress the representation of a string and to speed up its processing.
Exploiting this idea, we propose an index for a packed string, based on a {\em
sparse suffix tree} \cite{KU-96} with appropriately defined suffix links.
Assuming, under the standard unit-cost RAM model, that a word can store up to
characters ( the alphabet size), our index takes
space, i.e. the same space as the packed string itself.
The resulting pattern matching algorithm runs in time ,
where is the length of the pattern, is the actual number of characters
stored in a word and is the number of pattern occurrences
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees are an elegant way to partition a sequence of values into sorted
consecutive blocks, and to express the relative position of the first element
of each block within a previous block. They were used to encode ordinal trees
and to index integer arrays in order to support range minimum queries on them.
We describe how they yield many other convenient results in a variety of areas,
from data structures to algorithms: some compressed succinct indices for range
minimum queries; a new adaptive sorting algorithm; and a compressed succinct
data structure for permutations supporting direct and indirect application in
time all the shortest as the permutation is compressible.Comment: 13 pages, 1 figur
Speeding-up -gram mining on grammar-based compressed texts
We present an efficient algorithm for calculating -gram frequencies on
strings represented in compressed form, namely, as a straight line program
(SLP). Given an SLP of size that represents string , the
algorithm computes the occurrence frequencies of all -grams in , by
reducing the problem to the weighted -gram frequencies problem on a
trie-like structure of size , where
is a quantity that represents the amount of
redundancy that the SLP captures with respect to -grams. The reduced problem
can be solved in linear time. Since , the running time of our
algorithm is , improving our
previous algorithm when
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Run-length encoding Burrows-Wheeler Transformed strings, resulting in
Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive
strings. We propose a new algorithm for online RLBWT working in run-compressed
space, which runs in time and bits of space, where
is the length of input string received so far and is the number of runs
in the BWT of the reversed . We improve the state-of-the-art algorithm for
online RLBWT in terms of empirical construction time. Adopting the dynamic list
for maintaining a total order, we can replace rank queries in a dynamic wavelet
tree on a run-length compressed string by the direct comparison of labels in a
dynamic list. The empirical result for various benchmarks show the efficiency
of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201
Lightweight BWT and LCP merging via the gap algorithm
Recently, Holt and McMillan [Bioinformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a collection of strings. In this paper we show that their algorithm can be improved so that, in addition to the BWTs, it also merges the Longest Common Prefix (LCP) arrays. Because of its small memory footprint this new algorithm can be used for the final merge of BWT and LCP arrays computed by a faster but memory intensive construction algorithm
Faster Compact On-Line Lempel-Ziv Factorization
We present a new on-line algorithm for computing the Lempel-Ziv factorization
of a string that runs in time and uses only bits
of working space, where is the length of the string and is the
size of the alphabet. This is a notable improvement compared to the performance
of previous on-line algorithms using the same order of working space but
running in either time (Okanohara & Sadakane 2009) or
time (Starikovskaya 2012). The key to our new algorithm is in the
utilization of an elegant but less popular index structure called Directed
Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an
opportunistic variant of our algorithm, which, given the run length encoding of
size of a string of length , computes the Lempel-Ziv factorization
on-line, in time
and bits of space, which is faster and more space efficient when
the string is run-length compressible
Ψ-RA: a parallel sparse index for genomic read alignment
Background
Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results
We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named Ψ-RA (PSI-RA: parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption of the index based on the available memory of the machine and the minimum length of the arriving pattern queries. Although SSAs have been studied before for exact matching of short reads, an elegant way of approximate matching capability was missing. We provide this by defining the rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors are more probable. Ψ-RA supports any number of mismatches in aligning reads. We give comparisons with some of the well-known short read aligners, and show that indexing a genome with SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions
Ψ-RA is expected to serve as a valuable tool in the alignment of short reads generated by the next generation high-throughput sequencing technology. Ψ-RA is very fast in exact matching and also supports rightmost approximate matching. The SSA structure that Ψ-RA is built on naturally incorporates the modern multicore architecture and thus further speed-up can be gained. All the information, including the source code of Ψ-RA, can be downloaded at: http://www.busillis.com/o_kulekci/PSIRA.zip webcite
Indexing large genome collections on a PC
Motivation: The availability of thousands of invidual genomes of one species
should boost rapid progress in personalized medicine or understanding of the
interaction between genotype and phenotype, to name a few applications. A key
operation useful in such analyses is aligning sequencing reads against a
collection of genomes, which is costly with the use of existing algorithms due
to their large memory requirements.
Results: We present MuGI, Multiple Genome Index, which reports all
occurrences of a given pattern, in exact and approximate matching model,
against a collection of thousand(s) genomes. Its unique feature is the small
index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of
RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is
also fast. For example, the exact matching queries are handled in average time
of 39\,s and with up to 3 mismatches in 373\,s on the test PC with
the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory,
the respective times grow to 76\,s and 917\,s.
Availability: Software and Suuplementary material:
\url{http://sun.aei.polsl.pl/mugi}
- …