13 research outputs found
An Elegant Algorithm for the Construction of Suffix Arrays
The suffix array is a data structure that finds numerous applications in
string processing problems for both linguistic texts and biological data. It
has been introduced as a memory efficient alternative for suffix trees. The
suffix array consists of the sorted suffixes of a string. There are several
linear time suffix array construction algorithms (SACAs) known in the
literature. However, one of the fastest algorithms in practice has a worst case
run time of . The problem of designing practically and theoretically
efficient techniques remains open. In this paper we present an elegant
algorithm for suffix array construction which takes linear time with high
probability; the probability is on the space of all possible inputs. Our
algorithm is one of the simplest of the known SACAs and it opens up a new
dimension of suffix array construction that has not been explored until now.
Our algorithm is easily parallelizable. We offer parallel implementations on
various parallel models of computing. We prove a lemma on the -mers of a
random string which might find independent applications. We also present
another algorithm that utilizes the above algorithm. This algorithm is called
RadixSA and has a worst case run time of . RadixSA introduces an
idea that may find independent applications as a speedup technique for other
SACAs. An empirical comparison of RadixSA with other algorithms on various
datasets reveals that our algorithm is one of the fastest algorithms to date.
The C++ source code is freely available at
http://www.engr.uconn.edu/~man09004/radixSA.zi
Fast parallel skew and prefix‐doubling suffix array construction on the GPU
Suffix arrays are fundamental full-text index data structures of importance to a broad spectrum of applications in such fields as bioinformatics, Burrows-Wheeler Transform (BWT)-based lossless data compression, and information retrieval. In this work, we propose and implement two massively parallel approaches on the GPU based on two classes of suffix array construction algorithms. The first, parallel skew, makes algorithmic improvements to the previous work of Deo and Keely to achieve a speedup of 1.45x over their work. The second, a hybrid skew and prefix-doubling implementation, is the first of its kind on the GPU and achieves a speedup of 2.3–4.4x over Osipov’s prefix-doubling and 2.4–7.9x over our skew implementation on large datasets. Our implementations rely on two efficient parallel primitives, a merge and a segmented sort. We theoretically analyze the two formulations of suffix array construction algorithms and show performance comparisons on a large variety of practical inputs. We conclude that, with the novel use of our efficient segmented sort, prefix-doubling is more competitive than skew on the GPU. We also demonstrate the effectiveness of our methods in our implementations of the Burrows-Wheeler transform and in a parallel FM-index for pattern searching. This is the submitted version of the paper
An Incomplex Algorithm for Fast Suffix Array Construction
Schürmann K-B, Stoye J. An Incomplex Algorithm for Fast Suffix Array Construction. In: Proc. of ALENEX/ANALCO 2005. 2005: 77-85
An incomplex algorithm for fast suffix array construction
Schürmann K-B, Stoye J. An incomplex algorithm for fast suffix array construction. SOFTWARE-PRACTICE & EXPERIENCE. 2007;37(3):309-329.The suffix array of a string is a permutation of all starting positions of the string's suffixes that are lexicographically sorted. We present a practical algorithm for suffix array construction that consists of two easy-to-implement components. First it sorts the suffixes with respect to a fixed length prefix; then it refines each bucket of suffixes sharing the same prefix using the order of already sorted suffixes. Other suffix array construction algorithms follow more complex strategies. Moreover, we achieve a very fast construction for common strings as well as for worst case strings by enhancing our algorithm with further techniques. Copyright (C) 2006 John Wiley & Sons, Ltd
Data Structures for Efficient String Algorithms
This thesis deals with data structures that are mostly useful in the area of string matching and string mining. Our main result is an O(n)-time preprocessing scheme for an array of n numbers such that subsequent queries asking for the position of a minimum element in a specified interval can be answered in constant time (so-called RMQs for Range Minimum Queries). The space for this data structure is 2n+o(n) bits, which is shown to be asymptotically optimal in a general setting. This improves all previous results on this problem. The main techniques for deriving this result rely on combinatorial properties of arrays and so-called Cartesian Trees. For compressible input arrays we show that further space can be saved, while not affecting the time bounds. For the two-dimensional variant of the RMQ-problem we give a preprocessing scheme with quasi-optimal time bounds, but with an asymptotic increase in space consumption of a factor of log(n).
It is well known that algorithms for answering RMQs in constant time are useful for many different algorithmic tasks (e.g., the computation of lowest common ancestors in trees); in the second part of this thesis we give several new applications of the RMQ-problem. We show that our preprocessing scheme for RMQ (and a variant thereof) leads to improvements in the space- and time-consumption of the Enhanced Suffix Array, a collection of arrays that can be used for many tasks in pattern matching. In particular, we will see that in conjunction with the suffix- and LCP-array 2n+o(n) bits of additional space (coming from our RMQ-scheme) are sufficient to find all occ occurrences of a (usually short) pattern of length m in a (usually long) text of length n in O(m*s+occ) time, where s denotes the size of the alphabet. This is certainly optimal if the size of the alphabet is constant; for non-constant alphabets we can improve this to O(m*log(s)+occ) locating time, replacing our original scheme with a data structure of size approximately 2.54n bits. Again by using RMQs, we then show how to solve frequency-related string mining tasks in optimal time. In a final chapter we propose a space- and time-optimal algorithm for computing suffix arrays on texts that are logically divided into words, if one is just interested in finding all word-aligned occurrences of a pattern.
Apart from the theoretical improvements made in this thesis, most of our algorithms are also of practical value; we underline this fact by empirical tests and comparisons on real-word problem instances. In most cases our algorithms outperform previous approaches by all means
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author
Algorithm Engineering for fundamental Sorting and Graph Problems
Fundamental Algorithms build a basis knowledge for every computer science undergraduate or a professional programmer. It is a set of basic techniques one can find in any (good) coursebook on algorithms and data structures. In this thesis we try to close the gap between theoretically worst-case optimal classical algorithms and the real-world circumstances one face under the assumptions imposed by the data size, limited main memory or available parallelism