767 research outputs found
Combined Data Structure for Previous- and Next-Smaller-Values
Let be a static array storing elements from a totally ordered set. We
present a data structure of optimal size at most
bits that allows us to answer the following queries on in constant time,
without accessing : (1) previous smaller value queries, where given an index
, we wish to find the first index to the left of where is strictly
smaller than at , and (2) next smaller value queries, which search to the
right of . As an additional bonus, our data structure also allows to answer
a third kind of query: given indices , find the position of the minimum in
. Our data structure has direct consequences for the space-efficient
storage of suffix trees.Comment: to appear in Theoretical Computer Scienc
Random Access to Grammar Compressed Strings
Grammar based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. In this paper, we
present a novel grammar representation that allows efficient random access to
any character or substring without decompressing the string.
Let be a string of length compressed into a context-free grammar
of size . We present two representations of
achieving random access time, and either
construction time and space on the pointer machine model, or
construction time and space on the RAM. Here, is the inverse of
the row of Ackermann's function. Our representations also efficiently
support decompression of any substring in : we can decompress any substring
of length in the same complexity as a single random access query and
additional time. Combining these results with fast algorithms for
uncompressed approximate string matching leads to several efficient algorithms
for approximate string matching on grammar-compressed strings without
decompression. For instance, we can find all approximate occurrences of a
pattern with at most errors in time , where is the number of occurrences of in . Finally, we
generalize our results to navigation and other operations on grammar-compressed
ordered trees.
All of the above bounds significantly improve the currently best known
results. To achieve these bounds, we introduce several new techniques and data
structures of independent interest, including a predecessor data structure, two
"biased" weighted ancestor data structures, and a compact representation of
heavy paths in grammars.Comment: Preliminary version in SODA 201
On the Complexity of the (Approximate) Nearest Colored Node Problem
Given a graph G=(V,E) where each vertex is assigned a color from the set C={c_1, c_2, .., c_sigma}. In the (approximate) nearest colored node problem, we want to query, given v in V and c in C, for the (approximate) distance dist^(v, c) from v to the nearest node of color c. For any integer 1 <= k <= log n, we present a Color Distance Oracle (also often referred to as Vertex-label Distance Oracle) of stretch 4k-5 using space O(kn sigma^{1/k}) and query time O(log{k}). This improves the query time from O(k) to O(log{k}) over the best known Color Distance Oracle by Chechik [Chechik, 2012].
We then prove a lower bound in the cell probe model showing that even for unweighted undirected paths any static data structure that uses space S requires at least Omega (log (log{sigma} / log(S/n)+log log{n})) query time to give a distance estimate of stretch O(polylog(n)). This implies for the important case when sigma = Theta(n^{epsilon}) for some constant 0 < epsilon < 1, that our Color Distance Oracle has asymptotically optimal query time in regard to k, and that recent Color Distance Oracles for trees [Tsur, 2018] and planar graphs [Mozes and Skop, 2018] achieve asymptotically optimal query time in regard to n.
We also investigate the setting where the data structure additionally has to support color-reassignments. We present the first Color Distance Oracle that achieves query times matching our lower bound from the static setting for large stretch yielding an exponential improvement over the best known query time [Chechik, 2014]. Finally, we give new conditional lower bounds proving the hardness of answering queries if edge insertions and deletion are allowed that strictly improve over recent bounds in time and generality
GPU accelerating distributed succinct de Bruijn graph construction
The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance.
Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph.
As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing.
This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction
Space Efficient Encodings for Bit-strings, Range queries and Related Problems
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. Srinivasa Rao Satti.In this thesis, we design and implement various space efficient data structures. Most of these structures use spaces close to the information-theoretic lower bound while supporting the queries efficiently.
In particular, this thesis is concerned with the data structures for four problems: (i) supporting \rank{} and \select{} queries on compressed bit strings, (ii) nearest larger neighbor problem, (iii) simultaneous encodings for range and next/previous larger/smaller value queries, and (iv) range \topk{} queries on two-dimensional arrays.
We first consider practical implementations of \emph{compressed} bitvectors, which support \rank{} and \select{} operations on a given bit-string, while storing the bit-string in compressed form~\cite{DBLP:conf/dcc/JoJORS14}. Our approach relies on \emph{variable-to-fixed} encodings
of the bit-string, an approach that has not yet been considered systematically for practical encodings of bitvectors. We show that this approach leads to fast practical implementations with low \emph{redundancy} (i.e., the space used by the bitvector in addition to the compressed representation of the bit-string), and is a flexible and promising solution to the problem of supporting
\rank{} and \select{} on moderately compressible bit-strings, such as those encountered in real-world applications.
Next, we propose space-efficient data structures for the nearest larger neighbor problem~\cite{IWOCA2014,walcom-JoRS15}. Given a sequence of elements from a total order, and a position in the sequence, the nearest larger neighbor (\NLV{}) query returns the position of the
element which is closest to the query position, and is larger than the element at the query position. The
problem of finding all nearest larger neighbors has attracted interest due to its applications for
parenthesis matching and in computational geometry~\cite{AsanoBK09,AsanoK13,BerkmanSV93}.
We consider a data structure version of this problem, which is to preprocess a given sequence of elements to construct a data structure that can answer \NLN{} queries efficiently. For one-dimensional arrays, we give time-space tradeoffs for the problem on \textit{indexing model}. For two-dimensional arrays, we give an optimal encoding with constant query on \textit{encoding model}.
We also propose space-efficient encodings which support various range queries, and previous and next smaller/larger value queries~\cite{cocoonJS15}. Given a sequence of elements from a total order, we obtain a -bit encoding that supports all these queries where is the length
of input array. For the case when we need to support all these queries in constant time, we give an encoding that takes bits. This improves the -bit encoding obtained by encoding the colored -Min and -Max heaps proposed by Fischer~\cite{Fischer11}.
We extend the original DFUDS~\cite{BDMRRR05} encoding of the colored -Min and -Max heap that supports the queries in constant time. Then, we combine the extended DFUDS of -Min heap and -Max heap using the Min-Max encoding of Gawrychowski and Nicholson~\cite{Gawry14}
with some modifications. We also obtain encodings that take lesser space and support a subset of these queries.
Finally, we consider the various encodings that support range \topk{} queries on a two-dimensional array containing elements from a total order. For an array, we first propose an optimal encoding for answering one-sided \topk{} queries, whose query range is restricted to , for . Next, we propose an encoding for the general \topk{} queries that takes
bits. This generalizes the \topk{} encoding of Gawrychowski and Nicholson~\cite{Gawry14}.Chapter 1 Introduction 1
1.1 Computational model 2
1.1.1 Encoding and indexing models 2
1.2 Contribution of the thesis 3
1.3 Organization of the thesis 5
Chapter 2 Preliminaries 7
Chapter 3 Compressed bit vectors based on variable-to-fixed encodings 10
3.1 Introduction 10
3.2 Bit-vectors using V2F coding 14
3.3 V2F compression algorithms for bit-strings 16
3.3.1 Tunstall code 16
3.3.2 Enumerative codes 19
3.3.3 LZW algorithm 23
3.3.4 Empirical evaluation of the compressors 23
3.4 Practical implementation of bitvectors based on V2F compression. 26
3.4.1 Testing Methodology 29
3.4.2 Results of Empirical Evaluation 33
3.5 Future works 35
Chapter 4 Space Efficient Data Structures for Nearest Larger Neighbor 39
4.1 Introduction 39
4.2 Indexing NLV queries on 1D arrays 43
4.3 Encoding NLN queries on2D binary arrays 44
4.4 Encoding NLN queries for general 2D arrays 50
4.4.1 2D NLN in the encoding model–distinct case 50
4.4.2 2D NLN in the encoding model–general case 53
4.5 Open problems 63
Chapter 5 Simultaneous encodings for range and next/previous larger/smaller value queries 64
5.1 Introduction 64
5.2 Preliminaries 67
5.2.1 2d-Min heap 69
5.2.2 Encoding range min-max queries 72
5.3 Extended DFUDS for colored 2d-Min heap 75
5.4 Encoding colored 2d-Min and 2d-Max heaps 80
5.4.1 Combined data structure for DCMin(A) and DCMax(A) 82
5.4.2 Encoding colored 2d-Min and 2d-Max heaps using less space 88
5.5 Open problems 89
Chapter 6 Encoding Two-dimensional range Top-k queries 90
6.1 Introduction 90
6.2 Encoding one-sided range Top-k queries on 2D array 92
6.3 Encoding general range Top-k queries on 2D array 95
6.4 Open problems 99
Chapter 7 Conculsion 100
Bibliography 103
요약 112Docto
Relative Suffix Trees
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.Peer reviewe
- …