767 research outputs found

    Combined Data Structure for Previous- and Next-Smaller-Values

    Get PDF
    Let AA be a static array storing nn elements from a totally ordered set. We present a data structure of optimal size at most nlog2(3+22)+o(n)n\log_2(3+2\sqrt{2})+o(n) bits that allows us to answer the following queries on AA in constant time, without accessing AA: (1) previous smaller value queries, where given an index ii, we wish to find the first index to the left of ii where AA is strictly smaller than at ii, and (2) next smaller value queries, which search to the right of ii. As an additional bonus, our data structure also allows to answer a third kind of query: given indices i<ji<j, find the position of the minimum in A[i..j]A[i..j]. Our data structure has direct consequences for the space-efficient storage of suffix trees.Comment: to appear in Theoretical Computer Scienc

    Random Access to Grammar Compressed Strings

    Full text link
    Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let SS be a string of length NN compressed into a context-free grammar S\mathcal{S} of size nn. We present two representations of S\mathcal{S} achieving O(logN)O(\log N) random access time, and either O(nαk(n))O(n\cdot \alpha_k(n)) construction time and space on the pointer machine model, or O(n)O(n) construction time and space on the RAM. Here, αk(n)\alpha_k(n) is the inverse of the kthk^{th} row of Ackermann's function. Our representations also efficiently support decompression of any substring in SS: we can decompress any substring of length mm in the same complexity as a single random access query and additional O(m)O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern PP with at most kk errors in time O(n(min{Pk,k4+P}+logN)+occ)O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ), where occocc is the number of occurrences of PP in SS. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.Comment: Preliminary version in SODA 201

    On the Complexity of the (Approximate) Nearest Colored Node Problem

    Get PDF
    Given a graph G=(V,E) where each vertex is assigned a color from the set C={c_1, c_2, .., c_sigma}. In the (approximate) nearest colored node problem, we want to query, given v in V and c in C, for the (approximate) distance dist^(v, c) from v to the nearest node of color c. For any integer 1 <= k <= log n, we present a Color Distance Oracle (also often referred to as Vertex-label Distance Oracle) of stretch 4k-5 using space O(kn sigma^{1/k}) and query time O(log{k}). This improves the query time from O(k) to O(log{k}) over the best known Color Distance Oracle by Chechik [Chechik, 2012]. We then prove a lower bound in the cell probe model showing that even for unweighted undirected paths any static data structure that uses space S requires at least Omega (log (log{sigma} / log(S/n)+log log{n})) query time to give a distance estimate of stretch O(polylog(n)). This implies for the important case when sigma = Theta(n^{epsilon}) for some constant 0 < epsilon < 1, that our Color Distance Oracle has asymptotically optimal query time in regard to k, and that recent Color Distance Oracles for trees [Tsur, 2018] and planar graphs [Mozes and Skop, 2018] achieve asymptotically optimal query time in regard to n. We also investigate the setting where the data structure additionally has to support color-reassignments. We present the first Color Distance Oracle that achieves query times matching our lower bound from the static setting for large stretch yielding an exponential improvement over the best known query time [Chechik, 2014]. Finally, we give new conditional lower bounds proving the hardness of answering queries if edge insertions and deletion are allowed that strictly improve over recent bounds in time and generality

    Data Structures for Categorical Path Counting Queries

    Get PDF

    GPU accelerating distributed succinct de Bruijn graph construction

    Get PDF
    The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance. Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph. As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing. This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction

    Space Efficient Encodings for Bit-strings, Range queries and Related Problems

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. Srinivasa Rao Satti.In this thesis, we design and implement various space efficient data structures. Most of these structures use spaces close to the information-theoretic lower bound while supporting the queries efficiently. In particular, this thesis is concerned with the data structures for four problems: (i) supporting \rank{} and \select{} queries on compressed bit strings, (ii) nearest larger neighbor problem, (iii) simultaneous encodings for range and next/previous larger/smaller value queries, and (iv) range \topk{} queries on two-dimensional arrays. We first consider practical implementations of \emph{compressed} bitvectors, which support \rank{} and \select{} operations on a given bit-string, while storing the bit-string in compressed form~\cite{DBLP:conf/dcc/JoJORS14}. Our approach relies on \emph{variable-to-fixed} encodings of the bit-string, an approach that has not yet been considered systematically for practical encodings of bitvectors. We show that this approach leads to fast practical implementations with low \emph{redundancy} (i.e., the space used by the bitvector in addition to the compressed representation of the bit-string), and is a flexible and promising solution to the problem of supporting \rank{} and \select{} on moderately compressible bit-strings, such as those encountered in real-world applications. Next, we propose space-efficient data structures for the nearest larger neighbor problem~\cite{IWOCA2014,walcom-JoRS15}. Given a sequence of nn elements from a total order, and a position in the sequence, the nearest larger neighbor (\NLV{}) query returns the position of the element which is closest to the query position, and is larger than the element at the query position. The problem of finding all nearest larger neighbors has attracted interest due to its applications for parenthesis matching and in computational geometry~\cite{AsanoBK09,AsanoK13,BerkmanSV93}. We consider a data structure version of this problem, which is to preprocess a given sequence of elements to construct a data structure that can answer \NLN{} queries efficiently. For one-dimensional arrays, we give time-space tradeoffs for the problem on \textit{indexing model}. For two-dimensional arrays, we give an optimal encoding with constant query on \textit{encoding model}. We also propose space-efficient encodings which support various range queries, and previous and next smaller/larger value queries~\cite{cocoonJS15}. Given a sequence of nn elements from a total order, we obtain a 4.088n+o(n)4.088n + o(n)-bit encoding that supports all these queries where nn is the length of input array. For the case when we need to support all these queries in constant time, we give an encoding that takes 4.585n+o(n)4.585n + o(n) bits. This improves the 5.08n+o(n)5.08n+o(n)-bit encoding obtained by encoding the colored 2d2d-Min and 2d2d-Max heaps proposed by Fischer~\cite{Fischer11}. We extend the original DFUDS~\cite{BDMRRR05} encoding of the colored 2d2d-Min and 2d2d-Max heap that supports the queries in constant time. Then, we combine the extended DFUDS of 2d2d-Min heap and 2d2d-Max heap using the Min-Max encoding of Gawrychowski and Nicholson~\cite{Gawry14} with some modifications. We also obtain encodings that take lesser space and support a subset of these queries. Finally, we consider the various encodings that support range \topk{} queries on a two-dimensional array containing elements from a total order. For an m×nm \times n array, we first propose an optimal encoding for answering one-sided \topk{} queries, whose query range is restricted to [1m][1a][1 \dots m][1 \dots a], for 1an1 \le a \le n. Next, we propose an encoding for the general \topk{} queries that takes m2lg((k+1)nn)+mlgm+o(n)m^2\lg{{(k+1)n \choose n}} + m\lg{m}+o(n) bits. This generalizes the \topk{} encoding of Gawrychowski and Nicholson~\cite{Gawry14}.Chapter 1 Introduction 1 1.1 Computational model 2 1.1.1 Encoding and indexing models 2 1.2 Contribution of the thesis 3 1.3 Organization of the thesis 5 Chapter 2 Preliminaries 7 Chapter 3 Compressed bit vectors based on variable-to-fixed encodings 10 3.1 Introduction 10 3.2 Bit-vectors using V2F coding 14 3.3 V2F compression algorithms for bit-strings 16 3.3.1 Tunstall code 16 3.3.2 Enumerative codes 19 3.3.3 LZW algorithm 23 3.3.4 Empirical evaluation of the compressors 23 3.4 Practical implementation of bitvectors based on V2F compression. 26 3.4.1 Testing Methodology 29 3.4.2 Results of Empirical Evaluation 33 3.5 Future works 35 Chapter 4 Space Efficient Data Structures for Nearest Larger Neighbor 39 4.1 Introduction 39 4.2 Indexing NLV queries on 1D arrays 43 4.3 Encoding NLN queries on2D binary arrays 44 4.4 Encoding NLN queries for general 2D arrays 50 4.4.1 2D NLN in the encoding model–distinct case 50 4.4.2 2D NLN in the encoding model–general case 53 4.5 Open problems 63 Chapter 5 Simultaneous encodings for range and next/previous larger/smaller value queries 64 5.1 Introduction 64 5.2 Preliminaries 67 5.2.1 2d-Min heap 69 5.2.2 Encoding range min-max queries 72 5.3 Extended DFUDS for colored 2d-Min heap 75 5.4 Encoding colored 2d-Min and 2d-Max heaps 80 5.4.1 Combined data structure for DCMin(A) and DCMax(A) 82 5.4.2 Encoding colored 2d-Min and 2d-Max heaps using less space 88 5.5 Open problems 89 Chapter 6 Encoding Two-dimensional range Top-k queries 90 6.1 Introduction 90 6.2 Encoding one-sided range Top-k queries on 2D array 92 6.3 Encoding general range Top-k queries on 2D array 95 6.4 Open problems 99 Chapter 7 Conculsion 100 Bibliography 103 요약 112Docto

    Relative Suffix Trees

    Get PDF
    Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.Peer reviewe
    corecore