1,296 research outputs found
Space Efficient Encodings for Bit-strings, Range queries and Related Problems
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. Srinivasa Rao Satti.In this thesis, we design and implement various space efficient data structures. Most of these structures use spaces close to the information-theoretic lower bound while supporting the queries efficiently.
In particular, this thesis is concerned with the data structures for four problems: (i) supporting \rank{} and \select{} queries on compressed bit strings, (ii) nearest larger neighbor problem, (iii) simultaneous encodings for range and next/previous larger/smaller value queries, and (iv) range \topk{} queries on two-dimensional arrays.
We first consider practical implementations of \emph{compressed} bitvectors, which support \rank{} and \select{} operations on a given bit-string, while storing the bit-string in compressed form~\cite{DBLP:conf/dcc/JoJORS14}. Our approach relies on \emph{variable-to-fixed} encodings
of the bit-string, an approach that has not yet been considered systematically for practical encodings of bitvectors. We show that this approach leads to fast practical implementations with low \emph{redundancy} (i.e., the space used by the bitvector in addition to the compressed representation of the bit-string), and is a flexible and promising solution to the problem of supporting
\rank{} and \select{} on moderately compressible bit-strings, such as those encountered in real-world applications.
Next, we propose space-efficient data structures for the nearest larger neighbor problem~\cite{IWOCA2014,walcom-JoRS15}. Given a sequence of elements from a total order, and a position in the sequence, the nearest larger neighbor (\NLV{}) query returns the position of the
element which is closest to the query position, and is larger than the element at the query position. The
problem of finding all nearest larger neighbors has attracted interest due to its applications for
parenthesis matching and in computational geometry~\cite{AsanoBK09,AsanoK13,BerkmanSV93}.
We consider a data structure version of this problem, which is to preprocess a given sequence of elements to construct a data structure that can answer \NLN{} queries efficiently. For one-dimensional arrays, we give time-space tradeoffs for the problem on \textit{indexing model}. For two-dimensional arrays, we give an optimal encoding with constant query on \textit{encoding model}.
We also propose space-efficient encodings which support various range queries, and previous and next smaller/larger value queries~\cite{cocoonJS15}. Given a sequence of elements from a total order, we obtain a -bit encoding that supports all these queries where is the length
of input array. For the case when we need to support all these queries in constant time, we give an encoding that takes bits. This improves the -bit encoding obtained by encoding the colored -Min and -Max heaps proposed by Fischer~\cite{Fischer11}.
We extend the original DFUDS~\cite{BDMRRR05} encoding of the colored -Min and -Max heap that supports the queries in constant time. Then, we combine the extended DFUDS of -Min heap and -Max heap using the Min-Max encoding of Gawrychowski and Nicholson~\cite{Gawry14}
with some modifications. We also obtain encodings that take lesser space and support a subset of these queries.
Finally, we consider the various encodings that support range \topk{} queries on a two-dimensional array containing elements from a total order. For an array, we first propose an optimal encoding for answering one-sided \topk{} queries, whose query range is restricted to , for . Next, we propose an encoding for the general \topk{} queries that takes
bits. This generalizes the \topk{} encoding of Gawrychowski and Nicholson~\cite{Gawry14}.Chapter 1 Introduction 1
1.1 Computational model 2
1.1.1 Encoding and indexing models 2
1.2 Contribution of the thesis 3
1.3 Organization of the thesis 5
Chapter 2 Preliminaries 7
Chapter 3 Compressed bit vectors based on variable-to-fixed encodings 10
3.1 Introduction 10
3.2 Bit-vectors using V2F coding 14
3.3 V2F compression algorithms for bit-strings 16
3.3.1 Tunstall code 16
3.3.2 Enumerative codes 19
3.3.3 LZW algorithm 23
3.3.4 Empirical evaluation of the compressors 23
3.4 Practical implementation of bitvectors based on V2F compression. 26
3.4.1 Testing Methodology 29
3.4.2 Results of Empirical Evaluation 33
3.5 Future works 35
Chapter 4 Space Efficient Data Structures for Nearest Larger Neighbor 39
4.1 Introduction 39
4.2 Indexing NLV queries on 1D arrays 43
4.3 Encoding NLN queries on2D binary arrays 44
4.4 Encoding NLN queries for general 2D arrays 50
4.4.1 2D NLN in the encoding model–distinct case 50
4.4.2 2D NLN in the encoding model–general case 53
4.5 Open problems 63
Chapter 5 Simultaneous encodings for range and next/previous larger/smaller value queries 64
5.1 Introduction 64
5.2 Preliminaries 67
5.2.1 2d-Min heap 69
5.2.2 Encoding range min-max queries 72
5.3 Extended DFUDS for colored 2d-Min heap 75
5.4 Encoding colored 2d-Min and 2d-Max heaps 80
5.4.1 Combined data structure for DCMin(A) and DCMax(A) 82
5.4.2 Encoding colored 2d-Min and 2d-Max heaps using less space 88
5.5 Open problems 89
Chapter 6 Encoding Two-dimensional range Top-k queries 90
6.1 Introduction 90
6.2 Encoding one-sided range Top-k queries on 2D array 92
6.3 Encoding general range Top-k queries on 2D array 95
6.4 Open problems 99
Chapter 7 Conculsion 100
Bibliography 103
요약 112Docto
Optimal Encodings for Range Min-Max and Top-k
In this paper we consider various encoding problems for range queries on arrays. In these problems, the goal is that the encoding occupies the information theoretic minimum space required to answer a particular set of range queries. Given an array a range top- query on an arbitrary range asks us to return the ordered set of indices such that is the -th largest element in . We present optimal encodings for range top- queries, as well as for a new problem which we call range min-max, in which the goal is to return the indices of both the minimum and maximum element in a range
Asymptotically Optimal Encodings of Range Data Structures for Selection and Top-k Queries
Given an array A[1, n] of elements with a total order, we consider the problem of building a
data structure that solves two queries: (a) selection queries receive a range [i, j] and an integer
k and return the position of the kth largest element in A[i, j]; (b) top-k queries receive [i, j] and
k and return the positions of the k largest elements in A[i, j]. These problems can be solved in
optimal time, O(1 + lg k/ lg lg n) and O(k), respectively, using linear-space data structures.
We provide the first study of the encoding data structures for the above problems, where A
cannot be accessed at query time. Several applications are interested in the relative order of the
entries of A, and their positions, rather their actual values, and thus we do not need to keep A
at query time. In those cases, encodings save storage space: we first show that any encoding
answering such queries requires n lg k − O(n + k lg k) bits of space; then, we design encodings
using O(n lg k) bits, that is, asymptotically optimal up to constant factors, while preserving
optimal query time.Peer-reviewedPost-prin
POPE: Partial Order Preserving Encoding
Recently there has been much interest in performing search queries over
encrypted data to enable functionality while protecting sensitive data. One
particularly efficient mechanism for executing such queries is order-preserving
encryption/encoding (OPE) which results in ciphertexts that preserve the
relative order of the underlying plaintexts thus allowing range and comparison
queries to be performed directly on ciphertexts. In this paper, we propose an
alternative approach to range queries over encrypted data that is optimized to
support insert-heavy workloads as are common in "big data" applications while
still maintaining search functionality and achieving stronger security.
Specifically, we propose a new primitive called partial order preserving
encoding (POPE) that achieves ideal OPE security with frequency hiding and also
leaves a sizable fraction of the data pairwise incomparable. Using only O(1)
persistent and non-persistent client storage for
, our POPE scheme provides extremely fast batch insertion
consisting of a single round, and efficient search with O(1) amortized cost for
up to search queries. This improved security and
performance makes our scheme better suited for today's insert-heavy databases.Comment: Appears in ACM CCS 2016 Proceeding
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees
We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational
database systems -- data is stored in the form of arrays of integers. Encoding
and, most importantly, decoding of these arrays consumes considerable CPU time.
Therefore, substantial effort has been made to reduce costs associated with
compression and decompression. In particular, researchers have exploited the
superscalar nature of modern processors and SIMD instructions. Nevertheless, we
introduce a novel vectorized scheme called SIMD-BP128 that improves over
previously proposed vectorized approaches. It is nearly twice as fast as the
previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the
same time, SIMD-BP128 saves up to 2 bits per integer. For even better
compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has
a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while
being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see
http://boytsov.info/datasets/clueweb09gap
RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts
Creating radio galaxy catalogues from next-generation deep surveys requires
automated identification of associated components of extended sources and their
corresponding infrared hosts. In this paper, we introduce RadioGalaxyNET, a
multimodal dataset, and a suite of novel computer vision algorithms designed to
automate the detection and localization of multi-component extended radio
galaxies and their corresponding infrared hosts. The dataset comprises 4,155
instances of galaxies in 2,800 images with both radio and infrared channels.
Each instance provides information about the extended radio galaxy class, its
corresponding bounding box encompassing all components, the pixel-level
segmentation mask, and the keypoint position of its corresponding infrared host
galaxy. RadioGalaxyNET is the first dataset to include images from the highly
sensitive Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope,
corresponding infrared images, and instance-level annotations for galaxy
detection. We benchmark several object detection algorithms on the dataset and
propose a novel multimodal approach to simultaneously detect radio galaxies and
the positions of infrared hosts.Comment: Accepted for publication in PASA. The paper has 17 pages, 6 figures,
5 table
Prime Number-Based Hierarchical Data Labeling Scheme for Relational Databases
Hierarchical data structures are an important aspect of many computer science fields including data mining, terrain modeling, and image analysis. A good representation of such data accurately captures the parent-child and ancestor-descendent relationships between nodes. There exist a number of different ways to capture and manage hierarchical data while preserving such relationships. For instance, one may use a custom system designed for a specific kind of hierarchy. Object oriented databases may also be used to model hierarchical data. Relational database systems, on the other hand, add an additional benefit of mature mathematical theory, reliable implementations, superior functionality and scalability. Relational databases were not originally designed with hierarchical data management in mind. As a result, abstract information can not be natively stored in database relations. Database labeling schemes resolve this issue by labeling all nodes in a way that reveals their relationships. Labels usually encode the node's position in a hierarchy as a number or a string that can be stored, indexed, searched, and retrieved from a database. Many different labeling schemes have been developed in the past. All of them may be classified into three broad categories: recursive expansion, materialized path, and nested sets. Each model has its strengths and weaknesses. Each model implementation attempts to reduce the number of weaknesses inherent to the respective model. One of the most prominent implementations of the materialized path model uses the unique characteristics of prime numbers for its labeling purposes. However, the performance and space utilization of this prime number labeling scheme could be significantly improved. This research introduces a new scheme called reusable prime number labeling (rPNL) that reduces the effects of the mentioned weaknesses. The proposed scheme advantage is discussed in detail, proven mathematically, and experimentally confirmed
- …