417 research outputs found
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Efficient Binning for Bitmap Indices on High-Cardinality Attributes
ABSTRACT Bitmap indexing is a common technique for indexing highdimensional data in data warehouses and scientific applications. Though efficient for low-cardinality attributes, query processing can be rather costly for high-cardinality attributes due to the large storage requirements for the bitmap indices. Binning is a common technique for reducing storage costs of bitmap indices. This technique partitions the attribute values into a number of ranges, called bins, and uses bitmap vectors to represent bins (attribute ranges) rather than distinct values. Although binning may reduce storage costs, it may increase the access costs of queries that do not fall on exact bin boundaries (edge bins). For this kind of queries the original data values associated with edge bins must be accessed, in order to check them against the query constraints. In this paper we study the problem of finding optimal locations for the bin boundaries in order to minimize these access costs subject to storage constraints. We propose a dynamic programming algorithm for optimal partitioning of attribute values into bins that takes into account query access patterns as well as data distribution statistics. Mathematical analysis and experiments on real life data sets show that the optimal partitioning achieved by this algorithm can lead to a significant improvement in the access costs of bitmap indexing systems for high-cardinality attributes
Query Workload-Aware Index Structures for Range Searches in 1D, 2D, and High-Dimensional Spaces
abstract: Most current database management systems are optimized for single query execution.
Yet, often, queries come as part of a query workload. Therefore, there is a need
for index structures that can take into consideration existence of multiple queries in a
query workload and efficiently produce accurate results for the entire query workload.
These index structures should be scalable to handle large amounts of data as well as
large query workloads.
The main objective of this dissertation is to create and design scalable index structures
that are optimized for range query workloads. Range queries are an important
type of queries with wide-ranging applications. There are no existing index structures
that are optimized for efficient execution of range query workloads. There are
also unique challenges that need to be addressed for range queries in 1D, 2D, and
high-dimensional spaces. In this work, I introduce novel cost models, index selection
algorithms, and storage mechanisms that can tackle these challenges and efficiently
process a given range query workload in 1D, 2D, and high-dimensional spaces. In particular,
I introduce the index structures, HCS (for 1D spaces), cSHB (for 2D spaces),
and PSLSH (for high-dimensional spaces) that are designed specifically to efficiently
handle range query workload and the unique challenges arising from their respective
spaces. I experimentally show the effectiveness of the above proposed index structures
by comparing with state-of-the-art techniques.Dissertation/ThesisDoctoral Dissertation Computer Science 201
Set Representation for Rule Generation Algorithms
The task of mining the association rule has become one of the most widely used discovery pattern methods in Knowledge Discovery in Databases (KDD). One such task is to represent the itemset in the memory. The representation of the itemset largely depend on the type of data structure that is used for storing them. Computing the process of mining the association rule im- pacts the memory and time requirement of the itemset. With the increase in the dimensionality of data and datasets, mining such large volume of datasets will be difficult since all these itemsets cannot be placed in the main memory. As representation of an itemset greatly affects the efficiency of the rule mining association, a compact and compress representation of an itemset is needed. In this paper, a set representation is introduced which is more memory and cost efficient. Bitmap representation takes one byte for an element but the set representation uses one bit. The set representation is being incorporated in Apriori Algorithm. Set representation is also being tested for different rule generation algorithms. The complexities of these different rule generation algorithms using set representation are being compared in terms of memory and time execution
An Analysis of netCDF-FastBit Integration and Primitive Spatial-Temporal Operations
A process allowing for the intuitive use of SQL queries on dense multidimensional data stored in Network Common Data Format (netCDF) files is developed using advanced bitmap indexing provided by the FastBit bitmap indexing tool. A method for netCDF data extraction and FastBit index creation is presented and a geospatial Range and pseudo-KNN search based on the haversine function is implemented via SQL. A two step filtering algorithm is shown to greatly enhance the speed of these geospatial queries, allowing for extremely efficient processing of the netCDF data in bitmap indexed form
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection
operations is an active research topic. Most compression schemes rely on
encoding differences between consecutive positions with techniques that favor
small numbers. In this paper we explore a completely different alternative: We
use Re-Pair compression of those differences. While Re-Pair by itself offers
fast decompression at arbitrary positions in main and secondary memory, we
introduce variants that in addition speed up the operations required for
inverted list intersection. We compare the resulting data structures with
several recent proposals under various list intersection algorithms, to
conclude that our Re-Pair variants offer an interesting time/space tradeoff for
this problem, yet further improvements are required for it to improve upon the
state of the art
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
- …