3,114 research outputs found
Incremental elasticity for array databases
Relational databases benefit significantly from elasticity, whereby they execute on a set of changing hardware resources provisioned to match their storage and processing requirements. Such flexibility is especially attractive for scientific databases because their users often have a no-overwrite storage model, in which they delete data only when their available space is exhausted. This results in a database that is regularly growing and expanding its hardware proportionally. Also, scientific databases frequently store their data as multidimensional arrays optimized for spatial querying. This brings about several novel challenges in clustered, skew-aware data placement on an elastic shared-nothing database. In this work, we design and implement elasticity for an array database. We address this challenge on two fronts: determining when to expand a database cluster and how to partition the data within it. In both steps we propose incremental approaches, affecting a minimum set of data and nodes, while maintaining high performance. We introduce an algorithm for gradually augmenting an array database's hardware using a closed-loop control system. After the cluster adds nodes, we optimize data placement for n-dimensional arrays. Many of our elastic partitioners incrementally reorganize an array, redistributing data only to new nodes. By combining these two tools, the scientific database efficiently and seamlessly manages its monotonically increasing hardware resources.Intel Corporation (Science and Technology Center for Big Data
Distributed Caching for Processing Raw Arrays
As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority - by as much as two orders of magnitude - of the proposed framework over existing techniques in terms of cache overhead and workload execution time
Evaluating the Impact of Data Placement to Spark and SciDB with an Earth Science Use Case
We investigate the impact of data placement for two Big Data technologies, Spark and SciDB, with a use case from Earth Science where data arrays are multidimensional. Simultaneously, this investigation provides an opportunity to evaluate the performance of the technologies involved. Two datastores, HDFS and Cassandra, are used with Spark for our comparison. It is found that Spark with Cassandra performs better than with HDFS, but SciDB performs better yet than Spark with either datastore. The investigation also underscores the value of having data aligned for the most common analysis scenarios in advance on a shared nothing architecture. Otherwise, repartitioning needs to be carried out on the fly, degrading overall performance
Distributed Caching for Complex Querying of Raw Arrays
As applications continue to generate multi-dimensional data at exponentially
increasing rates, fast analytics to extract meaningful results is becoming
extremely important. The database community has developed array databases that
alleviate this problem through a series of techniques. In-situ mechanisms
provide direct access to raw data in the original format---without loading and
partitioning. Parallel processing scales to the largest datasets. In-memory
caching reduces latency when the same data are accessed across a workload of
queries. However, we are not aware of any work on distributed caching of
multi-dimensional raw arrays. In this paper, we introduce a distributed
framework for cost-based caching of multi-dimensional arrays in native format.
Given a set of files that contain portions of an array and an online query
workload, the framework computes an effective caching plan in two stages.
First, the plan identifies the cells to be cached locally from each of the
input files by continuously refining an evolving R-tree index. In the second
stage, an optimal assignment of cells to nodes that collocates dependent cells
in order to minimize the overall data transfer is determined. We design cache
eviction and placement heuristic algorithms that consider the historical query
workload. A thorough experimental evaluation over two real datasets in three
file formats confirms the superiority -- by as much as two orders of magnitude
-- of the proposed framework over existing techniques in terms of cache
overhead and workload execution time
- …