6 research outputs found
Parallel In-Memory Evaluation of Spatial Joins
The spatial join is a popular operation in spatial database systems and its
evaluation is a well-studied problem. As main memories become bigger and faster
and commodity hardware supports parallel processing, there is a need to revamp
classic join algorithms which have been designed for I/O-bound processing. In
view of this, we study the in-memory and parallel evaluation of spatial joins,
by re-designing a classic partitioning-based algorithm to consider alternative
approaches for space partitioning. Our study shows that, compared to a
straightforward implementation of the algorithm, our tuning can improve
performance significantly. We also show how to select appropriate partitioning
parameters based on data statistics, in order to tune the algorithm for the
given join inputs. Our parallel implementation scales gracefully with the
number of threads reducing the cost of the join to at most one second even for
join inputs with tens of millions of rectangles.Comment: Extended version of the SIGSPATIAL'19 paper under the same titl
Thinking spatial
The systems community in both academia and industry has tremendous success in building widely used general purpose systems for various types of data and applications. Examples include database systems, big data systems, data streaming systems, and machine learning systems. The vast majority of these systems are ill equipped in terms of supporting spatial data. The main reason is that system builders mostly think of spatial data as just one more type of data. Any spatial support can be considered as an afterthought problem that can be supported via on-top functions or spatial cartridges that can be added to the already built systems. This article advocates that spatial data and applications need to be natively supported in special purpose systems, where spatial data is considered as a first class citizen, while spatial operations are built inside the engine rather than on-top of it. System builders should consider spatial data while building their systems. The article gives examples of five categories of systems, namely, database systems, big data systems, machine learning systems, recommender systems, and social network systems, that would benefit tremendously, in terms of both accuracy and performance, when considering spatial data as an integral part of the system engine
Two-layer Space-oriented Partitioning for Non-point Data
Non-point spatial objects (e.g., polygons, linestrings, etc.) are ubiquitous.
We study the problem of indexing non-point objects in memory for range queries
and spatial intersection joins. We propose a secondary partitioning technique
for space-oriented partitioning indices (e.g., grids), which improves their
performance significantly, by avoiding the generation and elimination of
duplicate results. Our approach is easy to implement and can be used by any
space-partitioning index to significantly reduce the cost of range queries and
intersection joins. In addition, the secondary partitions can be processed
independently, which makes our method appropriate for distributed and parallel
indexing. Experiments on real datasets confirm the advantage of our approach
against alternative duplicate elimination techniques and data-oriented
state-of-the-art spatial indices. We also show that our partitioning technique,
paired with optimized partition-to-partition join algorithms, typically reduces
the cost of spatial joins by around 50%.Comment: To appear in the IEEE Transactions on Knowledge and Data Engineerin
R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets
The rapid growth of big spatial data urged the research community to develop
several big spatial data systems. Regardless of their architecture, one of the
fundamental requirements of all these systems is to spatially partition the
data efficiently across machines. The core challenges of big spatial
partitioning are building high spatial quality partitions while simultaneously
taking advantages of distributed processing models by providing load balanced
partitions. Previous works on big spatial partitioning are to reuse existing
index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree,
by building a temporary tree for a sample of the input and use its leaf nodes
as partition boundaries. However, we show in this paper that none of those
techniques has addressed the mentioned challenges completely. This paper
proposes a novel partitioning method, termed R*-Grove, which can partition very
large spatial datasets into high quality partitions with excellent load balance
and block utilization. This appealing property allows R*-Grove to outperform
existing techniques in spatial query processing. R*-Grove can be easily
integrated into any big data platforms such as Apache Spark or Apache Hadoop.
Our experiments show that R*-Grove outperforms the existing partitioning
techniques for big spatial data systems. With all the proposed work publicly
available as open source, we envision that R*-Grove will be adopted by the
community to better serve big spatial data research.Comment: 29 pages, to be published in Frontiers in Big Dat
Cost estimation of spatial join in spatialhadoop
Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness
A cost model for spatial join operations in SpatialHadoop
Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this is particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts has been made for extending existing frameworks for the processing of spatial data. In this context, SpatialHadoop is an extension of Apache Hadoop, which includes a native support for spatial data, in terms of spatial data types, operations and indexes. In particular, its provides five different variants of spatial join which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithm can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyze the characteristics of these algorithms and to define a cost model for them which is based on some dataset characteristics (i.e., selectivity or spatial properties). The main goal of the proposed cost model is to rank the spatial join implementations by defining a partial order among them using a dominance relation. This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness