229,591 research outputs found
MPI-Vector-IO: Parallel I/O and Partitioning for Geospatial Vector Data
In recent times, geospatial datasets are growing in terms of size, complexity and heterogeneity. High performance systems are needed to analyze such data to produce actionable insights in an efficient manner. For polygonal a.k.a vector datasets, operations such as I/O, data partitioning, communication, and load balancing becomes challenging in a cluster environment. In this work, we present MPI-Vector-IO 1 , a parallel I/O library that we have designed using MPI-IO specifically for partitioning and reading irregular vector data formats such as Well Known Text. It makes MPI aware of spatial data, spatial primitives and provides support for spatial data types embedded within collective computation and communication using MPI message-passing library. These abstractions along with parallel I/O support are useful for parallel Geographic Information System (GIS) application development on HPC platforms
Answering Spatial Multiple-Set Intersection Queries Using 2-3 Cuckoo Hash-Filters
We show how to answer spatial multiple-set intersection queries in O(n(log
w)/w + kt) expected time, where n is the total size of the t sets involved in
the query, w is the number of bits in a memory word, k is the output size, and
c is any fixed constant. This improves the asymptotic performance over previous
solutions and is based on an interesting data structure, known as 2-3 cuckoo
hash-filters. Our results apply in the word-RAM model (or practical RAM model),
which allows for constant-time bit-parallel operations, such as bitwise AND,
OR, NOT, and MSB (most-significant 1-bit), as exist in modern CPUs and GPUs.
Our solutions apply to any multiple-set intersection queries in spatial data
sets that can be reduced to one-dimensional range queries, such as spatial join
queries for one-dimensional points or sets of points stored along space-filling
curves, which are used in GIS applications.Comment: Full version of paper from 2017 ACM SIGSPATIAL International
Conference on Advances in Geographic Information System
Preliminary Evaluation of MapReduce for High-Performance Climate Data Analysis
MapReduce is an approach to high-performance analytics that may be useful to data intensive problems in climate research. It offers an analysis paradigm that uses clusters of computers and combines distributed storage of large data sets with parallel computation. We are particularly interested in the potential of MapReduce to speed up basic operations common to a wide range of analyses. In order to evaluate this potential, we are prototyping a series of canonical MapReduce operations over a test suite of observational and climate simulation datasets. Our initial focus has been on averaging operations over arbitrary spatial and temporal extents within Modern Era Retrospective- Analysis for Research and Applications (MERRA) data. Preliminary results suggest this approach can improve efficiencies within data intensive analytic workflows
Load-balanced Range Query Workload Partitioning for Compressed Spatial Hierarchical Bitmap (cSHB) Indexes
abstract: The spatial databases are used to store geometric objects such as points, lines, polygons. Querying such complex spatial objects becomes a challenging task. Index structures are used to improve the lookup performance of the stored objects in the databases, but traditional index structures cannot perform well in case of spatial databases. A significant amount of research is made to ingest, index and query the spatial objects based on different types of spatial queries, such as range, nearest neighbor, and join queries. Compressed Spatial Bitmap Index (cSHB) structure is one such example of indexing and querying approach that supports spatial range query workloads (set of queries). cSHB indexes and many other approaches lack parallel computation. The massive amount of spatial data requires a lot of computation and traditional methods are insufficient to address these issues. Other existing parallel processing approaches lack in load-balancing of parallel tasks which leads to resource overloading bottlenecks.
In this thesis, I propose novel spatial partitioning techniques, Max Containment Clustering and Max Containment Clustering with Separation, to create load-balanced partitions of a range query workload. Each partition takes a similar amount of time to process the spatial queries and reduces the response latency by minimizing the disk access cost and optimizing the bitmap operations. The partitions created are processed in parallel using cSHB indexes. The proposed techniques utilize the block-based organization of bitmaps in the cSHB index and improve the performance of the cSHB index for processing a range query workload.Dissertation/ThesisMasters Thesis Computer Science 201
Enhancing SpatialHadoop with Closest Pair Queries
Given two datasets P and Q, the K Closest Pair Query (KCPQ) finds the K closest pairs of objects from P ×Q. It is an operation widely adopted by many spatial and GIS applications. As a combination of the K Nearest Neighbor (KNN) and the spatial join queries, KCPQ is an expensive operation. Given the increasing volume of spatial data, it is difficult to perform a KCPQ on a centralized machine efficiently. For this reason, this paper addresses the problem of computing the KCPQ on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently, and proposes a novel algorithm in SpatialHadoop to perform efficient parallel KCPQ on large-scale spatial datasets. We have evaluated the performance of the algorithm in several situations with big synthetic and real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal
Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains
We introduce a class of scalable Bayesian hierarchical models for the
analysis of massive geostatistical datasets. The underlying idea combines ideas
on high-dimensional geostatistics by partitioning the spatial domain and
modeling the regions in the partition using a sparsity-inducing directed
acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial
process, which we call the Meshed Gaussian Process (MGP). A major contribution
is the development of a MGPs on tessellated domains, accompanied by a Gibbs
sampler for the efficient recovery of spatial random effects. In particular,
the cubic MGP (Q-MGP) can harness high-performance computing resources by
executing all large-scale operations in parallel within the Gibbs sampler,
improving mixing and computing time compared to sequential updating schemes.
Unlike some existing models for large spatial data, a Q-MGP facilitates massive
caching of expensive matrix operations, making it particularly apt in dealing
with spatiotemporal remote-sensing data. We compare Q-MGPs with large synthetic
and real world data against state-of-the-art methods. We also illustrate using
Normalized Difference Vegetation Index (NDVI) data from the Serengeti park
region to recover latent multivariate spatiotemporal random effects at millions
of locations. The source code is available at https://github.com/mkln/meshgp
Distance Range Queries in SpatialHadoop
Efficient processing of Distance Range Queries (DRQs) is of great importance in spatial databases due to the wide area of applications. This type of spatial query is characterized by a distance range over one or two datasets. The most representative and known DRQs are the ε Distance Range Query (εDRQ) and the ε Distance Range Join Query (εDRJQ). Given the increasing volume of spatial data, it is difficult to perform a DRQ on a centralized machine efficiently. Moreover, the εDRJQ is an expensive spatial operation, since it can be considered a combination of the εDR and the spatial join queries. For this reason, this paper addresses the problem of computing DRQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports spatial operations efficiently, and proposes new algorithms in SpatialHadoop to perform efficient parallel DRQs on large-scale spatial datasets. We have evaluated the performance of the proposed algorithms in several situations with big synthetic and real-world datasets. The experiments have demonstrated the efficiency and scalability of our proposal
Spatial Join with R-Tree on Graphics Processing Units
Spatial operations such as spatial join combine two objects on spatial predicates. It is different from relational join because objects have multi dimensions and spatial join consumes large execution time. Recently, many researches tried to find methods to improve the execution time. Parallel spatial join is one method to improve the execution time. Comparison between objects can be done in parallel. Spatial datasets are large. R-Tree data structure can improve the performance of spatial join.In this paper, a parallel spatial join on Graphic processor unit (GPU) is introduced. The capacity of GPU which has many processors to accelerate the computation is exploited. The experiment is carried out to compare the spatial join between a sequential implementation with C language on CPU and a parallel implementation with CUDA C language on GPU. The result shows that the spatial join on GPU is faster than on a conventional processor
- …