23,057 research outputs found

    Partitioning space for range queries

    Get PDF
    It is shown that, given a set S of n points in R3, one can always find three planes that form an eight-partition of S, that is, a partition where at most n/8 points of S lie in each of the eight open regions. This theorem is used to define a data structure, called an octant tree, for representing any point set in R3. An octant tree for n points occupies O(n) space and can be constructed in polynomial time. With this data structure and its refinements, efficient solutions to various range query problems in 2 and 3 dimensions can be obtained, including (1) half-space queries: find all points of S that lie to one side of any given plane; (2) polyhedron queries: find all points that lie inside (outside) any given polyhedron; and (3) circular queries in R2: for a planar set S, find all points that lie inside (outside) any given circle. The retrieval time for all these queries is T(n)=O(na + m) where a= 0.8988 (or 0.8471 in case (3)) and m is the size of the output. This performance is the best currently known for linear-space data structures which can be deterministically constructed in polynomial time

    Two-layer Space-oriented Partitioning for Non-point Data

    Full text link
    Non-point spatial objects (e.g., polygons, linestrings, etc.) are ubiquitous. We study the problem of indexing non-point objects in memory for range queries and spatial intersection joins. We propose a secondary partitioning technique for space-oriented partitioning indices (e.g., grids), which improves their performance significantly, by avoiding the generation and elimination of duplicate results. Our approach is easy to implement and can be used by any space-partitioning index to significantly reduce the cost of range queries and intersection joins. In addition, the secondary partitions can be processed independently, which makes our method appropriate for distributed and parallel indexing. Experiments on real datasets confirm the advantage of our approach against alternative duplicate elimination techniques and data-oriented state-of-the-art spatial indices. We also show that our partitioning technique, paired with optimized partition-to-partition join algorithms, typically reduces the cost of spatial joins by around 50%.Comment: To appear in the IEEE Transactions on Knowledge and Data Engineerin

    Optimal column layout for hybrid workloads

    Get PDF
    Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column’s data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.http://www.vldb.org/pvldb/vol12/p2393-athanassoulis.pdfPublished versionPublished versio

    Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster

    Full text link
    The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of ads to millions of users. The number of users is typically very high and they are continuously moving, and the ads change frequently as well. Hence sending the right ad to the matching users is very challenging. Existing streaming systems are either centralized or are not spatial-keyword aware, and cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. This paper presents Tornado, a distributed spatial-keyword stream processing system. Tornado features routing units to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing units use the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. Tornado uses evaluators to process the data objects against the queries. The routing units minimize the redundant communication by not sending data updates for processing when these updates do not match any query. By applying dynamically evaluated cost formulae that continuously represent the processing overhead at each evaluator, Tornado is adaptive to changes in the workload. Extensive experimental evaluation using spatio-textual range queries over real Twitter data indicates that Tornado outperforms the non-spatio-textually aware approaches by up to two orders of magnitude in terms of the overall system throughput

    Multidimensional Range Queries on Modern Hardware

    Full text link
    Range queries over multidimensional data are an important part of database workloads in many applications. Their execution may be accelerated by using multidimensional index structures (MDIS), such as kd-trees or R-trees. As for most index structures, the usefulness of this approach depends on the selectivity of the queries, and common wisdom told that a simple scan beats MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom is largely based on evaluations that are almost two decades old, performed on data being held on disks, applying IO-optimized data structures, and using single-core systems. The question is whether this rule of thumb still holds when multidimensional range queries (MDRQ) are performed on modern architectures with large main memories holding all data, multi-core CPUs and data-parallel instruction sets. In this paper, we study the question whether and how much modern hardware influences the performance ratio between index structures and scans for MDRQ. To this end, we conservatively adapted three popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit features of modern servers and compared their performance to different flavors of parallel scans using multiple (synthetic and real-world) analytical workloads over multiple (synthetic and real-world) datasets of varying size, dimensionality, and skew. We find that all approaches benefit considerably from using main memory and parallelization, yet to varying degrees. Our evaluation indicates that, on current machines, scanning should be favored over parallel versions of classical MDIS even for very selective queries
    • …
    corecore