123 research outputs found
Simple Multi-Pass Streaming Algorithms for Skyline Points and Extreme Points
In this paper, we present simple randomized multi-pass streaming algorithms for fundamental computational geometry problems of finding the skyline (maximal) points and the extreme points of the convex hull. For the skyline problem, one of our algorithm occupies O(h) space and performs O(log n) passes, where h is the number of skyline points. This improves the space bound of the currently best known result by Das Sarma, Lall, Nanongkai, and Xu [VLDB\u2709] by a logarithmic factor. For the extreme points problem, we present the first non-trivial result for any constant dimension greater than two: an O(h log^{O(1)}n) space and O(log^dn) pass algorithm, where h is the number of extreme points. Finally, we argue why randomization seems unavoidable for these problems, by proving lower bounds on the performance of deterministic algorithms for a related problem of finding maximal elements in a poset
I/O-Efficient Planar Range Skyline and Attrition Priority Queues
In the planar range skyline reporting problem, we store a set P of n 2D
points in a structure such that, given a query rectangle Q = [a_1, a_2] x [b_1,
b_2], the maxima (a.k.a. skyline) of P \cap Q can be reported efficiently. The
query is 3-sided if an edge of Q is grounded, giving rise to two variants:
top-open (b_2 = \infty) and left-open (a_1 = -\infty) queries.
All our results are in external memory under the O(n/B) space budget, for
both the static and dynamic settings:
* For static P, we give structures that answer top-open queries in O(log_B n
+ k/B), O(loglog_B U + k/B), and O(1 + k/B) I/Os when the universe is R^2, a U
x U grid, and a rank space grid [O(n)]^2, respectively (where k is the number
of reported points). The query complexity is optimal in all cases.
* We show that the left-open case is harder, such that any linear-size
structure must incur \Omega((n/B)^e + k/B) I/Os for a query. We show that this
case is as difficult as the general 4-sided queries, for which we give a static
structure with the optimal query cost O((n/B)^e + k/B).
* We give a dynamic structure that supports top-open queries in O(log_2B^e
(n/B) + k/B^1-e) I/Os, and updates in O(log_2B^e (n/B)) I/Os, for any e
satisfying 0 \le e \le 1. This leads to a dynamic structure for 4-sided queries
with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log
(n/B)).
As a contribution of independent interest, we propose an I/O-efficient
version of the fundamental structure priority queue with attrition (PQA). Our
PQA supports FindMin, DeleteMin, and InsertAndAttrite all in O(1) worst case
I/Os, and O(1/B) amortized I/Os per operation.
We also add the new CatenateAndAttrite operation that catenates two PQAs in
O(1) worst case and O(1/B) amortized I/Os. This operation is a non-trivial
extension to the classic PQA of Sundar, even in internal memory.Comment: Appeared at PODS 2013, New York, 19 pages, 10 figures. arXiv admin
note: text overlap with arXiv:1208.4511, arXiv:1207.234
Recommended from our members
High Throughput Heavy Hitter Aggregation
Heavy hitters are data items that occur at high frequency in a data set. Heavy hitters are among the most important items for an organization to summarize and understand during analytical processing. In data sets with sufficient skew, the number of heavy hitters can be relatively small. We take advantage of this small footprint to compute aggregate functions for the heavy hitters in fast cache memory. We design cache-resident, shared-nothing structures that hold only the most frequent elements from the table. Our approach works in three phases. It first samples and picks heavy hitter candidates. It then builds a hash table and computes the exact aggregates of these candidates. Finally, if necessary, a validation step identifies the true heavy hitters from among the candidates based on the query specification. We identify trade-offs between the hash table capacity and performance. Capacity determines how many candidates can be aggregated. We optimize performance by the use of perfect hashing and SIMD instructions. SIMD instructions are utilized in novel ways to minimize cache accesses, be- yond simple vectorized operations. We use bucketized and cuckoo hash tables to increase capacity, to adapt to different datasets and query constraints. The performance of our method is an order of magnitude faster than in-memory aggregation over a complete set of items if those items cannot be cache resident. Even for item sets that are cache resident, our SIMD techniques enable significant performance improvements over previous work
31. meÄunarodna konferencija Very Large Data Bases
Dana je vijest o održanoj 31. meÄunarodnoj konferenciji Very Large Data Bases
- ā¦