5,566 research outputs found

    Range Queries on Uncertain Data

    Full text link
    Given a set PP of nn uncertain points on the real line, each represented by its one-dimensional probability density function, we consider the problem of building data structures on PP to answer range queries of the following three types for any query interval II: (1) top-11 query: find the point in PP that lies in II with the highest probability, (2) top-kk query: given any integer k≤nk\leq n as part of the query, return the kk points in PP that lie in II with the highest probabilities, and (3) threshold query: given any threshold τ\tau as part of the query, return all points of PP that lie in II with probabilities at least τ\tau. We present data structures for these range queries with linear or nearly linear space and efficient query time.Comment: 26 pages. A preliminary version of this paper appeared in ISAAC 2014. In this full version, we also present solutions to the most general case of the problem (i.e., the histogram bounded case), which were left as open problems in the preliminary versio

    Partitioning space for range queries

    Get PDF
    It is shown that, given a set S of n points in R3, one can always find three planes that form an eight-partition of S, that is, a partition where at most n/8 points of S lie in each of the eight open regions. This theorem is used to define a data structure, called an octant tree, for representing any point set in R3. An octant tree for n points occupies O(n) space and can be constructed in polynomial time. With this data structure and its refinements, efficient solutions to various range query problems in 2 and 3 dimensions can be obtained, including (1) half-space queries: find all points of S that lie to one side of any given plane; (2) polyhedron queries: find all points that lie inside (outside) any given polyhedron; and (3) circular queries in R2: for a planar set S, find all points that lie inside (outside) any given circle. The retrieval time for all these queries is T(n)=O(na + m) where a= 0.8988 (or 0.8471 in case (3)) and m is the size of the output. This performance is the best currently known for linear-space data structures which can be deterministically constructed in polynomial time

    Multidimensional Range Queries on Modern Hardware

    Full text link
    Range queries over multidimensional data are an important part of database workloads in many applications. Their execution may be accelerated by using multidimensional index structures (MDIS), such as kd-trees or R-trees. As for most index structures, the usefulness of this approach depends on the selectivity of the queries, and common wisdom told that a simple scan beats MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom is largely based on evaluations that are almost two decades old, performed on data being held on disks, applying IO-optimized data structures, and using single-core systems. The question is whether this rule of thumb still holds when multidimensional range queries (MDRQ) are performed on modern architectures with large main memories holding all data, multi-core CPUs and data-parallel instruction sets. In this paper, we study the question whether and how much modern hardware influences the performance ratio between index structures and scans for MDRQ. To this end, we conservatively adapted three popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit features of modern servers and compared their performance to different flavors of parallel scans using multiple (synthetic and real-world) analytical workloads over multiple (synthetic and real-world) datasets of varying size, dimensionality, and skew. We find that all approaches benefit considerably from using main memory and parallelization, yet to varying degrees. Our evaluation indicates that, on current machines, scanning should be favored over parallel versions of classical MDIS even for very selective queries

    Structure-Aware Sampling: Flexible and Accurate Summarization

    Full text link
    In processing large quantities of data, a fundamental problem is to obtain a summary which supports approximate query answering. Random sampling yields flexible summaries which naturally support subset-sum queries with unbiased estimators and well-understood confidence bounds. Classic sample-based summaries, however, are designed for arbitrary subset queries and are oblivious to the structure in the set of keys. The particular structure, such as hierarchy, order, or product space (multi-dimensional), makes range queries much more relevant for most analysis of the data. Dedicated summarization algorithms for range-sum queries have also been extensively studied. They can outperform existing sampling schemes in terms of accuracy on range queries per summary size. Their accuracy, however, rapidly degrades when, as is often the case, the query spans multiple ranges. They are also less flexible - being targeted for range sum queries alone - and are often quite costly to build and use. In this paper we propose and evaluate variance optimal sampling schemes that are structure-aware. These summaries improve over the accuracy of existing structure-oblivious sampling schemes on range queries while retaining the benefits of sample-based summaries: flexible summaries, with high accuracy on both range queries and arbitrary subset queries

    Approximate Geometric MST Range Queries

    Get PDF
    Range searching is a widely-used method in computational geometry for efficiently accessing local regions of a large data set. Typically, range searching involves either counting or reporting the points lying within a given query region, but it is often desirable to compute statistics that better describe the structure of the point set lying within the region, not just the count. In this paper we consider the geometric minimum spanning tree (MST) problem in the context of range searching where approximation is allowed. We are given a set P of n points in R^d. The objective is to preprocess P so that given an admissible query region Q, it is possible to efficiently approximate the weight of the minimum spanning tree of the subset of P lying within Q. There are two natural sources of approximation error, first by treating Q as a fuzzy object and second by approximating the MST weight itself. To model this, we assume that we are given two positive real approximation parameters eps_q and eps_w. Following the typical practice in approximate range searching, the range is expressed as two shapes Q^- and Q^+, where Q^- is contained in Q which is contained in Q^+, and their boundaries are separated by a distance of at least eps_q diam(Q). Points within Q^- must be included and points external to Q^+ cannot be included. A weight W is a valid answer to the query if there exist subsets P\u27 and P\u27\u27 of P, such that Q^- is contained in P\u27 which is contained in P\u27\u27 which is contained in Q^+ and wt(MST(P\u27)) <= W <= (1+eps_w) wt(MST(P\u27\u27)). In this paper, we present an efficient data structure for answering such queries. Our approach uses simple data structures based on quadtrees, and it can be applied whenever Q^- and Q^+ are compact sets of constant combinatorial complexity. It uses space O(n), and it answers queries in time O(log n + 1/(eps_q eps_w)^{d + O(1)}). The O(1) term is a small constant independent of dimension, and the hidden constant factor in the overall running time depends on d, but not on eps_q or eps_w. Preprocessing requires knowledge of eps_w, but not eps_q
    • …
    corecore