5,566 research outputs found
Range Queries on Uncertain Data
Given a set of uncertain points on the real line, each represented by
its one-dimensional probability density function, we consider the problem of
building data structures on to answer range queries of the following three
types for any query interval : (1) top- query: find the point in that
lies in with the highest probability, (2) top- query: given any integer
as part of the query, return the points in that lie in
with the highest probabilities, and (3) threshold query: given any threshold
as part of the query, return all points of that lie in with
probabilities at least . We present data structures for these range
queries with linear or nearly linear space and efficient query time.Comment: 26 pages. A preliminary version of this paper appeared in ISAAC 2014.
In this full version, we also present solutions to the most general case of
the problem (i.e., the histogram bounded case), which were left as open
problems in the preliminary versio
Partitioning space for range queries
It is shown that, given a set S of n points in R3, one can always find three planes that form an eight-partition of S, that is, a partition where at most n/8 points of S lie in each of the eight open regions. This theorem is used to define a data structure, called an octant tree, for representing any point set in R3. An octant tree for n points occupies O(n) space and can be constructed in polynomial time. With this data structure and its refinements, efficient solutions to various range query problems in 2 and 3 dimensions can be obtained, including (1) half-space queries: find all points of S that lie to one side of any given plane; (2) polyhedron queries: find all points that lie inside (outside) any given polyhedron; and (3) circular queries in R2: for a planar set S, find all points that lie inside (outside) any given circle. The retrieval time for all these queries is T(n)=O(na + m) where a= 0.8988 (or 0.8471 in case (3)) and m is the size of the output. This performance is the best currently known for linear-space data structures which can be deterministically constructed in polynomial time
Multidimensional Range Queries on Modern Hardware
Range queries over multidimensional data are an important part of database
workloads in many applications. Their execution may be accelerated by using
multidimensional index structures (MDIS), such as kd-trees or R-trees. As for
most index structures, the usefulness of this approach depends on the
selectivity of the queries, and common wisdom told that a simple scan beats
MDIS for queries accessing more than 15%-20% of a dataset. However, this wisdom
is largely based on evaluations that are almost two decades old, performed on
data being held on disks, applying IO-optimized data structures, and using
single-core systems. The question is whether this rule of thumb still holds
when multidimensional range queries (MDRQ) are performed on modern
architectures with large main memories holding all data, multi-core CPUs and
data-parallel instruction sets. In this paper, we study the question whether
and how much modern hardware influences the performance ratio between index
structures and scans for MDRQ. To this end, we conservatively adapted three
popular MDIS, namely the R*-tree, the kd-tree, and the VA-file, to exploit
features of modern servers and compared their performance to different flavors
of parallel scans using multiple (synthetic and real-world) analytical
workloads over multiple (synthetic and real-world) datasets of varying size,
dimensionality, and skew. We find that all approaches benefit considerably from
using main memory and parallelization, yet to varying degrees. Our evaluation
indicates that, on current machines, scanning should be favored over parallel
versions of classical MDIS even for very selective queries
Structure-Aware Sampling: Flexible and Accurate Summarization
In processing large quantities of data, a fundamental problem is to obtain a
summary which supports approximate query answering. Random sampling yields
flexible summaries which naturally support subset-sum queries with unbiased
estimators and well-understood confidence bounds.
Classic sample-based summaries, however, are designed for arbitrary subset
queries and are oblivious to the structure in the set of keys. The particular
structure, such as hierarchy, order, or product space (multi-dimensional),
makes range queries much more relevant for most analysis of the data.
Dedicated summarization algorithms for range-sum queries have also been
extensively studied. They can outperform existing sampling schemes in terms of
accuracy on range queries per summary size. Their accuracy, however, rapidly
degrades when, as is often the case, the query spans multiple ranges. They are
also less flexible - being targeted for range sum queries alone - and are often
quite costly to build and use.
In this paper we propose and evaluate variance optimal sampling schemes that
are structure-aware. These summaries improve over the accuracy of existing
structure-oblivious sampling schemes on range queries while retaining the
benefits of sample-based summaries: flexible summaries, with high accuracy on
both range queries and arbitrary subset queries
Approximate Geometric MST Range Queries
Range searching is a widely-used method in computational geometry for efficiently accessing local regions of a large data set. Typically, range searching involves either counting or reporting the points lying within a given query region, but it is often desirable to compute statistics that better describe the structure of the point set lying within the region, not just the count.
In this paper we consider the geometric minimum spanning tree (MST) problem in the context of range searching where approximation is allowed. We are given a set P of n points in R^d. The objective is to preprocess P so that given an admissible query region Q, it is possible to efficiently approximate the weight of the minimum spanning tree of the subset of P lying within Q. There are two natural sources of approximation error, first by treating Q as a fuzzy object and second by approximating the MST weight itself. To model this, we assume that we are given two positive real approximation parameters eps_q and eps_w. Following the typical practice in approximate range searching, the range is expressed as two shapes Q^- and Q^+, where Q^- is contained in Q which is contained in Q^+, and their boundaries are separated by a distance of at least eps_q diam(Q). Points within Q^- must be included and points external to Q^+ cannot be included. A weight W is a valid answer to the query if there exist subsets P\u27 and P\u27\u27 of P, such that Q^- is contained in P\u27 which is contained in P\u27\u27 which is contained in Q^+ and wt(MST(P\u27)) <= W <= (1+eps_w) wt(MST(P\u27\u27)).
In this paper, we present an efficient data structure for answering such queries. Our approach uses simple data structures based on quadtrees, and it can be applied whenever Q^- and Q^+ are compact sets of constant combinatorial complexity. It uses space O(n), and it answers queries in time O(log n + 1/(eps_q eps_w)^{d + O(1)}). The O(1) term is a small constant independent of dimension, and the hidden constant factor in the overall running time depends on d, but not on eps_q or eps_w. Preprocessing requires knowledge of eps_w, but not eps_q
- …