27 research outputs found
Approximating the Distribution of the Median and other Robust Estimators on Uncertain Data
Robust estimators, like the median of a point set, are important for data
analysis in the presence of outliers. We study robust estimators for
locationally uncertain points with discrete distributions. That is, each point
in a data set has a discrete probability distribution describing its location.
The probabilistic nature of uncertain data makes it challenging to compute such
estimators, since the true value of the estimator is now described by a
distribution rather than a single point. We show how to construct and estimate
the distribution of the median of a point set. Building the approximate support
of the distribution takes near-linear time, and assigning probability to that
support takes quadratic time. We also develop a general approximation technique
for distributions of robust estimators with respect to ranges with bounded VC
dimension. This includes the geometric median for high dimensions and the
Siegel estimator for linear regression.Comment: Full version of a paper to appear at SoCG 201
Convex Hulls under Uncertainty
We study the convex-hull problem in a probabilistic setting, motivated by the
need to handle data uncertainty inherent in many applications, including sensor
databases, location-based services and computer vision. In our framework, the
uncertainty of each input site is described by a probability distribution over
a finite number of possible locations including a \emph{null} location to
account for non-existence of the point. Our results include both exact and
approximation algorithms for computing the probability of a query point lying
inside the convex hull of the input, time-space tradeoffs for the membership
queries, a connection between Tukey depth and membership queries, as well as a
new notion of \some-hull that may be a useful representation of uncertain
hulls
Uncertain Curve Simplification
We study the problem of polygonal curve simplification under uncertainty,
where instead of a sequence of exact points, each uncertain point is
represented by a region, which contains the (unknown) true location of the
vertex. The regions we consider are disks, line segments, convex polygons, and
discrete sets of points. We are interested in finding the shortest subsequence
of uncertain points such that no matter what the true location of each
uncertain point is, the resulting polygonal curve is a valid simplification of
the original polygonal curve under the Hausdorff or the Fr\'echet distance. For
both these distance measures, we present polynomial-time algorithms for this
problem.Comment: 25 pages, 5 figure
On the expected diameter, width, and complexity of a stochastic convex-hull
We investigate several computational problems related to the stochastic
convex hull (SCH). Given a stochastic dataset consisting of points in
each of which has an existence probability, a SCH refers to the
convex hull of a realization of the dataset, i.e., a random sample including
each point with its existence probability. We are interested in computing
certain expected statistics of a SCH, including diameter, width, and
combinatorial complexity. For diameter, we establish the first deterministic
1.633-approximation algorithm with a time complexity polynomial in both and
. For width, two approximation algorithms are provided: a deterministic
-approximation running in time, and a fully
polynomial-time randomized approximation scheme (FPRAS). For combinatorial
complexity, we propose an exact -time algorithm. Our solutions exploit
many geometric insights in Euclidean space, some of which might be of
independent interest
From Proximity to Utility: A Voronoi Partition of Pareto Optima
We present an extension of Voronoi diagrams where when considering which site
a client is going to use, in addition to the site distances, other site
attributes are also considered (for example, prices or weights). A cell in this
diagram is then the locus of all clients that consider the same set of sites to
be relevant. In particular, the precise site a client might use from this
candidate set depends on parameters that might change between usages, and the
candidate set lists all of the relevant sites. The resulting diagram is
significantly more expressive than Voronoi diagrams, but naturally has the
drawback that its complexity, even in the plane, might be quite high.
Nevertheless, we show that if the attributes of the sites are drawn from the
same distribution (note that the locations are fixed), then the expected
complexity of the candidate diagram is near linear.
To this end, we derive several new technical results, which are of
independent interest. In particular, we provide a high-probability,
asymptotically optimal bound on the number of Pareto optima points in a point
set uniformly sampled from the -dimensional hypercube. To do so we revisit
the classical backward analysis technique, both simplifying and improving
relevant results in order to achieve the high-probability bounds
Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently
In this paper we define the notion
of a probabilistic neighborhood in spatial data: Let a set of points in
, a query point , a distance metric \dist,
and a monotonically decreasing function be
given. Then a point belongs to the probabilistic neighborhood of with respect to with probability f(\dist(p,q)). We envision
applications in facility location, sensor networks, and other scenarios where a
connection between two entities becomes less likely with increasing distance. A
straightforward query algorithm would determine a probabilistic neighborhood in
time by probing each point in .
To answer the query in sublinear time for the planar case, we augment a
quadtree suitably and design a corresponding query algorithm. Our theoretical
analysis shows that -- for certain distributions of planar -- our algorithm
answers a query in time with high probability
(whp). This matches up to a logarithmic factor the cost induced by
quadtree-based algorithms for deterministic queries and is asymptotically
faster than the straightforward approach whenever .
As practical proofs of concept we use two applications, one in the Euclidean
and one in the hyperbolic plane. In particular, our results yield the first
generator for random hyperbolic graphs with arbitrary temperatures in
subquadratic time. Moreover, our experimental data show the usefulness of our
algorithm even if the point distribution is unknown or not uniform: The running
time savings over the pairwise probing approach constitute at least one order
of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via
http://dx.doi.org/10.1007/978-3-319-44543-4_3
Non-zero probability of nearest neighbor searching
Nearest Neighbor (NN) searching is a challenging problem in data management and has been widely studied in data mining, pattern recognition and computational geometry. The goal of NN searching is efficiently reporting the nearest data to a given object as a query. In most of the studies both the data and query are assumed to be precise, however, due to the real applications of NN searching, such as tracking and locating services, GIS and data mining, it is possible both of them are imprecise. So, in this situation, a natural way to handle the issue is to report the data have a nonzero probability âcalled nonzero nearest neighborâ to be the nearest neighbor of a given query. Formally, let P be a set of n uncertain points modeled by some regions. We first consider the following variation of NN searching problem under uncertainty. If both the query and the data are uncertain points modeled by distinct unit segments parallel to the x-axis, we propose an efficient algorithm that reports nonzero nearest neighbors under Manhattan metric in O(n^2 α(n^2 )) preprocessing and O(logâĄn+k) query time, where α(.) is the extremely slowly growing functional inverse of Ackermannâs function. Finally, for the arbitrarily length segments parallel to the x-axis, we propose an approximation algorithm that reports nonzero nearest neighbor with maximum error L in O(n^2 α(n^2 )) preprocessing and O(logâĄn+k) query time, where L is the length of the query