13 research outputs found
Range Queries on Uncertain Data
Given a set of uncertain points on the real line, each represented by
its one-dimensional probability density function, we consider the problem of
building data structures on to answer range queries of the following three
types for any query interval : (1) top- query: find the point in that
lies in with the highest probability, (2) top- query: given any integer
as part of the query, return the points in that lie in
with the highest probabilities, and (3) threshold query: given any threshold
as part of the query, return all points of that lie in with
probabilities at least . We present data structures for these range
queries with linear or nearly linear space and efficient query time.Comment: 26 pages. A preliminary version of this paper appeared in ISAAC 2014.
In this full version, we also present solutions to the most general case of
the problem (i.e., the histogram bounded case), which were left as open
problems in the preliminary versio
Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently
In this paper we define the notion
of a probabilistic neighborhood in spatial data: Let a set of points in
, a query point , a distance metric \dist,
and a monotonically decreasing function be
given. Then a point belongs to the probabilistic neighborhood of with respect to with probability f(\dist(p,q)). We envision
applications in facility location, sensor networks, and other scenarios where a
connection between two entities becomes less likely with increasing distance. A
straightforward query algorithm would determine a probabilistic neighborhood in
time by probing each point in .
To answer the query in sublinear time for the planar case, we augment a
quadtree suitably and design a corresponding query algorithm. Our theoretical
analysis shows that -- for certain distributions of planar -- our algorithm
answers a query in time with high probability
(whp). This matches up to a logarithmic factor the cost induced by
quadtree-based algorithms for deterministic queries and is asymptotically
faster than the straightforward approach whenever .
As practical proofs of concept we use two applications, one in the Euclidean
and one in the hyperbolic plane. In particular, our results yield the first
generator for random hyperbolic graphs with arbitrary temperatures in
subquadratic time. Moreover, our experimental data show the usefulness of our
algorithm even if the point distribution is unknown or not uniform: The running
time savings over the pairwise probing approach constitute at least one order
of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via
http://dx.doi.org/10.1007/978-3-319-44543-4_3
Similarity search and mining in uncertain spatial and spatio-temporal databases
Both the current trends in technology such as smart phones, general mobile devices, stationary sensors and satellites as well as a new user mentality of utilizing this technology to voluntarily share information produce a huge flood of geo-spatial and geo-spatio-temporal data. This data flood provides a tremendous potential of discovering new and possibly useful knowledge. In addition to the fact that measurements are imprecise, due to the physical limitation of the devices, some form of interpolation is needed in-between discrete time instances. From a complementary perspective - to reduce the communication and bandwidth utilization, along with the storage requirements, often the data is subjected to a reduction, thereby eliminating some of the known/recorded values. These issues introduce the notion of uncertainty in the context of spatio-temporal data management - an aspect raising an imminent need for scalable and flexible data management. The main scope of this thesis is to develop effective and efficient techniques for similarity search and data mining in uncertain spatial and spatio-temporal data. In a plethora of research fields and industrial applications, these techniques can substantially improve decision making, minimize risk and unearth valuable insights that would otherwise remain hidden. The challenge of effectiveness in uncertain data is to correctly determine the set of possible results, each associated with the correct probability of being a result, in order to give a user a confidence about the returned results. The contrary challenge of efficiency, is to compute these result and corresponding probabilities in an efficient manner, allowing for reasonable querying and mining times, even for large uncertain databases. The paradigm used to master both challenges, is to identify a small set of equivalent classes of possible worlds, such that members of the same class can be treated as equivalent in the context of a given query predicate or data mining task. In the scope of this work, this paradigm will be formally defined, and applied to the most prominent classes of spatial queries on uncertain data, including range queries, k-nearest neighbor queries, ranking queries and reverse k-nearest neighbor queries. For this purpose, new spatial and probabilistic pruning approaches are developed to further speed up query processing. Furthermore, the proposed paradigm allows to develop the first efficient solution for the problem of frequent co-location mining on uncertain data. Special emphasis is taken on the temporal aspect of applications using modern data collection technologies. While the aforementioned techniques work well for single points of time, the prediction of query results over time remains a challenge. This thesis fills this gap by modeling an uncertain spatio-temporal object as a stochastic process, and by applying the above paradigm to efficiently query, index and mine historical spatio-temporal data.Moderne Technologien, z.B. Sattelitentechnologie und Technologie in Smart Phones, erzeugen eine Flut rĂ€umlicher Geo-Daten. Zudem ist in der Gesellschaft ein Trend zu beobachten diese erzeugten Daten freiwillig auf öffentlich zugĂ€nglichen Plattformen zur VerfĂŒgung zu stellen. Diese Datenflut hat immenses Potential, um neues und nĂŒtzliches Wissen zu entdecken. Diese Daten sind jedoch grundsĂ€tzlich unsichere rĂ€umliche Daten. Die Unsicherheit ergibt sich aus mehreren Aspekten. Zum einen kommt es bei Messungen grundsĂ€tzlich zu Messungenauigkeiten, zum anderen ist zwischen diskreten Messzeitpunkten eine Interpolation nötig, die zusĂ€tzliche Unsicherheit erzeugt. Auerdem werden die Daten oft absichtlich reduziert, um Speicherplatz und Transfervolumen einzusparen, wodurch weitere Information verloren geht. Diese Unsicherheit schafft einen sofortigen Bedarf fĂŒr skalierbare und flexible Methoden zur Verwaltung und Auswertung solcher Daten. Im Rahmen dieser Arbeit sollen effektive und effiziente Techniken zur Ăhnlichkeitssuche und zum Data Mining bei unsicheren rĂ€umlichen und unsicheren rĂ€umlich-zeitlichen Daten erarbeitet werden. Diese Techniken liefern wertvolles Wissen, das auf verschiedenen Forschungsgebieten, als auch bei industriellen Anwendungen zur Entscheidungsfindung genutzt werden kann. Bei der Entwicklung dieser Techniken gibt es zwei Herausforderungen. Einerseits mĂŒssen die entwickelten Techniken effektiv sein, um korrekte Ergebnisse und Wahrscheinlichkeiten dieser Ergebnisse zurĂŒckzugeben. Andererseits mĂŒssen die entwickelten Techniken effizient sein, um auch in sehr groĂen Datenbanken Ergebnisse in annehmbarer Zeit zu liefern. Die Dissertation stellt ein neues Paradigma vor, das beide Herausforderungen meistert. Dieses Paradigma identifiziert mögliche Datenbankwelten, die bezĂŒglich des gegebenen AnfrageprĂ€dikats Ă€quivalent sind. Es wird formal definiert und auf die relevantesten rĂ€umlichen Anfragetypen angewendet, um effiziente Lösungen zu entwickeln. Dazu gehören Bereichanfragen, k-NĂ€chste-Nachbarnanfragen, Rankinganfragen und Reverse k-NĂ€chste-Nachbarnanfragen. RĂ€umliche und probabilistische Pruningkriterien werden entwickelt, um insignifikante Ergebnisse frĂŒh auszuschlieen. Zudem wird die erste effiziente Lösung fĂŒr das Problem des "Spatial Co-location Minings" auf unsicheren Daten prĂ€sentiert. Ein besonderer Schwerpunkt dieser Arbeit liegt auf dem temporalen Aspekt moderner Geo-Daten. WĂ€hrend obig genannte Techniken dieser Arbeit fĂŒr einzelne Zeitpunkt sehr gut funktionieren, ist die effektive und effiziente Verwaltung von unsicheren rĂ€umlich zeitlichen Daten immer noch ein weitestgehend ungelöstes Problem. Diese Dissertation löst dieses Problem, indem unsichere rĂ€umlich-zeitliche Daten durch stochastische Prozesse modeliert werden. Auf diese stochastischen Prozesse lĂ€sst sich das oben genannte Paradigma anwenden, um unsichere rĂ€umlich-zeitliche Daten effizient anzufragen, zu indexieren, und zu minen
Δ-Kernel Coresets for Stochastic Points
With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing epsilon-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point\u27s location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An epsilon-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an Δ-EXP-KERNEL), as well as the probability distribution on the width (an (Δ, tau)-QUANT-KERNEL) for any direction. We show that there exists a set of O(Δ^{-(d-1)/2}) deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an Δ-EXP-KERNEL is still possible. We also provide efficient algorithms for construct an (Δ, Ï)-QUANT-KERNEL coreset in nearly linear time. Our techniques utilize or connect to several important notions in probability and geometry, such as Kolmogorov distances, VC uniform convergence and Tukey depth, and may be useful in other geometric optimization problem in stochastic settings. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty
Decision trees for uncertain data
Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/ quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted which show that the resulting classifiers are more accurate than those using value averages. Since processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency. © 2006 IEEE.link_to_subscribed_fulltex