13 research outputs found

    Range Queries on Uncertain Data

    Full text link
    Given a set PP of nn uncertain points on the real line, each represented by its one-dimensional probability density function, we consider the problem of building data structures on PP to answer range queries of the following three types for any query interval II: (1) top-11 query: find the point in PP that lies in II with the highest probability, (2) top-kk query: given any integer k≀nk\leq n as part of the query, return the kk points in PP that lie in II with the highest probabilities, and (3) threshold query: given any threshold τ\tau as part of the query, return all points of PP that lie in II with probabilities at least τ\tau. We present data structures for these range queries with linear or nearly linear space and efficient query time.Comment: 26 pages. A preliminary version of this paper appeared in ISAAC 2014. In this full version, we also present solutions to the most general case of the problem (i.e., the histogram bounded case), which were left as open problems in the preliminary versio

    Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently

    Full text link
    \newcommand{\dist}{\operatorname{dist}} In this paper we define the notion of a probabilistic neighborhood in spatial data: Let a set PP of nn points in Rd\mathbb{R}^d, a query point q∈Rdq \in \mathbb{R}^d, a distance metric \dist, and a monotonically decreasing function f:R+→[0,1]f : \mathbb{R}^+ \rightarrow [0,1] be given. Then a point p∈Pp \in P belongs to the probabilistic neighborhood N(q,f)N(q, f) of qq with respect to ff with probability f(\dist(p,q)). We envision applications in facility location, sensor networks, and other scenarios where a connection between two entities becomes less likely with increasing distance. A straightforward query algorithm would determine a probabilistic neighborhood in Θ(n⋅d)\Theta(n\cdot d) time by probing each point in PP. To answer the query in sublinear time for the planar case, we augment a quadtree suitably and design a corresponding query algorithm. Our theoretical analysis shows that -- for certain distributions of planar PP -- our algorithm answers a query in O((∣N(q,f)∣+n)log⁡n)O((|N(q,f)| + \sqrt{n})\log n) time with high probability (whp). This matches up to a logarithmic factor the cost induced by quadtree-based algorithms for deterministic queries and is asymptotically faster than the straightforward approach whenever ∣N(q,f)∣∈o(n/log⁡n)|N(q,f)| \in o(n / \log n). As practical proofs of concept we use two applications, one in the Euclidean and one in the hyperbolic plane. In particular, our results yield the first generator for random hyperbolic graphs with arbitrary temperatures in subquadratic time. Moreover, our experimental data show the usefulness of our algorithm even if the point distribution is unknown or not uniform: The running time savings over the pairwise probing approach constitute at least one order of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-44543-4_3

    Similarity search and mining in uncertain spatial and spatio-temporal databases

    Get PDF
    Both the current trends in technology such as smart phones, general mobile devices, stationary sensors and satellites as well as a new user mentality of utilizing this technology to voluntarily share information produce a huge flood of geo-spatial and geo-spatio-temporal data. This data flood provides a tremendous potential of discovering new and possibly useful knowledge. In addition to the fact that measurements are imprecise, due to the physical limitation of the devices, some form of interpolation is needed in-between discrete time instances. From a complementary perspective - to reduce the communication and bandwidth utilization, along with the storage requirements, often the data is subjected to a reduction, thereby eliminating some of the known/recorded values. These issues introduce the notion of uncertainty in the context of spatio-temporal data management - an aspect raising an imminent need for scalable and flexible data management. The main scope of this thesis is to develop effective and efficient techniques for similarity search and data mining in uncertain spatial and spatio-temporal data. In a plethora of research fields and industrial applications, these techniques can substantially improve decision making, minimize risk and unearth valuable insights that would otherwise remain hidden. The challenge of effectiveness in uncertain data is to correctly determine the set of possible results, each associated with the correct probability of being a result, in order to give a user a confidence about the returned results. The contrary challenge of efficiency, is to compute these result and corresponding probabilities in an efficient manner, allowing for reasonable querying and mining times, even for large uncertain databases. The paradigm used to master both challenges, is to identify a small set of equivalent classes of possible worlds, such that members of the same class can be treated as equivalent in the context of a given query predicate or data mining task. In the scope of this work, this paradigm will be formally defined, and applied to the most prominent classes of spatial queries on uncertain data, including range queries, k-nearest neighbor queries, ranking queries and reverse k-nearest neighbor queries. For this purpose, new spatial and probabilistic pruning approaches are developed to further speed up query processing. Furthermore, the proposed paradigm allows to develop the first efficient solution for the problem of frequent co-location mining on uncertain data. Special emphasis is taken on the temporal aspect of applications using modern data collection technologies. While the aforementioned techniques work well for single points of time, the prediction of query results over time remains a challenge. This thesis fills this gap by modeling an uncertain spatio-temporal object as a stochastic process, and by applying the above paradigm to efficiently query, index and mine historical spatio-temporal data.Moderne Technologien, z.B. Sattelitentechnologie und Technologie in Smart Phones, erzeugen eine Flut rĂ€umlicher Geo-Daten. Zudem ist in der Gesellschaft ein Trend zu beobachten diese erzeugten Daten freiwillig auf öffentlich zugĂ€nglichen Plattformen zur VerfĂŒgung zu stellen. Diese Datenflut hat immenses Potential, um neues und nĂŒtzliches Wissen zu entdecken. Diese Daten sind jedoch grundsĂ€tzlich unsichere rĂ€umliche Daten. Die Unsicherheit ergibt sich aus mehreren Aspekten. Zum einen kommt es bei Messungen grundsĂ€tzlich zu Messungenauigkeiten, zum anderen ist zwischen diskreten Messzeitpunkten eine Interpolation nötig, die zusĂ€tzliche Unsicherheit erzeugt. Auerdem werden die Daten oft absichtlich reduziert, um Speicherplatz und Transfervolumen einzusparen, wodurch weitere Information verloren geht. Diese Unsicherheit schafft einen sofortigen Bedarf fĂŒr skalierbare und flexible Methoden zur Verwaltung und Auswertung solcher Daten. Im Rahmen dieser Arbeit sollen effektive und effiziente Techniken zur Ähnlichkeitssuche und zum Data Mining bei unsicheren rĂ€umlichen und unsicheren rĂ€umlich-zeitlichen Daten erarbeitet werden. Diese Techniken liefern wertvolles Wissen, das auf verschiedenen Forschungsgebieten, als auch bei industriellen Anwendungen zur Entscheidungsfindung genutzt werden kann. Bei der Entwicklung dieser Techniken gibt es zwei Herausforderungen. Einerseits mĂŒssen die entwickelten Techniken effektiv sein, um korrekte Ergebnisse und Wahrscheinlichkeiten dieser Ergebnisse zurĂŒckzugeben. Andererseits mĂŒssen die entwickelten Techniken effizient sein, um auch in sehr großen Datenbanken Ergebnisse in annehmbarer Zeit zu liefern. Die Dissertation stellt ein neues Paradigma vor, das beide Herausforderungen meistert. Dieses Paradigma identifiziert mögliche Datenbankwelten, die bezĂŒglich des gegebenen AnfrageprĂ€dikats Ă€quivalent sind. Es wird formal definiert und auf die relevantesten rĂ€umlichen Anfragetypen angewendet, um effiziente Lösungen zu entwickeln. Dazu gehören Bereichanfragen, k-NĂ€chste-Nachbarnanfragen, Rankinganfragen und Reverse k-NĂ€chste-Nachbarnanfragen. RĂ€umliche und probabilistische Pruningkriterien werden entwickelt, um insignifikante Ergebnisse frĂŒh auszuschlieen. Zudem wird die erste effiziente Lösung fĂŒr das Problem des "Spatial Co-location Minings" auf unsicheren Daten prĂ€sentiert. Ein besonderer Schwerpunkt dieser Arbeit liegt auf dem temporalen Aspekt moderner Geo-Daten. WĂ€hrend obig genannte Techniken dieser Arbeit fĂŒr einzelne Zeitpunkt sehr gut funktionieren, ist die effektive und effiziente Verwaltung von unsicheren rĂ€umlich zeitlichen Daten immer noch ein weitestgehend ungelöstes Problem. Diese Dissertation löst dieses Problem, indem unsichere rĂ€umlich-zeitliche Daten durch stochastische Prozesse modeliert werden. Auf diese stochastischen Prozesse lĂ€sst sich das oben genannte Paradigma anwenden, um unsichere rĂ€umlich-zeitliche Daten effizient anzufragen, zu indexieren, und zu minen

    Δ-Kernel Coresets for Stochastic Points

    Get PDF
    With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing epsilon-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point\u27s location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An epsilon-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an Δ-EXP-KERNEL), as well as the probability distribution on the width (an (Δ, tau)-QUANT-KERNEL) for any direction. We show that there exists a set of O(Δ^{-(d-1)/2}) deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an Δ-EXP-KERNEL is still possible. We also provide efficient algorithms for construct an (Δ, τ)-QUANT-KERNEL coreset in nearly linear time. Our techniques utilize or connect to several important notions in probability and geometry, such as Kolmogorov distances, VC uniform convergence and Tukey depth, and may be useful in other geometric optimization problem in stochastic settings. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty

    Decision trees for uncertain data

    Get PDF
    Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/ quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives (such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the "complete information" of a data item (taking into account the probability density function (pdf)) is utilized. We extend classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have been conducted which show that the resulting classifiers are more accurate than those using value averages. Since processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a series of pruning techniques that can greatly improve construction efficiency. © 2006 IEEE.link_to_subscribed_fulltex

    Event detection in high throughput social media

    Get PDF

    Algorithms for Imprecise Trajectories

    Get PDF
    corecore