1,495 research outputs found
Querying Probabilistic Neighborhoods in Spatial Data Sets Efficiently
In this paper we define the notion
of a probabilistic neighborhood in spatial data: Let a set of points in
, a query point , a distance metric \dist,
and a monotonically decreasing function be
given. Then a point belongs to the probabilistic neighborhood of with respect to with probability f(\dist(p,q)). We envision
applications in facility location, sensor networks, and other scenarios where a
connection between two entities becomes less likely with increasing distance. A
straightforward query algorithm would determine a probabilistic neighborhood in
time by probing each point in .
To answer the query in sublinear time for the planar case, we augment a
quadtree suitably and design a corresponding query algorithm. Our theoretical
analysis shows that -- for certain distributions of planar -- our algorithm
answers a query in time with high probability
(whp). This matches up to a logarithmic factor the cost induced by
quadtree-based algorithms for deterministic queries and is asymptotically
faster than the straightforward approach whenever .
As practical proofs of concept we use two applications, one in the Euclidean
and one in the hyperbolic plane. In particular, our results yield the first
generator for random hyperbolic graphs with arbitrary temperatures in
subquadratic time. Moreover, our experimental data show the usefulness of our
algorithm even if the point distribution is unknown or not uniform: The running
time savings over the pairwise probing approach constitute at least one order
of magnitude already for a modest number of points and queries.Comment: The final publication is available at Springer via
http://dx.doi.org/10.1007/978-3-319-44543-4_3
Conditional Reliability in Uncertain Graphs
Network reliability is a well-studied problem that requires to measure the
probability that a target node is reachable from a source node in a
probabilistic (or uncertain) graph, i.e., a graph where every edge is assigned
a probability of existence. Many approaches and problem variants have been
considered in the literature, all assuming that edge-existence probabilities
are fixed. Nevertheless, in real-world graphs, edge probabilities typically
depend on external conditions. In metabolic networks a protein can be converted
into another protein with some probability depending on the presence of certain
enzymes. In social influence networks the probability that a tweet of some user
will be re-tweeted by her followers depends on whether the tweet contains
specific hashtags. In transportation networks the probability that a network
segment will work properly or not might depend on external conditions such as
weather or time of the day. In this paper we overcome this limitation and focus
on conditional reliability, that is assessing reliability when edge-existence
probabilities depend on a set of conditions. In particular, we study the
problem of determining the k conditions that maximize the reliability between
two nodes. We deeply characterize our problem and show that, even employing
polynomial-time reliability-estimation methods, it is NP-hard, does not admit
any PTAS, and the underlying objective function is non-submodular. We then
devise a practical method that targets both accuracy and efficiency. We also
study natural generalizations of the problem with multiple source and target
nodes. An extensive empirical evaluation on several large, real-life graphs
demonstrates effectiveness and scalability of the proposed methods.Comment: 14 pages, 13 figure
Geographic Gossip: Efficient Averaging for Sensor Networks
Gossip algorithms for distributed computation are attractive due to their
simplicity, distributed nature, and robustness in noisy and uncertain
environments. However, using standard gossip algorithms can lead to a
significant waste in energy by repeatedly recirculating redundant information.
For realistic sensor network model topologies like grids and random geometric
graphs, the inefficiency of gossip schemes is related to the slow mixing times
of random walks on the communication graph. We propose and analyze an
alternative gossiping scheme that exploits geographic information. By utilizing
geographic routing combined with a simple resampling method, we demonstrate
substantial gains over previously proposed gossip protocols. For regular graphs
such as the ring or grid, our algorithm improves standard gossip by factors of
and respectively. For the more challenging case of random
geometric graphs, our algorithm computes the true average to accuracy
using radio
transmissions, which yields a factor improvement over
standard gossip algorithms. We illustrate these theoretical results with
experimental comparisons between our algorithm and standard methods as applied
to various classes of random fields.Comment: To appear, IEEE Transactions on Signal Processin
Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty
There is a growing need for methods which can capture uncertainties and
answer queries over graph-structured data. Two common types of uncertainty are
uncertainty over the attribute values of nodes and uncertainty over the
existence of edges. In this paper, we combine those with identity uncertainty.
Identity uncertainty represents uncertainty over the mapping from objects
mentioned in the data, or references, to the underlying real-world entities. We
propose the notion of a probabilistic entity graph (PEG), a probabilistic graph
model that defines a distribution over possible graphs at the entity level. The
model takes into account node attribute uncertainty, edge existence
uncertainty, and identity uncertainty, and thus enables us to systematically
reason about all three types of uncertainties in a uniform manner. We introduce
a general framework for constructing a PEG given uncertain data at the
reference level and develop highly efficient algorithms to answer subgraph
pattern matching queries in this setting. Our algorithms are based on two novel
ideas: context-aware path indexing and reduction by join-candidates, which
drastically reduce the query search space. A comprehensive experimental
evaluation shows that our approach outperforms baseline implementations by
orders of magnitude
Searching and mining in enriched geo-spatial data
The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability
distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released.
This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were
accepted into various renowned international peer-reviewed venues.
A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences.
This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its
components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data.
In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots.
Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend.
Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner.
This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a
framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in räumlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten über sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhärenter Unsicherheit. Beispiele für solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusätzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hält viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll.
Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in räumlichen Daten. In all diesen Fällen haben die entwickelten Methoden signifikante Forschungsbeiträge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals.
Ein gängiges Anwendungsgebiet räumlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulärer Maße optimieren, d.h., der kürzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusätzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder Präferenzen erfüllen.
Diese Arbeit präsentiert zwei Ansätze, solche multi-enriched Daten in Straßennetze einzufügen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen über Nutzerpräferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden präsentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen.
Im zweiten Fall wird ein Framework präsentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfügbaren Ressourcen. Eine Anwendung ist die Suche nach Parkplätzen.
Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulässt. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant über den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit präsentiert diese Arbeit Methoden,
um Trends nach Archetypen zu klassifizieren und ihren zukünftigen Weg vorherzusagen.
Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfügbar und Anfragen lassen sich aufgrund der \P-Vollständigkeit des Problems nicht ohne Weiteres auf größere Datensätze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wäre. Diese Arbeit präsentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen
- …