36 research outputs found
Querying Probabilistic Ontologies with SPARQL
In recent years a lot of efforts was put into the field of Semantic Web
research to specify knowledge as precisely as possible. However, optimizing for precision
alone is not sufficient. The handling of uncertain or incomplete information is
getting more and more important and it promises to significantly improve the quality
of query answering in Semantic Web applications. My plan is to develop a framework
that extends the rich semantics offered by ontologies with probabilistic information,
stores this in a probabilistic database and provides query answering with the help of
query rewriting. In this proposal I describe how these three aspects can be combined.
Especially, I am focusing on how uncertainty is incorporated into the ABox and how
it is handled by the database and the rewriter during query answering
IMPrECISE: Good-is-good-enough data integration
IMPrECISE is an XQuery module that adds probabilistic XML functionality to an existing XML DBMS, in our case MonetDB/XQuery. We demonstrate probabilistic XML and data integration functionality of IMPrECISE. The prototype is configurable with domain knowledge such that the amount of uncertainty arising during data integration is reduced to an acceptable level, thus obtaining a "good is good enough" data integration with minimal human effort
10 Years of Probabilistic Querying – What Next?
Over the past decade, the two research areas of probabilistic databases and probabilistic programming have intensively studied the problem of making structured probabilistic inference scalable, but — so far — both areas developed almost independently of one another. While probabilistic databases have focused on describing tractable query classes based on the structure of query plans and data lineage, probabilistic programming has contributed sophisticated inference techniques based on knowledge compilation and lifted (first-order) inference. Both fields have developed their own variants of — both exact and approximate — top-k algorithms for query evaluation, and both investigate query optimization techniques known from SQL, Datalog, and Prolog, which all calls for a more intensive study of the commonalities and integration of the two fields. Moreover, we believe that natural-language processing and information extraction will remain a driving factor and in fact a longstanding challenge for developing expressive representation models which can be combined with structured probabilistic inference — also for the next decades to come
Exploiting prior knowledge and latent variable representations for the statistical modeling and probabilistic querying of large knowledge graphs
Large knowledge graphs increasingly add great value to various applications that require machines to recognize and understand queries and their semantics, as in search or question answering systems. These applications include Google search, Bing search, IBM’s Watson, but also smart mobile assistants as Apple’s Siri, Google Now or Microsoft’s Cortana. Popular knowledge graphs like DBpedia, YAGO or Freebase store a broad range of facts about the world, to a large extent derived from Wikipedia, currently the biggest web encyclopedia. In addition to these freely accessible open knowledge graphs, commercial ones have also evolved including the well-known Google Knowledge Graph or Microsoft’s Satori. Since incompleteness and veracity of knowledge graphs are known problems, the statistical modeling of knowledge graphs has increasingly gained attention in recent years. Some of the leading approaches are based on latent variable models which show both excellent predictive performance and scalability. Latent variable models learn embedding representations of domain entities and relations (representation learning). From these embeddings, priors for every possible fact in the knowledge graph are generated which can be exploited for data cleansing, completion or as prior knowledge to support triple extraction from unstructured textual data as successfully demonstrated by Google’s Knowledge-Vault project. However, large knowledge graphs impose constraints on the complexity of the latent embeddings learned by these models. For graphs with millions of entities and thousands of relation-types, latent variable models are required to exploit low dimensional embeddings for entities and relation-types to be tractable when applied to these graphs. The work described in this thesis extends the application of latent variable models for large knowledge graphs in three important dimensions. First, it is shown how the integration of ontological constraints on the domain and range of relation-types enables latent variable models to exploit latent embeddings of reduced complexity for modeling large knowledge graphs. The integration of this prior knowledge into the models leads to a substantial increase both in predictive performance and scalability with improvements of up to 77% in link-prediction tasks. Since manually designed domain and range constraints can be absent or fuzzy, we also propose and study an alternative approach based on a local closed-world assumption, which derives domain and range constraints from observed data without the need of prior knowledge extracted from the curated schema of the knowledge graph. We show that such an approach also leads to similar significant improvements in modeling quality. Further, we demonstrate that these two types of domain and range constraints are of general value to latent variable models by integrating and evaluating them on the current state of the art of latent variable models represented by RESCAL, Translational Embedding, and the neural network approach used by the recently proposed Google Knowledge Vault system. In the second part of the thesis it is shown that the just mentioned three approaches all perform well, but do not share many commonalities in the way they model knowledge graphs. These differences can be exploited in ensemble solutions which improve the predictive performance even further. The third part of the thesis concerns the efficient querying of the statistically modeled knowledge graphs. This thesis interprets statistically modeled knowledge graphs as probabilistic databases, where the latent variable models define a probability distribution for triples. From this perspective, link-prediction is equivalent to querying ground triples which is a standard functionality of the latent variable models. For more complex querying that involves e.g. joins and projections, the theory on probabilistic databases provides evaluation rules. In this thesis it is shown how the intrinsic features of latent variable models can be combined with the theory of probabilistic databases to realize efficient probabilistic querying of the modeled graphs
Learning Credal Sum-Product Networks
Probabilistic representations, such as Bayesian and Markov networks, are
fundamental to much of statistical machine learning. Thus, learning
probabilistic representations directly from data is a deep challenge, the main
computational bottleneck being inference that is intractable. Tractable
learning is a powerful new paradigm that attempts to learn distributions that
support efficient probabilistic querying. By leveraging local structure,
representations such as sum-product networks (SPNs) can capture high tree-width
models with many hidden layers, essentially a deep architecture, while still
admitting a range of probabilistic queries to be computable in time polynomial
in the network size. While the progress is impressive, numerous data sources
are incomplete, and in the presence of missing data, structure learning methods
nonetheless revert to single distributions without characterizing the loss in
confidence. In recent work, credal sum-product networks, an imprecise extension
of sum-product networks, were proposed to capture this robustness angle. In
this work, we are interested in how such representations can be learnt and thus
study how the computational machinery underlying tractable learning and
inference can be generalized for imprecise probabilities.Comment: Accepted to AKBC 202
Predictive Querying for Autoregressive Neural Sequence Models
In reasoning about sequential events it is natural to pose probabilistic
queries such as "when will event A occur next" or "what is the probability of A
occurring before B", with applications in areas such as user modeling,
medicine, and finance. However, with machine learning shifting towards neural
autoregressive models such as RNNs and transformers, probabilistic querying has
been largely restricted to simple cases such as next-event prediction. This is
in part due to the fact that future querying involves marginalization over
large path spaces, which is not straightforward to do efficiently in such
models. In this paper we introduce a general typology for predictive queries in
neural autoregressive sequence models and show that such queries can be
systematically represented by sets of elementary building blocks. We leverage
this typology to develop new query estimation methods based on beam search,
importance sampling, and hybrids. Across four large-scale sequence datasets
from different application domains, as well as for the GPT-2 language model, we
demonstrate the ability to make query answering tractable for arbitrary queries
in exponentially-large predictive path-spaces, and find clear differences in
cost-accuracy tradeoffs between search and sampling methods.Comment: Oral Presentation at the Intl. Conference on Neural Information
Processing Systems (NeurIPS 2022
Scalable Probabilistic Similarity Ranking in Uncertain Databases (Technical Report)
This paper introduces a scalable approach for probabilistic top-k similarity
ranking on uncertain vector data. Each uncertain object is represented by a set
of vector instances that are assumed to be mutually-exclusive. The objective is
to rank the uncertain data according to their distance to a reference object.
We propose a framework that incrementally computes for each object instance and
ranking position, the probability of the object falling at that ranking
position. The resulting rank probability distribution can serve as input for
several state-of-the-art probabilistic ranking models. Existing approaches
compute this probability distribution by applying a dynamic programming
approach of quadratic complexity. In this paper we theoretically as well as
experimentally show that our framework reduces this to a linear-time complexity
while having the same memory requirements, facilitated by incremental accessing
of the uncertain vector instances in increasing order of their distance to the
reference object. Furthermore, we show how the output of our method can be used
to apply probabilistic top-k ranking for the objects, according to different
state-of-the-art definitions. We conduct an experimental evaluation on
synthetic and real data, which demonstrates the efficiency of our approach
Probabilistic Inductive Querying Using ProbLog
We study how probabilistic reasoning and inductive querying can be combined within ProbLog, a recent probabilistic extension of Prolog. ProbLog can be regarded as a database system that supports both probabilistic and inductive reasoning through a variety of querying mechanisms. After a short introduction to ProbLog, we provide a survey of the different types of inductive queries that ProbLog supports, and show how it can be applied to the mining of large biological networks.Peer reviewe