977 research outputs found
Coping with Incomplete Data: Recent Advances
International audienceHandling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on threevalued logic. Can we find a middle ground, and produce correct answers efficiently? The paper surveys results of the last few years motivated by this question. We reexamine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers
Coping with Incomplete Data: Recent Advances
Handling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on three-valued logic. Can we find a middle ground, and produce correct answers efficiently? The paper surveys results of the last few years motivated by this question. We re-examine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers
Approximate Assertional Reasoning Over Expressive Ontologies
In this thesis, approximate reasoning methods for scalable assertional reasoning are provided whose computational properties can be established in a well-understood way, namely in terms of soundness and completeness, and whose quality can be analyzed in terms of statistical measurements, namely recall and precision. The basic idea of these approximate reasoning methods is to speed up reasoning by trading off the quality of reasoning results against increased speed
Exploiting prior knowledge and latent variable representations for the statistical modeling and probabilistic querying of large knowledge graphs
Large knowledge graphs increasingly add great value to various applications that require machines to recognize and understand queries and their semantics, as in search or question answering systems. These applications include Google search, Bing search, IBMâs Watson, but also smart mobile assistants as Appleâs Siri, Google Now or Microsoftâs Cortana. Popular knowledge graphs like DBpedia, YAGO or Freebase store a broad range of facts about the world, to a large extent derived from Wikipedia, currently the biggest web encyclopedia. In addition to these freely accessible open knowledge graphs, commercial ones have also evolved including the well-known Google Knowledge Graph or Microsoftâs Satori. Since incompleteness and veracity of knowledge graphs are known problems, the statistical modeling of knowledge graphs has increasingly gained attention in recent years. Some of the leading approaches are based on latent variable models which show both excellent predictive performance and scalability. Latent variable models learn embedding representations of domain entities and relations (representation learning). From these embeddings, priors for every possible fact in the knowledge graph are generated which can be exploited for data cleansing, completion or as prior knowledge to support triple extraction from unstructured textual data as successfully demonstrated by Googleâs Knowledge-Vault project. However, large knowledge graphs impose constraints on the complexity of the latent embeddings learned by these models. For graphs with millions of entities and thousands of relation-types, latent variable models are required to exploit low dimensional embeddings for entities and relation-types to be tractable when applied to these graphs. The work described in this thesis extends the application of latent variable models for large knowledge graphs in three important dimensions. First, it is shown how the integration of ontological constraints on the domain and range of relation-types enables latent variable models to exploit latent embeddings of reduced complexity for modeling large knowledge graphs. The integration of this prior knowledge into the models leads to a substantial increase both in predictive performance and scalability with improvements of up to 77% in link-prediction tasks. Since manually designed domain and range constraints can be absent or fuzzy, we also propose and study an alternative approach based on a local closed-world assumption, which derives domain and range constraints from observed data without the need of prior knowledge extracted from the curated schema of the knowledge graph. We show that such an approach also leads to similar significant improvements in modeling quality. Further, we demonstrate that these two types of domain and range constraints are of general value to latent variable models by integrating and evaluating them on the current state of the art of latent variable models represented by RESCAL, Translational Embedding, and the neural network approach used by the recently proposed Google Knowledge Vault system. In the second part of the thesis it is shown that the just mentioned three approaches all perform well, but do not share many commonalities in the way they model knowledge graphs. These differences can be exploited in ensemble solutions which improve the predictive performance even further. The third part of the thesis concerns the efficient querying of the statistically modeled knowledge graphs. This thesis interprets statistically modeled knowledge graphs as probabilistic databases, where the latent variable models define a probability distribution for triples. From this perspective, link-prediction is equivalent to querying ground triples which is a standard functionality of the latent variable models. For more complex querying that involves e.g. joins and projections, the theory on probabilistic databases provides evaluation rules. In this thesis it is shown how the intrinsic features of latent variable models can be combined with the theory of probabilistic databases to realize efficient probabilistic querying of the modeled graphs
Oblivious Bounds on the Probability of Boolean Functions
This paper develops upper and lower bounds for the probability of Boolean
functions by treating multiple occurrences of variables as independent and
assigning them new individual probabilities. We call this approach dissociation
and give an exact characterization of optimal oblivious bounds, i.e. when the
new probabilities are chosen independent of the probabilities of all other
variables. Our motivation comes from the weighted model counting problem (or,
equivalently, the problem of computing the probability of a Boolean function),
which is #P-hard in general. By performing several dissociations, one can
transform a Boolean formula whose probability is difficult to compute, into one
whose probability is easy to compute, and which is guaranteed to provide an
upper or lower bound on the probability of the original formula by choosing
appropriate probabilities for the dissociated variables. Our new bounds shed
light on the connection between previous relaxation-based and model-based
approximations and unify them as concrete choices in a larger design space. We
also show how our theory allows a standard relational database management
system (DBMS) to both upper and lower bound hard probabilistic queries in
guaranteed polynomial time.Comment: 34 pages, 14 figures, supersedes: http://arxiv.org/abs/1105.281
Algorithms for Provisioning Queries and Analytics
Provisioning is a technique for avoiding repeated expensive computations in
what-if analysis. Given a query, an analyst formulates hypotheticals, each
retaining some of the tuples of a database instance, possibly overlapping, and
she wishes to answer the query under scenarios, where a scenario is defined by
a subset of the hypotheticals that are "turned on". We say that a query admits
compact provisioning if given any database instance and any hypotheticals,
one can create a poly-size (in ) sketch that can then be used to answer the
query under any of the possible scenarios without accessing the
original instance.
In this paper, we focus on provisioning complex queries that combine
relational algebra (the logical component), grouping, and statistics/analytics
(the numerical component). We first show that queries that compute quantiles or
linear regression (as well as simpler queries that compute count and
sum/average of positive values) can be compactly provisioned to provide
(multiplicative) approximate answers to an arbitrary precision. In contrast,
exact provisioning for each of these statistics requires the sketch size to be
exponential in . We then establish that for any complex query whose logical
component is a positive relational algebra query, as long as the numerical
component can be compactly provisioned, the complex query itself can be
compactly provisioned. On the other hand, introducing negation or recursion in
the logical component again requires the sketch size to be exponential in .
While our positive results use algorithms that do not access the original
instance after a scenario is known, we prove our lower bounds even for the case
when, knowing the scenario, limited access to the instance is allowed
- âŠ