20,135 research outputs found
Infinite Probabilistic Databases
Probabilistic databases (PDBs) are used to model uncertainty in data in a quantitative way. In the standard formal framework, PDBs are finite probability spaces over relational database instances. It has been argued convincingly that this is not compatible with an open-world semantics (Ceylan et al., KR 2016) and with application scenarios that are modeled by continuous probability distributions (Dalvi et al., CACM 2009).
We recently introduced a model of PDBs as infinite probability spaces that addresses these issues (Grohe and Lindner, PODS 2019). While that work was mainly concerned with countably infinite probability spaces, our focus here is on uncountable spaces. Such an extension is necessary to model typical continuous probability distributions that appear in many applications. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics.
It turns out that so-called finite point processes are the appropriate model from probability theory for dealing with probabilistic databases. This model allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries
Infinite Probabilistic Databases
Probabilistic databases (PDBs) model uncertainty in data in a quantitative
way. In the established formal framework, probabilistic (relational) databases
are finite probability spaces over relational database instances. This
finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016),
and with application scenarios that are better modeled by continuous
probability distributions (Dalvi et al., CACM 2009).
We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a
primary focus on countably infinite spaces. However, an extension beyond
countable probability spaces raises nontrivial foundational issues concerned
with the measurability of events and queries and ultimately with the question
whether queries have a well-defined semantics.
We argue that finite point processes are an appropriate model from
probability theory for dealing with general probabilistic databases. This
allows us to construct suitable (uncountable) probability spaces of database
instances in a systematic way. Our main technical results are measurability
statements for relational algebra queries as well as aggregate queries and
Datalog queries.Comment: This is the full version of the paper "Infinite Probabilistic
Databases" presented at ICDT 2020 (arXiv:1904.06766
Infinite Probabilistic Databases
Probabilistic databases (PDBs) model uncertainty in data in a quantitative
way. In the established formal framework, probabilistic (relational) databases
are finite probability spaces over relational database instances. This
finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016),
and with application scenarios that are better modeled by continuous
probability distributions (Dalvi et al., CACM 2009).
We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a
primary focus on countably infinite spaces. However, an extension beyond
countable probability spaces raises nontrivial foundational issues concerned
with the measurability of events and queries and ultimately with the question
whether queries have a well-defined semantics.
We argue that finite point processes are an appropriate model from
probability theory for dealing with general probabilistic databases. This
allows us to construct suitable (uncountable) probability spaces of database
instances in a systematic way. Our main technical results are measurability
statements for relational algebra queries as well as aggregate queries and
Datalog queries
Tuple-Independent Representations of Infinite Probabilistic Databases
Probabilistic databases (PDBs) are probability spaces over database
instances. They provide a framework for handling uncertainty in databases, as
occurs due to data integration, noisy data, data from unreliable sources or
randomized processes. Most of the existing theory literature investigated
finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are
independent events. Only recently, Grohe and Lindner (PODS '19) introduced
independence assumptions for PDBs beyond the finite domain assumption. In the
finite, a major argument for discussing the theoretical properties of TI-PDBs
is that they can be used to represent any finite PDB via views. This is no
longer the case once the number of tuples is countably infinite. In this paper,
we systematically study the representability of infinite PDBs in terms of
TI-PDBs and the related block-independent disjoint PDBs.
The central question is which infinite PDBs are representable as first-order
views over tuple-independent PDBs. We give a necessary condition for the
representability of PDBs and provide a sufficient criterion for
representability in terms of the probability distribution of a PDB. With
various examples, we explore the limits of our criteria. We show that
conditioning on first order properties yields no additional power in terms of
expressivity. Finally, we discuss the relation between purely logical and
arithmetic reasons for (non-)representability
Probabilistic Data with Continuous Distributions
Statistical models of real world data typically involve continuous
probability distributions such as normal, Laplace, or exponential
distributions. Such distributions are supported by many probabilistic modelling
formalisms, including probabilistic database systems. Yet, the traditional
theoretical framework of probabilistic databases focusses entirely on finite
probabilistic databases.
Only recently, we set out to develop the mathematical theory of infinite
probabilistic databases. The present paper is an exposition of two recent
papers which are cornerstones of this theory. In (Grohe, Lindner; ICDT 2020) we
propose a very general framework for probabilistic databases, possibly
involving continuous probability distributions, and show that queries have a
well-defined semantics in this framework. In (Grohe, Kaminski, Katoen, Lindner;
PODS 2020) we extend the declarative probabilistic programming language
Generative Datalog, proposed by (B\'ar\'any et al.~2017) for discrete
probability distributions, to continuous probability distributions and show
that such programs yield generative models of continuous probabilistic
databases
The Dichotomy of Evaluating Homomorphism-Closed Queries on Probabilistic Graphs
We study the problem of probabilistic query evaluation on probabilistic
graphs, namely, tuple-independent probabilistic databases on signatures of
arity two. Our focus is the class of queries that is closed under
homomorphisms, or equivalently, the infinite unions of conjunctive queries. Our
main result states that all unbounded queries from this class are #P-hard for
probabilistic query evaluation. As bounded queries from this class are
equivalent to a union of conjunctive queries, they are already classified by
the dichotomy of Dalvi and Suciu (2012). Hence, our result and theirs imply a
complete data complexity dichotomy, between polynomial time and #P-hardness,
for evaluating infinite unions of conjunctive queries over probabilistic
graphs. This dichotomy covers in particular all fragments of infinite unions of
conjunctive queries such as negation-free (disjunctive) Datalog, regular path
queries, and a large class of ontology-mediated queries on arity-two
signatures. Our result is shown by reducing from counting the valuations of
positive partitioned 2-DNF formulae for some queries, or from the
source-to-target reliability problem in an undirected graph for other queries,
depending on properties of minimal models. The presented dichotomy result
applies to even a special case of probabilistic query evaluation called
generalized model counting, where fact probabilities must be 0, 0.5, or 1.Comment: 30 pages. Journal version of the ICDT'20 paper
https://drops.dagstuhl.de/opus/volltexte/2020/11939/. Submitted to LMCS. The
previous version (version 2) was the same as the ICDT'20 paper with some
minor formatting tweaks and 7 extra pages of technical appendi
Duplicate Detection in Probabilistic Data
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data
Characterizing the Sample Complexity of Private Learners
In 2008, Kasiviswanathan et al. defined private learning as a combination of
PAC learning and differential privacy. Informally, a private learner is applied
to a collection of labeled individual information and outputs a hypothesis
while preserving the privacy of each individual. Kasiviswanathan et al. gave a
generic construction of private learners for (finite) concept classes, with
sample complexity logarithmic in the size of the concept class. This sample
complexity is higher than what is needed for non-private learners, hence
leaving open the possibility that the sample complexity of private learning may
be sometimes significantly higher than that of non-private learning.
We give a combinatorial characterization of the sample size sufficient and
necessary to privately learn a class of concepts. This characterization is
analogous to the well known characterization of the sample complexity of
non-private learning in terms of the VC dimension of the concept class. We
introduce the notion of probabilistic representation of a concept class, and
our new complexity measure RepDim corresponds to the size of the smallest
probabilistic representation of the concept class.
We show that any private learning algorithm for a concept class C with sample
complexity m implies RepDim(C)=O(m), and that there exists a private learning
algorithm with sample complexity m=O(RepDim(C)). We further demonstrate that a
similar characterization holds for the database size needed for privately
computing a large class of optimization problems and also for the well studied
problem of private data release
- …