10 research outputs found
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD), and multi-valued
dependencies (MVD) are fundamental in database schema design. Likewise,
probabilistic conditional independences (CI) are crucial for reasoning about
multivariate probability distributions. The implication problem studies whether
a set of constraints (antecedents) implies another constraint (consequent), and
has been investigated in both the database and the AI literature, under the
assumption that all constraints hold exactly. However, many applications today
consider constraints that hold only approximately. In this paper we define an
approximate implication as a linear inequality between the degree of
satisfaction of the antecedents and consequent, and we study the relaxation
problem: when does an exact implication relax to an approximate implication? We
use information theory to define the degree of satisfaction, and prove several
results. First, we show that any implication from a set of data dependencies
(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most
quadratic in the number of variables; when the consequent is an FD, the factor
can be reduced to 1. Second, we prove that there exists an implication between
CIs that does not admit any relaxation; however, we prove that every
implication between CIs relaxes "in the limit". Finally, we show that the
implication problem for differential constraints in market basket analysis also
admits a relaxation with a factor equal to 1. Our results recover, and
sometimes extend, several previously known results about the implication
problem: implication of MVDs can be checked by considering only 2-tuple
relations, and the implication of differential constraints for frequent item
sets can be checked by considering only databases containing a single
transaction
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes "in the limit". Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction
Approximate Implication for Probabilistic Graphical Models
The graphical structure of Probabilistic Graphical Models (PGMs) represents
the conditional independence (CI) relations that hold in the modeled
distribution. Every separator in the graph represents a conditional
independence relation in the distribution, making them the vehicle through
which new conditional independencies are inferred and verified. The notion of
separation in graphs depends on whether the graph is directed (i.e., a Bayesian
Network), or undirected (i.e., a Markov Network).
The premise of all current systems-of-inference for deriving CIs in PGMs, is
that the set of CIs used for the construction of the PGM hold exactly. In
practice, algorithms for extracting the structure of PGMs from data discover
approximate CIs that do not hold exactly in the distribution. In this paper, we
ask how the error in this set propagates to the inferred CIs read off the
graphical structure. More precisely, what guarantee can we provide on the
inferred CI when the set of CIs that entailed it hold only approximately? It
has recently been shown that in the general case, no such guarantee can be
provided.
In this work, we prove new negative and positive results concerning this
problem. We prove that separators in undirected PGMs do not necessarily
represent approximate CIs. That is, no guarantee can be provided for CIs
inferred from the structure of undirected graphs. We prove that such a
guarantee exists for the set of CIs inferred in directed graphical models,
making the -separation algorithm a sound and complete system for inferring
approximate CIs. We also establish improved approximation guarantees for
independence relations derived from marginal and saturated CIs.Comment: arXiv admin note: substantial text overlap with arXiv:2105.1446
On the Enumeration of all Minimal Triangulations
We present an algorithm that enumerates all the minimal triangulations of a
graph in incremental polynomial time. Consequently, we get an algorithm for
enumerating all the proper tree decompositions, in incremental polynomial time,
where "proper" means that the tree decomposition cannot be improved by removing
or splitting a bag
Quantifying the Loss of Acyclic Join Dependencies
Acyclic schemes posses known benefits for database design, speeding up
queries, and reducing space requirements. An acyclic join dependency (AJD) is
lossless with respect to a universal relation if joining the projections
associated with the schema results in the original universal relation. An
intuitive and standard measure of loss entailed by an AJD is the number of
redundant tuples generated by the acyclic join. Recent work has shown that the
loss of an AJD can also be characterized by an information-theoretic measure.
Motivated by the problem of automatically fitting an acyclic schema to a
universal relation, we investigate the connection between these two
characterizations of loss. We first show that the loss of an AJD is captured
using the notion of KL-Divergence. We then show that the KL-divergence can be
used to bound the number of redundant tuples. We prove a deterministic lower
bound on the percentage of redundant tuples. For an upper bound, we propose a
random database model, and establish a high probability bound on the percentage
of redundant tuples, which coincides with the lower bound for large databases.Comment: To appear in PODS 202
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD) and multi-valued
dependencies (MVD) are fundamental in database schema design. Likewise,
probabilistic conditional independences (CI) are crucial for reasoning about
multivariate probability distributions. The implication problem studies whether
a set of constraints (antecedents) implies another constraint (consequent), and
has been investigated in both the database and the AI literature, under the
assumption that all constraints hold exactly. However, many applications today
consider constraints that hold only approximately. In this paper we define an
approximate implication as a linear inequality between the degree of
satisfaction of the antecedents and consequent, and we study the relaxation
problem: when does an exact implication relax to an approximate implication? We
use information theory to define the degree of satisfaction, and prove several
results. First, we show that any implication from a set of data dependencies
(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most
quadratic in the number of variables; when the consequent is an FD, the factor
can be reduced to 1. Second, we prove that there exists an implication between
CIs that does not admit any relaxation; however, we prove that every
implication between CIs relaxes "in the limit". Then, we show that the
implication problem for differential constraints in market basket analysis also
admits a relaxation with a factor equal to 1. Finally, we show how some of the
results in the paper can be derived using the I-measure theory, which relates
between information theoretic measures and set theory. Our results recover, and
sometimes extend, previously known results about the implication problem: the
implication of MVDs and FDs can be checked by considering only 2-tuple
relations
Approximate Inference of Outcomes in Probabilistic Elections
We study the complexity of estimating the probability of an outcome in an election over probabilistic votes. The focus is on voting rules expressed as positional scoring rules, and two models of probabilistic voters: the uniform distribution over the completions of a partial voting profile (consisting of a partial ordering of the candidates by each voter), and the Repeated Insertion Model (RIM) over the candidates, including the special case of the Mallows distribution. Past research has established that, while exact inference of the probability of winning is computationally hard (#P-hard), an additive polynomial-time approximation (additive FPRAS) is attained by sampling and averaging. There is often, though, a need for multiplicative approximation guarantees that are crucial for important measures such as conditional probabilities. Unfortunately, a multiplicative approximation of the probability of winning cannot be efficient (under conventional complexity assumptions) since it is already NP-complete to determine whether this probability is nonzero. Contrastingly, we devise multiplicative polynomial-time approximations (multiplicative FPRAS) for the probability of the complement event, namely, losing the election
Probabilistic Inference Over Repeated Insertion Models
Distributions over rankings are used to model user preferences in various settings including political elections and electronic commerce. The Repeated Insertion Model (RIM) gives rise to various known probability distributions over rankings, in particular to the popular Mallows model. However, probabilistic inference on RIM is computationally challenging, and provably intractable in the general case. In this paper we propose an algorithm for computing the marginal probability of an arbitrary partially ordered set over RIM. We analyze the complexity of the algorithm in terms of properties of the model and the partial order, captured by a novel measure termed the "cover width." We also conduct an experimental study of the algorithm over serial and parallelized implementations. Building upon the relationship between inference with rank distributions and counting linear extensions, we investigate the inference problem when restricted to partial orders that lend themselves to efficient counting of their linear extensions