398 research outputs found
A novel Boolean kernels family for categorical data
Kernel based classifiers, such as SVM, are considered state-of-the-art algorithms and are widely used on many classification tasks. However, this kind of methods are hardly interpretable and for this reason they are often considered as black-box models. In this paper, we propose a new family of Boolean kernels for categorical data where features correspond to propositional formulas applied to the input variables. The idea is to create human-readable features to ease the extraction of interpretation rules directly from the embedding space. Experiments on artificial and benchmark datasets show the effectiveness of the proposed family of kernels with respect to established ones, such as RBF, in terms of classification accuracy
Achieving New Upper Bounds for the Hypergraph Duality Problem through Logic
The hypergraph duality problem DUAL is defined as follows: given two simple
hypergraphs and , decide whether
consists precisely of all minimal transversals of (in which case
we say that is the dual of ). This problem is
equivalent to deciding whether two given non-redundant monotone DNFs are dual.
It is known that non-DUAL, the complementary problem to DUAL, is in
, where
denotes the complexity class of all problems that after a nondeterministic
guess of bits can be decided (checked) within complexity class
. It was conjectured that non-DUAL is in . In this paper we prove this conjecture and actually
place the non-DUAL problem into the complexity class which is a subclass of . We here refer to the logtime-uniform version of
, which corresponds to , i.e., first order
logic augmented by counting quantifiers. We achieve the latter bound in two
steps. First, based on existing problem decomposition methods, we develop a new
nondeterministic algorithm for non-DUAL that requires to guess
bits. We then proceed by a logical analysis of this algorithm, allowing us to
formulate its deterministic part in . From this result, by
the well known inclusion , it follows
that DUAL belongs also to . Finally, by exploiting
the principles on which the proposed nondeterministic algorithm is based, we
devise a deterministic algorithm that, given two hypergraphs and
, computes in quadratic logspace a transversal of
missing in .Comment: Restructured the presentation in order to be the extended version of
a paper that will shortly appear in SIAM Journal on Computin
Queries with Guarded Negation (full version)
A well-established and fundamental insight in database theory is that
negation (also known as complementation) tends to make queries difficult to
process and difficult to reason about. Many basic problems are decidable and
admit practical algorithms in the case of unions of conjunctive queries, but
become difficult or even undecidable when queries are allowed to contain
negation. Inspired by recent results in finite model theory, we consider a
restricted form of negation, guarded negation. We introduce a fragment of SQL,
called GN-SQL, as well as a fragment of Datalog with stratified negation,
called GN-Datalog, that allow only guarded negation, and we show that these
query languages are computationally well behaved, in terms of testing query
containment, query evaluation, open-world query answering, and boundedness.
GN-SQL and GN-Datalog subsume a number of well known query languages and
constraint languages, such as unions of conjunctive queries, monadic Datalog,
and frontier-guarded tgds. In addition, an analysis of standard benchmark
workloads shows that most usage of negation in SQL in practice is guarded
negation
Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover
Stochastic Boolean Function Evaluation is the problem of determining the
value of a given Boolean function f on an unknown input x, when each bit of x_i
of x can only be determined by paying an associated cost c_i. The assumption is
that x is drawn from a given product distribution, and the goal is to minimize
the expected cost. This problem has been studied in Operations Research, where
it is known as "sequential testing" of Boolean functions. It has also been
studied in learning theory in the context of learning with attribute costs. We
consider the general problem of developing approximation algorithms for
Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for
evaluating Boolean linear threshold formulas. We also present an approximation
algorithm for evaluating CDNF formulas (and decision trees) achieving a factor
of O(log kd), where k is the number of terms in the DNF formula, and d is the
number of clauses in the CNF formula. In addition, we present approximation
algorithms for simultaneous evaluation of linear threshold functions, and for
ranking of linear functions.
Our function evaluation algorithms are based on reductions to the Stochastic
Submodular Set Cover (SSSC) problem. This problem was introduced by Golovin and
Krause. They presented an approximation algorithm for the problem, called
Adaptive Greedy. Our main technical contribution is a new approximation
algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an
extension of the Dual Greedy algorithm for Submodular Set Cover due to Fujito,
which is a generalization of Hochbaum's algorithm for the classical Set Cover
Problem. We also give a new bound on the approximation achieved by the Adaptive
Greedy algorithm of Golovin and Krause
Separation of Test-Free Propositional Dynamic Logics over Context-Free Languages
For a class L of languages let PDL[L] be an extension of Propositional
Dynamic Logic which allows programs to be in a language of L rather than just
to be regular. If L contains a non-regular language, PDL[L] can express
non-regular properties, in contrast to pure PDL.
For regular, visibly pushdown and deterministic context-free languages, the
separation of the respective PDLs can be proven by automata-theoretic
techniques. However, these techniques introduce non-determinism on the automata
side. As non-determinism is also the difference between DCFL and CFL, these
techniques seem to be inappropriate to separate PDL[DCFL] from PDL[CFL].
Nevertheless, this separation is shown but for programs without test operators.Comment: In Proceedings GandALF 2011, arXiv:1106.081
FPTAS for Counting Monotone CNF
A monotone CNF formula is a Boolean formula in conjunctive normal form where
each variable appears positively. We design a deterministic fully
polynomial-time approximation scheme (FPTAS) for counting the number of
satisfying assignments for a given monotone CNF formula when each variable
appears in at most clauses. Equivalently, this is also an FPTAS for
counting set covers where each set contains at most elements. If we allow
variables to appear in a maximum of clauses (or sets to contain
elements), it is NP-hard to approximate it. Thus, this gives a complete
understanding of the approximability of counting for monotone CNF formulas. It
is also an important step towards a complete characterization of the
approximability for all bounded degree Boolean #CSP problems. In addition, we
study the hypergraph matching problem, which arises naturally towards a
complete classification of bounded degree Boolean #CSP problems, and show an
FPTAS for counting 3D matchings of hypergraphs with maximum degree .
Our main technique is correlation decay, a powerful tool to design
deterministic FPTAS for counting problems defined by local constraints among a
number of variables. All previous uses of this design technique fall into two
categories: each constraint involves at most two variables, such as independent
set, coloring, and spin systems in general; or each variable appears in at most
two constraints, such as matching, edge cover, and holant problem in general.
The CNF problems studied here have more complicated structures than these
problems and require new design and proof techniques. As it turns out, the
technique we developed for the CNF problem also works for the hypergraph
matching problem. We believe that it may also find applications in other CSP or
more general counting problems.Comment: 24 pages, 2 figures. version 1=>2: minor edits, highlighted the
picture of set cover/packing, and an implication of our previous result in 3D
matchin
Oblivious Bounds on the Probability of Boolean Functions
This paper develops upper and lower bounds for the probability of Boolean
functions by treating multiple occurrences of variables as independent and
assigning them new individual probabilities. We call this approach dissociation
and give an exact characterization of optimal oblivious bounds, i.e. when the
new probabilities are chosen independent of the probabilities of all other
variables. Our motivation comes from the weighted model counting problem (or,
equivalently, the problem of computing the probability of a Boolean function),
which is #P-hard in general. By performing several dissociations, one can
transform a Boolean formula whose probability is difficult to compute, into one
whose probability is easy to compute, and which is guaranteed to provide an
upper or lower bound on the probability of the original formula by choosing
appropriate probabilities for the dissociated variables. Our new bounds shed
light on the connection between previous relaxation-based and model-based
approximations and unify them as concrete choices in a larger design space. We
also show how our theory allows a standard relational database management
system (DBMS) to both upper and lower bound hard probabilistic queries in
guaranteed polynomial time.Comment: 34 pages, 14 figures, supersedes: http://arxiv.org/abs/1105.281
Faster Query Answering in Probabilistic Databases using Read-Once Functions
A boolean expression is in read-once form if each of its variables appears
exactly once. When the variables denote independent events in a probability
space, the probability of the event denoted by the whole expression in
read-once form can be computed in polynomial time (whereas the general problem
for arbitrary expressions is #P-complete). Known approaches to checking
read-once property seem to require putting these expressions in disjunctive
normal form. In this paper, we tell a better story for a large subclass of
boolean event expressions: those that are generated by conjunctive queries
without self-joins and on tuple-independent probabilistic databases. We first
show that given a tuple-independent representation and the provenance graph of
an SPJ query plan without self-joins, we can, without using the DNF of a result
event expression, efficiently compute its co-occurrence graph. From this, the
read-once form can already, if it exists, be computed efficiently using
existing techniques. Our second and key contribution is a complete, efficient,
and simple to implement algorithm for computing the read-once forms (whenever
they exist) directly, using a new concept, that of co-table graph, which can be
significantly smaller than the co-occurrence graph.Comment: Accepted in ICDT 201
Model Counting of Query Expressions: Limitations of Propositional Methods
Query evaluation in tuple-independent probabilistic databases is the problem
of computing the probability of an answer to a query given independent
probabilities of the individual tuples in a database instance. There are two
main approaches to this problem: (1) in `grounded inference' one first obtains
the lineage for the query and database instance as a Boolean formula, then
performs weighted model counting on the lineage (i.e., computes the probability
of the lineage given probabilities of its independent Boolean variables); (2)
in methods known as `lifted inference' or `extensional query evaluation', one
exploits the high-level structure of the query as a first-order formula.
Although it is widely believed that lifted inference is strictly more powerful
than grounded inference on the lineage alone, no formal separation has
previously been shown for query evaluation. In this paper we show such a formal
separation for the first time.
We exhibit a class of queries for which model counting can be done in
polynomial time using extensional query evaluation, whereas the algorithms used
in state-of-the-art exact model counters on their lineages provably require
exponential time. Our lower bounds on the running times of these exact model
counters follow from new exponential size lower bounds on the kinds of d-DNNF
representations of the lineages that these model counters (either explicitly or
implicitly) produce. Though some of these queries have been studied before, no
non-trivial lower bounds on the sizes of these representations for these
queries were previously known.Comment: To appear in International Conference on Database Theory (ICDT) 201
- …