398 research outputs found

    A novel Boolean kernels family for categorical data

    Get PDF
    Kernel based classifiers, such as SVM, are considered state-of-the-art algorithms and are widely used on many classification tasks. However, this kind of methods are hardly interpretable and for this reason they are often considered as black-box models. In this paper, we propose a new family of Boolean kernels for categorical data where features correspond to propositional formulas applied to the input variables. The idea is to create human-readable features to ease the extraction of interpretation rules directly from the embedding space. Experiments on artificial and benchmark datasets show the effectiveness of the proposed family of kernels with respect to established ones, such as RBF, in terms of classification accuracy

    Achieving New Upper Bounds for the Hypergraph Duality Problem through Logic

    Get PDF
    The hypergraph duality problem DUAL is defined as follows: given two simple hypergraphs G\mathcal{G} and H\mathcal{H}, decide whether H\mathcal{H} consists precisely of all minimal transversals of G\mathcal{G} (in which case we say that G\mathcal{G} is the dual of H\mathcal{H}). This problem is equivalent to deciding whether two given non-redundant monotone DNFs are dual. It is known that non-DUAL, the complementary problem to DUAL, is in GC(log2n,PTIME)\mathrm{GC}(\log^2 n,\mathrm{PTIME}), where GC(f(n),C)\mathrm{GC}(f(n),\mathcal{C}) denotes the complexity class of all problems that after a nondeterministic guess of O(f(n))O(f(n)) bits can be decided (checked) within complexity class C\mathcal{C}. It was conjectured that non-DUAL is in GC(log2n,LOGSPACE)\mathrm{GC}(\log^2 n,\mathrm{LOGSPACE}). In this paper we prove this conjecture and actually place the non-DUAL problem into the complexity class GC(log2n,TC0)\mathrm{GC}(\log^2 n,\mathrm{TC}^0) which is a subclass of GC(log2n,LOGSPACE)\mathrm{GC}(\log^2 n,\mathrm{LOGSPACE}). We here refer to the logtime-uniform version of TC0\mathrm{TC}^0, which corresponds to FO(COUNT)\mathrm{FO(COUNT)}, i.e., first order logic augmented by counting quantifiers. We achieve the latter bound in two steps. First, based on existing problem decomposition methods, we develop a new nondeterministic algorithm for non-DUAL that requires to guess O(log2n)O(\log^2 n) bits. We then proceed by a logical analysis of this algorithm, allowing us to formulate its deterministic part in FO(COUNT)\mathrm{FO(COUNT)}. From this result, by the well known inclusion TC0LOGSPACE\mathrm{TC}^0\subseteq\mathrm{LOGSPACE}, it follows that DUAL belongs also to DSPACE[log2n]\mathrm{DSPACE}[\log^2 n]. Finally, by exploiting the principles on which the proposed nondeterministic algorithm is based, we devise a deterministic algorithm that, given two hypergraphs G\mathcal{G} and H\mathcal{H}, computes in quadratic logspace a transversal of G\mathcal{G} missing in H\mathcal{H}.Comment: Restructured the presentation in order to be the extended version of a paper that will shortly appear in SIAM Journal on Computin

    Queries with Guarded Negation (full version)

    Full text link
    A well-established and fundamental insight in database theory is that negation (also known as complementation) tends to make queries difficult to process and difficult to reason about. Many basic problems are decidable and admit practical algorithms in the case of unions of conjunctive queries, but become difficult or even undecidable when queries are allowed to contain negation. Inspired by recent results in finite model theory, we consider a restricted form of negation, guarded negation. We introduce a fragment of SQL, called GN-SQL, as well as a fragment of Datalog with stratified negation, called GN-Datalog, that allow only guarded negation, and we show that these query languages are computationally well behaved, in terms of testing query containment, query evaluation, open-world query answering, and boundedness. GN-SQL and GN-Datalog subsume a number of well known query languages and constraint languages, such as unions of conjunctive queries, monadic Datalog, and frontier-guarded tgds. In addition, an analysis of standard benchmark workloads shows that most usage of negation in SQL in practice is guarded negation

    Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover

    Full text link
    Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean functions. It has also been studied in learning theory in the context of learning with attribute costs. We consider the general problem of developing approximation algorithms for Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for evaluating Boolean linear threshold formulas. We also present an approximation algorithm for evaluating CDNF formulas (and decision trees) achieving a factor of O(log kd), where k is the number of terms in the DNF formula, and d is the number of clauses in the CNF formula. In addition, we present approximation algorithms for simultaneous evaluation of linear threshold functions, and for ranking of linear functions. Our function evaluation algorithms are based on reductions to the Stochastic Submodular Set Cover (SSSC) problem. This problem was introduced by Golovin and Krause. They presented an approximation algorithm for the problem, called Adaptive Greedy. Our main technical contribution is a new approximation algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an extension of the Dual Greedy algorithm for Submodular Set Cover due to Fujito, which is a generalization of Hochbaum's algorithm for the classical Set Cover Problem. We also give a new bound on the approximation achieved by the Adaptive Greedy algorithm of Golovin and Krause

    Separation of Test-Free Propositional Dynamic Logics over Context-Free Languages

    Full text link
    For a class L of languages let PDL[L] be an extension of Propositional Dynamic Logic which allows programs to be in a language of L rather than just to be regular. If L contains a non-regular language, PDL[L] can express non-regular properties, in contrast to pure PDL. For regular, visibly pushdown and deterministic context-free languages, the separation of the respective PDLs can be proven by automata-theoretic techniques. However, these techniques introduce non-determinism on the automata side. As non-determinism is also the difference between DCFL and CFL, these techniques seem to be inappropriate to separate PDL[DCFL] from PDL[CFL]. Nevertheless, this separation is shown but for programs without test operators.Comment: In Proceedings GandALF 2011, arXiv:1106.081

    FPTAS for Counting Monotone CNF

    Full text link
    A monotone CNF formula is a Boolean formula in conjunctive normal form where each variable appears positively. We design a deterministic fully polynomial-time approximation scheme (FPTAS) for counting the number of satisfying assignments for a given monotone CNF formula when each variable appears in at most 55 clauses. Equivalently, this is also an FPTAS for counting set covers where each set contains at most 55 elements. If we allow variables to appear in a maximum of 66 clauses (or sets to contain 66 elements), it is NP-hard to approximate it. Thus, this gives a complete understanding of the approximability of counting for monotone CNF formulas. It is also an important step towards a complete characterization of the approximability for all bounded degree Boolean #CSP problems. In addition, we study the hypergraph matching problem, which arises naturally towards a complete classification of bounded degree Boolean #CSP problems, and show an FPTAS for counting 3D matchings of hypergraphs with maximum degree 44. Our main technique is correlation decay, a powerful tool to design deterministic FPTAS for counting problems defined by local constraints among a number of variables. All previous uses of this design technique fall into two categories: each constraint involves at most two variables, such as independent set, coloring, and spin systems in general; or each variable appears in at most two constraints, such as matching, edge cover, and holant problem in general. The CNF problems studied here have more complicated structures than these problems and require new design and proof techniques. As it turns out, the technique we developed for the CNF problem also works for the hypergraph matching problem. We believe that it may also find applications in other CSP or more general counting problems.Comment: 24 pages, 2 figures. version 1=>2: minor edits, highlighted the picture of set cover/packing, and an implication of our previous result in 3D matchin

    Oblivious Bounds on the Probability of Boolean Functions

    Full text link
    This paper develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. We also show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.Comment: 34 pages, 14 figures, supersedes: http://arxiv.org/abs/1105.281

    Faster Query Answering in Probabilistic Databases using Read-Once Functions

    Full text link
    A boolean expression is in read-once form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in read-once form can be computed in polynomial time (whereas the general problem for arbitrary expressions is #P-complete). Known approaches to checking read-once property seem to require putting these expressions in disjunctive normal form. In this paper, we tell a better story for a large subclass of boolean event expressions: those that are generated by conjunctive queries without self-joins and on tuple-independent probabilistic databases. We first show that given a tuple-independent representation and the provenance graph of an SPJ query plan without self-joins, we can, without using the DNF of a result event expression, efficiently compute its co-occurrence graph. From this, the read-once form can already, if it exists, be computed efficiently using existing techniques. Our second and key contribution is a complete, efficient, and simple to implement algorithm for computing the read-once forms (whenever they exist) directly, using a new concept, that of co-table graph, which can be significantly smaller than the co-occurrence graph.Comment: Accepted in ICDT 201

    Model Counting of Query Expressions: Limitations of Propositional Methods

    Full text link
    Query evaluation in tuple-independent probabilistic databases is the problem of computing the probability of an answer to a query given independent probabilities of the individual tuples in a database instance. There are two main approaches to this problem: (1) in `grounded inference' one first obtains the lineage for the query and database instance as a Boolean formula, then performs weighted model counting on the lineage (i.e., computes the probability of the lineage given probabilities of its independent Boolean variables); (2) in methods known as `lifted inference' or `extensional query evaluation', one exploits the high-level structure of the query as a first-order formula. Although it is widely believed that lifted inference is strictly more powerful than grounded inference on the lineage alone, no formal separation has previously been shown for query evaluation. In this paper we show such a formal separation for the first time. We exhibit a class of queries for which model counting can be done in polynomial time using extensional query evaluation, whereas the algorithms used in state-of-the-art exact model counters on their lineages provably require exponential time. Our lower bounds on the running times of these exact model counters follow from new exponential size lower bounds on the kinds of d-DNNF representations of the lineages that these model counters (either explicitly or implicitly) produce. Though some of these queries have been studied before, no non-trivial lower bounds on the sizes of these representations for these queries were previously known.Comment: To appear in International Conference on Database Theory (ICDT) 201
    corecore