1,608 research outputs found

    A polynomial-time algorithm to approximately count contingency tables when the number of rows is constant.

    Get PDF
    AbstractWe consider the problem of counting the number of contingency tables with given row and column sums. This problem is known to be #P-complete, even when there are only two rows (Random Structures Algorithms 10(4) (1997) 487). In this paper we present the first fully polynomial randomized approximation scheme for counting contingency tables when the number of rows is constant. A novel feature of our algorithm is that it is a hybrid of an exact counting technique with an approximation algorithm, giving two distinct phases. In the first, the columns are partitioned into “small” and “large”. We show that the number of contingency tables can be expressed as the weighted sum of a polynomial number of new instances of the problem, where each instance consists of some new row sums and the original large column sums. In the second phase, we show how to approximately count contingency tables when all the column sums are large. In this case, we show that the solution lies in approximating the volume of a single convex body, a problem which is known to be solvable in polynomial time (J. ACM 38 (1) (1991) 1)

    Enumerating contingency tables via random permanents

    Full text link
    Given m positive integers R=(r_i), n positive integers C=(c_j) such that sum r_i = sum c_j =N, and mn non-negative weights W=(w_{ij}), we consider the total weight T=T(R, C; W) of non-negative integer matrices (contingency tables) D=(d_{ij}) with the row sums r_i, column sums c_j, and the weight of D equal to prod w_{ij}^{d_{ij}}. We present a randomized algorithm of a polynomial in N complexity which computes a number T'=T'(R,C; W) such that T' < T < alpha(R, C) T' where alpha(R,C) = min{prod r_i! r_i^{-r_i}, prod c_j! c_j^{-c_j}} N^N/N!. In many cases, ln T' provides an asymptotically accurate estimate of ln T. The idea of the algorithm is to express T as the expectation of the permanent of an N x N random matrix with exponentially distributed entries and approximate the expectation by the integral T' of an efficiently computable log-concave function on R^{mn}. Applications to counting integer flows in graphs are also discussed.Comment: 19 pages, bounds are sharpened, references are adde

    Counting magic squares in quasi-polynomial time

    Full text link
    We present a randomized algorithm, which, given positive integers n and t and a real number 0< epsilon <1, computes the number Sigma(n, t) of n x n non-negative integer matrices (magic squares) with the row and column sums equal to t within relative error epsilon. The computational complexity of the algorithm is polynomial in 1/epsilon and quasi-polynomial in N=nt, that is, of the order N^{log N}. A simplified version of the algorithm works in time polynomial in 1/epsilon and N and estimates Sigma(n,t) within a factor of N^{log N}. This simplified version has been implemented. We present results of the implementation, state some conjectures, and discuss possible generalizations.Comment: 30 page

    Sequential importance sampling for multiway tables

    Full text link
    We describe an algorithm for the sequential sampling of entries in multiway contingency tables with given constraints. The algorithm can be used for computations in exact conditional inference. To justify the algorithm, a theory relates sampling values at each step to properties of the associated toric ideal using computational commutative algebra. In particular, the property of interval cell counts at each step is related to exponents on lead indeterminates of a lexicographic Gr\"{o}bner basis. Also, the approximation of integer programming by linear programming for sampling is related to initial terms of a toric ideal. We apply the algorithm to examples of contingency tables which appear in the social and medical sciences. The numerical results demonstrate that the theory is applicable and that the algorithm performs well.Comment: Published at http://dx.doi.org/10.1214/009053605000000822 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

    Full text link
    This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.Comment: See http://www.jair.org/ for any accompanying file

    Efficient Algorithms for Privately Releasing Marginals via Convex Relaxations

    Full text link
    Consider a database of nn people, each represented by a bit-string of length dd corresponding to the setting of dd binary attributes. A kk-way marginal query is specified by a subset SS of kk attributes, and a S|S|-dimensional binary vector β\beta specifying their values. The result for this query is a count of the number of people in the database whose attribute vector restricted to SS agrees with β\beta. Privately releasing approximate answers to a set of kk-way marginal queries is one of the most important and well-motivated problems in differential privacy. Information theoretically, the error complexity of marginal queries is well-understood: the per-query additive error is known to be at least Ω(min{n,dk2})\Omega(\min\{\sqrt{n},d^{\frac{k}{2}}\}) and at most O~(min{nd1/4,dk2})\tilde{O}(\min\{\sqrt{n} d^{1/4},d^{\frac{k}{2}}\}). However, no polynomial time algorithm with error complexity as low as the information theoretic upper bound is known for small nn. In this work we present a polynomial time algorithm that, for any distribution on marginal queries, achieves average error at most O~(ndk/24)\tilde{O}(\sqrt{n} d^{\frac{\lceil k/2 \rceil}{4}}). This error bound is as good as the best known information theoretic upper bounds for k=2k=2. This bound is an improvement over previous work on efficiently releasing marginals when kk is small and when error o(n)o(n) is desirable. Using private boosting we are also able to give nearly matching worst-case error bounds. Our algorithms are based on the geometric techniques of Nikolov, Talwar, and Zhang. The main new ingredients are convex relaxations and careful use of the Frank-Wolfe algorithm for constrained convex minimization. To design our relaxations, we rely on the Grothendieck inequality from functional analysis
    corecore