1,608 research outputs found
A polynomial-time algorithm to approximately count contingency tables when the number of rows is constant.
AbstractWe consider the problem of counting the number of contingency tables with given row and column sums. This problem is known to be #P-complete, even when there are only two rows (Random Structures Algorithms 10(4) (1997) 487). In this paper we present the first fully polynomial randomized approximation scheme for counting contingency tables when the number of rows is constant. A novel feature of our algorithm is that it is a hybrid of an exact counting technique with an approximation algorithm, giving two distinct phases. In the first, the columns are partitioned into “small” and “large”. We show that the number of contingency tables can be expressed as the weighted sum of a polynomial number of new instances of the problem, where each instance consists of some new row sums and the original large column sums. In the second phase, we show how to approximately count contingency tables when all the column sums are large. In this case, we show that the solution lies in approximating the volume of a single convex body, a problem which is known to be solvable in polynomial time (J. ACM 38 (1) (1991) 1)
Enumerating contingency tables via random permanents
Given m positive integers R=(r_i), n positive integers C=(c_j) such that sum
r_i = sum c_j =N, and mn non-negative weights W=(w_{ij}), we consider the total
weight T=T(R, C; W) of non-negative integer matrices (contingency tables)
D=(d_{ij}) with the row sums r_i, column sums c_j, and the weight of D equal to
prod w_{ij}^{d_{ij}}. We present a randomized algorithm of a polynomial in N
complexity which computes a number T'=T'(R,C; W) such that T' < T < alpha(R, C)
T' where alpha(R,C) = min{prod r_i! r_i^{-r_i}, prod c_j! c_j^{-c_j}} N^N/N!.
In many cases, ln T' provides an asymptotically accurate estimate of ln T. The
idea of the algorithm is to express T as the expectation of the permanent of an
N x N random matrix with exponentially distributed entries and approximate the
expectation by the integral T' of an efficiently computable log-concave
function on R^{mn}. Applications to counting integer flows in graphs are also
discussed.Comment: 19 pages, bounds are sharpened, references are adde
Counting magic squares in quasi-polynomial time
We present a randomized algorithm, which, given positive integers n and t and
a real number 0< epsilon <1, computes the number Sigma(n, t) of n x n
non-negative integer matrices (magic squares) with the row and column sums
equal to t within relative error epsilon. The computational complexity of the
algorithm is polynomial in 1/epsilon and quasi-polynomial in N=nt, that is, of
the order N^{log N}. A simplified version of the algorithm works in time
polynomial in 1/epsilon and N and estimates Sigma(n,t) within a factor of
N^{log N}. This simplified version has been implemented. We present results of
the implementation, state some conjectures, and discuss possible
generalizations.Comment: 30 page
Sequential importance sampling for multiway tables
We describe an algorithm for the sequential sampling of entries in multiway
contingency tables with given constraints. The algorithm can be used for
computations in exact conditional inference. To justify the algorithm, a theory
relates sampling values at each step to properties of the associated toric
ideal using computational commutative algebra. In particular, the property of
interval cell counts at each step is related to exponents on lead
indeterminates of a lexicographic Gr\"{o}bner basis. Also, the approximation of
integer programming by linear programming for sampling is related to initial
terms of a toric ideal. We apply the algorithm to examples of contingency
tables which appear in the social and medical sciences. The numerical results
demonstrate that the theory is applicable and that the algorithm performs well.Comment: Published at http://dx.doi.org/10.1214/009053605000000822 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
This paper introduces new algorithms and data structures for quick counting
for machine learning datasets. We focus on the counting task of constructing
contingency tables, but our approach is also applicable to counting the number
of records in a dataset that match conjunctive queries. Subject to certain
assumptions, the costs of these operations can be shown to be independent of
the number of records in the dataset and loglinear in the number of non-zero
entries in the contingency table. We provide a very sparse data structure, the
ADtree, to minimize memory use. We provide analytical worst-case bounds for
this structure for several models of data distribution. We empirically
demonstrate that tractably-sized data structures can be produced for large
real-world datasets by (a) using a sparse tree structure that never allocates
memory for counts of zero, (b) never allocating memory for counts that can be
deduced from other counts, and (c) not bothering to expand the tree fully near
its leaves. We show how the ADtree can be used to accelerate Bayes net
structure finding algorithms, rule learning algorithms, and feature selection
algorithms, and we provide a number of empirical results comparing ADtree
methods against traditional direct counting approaches. We also discuss the
possible uses of ADtrees in other machine learning methods, and discuss the
merits of ADtrees in comparison with alternative representations such as
kd-trees, R-trees and Frequent Sets.Comment: See http://www.jair.org/ for any accompanying file
Efficient Algorithms for Privately Releasing Marginals via Convex Relaxations
Consider a database of people, each represented by a bit-string of length
corresponding to the setting of binary attributes. A -way marginal
query is specified by a subset of attributes, and a -dimensional
binary vector specifying their values. The result for this query is a
count of the number of people in the database whose attribute vector restricted
to agrees with .
Privately releasing approximate answers to a set of -way marginal queries
is one of the most important and well-motivated problems in differential
privacy. Information theoretically, the error complexity of marginal queries is
well-understood: the per-query additive error is known to be at least
and at most
. However, no polynomial
time algorithm with error complexity as low as the information theoretic upper
bound is known for small . In this work we present a polynomial time
algorithm that, for any distribution on marginal queries, achieves average
error at most . This error
bound is as good as the best known information theoretic upper bounds for
. This bound is an improvement over previous work on efficiently releasing
marginals when is small and when error is desirable. Using private
boosting we are also able to give nearly matching worst-case error bounds.
Our algorithms are based on the geometric techniques of Nikolov, Talwar, and
Zhang. The main new ingredients are convex relaxations and careful use of the
Frank-Wolfe algorithm for constrained convex minimization. To design our
relaxations, we rely on the Grothendieck inequality from functional analysis
- …