77,825 research outputs found
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting
of the k elements that are smallest according to a given hash function h. With
this sample we can estimate the relative size f=|Y|/|X| of any subset Y as
|S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard
similarity f=|A intersect B|/|A union B| between sets A and B. Given the
bottom-k samples from A and B, we construct the bottom-k sample of their union
as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is
estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.
We show here that even if the hash function is only 2-independent, the
expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a
constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxmin-wise where we use k
hash independent functions h_1,...,h_k, storing the smallest element with each
hash function. For kxmin-wise there is an at least constant bias with constant
independence, and it is not reduced with larger k. Recently Feigenblat et al.
showed that bottom-k circumvents the bias if the hash function is 8-independent
and k is sufficiently large. We get down to 2-independence for any k. Our
result is based on a simply union bound, transferring generic concentration
bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger
probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to
the concrete input weights, e.g., benefiting strongly from heavy-tailed input.
This time, the analysis is much more involved, but again we show that generic
concentration bounds can be applied.Comment: A short version appeared at STOC'1
Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces
The \emph{Chow parameters} of a Boolean function
are its degree-0 and degree-1 Fourier coefficients. It has been known
since 1961 (Chow, Tannenbaum) that the (exact values of the) Chow parameters of
any linear threshold function uniquely specify within the space of all
Boolean functions, but until recently (O'Donnell and Servedio) nothing was
known about efficient algorithms for \emph{reconstructing} (exactly or
approximately) from exact or approximate values of its Chow parameters. We
refer to this reconstruction problem as the \emph{Chow Parameters Problem.}
Our main result is a new algorithm for the Chow Parameters Problem which,
given (sufficiently accurate approximations to) the Chow parameters of any
linear threshold function , runs in time \tilde{O}(n^2)\cdot
(1/\eps)^{O(\log^2(1/\eps))} and with high probability outputs a
representation of an LTF that is \eps-close to . The only previous
algorithm (O'Donnell and Servedio) had running time \poly(n) \cdot
2^{2^{\tilde{O}(1/\eps^2)}}.
As a byproduct of our approach, we show that for any linear threshold
function over , there is a linear threshold function which
is \eps-close to and has all weights that are integers at most \sqrt{n}
\cdot (1/\eps)^{O(\log^2(1/\eps))}. This significantly improves the best
previous result of Diakonikolas and Servedio which gave a \poly(n) \cdot
2^{\tilde{O}(1/\eps^{2/3})} weight bound, and is close to the known lower
bound of (1/\eps)^{\Omega(\log \log (1/\eps))}\} (Goldberg,
Servedio). Our techniques also yield improved algorithms for related problems
in learning theory
Learning the Structure and Parameters of Large-Population Graphical Games from Behavioral Data
We consider learning, from strictly behavioral data, the structure and
parameters of linear influence games (LIGs), a class of parametric graphical
games introduced by Irfan and Ortiz (2014). LIGs facilitate causal strategic
inference (CSI): Making inferences from causal interventions on stable behavior
in strategic settings. Applications include the identification of the most
influential individuals in large (social) networks. Such tasks can also support
policy-making analysis. Motivated by the computational work on LIGs, we cast
the learning problem as maximum-likelihood estimation (MLE) of a generative
model defined by pure-strategy Nash equilibria (PSNE). Our simple formulation
uncovers the fundamental interplay between goodness-of-fit and model
complexity: good models capture equilibrium behavior within the data while
controlling the true number of equilibria, including those unobserved. We
provide a generalization bound establishing the sample complexity for MLE in
our framework. We propose several algorithms including convex loss minimization
(CLM) and sigmoidal approximations. We prove that the number of exact PSNE in
LIGs is small, with high probability; thus, CLM is sound. We illustrate our
approach on synthetic data and real-world U.S. congressional voting records. We
briefly discuss our learning framework's generality and potential applicability
to general graphical games.Comment: Journal of Machine Learning Research. (accepted, pending
publication.) Last conference version: submitted March 30, 2012 to UAI 2012.
First conference version: entitled, Learning Influence Games, initially
submitted on June 1, 2010 to NIPS 201
Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests
We study generalized bootstrap confidence regions for the mean of a random
vector whose coordinates have an unknown dependency structure. The random
vector is supposed to be either Gaussian or to have a symmetric and bounded
distribution. The dimensionality of the vector can possibly be much larger than
the number of observations and we focus on a nonasymptotic control of the
confidence level, following ideas inspired by recent results in learning
theory. We consider two approaches, the first based on a concentration
principle (valid for a large class of resampling weights) and the second on a
resampled quantile, specifically using Rademacher weights. Several intermediate
results established in the approach based on concentration principles are of
interest in their own right. We also discuss the question of accuracy when
using Monte Carlo approximations of the resampled quantities.Comment: Published in at http://dx.doi.org/10.1214/08-AOS667;
http://dx.doi.org/10.1214/08-AOS668 the Annals of Statistics
(http://www.imstat.org/aos/) by the Institute of Mathematical Statistics
(http://www.imstat.org
Stream Sampling for Frequency Cap Statistics
Unaggregated data, in streamed or distributed form, is prevalent and come
from diverse application domains which include interactions of users with web
services and IP traffic. Data elements have {\em keys} (cookies, users,
queries) and elements with different keys interleave. Analytics on such data
typically utilizes statistics stated in terms of the frequencies of keys. The
two most common statistics are {\em distinct}, which is the number of active
keys in a specified segment, and {\em sum}, which is the sum of the frequencies
of keys in the segment. Both are special cases of {\em cap} statistics, defined
as the sum of frequencies {\em capped} by a parameter , which are popular in
online advertising platforms. Aggregation by key, however, is costly, requiring
state proportional to the number of distinct keys, and therefore we are
interested in estimating these statistics or more generally, sampling the data,
without aggregation. We present a sampling framework for unaggregated data that
uses a single pass (for streams) or two passes (for distributed data) and state
proportional to the desired sample size. Our design provides the first
effective solution for general frequency cap statistics. Our -capped
samples provide estimates with tight statistical guarantees for cap statistics
with and nonnegative unbiased estimates of {\em any} monotone
non-decreasing frequency statistics. An added benefit of our unified design is
facilitating {\em multi-objective samples}, which provide estimates with
statistical guarantees for a specified set of different statistics, using a
single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201
- …