Search CORE

77,825 research outputs found

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2013
Field of study

We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h_1,...,h_k, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.Comment: A short version appeared at STOC'1

arXiv.org e-Print Archive

CiteSeerX

Copenhagen University Research Information System

Nearly optimal solutions for the Chow Parameters Problem and low-weight approximation of halfspaces

Author: Anindya De
Aziz H.
Banzhaf J.
Cheraghchi M.
de Keijzer B.
Dertouzos M.
Feldman V.
Feldman V.
Felsenthal D.
Freixas J.
Ilias Diakonikolas
Muroga S.
Rocco A. Servedio
Takamiya K.
Tannenbaum M.
Vitaly Feldman
Winder R. O.
Publication venue
Publication date: 05/06/2012
Field of study

The \emph{Chow parameters} of a Boolean function

f: \{-1,1\}^n \to \{-1,1\}

are its

n+1

degree-0 and degree-1 Fourier coefficients. It has been known since 1961 (Chow, Tannenbaum) that the (exact values of the) Chow parameters of any linear threshold function

f

uniquely specify

f

within the space of all Boolean functions, but until recently (O'Donnell and Servedio) nothing was known about efficient algorithms for \emph{reconstructing}

f

(exactly or approximately) from exact or approximate values of its Chow parameters. We refer to this reconstruction problem as the \emph{Chow Parameters Problem.} Our main result is a new algorithm for the Chow Parameters Problem which, given (sufficiently accurate approximations to) the Chow parameters of any linear threshold function

f

, runs in time \tilde{O}(n^2)\cdot (1/\eps)^{O(\log^2(1/\eps))} and with high probability outputs a representation of an LTF

f'

that is \eps-close to

f

. The only previous algorithm (O'Donnell and Servedio) had running time \poly(n) \cdot 2^{2^{\tilde{O}(1/\eps^2)}}. As a byproduct of our approach, we show that for any linear threshold function

f

over

\{-1,1\}^n

, there is a linear threshold function

f'

which is \eps-close to

f

and has all weights that are integers at most \sqrt{n} \cdot (1/\eps)^{O(\log^2(1/\eps))}. This significantly improves the best previous result of Diakonikolas and Servedio which gave a \poly(n) \cdot 2^{\tilde{O}(1/\eps^{2/3})} weight bound, and is close to the known lower bound of

\max\{\sqrt{n},

(1/\eps)^{\Omega(\log \log (1/\eps))}\} (Goldberg, Servedio). Our techniques also yield improved algorithms for related problems in learning theory

arXiv.org e-Print Archive

CiteSeerX

Learning the Structure and Parameters of Large-Population Graphical Games from Behavioral Data

Author: Honorio Jean
Ortiz Luis
Publication venue
Publication date: 03/05/2015
Field of study

We consider learning, from strictly behavioral data, the structure and parameters of linear influence games (LIGs), a class of parametric graphical games introduced by Irfan and Ortiz (2014). LIGs facilitate causal strategic inference (CSI): Making inferences from causal interventions on stable behavior in strategic settings. Applications include the identification of the most influential individuals in large (social) networks. Such tasks can also support policy-making analysis. Motivated by the computational work on LIGs, we cast the learning problem as maximum-likelihood estimation (MLE) of a generative model defined by pure-strategy Nash equilibria (PSNE). Our simple formulation uncovers the fundamental interplay between goodness-of-fit and model complexity: good models capture equilibrium behavior within the data while controlling the true number of equilibria, including those unobserved. We provide a generalization bound establishing the sample complexity for MLE in our framework. We propose several algorithms including convex loss minimization (CLM) and sigmoidal approximations. We prove that the number of exact PSNE in LIGs is small, with high probability; thus, CLM is sound. We illustrate our approach on synthetic data and real-world U.S. congressional voting records. We briefly discuss our learning framework's generality and potential applicability to general graphical games.Comment: Journal of Machine Learning Research. (accepted, pending publication.) Last conference version: submitted March 30, 2012 to UAI 2012. First conference version: entitled, Learning Influence Games, initially submitted on June 1, 2010 to NIPS 201

arXiv.org e-Print Archive

CiteSeerX

Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests

Author: Arlot Sylvain
Blanchard Gilles
Roquain Etienne
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2009
Field of study

We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a nonasymptotic control of the confidence level, following ideas inspired by recent results in learning theory. We consider two approaches, the first based on a concentration principle (valid for a large class of resampling weights) and the second on a resampled quantile, specifically using Rademacher weights. Several intermediate results established in the approach based on concentration principles are of interest in their own right. We also discuss the question of accuracy when using Monte Carlo approximations of the resampled quantities.Comment: Published in at http://dx.doi.org/10.1214/08-AOS667; http://dx.doi.org/10.1214/08-AOS668 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Stream Sampling for Frequency Cap Statistics

Author: Cohen E.
Gemulla R.
Indyk P.
Johnson W.
Misra J.
Ohlsson E.
Publication venue
Publication date: 28/06/2015
Field of study

Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter

T

, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our

\ell

-capped samples provide estimates with tight statistical guarantees for cap statistics with

T=\Theta(\ell)

and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201

arXiv.org e-Print Archive