Search CORE

68 research outputs found

Stochastic majorisation: exploding some myths

Author: Dubhashi D.
Ranjan D.
Publication venue: Max-Planck-Institut für Informatik
Publication date: 01/01/1994
Field of study

The analysis of many randomised algorithms involves random variables that are not independent, and hence many of the standard tools from classical probability theory that would be useful in the analysis, such as the Chernoff--Hoeffding bounds are rendered inapplicable. However, in many instances, the random variables involved are, nevertheless {\em negatively related\/} in the intuitive sense that when one of the variables is ``large'', another is likely to be ``small''. (this notion is made precise and analysed in [1].) In such situations, one is tempted to conjecture that these variables are in some sense {\em stochastically dominated\/} by a set of {\em independent\/} random variables with the same marginals. Thereby, one hopes to salvage tools such as the Chernoff--Hoeffding bound also for analysis involving the dependent set of variables. The analysis in [6, 7, 8] seems to strongly hint in this direction. In this note, we explode myths of this kind, and argue that stochastic majorisation in conjunction with an independent set of variables is actually much less useful a notion than it might have appeared

MPG.PuRe

H\"older-type inequalities and their applications to concentration and correlation bounds

Author: Pelekis Christos
Ramon Jan
Wang Yuyi
Publication venue
Publication date: 23/11/2015
Field of study

Let

Y_v, v\in V,

[0,1]

-valued random variables having a dependency graph

G=(V,E)

. We show that

\mathbb{E}\left[\prod_{v\in V} Y_{v} \right] \leq \prod_{v\in V} \left\{ \mathbb{E}\left[Y_v^{\frac{\chi_b}{b}}\right] \right\}^{\frac{b}{\chi_b}},

where

\chi_b

is the

b

-fold chromatic number of

G

. This inequality may be seen as a dependency-graph analogue of a generalised H\"older inequality, due to Helmut Finner. Additionally, we provide applications of H\"older-type inequalities to concentration and correlation bounds for sums of weakly dependent random variables.Comment: 15 page

arXiv.org e-Print Archive

Lirias

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Fast Similarity Sketching

Author: Dahlgaard Søren
Knudsen Mathias Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 01/01/2017
Field of study

We consider the Similarity Sketching problem: Given a universe

[u]= \{0,\ldots,u-1\}

we want a random function

S

mapping subsets

A\subseteq [u]

into vectors

S(A)

of size

t

, such that similarity is preserved. More precisely: Given sets

A,B\subseteq [u]

, define

X_i=[S(A)[i]= S(B)[i]]

and

X=\sum_{i\in [t]}X_i

. We want to have

E[X]=t\cdot J(A,B)

, where

J(A,B)=|A\cap B|/|A\cup B|

and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for

X

. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors

S(A)

are also called sketches. The seminal

t\times

MinHash algorithm uses

t

random hash functions

h_1,\ldots, h_t

, and stores

\left(\min_{a\in A}h_1(A),\ldots, \min_{a\in A}h_t(A)\right)

as the sketch of

A

. The main drawback of MinHash is, however, its

O(t\cdot |A|)

running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH), which creates a sketch of size

t

O(t + |A|)

time, but with the drawback that possibly some of the

t

entries are "empty" when

|A| = O(t)

. One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these "densification" schemes fail to provide good concentration bounds exactly in the case

|A| = O(t)

, where they are needed. (continued...

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Upper Tail Estimates with Combinatorial Proofs

Author: Holenstein Thomas
Hązła Jan
Publication venue
Publication date: 01/01/2015
Field of study

We study generalisations of a simple, combinatorial proof of a Chernoff bound similar to the one by Impagliazzo and Kabanets (RANDOM, 2010). In particular, we prove a randomized version of the hitting property of expander random walks and apply it to obtain a concentration bound for expander random walks which is essentially optimal for small deviations and a large number of steps. At the same time, we present a simpler proof that still yields a "right" bound settling a question asked by Impagliazzo and Kabanets. Next, we obtain a simple upper tail bound for polynomials with input variables in

[0, 1]

which are not necessarily independent, but obey a certain condition inspired by Impagliazzo and Kabanets. The resulting bound is used by Holenstein and Sinha (FOCS, 2012) in the proof of a lower bound for the number of calls in a black-box construction of a pseudorandom generator from a one-way function. We then show that the same technique yields the upper tail bound for the number of copies of a fixed graph in an Erd\H{o}s-R\'enyi random graph, matching the one given by Janson, Oleszkiewicz and Ruci\'nski (Israel J. Math, 2002).Comment: Full version of the paper from STACS 201

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server

Weighted Polynomial Approximations: Limits for Learning and Pseudorandomness

Author: Bun Mark
Steinke Thomas
Publication venue
Publication date: 08/12/2014
Field of study

Polynomial approximations to boolean functions have led to many positive results in computer science. In particular, polynomial approximations to the sign function underly algorithms for agnostically learning halfspaces, as well as pseudorandom generators for halfspaces. In this work, we investigate the limits of these techniques by proving inapproximability results for the sign function. Firstly, the polynomial regression algorithm of Kalai et al. (SIAM J. Comput. 2008) shows that halfspaces can be learned with respect to log-concave distributions on

\mathbb{R}^n

in the challenging agnostic learning model. The power of this algorithm relies on the fact that under log-concave distributions, halfspaces can be approximated arbitrarily well by low-degree polynomials. We ask whether this technique can be extended beyond log-concave distributions, and establish a negative result. We show that polynomials of any degree cannot approximate the sign function to within arbitrarily low error for a large class of non-log-concave distributions on the real line, including those with densities proportional to

\exp(-|x|^{0.99})

. Secondly, we investigate the derandomization of Chernoff-type concentration inequalities. Chernoff-type tail bounds on sums of independent random variables have pervasive applications in theoretical computer science. Schmidt et al. (SIAM J. Discrete Math. 1995) showed that these inequalities can be established for sums of random variables with only

O(\log(1/\delta))

-wise independence, for a tail probability of

\delta

. We show that their results are tight up to constant factors. These results rely on techniques from weighted approximation theory, which studies how well functions on the real line can be approximated by polynomials under various distributions. We believe that these techniques will have further applications in other areas of computer science.Comment: 22 page

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server

Fast and Powerful Hashing using Tabulation

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2016
Field of study

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of

c

characters and we have precomputed character tables

h_1,...,h_c

mapping characters to random hash values. A key

x=(x_1,...,x_c)

is hashed to

h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]

. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server