68 research outputs found
Stochastic majorisation: exploding some myths
The analysis of many randomised algorithms involves random variables that are not independent, and hence many of the standard tools from classical probability theory that would be useful in the analysis, such as the Chernoff--Hoeffding bounds are rendered inapplicable. However, in many instances, the random variables involved are, nevertheless {\em negatively related\/} in the intuitive sense that when one of the variables is ``large'', another is likely to be ``small''. (this notion is made precise and analysed in [1].) In such situations, one is tempted to conjecture that these variables are in some sense {\em stochastically dominated\/} by a set of {\em independent\/} random variables with the same marginals. Thereby, one hopes to salvage tools such as the Chernoff--Hoeffding bound also for analysis involving the dependent set of variables. The analysis in [6, 7, 8] seems to strongly hint in this direction. In this note, we explode myths of this kind, and argue that stochastic majorisation in conjunction with an independent set of variables is actually much less useful a notion than it might have appeared
H\"older-type inequalities and their applications to concentration and correlation bounds
Let be -valued random variables having a dependency
graph . We show that where is the -fold chromatic number
of . This inequality may be seen as a dependency-graph analogue of a
generalised H\"older inequality, due to Helmut Finner. Additionally, we provide
applications of H\"older-type inequalities to concentration and correlation
bounds for sums of weakly dependent random variables.Comment: 15 page
Fast Similarity Sketching
We consider the Similarity Sketching problem: Given a universe we want a random function mapping subsets into vectors of size , such that similarity is preserved. More
precisely: Given sets , define and
. We want to have , where
and furthermore to have strong concentration
guarantees (i.e. Chernoff-style bounds) for . This is a fundamental problem
which has found numerous applications in data mining, large-scale
classification, computer vision, similarity search, etc. via the classic
MinHash algorithm. The vectors are also called sketches.
The seminal MinHash algorithm uses random hash functions
, and stores as the sketch of . The main drawback of MinHash is,
however, its running time, and finding a sketch with similar
properties and faster running time has been the subject of several papers.
Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH),
which creates a sketch of size in time, but with the drawback
that possibly some of the entries are "empty" when . One could
argue that sketching is not necessary in this case, however the desire in most
applications is to have one sketching procedure that works for sets of all
sizes. Therefore, filling out these empty entries is the subject of several
follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these
"densification" schemes fail to provide good concentration bounds exactly in
the case , where they are needed. (continued...
Upper Tail Estimates with Combinatorial Proofs
We study generalisations of a simple, combinatorial proof of a Chernoff bound
similar to the one by Impagliazzo and Kabanets (RANDOM, 2010).
In particular, we prove a randomized version of the hitting property of
expander random walks and apply it to obtain a concentration bound for expander
random walks which is essentially optimal for small deviations and a large
number of steps. At the same time, we present a simpler proof that still yields
a "right" bound settling a question asked by Impagliazzo and Kabanets.
Next, we obtain a simple upper tail bound for polynomials with input
variables in which are not necessarily independent, but obey a certain
condition inspired by Impagliazzo and Kabanets. The resulting bound is used by
Holenstein and Sinha (FOCS, 2012) in the proof of a lower bound for the number
of calls in a black-box construction of a pseudorandom generator from a one-way
function.
We then show that the same technique yields the upper tail bound for the
number of copies of a fixed graph in an Erd\H{o}s-R\'enyi random graph,
matching the one given by Janson, Oleszkiewicz and Ruci\'nski (Israel J. Math,
2002).Comment: Full version of the paper from STACS 201
Weighted Polynomial Approximations: Limits for Learning and Pseudorandomness
Polynomial approximations to boolean functions have led to many positive
results in computer science. In particular, polynomial approximations to the
sign function underly algorithms for agnostically learning halfspaces, as well
as pseudorandom generators for halfspaces. In this work, we investigate the
limits of these techniques by proving inapproximability results for the sign
function.
Firstly, the polynomial regression algorithm of Kalai et al. (SIAM J. Comput.
2008) shows that halfspaces can be learned with respect to log-concave
distributions on in the challenging agnostic learning model. The
power of this algorithm relies on the fact that under log-concave
distributions, halfspaces can be approximated arbitrarily well by low-degree
polynomials. We ask whether this technique can be extended beyond log-concave
distributions, and establish a negative result. We show that polynomials of any
degree cannot approximate the sign function to within arbitrarily low error for
a large class of non-log-concave distributions on the real line, including
those with densities proportional to .
Secondly, we investigate the derandomization of Chernoff-type concentration
inequalities. Chernoff-type tail bounds on sums of independent random variables
have pervasive applications in theoretical computer science. Schmidt et al.
(SIAM J. Discrete Math. 1995) showed that these inequalities can be established
for sums of random variables with only -wise independence,
for a tail probability of . We show that their results are tight up to
constant factors.
These results rely on techniques from weighted approximation theory, which
studies how well functions on the real line can be approximated by polynomials
under various distributions. We believe that these techniques will have further
applications in other areas of computer science.Comment: 22 page
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
- …