68 research outputs found

    Stochastic majorisation: exploding some myths

    No full text
    The analysis of many randomised algorithms involves random variables that are not independent, and hence many of the standard tools from classical probability theory that would be useful in the analysis, such as the Chernoff--Hoeffding bounds are rendered inapplicable. However, in many instances, the random variables involved are, nevertheless {\em negatively related\/} in the intuitive sense that when one of the variables is ``large'', another is likely to be ``small''. (this notion is made precise and analysed in [1].) In such situations, one is tempted to conjecture that these variables are in some sense {\em stochastically dominated\/} by a set of {\em independent\/} random variables with the same marginals. Thereby, one hopes to salvage tools such as the Chernoff--Hoeffding bound also for analysis involving the dependent set of variables. The analysis in [6, 7, 8] seems to strongly hint in this direction. In this note, we explode myths of this kind, and argue that stochastic majorisation in conjunction with an independent set of variables is actually much less useful a notion than it might have appeared

    H\"older-type inequalities and their applications to concentration and correlation bounds

    Get PDF
    Let Yv,vV,Y_v, v\in V, be [0,1][0,1]-valued random variables having a dependency graph G=(V,E)G=(V,E). We show that E[vVYv]vV{E[Yvχbb]}bχb, \mathbb{E}\left[\prod_{v\in V} Y_{v} \right] \leq \prod_{v\in V} \left\{ \mathbb{E}\left[Y_v^{\frac{\chi_b}{b}}\right] \right\}^{\frac{b}{\chi_b}}, where χb\chi_b is the bb-fold chromatic number of GG. This inequality may be seen as a dependency-graph analogue of a generalised H\"older inequality, due to Helmut Finner. Additionally, we provide applications of H\"older-type inequalities to concentration and correlation bounds for sums of weakly dependent random variables.Comment: 15 page

    Fast Similarity Sketching

    Full text link
    We consider the Similarity Sketching problem: Given a universe [u]={0,,u1}[u]= \{0,\ldots,u-1\} we want a random function SS mapping subsets A[u]A\subseteq [u] into vectors S(A)S(A) of size tt, such that similarity is preserved. More precisely: Given sets A,B[u]A,B\subseteq [u], define Xi=[S(A)[i]=S(B)[i]]X_i=[S(A)[i]= S(B)[i]] and X=i[t]XiX=\sum_{i\in [t]}X_i. We want to have E[X]=tJ(A,B)E[X]=t\cdot J(A,B), where J(A,B)=AB/ABJ(A,B)=|A\cap B|/|A\cup B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for XX. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A)S(A) are also called sketches. The seminal t×t\timesMinHash algorithm uses tt random hash functions h1,,hth_1,\ldots, h_t, and stores (minaAh1(A),,minaAht(A))\left(\min_{a\in A}h_1(A),\ldots, \min_{a\in A}h_t(A)\right) as the sketch of AA. The main drawback of MinHash is, however, its O(tA)O(t\cdot |A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH), which creates a sketch of size tt in O(t+A)O(t + |A|) time, but with the drawback that possibly some of the tt entries are "empty" when A=O(t)|A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these "densification" schemes fail to provide good concentration bounds exactly in the case A=O(t)|A| = O(t), where they are needed. (continued...

    Upper Tail Estimates with Combinatorial Proofs

    Get PDF
    We study generalisations of a simple, combinatorial proof of a Chernoff bound similar to the one by Impagliazzo and Kabanets (RANDOM, 2010). In particular, we prove a randomized version of the hitting property of expander random walks and apply it to obtain a concentration bound for expander random walks which is essentially optimal for small deviations and a large number of steps. At the same time, we present a simpler proof that still yields a "right" bound settling a question asked by Impagliazzo and Kabanets. Next, we obtain a simple upper tail bound for polynomials with input variables in [0,1][0, 1] which are not necessarily independent, but obey a certain condition inspired by Impagliazzo and Kabanets. The resulting bound is used by Holenstein and Sinha (FOCS, 2012) in the proof of a lower bound for the number of calls in a black-box construction of a pseudorandom generator from a one-way function. We then show that the same technique yields the upper tail bound for the number of copies of a fixed graph in an Erd\H{o}s-R\'enyi random graph, matching the one given by Janson, Oleszkiewicz and Ruci\'nski (Israel J. Math, 2002).Comment: Full version of the paper from STACS 201

    Weighted Polynomial Approximations: Limits for Learning and Pseudorandomness

    Get PDF
    Polynomial approximations to boolean functions have led to many positive results in computer science. In particular, polynomial approximations to the sign function underly algorithms for agnostically learning halfspaces, as well as pseudorandom generators for halfspaces. In this work, we investigate the limits of these techniques by proving inapproximability results for the sign function. Firstly, the polynomial regression algorithm of Kalai et al. (SIAM J. Comput. 2008) shows that halfspaces can be learned with respect to log-concave distributions on Rn\mathbb{R}^n in the challenging agnostic learning model. The power of this algorithm relies on the fact that under log-concave distributions, halfspaces can be approximated arbitrarily well by low-degree polynomials. We ask whether this technique can be extended beyond log-concave distributions, and establish a negative result. We show that polynomials of any degree cannot approximate the sign function to within arbitrarily low error for a large class of non-log-concave distributions on the real line, including those with densities proportional to exp(x0.99)\exp(-|x|^{0.99}). Secondly, we investigate the derandomization of Chernoff-type concentration inequalities. Chernoff-type tail bounds on sums of independent random variables have pervasive applications in theoretical computer science. Schmidt et al. (SIAM J. Discrete Math. 1995) showed that these inequalities can be established for sums of random variables with only O(log(1/δ))O(\log(1/\delta))-wise independence, for a tail probability of δ\delta. We show that their results are tight up to constant factors. These results rely on techniques from weighted approximation theory, which studies how well functions on the real line can be approximated by polynomials under various distributions. We believe that these techniques will have further applications in other areas of computer science.Comment: 22 page

    Fast and Powerful Hashing using Tabulation

    Get PDF
    Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of cc characters and we have precomputed character tables h1,...,hch_1,...,h_c mapping characters to random hash values. A key x=(x1,...,xc)x=(x_1,...,x_c) is hashed to h1[x1]h2[x2].....hc[xc]h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not
    corecore