    Exponential Space Improvement for minwise Based Algorithms

    In this paper we introduce a general framework that exponentially improves the space, the degree of independence, and the time needed by min-wise based algorithms. The authors, in SODA 2011, we introduced an exponential time improvement for min-wise based algorithms by defining and constructing an almost k-min-wise independent family of hash functions. Here we develop an alternative approach that achieves both exponential time and exponential space improvement. The new approach relaxes the need for approximately min-wise hash functions, hence gets around the Omega(log(1/epsilon)) independence lower bound in [Patrascu 2010]. This is done by defining and constructing a d-k-min-wise independent family of hash functions. Surprisingly, for most cases only 8-wise independence is needed for the additional improvement. Moreover, as the degree of independence is a small constant, our function can be implemented efficiently. Informally, under this definition, all subsets of size d of any fixed set X have an equal probability to have hash values among the minimal k values in X, where the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for d=k=|X|. We define and give an efficient time and space construction of approximately d-k-min-wise independent family of hash functions for the case where d=2, as this is sufficient for the additional exponential improvement. We discuss how this construction can be used to improve many min-wise based algorithms. To our knowledge such definitions, for hash functions, were never studied and no construction was given before. As an example we show how to apply it for similarity and rarity estimation over data streams. Other min-wise based algorithms, can be adjusted in the same way

    Pairwise Independent Random Walks Can Be Slightly Unbounded

    A family of problems that have been studied in the context of various streaming algorithms are generalizations of the fact that the expected maximum distance of a 4-wise independent random walk on a line over n steps is O(sqrt{n}). For small values of k, there exist k-wise independent random walks that can be stored in much less space than storing n random bits, so these properties are often useful for lowering space bounds. In this paper, we show that for all of these examples, 4-wise independence is required by demonstrating a pairwise independent random walk with steps uniform in +/- 1 and expected maximum distance Omega(sqrt{n} lg n) from the origin. We also show that this bound is tight for the first and second moment, i.e. the expected maximum square distance of a 2-wise independent random walk is always O(n lg^2 n). Also, for any even k >= 4, we show that the kth moment of the maximum distance of any k-wise independent random walk is O(n^{k/2}). The previous two results generalize to random walks tracking insertion-only streams, and provide higher moment bounds than currently known. We also prove a generalization of Kolmogorov\u27s maximal inequality by showing an asymptotically equivalent statement that requires only 4-wise independent random variables with bounded second moments, which also generalizes a result of Blasiok

    The chaining lemma and its application

    We present a new information-theoretic result which we call the Chaining Lemma. It considers a so-called “chain” of random variables, defined by a source distribution X(0)with high min-entropy and a number (say, t in total) of arbitrary functions (T1,…, Tt) which are applied in succession to that source to generate the chain (Formula presented). Intuitively, the Chaining Lemma guarantees that, if the chain is not too long, then either (i) the entire chain is “highly random”, in that every variable has high min-entropy; or (ii) it is possible to find a point j (1 ≤ j ≤ t) in the chain such that, conditioned on the end of the chain i.e. (Formula presented), the preceding part (Formula presented) remains highly random. We think this is an interesting information-theoretic result which is intuitive but nevertheless requires rigorous case-analysis to prove. We believe that the above lemma will find applications in cryptography. We give an example of this, namely we show an application of the lemma to protect essentially any cryptographic scheme against memory tampering attacks. We allow several tampering requests, the tampering functions can be arbitrary, however, they must be chosen from a bounded size set of functions that is fixed a prior
