2 research outputs found

    b-Bit Minwise Hashing

    Full text link
    This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest bb bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance > 0.5

    On the sample size of k-restricted min-wise independent permutations and other k-wise distributions

    No full text
    An explicit study of min-wise independent permutation families, together with their variants — k-restricted, approximate, etc. — was initiated by Broder, et al [4]. In this paper, we give a lower bound for the size of k-restricted min-wise independent permutation family. A family F of permutations on [0,n−1] = {0, 1,...,n−1} is said to be k-restricted minwise independent if for any subset X ⊆ [0,n−1] with |X | ≤k and any x ∈ X, Pr[min{π(X)} = π(x)] = 1/|X|, when π is randomly chosen from F according to a probability distribution D on the family F. For the minimum size of a family of k-restricted min-wise independent permutations, upper bounds of O(n k) for any fixed k have been shown for uniform and biased probability distributions on F. We show that if a family F of permutations on [0,n−1] is k-restricted min-wise independent, then |F | ≥ m(n − 1,k − 1), where m(n, d) = Èd/2 ¡ n È (d−1)/2   ¡   ¡ n n−1 i=0 i if d is even; m(n, d) = i=0 i + (d−1)/2 otherwise. The lower bound for the size of F still holds when we allow an arbitrary probability distribution on F. Our proof technique is based on linear algebra methods, and can be regarded as a generalization of the result by Alon, Babai, and Itai [1], i.e., if random variables X1,X2,...,Xn:Ω→{0, 1} are k-wise independent and Pr[Xi =1]=pi is neither 0 nor 1, then |Ω | ≥m(n, k). By applying our proof technique, we also derive lower bounds for the sample size of the related notions, e.g., k-wise symmetrically independent distributions, k-rankwise independent permutation families, etc
    corecore