research

Approximately Minwise Independence with Twisted Tabulation

Abstract

A random hash function hh is ε\varepsilon-minwise if for any set SS, S=n|S|=n, and element xSx\in S, Pr[h(x)=minh(S)]=(1±ε)/n\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n. Minwise hash functions with low bias ε\varepsilon have widespread applications within similarity estimation. Hashing from a universe [u][u], the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes c=O(1)c=O(1) lookups in tables of size u1/cu^{1/c}. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79] O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing requires Ω(logu)\Omega(\log u)-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields O~(1/n1/c)\tilde O(1/n^{1/c})-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

    Similar works