Search CORE

166 research outputs found

b-Bit Minwise Hashing

Author: Konig Arnd Christian
Li Ping
Publication venue
Publication date: 17/10/2009
Field of study

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest

b

bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance > 0.5

arXiv.org e-Print Archive

CiteSeerX

Hashing Algorithms for Large-Scale Learning

Author: Konig Arnd Christian
Li Ping
Moore Joshua
Shrivastava Anshumali
Publication venue
Publication date: 01/01/2011
Field of study

In this paper, we first demonstrate that b-bit minwise hashing, whose estimators are positive definite kernels, can be naturally integrated with learning algorithms such as SVM and logistic regression. We adopt a simple scheme to transform the nonlinear (resemblance) kernel into linear (inner product) kernel; and hence large-scale problems can be solved extremely efficiently. Our method provides a simple effective solution to large-scale learning in massive and extremely high-dimensional datasets, especially when data do not fit in memory. We then compare b-bit minwise hashing with the Vowpal Wabbit (VW) algorithm (which is related the Count-Min (CM) sketch). Interestingly, VW has the same variances as random projections. Our theoretical and empirical comparisons illustrate that usually

b

-bit minwise hashing is significantly more accurate (at the same storage) than VW (and random projections) in binary data. Furthermore,

b

-bit minwise hashing can be combined with VW to achieve further improvements in terms of training speed, especially when

b

is large

arXiv.org e-Print Archive

CiteSeerX

In Defense of MinHash Over SimHash

Author: Li Ping
Shrivastava Anshumali
Publication venue
Publication date: 16/07/2014
Field of study

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (

\mathcal{R}

), while the collision probability of SimHash is a function of cosine similarity (

\mathcal{S}

). To provide a common basis for comparison, we evaluate retrieval results in terms of

\mathcal{S}

for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to

\mathcal{S}

, by using a general inequality

\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}

. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often

\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}}

holds where

z

is only slightly larger than 2 (e.g.,

z\leq 2.1

). Our restricted worst case analysis by assuming

\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}

shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse

arXiv.org e-Print Archive

CiteSeerX

Approximately Minwise Independence with Twisted Tabulation

Author: A. Broder
A.Z. Broder
E. Cohen
M. Datar
M. Pǎtraşcu
R.E. Fan
Y. Bachrach
Publication venue
Publication date: 01/01/2014
Field of study

A random hash function

h

\varepsilon

-minwise if for any set

S

|S|=n

, and element

x\in S

\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n

. Minwise hash functions with low bias

\varepsilon

have widespread applications within similarity estimation. Hashing from a universe

[u]

, the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes

c=O(1)

lookups in tables of size

u^{1/c}

. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields

\tilde O(1/u^{1/c})

-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79]

\tilde O(1/u^{1/c})

-minwise hashing requires

\Omega(\log u)

-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields

\tilde O(1/n^{1/c})

-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Copenhagen University Research Information System