Search CORE

39 research outputs found

Improved Densification of One Permutation Hashing

Author: Li Ping
Shrivastava Anshumali
Publication venue
Publication date: 18/06/2014
Field of study

The existing work on densification of one permutation hashing reduces the query processing cost of the

(K,L)

-parameterized Locality Sensitive Hashing (LSH) algorithm with minwise hashing, from

O(dKL)

to merely

O(d + KL)

, where

d

is the number of nonzeros of the data vector,

K

is the number of hashes in each hash table, and

L

is the number of hash tables. While that is a substantial improvement, our analysis reveals that the existing densification scheme is sub-optimal. In particular, there is no enough randomness in that procedure, which affects its accuracy on very sparse datasets. In this paper, we provide a new densification procedure which is provably better than the existing scheme. This improvement is more significant for very sparse datasets which are common over the web. The improved technique has the same cost of

O(d + KL)

for query processing, thereby making it strictly preferable over the existing procedure. Experimental evaluations on public datasets, in the task of hashing based near neighbor search, support our theoretical findings

arXiv.org e-Print Archive

CiteSeerX

Fast Similarity Sketching

Author: Dahlgaard Søren
Knudsen Mathias Bæk Tejs
Thorup Mikkel
Publication venue
Publication date: 01/01/2017
Field of study

We consider the Similarity Sketching problem: Given a universe

[u]= \{0,\ldots,u-1\}

we want a random function

S

mapping subsets

A\subseteq [u]

into vectors

S(A)

of size

t

, such that similarity is preserved. More precisely: Given sets

A,B\subseteq [u]

, define

X_i=[S(A)[i]= S(B)[i]]

and

X=\sum_{i\in [t]}X_i

. We want to have

E[X]=t\cdot J(A,B)

, where

J(A,B)=|A\cap B|/|A\cup B|

and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for

X

. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors

S(A)

are also called sketches. The seminal

t\times

MinHash algorithm uses

t

random hash functions

h_1,\ldots, h_t

, and stores

\left(\min_{a\in A}h_1(A),\ldots, \min_{a\in A}h_t(A)\right)

as the sketch of

A

. The main drawback of MinHash is, however, its

O(t\cdot |A|)

running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH), which creates a sketch of size

t

O(t + |A|)

time, but with the drawback that possibly some of the

t

entries are "empty" when

|A| = O(t)

. One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these "densification" schemes fail to provide good concentration bounds exactly in the case

|A| = O(t)

, where they are needed. (continued...

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

In Defense of MinHash Over SimHash

Author: Li Ping
Shrivastava Anshumali
Publication venue
Publication date: 16/07/2014
Field of study

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (

\mathcal{R}

), while the collision probability of SimHash is a function of cosine similarity (

\mathcal{S}

). To provide a common basis for comparison, we evaluate retrieval results in terms of

\mathcal{S}

for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to

\mathcal{S}

, by using a general inequality

\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}

. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often

\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}}

holds where

z

is only slightly larger than 2 (e.g.,

z\leq 2.1

). Our restricted worst case analysis by assuming

\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}

shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse

arXiv.org e-Print Archive

CiteSeerX

Practical and Optimal LSH for Angular Distance

Author: Alexandr Andoni
Ilya Razenshteyn
Ludwig Schmidt
Piotr Indyk
Thijs Laarhoven
Tu Eindhoven
Publication venue
Publication date: 01/01/2015
Field of study

We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.Comment: 22 pages, an extended abstract is to appear in the proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015

arXiv.org e-Print Archive

Hashing for Similarity Search: A Survey

Author: Ji Jianqiu
Shen Heng Tao
Song Jingkuan
Wang Jingdong
Publication venue
Publication date: 13/08/2014
Field of study

Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space

arXiv.org e-Print Archive

CiteSeerX