39 research outputs found

    Improved Densification of One Permutation Hashing

    Full text link
    The existing work on densification of one permutation hashing reduces the query processing cost of the (K,L)(K,L)-parameterized Locality Sensitive Hashing (LSH) algorithm with minwise hashing, from O(dKL)O(dKL) to merely O(d+KL)O(d + KL), where dd is the number of nonzeros of the data vector, KK is the number of hashes in each hash table, and LL is the number of hash tables. While that is a substantial improvement, our analysis reveals that the existing densification scheme is sub-optimal. In particular, there is no enough randomness in that procedure, which affects its accuracy on very sparse datasets. In this paper, we provide a new densification procedure which is provably better than the existing scheme. This improvement is more significant for very sparse datasets which are common over the web. The improved technique has the same cost of O(d+KL)O(d + KL) for query processing, thereby making it strictly preferable over the existing procedure. Experimental evaluations on public datasets, in the task of hashing based near neighbor search, support our theoretical findings

    Fast Similarity Sketching

    Full text link
    We consider the Similarity Sketching problem: Given a universe [u]={0,,u1}[u]= \{0,\ldots,u-1\} we want a random function SS mapping subsets A[u]A\subseteq [u] into vectors S(A)S(A) of size tt, such that similarity is preserved. More precisely: Given sets A,B[u]A,B\subseteq [u], define Xi=[S(A)[i]=S(B)[i]]X_i=[S(A)[i]= S(B)[i]] and X=i[t]XiX=\sum_{i\in [t]}X_i. We want to have E[X]=tJ(A,B)E[X]=t\cdot J(A,B), where J(A,B)=AB/ABJ(A,B)=|A\cap B|/|A\cup B| and furthermore to have strong concentration guarantees (i.e. Chernoff-style bounds) for XX. This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A)S(A) are also called sketches. The seminal t×t\timesMinHash algorithm uses tt random hash functions h1,,hth_1,\ldots, h_t, and stores (minaAh1(A),,minaAht(A))\left(\min_{a\in A}h_1(A),\ldots, \min_{a\in A}h_t(A)\right) as the sketch of AA. The main drawback of MinHash is, however, its O(tA)O(t\cdot |A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. Addressing this, Li et al. [NIPS'12] introduced one permutation hashing (OPH), which creates a sketch of size tt in O(t+A)O(t + |A|) time, but with the drawback that possibly some of the tt entries are "empty" when A=O(t)|A| = O(t). One could argue that sketching is not necessary in this case, however the desire in most applications is to have one sketching procedure that works for sets of all sizes. Therefore, filling out these empty entries is the subject of several follow-up papers initiated by Shrivastava and Li [ICML'14]. However, these "densification" schemes fail to provide good concentration bounds exactly in the case A=O(t)|A| = O(t), where they are needed. (continued...

    In Defense of MinHash Over SimHash

    Full text link
    MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R\mathcal{R}), while the collision probability of SimHash is a function of cosine similarity (S\mathcal{S}). To provide a common basis for comparison, we evaluate retrieval results in terms of S\mathcal{S} for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S\mathcal{S}, by using a general inequality S2RS2S\mathcal{S}^2\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}}. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often RSzS\mathcal{R}\geq \frac{\mathcal{S}}{z-\mathcal{S}} holds where zz is only slightly larger than 2 (e.g., z2.1z\leq 2.1). Our restricted worst case analysis by assuming SzSRS2S\frac{\mathcal{S}}{z-\mathcal{S}}\leq \mathcal{R}\leq \frac{\mathcal{S}}{2-\mathcal{S}} shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse

    Practical and Optimal LSH for Angular Distance

    Get PDF
    We show the existence of a Locality-Sensitive Hashing (LSH) family for the angular distance that yields an approximate Near Neighbor Search algorithm with the asymptotically optimal running time exponent. Unlike earlier algorithms with this property (e.g., Spherical LSH [Andoni, Indyk, Nguyen, Razenshteyn 2014], [Andoni, Razenshteyn 2015]), our algorithm is also practical, improving upon the well-studied hyperplane LSH [Charikar, 2002] in practice. We also introduce a multiprobe version of this algorithm, and conduct experimental evaluation on real and synthetic data sets. We complement the above positive results with a fine-grained lower bound for the quality of any LSH family for angular distance. Our lower bound implies that the above LSH family exhibits a trade-off between evaluation time and quality that is close to optimal for a natural class of LSH functions.Comment: 22 pages, an extended abstract is to appear in the proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015

    Hashing for Similarity Search: A Survey

    Full text link
    Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space
    corecore