36 research outputs found
In Defense of MinHash Over SimHash
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing
(LSH) algorithms for large-scale data processing applications. Deciding which
LSH to use for a particular problem at hand is an important question, which has
no clear answer in the existing literature. In this study, we provide a
theoretical answer (validated by experiments) that MinHash virtually always
outperforms SimHash when the data are binary, as common in practice such as
search.
The collision probability of MinHash is a function of resemblance similarity
(), while the collision probability of SimHash is a function of
cosine similarity (). To provide a common basis for comparison, we
evaluate retrieval results in terms of for both MinHash and
SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH
with respect to , by using a general inequality . Our worst case analysis can
show that MinHash significantly outperforms SimHash in high similarity region.
Interestingly, our intensive experiments reveal that MinHash is also
substantially better than SimHash even in datasets where most of the data
points are not too similar to each other. This is partly because, in practical
data, often holds where
is only slightly larger than 2 (e.g., ). Our restricted worst case
analysis by assuming shows that MinHash indeed significantly
outperforms SimHash even in low similarity region.
We believe the results in this paper will provide valuable guidelines for
search in practice, especially when the data are sparse
DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
Metrics for set similarity are a core aspect of several data mining tasks. To
remove duplicate results in a Web search, for example, a common approach looks
at the Jaccard index between all pairs of pages. In social network analysis, a
much-celebrated metric is the Adamic-Adar index, widely used to compare node
neighborhood sets in the important problem of predicting links. However, with
the increasing amount of data to be processed, calculating the exact similarity
between all pairs can be intractable. The challenge of working at this scale
has motivated research into efficient estimators for set similarity metrics.
The two most popular estimators, MinHash and SimHash, are indeed used in
applications such as document deduplication and recommender systems where large
volumes of data need to be processed. Given the importance of these tasks, the
demand for advancing estimators is evident. We propose DotHash, an unbiased
estimator for the intersection size of two sets. DotHash can be used to
estimate the Jaccard index and, to the best of our knowledge, is the first
method that can also estimate the Adamic-Adar index and a family of related
metrics. We formally define this family of metrics, provide theoretical bounds
on the probability of estimate errors, and analyze its empirical performance.
Our experimental results indicate that DotHash is more accurate than the other
estimators in link prediction and detecting duplicate documents with the same
complexity and similar comparison time
Parallel Index-Based Structural Graph Clustering and Its Approximation
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely
used graph clustering algorithm. For large graphs, however, sequential SCAN
variants are prohibitively slow, and parallel SCAN variants do not effectively
share work among queries with different SCAN parameter settings. Since users of
SCAN often explore many parameter settings to find good clusterings, it is
worthwhile to precompute an index that speeds up queries.
This paper presents a practical and provably efficient parallel index-based
SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel
algorithm improves upon the asymptotic work of the sequential algorithm by
using integer sorting. It is also highly parallel, achieving logarithmic span
(parallel time) for both index construction and clustering queries.
Furthermore, we apply locality-sensitive hashing (LSH) to design a novel
approximate SCAN algorithm and prove guarantees for its clustering behavior.
We present an experimental evaluation of our algorithms on large real-world
graphs. On a 48-core machine with two-way hyper-threading, our parallel index
construction achieves 50--151 speedup over the construction of
GS*-Index. In fact, even on a single thread, our index construction algorithm
is faster than GS*-Index. Our parallel index query implementation achieves
5--32 speedup over GS*-Index queries across a range of SCAN parameter
values, and our implementation is always faster than ppSCAN, a state-of-the-art
parallel SCAN algorithm. Moreover, our experiments show that applying LSH
results in faster index construction while maintaining good clustering quality
Improving the Sensitivity of MinHash Through Hash-Value Analysis
MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments
Building K-Anonymous User Cohorts with\\ Consecutive Consistent Weighted Sampling (CCWS)
To retrieve personalized campaigns and creatives while protecting user
privacy, digital advertising is shifting from member-based identity to
cohort-based identity. Under such identity regime, an accurate and efficient
cohort building algorithm is desired to group users with similar
characteristics. In this paper, we propose a scalable -anonymous cohort
building algorithm called {\em consecutive consistent weighted sampling}
(CCWS). The proposed method combines the spirit of the (-powered) consistent
weighted sampling and hierarchical clustering, so that the -anonymity is
ensured by enforcing a lower bound on the size of cohorts. Evaluations on a
LinkedIn dataset consisting of M users and ads campaigns demonstrate that
CCWS achieves substantial improvements over several hashing-based methods
including sign random projections (SignRP), minwise hashing (MinHash), as well
as the vanilla CWS