Search CORE

89 research outputs found

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Author: Broder A.
Durand Marianne
Flajolet Philippe
Li Ping
Li Ping
Shrivastava Anshumali
Shrivastava Anshumali
Shrivastava Anshumali
Shrivastava Anshumali
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/05/2019
Field of study

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities

arXiv.org e-Print Archive

Crossref

Pb-Hash: Partitioned b-bit Hashing

Author: Li Ping
Zhao Weijie
Publication venue
Publication date: 28/06/2023
Field of study

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of

B

bits. With

k

hashes for each data vector, the storage would be

B\times k

bits; and when used for large-scale learning, the model size would be

2^B\times k

, which can be expensive. A standard strategy is to use only the lowest

b

bits out of the

B

bits and somewhat increase

k

, the number of hashes. In this study, we propose to re-use the hashes by partitioning the

B

bits into

m

chunks, e.g.,

b\times m =B

. Correspondingly, the model size becomes

m\times 2^b \times k

, which can be substantially smaller than the original

2^B\times k

. Our theoretical analysis reveals that by partitioning the hash values into

m

chunks, the accuracy would drop. In other words, using

m

chunks of

B/m

bits would not be as accurate as directly using

B

bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,)

m=2\sim 4

. In some regions, Pb-Hash still works well even for

m

much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine

m

embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study

arXiv.org e-Print Archive