89 research outputs found
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Estimating set similarity and detecting highly similar sets are fundamental
problems in areas such as databases, machine learning, and information
retrieval. MinHash is a well-known technique for approximating Jaccard
similarity of sets and has been successfully used for many applications such as
similarity search and large scale learning. Its two compressed versions, b-bit
MinHash and Odd Sketch, can significantly reduce the memory usage of the
original MinHash method, especially for estimating high similarities (i.e.,
similarities around 1). Although MinHash can be applied to static sets as well
as streaming sets, of which elements are given in a streaming fashion and
cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd
Sketch fail to deal with streaming data. To solve this problem, we design a
memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard
similarities in streaming sets. Compared to MinHash, our method uses smaller
sized registers (each register consists of less than 7 bits) to build a compact
sketch for each set. We also provide a simple yet accurate estimator for
inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive
formulas for bounding the estimation error and determine the smallest necessary
memory usage (i.e., the number of registers used for a MaxLogHash sketch) for
the desired accuracy. We conduct experiments on a variety of datasets, and
experimental results show that our method MaxLogHash is about 5 times more
memory efficient than MinHash with the same accuracy and computational cost for
estimating high similarities
Pb-Hash: Partitioned b-bit Hashing
Many hashing algorithms including minwise hashing (MinHash), one permutation
hashing (OPH), and consistent weighted sampling (CWS) generate integers of
bits. With hashes for each data vector, the storage would be
bits; and when used for large-scale learning, the model size would be
, which can be expensive. A standard strategy is to use only the
lowest bits out of the bits and somewhat increase , the number of
hashes. In this study, we propose to re-use the hashes by partitioning the
bits into chunks, e.g., . Correspondingly, the model size
becomes , which can be substantially smaller than the
original .
Our theoretical analysis reveals that by partitioning the hash values into
chunks, the accuracy would drop. In other words, using chunks of
bits would not be as accurate as directly using bits. This is due to the
correlation from re-using the same hash. On the other hand, our analysis also
shows that the accuracy would not drop much for (e.g.,) . In some
regions, Pb-Hash still works well even for much larger than 4. We expect
Pb-Hash would be a good addition to the family of hashing methods/applications
and benefit industrial practitioners.
We verify the effectiveness of Pb-Hash in machine learning tasks, for linear
SVM models as well as deep learning models. Since the hashed data are
essentially categorical (ID) features, we follow the standard practice of using
embedding tables for each hash. With Pb-Hash, we need to design an effective
strategy to combine embeddings. Our study provides an empirical evaluation
on four pooling schemes: concatenation, max pooling, mean pooling, and product
pooling. There is no definite answer which pooling would be always better and
we leave that for future study
- …