58 research outputs found
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Estimating set similarity and detecting highly similar sets are fundamental
problems in areas such as databases, machine learning, and information
retrieval. MinHash is a well-known technique for approximating Jaccard
similarity of sets and has been successfully used for many applications such as
similarity search and large scale learning. Its two compressed versions, b-bit
MinHash and Odd Sketch, can significantly reduce the memory usage of the
original MinHash method, especially for estimating high similarities (i.e.,
similarities around 1). Although MinHash can be applied to static sets as well
as streaming sets, of which elements are given in a streaming fashion and
cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd
Sketch fail to deal with streaming data. To solve this problem, we design a
memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard
similarities in streaming sets. Compared to MinHash, our method uses smaller
sized registers (each register consists of less than 7 bits) to build a compact
sketch for each set. We also provide a simple yet accurate estimator for
inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive
formulas for bounding the estimation error and determine the smallest necessary
memory usage (i.e., the number of registers used for a MaxLogHash sketch) for
the desired accuracy. We conduct experiments on a variety of datasets, and
experimental results show that our method MaxLogHash is about 5 times more
memory efficient than MinHash with the same accuracy and computational cost for
estimating high similarities
Maximally Consistent Sampling and the Jaccard Index of Probability Distributions
We introduce simple, efficient algorithms for computing a MinHash of a
probability distribution, suitable for both sparse and dense data, with
equivalent running times to the state of the art for both cases. The collision
probability of these algorithms is a new measure of the similarity of positive
vectors which we investigate in detail. We describe the sense in which this
collision probability is optimal for any Locality Sensitive Hash based on
sampling. We argue that this similarity measure is more useful for probability
distributions than the similarity pursued by other algorithms for weighted
MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201
Consistent Weighted Sampling Made Fast, Small, and Easy
Document sketching using Jaccard similarity has been a workable effective
technique in reducing near-duplicates in Web page and image search results, and
has also proven useful in file system synchronization, compression and learning
applications.
Min-wise sampling can be used to derive an unbiased estimator for Jaccard
similarity and taking a few hundred independent consistent samples leads to
compact sketches which provide good estimates of pairwise-similarity.
Subsequent works extended this technique to weighted sets and show how to
produce samples with only a constant number of hash evaluations for any
element, independent of its weight. Another improvement by Li et al. shows how
to speedup sketch computations by computing many (near-)independent samples in
one shot. Unfortunately this latter improvement works only for the unweighted
case.
In this paper we give a simple, fast and accurate procedure which reduces
weighted sets to unweighted sets with small impact on the Jaccard similarity.
This leads to compact sketches consisting of many (near-)independent weighted
samples which can be computed with just a small constant number of hash
function evaluations per weighted element. The size of the produced unweighted
set is furthermore a tunable parameter which enables us to run the unweighted
scheme of Li et al. in the regime where it is most efficient. Even when the
sets involved are unweighted, our approach gives a simple solution to the
densification problem that other works attempted to address.
Unlike previously known schemes, ours does not result in an unbiased
estimator. However, we prove that the bias introduced by our reduction is
negligible and that the standard deviation is comparable to the unweighted
case. We also empirically evaluate our scheme and show that it gives
significant gains in computational efficiency, without any measurable loss in
accuracy
Engineering a Simplified 0-Bit Consistent Weighted Sampling
The Min-Hashing approach to sketching has become an important tool in data
analysis, information retrial, and classification. To apply it to real-valued
datasets, the ICWS algorithm has become a seminal approach that is widely used,
and provides state-of-the-art performance for this problem space. However, ICWS
suffers a computational burden as the sketch size K increases. We develop a new
Simplified approach to the ICWS algorithm, that enables us to obtain over 20x
speedups compared to the standard algorithm. The veracity of our approach is
demonstrated empirically on multiple datasets and scenarios, showing that our
new Simplified CWS obtains the same quality of results while being an order of
magnitude faster
Streaming histogram sketching for rapid microbiome analytics
Background: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. Results: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. Conclusions: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space
- …