469 research outputs found
In Defense of MinHash Over SimHash
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing
(LSH) algorithms for large-scale data processing applications. Deciding which
LSH to use for a particular problem at hand is an important question, which has
no clear answer in the existing literature. In this study, we provide a
theoretical answer (validated by experiments) that MinHash virtually always
outperforms SimHash when the data are binary, as common in practice such as
search.
The collision probability of MinHash is a function of resemblance similarity
(), while the collision probability of SimHash is a function of
cosine similarity (). To provide a common basis for comparison, we
evaluate retrieval results in terms of for both MinHash and
SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH
with respect to , by using a general inequality . Our worst case analysis can
show that MinHash significantly outperforms SimHash in high similarity region.
Interestingly, our intensive experiments reveal that MinHash is also
substantially better than SimHash even in datasets where most of the data
points are not too similar to each other. This is partly because, in practical
data, often holds where
is only slightly larger than 2 (e.g., ). Our restricted worst case
analysis by assuming shows that MinHash indeed significantly
outperforms SimHash even in low similarity region.
We believe the results in this paper will provide valuable guidelines for
search in practice, especially when the data are sparse
HDIdx: High-Dimensional Indexing for Efficient Approximate Nearest Neighbor Search
Fast Nearest Neighbor (NN) search is a fundamental challenge in large-scale
data processing and analytics, particularly for analyzing multimedia contents
which are often of high dimensionality. Instead of using exact NN search,
extensive research efforts have been focusing on approximate NN search
algorithms. In this work, we present "HDIdx", an efficient high-dimensional
indexing library for fast approximate NN search, which is open-source and
written in Python. It offers a family of state-of-the-art algorithms that
convert input high-dimensional vectors into compact binary codes, making them
very efficient and scalable for NN search with very low space complexity
Scalable and Sustainable Deep Learning via Randomized Hashing
Current deep learning architectures are growing larger in order to learn from
complex datasets. These architectures require giant matrix multiplication
operations to train millions of parameters. Conversely, there is another
growing trend to bring deep learning to low-power, embedded devices. The matrix
operations, associated with both training and testing of deep networks, are
very expensive from a computational and energy standpoint. We present a novel
hashing based technique to drastically reduce the amount of computation needed
to train and test deep networks. Our approach combines recent ideas from
adaptive dropouts and randomized hashing for maximum inner product search to
select the nodes with the highest activation efficiently. Our new algorithm for
deep learning reduces the overall computational cost of forward and
back-propagation by operating on significantly fewer (sparse) nodes. As a
consequence, our algorithm uses only 5% of the total multiplications, while
keeping on average within 1% of the accuracy of the original model. A unique
property of the proposed hashing based back-propagation is that the updates are
always sparse. Due to the sparse gradient updates, our algorithm is ideally
suited for asynchronous and parallel training leading to near linear speedup
with increasing number of cores. We demonstrate the scalability and
sustainability (energy efficiency) of our proposed algorithm via rigorous
experimental evaluations on several real datasets
Structured Multi-Hashing for Model Compression
Despite the success of deep neural networks (DNNs), state-of-the-art models
are too large to deploy on low-resource devices or common server configurations
in which multiple models are held in memory. Model compression methods address
this limitation by reducing the memory footprint, latency, or energy
consumption of a model with minimal impact on accuracy. We focus on the task of
reducing the number of learnable variables in the model. In this work we
combine ideas from weight hashing and dimensionality reductions resulting in a
simple and powerful structured multi-hashing method based on matrix products
that allows direct control of model size of any deep network and is trained
end-to-end. We demonstrate the strength of our approach by compressing models
from the ResNet, EfficientNet, and MobileNet architecture families. Our method
allows us to drastically decrease the number of variables while maintaining
high accuracy. For instance, by applying our approach to EfficentNet-B4 (16M
parameters) we reduce it to to the size of B0 (5M parameters), while gaining
over 3% in accuracy over B0 baseline. On the commonly used benchmark CIFAR10 we
reduce the ResNet32 model by 75% with no loss in quality, and are able to do a
10x compression while still achieving above 90% accuracy.Comment: Elad and Yair contributed equally to the paper. They jointly proposed
the idea of structured-multi-hashing. Elad: Wrote most of the code and ran
most of the experiments Yair: Main contributor to the manuscript Hao: Coding
and experiments Yerlan: Coding and experiments Miguel: advised Yerlan about
optimization and model compression Mark:experiments Andrew: experiment
- …