54 research outputs found
b-Bit Minwise Hashing
This paper establishes the theoretical framework of b-bit minwise hashing.
The original minwise hashing method has become a standard technique for
estimating set similarity (e.g., resemblance) with applications in information
retrieval, data management, social networks and computational advertising.
By only storing the lowest bits of each (minwise) hashed value (e.g., b=1
or 2), one can gain substantial advantages in terms of computational efficiency
and storage space. We prove the basic theoretical results and provide an
unbiased estimator of the resemblance for any b. We demonstrate that, even in
the least favorable scenario, using b=1 may reduce the storage space at least
by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is
interested in resemblance > 0.5
Hashing Algorithms for Large-Scale Learning
In this paper, we first demonstrate that b-bit minwise hashing, whose
estimators are positive definite kernels, can be naturally integrated with
learning algorithms such as SVM and logistic regression. We adopt a simple
scheme to transform the nonlinear (resemblance) kernel into linear (inner
product) kernel; and hence large-scale problems can be solved extremely
efficiently. Our method provides a simple effective solution to large-scale
learning in massive and extremely high-dimensional datasets, especially when
data do not fit in memory.
We then compare b-bit minwise hashing with the Vowpal Wabbit (VW) algorithm
(which is related the Count-Min (CM) sketch). Interestingly, VW has the same
variances as random projections. Our theoretical and empirical comparisons
illustrate that usually -bit minwise hashing is significantly more accurate
(at the same storage) than VW (and random projections) in binary data.
Furthermore, -bit minwise hashing can be combined with VW to achieve further
improvements in terms of training speed, especially when is large
Improved Densification of One Permutation Hashing
The existing work on densification of one permutation hashing reduces the
query processing cost of the -parameterized Locality Sensitive Hashing
(LSH) algorithm with minwise hashing, from to merely ,
where is the number of nonzeros of the data vector, is the number of
hashes in each hash table, and is the number of hash tables. While that is
a substantial improvement, our analysis reveals that the existing densification
scheme is sub-optimal. In particular, there is no enough randomness in that
procedure, which affects its accuracy on very sparse datasets.
In this paper, we provide a new densification procedure which is provably
better than the existing scheme. This improvement is more significant for very
sparse datasets which are common over the web. The improved technique has the
same cost of for query processing, thereby making it strictly
preferable over the existing procedure. Experimental evaluations on public
datasets, in the task of hashing based near neighbor search, support our
theoretical findings
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Coding for Random Projections
The method of random projections has become very popular for large-scale
applications in statistical learning, information retrieval, bio-informatics
and other applications. Using a well-designed coding scheme for the projected
data, which determines the number of bits needed for each projected value and
how to allocate these bits, can significantly improve the effectiveness of the
algorithm, in storage cost as well as computational speed. In this paper, we
study a number of simple coding schemes, focusing on the task of similarity
estimation and on an application to training linear classifiers. We demonstrate
that uniform quantization outperforms the standard existing influential method
(Datar et. al. 2004). Indeed, we argue that in many cases coding with just a
small number of bits suffices. Furthermore, we also develop a non-uniform 2-bit
coding scheme that generally performs well in practice, as confirmed by our
experiments on training linear support vector machines (SVM)
- …