2 research outputs found
Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)
We generated a dataset of 200 GB with 10^9 features, to test our recent b-bit
minwise hashing algorithms for training very large-scale logistic regression
and SVM. The results confirm our prior work that, compared with the VW hashing
algorithm (which has the same variance as random projections), b-bit minwise
hashing is substantially more accurate at the same storage. For example, with
merely 30 hashed values per data point, b-bit minwise hashing can achieve
similar accuracies as VW with 2^14 hashed values per data point.
We demonstrate that the preprocessing cost of b-bit minwise hashing is
roughly on the same order of magnitude as the data loading time. Furthermore,
by using a GPU, the preprocessing cost can be reduced to a small fraction of
the data loading time.
Minwise hashing has been widely used in industry, at least in the context of
search. One reason for its popularity is that one can efficiently simulate
permutations by (e.g.,) universal hashing. In other words, there is no need to
store the permutation matrix. In this paper, we empirically verify this
practice, by demonstrating that even using the simplest 2-universal hashing
does not degrade the learning performance
b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions
In this paper, we study several critical issues which must be tackled before
one can apply b-bit minwise hashing to the volumes of data often used
industrial applications, especially in the context of search.
1. (b-bit) Minwise hashing requires an expensive preprocessing step that
computes k (e.g., 500) minimal values after applying the corresponding
permutations for each data vector. We developed a parallelization scheme using
GPUs and observed that the preprocessing time can be reduced by a factor of
20-80 and becomes substantially smaller than the data loading time.
2. One major advantage of b-bit minwise hashing is that it can substantially
reduce the amount of memory required for batch learning. However, as online
algorithms become increasingly popular for large-scale learning in the context
of search, it is not clear if b-bit minwise yields significant improvements for
them. This paper demonstrates that -bit minwise hashing provides an
effective data size/dimension reduction scheme and hence it can dramatically
reduce the data loading time for each epoch of the online training process.
This is significant because online learning often requires many (e.g., 10 to
100) epochs to reach a sufficient accuracy.
3. Another critical issue is that for very large data sets it becomes
impossible to store a (fully) random permutation matrix, due to its space
requirements. Our paper is the first study to demonstrate that -bit minwise
hashing implemented using simple hash functions, e.g., the 2-universal (2U) and
4-universal (4U) hash families, can produce very similar learning results as
using fully random permutations. Experiments on datasets of up to 200GB are
presented