14 research outputs found
Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)
Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and
popular variants of MCMC for problems when an MCMC state consists of an unknown
number of components. It is well known that state-of-the-art methods for
split-merge MCMC do not scale well. Strategies for rapid mixing requires smart
and informative proposals to reduce the rejection rate. However, all known
smart proposals involve expensive operations to suggest informative
transitions. As a result, the cost of each iteration is prohibitive for massive
scale datasets. It is further known that uninformative but computationally
efficient proposals, such as random split-merge, leads to extremely slow
convergence. This tradeoff between mixing time and per update cost seems hard
to get around.
In this paper, we show a sweet spot. We leverage some unique properties of
weighted MinHash, which is a popular LSH, to design a novel class of
split-merge proposals which are significantly more informative than random
sampling but at the same time efficient to compute. Overall, we obtain a
superior tradeoff between convergence and per update cost. As a direct
consequence, our proposals are around 6X faster than the state-of-the-art
sampling methods on two large real datasets KDDCUP and PubMed with several
millions of entities and thousands of clusters
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results