666 research outputs found
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
HDIdx: High-Dimensional Indexing for Efficient Approximate Nearest Neighbor Search
Fast Nearest Neighbor (NN) search is a fundamental challenge in large-scale
data processing and analytics, particularly for analyzing multimedia contents
which are often of high dimensionality. Instead of using exact NN search,
extensive research efforts have been focusing on approximate NN search
algorithms. In this work, we present "HDIdx", an efficient high-dimensional
indexing library for fast approximate NN search, which is open-source and
written in Python. It offers a family of state-of-the-art algorithms that
convert input high-dimensional vectors into compact binary codes, making them
very efficient and scalable for NN search with very low space complexity
- …