Search CORE

577 research outputs found

Hashing-Based-Estimators for Kernel Density in High Dimensions

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 30/08/2018
Field of study

Given a set of points

P\subset \mathbb{R}^{d}

and a kernel

k

, the Kernel Density Estimate at a point

x\in\mathbb{R}^{d}

is defined as

\mathrm{KDE}_{P}(x)=\frac{1}{|P|}\sum_{y\in P} k(x,y)

. We study the problem of designing a data structure that given a data set

P

and a kernel function, returns *approximations to the kernel density* of a query point in *sublinear time*. We introduce a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and give general theorems bounding the variance of such estimators. These estimators give rise to efficient data structures for estimating the kernel density in high dimensions for a variety of commonly used kernels. Our work is the first to provide data-structures with theoretical guarantees that improve upon simple random sampling in high dimensions.Comment: A preliminary version of this paper appeared in FOCS 201

arXiv.org e-Print Archive

Crossref

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

Author: Andoni A.
Broder A. Z.
Li P.
Lv Q.
Shrivastava A.
Shrivastava A.
Weber R.
Publication venue
Publication date: 03/07/2018
Field of study

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (

n^2D

), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

arXiv.org e-Print Archive

Crossref

Coding for Random Projections

Author: Li Ping
Mitzenmacher Michael
Shrivastava Anshumali
Publication venue
Publication date: 09/08/2013
Field of study

The method of random projections has become very popular for large-scale applications in statistical learning, information retrieval, bio-informatics and other applications. Using a well-designed coding scheme for the projected data, which determines the number of bits needed for each projected value and how to allocate these bits, can significantly improve the effectiveness of the algorithm, in storage cost as well as computational speed. In this paper, we study a number of simple coding schemes, focusing on the task of similarity estimation and on an application to training linear classifiers. We demonstrate that uniform quantization outperforms the standard existing influential method (Datar et. al. 2004). Indeed, we argue that in many cases coding with just a small number of bits suffices. Furthermore, we also develop a non-uniform 2-bit coding scheme that generally performs well in practice, as confirmed by our experiments on training linear support vector machines (SVM)

arXiv.org e-Print Archive

CiteSeerX

Multi-Resolution Hashing for Fast Pairwise Summations

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 03/11/2018
Field of study

A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector

y

(query) that is unknown a priori. Given a set of points

X\subset \mathbb{R}^{d}

and a pairwise function

w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1]

, we study the problem of designing a data-structure that enables sublinear-time approximation of the summation

Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y)

for any query

y\in \mathbb{R}^{d}

. By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of

T\geq 1

hashing schemes with collision probabilities

p_{1},\ldots, p_{T}

such that

\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)})

. This leads to a data-structure that approximates

Z_{w}(y)

using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function

w(x,y)=e^{\phi(\langle x,y\rangle)}

of the inner product on the unit sphere

x,y\in \mathcal{S}^{d-1}

. Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to

\mathbb{R}^{d}

and from scalar functions to vector functions.Comment: 39 pages, 3 figure

arXiv.org e-Print Archive

Crossref

DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search

Author: Aumüller Martin
Karppa Matti
Pagh Rasmus
Publication venue
Publication date: 06/07/2021
Field of study

Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for KDE evaluation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.Comment: 24 pages, 1 figure. Submitted for revie

arXiv.org e-Print Archive

The IT University of Copenhagen's Repository