Search CORE

2,884 research outputs found

Hashing-Based-Estimators for Kernel Density in High Dimensions

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 30/08/2018
Field of study

Given a set of points

P\subset \mathbb{R}^{d}

and a kernel

k

, the Kernel Density Estimate at a point

x\in\mathbb{R}^{d}

is defined as

\mathrm{KDE}_{P}(x)=\frac{1}{|P|}\sum_{y\in P} k(x,y)

. We study the problem of designing a data structure that given a data set

P

and a kernel function, returns *approximations to the kernel density* of a query point in *sublinear time*. We introduce a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and give general theorems bounding the variance of such estimators. These estimators give rise to efficient data structures for estimating the kernel density in high dimensions for a variety of commonly used kernels. Our work is the first to provide data-structures with theoretical guarantees that improve upon simple random sampling in high dimensions.Comment: A preliminary version of this paper appeared in FOCS 201

arXiv.org e-Print Archive

Crossref

Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality

Author: Chawla Sanjay
Zvedeniouk Ilia
Publication venue
Publication date: 01/01/2010
Field of study

We propose an extension of tree-based space-partitioning indexing structures for data with low intrinsic dimensionality embedded in a high dimensional space. We call this extension an Angle Tree. Our extension can be applied to both classical kd-trees as well as the more recent rp-trees. The key idea of our approach is to store the angle (the "dihedral angle") between the data region (which is a low dimensional manifold) and the random hyperplane that splits the region (the "splitter"). We show that the dihedral angle can be used to obtain a tight lower bound on the distance between the query point and any point on the opposite side of the splitter. This in turn can be used to efficiently prune the search space. We introduce a novel randomized strategy to efficiently calculate the dihedral angle with a high degree of accuracy. Experiments and analysis on real and synthetic data sets shows that the Angle Tree is the most efficient known indexing structure for nearest neighbor queries in terms of preprocessing and space usage while achieving high accuracy and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine Intelligenc

arXiv.org e-Print Archive

CiteSeerX

Sparser Johnson-Lindenstrauss Transforms

Author: Kane Daniel M.
Nelson Jelani
Publication venue
Publication date: 05/02/2014
Field of study

We give two different and simple constructions for dimensionality reduction in

\ell_2

via linear mappings that are sparse: only an

O(\varepsilon)

-fraction of entries in each column of our embedding matrices are non-zero to achieve distortion

1+\varepsilon

with high probability, while still achieving the asymptotically optimal number of rows. These are the first constructions to provide subconstant sparsity for all values of parameters, improving upon previous works of Achlioptas (JCSS 2003) and Dasgupta, Kumar, and Sarl\'{o}s (STOC 2010). Such distributions can be used to speed up applications where

\ell_2

dimensionality reduction is used.Comment: v6: journal version, minor changes, added Remark 23; v5: modified abstract, fixed typos, added open problem section; v4: simplified section 4 by giving 1 analysis that covers both constructions; v3: proof of Theorem 25 in v2 was written incorrectly, now fixed; v2: Added another construction achieving same upper bound, and added proof of near-tight lower bound for DKS schem

arXiv.org e-Print Archive

Harvard University - DASH

An Octree-Based Approach towards Efficient Variational Range Data Fusion

Author: Holl Tobias
Ilic Slobodan
Kehl Wadim
Navab Nassir
Tombari Federico
Publication venue
Publication date: 01/01/2016
Field of study

Volume-based reconstruction is usually expensive both in terms of memory consumption and runtime. Especially for sparse geometric structures, volumetric representations produce a huge computational overhead. We present an efficient way to fuse range data via a variational Octree-based minimization approach by taking the actual range data geometry into account. We transform the data into Octree-based truncated signed distance fields and show how the optimization can be conducted on the newly created structures. The main challenge is to uphold speed and a low memory footprint without sacrificing the solutions' accuracy during optimization. We explain how to dynamically adjust the optimizer's geometric structure via joining/splitting of Octree nodes and how to define the operators. We evaluate on various datasets and outline the suitability in terms of performance and geometric accuracy.Comment: BMVC 201

arXiv.org e-Print Archive

Crossref

Flexible constrained sampling with guarantees for pattern mining

Author: A Giacometti
A Zimmermann
C Bucilă
CP Gomes
F Bonchi
Luc De Raedt
M Berlingerio
M Boley
MA Hasan
Matthijs van Leeuwen
S Ermon
S Nijssen
T Calders
T Guns
T Guns
Vladimir Dzyuba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to 1) flexibility in terms of quality measures and constraints that can be used, and/or 2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: 1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and 2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal (ECML/PKDD 2017 journal track

arXiv.org e-Print Archive

Crossref

Leiden University Scholary Publications

Multi-Resolution Hashing for Fast Pairwise Summations

Author: Charikar Moses
Siminelakis Paris
Publication venue
Publication date: 03/11/2018
Field of study

A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector

y

(query) that is unknown a priori. Given a set of points

X\subset \mathbb{R}^{d}

and a pairwise function

w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1]

, we study the problem of designing a data-structure that enables sublinear-time approximation of the summation

Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y)

for any query

y\in \mathbb{R}^{d}

. By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of

T\geq 1

hashing schemes with collision probabilities

p_{1},\ldots, p_{T}

such that

\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)})

. This leads to a data-structure that approximates

Z_{w}(y)

using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function

w(x,y)=e^{\phi(\langle x,y\rangle)}

of the inner product on the unit sphere

x,y\in \mathcal{S}^{d-1}

. Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to

\mathbb{R}^{d}

and from scalar functions to vector functions.Comment: 39 pages, 3 figure

arXiv.org e-Print Archive

Crossref

A Fair and Memory/Time-efficient Hashmap

Author: Asudeh Abolfazl
Shahbazi Nima
Sintos Stavros
Publication venue
Publication date: 21/07/2023
Field of study

There is a large amount of work constructing hashmaps to minimize the number of collisions. However, to the best of our knowledge no known hashing technique guarantees group fairness among different groups of items. We are given a set

P

n

tuples in

\mathbb{R}^d

, for a constant dimension

d

and a set of groups

\mathcal{G}=\{\mathbf{g}_1,\ldots, \mathbf{g}_k\}

such that every tuple belongs to a unique group. We formally define the fair hashing problem introducing the notions of single fairness (

Pr[h(p)=h(x)\mid p\in \mathbf{g}_i, x\in P]

for every

i=1,\ldots, k

), pairwise fairness (

Pr[h(p)=h(q)\mid p,q\in \mathbf{g}_i]

for every

i=1,\ldots, k

), and the well-known collision probability (

Pr[h(p)=h(q)\mid p,q\in P]

). The goal is to construct a hashmap such that the collision probability, the single fairness, and the pairwise fairness are close to

1/m

, where

m

is the number of buckets in the hashmap. We propose two families of algorithms to design fair hashmaps. First, we focus on hashmaps with optimum memory consumption minimizing the unfairness. We model the input tuples as points in

\mathbb{R}^d

and the goal is to find the vector

w

such that the projection of

P

onto

w

creates an ordering that is convenient to split to create a fair hashmap. For each projection we design efficient algorithms that find near optimum partitions of exactly (or at most)

m

buckets. Second, we focus on hashmaps with optimum fairness (

0

-unfairness), minimizing the memory consumption. We make the important observation that the fair hashmap problem is reduced to the necklace splitting problem. By carefully implementing algorithms for solving the necklace splitting problem, we propose faster algorithms constructing hashmaps with

0

-unfairness using

2(m-1)

boundary points when

k=2

and

k(m-1)(4+\log_2 (3mn))

boundary points for

k>2

arXiv.org e-Print Archive