2,884 research outputs found
Hashing-Based-Estimators for Kernel Density in High Dimensions
Given a set of points and a kernel , the Kernel
Density Estimate at a point is defined as
. We study the problem
of designing a data structure that given a data set and a kernel function,
returns *approximations to the kernel density* of a query point in *sublinear
time*. We introduce a class of unbiased estimators for kernel density
implemented through locality-sensitive hashing, and give general theorems
bounding the variance of such estimators. These estimators give rise to
efficient data structures for estimating the kernel density in high dimensions
for a variety of commonly used kernels. Our work is the first to provide
data-structures with theoretical guarantees that improve upon simple random
sampling in high dimensions.Comment: A preliminary version of this paper appeared in FOCS 201
Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality
We propose an extension of tree-based space-partitioning indexing structures
for data with low intrinsic dimensionality embedded in a high dimensional
space. We call this extension an Angle Tree. Our extension can be applied to
both classical kd-trees as well as the more recent rp-trees. The key idea of
our approach is to store the angle (the "dihedral angle") between the data
region (which is a low dimensional manifold) and the random hyperplane that
splits the region (the "splitter"). We show that the dihedral angle can be used
to obtain a tight lower bound on the distance between the query point and any
point on the opposite side of the splitter. This in turn can be used to
efficiently prune the search space. We introduce a novel randomized strategy to
efficiently calculate the dihedral angle with a high degree of accuracy.
Experiments and analysis on real and synthetic data sets shows that the Angle
Tree is the most efficient known indexing structure for nearest neighbor
queries in terms of preprocessing and space usage while achieving high accuracy
and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine
Intelligenc
Sparser Johnson-Lindenstrauss Transforms
We give two different and simple constructions for dimensionality reduction
in via linear mappings that are sparse: only an
-fraction of entries in each column of our embedding matrices
are non-zero to achieve distortion with high probability, while
still achieving the asymptotically optimal number of rows. These are the first
constructions to provide subconstant sparsity for all values of parameters,
improving upon previous works of Achlioptas (JCSS 2003) and Dasgupta, Kumar,
and Sarl\'{o}s (STOC 2010). Such distributions can be used to speed up
applications where dimensionality reduction is used.Comment: v6: journal version, minor changes, added Remark 23; v5: modified
abstract, fixed typos, added open problem section; v4: simplified section 4
by giving 1 analysis that covers both constructions; v3: proof of Theorem 25
in v2 was written incorrectly, now fixed; v2: Added another construction
achieving same upper bound, and added proof of near-tight lower bound for DKS
schem
An Octree-Based Approach towards Efficient Variational Range Data Fusion
Volume-based reconstruction is usually expensive both in terms of memory
consumption and runtime. Especially for sparse geometric structures, volumetric
representations produce a huge computational overhead. We present an efficient
way to fuse range data via a variational Octree-based minimization approach by
taking the actual range data geometry into account. We transform the data into
Octree-based truncated signed distance fields and show how the optimization can
be conducted on the newly created structures. The main challenge is to uphold
speed and a low memory footprint without sacrificing the solutions' accuracy
during optimization. We explain how to dynamically adjust the optimizer's
geometric structure via joining/splitting of Octree nodes and how to define the
operators. We evaluate on various datasets and outline the suitability in terms
of performance and geometric accuracy.Comment: BMVC 201
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Multi-Resolution Hashing for Fast Pairwise Summations
A basic computational primitive in the analysis of massive datasets is
summing simple functions over a large number of objects. Modern applications
pose an additional challenge in that such functions often depend on a parameter
vector (query) that is unknown a priori. Given a set of points and a pairwise function , we study the problem of designing a data-structure
that enables sublinear-time approximation of the summation
for any query . By combining ideas from Harmonic Analysis (partitions of unity
and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis
FOCS'17], we provide a general framework for designing such data structures
through hashing that reaches far beyond what previous techniques allowed.
A key design principle is a collection of hashing schemes with
collision probabilities such that . This leads to a data-structure
that approximates using a sub-linear number of samples from each
hash family. Using this new framework along with Distance Sensitive Hashing
[Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection
can be constructed and evaluated efficiently for any log-convex function
of the inner product on the unit sphere
.
Our method leads to data structures with sub-linear query time that
significantly improve upon random sampling and can be used for Kernel Density
or Partition Function Estimation. We provide extensions of our result from the
sphere to and from scalar functions to vector functions.Comment: 39 pages, 3 figure
A Fair and Memory/Time-efficient Hashmap
There is a large amount of work constructing hashmaps to minimize the number
of collisions. However, to the best of our knowledge no known hashing technique
guarantees group fairness among different groups of items. We are given a set
of tuples in , for a constant dimension and a set of
groups such that every
tuple belongs to a unique group. We formally define the fair hashing problem
introducing the notions of single fairness ( for every ), pairwise fairness
( for every ), and the
well-known collision probability (). The goal is to
construct a hashmap such that the collision probability, the single fairness,
and the pairwise fairness are close to , where is the number of
buckets in the hashmap.
We propose two families of algorithms to design fair hashmaps. First, we
focus on hashmaps with optimum memory consumption minimizing the unfairness. We
model the input tuples as points in and the goal is to find the
vector such that the projection of onto creates an ordering that is
convenient to split to create a fair hashmap. For each projection we design
efficient algorithms that find near optimum partitions of exactly (or at most)
buckets. Second, we focus on hashmaps with optimum fairness
(-unfairness), minimizing the memory consumption. We make the important
observation that the fair hashmap problem is reduced to the necklace splitting
problem. By carefully implementing algorithms for solving the necklace
splitting problem, we propose faster algorithms constructing hashmaps with
-unfairness using boundary points when and boundary points for
- β¦