2,884 research outputs found

    Hashing-Based-Estimators for Kernel Density in High Dimensions

    Full text link
    Given a set of points PβŠ‚RdP\subset \mathbb{R}^{d} and a kernel kk, the Kernel Density Estimate at a point x∈Rdx\in\mathbb{R}^{d} is defined as KDEP(x)=1∣Pβˆ£βˆ‘y∈Pk(x,y)\mathrm{KDE}_{P}(x)=\frac{1}{|P|}\sum_{y\in P} k(x,y). We study the problem of designing a data structure that given a data set PP and a kernel function, returns *approximations to the kernel density* of a query point in *sublinear time*. We introduce a class of unbiased estimators for kernel density implemented through locality-sensitive hashing, and give general theorems bounding the variance of such estimators. These estimators give rise to efficient data structures for estimating the kernel density in high dimensions for a variety of commonly used kernels. Our work is the first to provide data-structures with theoretical guarantees that improve upon simple random sampling in high dimensions.Comment: A preliminary version of this paper appeared in FOCS 201

    Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality

    Full text link
    We propose an extension of tree-based space-partitioning indexing structures for data with low intrinsic dimensionality embedded in a high dimensional space. We call this extension an Angle Tree. Our extension can be applied to both classical kd-trees as well as the more recent rp-trees. The key idea of our approach is to store the angle (the "dihedral angle") between the data region (which is a low dimensional manifold) and the random hyperplane that splits the region (the "splitter"). We show that the dihedral angle can be used to obtain a tight lower bound on the distance between the query point and any point on the opposite side of the splitter. This in turn can be used to efficiently prune the search space. We introduce a novel randomized strategy to efficiently calculate the dihedral angle with a high degree of accuracy. Experiments and analysis on real and synthetic data sets shows that the Angle Tree is the most efficient known indexing structure for nearest neighbor queries in terms of preprocessing and space usage while achieving high accuracy and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine Intelligenc

    Sparser Johnson-Lindenstrauss Transforms

    Get PDF
    We give two different and simple constructions for dimensionality reduction in β„“2\ell_2 via linear mappings that are sparse: only an O(Ξ΅)O(\varepsilon)-fraction of entries in each column of our embedding matrices are non-zero to achieve distortion 1+Ξ΅1+\varepsilon with high probability, while still achieving the asymptotically optimal number of rows. These are the first constructions to provide subconstant sparsity for all values of parameters, improving upon previous works of Achlioptas (JCSS 2003) and Dasgupta, Kumar, and Sarl\'{o}s (STOC 2010). Such distributions can be used to speed up applications where β„“2\ell_2 dimensionality reduction is used.Comment: v6: journal version, minor changes, added Remark 23; v5: modified abstract, fixed typos, added open problem section; v4: simplified section 4 by giving 1 analysis that covers both constructions; v3: proof of Theorem 25 in v2 was written incorrectly, now fixed; v2: Added another construction achieving same upper bound, and added proof of near-tight lower bound for DKS schem

    An Octree-Based Approach towards Efficient Variational Range Data Fusion

    Full text link
    Volume-based reconstruction is usually expensive both in terms of memory consumption and runtime. Especially for sparse geometric structures, volumetric representations produce a huge computational overhead. We present an efficient way to fuse range data via a variational Octree-based minimization approach by taking the actual range data geometry into account. We transform the data into Octree-based truncated signed distance fields and show how the optimization can be conducted on the newly created structures. The main challenge is to uphold speed and a low memory footprint without sacrificing the solutions' accuracy during optimization. We explain how to dynamically adjust the optimizer's geometric structure via joining/splitting of Octree nodes and how to define the operators. We evaluate on various datasets and outline the suitability in terms of performance and geometric accuracy.Comment: BMVC 201

    Flexible constrained sampling with guarantees for pattern mining

    Get PDF
    Pattern sampling has been proposed as a potential solution to the infamous pattern explosion. Instead of enumerating all patterns that satisfy the constraints, individual patterns are sampled proportional to a given quality measure. Several sampling algorithms have been proposed, but each of them has its limitations when it comes to 1) flexibility in terms of quality measures and constraints that can be used, and/or 2) guarantees with respect to sampling accuracy. We therefore present Flexics, the first flexible pattern sampler that supports a broad class of quality measures and constraints, while providing strong guarantees regarding sampling accuracy. To achieve this, we leverage the perspective on pattern mining as a constraint satisfaction problem and build upon the latest advances in sampling solutions in SAT as well as existing pattern mining algorithms. Furthermore, the proposed algorithm is applicable to a variety of pattern languages, which allows us to introduce and tackle the novel task of sampling sets of patterns. We introduce and empirically evaluate two variants of Flexics: 1) a generic variant that addresses the well-known itemset sampling task and the novel pattern set sampling task as well as a wide range of expressive constraints within these tasks, and 2) a specialized variant that exploits existing frequent itemset techniques to achieve substantial speed-ups. Experiments show that Flexics is both accurate and efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal (ECML/PKDD 2017 journal track

    Multi-Resolution Hashing for Fast Pairwise Summations

    Full text link
    A basic computational primitive in the analysis of massive datasets is summing simple functions over a large number of objects. Modern applications pose an additional challenge in that such functions often depend on a parameter vector yy (query) that is unknown a priori. Given a set of points XβŠ‚RdX\subset \mathbb{R}^{d} and a pairwise function w:RdΓ—Rdβ†’[0,1]w:\mathbb{R}^{d}\times \mathbb{R}^{d}\to [0,1], we study the problem of designing a data-structure that enables sublinear-time approximation of the summation Zw(y)=1∣Xβˆ£βˆ‘x∈Xw(x,y)Z_{w}(y)=\frac{1}{|X|}\sum_{x\in X}w(x,y) for any query y∈Rdy\in \mathbb{R}^{d}. By combining ideas from Harmonic Analysis (partitions of unity and approximation theory) with Hashing-Based-Estimators [Charikar, Siminelakis FOCS'17], we provide a general framework for designing such data structures through hashing that reaches far beyond what previous techniques allowed. A key design principle is a collection of Tβ‰₯1T\geq 1 hashing schemes with collision probabilities p1,…,pTp_{1},\ldots, p_{T} such that sup⁑t∈[T]{pt(x,y)}=Θ(w(x,y))\sup_{t\in [T]}\{p_{t}(x,y)\} = \Theta(\sqrt{w(x,y)}). This leads to a data-structure that approximates Zw(y)Z_{w}(y) using a sub-linear number of samples from each hash family. Using this new framework along with Distance Sensitive Hashing [Aumuller, Christiani, Pagh, Silvestri PODS'18], we show that such a collection can be constructed and evaluated efficiently for any log-convex function w(x,y)=eΟ•(⟨x,y⟩)w(x,y)=e^{\phi(\langle x,y\rangle)} of the inner product on the unit sphere x,y∈Sdβˆ’1x,y\in \mathcal{S}^{d-1}. Our method leads to data structures with sub-linear query time that significantly improve upon random sampling and can be used for Kernel Density or Partition Function Estimation. We provide extensions of our result from the sphere to Rd\mathbb{R}^{d} and from scalar functions to vector functions.Comment: 39 pages, 3 figure

    A Fair and Memory/Time-efficient Hashmap

    Full text link
    There is a large amount of work constructing hashmaps to minimize the number of collisions. However, to the best of our knowledge no known hashing technique guarantees group fairness among different groups of items. We are given a set PP of nn tuples in Rd\mathbb{R}^d, for a constant dimension dd and a set of groups G={g1,…,gk}\mathcal{G}=\{\mathbf{g}_1,\ldots, \mathbf{g}_k\} such that every tuple belongs to a unique group. We formally define the fair hashing problem introducing the notions of single fairness (Pr[h(p)=h(x)∣p∈gi,x∈P]Pr[h(p)=h(x)\mid p\in \mathbf{g}_i, x\in P] for every i=1,…,ki=1,\ldots, k), pairwise fairness (Pr[h(p)=h(q)∣p,q∈gi]Pr[h(p)=h(q)\mid p,q\in \mathbf{g}_i] for every i=1,…,ki=1,\ldots, k), and the well-known collision probability (Pr[h(p)=h(q)∣p,q∈P]Pr[h(p)=h(q)\mid p,q\in P]). The goal is to construct a hashmap such that the collision probability, the single fairness, and the pairwise fairness are close to 1/m1/m, where mm is the number of buckets in the hashmap. We propose two families of algorithms to design fair hashmaps. First, we focus on hashmaps with optimum memory consumption minimizing the unfairness. We model the input tuples as points in Rd\mathbb{R}^d and the goal is to find the vector ww such that the projection of PP onto ww creates an ordering that is convenient to split to create a fair hashmap. For each projection we design efficient algorithms that find near optimum partitions of exactly (or at most) mm buckets. Second, we focus on hashmaps with optimum fairness (00-unfairness), minimizing the memory consumption. We make the important observation that the fair hashmap problem is reduced to the necklace splitting problem. By carefully implementing algorithms for solving the necklace splitting problem, we propose faster algorithms constructing hashmaps with 00-unfairness using 2(mβˆ’1)2(m-1) boundary points when k=2k=2 and k(mβˆ’1)(4+log⁑2(3mn))k(m-1)(4+\log_2 (3mn)) boundary points for k>2k>2
    • …
    corecore