Search CORE

319 research outputs found

Efficient Document Indexing Using Pivot Tree

Author: Piwowarski Benjamin
Singh Gaurav
Publication venue
Publication date: 01/05/2016
Field of study

We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bag-of-words tf-idf representation. One of the most used ways of computing similarity between a pair of documents is cosine similarity between the vector representations, but cosine similarity is not a metric distance measure as it doesn't follow triangle inequality, therefore most metric searching methods can not be applied directly. We propose an efficient method for indexing documents using a pivot tree that leads to efficient retrieval. We also study the relation between precision and efficiency for the proposed method and compare it with a state of the art in the area of document searching based on inner product.Comment: 6 Pages, 2 Figure

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures

Author: Akl S G
Bentley J L
Blelloch G E
Callahan P B
Cormen T
Har-Peled S
Hoecker A
Ivan D Reid
Knuth D E
McCool M
Moore A
Moore A W
Omohundro S M
Peter R Hobson
Raul H C Lopes
Samet H
Vaidya P M
Publication venue: 'IOP Publishing'
Publication date: 11/06/2014
Field of study

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.Variations of k-d trees represent a fundamental data structure used in Computational Geometry with numerous applications in science. For example particle track tting in the software of the LHC experiments, and in simulations of N-body systems in the study of dynamics of interacting galaxies, particle beam physics, and molecular dynamics in biochemistry. The many-body tree methods devised by Barnes and Hutt in the 1980s and the Fast Multipole Method introduced in 1987 by Greengard and Rokhlin use variants of k-d trees to reduce the computation time upper bounds to O(n log n) and even O(n) from O(n2). We present an algorithm that uses the principle of well-separated pairs decomposition to always produce compressed trees in O(n log n) work. We present and evaluate parallel implementations for the algorithm that can take advantage of multi-core architectures.The Science and Technology Facilities Council, UK

Crossref

Brunel University Research Archive

Dynamic Manipulation of Spatial Weights Using Web Services

Author: Luc Anselin
Myunghwa Hwang
Sergio J. Rey
Publication venue
Publication date
Field of study

Spatial analytical tools are mostly provided in a desktop environment, which tends to restrict user access to the tools. This project intends to exploit up-to-date web technologies to extend user accessibility to spatial analytic tools. The first step is to develop web services for widely used spatial analysis such as spatial weights manipulation and provide easy-to-use web-based user interface to the services. Users can create, transform, and convert spatial weights for their data sets on web browsers without installing any specialized software.

Research Papers in Economics

Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing

Author: Andoni Alexandr
Razenshteyn Ilya
Publication venue
Publication date: 01/01/2015
Field of study

We prove a tight lower bound for the exponent

\rho

for data-dependent Locality-Sensitive Hashing schemes, recently used to design efficient solutions for the

c

-approximate nearest neighbor search. In particular, our lower bound matches the bound of

\rho\le \frac{1}{2c-1}+o(1)

for the

\ell_1

space, obtained via the recent algorithm from [Andoni-Razenshteyn, STOC'15]. In recent years it emerged that data-dependent hashing is strictly superior to the classical Locality-Sensitive Hashing, when the hash function is data-independent. In the latter setting, the best exponent has been already known: for the

\ell_1

space, the tight bound is

\rho=1/c

, with the upper bound from [Indyk-Motwani, STOC'98] and the matching lower bound from [O'Donnell-Wu-Zhou, ITCS'11]. We prove that, even if the hashing is data-dependent, it must hold that

\rho\ge \frac{1}{2c-1}-o(1)

. To prove the result, we need to formalize the exact notion of data-dependent hashing that also captures the complexity of the hash functions (in addition to their collision properties). Without restricting such complexity, we would allow for obviously infeasible solutions such as the Voronoi diagram of a dataset. To preclude such solutions, we require our hash functions to be succinct. This condition is satisfied by all the known algorithmic results.Comment: 16 pages, no figure

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server