223,522 research outputs found
Direct Neighbor Search
In this paper we study a novel query type, called direct neighbor query. Two objects in a dataset are direct neighbors (DNs) if a window selection may exclusively retrieve these two objects. Given a source object, a DN search computes all of its direct neighbors in the dataset. The DNs define a new type of affinity that differs from existing formulations (e.g., nearest neighbors, nearest surrounders, reverse nearest neighbors, etc) and finds application in domains where user interests are expressed in the form of windows, i.e., multi-attribute range selections. Drawing on key properties of the DN relationship, we develop an I/O optimal processing algorithm for data indexed with a spatial access method. In addition to plain DN search, we also study its K-DN and all-DN variants. The former relaxes the DN condition – two objects are K-DNs if a window query may retrieve them and only up to K − 1 other objects – whereas the all-DN variant computes the DNs of every object in the dataset. Using real, large-scale data
Fast Exact Search in Hamming Space with Multi-Index Hashing
There is growing interest in representing image data and feature descriptors
using compact binary codes for fast near neighbor search. Although binary codes
are motivated by their use as direct indices (addresses) into a hash table,
codes longer than 32 bits are not being used as such, as it was thought to be
ineffective. We introduce a rigorous way to build multiple hash tables on
binary code substrings that enables exact k-nearest neighbor search in Hamming
space. The approach is storage efficient and straightforward to implement.
Theoretical analysis shows that the algorithm exhibits sub-linear run-time
behavior for uniformly distributed codes. Empirical results show dramatic
speedups over a linear scan baseline for datasets of up to one billion codes of
64, 128, or 256 bits
Efficient Classification for Metric Data
Recent advances in large-margin classification of data residing in general
metric spaces (rather than Hilbert spaces) enable classification under various
natural metrics, such as string edit and earthmover distance. A general
framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004]
left open the questions of computational efficiency and of providing direct
bounds on generalization error.
We design a new algorithm for classification in general metric spaces, whose
runtime and accuracy depend on the doubling dimension of the data points, and
can thus achieve superior classification performance in many common scenarios.
The algorithmic core of our approach is an approximate (rather than exact)
solution to the classical problems of Lipschitz extension and of Nearest
Neighbor Search. The algorithm's generalization performance is guaranteed via
the fat-shattering dimension of Lipschitz classifiers, and we present
experimental evidence of its superiority to some common kernel methods. As a
by-product, we offer a new perspective on the nearest neighbor classifier,
which yields significantly sharper risk asymptotics than the classic analysis
of Cover and Hart [IEEE Trans. Info. Theory, 1967].Comment: This is the full version of an extended abstract that appeared in
Proceedings of the 23rd COLT, 201
Relative NN-Descent: A Fast Index Construction for Graph-Based Approximate Nearest Neighbor Search
Approximate Nearest Neighbor Search (ANNS) is the task of finding the
database vector that is closest to a given query vector. Graph-based ANNS is
the family of methods with the best balance of accuracy and speed for
million-scale datasets. However, graph-based methods have the disadvantage of
long index construction time. Recently, many researchers have improved the
tradeoff between accuracy and speed during a search. However, there is little
research on accelerating index construction. We propose a fast graph
construction algorithm, Relative NN-Descent (RNN-Descent). RNN-Descent combines
NN-Descent, an algorithm for constructing approximate K-nearest neighbor graphs
(K-NN graphs), and RNG Strategy, an algorithm for selecting edges effective for
search. This algorithm allows the direct construction of graph-based indexes
without ANNS. Experimental results demonstrated that the proposed method had
the fastest index construction speed, while its search performance is
comparable to existing state-of-the-art methods such as NSG. For example, in
experiments on the GIST1M dataset, the construction of the proposed method is
2x faster than NSG. Additionally, it was even faster than the construction
speed of NN-Descent.Comment: Accepted by ACMMM 202
Bolt: Accelerated Data Mining with Fast Vector Compression
Vectors of data are at the heart of machine learning and data mining.
Recently, vector quantization methods have shown great promise in reducing both
the time and space costs of operating on vectors. We introduce a vector
quantization algorithm that can compress vectors over 12x faster than existing
techniques while also accelerating approximate vector operations such as
distance and dot product computations by up to 10x. Because it can encode over
2GB of vectors per second, it makes vector quantization cheap enough to employ
in many more circumstances. For example, using our technique to compute
approximate dot products in a nested loop can multiply matrices faster than a
state-of-the-art BLAS implementation, even when our algorithm must first
compress the matrices.
In addition to showing the above speedups, we demonstrate that our approach
can accelerate nearest neighbor search and maximum inner product search by over
100x compared to floating point operations and up to 10x compared to other
vector quantization methods. Our approximate Euclidean distance and dot product
computations are not only faster than those of related algorithms with slower
encodings, but also faster than Hamming distance computations, which have
direct hardware support on the tested platforms. We also assess the errors of
our algorithm's approximate distances and dot products, and find that it is
competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201
Phytoplankton Hotspot Prediction With an Unsupervised Spatial Community Model
Many interesting natural phenomena are sparsely distributed and discrete.
Locating the hotspots of such sparsely distributed phenomena is often difficult
because their density gradient is likely to be very noisy. We present a novel
approach to this search problem, where we model the co-occurrence relations
between a robot's observations with a Bayesian nonparametric topic model. This
approach makes it possible to produce a robust estimate of the spatial
distribution of the target, even in the absence of direct target observations.
We apply the proposed approach to the problem of finding the spatial locations
of the hotspots of a specific phytoplankton taxon in the ocean. We use
classified image data from Imaging FlowCytobot (IFCB), which automatically
measures individual microscopic cells and colonies of cells. Given these
individual taxon-specific observations, we learn a phytoplankton community
model that characterizes the co-occurrence relations between taxa. We present
experiments with simulated robot missions drawn from real observation data
collected during a research cruise traversing the US Atlantic coast. Our
results show that the proposed approach outperforms nearest neighbor and
k-means based methods for predicting the spatial distribution of hotspots from
in-situ observations.Comment: To appear in ICRA 2017, Singapor
Context Sensitive Search String Composition Algorithm using User Intention to Handle Ambiguous Keywords
Finding the required URL among the first few result pages of a search engine is still a challenging task. This may require number of reformulations of the search string thus adversely affecting user's search time. Query ambiguity and polysemy are major reasons for not obtaining relevant results in the top few result pages. Efficient query composition and data organization are necessary for getting effective results. Context of the information need and the user intent may improve the autocomplete feature of existing search engines. This research proposes a Funnel Mesh-5 algorithm (FM5) to construct a search string taking into account context of information need and user intention with three main steps 1) Predict user intention with user profiles and the past searches via weighted mesh structure 2) Resolve ambiguity and polysemy of search strings with context and user intention 3) Generate a personalized disambiguated search string by query expansion encompassing user intention and predicted query. Experimental results for the proposed approach and a comparison with direct use of search engine are presented. A comparison of FM5 algorithm with K Nearest Neighbor algorithm for user intention identification is also presented. The proposed system provides better precision for search results for ambiguous search strings with improved identification of the user intention. Results are presented for English language dataset as well as Marathi (an Indian language) dataset of ambiguous search strings.
- …