18,115 research outputs found
HDIdx: High-Dimensional Indexing for Efficient Approximate Nearest Neighbor Search
Fast Nearest Neighbor (NN) search is a fundamental challenge in large-scale
data processing and analytics, particularly for analyzing multimedia contents
which are often of high dimensionality. Instead of using exact NN search,
extensive research efforts have been focusing on approximate NN search
algorithms. In this work, we present "HDIdx", an efficient high-dimensional
indexing library for fast approximate NN search, which is open-source and
written in Python. It offers a family of state-of-the-art algorithms that
convert input high-dimensional vectors into compact binary codes, making them
very efficient and scalable for NN search with very low space complexity
Towards a Scalable Dynamic Spatial Database System
With the rise of GPS-enabled smartphones and other similar mobile devices,
massive amounts of location data are available. However, no scalable solutions
for soft real-time spatial queries on large sets of moving objects have yet
emerged. In this paper we explore and measure the limits of actual algorithms
and implementations regarding different application scenarios. And finally we
propose a novel distributed architecture to solve the scalability issues.Comment: (2012
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
- …