9,941 research outputs found
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
Permutation and Grouping Methods for Sharpening Gaussian Process Approximations
Vecchia's approximate likelihood for Gaussian process parameters depends on
how the observations are ordered, which can be viewed as a deficiency because
the exact likelihood is permutation-invariant. This article takes the
alternative standpoint that the ordering of the observations can be tuned to
sharpen the approximations. Advantageously chosen orderings can drastically
improve the approximations, and in fact, completely random orderings often
produce far more accurate approximations than default coordinate-based
orderings do. In addition to the permutation results, automatic methods for
grouping calculations of components of the approximation are introduced, having
the result of simultaneously improving the quality of the approximation and
reducing its computational burden. In common settings, reordering combined with
grouping reduces Kullback-Leibler divergence from the target model by a factor
of 80 and computation time by a factor of 2 compared to ungrouped
approximations with default ordering. The claims are supported by theory and
numerical results with comparisons to other approximations, including tapered
covariances and stochastic partial differential equation approximations.
Computational details are provided, including efficiently finding the orderings
and ordered nearest neighbors, and profiling out linear mean parameters and
using the approximations for prediction and conditional simulation. An
application to space-time satellite data is presented
Maximum Inner-Product Search using Tree Data-structures
The problem of {\em efficiently} finding the best match for a query in a
given set with respect to the Euclidean distance or the cosine similarity has
been extensively studied in literature. However, a closely related problem of
efficiently finding the best match with respect to the inner product has never
been explored in the general setting to the best of our knowledge. In this
paper we consider this general problem and contrast it with the existing
best-match algorithms. First, we propose a general branch-and-bound algorithm
using a tree data structure. Subsequently, we present a dual-tree algorithm for
the case where there are multiple queries. Finally we present a new data
structure for increasing the efficiency of the dual-tree algorithm. These
branch-and-bound algorithms involve novel bounds suited for the purpose of
best-matching with inner products. We evaluate our proposed algorithms on a
variety of data sets from various applications, and exhibit up to five orders
of magnitude improvement in query time over the naive search technique.Comment: Under submission in KDD 201
- …