Search CORE

9,941 research outputs found

Fast k-means based on KNN Graph

Author: Deng Cheng-Hao
Zhao Wan-Lei
Publication venue
Publication date: 04/05/2017
Field of study

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast

k

-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means

arXiv.org e-Print Archive

Crossref

Permutation and Grouping Methods for Sharpening Gaussian Process Approximations

Author: Guinness Joseph
Publication venue: 'Informa UK Limited'
Publication date: 19/02/2018
Field of study

Vecchia's approximate likelihood for Gaussian process parameters depends on how the observations are ordered, which can be viewed as a deficiency because the exact likelihood is permutation-invariant. This article takes the alternative standpoint that the ordering of the observations can be tuned to sharpen the approximations. Advantageously chosen orderings can drastically improve the approximations, and in fact, completely random orderings often produce far more accurate approximations than default coordinate-based orderings do. In addition to the permutation results, automatic methods for grouping calculations of components of the approximation are introduced, having the result of simultaneously improving the quality of the approximation and reducing its computational burden. In common settings, reordering combined with grouping reduces Kullback-Leibler divergence from the target model by a factor of 80 and computation time by a factor of 2 compared to ungrouped approximations with default ordering. The claims are supported by theory and numerical results with comparisons to other approximations, including tapered covariances and stochastic partial differential equation approximations. Computational details are provided, including efficiently finding the orderings and ordered nearest neighbors, and profiling out linear mean parameters and using the approximations for prediction and conditional simulation. An application to space-time satellite data is presented

arXiv.org e-Print Archive

FigShare

Maximum Inner-Product Search using Tree Data-structures

Author: Gray Alexander G.
Ram Parikshit
Publication venue
Publication date: 01/01/2012
Field of study

The problem of {\em efficiently} finding the best match for a query in a given set with respect to the Euclidean distance or the cosine similarity has been extensively studied in literature. However, a closely related problem of efficiently finding the best match with respect to the inner product has never been explored in the general setting to the best of our knowledge. In this paper we consider this general problem and contrast it with the existing best-match algorithms. First, we propose a general branch-and-bound algorithm using a tree data structure. Subsequently, we present a dual-tree algorithm for the case where there are multiple queries. Finally we present a new data structure for increasing the efficiency of the dual-tree algorithm. These branch-and-bound algorithms involve novel bounds suited for the purpose of best-matching with inner products. We evaluate our proposed algorithms on a variety of data sets from various applications, and exhibit up to five orders of magnitude improvement in query time over the naive search technique.Comment: Under submission in KDD 201

arXiv.org e-Print Archive

CiteSeerX