13,495 research outputs found
k-NN 검색 및 k-NN 그래프 생성을 위한 고속 근사 알고리즘
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 이상구.Finding k-nearest neighbors (k-NN) is an essential part of recommeder systems, information retrieval, and many data mining and machine learning algorithms. However, there are two main problems in finding k-nearest neighbors: 1) Existing approaches require a huge amount of time when the number of objects or dimensions is scale up. 2) The k-NN computation methods do not show the consistent performance over different search tasks and types of data. In this dissertation, we present fast and versatile algorithms for finding k-nearest neighbors in order to cope with these problems. The main contributions are summarized as follows: first, we present an efficient and scalable algorithm for finding an approximate k-NN graph by filtering node pairs whose large value dimensions do not match at all. Second, a fast collaborative filtering algorithm that utilizes k-NN graph is presented. The main idea of this approach is to reverse the process of finding k-nearest neighbors in item-based collaborative filtering. Last, we propose a fast approximate algorithm for k-NN search by selecting query-specific signatures from a signature pool to pick high-quality k-NN candidates.The experimental results show that the proposed algorithms guarantee a high level of accuracy while also being much faster than the other algorithms over different types of search tasks and datasets.Abstract i
Contents iii
List of Figures vii
List of Tables xi
Chapter 1 Introduction 1
1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Fast Approximation . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Versatility . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Our Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Greedy Filtering . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Signature Selection LSH . . . . . . . . . . . . . . . . . . . 7
1.2.3 Reversed CF . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2 Background and Related Work 14
2.1 k-NN Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . 15
2.1.2 LSH-based k-NN Search . . . . . . . . . . . . . . . . . . . 16
2.2 k-NN Graph Construction . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 LSH-based Approach . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Clustering-based Approach . . . . . . . . . . . . . . . . . 19
2.2.3 Heuristic-based Approach . . . . . . . . . . . . . . . . . . 20
2.2.4 Similarity Join . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3 Fast Approximate k-NN Graph Construction 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Constructing a k-Nearest Neighbor Graph . . . . . . . . . . . . . 29
3.3.1 Greedy Filtering . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Prefix Selection Scheme . . . . . . . . . . . . . . . . . . . 32
3.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Graph Construction Time . . . . . . . . . . . . . . . . . . 39
3.4.3 Graph Accuracy . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Performance Comparison . . . . . . . . . . . . . . . . . . 48
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4 Fast Collaborative Filtering 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Fast Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Nearest Neighbor Graph Construction . . . . . . . . . . . 58
4.3.2 Fast Recommendation Algorithm . . . . . . . . . . . . . . 60
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Overall Comparison . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Effects of Parameter Changes . . . . . . . . . . . . . . . . 68
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5 Fast Approximate k-NN Search 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Signature Selection LSH . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Data-dependent LSH . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Signature Pool Generation . . . . . . . . . . . . . . . . . . 76
5.2.3 Signature Selection . . . . . . . . . . . . . . . . . . . . . . 79
5.2.4 Optimization Techniques . . . . . . . . . . . . . . . . . . 83
5.3 S2LSH for Graph Construction . . . . . . . . . . . . . . . . . . . 84
5.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.2 Signature Selection . . . . . . . . . . . . . . . . . . . . . . 84
5.3.3 Optimization Techniques . . . . . . . . . . . . . . . . . . 85
5.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 87
5.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 91
5.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . 97
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 6 Conclusion 103
Bibliography 105
초록 113Docto
Efficient Computation of K-Nearest Neighbor Graphs for Large High-Dimensional Data Sets on GPU Clusters
The k-Nearest Neighbor Graph (k-NNG) and the related k-Nearest Neighbor (k-NN) methods have a wide variety of applications in areas such as bioinformatics, machine learning, data mining, clustering analysis, and pattern recognition. Our application of interest is manifold embedding. Due to the large dimensionality of the input data (\u3c15k), spatial subdivision based techniques such OBBs, k-d tree, BSP etc., are not viable. The only alternative is the brute-force search, which has two distinct parts. The first finds distances between individual vectors in the corpus based on a pre-defined metric. Given the distance matrix, the second step selects k nearest neighbors for each member of the query data set.
This thesis presents the development and implementation of a distributed exact k-Nearest Neighbor Graph (k-NNG) construction method. The proposed method uses Graphics Processing Units (GPUs) and exploits multiple levels of parallelism for distributed computational systems using GPUs. It is scalable for different cluster sizes, with each compute node in the cluster containing multiple GPUs. The distance computation is formulated as a basic matrix multiplication and reduction operation. The optimized CUBLAS matrix multiplication library is used for this purpose. Various distance metrics such as Euclidian, cosine, and Pearson are supported. For k-NNG construction, two different methods are presented. The first is based on an approach called batch index sorting to build the k-NNG with three sorting operations. This method uses the optimized radix sort implementation in the Thrust library for GPU. The second is an efficient implementation using the latest GPU functionalities of a variant of the quick select algorithm. Overall, the batch index sorting based k-NNG method is approximately 13x faster than a distributed MATLAB implementation. The quick select algorithm itself has a 5x speedup over state-of-the art GPU methods. This has enabled the processing of k-NNG construction on a data set containing 20 million image vectors, each with dimension 15,000, as part of a manifold embedding technique for analyzing the conformations of biomolecules
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
Unsupervised robust nonparametric learning of hidden community properties
We consider learning of fundamental properties of communities in large noisy
networks, in the prototypical situation where the nodes or users are split into
two classes according to a binary property, e.g., according to their opinions
or preferences on a topic. For learning these properties, we propose a
nonparametric, unsupervised, and scalable graph scan procedure that is, in
addition, robust against a class of powerful adversaries. In our setup, one of
the communities can fall under the influence of a knowledgeable adversarial
leader, who knows the full network structure, has unlimited computational
resources and can completely foresee our planned actions on the network. We
prove strong consistency of our results in this setup with minimal assumptions.
In particular, the learning procedure estimates the baseline activity of normal
users asymptotically correctly with probability 1; the only assumption being
the existence of a single implicit community of asymptotically negligible
logarithmic size. We provide experiments on real and synthetic data to
illustrate the performance of our method, including examples with adversaries.Comment: Experiments with new types of adversaries adde
An Efficient Index for Visual Search in Appearance-based SLAM
Vector-quantization can be a computationally expensive step in visual
bag-of-words (BoW) search when the vocabulary is large. A BoW-based appearance
SLAM needs to tackle this problem for an efficient real-time operation. We
propose an effective method to speed up the vector-quantization process in
BoW-based visual SLAM. We employ a graph-based nearest neighbor search (GNNS)
algorithm to this aim, and experimentally show that it can outperform the
state-of-the-art. The graph-based search structure used in GNNS can efficiently
be integrated into the BoW model and the SLAM framework. The graph-based index,
which is a k-NN graph, is built over the vocabulary words and can be extracted
from the BoW's vocabulary construction procedure, by adding one iteration to
the k-means clustering, which adds small extra cost. Moreover, exploiting the
fact that images acquired for appearance-based SLAM are sequential, GNNS search
can be initiated judiciously which helps increase the speedup of the
quantization process considerably
Fast -NNG construction with GPU-based quick multi-select
In this paper we describe a new brute force algorithm for building the
-Nearest Neighbor Graph (-NNG). The -NNG algorithm has many
applications in areas such as machine learning, bio-informatics, and clustering
analysis. While there are very efficient algorithms for data of low dimensions,
for high dimensional data the brute force search is the best algorithm. There
are two main parts to the algorithm: the first part is finding the distances
between the input vectors which may be formulated as a matrix multiplication
problem. The second is the selection of the -NNs for each of the query
vectors. For the second part, we describe a novel graphics processing unit
(GPU) -based multi-select algorithm based on quick sort. Our optimization makes
clever use of warp voting functions available on the latest GPUs along with
use-controlled cache. Benchmarks show significant improvement over
state-of-the-art implementations of the -NN search on GPUs
- …