5,880 research outputs found
k-NN 검색 및 k-NN 그래프 생성을 위한 고속 근사 알고리즘
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 이상구.Finding k-nearest neighbors (k-NN) is an essential part of recommeder systems, information retrieval, and many data mining and machine learning algorithms. However, there are two main problems in finding k-nearest neighbors: 1) Existing approaches require a huge amount of time when the number of objects or dimensions is scale up. 2) The k-NN computation methods do not show the consistent performance over different search tasks and types of data. In this dissertation, we present fast and versatile algorithms for finding k-nearest neighbors in order to cope with these problems. The main contributions are summarized as follows: first, we present an efficient and scalable algorithm for finding an approximate k-NN graph by filtering node pairs whose large value dimensions do not match at all. Second, a fast collaborative filtering algorithm that utilizes k-NN graph is presented. The main idea of this approach is to reverse the process of finding k-nearest neighbors in item-based collaborative filtering. Last, we propose a fast approximate algorithm for k-NN search by selecting query-specific signatures from a signature pool to pick high-quality k-NN candidates.The experimental results show that the proposed algorithms guarantee a high level of accuracy while also being much faster than the other algorithms over different types of search tasks and datasets.Abstract i
Contents iii
List of Figures vii
List of Tables xi
Chapter 1 Introduction 1
1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Fast Approximation . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Versatility . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Our Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Greedy Filtering . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Signature Selection LSH . . . . . . . . . . . . . . . . . . . 7
1.2.3 Reversed CF . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2 Background and Related Work 14
2.1 k-NN Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . 15
2.1.2 LSH-based k-NN Search . . . . . . . . . . . . . . . . . . . 16
2.2 k-NN Graph Construction . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 LSH-based Approach . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Clustering-based Approach . . . . . . . . . . . . . . . . . 19
2.2.3 Heuristic-based Approach . . . . . . . . . . . . . . . . . . 20
2.2.4 Similarity Join . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3 Fast Approximate k-NN Graph Construction 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Constructing a k-Nearest Neighbor Graph . . . . . . . . . . . . . 29
3.3.1 Greedy Filtering . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Prefix Selection Scheme . . . . . . . . . . . . . . . . . . . 32
3.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Graph Construction Time . . . . . . . . . . . . . . . . . . 39
3.4.3 Graph Accuracy . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Performance Comparison . . . . . . . . . . . . . . . . . . 48
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4 Fast Collaborative Filtering 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Fast Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Nearest Neighbor Graph Construction . . . . . . . . . . . 58
4.3.2 Fast Recommendation Algorithm . . . . . . . . . . . . . . 60
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Overall Comparison . . . . . . . . . . . . . . . . . . . . . 65
4.4.3 Effects of Parameter Changes . . . . . . . . . . . . . . . . 68
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5 Fast Approximate k-NN Search 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Signature Selection LSH . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Data-dependent LSH . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Signature Pool Generation . . . . . . . . . . . . . . . . . . 76
5.2.3 Signature Selection . . . . . . . . . . . . . . . . . . . . . . 79
5.2.4 Optimization Techniques . . . . . . . . . . . . . . . . . . 83
5.3 S2LSH for Graph Construction . . . . . . . . . . . . . . . . . . . 84
5.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.2 Signature Selection . . . . . . . . . . . . . . . . . . . . . . 84
5.3.3 Optimization Techniques . . . . . . . . . . . . . . . . . . 85
5.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 87
5.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 91
5.5.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . 97
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 6 Conclusion 103
Bibliography 105
초록 113Docto
A Harmonic Extension Approach for Collaborative Ranking
We present a new perspective on graph-based methods for collaborative ranking
for recommender systems. Unlike user-based or item-based methods that compute a
weighted average of ratings given by the nearest neighbors, or low-rank
approximation methods using convex optimization and the nuclear norm, we
formulate matrix completion as a series of semi-supervised learning problems,
and propagate the known ratings to the missing ones on the user-user or
item-item graph globally. The semi-supervised learning problems are expressed
as Laplace-Beltrami equations on a manifold, or namely, harmonic extension, and
can be discretized by a point integral method. We show that our approach does
not impose a low-rank Euclidean subspace on the data points, but instead
minimizes the dimension of the underlying manifold. Our method, named LDM (low
dimensional manifold), turns out to be particularly effective in generating
rankings of items, showing decent computational efficiency and robust ranking
quality compared to state-of-the-art methods
Large Scale Spectral Clustering Using Approximate Commute Time Embedding
Spectral clustering is a novel clustering method which can detect complex
shapes of data clusters. However, it requires the eigen decomposition of the
graph Laplacian matrix, which is proportion to and thus is not
suitable for large scale systems. Recently, many methods have been proposed to
accelerate the computational time of spectral clustering. These approximate
methods usually involve sampling techniques by which a lot information of the
original data may be lost. In this work, we propose a fast and accurate
spectral clustering approach using an approximate commute time embedding, which
is similar to the spectral embedding. The method does not require using any
sampling technique and computing any eigenvector at all. Instead it uses random
projection and a linear time solver to find the approximate embedding. The
experiments in several synthetic and real datasets show that the proposed
approach has better clustering quality and is faster than the state-of-the-art
approximate spectral clustering methods
Graph Convolutional Matrix Completion
We consider matrix completion for recommender systems from the point of view
of link prediction on graphs. Interaction data such as movie ratings can be
represented by a bipartite user-item graph with labeled edges denoting observed
ratings. Building on recent progress in deep learning on graph-structured data,
we propose a graph auto-encoder framework based on differentiable message
passing on the bipartite interaction graph. Our model shows competitive
performance on standard collaborative filtering benchmarks. In settings where
complimentary feature information or structured data such as a social network
is available, our framework outperforms recent state-of-the-art methods.Comment: 9 pages, 3 figures, updated with additional experimental evaluatio
Customer purchase behavior prediction in E-commerce: a conceptual framework and research agenda
Digital retailers are experiencing an increasing number of transactions coming from their consumers online, a consequence of the convenience in buying goods via E-commerce platforms. Such interactions compose complex behavioral patterns which can be analyzed through predictive analytics to enable businesses to understand consumer needs. In this abundance of big data and possible tools to analyze them, a systematic review of the literature is missing. Therefore, this paper presents a systematic literature review of recent research dealing with customer purchase prediction in the E-commerce context. The main contributions are a novel analytical framework and a research agenda in the field. The framework reveals three main tasks in this review, namely, the prediction of customer intents, buying sessions, and purchase decisions. Those are followed by their employed predictive methodologies and are analyzed from three perspectives. Finally, the research agenda provides major existing issues for further research in the field of purchase behavior prediction online
Low-Rank Matrices on Graphs: Generalized Recovery & Applications
Many real world datasets subsume a linear or non-linear low-rank structure in
a very low-dimensional space. Unfortunately, one often has very little or no
information about the geometry of the space, resulting in a highly
under-determined recovery problem. Under certain circumstances,
state-of-the-art algorithms provide an exact recovery for linear low-rank
structures but at the expense of highly inscalable algorithms which use nuclear
norm. However, the case of non-linear structures remains unresolved. We revisit
the problem of low-rank recovery from a totally different perspective,
involving graphs which encode pairwise similarity between the data samples and
features. Surprisingly, our analysis confirms that it is possible to recover
many approximate linear and non-linear low-rank structures with recovery
guarantees with a set of highly scalable and efficient algorithms. We call such
data matrices as \textit{Low-Rank matrices on graphs} and show that many real
world datasets satisfy this assumption approximately due to underlying
stationarity. Our detailed theoretical and experimental analysis unveils the
power of the simple, yet very novel recovery framework \textit{Fast Robust PCA
on Graphs
Effect of Neighborhood Approximation on Downstream Analytics
Nearest neighbor search algorithms have been successful in finding practically useful solutions to computationally difficult problems. In the nearest neighbor search problem, the brute force approach is often more efficient than other algorithms for high-dimensional spaces. A special case exists for objects represented as sparse vectors, where algorithms take advantage of the fact that an object has a zero value for most features. In general, since exact nearest neighbor search methods suffer from the “curse of dimensionality,” many practitioners use approximate nearest neighbor search algorithms when faced with high dimensionality or large datasets. To a reasonable degree, it is known that relying on approximate nearest neighbors leads to some error in the solutions to the underlying data mining problems the neighbors are used to solve. However, no one has attempted to quantify this error or provide practitioners with guidance in choosing appropriate search methods for their task. In this thesis, we conduct several experiments on recommender systems with a goal to find the degree to which approximate nearest neighbor algorithms are subject to these types of error propagation problems. Additionally, we provide persuasive evidence on the trade-off between search performance and analytics effectiveness. Our experimental evaluation demonstrates that a state-of-the-art approximate nearest neighbor search method (L2KNNGApprox) is not an effective solution in most cases. When tuned to achieve high search recall (80% or higher), it provides a fairly competitive recommendation performance compared to an efficient exact search method but offers no advantage in terms of efficiency (0.1x—1.5x speedup). Low search recall (\u3c60%) leads to poor recommendation performance. Finally, medium recall values (60%—80%) lead to reasonable recommendation performance but are hard to achieve and offer only a modest gain in efficiency (1.5x—2.3x)
- …