Search CORE

653 research outputs found

TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs

Author: Jung Jinhong
Kang U
Yoon Minji
Publication venue
Publication date: 03/12/2017
Field of study

Given a large graph, how can we determine similarity between nodes in a fast and accurate way? Random walk with restart (RWR) is a popular measure for this purpose and has been exploited in numerous data mining applications including ranking, anomaly detection, link prediction, and community detection. However, previous methods for computing exact RWR require prohibitive storage sizes and computational costs, and alternative methods which avoid such costs by computing approximate RWR have limited accuracy. In this paper, we propose TPA, a fast, scalable, and highly accurate method for computing approximate RWR on large graphs. TPA exploits two important properties in RWR: 1) nodes close to a seed node are likely to be revisited in following steps due to block-wise structure of many real-world graphs, and 2) RWR scores of nodes which reside far from the seed node are proportional to their PageRank scores. Based on these two properties, TPA divides approximate RWR problem into two subproblems called neighbor approximation and stranger approximation. In the neighbor approximation, TPA estimates RWR scores of nodes close to the seed based on scores of few early steps from the seed. In the stranger approximation, TPA estimates RWR scores for nodes far from the seed using their PageRank. The stranger and neighbor approximations are conducted in the preprocessing phase and the online phase, respectively. Through extensive experiments, we show that TPA requires up to 3.5x less time with up to 40x less memory space than other state-of-the-art methods for the preprocessing phase. In the online phase, TPA computes approximate RWR up to 30x faster than existing methods while maintaining high accuracy.Comment: 12pages, 10 figure

arXiv.org e-Print Archive

Crossref

SNU Open Repository and Archive

Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 01/01/2012
Field of study

Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles of Database Systems (PODS 2012

arXiv.org e-Print Archive

CiteSeerX

실세계 그래프 특징을 활용한 랜덤 워크 기반 대규모 그래프 마이닝

Author: 정진홍
Publication venue: 서울대학교 대학원
Publication date: 01/02/2020
Field of study

학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 강유.Numerous real-world relationships are represented as graphs such as social networks, hyperlink networks, and protein interaction networks. Analyzing those networks is important to understand the real-life phenomena. Among various graph analysis techniques, random walk has been widely used in many applications with satisfactory results. However, various real-world graphs are large and complicated with diverse labels. Traditional random walk based methods require heavy computational cost, and disregards those labels for performing random walks; thus, its utilization has been limited in such large and complicated graphs. In this thesis, I handle the technical challenges of mining large real-world graphs based on random walk. Real-world graphs have distinct structural properties which become a basis to increase the performance of the random walk in terms of speed and quality. Based upon this idea, I develop fast, scalable, and exact methods for node ranking using random walk in large-scale plain networks. I also design accurate models using random walks for node ranking and relational reasoning in labeled graphs such as signed networks and knowledge bases. Through extensive experiments on various real-world graphs, I demonstrate the effectiveness of the methods and models proposed by this thesis. The proposed methods process 100 times larger graphs, and require up to 130 times less memory with up to 9 times faster speed compared to other existing methods, successfully scaling to billion-scale graphs. Also, the proposed models substantially improve the predictive performance of a variety of tasks in labeled graphs such as signed networks and knowledge bases.다양한 실세계 자연 현상에서의 관계들은 소셜 네트워크, 하이퍼링크 네트워크와 단백질 상호작용 네트워크와 같이 정점과 간서의 그래프로 표현된다. 이러한 네트워크를 분석하는 것은 실세계의 현상을 이해하는데 매우 중요하다. 다양한 그래프 분석 기법중에 랜덤 워크라는 기법이 만족스러운 성능과 함께 많은 그래프 마이닝 응용에 널리 활용되어 왔다. 그러나 대다수의 실세계 그래프는 그 규모가 굉장히 크고 다양한 라벨 정보와 함께 복잡하게 표현된다. 전통적인 랜덤 워크 기반의 기법들은 계산량이 많이 요구되고, 랜덤 워크를 하는데 있어서 다양한 라벨 정보를 전혀 고려하지 않아 라벨로 표현되는 그래프의 고유한 특성이 무시되게 된다. 그래서 이와 같이 복잡하면서 대규모 그래프에서는 랜덤 워크의 실질적 활용이 제한되어왔다. 본 학위 논문에서는 랜덤 워크 기반의 대규모 실세계 그래프 분석의 기술적 한계를 해결하고자 한다. 실세계 그래프는 고유한 구조적 특징들을 가지고 있으며 이러한 구조적 특징들은 속도와 품질의 측면에서 랜덤 워크의 성능을 향상시키는데 기반이 될 수 있다. 이러한 아이디어를 활용하여, 대규모의 라벨이 없는 일반적인 네트워크에서 랜덤 워크 기반의 개인화된 정점 랭킹 계산을 빠르고, 확장성 있고 정확하게 구하는 기법을 제안한다. 또한 부호화된 네트워크 또는 지식 베이스와 같은 라벨이 있는 그래프에서 개인화된 정점 랭킹과 관계 추론을 위한 랜덤 워크 기반의 모델을 제안한다. 다양한 실세계 그래프에서 광범위한 실험을 통해 본 학위 논문에 의해 제안된 방법과 모델의 효과성을 보인다. 제안하는 방법은 다른 경쟁 기법들과 비교했을 때 최대 100배 더 큰 그래프를 처리할 수 있고, 최대 130배 적게 메모리를 사용하면서, 최대 9배 빠른 속도를 보이며, 결과적으로 수 십억 규모의 그래프에서 랜덤 워크 기반의 개인화된 정점 랭킹을 성공적으로 구할 수 있다. 또한, 제안하는 랜덤 워크 기반의 모델들은 부호화된 네트워크와 지식 베이스와 같은 라벨이 있는 그래프에서 부호 예측, 간선 예측, 이상 현상 탐지, 관계 추론 등의 다양한 응용에서 다른 경쟁 모델들보다 더 좋은 예측 성능을 보인다.Chapter1 Overview .... 1 1.1 Motivation .... 1 1.2 Research Statement .... 4 1.2.1 Research Goals and Importance .... 4 1.2.2 Technical Challenges .... 6 1.2.3 Main Approaches .... 7 1.2.4 Contributions .... 9 1.2.5 Overall Impact .... 10 1.3 Thesis Organization .... 11 Chapter2 Background .... 12 2.1 Definitions .... 12 2.1.1 Notations on Graphs .... 12 2.1.2 Random Walk with Restart .... 13 2.2 Related Works .... 15 2.2.1 Previous Methods for RWR in Plain Graphs .... 15 2.2.2 Ranking Models in Signed Networks .... 17 2.2.3 Relational Reasoning Models in Edge-labeled Graphs .... 19 Chapter 3 Fast and Scalable Ranking in Large-scale Plain Graphs .... 21 3.1 Introduction .... 21 3.2 Preliminaries .... 23 3.2.1 Iterative Methods for RWR .... 24 3.2.2 Preprocessing Methods for RWR .... 25 3.3 Proposed Method .... 26 3.3.1 Overview .... 26 3.3.2 BePI-B: Exploiting Graph Characteristics for Node Reordering and Block Elimination .... 28 3.3.3 BePI-B: Incorporating an Iterative Method into Block Elimination .... 32 3.3.4 BePI-S: Sparsifying the Schur Complement .... 34 3.3.5 BePI: Preconditioning a Linear System for the Iterative Method .... 36 3.4 Theoretical Results .... 39 3.4.1 Time Complexity .... 39 3.4.2 Space Complexity .... 40 3.4.3 Accuracy Bound .... 41 3.4.4 Lemmas and Proofs .... 43 3.5 Experiments .... 48 3.5.1 Experimental Settings .... 49 3.5.2 Preprocessing Cost .... 51 3.5.3 Query Cost .... 53 3.5.4 Scalability .... 53 3.5.5 Effects of Sparse Schur Complement and Preconditioning .... 54 3.5.6 Effects of the Hub Selection Ratio .... 57 3.5.7 Accuracy .... 58 3.5.8 Comparison with the-State-of-the-Art Method .... 59 3.6 Summary .... 60 Chapter 4 Personalized Ranking in Signed Graphs .... 61 4.1 Introduction .... 61 4.2 Problem Definition .... 65 4.3 Proposed Method .... 65 4.3.1 Signed Random Walk with Restart Model .... 66 4.3.2 SRWR-Iter: Iterative Algorithm for Signed Random Walk with Restart .... 76 4.3.3 SRWR-Pre: Preprocessing Algorithm for Signed Random Walk with Restart .... 82 4.4 Experiments .... 93 4.4.1 Experimental Settings .... 94 4.4.2 Link Prediction Task .... 96 4.4.3 User Preference Preservation Task .... 99 4.4.4 Troll Identification Task .... 100 4.4.5 Sign Prediction Task .... 104 4.4.6 Effectiveness of Balance Attenuation Factors .... 109 4.4.7 Performance of SRWR-Pre .... 110 4.5 Summary .... 113 Chapter 5 Relational Reasoning in Edge-labeled Graphs .... 114 5.1 Introduction .... 114 5.2 Preliminary .... 116 5.3 Proposed Method .... 118 5.3.1 Label Transition Observation .... 120 5.3.2 Learning Label Transition Probabilities .... 121 5.3.3 Multi-Labeled Random Walk with Restart .... 123 5.3.4 Formulation for MuRWR .... 125 5.3.5 Algorithm for MuRWR .... 127 5.4 Theoretical Results .... 131 5.4.1 Lemma for Solution of Label Transition Probabilities and Convexity .... 131 5.4.2 Lemma for Recursive Equation of MuRWR Score Matrix .... 134 5.4.3 Lemma for Spectral Radius in Convergence Theorem .... 136 5.4.4 Lemma for Complexity Analysis .... 137 5.5 Experiment .... 138 5.5.1 Experimental Settings .... 139 5.5.2 Relation Inference Task .... 140 5.5.3 Effects of Label Weights in MuRWR .... 142 5.5.4 Effects of Restart Probability in MuRWR .... 143 5.5.5 Convergence of MuRWR .... 144 5.6 Summary .... 145 Chapter6 Future Works .... 146 6.1 Fast and Accurate Pseudoinverse Computation .... 146 6.2 Fast and Scalable Signed Network Generation .... 147 6.3 Disk-based Algorithms for Random Walk .... 147 Chapter7 Conclusion .... 149 References .... 151 Appendix .... 166 A.1 Hub-and-Spoke Reordering Method .... 166 A.2 Time Complexity of Sparse Matrix Multiplication .... 167 A.3 Details of Preconditioned GMRES .... 167 A.4 Detailed Description of Evaluation Metrics .... 170 A.4.1 Link Prediction .... 170 A.4.2 Troll Identification .... 171 A.5 Discussion on Relative Trustworthiness of SRWR .... 173 Abstract in Korean .... 176Docto

SNU Open Repository and Archive

Reducing Seed Noise in Personalized PageRank

Author: Sapino Maria Luisa
Sel&#231
Shengyu Huang
Xinsheng Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Institutional Research Information System University of Turin

Efficient Node Proximity and Node Significance Computations in Graphs

Author
Publication venue
Publication date: 01/01/2017
Field of study

abstract: Node proximity measures are commonly used for quantifying how nearby or otherwise related to two or more nodes in a graph are. Node significance measures are mainly used to find how much nodes are important in a graph. The measures of node proximity/significance have been highly effective in many predictions and applications. Despite their effectiveness, however, there are various shortcomings. One such shortcoming is a scalability problem due to their high computation costs on large size graphs and another problem on the measures is low accuracy when the significance of node and its degree in the graph are not related. The other problem is that their effectiveness is less when information for a graph is uncertain. For an uncertain graph, they require exponential computation costs to calculate ranking scores with considering all possible worlds. In this thesis, I first introduce Locality-sensitive, Re-use promoting, approximate Personalized PageRank (LR-PPR) which is an approximate personalized PageRank calculating node rankings for the locality information for seeds without calculating the entire graph and reusing the precomputed locality information for different locality combinations. For the identification of locality information, I present Impact Neighborhood Indexing (INI) to find impact neighborhoods with nodes' fingerprints propagation on the network. For the accuracy challenge, I introduce Degree Decoupled PageRank (D2PR) technique to improve the effectiveness of PageRank based knowledge discovery, especially considering the significance of neighbors and degree of a given node. To tackle the uncertain challenge, I introduce Uncertain Personalized PageRank (UPPR) to approximately compute personalized PageRank values on uncertainties of edge existence and Interval Personalized PageRank with Integration (IPPR-I) and Interval Personalized PageRank with Mean (IPPR-M) to compute ranking scores for the case when uncertainty exists on edge weights as interval values.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

큰 그래프 상에서의 개인화된 페이지 랭크에 대한 빠른 계산 기법

Author: 박성찬
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 이상구.Computation of Personalized PageRank (PPR) in graphs is an important function that is widely utilized in myriad application domains such as search, recommendation, and knowledge discovery. Because the computation of PPR is an expensive process, a good number of innovative and efficient algorithms for computing PPR have been developed. However, efficient computation of PPR within very large graphs with over millions of nodes is still an open problem. Moreover, previously proposed algorithms cannot handle updates efficiently, thus, severely limiting their capability of handling dynamic graphs. In this paper, we present a fast converging algorithm that guarantees high and controlled precision. We improve the convergence rate of traditional Power Iteration method by adopting successive over-relaxation, and initial guess revision, a vector reuse strategy. The proposed method vastly improves on the traditional Power Iteration in terms of convergence rate and computation time, while retaining its simplicity and strictness. Since it can reuse the previously computed vectors for refreshing PPR vectors, its update performance is also greatly enhanced. Also, since the algorithm halts as soon as it reaches a given error threshold, we can flexibly control the trade-off between accuracy and time, a feature lacking in both sampling-based approximation methods and fully exact methods. Experiments show that the proposed algorithm is at least 20 times faster than the Power Iteration and outperforms other state-of-the-art algorithms.그래프 내에서 개인화된 페이지랭크 (P ersonalized P age R ank, PPR 를 계산하는 것은 검색 , 추천 , 지식발견 등 여러 분야에서 광범위하게 활용되는 중요한 작업 이다 . 개인화된 페이지랭크를 계산하는 것은 고비용의 과정이 필요하므로 , 개인화된 페이지랭크를 계산하는 효율적이고 혁신적인 방법들이 다수 개발되어왔다 . 그러나 수백만 이상의 노드를 가진 대용량 그래프에 대한 효율적인 계산은 여전히 해결되지 않은 문제이다 . 그에 더하여 , 기존 제시된 알고리듬들은 그래프 갱신을 효율적으로 다루지 못하여 동적으로 변화하는 그래프를 다루는 데에 한계점이 크다 . 본 연구에서는 높은 정밀도를 보장하고 정밀도를 통제 가능한 , 빠르게 수렴하는 개인화된 페이지랭크 계산 알고리듬을 제시한다 . 전통적인 거듭제곱법 (Power 에 축차가속완화법 (Successive Over Relaxation) 과 초기 추측 값 보정법 (Initial Guess 을 활용한 벡터 재사용 전략을 적용하여 수렴 속도를 개선하였다 . 제시된 방법은 기존 거듭제곱법의 장점인 단순성과 엄밀성을 유지 하면서 도 수렴율과 계산속도를 크게 개선 한다 . 또한 개인화된 페이지랭크 벡터의 갱신을 위하여 이전에 계산 되어 저장된 벡터를 재사용하 여 , 갱신 에 드는 시간이 크게 단축된다 . 본 방법은 주어진 오차 한계에 도달하는 즉시 결과값을 산출하므로 정확도와 계산시간을 유연하게 조절할 수 있으며 이는 표본 기반 추정방법이나 정확한 값을 산출하는 역행렬 기반 방법 이 가지지 못한 특성이다 . 실험 결과 , 본 방법은 거듭제곱법에 비하여 20 배 이상 빠르게 수렴한다는 것이 확인되었으며 , 기 제시된 최고 성능 의 알고리 듬 보다 우수한 성능을 보이는 것 또한 확인되었다1 Introduction 1 2 Preliminaries: Personalized PageRank 4 2.1 Random Walk, PageRank, and Personalized PageRank. 5 2.1.1 Basics on Random Walk 5 2.1.2 PageRank. 6 2.1.3 Personalized PageRank 8 2.2 Characteristics of Personalized PageRank. 9 2.3 Applications of Personalized PageRank. 12 2.4 Previous Work on Personalized PageRank Computation. 17 2.4.1 Basic Algorithms 17 2.4.2 Enhanced Power Iteration 18 2.4.3 Bookmark Coloring Algorithm. 20 2.4.4 Dynamic Programming 21 2.4.5 Monte-Carlo Sampling. 22 2.4.6 Enhanced Direct Solving 24 2.5 Summary 26 3 Personalized PageRank Computation with Initial Guess Revision 30 3.1 Initial Guess Revision and Relaxation 30 3.2 Finding Optimal Weight of Successive Over Relaxation for PPR. 34 3.3 Initial Guess Construction Algorithm for Personalized PageRank. 36 4 Fully Personalized PageRank Algorithm with Initial Guess Revision 42 4.1 FPPR with IGR. 42 4.2 Optimization. 49 4.3 Experiments. 52 5 Personalized PageRank Query Processing with Initial Guess Revision 56 5.1 PPR Query Processing with IGR 56 5.2 Optimization. 64 5.3 Experiments. 67 6 Conclusion 74 Bibliography 77 Appendix 88 Abstract (In Korean) 90Docto

SNU Open Repository and Archive