Efficient PartialPairs SimRank search on large graphs

Abstract

The assessment of node-to-node similarities based on graph topology arises in a myriad of applications, e.g., web search. SimRank is a notable measure of this type, with the intuition that “two nodes are similar if their in-neighbors are similar”. While most existing work retrieving SimRank only considers all-pairs SimRank s(⋆, ⋆) and single-source SimRank s(⋆, j) (scores between every node and query j), there are appealing applications for partial-pairs SimRank, e.g., similarity join. Given two node subsets A and B in a graph, partial-pairs SimRank assessment aims to retrieve only {s(a, b)}∀a∈A,∀b∈B. However, the best-known solution [17] is not self-contained since it hinges on the premise that the SimRank scores with node-pairs in an h-go cover set must be given beforehand. This paper focuses on efficient assessment of partial-pairs SimRank in a self-contained manner. (1) We devise a novel “seed germination” model that computes partial-pairs Sim- Rank in O(k|E|min{|A|, |B|}) time and O(|E|+k|V |) memory for k iterations on a graph of |V | nodes and |E| edges. (2) We further eliminate unnecessary edge access to improve the time of partial-pairs SimRank to O(mmin{|A|, |B|}), where m ≤ min{k|E|, 2k}, and is the maximum degree. (3) We show that our partial-pairs SimRank model also can handle the computations of all-pairs and single-source Sim- Ranks, as well as partial-pairs SimRank* (a related notion of SimRank). (4) We empirically verify that our algorithms are (a) 38x faster than the best-known competitors, and (b) memory-efficient, allowing scores to be assessed accurately on graphs with tens of millions of links

    Similar works