219 research outputs found

    Towards efficient SimRank computation on large networks

    Get PDF
    SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work [17] exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have duplicate redundancy. Besides, for a desired accuracy ϵ, the existing SimRank model requires K = [logC ϵ] iterations [17], where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicate computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to 0(Kdn2) time, where d is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while also fairly preserving the relative order of original SimRank scores

    Gauging Correct Relative Rankings For Similarity Search

    Get PDF
    © 2015 ACM.One of the important tasks in link analysis is to quantify the similarity between two objects based on hyperlink structure. SimRank is an attractive similarity measure of this type. Existing work mainly focuses on absolute SimRank scores, and often harnesses an iterative paradigm to compute them. While these iterative scores converge to exact ones with the increasing number of iterations, it is still notoriously difficult to determine how well the relative orders of these iterative scores can be preserved for a given iteration. In this paper, we propose efficient ranking criteria that can secure correct relative orders of node-pairs with respect to SimRank scores when they are computed in an iterative fashion. Moreover, we show the superiority of our criteria in harvesting top-K SimRank scores and bucket orders from a full ranking list. Finally, viable empirical studies verify the usefulness of our techniques for SimRank top-K ranking and bucket ordering

    Exact Single-Source SimRank Computation on Large Graphs

    Full text link
    SimRank is a popular measurement for evaluating the node-to-node similarities based on the graph topology. In recent years, single-source and top-kk SimRank queries have received increasing attention due to their applications in web mining, social network analysis, and spam detection. However, a fundamental obstacle in studying SimRank has been the lack of ground truths. The only exact algorithm, Power Method, is computationally infeasible on graphs with more than 10610^6 nodes. Consequently, no existing work has evaluated the actual trade-offs between query time and accuracy on large real-world graphs. In this paper, we present ExactSim, the first algorithm that computes the exact single-source and top-kk SimRank results on large graphs. With high probability, this algorithm produces ground truths with a rigorous theoretical guarantee. We conduct extensive experiments on real-world datasets to demonstrate the efficiency of ExactSim. The results show that ExactSim provides the ground truth for any single-source SimRank query with a precision up to 7 decimal places within a reasonable query time.Comment: ACM SIGMOD 202

    High quality graph-based similarity search

    Get PDF
    SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors

    ASAP : towards accurate, stable and accelerative penetrating-rank estimation on large graphs

    Get PDF
    Pervasive web applications increasingly require a measure of similarity among objects. Penetrating-Rank (P-Rank) has been one of the promising link-based similarity metrics as it provides a comprehensive way of jointly encoding both incoming and outgoing links into computation for emerging applications. In this paper, we investigate P-Rank efficiency problem that encompasses its accuracy, stability and computational time. (1) We provide an accuracy estimate for iteratively computing P-Rank. A symmetric problem is to find the iteration number K needed for achieving a given accuracy ε. (2) We also analyze the stability of P-Rank, by showing that small choices of the damping factors would make P-Rank more stable and well-conditioned. (3) For undirected graphs, we also explicitly characterize the P-Rank solution in terms of matrices. This results in a novel non-iterative algorithm, termed ASAP , for efficiently computing P-Rank, which improves the CPU time from O(n 4) to O( n 3 ). Using real and synthetic data, we empirically verify the effectiveness and efficiency of our approaches

    PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

    Full text link
    {\it SimRank} is a classic measure of the similarities of nodes in a graph. Given a node uu in graph G=(V,E)G =(V, E), a {\em single-source SimRank query} returns the SimRank similarities s(u,v)s(u, v) between node uu and each node v∈Vv \in V. This type of queries has numerous applications in web search and social networks analysis, such as link prediction, web mining, and spam detection. Existing methods for single-source SimRank queries, however, incur query cost at least linear to the number of nodes nn, which renders them inapplicable for real-time and interactive analysis. { This paper proposes \prsim, an algorithm that exploits the structure of graphs to efficiently answer single-source SimRank queries. \prsim uses an index of size O(m)O(m), where mm is the number of edges in the graph, and guarantees a query time that depends on the {\em reverse PageRank} distribution of the input graph. In particular, we prove that \prsim runs in sub-linear time if the degree distribution of the input graph follows the power-law distribution, a property possessed by many real-world graphs. Based on the theoretical analysis, we show that the empirical query time of all existing SimRank algorithms also depends on the reverse PageRank distribution of the graph.} Finally, we present the first experimental study that evaluates the absolute errors of various SimRank algorithms on large graphs, and we show that \prsim outperforms the state of the art in terms of query time, accuracy, index size, and scalability.Comment: ACM SIGMOD 201

    On the efficiency of estimating penetrating rank on large graphs

    Get PDF
    P-Rank (Penetrating Rank) has been suggested as a useful measure of structural similarity that takes account of both incoming and outgoing edges in ubiquitous networks. Existing work often utilizes memoization to compute P-Rank similarity in an iterative fashion, which requires cubic time in the worst case. Besides, previous methods mainly focus on the deterministic computation of P-Rank, but lack the probabilistic framework that scales well for large graphs. In this paper, we propose two efficient algorithms for computing P-Rank on large graphs. The first observation is that a large body of objects in a real graph usually share similar neighborhood structures. By merging such objects with an explicit low-rank factorization, we devise a deterministic algorithm to compute P-Rank in quadratic time. The second observation is that by converting the iterative form of P-Rank into a matrix power series form, we can leverage the random sampling approach to probabilistically compute P-Rank in linear time with provable accuracy guarantees. The empirical results on both real and synthetic datasets show that our approaches achieve high time efficiency with controlled error and outperform the baseline algorithms by at least one order of magnitude

    Flow-based Influence Graph Visual Summarization

    Full text link
    Visually mining a large influence graph is appealing yet challenging. People are amazed by pictures of newscasting graph on Twitter, engaged by hidden citation networks in academics, nevertheless often troubled by the unpleasant readability of the underlying visualization. Existing summarization methods enhance the graph visualization with blocked views, but have adverse effect on the latent influence structure. How can we visually summarize a large graph to maximize influence flows? In particular, how can we illustrate the impact of an individual node through the summarization? Can we maintain the appealing graph metaphor while preserving both the overall influence pattern and fine readability? To answer these questions, we first formally define the influence graph summarization problem. Second, we propose an end-to-end framework to solve the new problem. Our method can not only highlight the flow-based influence patterns in the visual summarization, but also inherently support rich graph attributes. Last, we present a theoretic analysis and report our experiment results. Both evidences demonstrate that our framework can effectively approximate the proposed influence graph summarization objective while outperforming previous methods in a typical scenario of visually mining academic citation networks.Comment: to appear in IEEE International Conference on Data Mining (ICDM), Shen Zhen, China, December 201
    • …
    corecore