137 research outputs found

    Personalized PageRank with Node-dependent Restart

    Get PDF
    Personalized PageRank is an algorithm to classify the improtance of web pages on a user-dependent basis. We introduce two generalizations of Personalized PageRank with node-dependent restart. The first generalization is based on the proportion of visits to nodes before the restart, whereas the second generalization is based on the probability of visited node just before the restart. In the original case of constant restart probability, the two measures coincide. We discuss interesting particular cases of restart probabilities and restart distributions. We show that the both generalizations of Personalized PageRank have an elegant expression connecting the so-called direct and reverse Personalized PageRanks that yield a symmetry property of these Personalized PageRanks

    Fast Distributed PageRank Computation

    Full text link
    Over the last decade, PageRank has gained importance in a wide range of applications and domains, ever since it first proved to be effective in determining node importance in large graphs (and was a pioneering idea behind Google's search engine). In distributed computing alone, PageRank vector, or more generally random walk based quantities have been used for several different applications ranging from determining important nodes, load balancing, search, and identifying connectivity structures. Surprisingly, however, there has been little work towards designing provably efficient fully-distributed algorithms for computing PageRank. The difficulty is that traditional matrix-vector multiplication style iterative methods may not always adapt well to the distributed setting owing to communication bandwidth restrictions and convergence rates. In this paper, we present fast random walk-based distributed algorithms for computing PageRanks in general graphs and prove strong bounds on the round complexity. We first present a distributed algorithm that takes O\big(\log n/\eps \big) rounds with high probability on any graph (directed or undirected), where nn is the network size and \eps is the reset probability used in the PageRank computation (typically \eps is a fixed constant). We then present a faster algorithm that takes O\big(\sqrt{\log n}/\eps \big) rounds in undirected graphs. Both of the above algorithms are scalable, as each node sends only small (\polylog n) number of bits over each edge per round. To the best of our knowledge, these are the first fully distributed algorithms for computing PageRank vector with provably efficient running time.Comment: 14 page

    Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation

    Full text link
    A fundamental problem arising in many applications in Web science and social network analysis is, given an arbitrary approximation factor c>1c>1, to output a set SS of nodes that with high probability contains all nodes of PageRank at least Δ\Delta, and no node of PageRank smaller than Δ/c\Delta/c. We call this problem {\sc SignificantPageRanks}. We develop a nearly optimal, local algorithm for the problem with runtime complexity O~(n/Δ)\tilde{O}(n/\Delta) on networks with nn nodes. We show that any algorithm for solving this problem must have runtime of Ω(n/Δ){\Omega}(n/\Delta), rendering our algorithm optimal up to logarithmic factors. Our algorithm comes with two main technical contributions. The first is a multi-scale sampling scheme for a basic matrix problem that could be of interest on its own. In the abstract matrix problem it is assumed that one can access an unknown {\em right-stochastic matrix} by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter Ï”\epsilon. At a cost propositional to 1/Ï”1/\epsilon, the query will return a list of O(1/Ï”)O(1/\epsilon) entries and their indices that provide an Ï”\epsilon-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least Δ\Delta, and omits any column whose sum is less than Δ/c\Delta/c. Our multi-scale sampling scheme solves this problem with cost O~(n/Δ)\tilde{O}(n/\Delta), while traditional sampling algorithms would take time Θ((n/Δ)2)\Theta((n/\Delta)^2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme we are able to optimally solve the {\sc SignificantPageRanks} problem.Comment: Accepted to Internet Mathematics journal for publication. An extended abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time Algorithm for PageRank Computations

    Asymptotic analysis for personalized Web search

    Get PDF
    Personalized PageRank is used in Web search as an importance measure for Web documents. The goal of this paper is to characterize the tail behavior of the PageRank distribution in the Web and other complex networks characterized by power laws. To this end, we model the PageRank as a solution of a stochastic equation R=d∑i=1NAiRi+BR\stackrel{d}{=}\sum_{i=1}^NA_iR_i+B, where RiR_i's are distributed as RR. This equation is inspired by the original definition of the PageRank. In particular, NN models the number of incoming links of a page, and BB stays for the user preference. Assuming that NN or BB are heavy-tailed, we employ the theory of regular variation to obtain the asymptotic behavior of RR under quite general assumptions on the involved random variables. Our theoretical predictions show a good agreement with experimental data

    PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs

    Full text link
    {\it SimRank} is a classic measure of the similarities of nodes in a graph. Given a node uu in graph G=(V,E)G =(V, E), a {\em single-source SimRank query} returns the SimRank similarities s(u,v)s(u, v) between node uu and each node v∈Vv \in V. This type of queries has numerous applications in web search and social networks analysis, such as link prediction, web mining, and spam detection. Existing methods for single-source SimRank queries, however, incur query cost at least linear to the number of nodes nn, which renders them inapplicable for real-time and interactive analysis. { This paper proposes \prsim, an algorithm that exploits the structure of graphs to efficiently answer single-source SimRank queries. \prsim uses an index of size O(m)O(m), where mm is the number of edges in the graph, and guarantees a query time that depends on the {\em reverse PageRank} distribution of the input graph. In particular, we prove that \prsim runs in sub-linear time if the degree distribution of the input graph follows the power-law distribution, a property possessed by many real-world graphs. Based on the theoretical analysis, we show that the empirical query time of all existing SimRank algorithms also depends on the reverse PageRank distribution of the graph.} Finally, we present the first experimental study that evaluates the absolute errors of various SimRank algorithms on large graphs, and we show that \prsim outperforms the state of the art in terms of query time, accuracy, index size, and scalability.Comment: ACM SIGMOD 201

    Improved Distortion and Spam Resistance for PageRank

    Full text link
    For a directed graph G=(V,E)G = (V,E), a ranking function, such as PageRank, provides a way of mapping elements of VV to non-negative real numbers so that nodes can be ordered. Brin and Page argued that the stationary distribution, R(G)R(G), of a random walk on GG is an effective ranking function for queries on an idealized web graph. However, R(G)R(G) is not defined for all GG, and in particular, it is not defined for the real web graph. Thus, they introduced PageRank to approximate R(G)R(G) for graphs GG with ergodic random walks while being defined on all graphs. PageRank is defined as a random walk on a graph, where with probability (1−ϔ)(1-\epsilon), a random out-edge is traversed, and with \emph{reset probability} Ï”\epsilon the random walk instead restarts at a node selected using a \emph{reset vector} r^\hat{r}. Originally, r^\hat{r} was taken to be uniform on the nodes, and we call this version UPR. In this paper, we introduce graph-theoretic notions of quality for ranking functions, specifically \emph{distortion} and \emph{spam resistance}. We show that UPR has high distortion and low spam resistance and we show how to select an r^\hat{r} that yields low distortion and high spam resistance.Comment: 36 page

    Identifying Diabetes-Related Important Protein Targets with few Interacting Partners with the PageRank Algorithm

    Get PDF
    Diabetes is a growing concern for the developed nations worldwide. New genomic, metagenomic and gene-technologic approaches may yield considerable results in the next several years in its early diagnosis, or in advances in therapy and management. In this work, we highlight some human proteins that may serve as new targets in the early diagnosis and therapy. With the help of a very successful mathematical tool for network analysis that formed the basis of the early successes of Google(TM), Inc., we analyse the human protein–protein interaction network gained from the IntAct database with a mathematical algorithm. The novelty of our approach is that the new protein targets suggested do not have many interacting partners (so, they are not hubs or super-hubs), so their inhibition or promotion probably will not have serious side effects. We have identified numerous possible protein targets for diabetes therapy and/or management; some of these have been well known for a long time (these validate our method), some of them appeared in the literature in the last 12 months (these show the cutting edge of the algorithm), and the remainder are still unknown to be connected with diabetes, witnessing completely new hits of the method
    • 

    corecore