137 research outputs found
Personalized PageRank with Node-dependent Restart
Personalized PageRank is an algorithm to classify the improtance of web pages
on a user-dependent basis. We introduce two generalizations of Personalized
PageRank with node-dependent restart. The first generalization is based on the
proportion of visits to nodes before the restart, whereas the second
generalization is based on the probability of visited node just before the
restart. In the original case of constant restart probability, the two measures
coincide. We discuss interesting particular cases of restart probabilities and
restart distributions. We show that the both generalizations of Personalized
PageRank have an elegant expression connecting the so-called direct and reverse
Personalized PageRanks that yield a symmetry property of these Personalized
PageRanks
Fast Distributed PageRank Computation
Over the last decade, PageRank has gained importance in a wide range of
applications and domains, ever since it first proved to be effective in
determining node importance in large graphs (and was a pioneering idea behind
Google's search engine). In distributed computing alone, PageRank vector, or
more generally random walk based quantities have been used for several
different applications ranging from determining important nodes, load
balancing, search, and identifying connectivity structures. Surprisingly,
however, there has been little work towards designing provably efficient
fully-distributed algorithms for computing PageRank. The difficulty is that
traditional matrix-vector multiplication style iterative methods may not always
adapt well to the distributed setting owing to communication bandwidth
restrictions and convergence rates.
In this paper, we present fast random walk-based distributed algorithms for
computing PageRanks in general graphs and prove strong bounds on the round
complexity. We first present a distributed algorithm that takes O\big(\log
n/\eps \big) rounds with high probability on any graph (directed or
undirected), where is the network size and \eps is the reset probability
used in the PageRank computation (typically \eps is a fixed constant). We
then present a faster algorithm that takes O\big(\sqrt{\log n}/\eps \big)
rounds in undirected graphs. Both of the above algorithms are scalable, as each
node sends only small (\polylog n) number of bits over each edge per round.
To the best of our knowledge, these are the first fully distributed algorithms
for computing PageRank vector with provably efficient running time.Comment: 14 page
Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
A fundamental problem arising in many applications in Web science and social
network analysis is, given an arbitrary approximation factor , to output a
set of nodes that with high probability contains all nodes of PageRank at
least , and no node of PageRank smaller than . We call this
problem {\sc SignificantPageRanks}. We develop a nearly optimal, local
algorithm for the problem with runtime complexity on
networks with nodes. We show that any algorithm for solving this problem
must have runtime of , rendering our algorithm optimal up
to logarithmic factors.
Our algorithm comes with two main technical contributions. The first is a
multi-scale sampling scheme for a basic matrix problem that could be of
interest on its own. In the abstract matrix problem it is assumed that one can
access an unknown {\em right-stochastic matrix} by querying its rows, where the
cost of a query and the accuracy of the answers depend on a precision parameter
. At a cost propositional to , the query will return a
list of entries and their indices that provide an
-precision approximation of the row. Our task is to find a set that
contains all columns whose sum is at least , and omits any column whose
sum is less than . Our multi-scale sampling scheme solves this
problem with cost , while traditional sampling algorithms
would take time .
Our second main technical contribution is a new local algorithm for
approximating personalized PageRank, which is more robust than the earlier ones
developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly
for networks with large in-degrees or out-degrees. Together with our multiscale
sampling scheme we are able to optimally solve the {\sc SignificantPageRanks}
problem.Comment: Accepted to Internet Mathematics journal for publication. An extended
abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time
Algorithm for PageRank Computations
Asymptotic analysis for personalized Web search
Personalized PageRank is used in Web search as an importance measure for Web documents. The goal of this paper is to characterize the tail behavior of the PageRank distribution in the Web and other complex networks characterized by power laws. To this end, we model the PageRank as a solution of a stochastic equation , where 's are distributed as . This equation is inspired by the original definition of the PageRank. In particular, models the number of incoming links of a page, and stays for the user preference. Assuming that or are heavy-tailed, we employ the theory of regular variation to obtain the asymptotic behavior of under quite general assumptions on the involved random variables. Our theoretical predictions show a good agreement with experimental data
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs
{\it SimRank} is a classic measure of the similarities of nodes in a graph.
Given a node in graph , a {\em single-source SimRank query}
returns the SimRank similarities between node and each node . This type of queries has numerous applications in web search and social
networks analysis, such as link prediction, web mining, and spam detection.
Existing methods for single-source SimRank queries, however, incur query cost
at least linear to the number of nodes , which renders them inapplicable for
real-time and interactive analysis.
{ This paper proposes \prsim, an algorithm that exploits the structure of
graphs to efficiently answer single-source SimRank queries. \prsim uses an
index of size , where is the number of edges in the graph, and
guarantees a query time that depends on the {\em reverse PageRank} distribution
of the input graph. In particular, we prove that \prsim runs in sub-linear time
if the degree distribution of the input graph follows the power-law
distribution, a property possessed by many real-world graphs. Based on the
theoretical analysis, we show that the empirical query time of all existing
SimRank algorithms also depends on the reverse PageRank distribution of the
graph.} Finally, we present the first experimental study that evaluates the
absolute errors of various SimRank algorithms on large graphs, and we show that
\prsim outperforms the state of the art in terms of query time, accuracy, index
size, and scalability.Comment: ACM SIGMOD 201
Improved Distortion and Spam Resistance for PageRank
For a directed graph , a ranking function, such as PageRank,
provides a way of mapping elements of to non-negative real numbers so that
nodes can be ordered. Brin and Page argued that the stationary distribution,
, of a random walk on is an effective ranking function for queries on
an idealized web graph. However, is not defined for all , and in
particular, it is not defined for the real web graph. Thus, they introduced
PageRank to approximate for graphs with ergodic random walks while
being defined on all graphs.
PageRank is defined as a random walk on a graph, where with probability
, a random out-edge is traversed, and with \emph{reset
probability} the random walk instead restarts at a node selected
using a \emph{reset vector} . Originally, was taken to be
uniform on the nodes, and we call this version UPR.
In this paper, we introduce graph-theoretic notions of quality for ranking
functions, specifically \emph{distortion} and \emph{spam resistance}. We show
that UPR has high distortion and low spam resistance and we show how to select
an that yields low distortion and high spam resistance.Comment: 36 page
Identifying Diabetes-Related Important Protein Targets with few Interacting Partners with the PageRank Algorithm
Diabetes is a growing concern for the developed nations worldwide. New genomic, metagenomic and gene-technologic approaches may yield considerable results in the next several years in its early diagnosis, or in advances in therapy and management. In this work, we highlight some human proteins that may serve as new targets in the early diagnosis and therapy. With the help of a very successful mathematical tool for network analysis that formed the basis of the early successes of Google(TM), Inc., we analyse the human proteinâprotein interaction network gained from the IntAct database with a mathematical algorithm. The novelty of our approach is that the new protein targets suggested do not have many interacting partners (so, they are not hubs or super-hubs), so their inhibition or promotion probably will not have serious side effects. We have identified numerous possible protein targets for diabetes therapy and/or management; some of these have been well known for a long time (these validate our method), some of them appeared in the literature in the last 12 months (these show the cutting edge of the algorithm), and the remainder are still unknown to be connected with diabetes, witnessing completely new hits of the method
- âŠ