182 research outputs found
Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
A fundamental problem arising in many applications in Web science and social
network analysis is, given an arbitrary approximation factor , to output a
set of nodes that with high probability contains all nodes of PageRank at
least , and no node of PageRank smaller than . We call this
problem {\sc SignificantPageRanks}. We develop a nearly optimal, local
algorithm for the problem with runtime complexity on
networks with nodes. We show that any algorithm for solving this problem
must have runtime of , rendering our algorithm optimal up
to logarithmic factors.
Our algorithm comes with two main technical contributions. The first is a
multi-scale sampling scheme for a basic matrix problem that could be of
interest on its own. In the abstract matrix problem it is assumed that one can
access an unknown {\em right-stochastic matrix} by querying its rows, where the
cost of a query and the accuracy of the answers depend on a precision parameter
. At a cost propositional to , the query will return a
list of entries and their indices that provide an
-precision approximation of the row. Our task is to find a set that
contains all columns whose sum is at least , and omits any column whose
sum is less than . Our multi-scale sampling scheme solves this
problem with cost , while traditional sampling algorithms
would take time .
Our second main technical contribution is a new local algorithm for
approximating personalized PageRank, which is more robust than the earlier ones
developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly
for networks with large in-degrees or out-degrees. Together with our multiscale
sampling scheme we are able to optimally solve the {\sc SignificantPageRanks}
problem.Comment: Accepted to Internet Mathematics journal for publication. An extended
abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time
Algorithm for PageRank Computations
Bidirectional PageRank Estimation: From Average-Case to Worst-Case
We present a new algorithm for estimating the Personalized PageRank (PPR)
between a source and target node on undirected graphs, with sublinear
running-time guarantees over the worst-case choice of source and target nodes.
Our work builds on a recent line of work on bidirectional estimators for PPR,
which obtained sublinear running-time guarantees but in an average-case sense,
for a uniformly random choice of target node. Crucially, we show how the
reversibility of random walks on undirected networks can be exploited to
convert average-case to worst-case guarantees. While past bidirectional methods
combine forward random walks with reverse local pushes, our algorithm combines
forward local pushes with reverse random walks. We also discuss how to modify
our methods to estimate random-walk probabilities for any length distribution,
thereby obtaining fast algorithms for estimating general graph diffusions,
including the heat kernel, on undirected networks.Comment: Workshop on Algorithms and Models for the Web-Graph (WAW) 201
Quick Detection of High-degree Entities in Large Directed Networks
In this paper, we address the problem of quick detection of high-degree
entities in large online social networks. Practical importance of this problem
is attested by a large number of companies that continuously collect and update
statistics about popular entities, usually using the degree of an entity as an
approximation of its popularity. We suggest a simple, efficient, and easy to
implement two-stage randomized algorithm that provides highly accurate
solutions for this problem. For instance, our algorithm needs only one thousand
API requests in order to find the top-100 most followed users in Twitter, a
network with approximately a billion of registered users, with more than 90%
precision. Our algorithm significantly outperforms existing methods and serves
many different purposes, such as finding the most popular users or the most
popular interest groups in social networks. An important contribution of this
work is the analysis of the proposed algorithm using Extreme Value Theory -- a
branch of probability that studies extreme events and properties of largest
order statistics in random samples. Using this theory, we derive an accurate
prediction for the algorithm's performance and show that the number of API
requests for finding the top-k most popular entities is sublinear in the number
of entities. Moreover, we formally show that the high variability among the
entities, expressed through heavy-tailed distributions, is the reason for the
algorithm's efficiency. We quantify this phenomenon in a rigorous mathematical
way
Sublinear algorithms for local graph centrality estimation
We study the complexity of local graph centrality estimation, with the goal
of approximating the centrality score of a given target node while exploring
only a sublinear number of nodes/arcs of the graph and performing a sublinear
number of elementary operations. We develop a technique, that we apply to the
PageRank and Heat Kernel centralities, for building a low-variance score
estimator through a local exploration of the graph. We obtain an algorithm
that, given any node in any graph of arcs, with probability
computes a multiplicative -approximation of its score by
examining only nodes/arcs, where and are respectively the maximum and
average outdegree of the graph (omitting for readability
and
factors). A similar bound holds for computational complexity. We also prove a
lower bound of for both query complexity and computational complexity. Moreover,
our technique yields a query complexity algorithm for the
graph access model of [Brautbar et al., 2010], widely used in social network
mining; we show this algorithm is optimal up to a sublogarithmic factor. These
are the first algorithms yielding worst-case sublinear bounds for general
directed graphs and any choice of the target node.Comment: 29 pages, 1 figur
Fast Local Computation Algorithms
For input , let denote the set of outputs that are the "legal"
answers for a computational problem . Suppose and members of are
so large that there is not time to read them in their entirety. We propose a
model of {\em local computation algorithms} which for a given input ,
support queries by a user to values of specified locations in a legal
output . When more than one legal output exists for a given
, the local computation algorithm should output in a way that is consistent
with at least one such . Local computation algorithms are intended to
distill the common features of several concepts that have appeared in various
algorithmic subfields, including local distributed computation, local
algorithms, locally decodable codes, and local reconstruction.
We develop a technique, based on known constructions of small sample spaces
of -wise independent random variables and Beck's analysis in his algorithmic
approach to the Lov{\'{a}}sz Local Lemma, which under certain conditions can be
applied to construct local computation algorithms that run in {\em
polylogarithmic} time and space. We apply this technique to maximal independent
set computations, scheduling radio network broadcasts, hypergraph coloring and
satisfying -SAT formulas.Comment: A preliminary version of this paper appeared in ICS 2011, pp. 223-23
- …