129,834 research outputs found
Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
A fundamental problem arising in many applications in Web science and social
network analysis is, given an arbitrary approximation factor , to output a
set of nodes that with high probability contains all nodes of PageRank at
least , and no node of PageRank smaller than . We call this
problem {\sc SignificantPageRanks}. We develop a nearly optimal, local
algorithm for the problem with runtime complexity on
networks with nodes. We show that any algorithm for solving this problem
must have runtime of , rendering our algorithm optimal up
to logarithmic factors.
Our algorithm comes with two main technical contributions. The first is a
multi-scale sampling scheme for a basic matrix problem that could be of
interest on its own. In the abstract matrix problem it is assumed that one can
access an unknown {\em right-stochastic matrix} by querying its rows, where the
cost of a query and the accuracy of the answers depend on a precision parameter
. At a cost propositional to , the query will return a
list of entries and their indices that provide an
-precision approximation of the row. Our task is to find a set that
contains all columns whose sum is at least , and omits any column whose
sum is less than . Our multi-scale sampling scheme solves this
problem with cost , while traditional sampling algorithms
would take time .
Our second main technical contribution is a new local algorithm for
approximating personalized PageRank, which is more robust than the earlier ones
developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly
for networks with large in-degrees or out-degrees. Together with our multiscale
sampling scheme we are able to optimally solve the {\sc SignificantPageRanks}
problem.Comment: Accepted to Internet Mathematics journal for publication. An extended
abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time
Algorithm for PageRank Computations
Respondent-Driven Sampling: An Assessment of Current Methodology
Respondent-Driven Sampling (RDS) employs a variant of a link-tracing network
sampling strategy to collect data from hard-to-reach populations. By tracing
the links in the underlying social network, the process exploits the social
structure to expand the sample and reduce its dependence on the initial
(convenience) sample.
The primary goal of RDS is typically to estimate population averages in the
hard-to-reach population. The current estimates make strong assumptions in
order to treat the data as a probability sample. In particular, we evaluate
three critical sensitivities of the estimators: to bias induced by the initial
sample, to uncontrollable features of respondent behavior, and to the
without-replacement structure of sampling.
This paper sounds a cautionary note for the users of RDS. While current RDS
methodology is powerful and clever, the favorable statistical properties
claimed for the current estimates are shown to be heavily dependent on often
unrealistic assumptions.Comment: 35 pages, 29 figures, under revie
Quick Detection of High-degree Entities in Large Directed Networks
In this paper, we address the problem of quick detection of high-degree
entities in large online social networks. Practical importance of this problem
is attested by a large number of companies that continuously collect and update
statistics about popular entities, usually using the degree of an entity as an
approximation of its popularity. We suggest a simple, efficient, and easy to
implement two-stage randomized algorithm that provides highly accurate
solutions for this problem. For instance, our algorithm needs only one thousand
API requests in order to find the top-100 most followed users in Twitter, a
network with approximately a billion of registered users, with more than 90%
precision. Our algorithm significantly outperforms existing methods and serves
many different purposes, such as finding the most popular users or the most
popular interest groups in social networks. An important contribution of this
work is the analysis of the proposed algorithm using Extreme Value Theory -- a
branch of probability that studies extreme events and properties of largest
order statistics in random samples. Using this theory, we derive an accurate
prediction for the algorithm's performance and show that the number of API
requests for finding the top-k most popular entities is sublinear in the number
of entities. Moreover, we formally show that the high variability among the
entities, expressed through heavy-tailed distributions, is the reason for the
algorithm's efficiency. We quantify this phenomenon in a rigorous mathematical
way
Social Bootstrapping: How Pinterest and Last.fm Social Communities Benefit by Borrowing Links from Facebook
How does one develop a new online community that is highly engaging to each
user and promotes social interaction? A number of websites offer friend-finding
features that help users bootstrap social networks on the website by copying
links from an established network like Facebook or Twitter. This paper
quantifies the extent to which such social bootstrapping is effective in
enhancing a social experience of the website. First, we develop a stylised
analytical model that suggests that copying tends to produce a giant connected
component (i.e., a connected community) quickly and preserves properties such
as reciprocity and clustering, up to a linear multiplicative factor. Second, we
use data from two websites, Pinterest and Last.fm, to empirically compare the
subgraph of links copied from Facebook to links created natively. We find that
the copied subgraph has a giant component, higher reciprocity and clustering,
and confirm that the copied connections see higher social interactions.
However, the need for copying diminishes as users become more active and
influential. Such users tend to create links natively on the website, to users
who are more similar to them than their Facebook friends. Our findings give new
insights into understanding how bootstrapping from established social networks
can help engage new users by enhancing social interactivity.Comment: Proc. 23rd International World Wide Web Conference (WWW), 201
Constraining the Parameters of High-Dimensional Models with Active Learning
Constraining the parameters of physical models with parameters is a
widespread problem in fields like particle physics and astronomy. The
generation of data to explore this parameter space often requires large amounts
of computational resources. The commonly used solution of reducing the number
of relevant physical parameters hampers the generality of the results. In this
paper we show that this problem can be alleviated by the use of active
learning. We illustrate this with examples from high energy physics, a field
where simulations are often expensive and parameter spaces are
high-dimensional. We show that the active learning techniques
query-by-committee and query-by-dropout-committee allow for the identification
of model points in interesting regions of high-dimensional parameter spaces
(e.g. around decision boundaries). This makes it possible to constrain model
parameters more efficiently than is currently done with the most common
sampling algorithms and to train better performing machine learning models on
the same amount of data. Code implementing the experiments in this paper can be
found on GitHub
- …