Search CORE

129,834 research outputs found

Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation

Author: Borgs Christian
Brautbar Michael
Chayes Jennifer
Teng Shang-Hua
Publication venue
Publication date: 01/01/1202
Field of study

A fundamental problem arising in many applications in Web science and social network analysis is, given an arbitrary approximation factor

c>1

, to output a set

S

of nodes that with high probability contains all nodes of PageRank at least

\Delta

, and no node of PageRank smaller than

\Delta/c

. We call this problem {\sc SignificantPageRanks}. We develop a nearly optimal, local algorithm for the problem with runtime complexity

\tilde{O}(n/\Delta)

on networks with

n

nodes. We show that any algorithm for solving this problem must have runtime of

{\Omega}(n/\Delta)

, rendering our algorithm optimal up to logarithmic factors. Our algorithm comes with two main technical contributions. The first is a multi-scale sampling scheme for a basic matrix problem that could be of interest on its own. In the abstract matrix problem it is assumed that one can access an unknown {\em right-stochastic matrix} by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter

\epsilon

. At a cost propositional to

1/\epsilon

, the query will return a list of

O(1/\epsilon)

entries and their indices that provide an

\epsilon

-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least

\Delta

, and omits any column whose sum is less than

\Delta/c

. Our multi-scale sampling scheme solves this problem with cost

\tilde{O}(n/\Delta)

, while traditional sampling algorithms would take time

\Theta((n/\Delta)^2)

. Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme we are able to optimally solve the {\sc SignificantPageRanks} problem.Comment: Accepted to Internet Mathematics journal for publication. An extended abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time Algorithm for PageRank Computations

arXiv.org e-Print Archive

CiteSeerX

Respondent-Driven Sampling: An Assessment of Current Methodology

Author: Gile Krista J.
Handcock Mark S.
Publication venue
Publication date: 12/04/2009
Field of study

Respondent-Driven Sampling (RDS) employs a variant of a link-tracing network sampling strategy to collect data from hard-to-reach populations. By tracing the links in the underlying social network, the process exploits the social structure to expand the sample and reduce its dependence on the initial (convenience) sample. The primary goal of RDS is typically to estimate population averages in the hard-to-reach population. The current estimates make strong assumptions in order to treat the data as a probability sample. In particular, we evaluate three critical sensitivities of the estimators: to bias induced by the initial sample, to uncontrollable features of respondent behavior, and to the without-replacement structure of sampling. This paper sounds a cautionary note for the users of RDS. While current RDS methodology is powerful and clever, the favorable statistical properties claimed for the current estimates are shown to be heavily dependent on often unrealistic assumptions.Comment: 35 pages, 29 figures, under revie

arXiv.org e-Print Archive

CiteSeerX

PubMed Central

eScholarship - University of California

Quick Detection of High-degree Entities in Large Directed Networks

Author: Avrachenkov Konstantin
Litvak Nelly
Prokhorenkova Liudmila Ostroumova
Suyargulova Eugenia
Publication venue
Publication date: 23/10/2014
Field of study

In this paper, we address the problem of quick detection of high-degree entities in large online social networks. Practical importance of this problem is attested by a large number of companies that continuously collect and update statistics about popular entities, usually using the degree of an entity as an approximation of its popularity. We suggest a simple, efficient, and easy to implement two-stage randomized algorithm that provides highly accurate solutions for this problem. For instance, our algorithm needs only one thousand API requests in order to find the top-100 most followed users in Twitter, a network with approximately a billion of registered users, with more than 90% precision. Our algorithm significantly outperforms existing methods and serves many different purposes, such as finding the most popular users or the most popular interest groups in social networks. An important contribution of this work is the analysis of the proposed algorithm using Extreme Value Theory -- a branch of probability that studies extreme events and properties of largest order statistics in random samples. Using this theory, we derive an accurate prediction for the algorithm's performance and show that the number of API requests for finding the top-k most popular entities is sublinear in the number of entities. Moreover, we formally show that the high variability among the entities, expressed through heavy-tailed distributions, is the reason for the algorithm's efficiency. We quantify this phenomenon in a rigorous mathematical way

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

University of Twente Research Information

Hal-Diderot

Social Bootstrapping: How Pinterest and Last.fm Social Communities Benefit by Borrowing Links from Facebook

Author: Cha Meeyoung
Cobzarenco Marius
Salehi Mostafa
Sastry Nishanth
Shah Sunil
Zhong Changtao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

How does one develop a new online community that is highly engaging to each user and promotes social interaction? A number of websites offer friend-finding features that help users bootstrap social networks on the website by copying links from an established network like Facebook or Twitter. This paper quantifies the extent to which such social bootstrapping is effective in enhancing a social experience of the website. First, we develop a stylised analytical model that suggests that copying tends to produce a giant connected component (i.e., a connected community) quickly and preserves properties such as reciprocity and clustering, up to a linear multiplicative factor. Second, we use data from two websites, Pinterest and Last.fm, to empirically compare the subgraph of links copied from Facebook to links created natively. We find that the copied subgraph has a giant component, higher reciprocity and clustering, and confirm that the copied connections see higher social interactions. However, the need for copying diminishes as users become more active and influential. Such users tend to create links natively on the website, to users who are more similar to them than their Facebook friends. Our findings give new insights into understanding how bootstrapping from established social networks can help engage new users by enhancing social interactivity.Comment: Proc. 23rd International World Wide Web Conference (WWW), 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

King's Research Portal

Constraining the Parameters of High-Dimensional Models with Active Learning

Author: Caron Sascha
Heskes Tom
Otten Sydney
Stienen Bob
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Constraining the parameters of physical models with

>5-10

parameters is a widespread problem in fields like particle physics and astronomy. The generation of data to explore this parameter space often requires large amounts of computational resources. The commonly used solution of reducing the number of relevant physical parameters hampers the generality of the results. In this paper we show that this problem can be alleviated by the use of active learning. We illustrate this with examples from high energy physics, a field where simulations are often expensive and parameter spaces are high-dimensional. We show that the active learning techniques query-by-committee and query-by-dropout-committee allow for the identification of model points in interesting regions of high-dimensional parameter spaces (e.g. around decision boundaries). This makes it possible to constrain model parameters more efficiently than is currently done with the most common sampling algorithms and to train better performing machine learning models on the same amount of data. Code implementing the experiments in this paper can be found on GitHub

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Radboud Repository