2 research outputs found
Sampling Online Social Networks via Heterogeneous Statistics
Most sampling techniques for online social networks (OSNs) are based on a
particular sampling method on a single graph, which is referred to as a
statistics. However, various realizing methods on different graphs could
possibly be used in the same OSN, and they may lead to different sampling
efficiencies, i.e., asymptotic variances. To utilize multiple statistics for
accurate measurements, we formulate a mixture sampling problem, through which
we construct a mixture unbiased estimator which minimizes asymptotic variance.
Given fixed sampling budgets for different statistics, we derive the optimal
weights to combine the individual estimators; given fixed total budget, we show
that a greedy allocation towards the most efficient statistics is optimal. In
practice, the sampling efficiencies of statistics can be quite different for
various targets and are unknown before sampling. To solve this problem, we
design a two-stage framework which adaptively spends a partial budget to test
different statistics and allocates the remaining budget to the inferred best
statistics. We show that our two-stage framework is a generalization of 1)
randomly choosing a statistics and 2) evenly allocating the total budget among
all available statistics, and our adaptive algorithm achieves higher efficiency
than these benchmark strategies in theory and experiment
Do we really need to catch them all? A new User-guided Social Media Crawling method
With the growing use of popular social media services like Facebook and
Twitter it is challenging to collect all content from the networks without
access to the core infrastructure or paying for it. Thus, if all content cannot
be collected one must consider which data are of most importance. In this work
we present a novel User-guided Social Media Crawling method (USMC) that is able
to collect data from social media, utilizing the wisdom of the crowd to decide
the order in which user generated content should be collected to cover as many
user interactions as possible. USMC is validated by crawling 160 public
Facebook pages, containing content from 368 million users including 1.3 billion
interactions, and it is compared with two other crawling methods. The results
show that it is possible to cover approximately 75% of the interactions on a
Facebook page by sampling just 20% of its posts, and at the same time reduce
the crawling time by 53%. In addition, the social network constructed from the
20% sample contains more than 75% of the users and edges compared to the social
network created from all posts, and it has similar degree distribution