17,744 research outputs found
Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps
Estimating distributions of node characteristics (labels) such as number of
connections or citizenship of users in a social network via edge and node
sampling is a vital part of the study of complex networks. Due to its low cost,
sampling via a random walk (RW) has been proposed as an attractive solution to
this task. Most RW methods assume either that the network is undirected or that
walkers can traverse edges regardless of their direction. Some RW methods have
been designed for directed networks where edges coming into a node are not
directly observable. In this work, we propose Directed Unbiased Frontier
Sampling (DUFS), a sampling method based on a large number of coordinated
walkers, each starting from a node chosen uniformly at random. It is applicable
to directed networks with invisible incoming edges because it constructs, in
real-time, an undirected graph consistent with the walkers trajectories, and
due to the use of random jumps which prevent walkers from being trapped. DUFS
generalizes previous RW methods and is suited for undirected networks and to
directed networks regardless of in-edges visibility. We also propose an
improved estimator of node label distributions that combines information from
the initial walker locations with subsequent RW observations. We evaluate DUFS,
compare it to other RW methods, investigate the impact of its parameters on
estimation accuracy and provide practical guidelines for choosing them. In
estimating out-degree distributions, DUFS yields significantly better estimates
of the head of the distribution than other methods, while matching or exceeding
estimation accuracy of the tail. Last, we show that DUFS outperforms uniform
node sampling when estimating distributions of node labels of the top 10%
largest degree nodes, even when sampling a node uniformly has the same cost as
RW steps.Comment: 35 pages, submitted to ACM Transactions on Knowledge Discovery from
Data (TKDD
Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data
In this paper we show how to efficiently produce unbiased estimates of
subgraph frequencies from a probability sample of egocentric networks (i.e.,
focal nodes, their neighbors, and the induced subgraphs of ties among their
neighbors). A key feature of our proposed method that differentiates it from
prior methods is the use of egocentric data. Because of this, our method is
suitable for estimation in large unknown graphs, is easily parallelizable,
handles privacy sensitive network data (e.g. egonets with no neighbor labels),
and supports counting of large subgraphs (e.g. maximal clique of size 205 in
Section 6) by building on top of existing exact subgraph counting algorithms
that may not support sampling. It gracefully handles a variety of sampling
designs such as uniform or weighted independence or random walk sampling. Our
method can be used for subgraphs that are: (i) undirected or directed; (ii)
induced or non-induced; (iii) maximal or non-maximal; and (iv) potentially
annotated with attributes. We compare our estimators on a variety of real-world
graphs and sampling methods and provide suggestions for their use. Simulation
shows that our method outperforms the state-of-the-art approach for relative
subgraph frequencies by up to an order of magnitude for the same sample size.
Finally, we apply our methodology to a rare sample of Facebook users across the
social graph to estimate and interpret the clique size distribution and gender
composition of cliques.Comment: This article generalizes the methods we introduced in arxiv:1308.3297
to count any directed or undirected subgraph that is containable within an
egone
Practical Characterization of Large Networks Using Neighborhood Information
Characterizing large online social networks (OSNs) through node querying is a
challenging task. OSNs often impose severe constraints on the query rate, hence
limiting the sample size to a small fraction of the total network. Various
ad-hoc subgraph sampling methods have been proposed, but many of them give
biased estimates and no theoretical basis on the accuracy. In this work, we
focus on developing sampling methods for OSNs where querying a node also
reveals partial structural information about its neighbors. Our methods are
optimized for NoSQL graph databases (if the database can be accessed directly),
or utilize Web API available on most major OSNs for graph sampling. We show
that our sampling method has provable convergence guarantees on being an
unbiased estimator, and it is more accurate than current state-of-the-art
methods. We characterize metrics such as node label density estimation and edge
label density estimation, two of the most fundamental network characteristics
from which other network characteristics can be derived. We evaluate our
methods on-the-fly over several live networks using their native APIs. Our
simulation studies over a variety of offline datasets show that by including
neighborhood information, our method drastically (4-fold) reduces the number of
samples required to achieve the same estimation accuracy of state-of-the-art
methods
A Measurement Framework for Directed Networks
Partially-observed network data collected by link-tracing based sampling
methods is often being studied to obtain the characteristics of a large complex
network. However, little attention has been paid to sampling from directed
networks such as WWW and Peer-to-Peer networks. In this paper, we propose a
novel two-step (sampling/estimation) framework to measure nodal characteristics
which can be defined by an average target function in an arbitrary directed
network. To this end, we propose a personalized PageRank-based algorithm to
visit and sample nodes. This algorithm only uses already visited nodes as local
information without any prior knowledge about the latent structure of the
network. Moreover, we introduce a new estimator based on the approximate
importance sampling to estimate average target functions. The proposed
estimator utilizes calculated PageRank value of each sampled node as an
approximation for the exact visiting probability. To the best of our knowledge,
this is the first study on correcting the bias of a sampling method by
re-weighting of measured values that considers the effect of approximation of
visiting probabilities. Comprehensive theoretical and empirical analysis of the
estimator demonstrate that it is asymptotically unbiased even in situations
where stationary distribution of PageRank is poorly approximated.Comment: 10 pages, 6 figure
A Fast Sampling Method of Exploring Graphlet Degrees of Large Directed and Undirected Graphs
Exploring small connected and induced subgraph patterns (CIS patterns, or
graphlets) has recently attracted considerable attention. Despite recent
efforts on computing the number of instances a specific graphlet appears in a
large graph (i.e., the total number of CISes isomorphic to the graphlet),
little attention has been paid to characterizing a node's graphlet degree,
i.e., the number of CISes isomorphic to the graphlet that include the node,
which is an important metric for analyzing complex networks such as social and
biological networks. Similar to global graphlet counting, it is challenging to
compute node graphlet degrees for a large graph due to the combinatorial nature
of the problem. Unfortunately, previous methods of computing global graphlet
counts are not suited to solve this problem. In this paper we propose sampling
methods to estimate node graphlet degrees for undirected and directed graphs,
and analyze the error of our estimates. To the best of our knowledge, we are
the first to study this problem and give a fast scalable solution. We conduct
experiments on a variety of real-word datasets that demonstrate that our
methods accurately and efficiently estimate node graphlet degrees for graphs
with millions of edges
Estimating group properties in online social networks with a classifier
We consider the problem of obtaining unbiased estimates of group properties
in social networks when using a classifier for node labels. Inference for this
problem is complicated by two factors: the network is not known and must be
crawled, and even high-performance classifiers provide biased estimates of
group proportions. We propose and evaluate AdjustedWalk for addressing this
problem. This is a three step procedure which entails: 1) walking the graph
starting from an arbitrary node; 2) learning a classifier on the nodes in the
walk; and 3) applying a post-hoc adjustment to classification labels. The walk
step provides the information necessary to make inferences over the nodes and
edges, while the adjustment step corrects for classifier bias in estimating
group proportions. This process provides de-biased estimates at the cost of
additional variance. We evaluate AdjustedWalk on four tasks: the proportion of
nodes belonging to a minority group, the proportion of the minority group among
high degree nodes, the proportion of within-group edges, and Coleman's
homophily index. Simulated and empirical graphs show that this procedure
performs well compared to optimal baselines in a variety of circumstances,
while indicating that variance increases can be large for low-recall
classifiers.Comment: 19 pages, 6 figures, 1 tabl
Design of Efficient Sampling Methods on Hybrid Social-Affiliation Networks
Graph sampling via crawling has become increasingly popular and important in
the study of measuring various characteristics of large scale complex networks.
While powerful, it is known to be challenging when the graph is loosely
connected or disconnected which slows down the convergence of random walks and
can cause poor estimation accuracy.
In this work, we observe that the graph under study, or called target graph,
usually does not exist in isolation. In many situations, the target graph is
related to an auxiliary graph and an affiliation graph, and the target graph
becomes well connected when we view it from the perspective of these three
graphs together, or called a hybrid social-affiliation graph in this paper.
When directly sampling the target graph is difficult or inefficient, we can
indirectly sample it efficiently with the assistances of the other two graphs.
We design three sampling methods on such a hybrid social-affiliation network.
Experiments conducted on both synthetic and real datasets demonstrate the
effectiveness of our proposed methods.Comment: 11 pages, 13 figures, technique repor
Mining Top-k Sequential Patterns in Database Graphs:A New Challenging Problem and a Sampling-based Approach
In many real world networks, a vertex is usually associated with a
transaction database that comprehensively describes the behaviour of the
vertex. A typical example is the social network, where the behaviour of every
user is depicted by a transaction database that stores his daily posted
contents. A transaction database is a set of transactions, where a transaction
is a set of items. Every path of the network is a sequence of vertices that
induces multiple sequences of transactions. The sequences of transactions
induced by all of the paths in the network forms an extremely large sequence
database. Finding frequent sequential patterns from such sequence database
discovers interesting subsequences that frequently appear in many paths of the
network. However, it is a challenging task, since the sequence database induced
by a database graph is too large to be explicitly induced and stored. In this
paper, we propose the novel notion of database graph, which naturally models a
wide spectrum of real world networks by associating each vertex with a
transaction database. Our goal is to find the top-k frequent sequential
patterns in the sequence database induced from a database graph. We prove that
this problem is #P-hard. To tackle this problem, we propose an efficient
two-step sampling algorithm that approximates the top-k frequent sequential
patterns with provable quality guarantee. Extensive experimental results on
synthetic and real-world data sets demonstrate the effectiveness and efficiency
of our method
Sampling Online Social Networks by Random Walk with Indirect Jumps
Random walk-based sampling methods are gaining popularity and importance in
characterizing large networks. While powerful, they suffer from the slow mixing
problem when the graph is loosely connected, which results in poor estimation
accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but
it is inapplicable if the graph does not support uniform vertex sampling (UNI).
In this work, we develop methods that can efficiently sample a graph without
the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe
that many graphs under study, called target graphs, do not exist in isolation.
In many situations, a target graph is related to an auxiliary graph and a
bipartite graph, and they together form a better connected {\em two-layered
network structure}. This new viewpoint brings extra benefits to graph sampling:
if directly sampling a target graph is difficult, we can sample it indirectly
with the assistance of the other two graphs. We propose a series of new graph
sampling techniques by exploiting such a two-layered network structure to
estimate target graph characteristics. Experiments conducted on both synthetic
and real-world networks demonstrate the effectiveness and usefulness of these
new techniques.Comment: 14 pages, 17 figures, extended versio
Sampling-based Estimation of In-degree Distribution with Applications to Directed Complex Networks
The focus of this work is on estimation of the in-degree distribution in
directed networks from sampling network nodes or edges. A number of sampling
schemes are considered, including random sampling with and without replacement,
and several approaches based on random walks with possible jumps. When sampling
nodes, it is assumed that only the out-edges of that node are visible, that is,
the in-degree of that node is not observed. The suggested estimation of the
in-degree distribution is based on two approaches. The inversion approach
exploits the relation between the original and sample in-degree distributions,
and can estimate the bulk of the in-degree distribution, but not the tail of
the distribution. The tail of the in-degree distribution is estimated through
an asymptotic approach, which itself has two versions: one assuming a power-law
tail and the other for a tail of general form. The two estimation approaches
are examined on synthetic and real networks, with good performance results,
especially striking for the asymptotic approach.Comment: 30 pages , 6 figure
- …