17,744 research outputs found

    Characterizing Directed and Undirected Networks via Multidimensional Walks with Jumps

    Full text link
    Estimating distributions of node characteristics (labels) such as number of connections or citizenship of users in a social network via edge and node sampling is a vital part of the study of complex networks. Due to its low cost, sampling via a random walk (RW) has been proposed as an attractive solution to this task. Most RW methods assume either that the network is undirected or that walkers can traverse edges regardless of their direction. Some RW methods have been designed for directed networks where edges coming into a node are not directly observable. In this work, we propose Directed Unbiased Frontier Sampling (DUFS), a sampling method based on a large number of coordinated walkers, each starting from a node chosen uniformly at random. It is applicable to directed networks with invisible incoming edges because it constructs, in real-time, an undirected graph consistent with the walkers trajectories, and due to the use of random jumps which prevent walkers from being trapped. DUFS generalizes previous RW methods and is suited for undirected networks and to directed networks regardless of in-edges visibility. We also propose an improved estimator of node label distributions that combines information from the initial walker locations with subsequent RW observations. We evaluate DUFS, compare it to other RW methods, investigate the impact of its parameters on estimation accuracy and provide practical guidelines for choosing them. In estimating out-degree distributions, DUFS yields significantly better estimates of the head of the distribution than other methods, while matching or exceeding estimation accuracy of the tail. Last, we show that DUFS outperforms uniform node sampling when estimating distributions of node labels of the top 10% largest degree nodes, even when sampling a node uniformly has the same cost as RW steps.Comment: 35 pages, submitted to ACM Transactions on Knowledge Discovery from Data (TKDD

    Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data

    Full text link
    In this paper we show how to efficiently produce unbiased estimates of subgraph frequencies from a probability sample of egocentric networks (i.e., focal nodes, their neighbors, and the induced subgraphs of ties among their neighbors). A key feature of our proposed method that differentiates it from prior methods is the use of egocentric data. Because of this, our method is suitable for estimation in large unknown graphs, is easily parallelizable, handles privacy sensitive network data (e.g. egonets with no neighbor labels), and supports counting of large subgraphs (e.g. maximal clique of size 205 in Section 6) by building on top of existing exact subgraph counting algorithms that may not support sampling. It gracefully handles a variety of sampling designs such as uniform or weighted independence or random walk sampling. Our method can be used for subgraphs that are: (i) undirected or directed; (ii) induced or non-induced; (iii) maximal or non-maximal; and (iv) potentially annotated with attributes. We compare our estimators on a variety of real-world graphs and sampling methods and provide suggestions for their use. Simulation shows that our method outperforms the state-of-the-art approach for relative subgraph frequencies by up to an order of magnitude for the same sample size. Finally, we apply our methodology to a rare sample of Facebook users across the social graph to estimate and interpret the clique size distribution and gender composition of cliques.Comment: This article generalizes the methods we introduced in arxiv:1308.3297 to count any directed or undirected subgraph that is containable within an egone

    Practical Characterization of Large Networks Using Neighborhood Information

    Full text link
    Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network. Various ad-hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for OSNs where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web API available on most major OSNs for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than current state-of-the-art methods. We characterize metrics such as node label density estimation and edge label density estimation, two of the most fundamental network characteristics from which other network characteristics can be derived. We evaluate our methods on-the-fly over several live networks using their native APIs. Our simulation studies over a variety of offline datasets show that by including neighborhood information, our method drastically (4-fold) reduces the number of samples required to achieve the same estimation accuracy of state-of-the-art methods

    A Measurement Framework for Directed Networks

    Full text link
    Partially-observed network data collected by link-tracing based sampling methods is often being studied to obtain the characteristics of a large complex network. However, little attention has been paid to sampling from directed networks such as WWW and Peer-to-Peer networks. In this paper, we propose a novel two-step (sampling/estimation) framework to measure nodal characteristics which can be defined by an average target function in an arbitrary directed network. To this end, we propose a personalized PageRank-based algorithm to visit and sample nodes. This algorithm only uses already visited nodes as local information without any prior knowledge about the latent structure of the network. Moreover, we introduce a new estimator based on the approximate importance sampling to estimate average target functions. The proposed estimator utilizes calculated PageRank value of each sampled node as an approximation for the exact visiting probability. To the best of our knowledge, this is the first study on correcting the bias of a sampling method by re-weighting of measured values that considers the effect of approximation of visiting probabilities. Comprehensive theoretical and empirical analysis of the estimator demonstrate that it is asymptotically unbiased even in situations where stationary distribution of PageRank is poorly approximated.Comment: 10 pages, 6 figure

    A Fast Sampling Method of Exploring Graphlet Degrees of Large Directed and Undirected Graphs

    Full text link
    Exploring small connected and induced subgraph patterns (CIS patterns, or graphlets) has recently attracted considerable attention. Despite recent efforts on computing the number of instances a specific graphlet appears in a large graph (i.e., the total number of CISes isomorphic to the graphlet), little attention has been paid to characterizing a node's graphlet degree, i.e., the number of CISes isomorphic to the graphlet that include the node, which is an important metric for analyzing complex networks such as social and biological networks. Similar to global graphlet counting, it is challenging to compute node graphlet degrees for a large graph due to the combinatorial nature of the problem. Unfortunately, previous methods of computing global graphlet counts are not suited to solve this problem. In this paper we propose sampling methods to estimate node graphlet degrees for undirected and directed graphs, and analyze the error of our estimates. To the best of our knowledge, we are the first to study this problem and give a fast scalable solution. We conduct experiments on a variety of real-word datasets that demonstrate that our methods accurately and efficiently estimate node graphlet degrees for graphs with millions of edges

    Estimating group properties in online social networks with a classifier

    Full text link
    We consider the problem of obtaining unbiased estimates of group properties in social networks when using a classifier for node labels. Inference for this problem is complicated by two factors: the network is not known and must be crawled, and even high-performance classifiers provide biased estimates of group proportions. We propose and evaluate AdjustedWalk for addressing this problem. This is a three step procedure which entails: 1) walking the graph starting from an arbitrary node; 2) learning a classifier on the nodes in the walk; and 3) applying a post-hoc adjustment to classification labels. The walk step provides the information necessary to make inferences over the nodes and edges, while the adjustment step corrects for classifier bias in estimating group proportions. This process provides de-biased estimates at the cost of additional variance. We evaluate AdjustedWalk on four tasks: the proportion of nodes belonging to a minority group, the proportion of the minority group among high degree nodes, the proportion of within-group edges, and Coleman's homophily index. Simulated and empirical graphs show that this procedure performs well compared to optimal baselines in a variety of circumstances, while indicating that variance increases can be large for low-recall classifiers.Comment: 19 pages, 6 figures, 1 tabl

    Design of Efficient Sampling Methods on Hybrid Social-Affiliation Networks

    Full text link
    Graph sampling via crawling has become increasingly popular and important in the study of measuring various characteristics of large scale complex networks. While powerful, it is known to be challenging when the graph is loosely connected or disconnected which slows down the convergence of random walks and can cause poor estimation accuracy. In this work, we observe that the graph under study, or called target graph, usually does not exist in isolation. In many situations, the target graph is related to an auxiliary graph and an affiliation graph, and the target graph becomes well connected when we view it from the perspective of these three graphs together, or called a hybrid social-affiliation graph in this paper. When directly sampling the target graph is difficult or inefficient, we can indirectly sample it efficiently with the assistances of the other two graphs. We design three sampling methods on such a hybrid social-affiliation network. Experiments conducted on both synthetic and real datasets demonstrate the effectiveness of our proposed methods.Comment: 11 pages, 13 figures, technique repor

    Mining Top-k Sequential Patterns in Database Graphs:A New Challenging Problem and a Sampling-based Approach

    Full text link
    In many real world networks, a vertex is usually associated with a transaction database that comprehensively describes the behaviour of the vertex. A typical example is the social network, where the behaviour of every user is depicted by a transaction database that stores his daily posted contents. A transaction database is a set of transactions, where a transaction is a set of items. Every path of the network is a sequence of vertices that induces multiple sequences of transactions. The sequences of transactions induced by all of the paths in the network forms an extremely large sequence database. Finding frequent sequential patterns from such sequence database discovers interesting subsequences that frequently appear in many paths of the network. However, it is a challenging task, since the sequence database induced by a database graph is too large to be explicitly induced and stored. In this paper, we propose the novel notion of database graph, which naturally models a wide spectrum of real world networks by associating each vertex with a transaction database. Our goal is to find the top-k frequent sequential patterns in the sequence database induced from a database graph. We prove that this problem is #P-hard. To tackle this problem, we propose an efficient two-step sampling algorithm that approximates the top-k frequent sequential patterns with provable quality guarantee. Extensive experimental results on synthetic and real-world data sets demonstrate the effectiveness and efficiency of our method

    Sampling Online Social Networks by Random Walk with Indirect Jumps

    Full text link
    Random walk-based sampling methods are gaining popularity and importance in characterizing large networks. While powerful, they suffer from the slow mixing problem when the graph is loosely connected, which results in poor estimation accuracy. Random walk with jumps (RWwJ) can address the slow mixing problem but it is inapplicable if the graph does not support uniform vertex sampling (UNI). In this work, we develop methods that can efficiently sample a graph without the necessity of UNI but still enjoy the similar benefits as RWwJ. We observe that many graphs under study, called target graphs, do not exist in isolation. In many situations, a target graph is related to an auxiliary graph and a bipartite graph, and they together form a better connected {\em two-layered network structure}. This new viewpoint brings extra benefits to graph sampling: if directly sampling a target graph is difficult, we can sample it indirectly with the assistance of the other two graphs. We propose a series of new graph sampling techniques by exploiting such a two-layered network structure to estimate target graph characteristics. Experiments conducted on both synthetic and real-world networks demonstrate the effectiveness and usefulness of these new techniques.Comment: 14 pages, 17 figures, extended versio

    Sampling-based Estimation of In-degree Distribution with Applications to Directed Complex Networks

    Full text link
    The focus of this work is on estimation of the in-degree distribution in directed networks from sampling network nodes or edges. A number of sampling schemes are considered, including random sampling with and without replacement, and several approaches based on random walks with possible jumps. When sampling nodes, it is assumed that only the out-edges of that node are visible, that is, the in-degree of that node is not observed. The suggested estimation of the in-degree distribution is based on two approaches. The inversion approach exploits the relation between the original and sample in-degree distributions, and can estimate the bulk of the in-degree distribution, but not the tail of the distribution. The tail of the in-degree distribution is estimated through an asymptotic approach, which itself has two versions: one assuming a power-law tail and the other for a tail of general form. The two estimation approaches are examined on synthetic and real networks, with good performance results, especially striking for the asymptotic approach.Comment: 30 pages , 6 figure
    corecore