59,855 research outputs found

    Why Do Cascade Sizes Follow a Power-Law?

    Full text link
    We introduce random directed acyclic graph and use it to model the information diffusion network. Subsequently, we analyze the cascade generation model (CGM) introduced by Leskovec et al. [19]. Until now only empirical studies of this model were done. In this paper, we present the first theoretical proof that the sizes of cascades generated by the CGM follow the power-law distribution, which is consistent with multiple empirical analysis of the large social networks. We compared the assumptions of our model with the Twitter social network and tested the goodness of approximation.Comment: 8 pages, 7 figures, accepted to WWW 201

    A survey of statistical network models

    Full text link
    Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference

    Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs

    Full text link
    Graphs are an important tool to model data in different domains, including social networks, bioinformatics and the world wide web. Most of the networks formed in these domains are directed graphs, where all the edges have a direction and they are not symmetric. Betweenness centrality is an important index widely used to analyze networks. In this paper, first given a directed network GG and a vertex rV(G)r \in V(G), we propose a new exact algorithm to compute betweenness score of rr. Our algorithm pre-computes a set RV(r)\mathcal{RV}(r), which is used to prune a huge amount of computations that do not contribute in the betweenness score of rr. Time complexity of our exact algorithm depends on RV(r)|\mathcal{RV}(r)| and it is respectively Θ(RV(r)E(G))\Theta(|\mathcal{RV}(r)|\cdot|E(G)|) and Θ(RV(r)E(G)+RV(r)V(G)logV(G))\Theta(|\mathcal{RV}(r)|\cdot|E(G)|+|\mathcal{RV}(r)|\cdot|V(G)|\log |V(G)|) for unweighted graphs and weighted graphs with positive weights. RV(r)|\mathcal{RV}(r)| is bounded from above by V(G)1|V(G)|-1 and in most cases, it is a small constant. Then, for the cases where RV(r)\mathcal{RV}(r) is large, we present a simple randomized algorithm that samples from RV(r)\mathcal{RV}(r) and performs computations for only the sampled elements. We show that this algorithm provides an (ϵ,δ)(\epsilon,\delta)-approximation of the betweenness score of rr. Finally, we perform extensive experiments over several real-world datasets from different domains for several randomly chosen vertices as well as for the vertices with the highest betweenness scores. Our experiments reveal that in most cases, our algorithm significantly outperforms the most efficient existing randomized algorithms, in terms of both running time and accuracy. Our experiments also show that our proposed algorithm computes betweenness scores of all vertices in the sets of sizes 5, 10 and 15, much faster and more accurate than the most efficient existing algorithms.Comment: arXiv admin note: text overlap with arXiv:1704.0735

    A Primer on Causality in Data Science

    Get PDF
    Many questions in Data Science are fundamentally causal in that our objective is to learn the effect of some exposure, randomized or not, on an outcome interest. Even studies that are seemingly non-causal, such as those with the goal of prediction or prevalence estimation, have causal elements, including differential censoring or measurement. As a result, we, as Data Scientists, need to consider the underlying causal mechanisms that gave rise to the data, rather than simply the pattern or association observed in those data. In this work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to provide an introduction to some key concepts in causal inference. Similar to other causal frameworks, the steps of the Roadmap include clearly stating the scientific question, defining of the causal model, translating the scientific question into a causal parameter, assessing the assumptions needed to express the causal parameter as a statistical estimand, implementation of statistical estimators including parametric and semi-parametric methods, and interpretation of our findings. We believe that using such a framework in Data Science will help to ensure that our statistical analyses are guided by the scientific question driving our research, while avoiding over-interpreting our results. We focus on the effect of an exposure occurring at a single time point and highlight the use of targeted maximum likelihood estimation (TMLE) with Super Learner.Comment: 26 pages (with references); 4 figure

    Regression and Singular Value Decomposition in Dynamic Graphs

    Full text link
    Most of real-world graphs are {\em dynamic}, i.e., they change over time. However, while problems such as regression and Singular Value Decomposition (SVD) have been studied for {\em static} graphs, they have not been investigated for {\em dynamic} graphs, yet. In this paper, we introduce, motivate and study regression and SVD over dynamic graphs. First, we present the notion of {\em update-efficient matrix embedding} that defines the conditions sufficient for a matrix embedding to be used for the dynamic graph regression problem (under l2l_2 norm). We prove that given an n×mn \times m update-efficient matrix embedding (e.g., adjacency matrix), after an update operation in the graph, the optimal solution of the graph regression problem for the revised graph can be computed in O(nm)O(nm) time. We also study dynamic graph regression under least absolute deviation. Then, we characterize a class of matrix embeddings that can be used to efficiently update SVD of a dynamic graph. For adjacency matrix and Laplacian matrix, we study those graph update operations for which SVD (and low rank approximation) can be updated efficiently
    corecore