59,855 research outputs found
Why Do Cascade Sizes Follow a Power-Law?
We introduce random directed acyclic graph and use it to model the
information diffusion network. Subsequently, we analyze the cascade generation
model (CGM) introduced by Leskovec et al. [19]. Until now only empirical
studies of this model were done. In this paper, we present the first
theoretical proof that the sizes of cascades generated by the CGM follow the
power-law distribution, which is consistent with multiple empirical analysis of
the large social networks. We compared the assumptions of our model with the
Twitter social network and tested the goodness of approximation.Comment: 8 pages, 7 figures, accepted to WWW 201
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs
Graphs are an important tool to model data in different domains, including
social networks, bioinformatics and the world wide web. Most of the networks
formed in these domains are directed graphs, where all the edges have a
direction and they are not symmetric. Betweenness centrality is an important
index widely used to analyze networks. In this paper, first given a directed
network and a vertex , we propose a new exact algorithm to
compute betweenness score of . Our algorithm pre-computes a set
, which is used to prune a huge amount of computations that do
not contribute in the betweenness score of . Time complexity of our exact
algorithm depends on and it is respectively
and
for unweighted graphs and weighted graphs with positive weights.
is bounded from above by and in most cases, it
is a small constant. Then, for the cases where is large, we
present a simple randomized algorithm that samples from and
performs computations for only the sampled elements. We show that this
algorithm provides an -approximation of the betweenness
score of . Finally, we perform extensive experiments over several real-world
datasets from different domains for several randomly chosen vertices as well as
for the vertices with the highest betweenness scores. Our experiments reveal
that in most cases, our algorithm significantly outperforms the most efficient
existing randomized algorithms, in terms of both running time and accuracy. Our
experiments also show that our proposed algorithm computes betweenness scores
of all vertices in the sets of sizes 5, 10 and 15, much faster and more
accurate than the most efficient existing algorithms.Comment: arXiv admin note: text overlap with arXiv:1704.0735
A Primer on Causality in Data Science
Many questions in Data Science are fundamentally causal in that our objective
is to learn the effect of some exposure, randomized or not, on an outcome
interest. Even studies that are seemingly non-causal, such as those with the
goal of prediction or prevalence estimation, have causal elements, including
differential censoring or measurement. As a result, we, as Data Scientists,
need to consider the underlying causal mechanisms that gave rise to the data,
rather than simply the pattern or association observed in those data. In this
work, we review the 'Causal Roadmap' of Petersen and van der Laan (2014) to
provide an introduction to some key concepts in causal inference. Similar to
other causal frameworks, the steps of the Roadmap include clearly stating the
scientific question, defining of the causal model, translating the scientific
question into a causal parameter, assessing the assumptions needed to express
the causal parameter as a statistical estimand, implementation of statistical
estimators including parametric and semi-parametric methods, and interpretation
of our findings. We believe that using such a framework in Data Science will
help to ensure that our statistical analyses are guided by the scientific
question driving our research, while avoiding over-interpreting our results. We
focus on the effect of an exposure occurring at a single time point and
highlight the use of targeted maximum likelihood estimation (TMLE) with Super
Learner.Comment: 26 pages (with references); 4 figure
Regression and Singular Value Decomposition in Dynamic Graphs
Most of real-world graphs are {\em dynamic}, i.e., they change over time.
However, while problems such as regression and Singular Value Decomposition
(SVD) have been studied for {\em static} graphs, they have not been
investigated for {\em dynamic} graphs, yet. In this paper, we introduce,
motivate and study regression and SVD over dynamic graphs. First, we present
the notion of {\em update-efficient matrix embedding} that defines the
conditions sufficient for a matrix embedding to be used for the dynamic graph
regression problem (under norm). We prove that given an
update-efficient matrix embedding (e.g., adjacency matrix), after an update
operation in the graph, the optimal solution of the graph regression problem
for the revised graph can be computed in time. We also study dynamic
graph regression under least absolute deviation. Then, we characterize a class
of matrix embeddings that can be used to efficiently update SVD of a dynamic
graph. For adjacency matrix and Laplacian matrix, we study those graph update
operations for which SVD (and low rank approximation) can be updated
efficiently
- …