3,602 research outputs found
Embedded Topics in the Stochastic Block Model
Communication networks such as emails or social networks are now ubiquitous
and their analysis has become a strategic field. In many applications, the goal
is to automatically extract relevant information by looking at the nodes and
their connections. Unfortunately, most of the existing methods focus on
analysing the presence or absence of edges and textual data is often discarded.
However, all communication networks actually come with textual data on the
edges. In order to take into account this specificity, we consider in this
paper networks for which two nodes are linked if and only if they share textual
data. We introduce a deep latent variable model allowing embedded topics to be
handled called ETSBM to simultaneously perform clustering on the nodes while
modelling the topics used between the different clusters. ETSBM extends both
the stochastic block model (SBM) and the embedded topic model (ETM) which are
core models for studying networks and corpora, respectively. The inference is
done using a variational-Bayes expectation-maximisation algorithm combined with
a stochastic gradient descent. The methodology is evaluated on synthetic data
and on a real world dataset
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
Interactions in Information Spread
Since the development of writing 5000 years ago, human-generated data gets
produced at an ever-increasing pace. Classical archival methods aimed at easing
information retrieval. Nowadays, archiving is not enough anymore. The amount of
data that gets generated daily is beyond human comprehension, and appeals for
new information retrieval strategies. Instead of referencing every single data
piece as in traditional archival techniques, a more relevant approach consists
in understanding the overall ideas conveyed in data flows. To spot such general
tendencies, a precise comprehension of the underlying data generation
mechanisms is required. In the rich literature tackling this problem, the
question of information interaction remains nearly unexplored. First, we
investigate the frequency of such interactions. Building on recent advances
made in Stochastic Block Modelling, we explore the role of interactions in
several social networks. We find that interactions are rare in these datasets.
Then, we wonder how interactions evolve over time. Earlier data pieces should
not have an everlasting influence on ulterior data generation mechanisms. We
model this using dynamic network inference advances. We conclude that
interactions are brief. Finally, we design a framework that jointly models rare
and brief interactions based on Dirichlet-Hawkes Processes. We argue that this
new class of models fits brief and sparse interaction modelling. We conduct a
large-scale application on Reddit and find that interactions play a minor role
in this dataset. From a broader perspective, our work results in a collection
of highly flexible models and in a rethinking of core concepts of machine
learning. Consequently, we open a range of novel perspectives both in terms of
real-world applications and in terms of technical contributions to machine
learning.Comment: PhD thesis defended on 2022/09/1
Multivariate Spatiotemporal Hawkes Processes and Network Reconstruction
There is often latent network structure in spatial and temporal data and the
tools of network analysis can yield fascinating insights into such data. In
this paper, we develop a nonparametric method for network reconstruction from
spatiotemporal data sets using multivariate Hawkes processes. In contrast to
prior work on network reconstruction with point-process models, which has often
focused on exclusively temporal information, our approach uses both temporal
and spatial information and does not assume a specific parametric form of
network dynamics. This leads to an effective way of recovering an underlying
network. We illustrate our approach using both synthetic networks and networks
constructed from real-world data sets (a location-based social media network, a
narrative of crime events, and violent gang crimes). Our results demonstrate
that, in comparison to using only temporal data, our spatiotemporal approach
yields improved network reconstruction, providing a basis for meaningful
subsequent analysis --- such as community structure and motif analysis --- of
the reconstructed networks
The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015
In this paper we retrace the recent history of statistics by analyzing all
the papers published in five prestigious statistical journals since 1970,
namely: Annals of Statistics, Biometrika, Journal of the American Statistical
Association, Journal of the Royal Statistical Society, series B and Statistical
Science. The aim is to construct a kind of "taxonomy" of the statistical papers
by organizing and by clustering them in main themes. In this sense being
identified in a cluster means being important enough to be uncluttered in the
vast and interconnected world of the statistical research. Since the main
statistical research topics naturally born, evolve or die during time, we will
also develop a dynamic clustering strategy, where a group in a time period is
allowed to migrate or to merge into different groups in the following one.
Results show that statistics is a very dynamic and evolving science, stimulated
by the rise of new research questions and types of data
- …