124 research outputs found
Dirichlet-Survival Process: Scalable Inference of Topic-Dependent Diffusion Networks
Information spread on networks can be efficiently modeled by considering
three features: documents' content, time of publication relative to other
publications, and position of the spreader in the network. Most previous works
model up to two of those jointly, or rely on heavily parametric approaches.
Building on recent Dirichlet-Point processes literature, we introduce the
Houston (Hidden Online User-Topic Network) model, that jointly considers all
those features in a non-parametric unsupervised framework. It infers dynamic
topic-dependent underlying diffusion networks in a continuous-time setting
along with said topics. It is unsupervised; it considers an unlabeled stream of
triplets shaped as \textit{(time of publication, information's content,
spreading entity)} as input data. Online inference is conducted using a
sequential Monte-Carlo algorithm that scales linearly with the size of the
dataset. Our approach yields consequent improvements over existing baselines on
both cluster recovery and subnetworks inference tasks
Multivariate Powered Dirichlet Hawkes Process
The publication time of a document carries a relevant information about its
semantic content. The Dirichlet-Hawkes process has been proposed to jointly
model textual information and publication dynamics. This approach has been used
with success in several recent works, and extended to tackle specific
challenging problems --typically for short texts or entangled publication
dynamics. However, the prior in its current form does not allow for complex
publication dynamics. In particular, inferred topics are independent from each
other --a publication about finance is assumed to have no influence on
publications about politics, for instance.
In this work, we develop the Multivariate Powered Dirichlet-Hawkes Process
(MPDHP), that alleviates this assumption. Publications about various topics can
now influence each other. We detail and overcome the technical challenges that
arise from considering interacting topics. We conduct a systematic evaluation
of MPDHP on a range of synthetic datasets to define its application domain and
limitations. Finally, we develop a use case of the MPDHP on Reddit data. At the
end of this article, the interested reader will know how and when to use MPDHP,
and when not to
Interactions in Information Spread
Since the development of writing 5000 years ago, human-generated data gets
produced at an ever-increasing pace. Classical archival methods aimed at easing
information retrieval. Nowadays, archiving is not enough anymore. The amount of
data that gets generated daily is beyond human comprehension, and appeals for
new information retrieval strategies. Instead of referencing every single data
piece as in traditional archival techniques, a more relevant approach consists
in understanding the overall ideas conveyed in data flows. To spot such general
tendencies, a precise comprehension of the underlying data generation
mechanisms is required. In the rich literature tackling this problem, the
question of information interaction remains nearly unexplored. First, we
investigate the frequency of such interactions. Building on recent advances
made in Stochastic Block Modelling, we explore the role of interactions in
several social networks. We find that interactions are rare in these datasets.
Then, we wonder how interactions evolve over time. Earlier data pieces should
not have an everlasting influence on ulterior data generation mechanisms. We
model this using dynamic network inference advances. We conclude that
interactions are brief. Finally, we design a framework that jointly models rare
and brief interactions based on Dirichlet-Hawkes Processes. We argue that this
new class of models fits brief and sparse interaction modelling. We conduct a
large-scale application on Reddit and find that interactions play a minor role
in this dataset. From a broader perspective, our work results in a collection
of highly flexible models and in a rethinking of core concepts of machine
learning. Consequently, we open a range of novel perspectives both in terms of
real-world applications and in terms of technical contributions to machine
learning.Comment: PhD thesis defended on 2022/09/1
Properties of Reddit News Topical Interactions
Most models of information diffusion online rely on the assumption that
pieces of information spread independently from each other. However, several
works pointed out the necessity of investigating the role of interactions in
real-world processes, and highlighted possible difficulties in doing so:
interactions are sparse and brief. As an answer, recent advances developed
models to account for interactions in underlying publication dynamics. In this
article, we propose to extend and apply one such model to determine whether
interactions between news headlines on Reddit play a significant role in their
underlying publication mechanisms. After conducting an in-depth case study on
100,000 news headline from 2019, we retrieve state-of-the-art conclusions about
interactions and conclude that they play a minor role in this dataset.Comment: Published at the conference Complex Networks and their Application
Modeling the Dynamics of Online Learning Activity
People are increasingly relying on the Web and social media to find solutions to their problems in a wide range of domains. In this online setting, closely related problems often lead to the same characteristic learning pattern, in which people sharing these problems visit related pieces of information, perform almost identical queries or, more generally, take a series of similar actions. In this paper, we introduce a novel modeling framework for clustering continuous-time grouped streaming data, the hierarchical Dirichlet Hawkes process (HDHP), which allows us to automatically uncover a wide variety of learning patterns from detailed traces of learning activity. Our model allows for efficient inference, scaling to millions of actions taken by thousands of users. Experiments on real data gathered from Stack Overflow reveal that our framework can recover meaningful learning patterns in terms of both content and temporal dynamics, as well as accurately track users' interests and goals over time
Online Learning for Mixture of Multivariate Hawkes Processes
Online learning of Hawkes processes has received increasing attention in the
last couple of years especially for modeling a network of actors. However,
these works typically either model the rich interaction between the events or
the latent cluster of the actors or the network structure between the actors.
We propose to model the latent structure of the network of actors as well as
their rich interaction across events for real-world settings of medical and
financial applications. Experimental results on both synthetic and real-world
data showcase the efficacy of our approach.Comment: 12 pages, 6 figures, 3 table
Modeling the Dynamics of Online Learning Activity
People are increasingly relying on the Web and social media to find solutions to their problems in a wide range of domains. In this online setting, closely related problems often lead to the same characteristic learning pattern, in which people sharing these problems visit related pieces of information, perform almost identical queries or, more generally, take a series of similar actions. In this paper, we introduce a novel modeling framework for clustering continuous-time grouped streaming data, the hierarchical Dirichlet Hawkes process (HDHP), which allows us to automatically uncover a wide variety of learning patterns from detailed traces of learning activity. Our model allows for efficient inference, scaling to millions of actions taken by thousands of users. Experiments on real data gathered from Stack Overflow reveal that our framework can recover meaningful learning patterns in terms of both content and temporal dynamics, as well as accurately track users' interests and goals over time
- …