51 research outputs found
Variational Inference for Sparse Gaussian Process Modulated Hawkes Process
The Hawkes process (HP) has been widely applied to modeling self-exciting events including neuron spikes, earthquakes and tweets. To avoid designing parametric triggering kernel and to be able to quantify the prediction confidence, the non-parametric Bayesian HP has been proposed. However, the inference of such models suffers from unscalability or slow convergence. In this paper, we aim to solve both problems. Specifically, first, we propose a new non-parametric Bayesian HP in which the triggering kernel is modeled as a squared sparse Gaussian process. Then, we propose a novel variational inference schema for model optimization. We employ the branching structure of the HP so that maximization of evidence lower bound (ELBO) is tractable by the expectation-maximization algorithm. We propose a tighter ELBO which improves the fitting performance. Further, we accelerate the novel variational inference schema to linear time complexity by leveraging the stationarity of the triggering kernel. Different from prior acceleration methods, ours enjoys higher efficiency. Finally, we exploit synthetic data and two large social media datasets to evaluate our method. We show that our approach outperforms state-of-the-art non-parametric frequentist and Bayesian methods. We validate the efficiency of our accelerated variational inference schema and practical utility of our tighter ELBO for model selection. We observe that the tighter ELBO exceeds the common one in model selection
Will This Video Go Viral? Explaining and Predicting the Popularity of Youtube Videos
What makes content go viral? Which videos become popular and why others don't? Such questions have elicited significant attention from both researchers and industry, particularly in the context of online media. A range of models have been recently proposed to explain and predict popularity; however, there is a short supply of practical tools, accessible for regular users, that leverage these theoretical results. HIPie -- an interactive visualization system -- is created to fill this gap, by enabling users to reason about the virality and the popularity of online videos. It retrieves the metadata and the past popularity series of Youtube videos, it employs Hawkes Intensity Process, a state-of-the-art online popularity model for explaining and predicting video popularity, and it presents videos comparatively in a series of interactive plots. This system will help both content consumers and content producers in a range of data-driven inquiries, such as to comparatively analyze videos and channels, to explain and predict future popularity, to identify viral videos, and to estimate response to online promotion
Adaptively selecting occupations to detect skill shortages from online job ads
Labour demand and skill shortages have historically been difficult to assess given the high costs of conducting representative surveys and the inherent delays of these indicators. This is particularly consequential for fast developing skills and occupations, such as those relating to Data Science and Analytics (DSA). This paper develops a data-driven solution to detecting skill shortages from online job advertisements (ads) data. We first propose a method to generate sets of highly similar skills based on a set of seed skills from job ads. This provides researchers with a novel method to adaptively select occupations based on granular skills data. Next, we apply this adaptive skills similarity technique to a dataset of over 6.7 million Australian job ads in order to identify occupations with the highest proportions of DSA skills. This uncovers 306,577 DSA job ads across 23 occupational classes from 2012-2019. Finally, we propose five variables for detecting skill shortages from online job ads: (1) posting frequency; (2) salary levels; (3) education requirements; (4) experience demands; and (5) job ad posting predictability. This contributes further evidence to the goal of detecting skills shortages in real-time. In conducting this analysis, we also find strong evidence of skills shortages in Australia for highly technical DSA skills and occupations. These results provide insights to Data Science researchers, educators, and policy-makers from other advanced economies about the types of skills that should be cultivated to meet growing DSA labour demands in the future
Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting
Predicting traffic incident duration is a major challenge for many traffic centres around the world. Most research studies focus on predicting the incident duration on motorways rather than arterial roads, due to a high network complexity and lack of data. In this paper we propose a bi-level framework for predicting the accident duration on arterial road networks in Sydney, based on operational requirements of incident clearance target which is less than 45 minutes. Using incident baseline information, we first deploy a classification method using various ensemble tree models in order to predict whether a new incident will be cleared in less than 45min or not. If the incident was classified as short-term, then various regression models are developed for predicting the actual incident duration in minutes by incorporating various traffic flow features. After outlier removal and intensive model hyper-parameter tuning through randomized search and cross-validation, we show that the extreme gradient boost approach outperformed all models, including the gradient-boosted decision-trees by almost 53%. Finally, we perform a feature importance evaluation for incident duration prediction and show that the best prediction results are obtained when leveraging the real-time traffic flow in vicinity road sections to the reported accident location
Birdspotter: A Tool for Analyzing and Labeling Twitter Users
The impact of online social media on societal events and institutions is
profound; and with the rapid increases in user uptake, we are just starting to
understand its ramifications. Social scientists and practitioners who model
online discourse as a proxy for real-world behavior, often curate large social
media datasets. A lack of available tooling aimed at non-data science experts
frequently leaves this data (and the insights it holds) underutilized. Here, we
propose birdspotter -- a tool to analyze and label Twitter users --, and
birdspotter.ml -- an exploratory visualizer for the computed metrics.
birdspotter provides an end-to-end analysis pipeline, from the processing of
pre-collected Twitter data, to general-purpose labeling of users, and
estimating their social influence, within a few lines of code. The package
features tutorials and detailed documentation. We also illustrate how to train
birdspotter into a fully-fledged bot detector that achieves better than
state-of-the-art performances without making any Twitter API online calls, and
we showcase its usage in an exploratory analysis of a topical COVID-19 dataset
Quantile Propagation for Wasserstein-Approximate Gaussian Processes
We develop a new approximate Bayesian inference method for Gaussian process models with factorized non-Gaussian likelihoods. Our method---dubbed Quantile Propagation (QP)---is similar to expectation propagation (EP) but minimizes the L_2 Wasserstein distance rather than the Kullback-Leibler (KL) divergence. We consider the case where likelihood factors are approximated by a Gaussian form. We show that QP matches quantile functions rather than moments as in EP and has the same mean update but a smaller variance update than EP, thereby alleviating the over-estimation of the posterior variance exhibited by EP. Crucially, QP has the same favorable locality property as EP, and thereby admits an efficient algorithm. Experiments on classification and Poisson regression tasks demonstrate that QP outperforms both EP and variational Bayes
Evently: Modeling and Analyzing Reshare Cascades with Hawkes Processes
Modeling online discourse dynamics is a core activity in understanding the
spread of information, both offline and online, and emergent online behavior.
There is currently a disconnect between the practitioners of online social
media analysis -- usually social, political and communication scientists -- and
the accessibility to tools capable of examining online discussions of users.
Here we present evently, a tool for modeling online reshare cascades, and
particularly retweet cascades, using self-exciting processes. It provides a
comprehensive set of functionalities for processing raw data from Twitter
public APIs, modeling the temporal dynamics of processed retweet cascades and
characterizing online users with a wide range of diffusion measures. This tool
is designed for researchers with a wide range of computer expertise, and it
includes tutorials and detailed documentation. We illustrate the usage of
evently with an end-to-end analysis of online user behavior on a topical
dataset relating to COVID-19. We show that, by characterizing users solely
based on how their content spreads online, we can disentangle influential users
and online bots
Traffic congestion anomaly detection and prediction using deep learning
Congestion prediction represents a major priority for traffic management
centres around the world to ensure timely incident response handling. The
increasing amounts of generated traffic data have been used to train machine
learning predictors for traffic, however, this is a challenging task due to
inter-dependencies of traffic flow both in time and space. Recently, deep
learning techniques have shown significant prediction improvements over
traditional models, however, open questions remain around their applicability,
accuracy and parameter tuning. This paper brings two contributions in terms of:
1) applying an outlier detection an anomaly adjustment method based on incoming
and historical data streams, and 2) proposing an advanced deep learning
framework for simultaneously predicting the traffic flow, speed and occupancy
on a large number of monitoring stations along a highly circulated motorway in
Sydney, Australia, including exit and entry loop count stations, and over
varying training and prediction time horizons. The spatial and temporal
features extracted from the 36.34 million data points are used in various deep
learning architectures that exploit their spatial structure (convolutional
neuronal networks), their temporal dynamics (recurrent neuronal networks), or
both through a hybrid spatio-temporal modelling (CNN-LSTM). We show that our
deep learning models consistently outperform traditional methods, and we
conduct a comparative analysis of the optimal time horizon of historical data
required to predict traffic flow at different time points in the future.
Lastly, we prove that the anomaly adjustment method brings significant
improvements to using deep learning in both time and space
- …