51 research outputs found

    Variational Inference for Sparse Gaussian Process Modulated Hawkes Process

    Full text link
    The Hawkes process (HP) has been widely applied to modeling self-exciting events including neuron spikes, earthquakes and tweets. To avoid designing parametric triggering kernel and to be able to quantify the prediction confidence, the non-parametric Bayesian HP has been proposed. However, the inference of such models suffers from unscalability or slow convergence. In this paper, we aim to solve both problems. Specifically, first, we propose a new non-parametric Bayesian HP in which the triggering kernel is modeled as a squared sparse Gaussian process. Then, we propose a novel variational inference schema for model optimization. We employ the branching structure of the HP so that maximization of evidence lower bound (ELBO) is tractable by the expectation-maximization algorithm. We propose a tighter ELBO which improves the fitting performance. Further, we accelerate the novel variational inference schema to linear time complexity by leveraging the stationarity of the triggering kernel. Different from prior acceleration methods, ours enjoys higher efficiency. Finally, we exploit synthetic data and two large social media datasets to evaluate our method. We show that our approach outperforms state-of-the-art non-parametric frequentist and Bayesian methods. We validate the efficiency of our accelerated variational inference schema and practical utility of our tighter ELBO for model selection. We observe that the tighter ELBO exceeds the common one in model selection

    Will This Video Go Viral? Explaining and Predicting the Popularity of Youtube Videos

    Full text link
    What makes content go viral? Which videos become popular and why others don't? Such questions have elicited significant attention from both researchers and industry, particularly in the context of online media. A range of models have been recently proposed to explain and predict popularity; however, there is a short supply of practical tools, accessible for regular users, that leverage these theoretical results. HIPie -- an interactive visualization system -- is created to fill this gap, by enabling users to reason about the virality and the popularity of online videos. It retrieves the metadata and the past popularity series of Youtube videos, it employs Hawkes Intensity Process, a state-of-the-art online popularity model for explaining and predicting video popularity, and it presents videos comparatively in a series of interactive plots. This system will help both content consumers and content producers in a range of data-driven inquiries, such as to comparatively analyze videos and channels, to explain and predict future popularity, to identify viral videos, and to estimate response to online promotion

    Adaptively selecting occupations to detect skill shortages from online job ads

    Full text link
    Labour demand and skill shortages have historically been difficult to assess given the high costs of conducting representative surveys and the inherent delays of these indicators. This is particularly consequential for fast developing skills and occupations, such as those relating to Data Science and Analytics (DSA). This paper develops a data-driven solution to detecting skill shortages from online job advertisements (ads) data. We first propose a method to generate sets of highly similar skills based on a set of seed skills from job ads. This provides researchers with a novel method to adaptively select occupations based on granular skills data. Next, we apply this adaptive skills similarity technique to a dataset of over 6.7 million Australian job ads in order to identify occupations with the highest proportions of DSA skills. This uncovers 306,577 DSA job ads across 23 occupational classes from 2012-2019. Finally, we propose five variables for detecting skill shortages from online job ads: (1) posting frequency; (2) salary levels; (3) education requirements; (4) experience demands; and (5) job ad posting predictability. This contributes further evidence to the goal of detecting skills shortages in real-time. In conducting this analysis, we also find strong evidence of skills shortages in Australia for highly technical DSA skills and occupations. These results provide insights to Data Science researchers, educators, and policy-makers from other advanced economies about the types of skills that should be cultivated to meet growing DSA labour demands in the future

    Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting

    Full text link
    Predicting traffic incident duration is a major challenge for many traffic centres around the world. Most research studies focus on predicting the incident duration on motorways rather than arterial roads, due to a high network complexity and lack of data. In this paper we propose a bi-level framework for predicting the accident duration on arterial road networks in Sydney, based on operational requirements of incident clearance target which is less than 45 minutes. Using incident baseline information, we first deploy a classification method using various ensemble tree models in order to predict whether a new incident will be cleared in less than 45min or not. If the incident was classified as short-term, then various regression models are developed for predicting the actual incident duration in minutes by incorporating various traffic flow features. After outlier removal and intensive model hyper-parameter tuning through randomized search and cross-validation, we show that the extreme gradient boost approach outperformed all models, including the gradient-boosted decision-trees by almost 53%. Finally, we perform a feature importance evaluation for incident duration prediction and show that the best prediction results are obtained when leveraging the real-time traffic flow in vicinity road sections to the reported accident location

    Birdspotter: A Tool for Analyzing and Labeling Twitter Users

    Full text link
    The impact of online social media on societal events and institutions is profound; and with the rapid increases in user uptake, we are just starting to understand its ramifications. Social scientists and practitioners who model online discourse as a proxy for real-world behavior, often curate large social media datasets. A lack of available tooling aimed at non-data science experts frequently leaves this data (and the insights it holds) underutilized. Here, we propose birdspotter -- a tool to analyze and label Twitter users --, and birdspotter.ml -- an exploratory visualizer for the computed metrics. birdspotter provides an end-to-end analysis pipeline, from the processing of pre-collected Twitter data, to general-purpose labeling of users, and estimating their social influence, within a few lines of code. The package features tutorials and detailed documentation. We also illustrate how to train birdspotter into a fully-fledged bot detector that achieves better than state-of-the-art performances without making any Twitter API online calls, and we showcase its usage in an exploratory analysis of a topical COVID-19 dataset

    Quantile Propagation for Wasserstein-Approximate Gaussian Processes

    Full text link
    We develop a new approximate Bayesian inference method for Gaussian process models with factorized non-Gaussian likelihoods. Our method---dubbed Quantile Propagation (QP)---is similar to expectation propagation (EP) but minimizes the L_2 Wasserstein distance rather than the Kullback-Leibler (KL) divergence. We consider the case where likelihood factors are approximated by a Gaussian form. We show that QP matches quantile functions rather than moments as in EP and has the same mean update but a smaller variance update than EP, thereby alleviating the over-estimation of the posterior variance exhibited by EP. Crucially, QP has the same favorable locality property as EP, and thereby admits an efficient algorithm. Experiments on classification and Poisson regression tasks demonstrate that QP outperforms both EP and variational Bayes

    Evently: Modeling and Analyzing Reshare Cascades with Hawkes Processes

    Full text link
    Modeling online discourse dynamics is a core activity in understanding the spread of information, both offline and online, and emergent online behavior. There is currently a disconnect between the practitioners of online social media analysis -- usually social, political and communication scientists -- and the accessibility to tools capable of examining online discussions of users. Here we present evently, a tool for modeling online reshare cascades, and particularly retweet cascades, using self-exciting processes. It provides a comprehensive set of functionalities for processing raw data from Twitter public APIs, modeling the temporal dynamics of processed retweet cascades and characterizing online users with a wide range of diffusion measures. This tool is designed for researchers with a wide range of computer expertise, and it includes tutorials and detailed documentation. We illustrate the usage of evently with an end-to-end analysis of online user behavior on a topical dataset relating to COVID-19. We show that, by characterizing users solely based on how their content spreads online, we can disentangle influential users and online bots

    Traffic congestion anomaly detection and prediction using deep learning

    Full text link
    Congestion prediction represents a major priority for traffic management centres around the world to ensure timely incident response handling. The increasing amounts of generated traffic data have been used to train machine learning predictors for traffic, however, this is a challenging task due to inter-dependencies of traffic flow both in time and space. Recently, deep learning techniques have shown significant prediction improvements over traditional models, however, open questions remain around their applicability, accuracy and parameter tuning. This paper brings two contributions in terms of: 1) applying an outlier detection an anomaly adjustment method based on incoming and historical data streams, and 2) proposing an advanced deep learning framework for simultaneously predicting the traffic flow, speed and occupancy on a large number of monitoring stations along a highly circulated motorway in Sydney, Australia, including exit and entry loop count stations, and over varying training and prediction time horizons. The spatial and temporal features extracted from the 36.34 million data points are used in various deep learning architectures that exploit their spatial structure (convolutional neuronal networks), their temporal dynamics (recurrent neuronal networks), or both through a hybrid spatio-temporal modelling (CNN-LSTM). We show that our deep learning models consistently outperform traditional methods, and we conduct a comparative analysis of the optimal time horizon of historical data required to predict traffic flow at different time points in the future. Lastly, we prove that the anomaly adjustment method brings significant improvements to using deep learning in both time and space
    • …
    corecore