Search CORE

789 research outputs found

Exploring Time-Sensitive Variational Bayesian Inference LDA for Social Media Data

Author: A Fang
A Guolo
DM Blei
L AlSumait
M Braun
TL Griffiths
WX Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

There is considerable interest among both researchers and the mass public in understanding the topics of discussion on social media as they occur over time. Scholars have thoroughly analysed sampling-based topic modelling approaches for various text corpora including social media; however, another LDA topic modelling implementation—Variational Bayesian (VB)—has not been well studied, despite its known efficiency and its adaptability to the volume and dynamics of social media data. In this paper, we examine the performance of the VB-based topic modelling approach for producing coherent topics, and further, we extend the VB approach by proposing a novel time-sensitive Variational Bayesian implementation, denoted as TVB. Our newly proposed TVB approach incorporates time so as to increase the quality of the generated topics. Using a Twitter dataset covering 8 events, our empirical results show that the coherence of the topics in our TVB model is improved by the integration of time. In particular, through a user study, we find that our TVB approach generates less mixed topics than state-of-the-art topic modelling approaches. Moreover, our proposed TVB approach can more accurately estimate topical trends, making it particularly suitable to assist end-users in tracking emerging topics on social media

Crossref

Enlighten

Optimal client recommendation for market makers in illiquid financial products

Author: DD Lee
DJC MacKay
DM Blei
DM Blei
EJ Elton
F Pedregosa
G Shani
GE Batista
I Kim
KS Jones
L Bolelli
M Avellaneda
M Hoffman
MI Jordan
S Robertson
Y Amihud
Publication venue
Publication date: 27/04/2017
Field of study

The process of liquidity provision in financial markets can result in prolonged exposure to illiquid instruments for market makers. In this case, where a proprietary position is not desired, pro-actively targeting the right client who is likely to be interested can be an effective means to offset this position, rather than relying on commensurate interest arising through natural demand. In this paper, we consider the inference of a client profile for the purpose of corporate bond recommendation, based on typical recorded information available to the market maker. Given a historical record of corporate bond transactions and bond meta-data, we use a topic-modelling analogy to develop a probabilistic technique for compiling a curated list of client recommendations for a particular bond that needs to be traded, ranked by probability of interest. We show that a model based on Latent Dirichlet Allocation offers promising performance to deliver relevant recommendations for sales traders.Comment: 12 pages, 3 figures, 1 tabl

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

What2Cite: Unveiling Topics and Citations Dependencies for Scientific Literature Exploration and Recommendation

Author: DM Blei
DM Blei
L Di Caro
M Schuster
N Nagwani
NJ van Eck
NJ van Eck
SPD Shotton
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Crossref

Institutional Research Information System University of Turin

Exploratory topic modeling with distributional semantics

Author: A Treisman
DA Keim
DM Blei
J Risch
L Barth
M Bostock
S Fortunato
S Lohmann
S Palmer
Y Bengio
Publication venue
Publication date: 16/07/2015
Field of study

As we continue to collect and store textual data in a multitude of domains, we are regularly confronted with material whose largely unknown thematic structure we want to uncover. With unsupervised, exploratory analysis, no prior knowledge about the content is required and highly open-ended tasks can be supported. In the past few years, probabilistic topic modeling has emerged as a popular approach to this problem. Nevertheless, the representation of the latent topics as aggregations of semi-coherent terms limits their interpretability and level of detail. This paper presents an alternative approach to topic modeling that maps topics as a network for exploration, based on distributional semantics using learned word vectors. From the granular level of terms and their semantic similarity relations global topic structures emerge as clustered regions and gradients of concepts. Moreover, the paper discusses the visual interactive representation of the topic map, which plays an important role in supporting its exploration.Comment: Conference: The Fourteenth International Symposium on Intelligent Data Analysis (IDA 2015

arXiv.org e-Print Archive

Crossref

Temporal Cross-Media Retrieval with Soft-Smoothing

Author: Andrew Galen
Benevenuto Fabricio
Blei David M.
He Kaiming
Herbrich Ralf
Hu D.
Ngiam Jiquan
Srivastava Nitish
Srivastava Nitish
Uricchio Tiberio
Wang L.
Yan F.
Zhan M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/10/2018
Field of study

Multimedia information have strong temporal correlations that shape the way modalities co-occur over time. In this paper we study the dynamic nature of multimedia and social-media information, where the temporal dimension emerges as a strong source of evidence for learning the temporal correlations across visual and textual modalities. So far, cross-media retrieval models, explored the correlations between different modalities (e.g. text and image) to learn a common subspace, in which semantically similar instances lie in the same neighbourhood. Building on such knowledge, we propose a novel temporal cross-media neural architecture, that departs from standard cross-media methods, by explicitly accounting for the temporal dimension through temporal subspace learning. The model is softly-constrained with temporal and inter-modality constraints that guide the new subspace learning task by favouring temporal correlations between semantically similar and temporally close instances. Experiments on three distinct datasets show that accounting for time turns out to be important for cross-media retrieval. Namely, the proposed method outperforms a set of baselines on the task of temporal cross-media retrieval, demonstrating its effectiveness for performing temporal subspace learning.Comment: To appear in ACM MM 201

arXiv.org e-Print Archive

Crossref

Semantic patterns for sentiment analysis of Twitter

Author: C. Lin
D.M. Blei
G.W. Milligan
L. Wittgenstein
M. Thelwall
P.D. Turney
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Most existing approaches to Twitter sentiment analysis assume that sentiment is explicitly expressed through affective words. Nevertheless, sentiment is often implicitly expressed via latent semantic relations, patterns and dependencies among words in tweets. In this paper, we propose a novel approach that automatically captures patterns of words of similar contextual semantics and sentiment in tweets. Unlike previous work on sentiment pattern extraction, our proposed approach does not rely on external and fixed sets of syntactical templates/patterns, nor requires deep analyses of the syntactic structure of sentences in tweets. We evaluate our approach with tweet- and entity-level sentiment analysis tasks by using the extracted semantic patterns as classification features in both tasks. We use 9 Twitter datasets in our evaluation and compare the performance of our patterns against 6 state-of-the-art baselines. Results show that our patterns consistently outperform all other baselines on all datasets by 2.19% at the tweet-level and 7.5% at the entity-level in average F-measure

Crossref

Open Research Online (The Open University)

An efficient and principled method for detecting communities in networks

Author: A. Gyenge
B. W. Kernighan
Brian Ball
Brian Karrer
C. Ding
C. Ding
D. E. Knuth
D. M. Blei
E. M. Airoldi
H. Zhang
J. Parkinnen
K. Henderson
L. A. Adamic
L. Backstrom
M. E. J. Newman
M. Girolami
T. Hofmann
W. W. Zachary
Publication venue: 'American Physical Society (APS)'
Publication date: 18/04/2011
Field of study

A fundamental problem in the analysis of network data is the detection of network communities, groups of densely interconnected nodes, which may be overlapping or disjoint. Here we describe a method for finding overlapping communities based on a principled statistical approach using generative network models. We show how the method can be implemented using a fast, closed-form expectation-maximization algorithm that allows us to analyze networks of millions of nodes in reasonable running times. We test the method both on real-world networks and on synthetic benchmarks and find that it gives results competitive with previous methods. We also show that the same approach can be used to extract nonoverlapping community divisions via a relaxation method, and demonstrate that the algorithm is competitively fast and accurate for the nonoverlapping problem.Comment: 14 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

Crossref

Infinite factorization of multiple non-parametric views

Author: A. Gelman
A. Klami
A. Klami
A. Rodriguez
A. Vinokourov
Arto Klami
C. Archambeau
C. Rasmussen
D. Blackwell
D. Blei
D. Cohn
D. Lee
D. M. Blei
D. M. Roy
G. Englebienne
I. Rivals
I. S. Dhillon
Janne Sinkkonen
K. Barnard
M. Welling
Mark Girolami
N. Friedman
N. L. Johnson
R. M. Neal
S. Becker
S. Rogers
Samuel Kaski
Simon Rogers
T. Hofmann
Y. W. Teh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering setting, by introducing a novel non-parametric hierarchical mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views. Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation

CUED - Cambridge University Engineering Department

'What is this corpus about?': Using topic modelling to explore a specialised corpus

Author: Akira Murakami
Anthony L.
Blei D.
Blei D.
Bondi M.
Brett M.R.
Dominik Vajn
Meeks E.
Paul Thompson
Ponweiser M.
R Core Team
Rayson P.
Rhody L.M.
Scott M.
Sinclair J.
Susan Hunston
Wood S.
Publication venue: 'Edinburgh University Press'
Publication date: 01/08/2017
Field of study

This paper introduces topic modelling, a machine learning technique that automatically identifies 'topics' in a given corpus. The paper illustrates its use in the exploration of a corpus of academic English. It first offers the intuitive explanation of the underlying mechanism of topic modelling and describes the procedure for building a model, including the decisions involved in the model-building process. The paper then explores the model. A topic in topic models is characterised by a set of co-occurring words, and we will demonstrate that such topics bring us rich insights into the nature of a corpus. As exemplary tasks, this paper identifies the prominent topics in different parts of papers, investigates the chronological change of a journal, and reveals different types of papers in the journal. The paper further compares topic modelling to two more traditional techniques in corpus linguistics, semantic annotation and keywords analysis, and highlights the strengths of topic modelling.We believe that topic modelling is particularly useful in the initial exploration of a corpus

CLoK

Crossref

University of Birmingham Research Portal