Search CORE

469 research outputs found

Distance Dependent Chinese Restaurant Processes

Author: Blei David M.
Frazier Peter I.
Publication venue
Publication date: 01/01/2010
Field of study

We develop the distance dependent Chinese restaurant process (CRP), a flexible class of distributions over partitions that allows for non-exchangeability. This class can be used to model many kinds of dependencies between data in infinite clustering models, including dependencies across time or space. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both observed and mixture settings. We study its performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data. We also show its alternative formulation of the traditional CRP leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation

arXiv.org e-Print Archive

CiteSeerX

Inferring Networks of Substitutable and Complementary Products

Author: Bennett J.
Blei D.
Blei D.
Blei D. M.
Brody S.
Chang J.
Ganu G.
Mas-Colell A.
Moghaddam S.
Reyes A.
Titov I.
Vu D.
Publication venue
Publication date: 29/06/2015
Field of study

In a modern recommender system, it is important to understand how products relate to each other. For example, while a user is looking for mobile phones, it might make sense to recommend other phones, but once they buy a phone, we might instead want to recommend batteries, cases, or chargers. These two types of recommendations are referred to as substitutes and complements: substitutes are products that can be purchased instead of each other, while complements are products that can be purchased in addition to each other. Here we develop a method to infer networks of substitutable and complementary products. We formulate this as a supervised link prediction task, where we learn the semantics of substitutes and complements from data associated with products. The primary source of data we use is the text of product reviews, though our method also makes use of features such as ratings, specifications, prices, and brands. Methodologically, we build topic models that are trained to automatically discover topics from text that are successful at predicting and explaining such relationships. Experimentally, we evaluate our system on the Amazon product catalog, a large dataset consisting of 9 million products, 237 million links, and 144 million reviews.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Nested Hierarchical Dirichlet Processes

Author: Blei David M.
Jordan Michael I.
Paisley John
Wang Chong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/05/2014
Field of study

We develop a nested hierarchical Dirichlet process (nHDP) for hierarchical topic modeling. The nHDP is a generalization of the nested Chinese restaurant process (nCRP) that allows each word to follow its own path to a topic node according to a document-specific distribution on a shared tree. This alleviates the rigid, single-path formulation of the nCRP, allowing a document to more easily express thematic borrowings as a random effect. We derive a stochastic variational inference algorithm for the model, in addition to a greedy subtree selection method for each document, which allows for efficient inference using massive collections of text documents. We demonstrate our algorithm on 1.8 million documents from The New York Times and 3.3 million documents from Wikipedia.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence, Special Issue on Bayesian Nonparametric

arXiv.org e-Print Archive

Princeton University Open Access Repository

CiteSeerX

Crossref

Optimal client recommendation for market makers in illiquid financial products

Author: DD Lee
DJC MacKay
DM Blei
DM Blei
EJ Elton
F Pedregosa
G Shani
GE Batista
I Kim
KS Jones
L Bolelli
M Avellaneda
M Hoffman
MI Jordan
S Robertson
Y Amihud
Publication venue
Publication date: 27/04/2017
Field of study

The process of liquidity provision in financial markets can result in prolonged exposure to illiquid instruments for market makers. In this case, where a proprietary position is not desired, pro-actively targeting the right client who is likely to be interested can be an effective means to offset this position, rather than relying on commensurate interest arising through natural demand. In this paper, we consider the inference of a client profile for the purpose of corporate bond recommendation, based on typical recorded information available to the market maker. Given a historical record of corporate bond transactions and bond meta-data, we use a topic-modelling analogy to develop a probabilistic technique for compiling a curated list of client recommendations for a particular bond that needs to be traded, ranked by probability of interest. We show that a model based on Latent Dirichlet Allocation offers promising performance to deliver relevant recommendations for sales traders.Comment: 12 pages, 3 figures, 1 tabl

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Distance Dependent Infinite Latent Feature Models

Author: Blei David M.
Frazier Peter I.
Gershman Samuel J.
Publication venue
Publication date: 01/01/2011
Field of study

Latent feature models are widely used to decompose data into a small number of components. Bayesian nonparametric variants of these models, which use the Indian buffet process (IBP) as a prior over latent features, allow the number of features to be determined from the data. We present a generalization of the IBP, the distance dependent Indian buffet process (dd-IBP), for modeling non-exchangeable data. It relies on distances defined between data points, biasing nearby data to share more features. The choice of distance measure allows for many kinds of dependencies, including temporal and spatial. Further, the original IBP is a special case of the dd-IBP. In this paper, we develop the dd-IBP and theoretically characterize its feature-sharing properties. We derive a Markov chain Monte Carlo sampler for a linear Gaussian model with a dd-IBP prior and study its performance on several non-exchangeable data sets.Comment: 28 pages, 9 figure

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

Crossref

Contexts of diffusion: Adoption of research synthesis in Social Work and Women's Studies

Author: A. Abbott
A. Asuncion
A. Stirling
D.M. Blei
Evidence-Based Medicine Working Group
I. Chalmers
I. Rafols
I. Rafols
J.T. Klein
J.W. Anastas
L. Toews
L.V. Hedges
M. Herie
M.J. Boxer
V. Batagelj
Publication venue
Publication date: 01/01/2014
Field of study

Texts reveal the subjects of interest in research fields, and the values, beliefs, and practices of researchers. In this study, texts are examined through bibliometric mapping and topic modeling to provide a birds eye view of the social dynamics associated with the diffusion of research synthesis methods in the contexts of Social Work and Women's Studies. Research synthesis texts are especially revealing because the methods, which include meta-analysis and systematic review, are reliant on the availability of past research and data, sometimes idealized as objective, egalitarian approaches to research evaluation, fundamentally tied to past research practices, and performed with the goal informing future research and practice. This study highlights the co-influence of past and subsequent research within research fields; illustrates dynamics of the diffusion process; and provides insight into the cultural contexts of research in Social Work and Women's Studies. This study suggests the potential to further develop bibliometric mapping and topic modeling techniques to inform research problem selection and resource allocation.Comment: To appear in proceedings of the 2014 International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction (SBP2014

arXiv.org e-Print Archive

Crossref

Infinite factorization of multiple non-parametric views

Author: A. Gelman
A. Klami
A. Klami
A. Rodriguez
A. Vinokourov
Arto Klami
C. Archambeau
C. Rasmussen
D. Blackwell
D. Blei
D. Cohn
D. Lee
D. M. Blei
D. M. Roy
G. Englebienne
I. Rivals
I. S. Dhillon
Janne Sinkkonen
K. Barnard
M. Welling
Mark Girolami
N. Friedman
N. L. Johnson
R. M. Neal
S. Becker
S. Rogers
Samuel Kaski
Simon Rogers
T. Hofmann
Y. W. Teh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering setting, by introducing a novel non-parametric hierarchical mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views. Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation

CUED - Cambridge University Engineering Department

Basic tasks of sentiment analysis

Author: DM Blei
E Cambria
E Cambria
E Cambria
E Cambria
E Cambria
G Murray
G Qiu
GE Hinton
GW Taylor
H Tang
I Chaturvedi
L Oneto
R Collobert
R Ortega
S Branavan
S Poria
S Rill
T Wang
X Ding
Y Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/10/2017
Field of study

Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about

arXiv.org e-Print Archive

Crossref

Overcoming data scarcity of Twitter: using tweets as bootstrap with application to autism-related topic content analysis

Author: Agarwal A.
Autism
Blei D.
Bollen J.
Chang J.
Danial J. T.
Harrington J. W.
Harshavardhan A.
Higashida N.
Himelboim I.
Hutchings C.
Hviid A.
Ishwaran H.
Jacobson J. W.
Jashinsky J.
Jiang L.
Paul M. J.
Paul M. J.
Robinson B.
Russell M. A.
Scanfeld D.
Teh Y. W.
Teh Y. W.
Trembath D.
Verma S.
Warren Z.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Notwithstanding recent work which has demonstrated the potential of using Twitter messages for content-specific data mining and analysis, the depth of such analysis is inherently limited by the scarcity of data imposed by the 140 character tweet limit. In this paper we describe a novel approach for targeted knowledge exploration which uses tweet content analysis as a preliminary step. This step is used to bootstrap more sophisticated data collection from directly related but much richer content sources. In particular we demonstrate that valuable information can be collected by following URLs included in tweets. We automatically extract content from the corresponding web pages and treating each web page as a document linked to the original tweet show how a temporal topic model based on a hierarchical Dirichlet process can be used to track the evolution of a complex topic structure of a Twitter community. Using autism-related tweets we demonstrate that our method is capable of capturing a much more meaningful picture of information exchange than user-chosen hashtags.Comment: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 201

arXiv.org e-Print Archive

Deakin Research Online

Crossref