Search CORE

855 research outputs found

Mixed membership stochastic blockmodels

Author: Airoldi Edoardo M
Blei David M
Fienberg Stephen E
Xing Eric P
Publication venue
Publication date: 30/05/2007
Field of study

Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.Comment: 46 pages, 14 figures, 3 table

arXiv.org e-Print Archive

CiteSeerX

Feature LDA: a supervised topic model for automatic detection of Web API documentations from the Web

Author: C. Pedrinaci
D.M. Blei
D.M. Blei
E. Erosheva
N. Steinmetz
T. Pilioura
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Web APIs have gained increasing popularity in recent Web service technology development owing to its simplicity of technology stack and the proliferation of mashups. However, efficiently discovering Web APIs and the relevant documentations on the Web is still a challenging task even with the best resources available on the Web. In this paper we cast the problem of detecting the Web API documentations as a text classification problem of classifying a given Web page as Web API associated or not. We propose a supervised generative topic model called feature latent Dirichlet allocation (feaLDA) which offers a generic probabilistic framework for automatic detection of Web APIs. feaLDA not only captures the correspondence between data and the associated class labels, but also provides a mechanism for incorporating side information such as labelled features automatically learned from data that can effectively help improving classification performance. Extensive experiments on our Web APIs documentation dataset shows that the feaLDA model outperforms three strong supervised baselines including naive Bayes, support vector machines, and the maximum entropy model, by over 3% in classification accuracy. In addition, feaLDA also gives superior performance when compared against other existing supervised topic models

CiteSeerX

Crossref

Open Research Online

Basic tasks of sentiment analysis

Author: DM Blei
E Cambria
E Cambria
E Cambria
E Cambria
E Cambria
G Murray
G Qiu
GE Hinton
GW Taylor
H Tang
I Chaturvedi
L Oneto
R Collobert
R Ortega
S Branavan
S Poria
S Rill
T Wang
X Ding
Y Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/10/2017
Field of study

Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about

arXiv.org e-Print Archive

Crossref

Location Dependent Dirichlet Processes

Author: A Oliva
A Rodríguez
C Bishop
C Williams
CE Rasmussen
D Blei
D Blei
D Dunson
E Sudderth
F Zhu
H Ishwaran
J Duan
J Griffin
J Griffin
J Paisley
J Sethuraman
J Shi
L Ren
N Foti
P Orbanz
R Unnikrishnan
S Kumar
T Ferguson
X Sun
YW Teh
Publication venue
Publication date: 02/07/2017
Field of study

Dirichlet processes (DP) are widely applied in Bayesian nonparametric modeling. However, in their basic form they do not directly integrate dependency information among data arising from space and time. In this paper, we propose location dependent Dirichlet processes (LDDP) which incorporate nonparametric Gaussian processes in the DP modeling framework to model such dependencies. We develop the LDDP in the context of mixture modeling, and develop a mean field variational inference algorithm for this mixture model. The effectiveness of the proposed modeling framework is shown on an image segmentation task

arXiv.org e-Print Archive

Crossref

Statistical Mechanics of the Chinese Restaurant Process: lack of self-averaging, anomalous finite-size effects and condensation

Author: A. E. Scheidegger
Bruno Bassetti
D. M. Blei
G. K. Zipf
Ginestra Bianconi
H. A. Simon
H. Yamato
J. Pitman
Marco Cosentino Lagomarsino
Mina Zarei
S. N. Dorogovtsev
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2009
Field of study

The Pitman-Yor, or Chinese Restaurant Process, is a stochastic process that generates distributions following a power-law with exponents lower than two, as found in a numerous physical, biological, technological and social systems. We discuss its rich behavior with the tools and viewpoint of statistical mechanics. We show that this process invariably gives rise to a condensation, i.e. a distribution dominated by a finite number of classes. We also evaluate thoroughly the finite-size effects, finding that the lack of stationary state and self-averaging of the process creates realization-dependent cutoffs and behavior of the distributions with no equivalent in other statistical mechanical models.Comment: (5pages, 1 figure

arXiv.org e-Print Archive

Crossref

AIR Universita degli studi di Milano

How Many Topics? Stability Analysis for Topic Models

Author: C. Lin
D. Greene
D.D. Lee
D.M. Blei
E. Levine
H.W. Kuhn
J.P. Brunet
L.N. Hutchins
M. Kendall
P. Jaccard
R. Fagin
S. Ben-David
T. Lange
W. Webber
Publication venue
Publication date: 01/01/2014
Field of study

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

arXiv.org e-Print Archive

Crossref

Research Repository UCD

Irish Universities

Probabilistic Clustering of Time-Evolving Distance Data

Author: AK Jain
AY Ng
C Leslie
CP Robert
D Blei
DD Lee
DM Blei
Gunnar Rätsch
H Saigo
J Pitman
Julia E. Vogt
M Bilodeau
Marius Kloft
MB Eisen
MS Srivastava
P McCullagh
P McCullagh
RM Neal
S Sonnenburg
Sandhya Prabhakaran
SN MacEachern
Stefan Stark
Sudhir S. Raman
SVN Vishwanathan
TS Ferguson
TW Anderson
Volker Roth
WJ Ewens
Publication venue
Publication date: 01/01/2015
Field of study

We present a novel probabilistic clustering model for objects that are represented via pairwise distances and observed at different time points. The proposed method utilizes the information given by adjacent time points to find the underlying cluster structure and obtain a smooth cluster evolution. This approach allows the number of objects and clusters to differ at every time point, and no identification on the identities of the objects is needed. Further, the model does not require the number of clusters being specified in advance -- they are instead determined automatically using a Dirichlet process prior. We validate our model on synthetic data showing that the proposed method is more accurate than state-of-the-art clustering methods. Finally, we use our dynamic clustering model to analyze and illustrate the evolution of brain cancer patients over time

arXiv.org e-Print Archive

Crossref

edoc

An efficient and principled method for detecting communities in networks

Author: A. Gyenge
B. W. Kernighan
Brian Ball
Brian Karrer
C. Ding
C. Ding
D. E. Knuth
D. M. Blei
E. M. Airoldi
H. Zhang
J. Parkinnen
K. Henderson
L. A. Adamic
L. Backstrom
M. E. J. Newman
M. Girolami
T. Hofmann
W. W. Zachary
Publication venue: 'American Physical Society (APS)'
Publication date: 18/04/2011
Field of study

A fundamental problem in the analysis of network data is the detection of network communities, groups of densely interconnected nodes, which may be overlapping or disjoint. Here we describe a method for finding overlapping communities based on a principled statistical approach using generative network models. We show how the method can be implemented using a fast, closed-form expectation-maximization algorithm that allows us to analyze networks of millions of nodes in reasonable running times. We test the method both on real-world networks and on synthetic benchmarks and find that it gives results competitive with previous methods. We also show that the same approach can be used to extract nonoverlapping community divisions via a relaxation method, and demonstrate that the algorithm is competitively fast and accurate for the nonoverlapping problem.Comment: 14 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

Crossref