8,312 research outputs found
Scalable Text and Link Analysis with Mixed-Topic Link Models
Many data sets contain rich information about objects, as well as pairwise
relations between them. For instance, in networks of websites, scientific
papers, and other documents, each node has content consisting of a collection
of words, as well as hyperlinks or citations to other nodes. In order to
perform inference on such data sets, and make predictions and recommendations,
it is useful to have models that are able to capture the processes which
generate the text at each node and the links between them. In this paper, we
combine classic ideas in topic modeling with a variant of the mixed-membership
block model recently developed in the statistical physics community. The
resulting model has the advantage that its parameters, including the mixture of
topics of each document and the resulting overlapping communities, can be
inferred with a simple and scalable expectation-maximization algorithm. We test
our model on three data sets, performing unsupervised topic classification and
link prediction. For both tasks, our model outperforms several existing
state-of-the-art methods, achieving higher accuracy with significantly less
computation, analyzing a data set with 1.3 million words and 44 thousand links
in a few minutes.Comment: 11 pages, 4 figure
Online Tensor Methods for Learning Latent Variable Models
We introduce an online tensor decomposition based approach for two latent
variable modeling problems namely, (1) community detection, in which we learn
the latent communities that the social actors in social networks belong to, and
(2) topic modeling, in which we infer hidden topics of text articles. We
consider decomposition of moment tensors using stochastic gradient descent. We
conduct optimization of multilinear operations in SGD and avoid directly
forming the tensors, to save computational and storage costs. We present
optimized algorithm in two platforms. Our GPU-based implementation exploits the
parallelism of SIMD architectures to allow for maximum speed-up by a careful
optimization of storage and data transfer, whereas our CPU-based implementation
uses efficient sparse matrix computations and is suitable for large sparse
datasets. For the community detection problem, we demonstrate accuracy and
computational efficiency on Facebook, Yelp and DBLP datasets, and for the topic
modeling problem, we also demonstrate good performance on the New York Times
dataset. We compare our results to the state-of-the-art algorithms such as the
variational method, and report a gain of accuracy and a gain of several orders
of magnitude in the execution time.Comment: JMLR 201
- …