44,249 research outputs found
A New Approach to Speeding Up Topic Modeling
Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic
modeling paradigm, and recently finds many applications in computer vision and
computational biology. In this paper, we propose a fast and accurate batch
algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA
algorithms require repeated scanning of the entire corpus and searching the
complete topic space. To process massive corpora having a large number of
topics, the training iteration of batch LDA algorithms is often inefficient and
time-consuming. To accelerate the training speed, ABP actively scans the subset
of corpus and searches the subset of topic space for topic modeling, therefore
saves enormous training time in each iteration. To ensure accuracy, ABP selects
only those documents and topics that contribute to the largest residuals within
the residual belief propagation (RBP) framework. On four real-world corpora,
ABP performs around to times faster than state-of-the-art batch LDA
algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure
Computing Web-scale Topic Models using an Asynchronous Parameter Server
Topic models such as Latent Dirichlet Allocation (LDA) have been widely used
in information retrieval for tasks ranging from smoothing and feedback methods
to tools for exploratory search and discovery. However, classical methods for
inferring topic models do not scale up to the massive size of today's publicly
available Web-scale data sets. The state-of-the-art approaches rely on custom
strategies, implementations and hardware to facilitate their asynchronous,
communication-intensive workloads.
We present APS-LDA, which integrates state-of-the-art topic modeling with
cluster computing frameworks such as Spark using a novel asynchronous parameter
server. Advantages of this integration include convenient usage of existing
data processing pipelines and eliminating the need for disk writes as data can
be kept in memory from start to finish. Our goal is not to outperform highly
customized implementations, but to propose a general high-performance topic
modeling framework that can easily be used in today's data processing
pipelines. We compare APS-LDA to the existing Spark LDA implementations and
show that our system can, on a 480-core cluster, process up to 135 times more
data and 10 times more topics without sacrificing model quality.Comment: To appear in SIGIR 201
- …