Search CORE

17 research outputs found

A Scalable Asynchronous Distributed Algorithm for Topic Modeling

Author: Asuncion A.
Asuncion A.
Cormen T. H.
Gonzalez J. E.
Snyder P.
Yan F.
Publication venue
Publication date: 16/12/2014
Field of study

Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over

T

items in

O(\log T)

time. Moreover, when topic counts change the data structure can be updated in

O(\log T)

time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics

arXiv.org e-Print Archive

CiteSeerX

Crossref

Computing Web-scale Topic Models using an Asynchronous Parameter Server

Author: Asuncion A.
Hofmann T.
Yu H.-F.
Zaharia M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Topic models such as Latent Dirichlet Allocation (LDA) have been widely used in information retrieval for tasks ranging from smoothing and feedback methods to tools for exploratory search and discovery. However, classical methods for inferring topic models do not scale up to the massive size of today's publicly available Web-scale data sets. The state-of-the-art approaches rely on custom strategies, implementations and hardware to facilitate their asynchronous, communication-intensive workloads. We present APS-LDA, which integrates state-of-the-art topic modeling with cluster computing frameworks such as Spark using a novel asynchronous parameter server. Advantages of this integration include convenient usage of existing data processing pipelines and eliminating the need for disk writes as data can be kept in memory from start to finish. Our goal is not to outperform highly customized implementations, but to propose a general high-performance topic modeling framework that can easily be used in today's data processing pipelines. We compare APS-LDA to the existing Spark LDA implementations and show that our system can, on a 480-core cluster, process up to 135 times more data and 10 times more topics without sacrificing model quality.Comment: To appear in SIGIR 201

arXiv.org e-Print Archive

Crossref

International Migration, Integration and Social Cohesion online publications

Dynamic Parameter Allocation in Parameter Servers

Author: Gemulla Rainer
Markl Volker
Renz-Wieland Alexander
Zeuch Steffen
Publication venue: 'VLDB Endowment'
Publication date: 01/01/2020
Field of study

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead. To reduce communication overhead, distributed machine learning algorithms use techniques to increase parameter access locality (PAL), achieving up to linear speed-ups. We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. In this paper, we explore whether and to what extent PAL techniques can be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse, and experimentally compare its performance to existing parameter servers across a number of machine learning tasks. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers

arXiv.org e-Print Archive

MAnnheim DOCument Server

Enabling Efficient and Scalable Service Search in IoT with Topic Modelling: an evaluation

Author: Razzaque Mohammad Abdur
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/04/2021
Field of study

Teeside University's Research Repository