Search CORE

343 research outputs found

Amortized Analysis on Asynchronous Gradient Descent

Author: Cheung Yun Kuen
Cole Richard
Publication venue
Publication date: 29/11/2014
Field of study

Gradient descent is an important class of iterative algorithms for minimizing convex functions. Classically, gradient descent has been a sequential and synchronous process. Distributed and asynchronous variants of gradient descent have been studied since the 1980s, and they have been experiencing a resurgence due to demand from large-scale machine learning problems running on multi-core processors. We provide a version of asynchronous gradient descent (AGD) in which communication between cores is minimal and for which there is little synchronization overhead. We also propose a new timing model for its analysis. With this model, we give the first amortized analysis of AGD on convex functions. The amortization allows for bad updates (updates that increase the value of the convex function); in contrast, most prior work makes the strong assumption that every update must be significantly improving. Typically, the step sizes used in AGD are smaller than those used in its synchronous counterpart. We provide a method to determine the step sizes in AGD based on the Hessian entries for the convex function. In certain circumstances, the resulting step sizes are a constant fraction of those used in the corresponding synchronous algorithm, enabling the overall performance of AGD to improve linearly with the number of cores. We give two applications of our amortized analysis.Comment: 40 page

arXiv.org e-Print Archive

CiteSeerX

A Scalable Asynchronous Distributed Algorithm for Topic Modeling

Author: Asuncion A.
Asuncion A.
Cormen T. H.
Gonzalez J. E.
Snyder P.
Yan F.
Publication venue
Publication date: 16/12/2014
Field of study

Learning meaningful topic models with massive document collections which contain millions of documents and billions of tokens is challenging because of two reasons: First, one needs to deal with a large number of topics (typically in the order of thousands). Second, one needs a scalable and efficient way of distributing the computation across multiple machines. In this paper we present a novel algorithm F+Nomad LDA which simultaneously tackles both these problems. In order to handle large number of topics we use an appropriately modified Fenwick tree. This data structure allows us to sample from a multinomial distribution over

T

items in

O(\log T)

time. Moreover, when topic counts change the data structure can be updated in

O(\log T)

time. In order to distribute the computation across multiple processor we present a novel asynchronous framework inspired by the Nomad algorithm of \cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform state-of-the-art on massive problems which involve millions of documents, billions of words, and thousands of topics

arXiv.org e-Print Archive

CiteSeerX

Crossref

GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding

Author: Qu Meng
Tang Jian
Xu Shizhen
Zhu Zhaocheng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and systems are capable of processing networks with hundreds of thousands or a few millions of nodes. However, how to scale them to networks that have tens of millions or even hundreds of millions of nodes remains a challenging problem. In this paper, we propose GraphVite, a high-performance CPU-GPU hybrid system for training node embeddings, by co-optimizing the algorithm and the system. On the CPU end, augmented edge samples are parallelly generated by random walks in an online fashion on the network, and serve as the training data. On the GPU end, a novel parallel negative sampling is proposed to leverage multiple GPUs to train node embeddings simultaneously, without much data transfer and synchronization. Moreover, an efficient collaboration strategy is proposed to further reduce the synchronization cost between CPUs and GPUs. Experiments on multiple real-world networks show that GraphVite is super efficient. It takes only about one minute for a network with 1 million nodes and 5 million edges on a single machine with 4 GPUs, and takes around 20 hours for a network with 66 million nodes and 1.8 billion edges. Compared to the current fastest system, GraphVite is about 50 times faster without any sacrifice on performance.Comment: accepted at WWW 201

arXiv.org e-Print Archive

Crossref