343 research outputs found
Amortized Analysis on Asynchronous Gradient Descent
Gradient descent is an important class of iterative algorithms for minimizing
convex functions. Classically, gradient descent has been a sequential and
synchronous process. Distributed and asynchronous variants of gradient descent
have been studied since the 1980s, and they have been experiencing a resurgence
due to demand from large-scale machine learning problems running on multi-core
processors.
We provide a version of asynchronous gradient descent (AGD) in which
communication between cores is minimal and for which there is little
synchronization overhead. We also propose a new timing model for its analysis.
With this model, we give the first amortized analysis of AGD on convex
functions. The amortization allows for bad updates (updates that increase the
value of the convex function); in contrast, most prior work makes the strong
assumption that every update must be significantly improving.
Typically, the step sizes used in AGD are smaller than those used in its
synchronous counterpart. We provide a method to determine the step sizes in AGD
based on the Hessian entries for the convex function. In certain circumstances,
the resulting step sizes are a constant fraction of those used in the
corresponding synchronous algorithm, enabling the overall performance of AGD to
improve linearly with the number of cores.
We give two applications of our amortized analysis.Comment: 40 page
A Scalable Asynchronous Distributed Algorithm for Topic Modeling
Learning meaningful topic models with massive document collections which
contain millions of documents and billions of tokens is challenging because of
two reasons: First, one needs to deal with a large number of topics (typically
in the order of thousands). Second, one needs a scalable and efficient way of
distributing the computation across multiple machines. In this paper we present
a novel algorithm F+Nomad LDA which simultaneously tackles both these problems.
In order to handle large number of topics we use an appropriately modified
Fenwick tree. This data structure allows us to sample from a multinomial
distribution over items in time. Moreover, when topic counts
change the data structure can be updated in time. In order to
distribute the computation across multiple processor we present a novel
asynchronous framework inspired by the Nomad algorithm of
\cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform
state-of-the-art on massive problems which involve millions of documents,
billions of words, and thousands of topics
GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding
Learning continuous representations of nodes is attracting growing interest
in both academia and industry recently, due to their simplicity and
effectiveness in a variety of applications. Most of existing node embedding
algorithms and systems are capable of processing networks with hundreds of
thousands or a few millions of nodes. However, how to scale them to networks
that have tens of millions or even hundreds of millions of nodes remains a
challenging problem. In this paper, we propose GraphVite, a high-performance
CPU-GPU hybrid system for training node embeddings, by co-optimizing the
algorithm and the system. On the CPU end, augmented edge samples are parallelly
generated by random walks in an online fashion on the network, and serve as the
training data. On the GPU end, a novel parallel negative sampling is proposed
to leverage multiple GPUs to train node embeddings simultaneously, without much
data transfer and synchronization. Moreover, an efficient collaboration
strategy is proposed to further reduce the synchronization cost between CPUs
and GPUs. Experiments on multiple real-world networks show that GraphVite is
super efficient. It takes only about one minute for a network with 1 million
nodes and 5 million edges on a single machine with 4 GPUs, and takes around 20
hours for a network with 66 million nodes and 1.8 billion edges. Compared to
the current fastest system, GraphVite is about 50 times faster without any
sacrifice on performance.Comment: accepted at WWW 201
- …