629 research outputs found
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
BigDL: A Distributed Deep Learning Framework for Big Data
This paper presents BigDL (a distributed deep learning framework for Apache
Spark), which has been used by a variety of users in the industry for building
deep learning applications on production big data platforms. It allows deep
learning applications to run on the Apache Hadoop/Spark cluster so as to
directly process the production data, and as a part of the end-to-end data
analysis pipeline for deployment and management. Unlike existing deep learning
frameworks, BigDL implements distributed, data parallel training directly on
top of the functional compute model (with copy-on-write and coarse-grained
operations) of Spark. We also share real-world experience and "war stories" of
users that have adopted BigDL to address their challenges(i.e., how to easily
build end-to-end data analysis and deep learning pipelines for their production
data).Comment: In ACM Symposium of Cloud Computing conference (SoCC) 201
Distributed Graph Neural Network Training: A Survey
Graph neural networks (GNNs) are a type of deep learning models that are
trained on graphs and have been successfully applied in various domains.
Despite the effectiveness of GNNs, it is still challenging for GNNs to
efficiently scale to large graphs. As a remedy, distributed computing becomes a
promising solution of training large-scale GNNs, since it is able to provide
abundant computing resources. However, the dependency of graph structure
increases the difficulty of achieving high-efficiency distributed GNN training,
which suffers from the massive communication and workload imbalance. In recent
years, many efforts have been made on distributed GNN training, and an array of
training algorithms and systems have been proposed. Yet, there is a lack of
systematic review on the optimization techniques for the distributed execution
of GNN training. In this survey, we analyze three major challenges in
distributed GNN training that are massive feature communication, the loss of
model accuracy and workload imbalance. Then we introduce a new taxonomy for the
optimization techniques in distributed GNN training that address the above
challenges. The new taxonomy classifies existing techniques into four
categories that are GNN data partition, GNN batch generation, GNN execution
model, and GNN communication protocol. We carefully discuss the techniques in
each category. In the end, we summarize existing distributed GNN systems for
multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion
about the future direction on distributed GNN training
Recommended from our members
Ray: A Distributed Execution Engine for the Machine Learning Ecosystem
In recent years, growing data volumes and more sophisticated computational procedures have greatly increased the demand for computational power. Machine learning and artificial intelligence applications, for example, are notorious for their computational requirements. At the same time, Moores law is ending and processor speeds are stalling. As a result, distributed computing has become ubiquitous. While the cloud makes distributed hardware infrastructure widely accessible and therefore offers the potential of horizontal scale, developing these distributed algorithms and applications remains surprisingly hard. This is due to the inherent complexity of concurrent algorithms, the engineering challenges that arise when communicating between many machines, the requirements like fault tolerance and straggler mitigation that arise at large scale and the lack of a general-purpose distributed execution engine that can support a wide variety of applications.In this thesis, we study the requirements for a general-purpose distributed computation model and present a solution that is easy to use yet expressive and resilient to faults. At its core our model takes familiar concepts from serial programming, namely functions and classes, and generalizes them to the distributed world, therefore unifying stateless and stateful distributed computation. This model not only supports many machine learning workloads like training or serving, but is also a good t for cross-cutting machine learning applications like reinforcement learning and data processing applications like streaming or graph processing. We implement this computational model as an open-source system called Ray, which matches or exceeds the performance of specialized systems in many application domains, while also offering horizontally scalability and strong fault tolerance properties
- …