Search CORE

5,853 research outputs found

Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster

Author: Bellver Míriam
Campos Camunez Victor
Giró Nieto Xavier
Sastre Francesc
Torres Viñals Jordi
Yagües Maurici
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Deep learning algorithms base their success on building high learning capacity models with millions of parameters that are tuned in a data-driven fashion. These models are trained by processing millions of examples, so that the development of more accurate algorithms is usually limited by the throughput of the computing devices on which they are trained. In this work, we explore how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the final accuracy of the models is studied.This work is partially supported by the Spanish Ministry of Economy and Competitivity under contract TIN2012-34557, by the BSC-CNS Severo Ochoa program (SEV-2011-00067), by the SGR programmes (2014-SGR-1051 and 2014-SGR-1421) of the Catalan Government and by the framework of the project BigGraph TEC2013-43935-R, funded by the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF). We also would like to thank the technical support team at the Barcelona Supercomputing center (BSC) especially to Carlos Tripiana.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Scaling Deep Learning on GPU and Knights Landing clusters

Author: Coates Adam
Gupta Suyog
Le Quoc V
Recht Benjamin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Author: Chu Xiaowen
Shi Shaohuai
Wang Qiang
Publication venue
Publication date: 20/08/2018
Field of study

Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution and stochastic gradient descent (SGD), but the running performance of different frameworks might be different even running the same deep model on the same GPU hardware. In this study, we evaluate the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi-GPU, and multi-node environments. We first build performance models of standard processes in training DNNs with SGD, and then we benchmark the running performance of these frameworks with three popular convolutional neural networks (i.e., AlexNet, GoogleNet and ResNet-50), after that, we analyze what factors that result in the performance gap among these four frameworks. Through both analytical and experimental analysis, we identify bottlenecks and overheads which could be further optimized. The main contribution is that the proposed performance models and the analysis provide further optimization directions in both algorithmic design and system configuration.Comment: Published at DataCom'201

arXiv.org e-Print Archive

Crossref