4,559 research outputs found
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
Deep learning frameworks have been widely deployed on GPU servers for deep
learning applications in both academia and industry. In training deep neural
networks (DNNs), there are many standard processes or algorithms, such as
convolution and stochastic gradient descent (SGD), but the running performance
of different frameworks might be different even running the same deep model on
the same GPU hardware. In this study, we evaluate the running performance of
four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI,
CNTK, MXNet, and TensorFlow) over single-GPU, multi-GPU, and multi-node
environments. We first build performance models of standard processes in
training DNNs with SGD, and then we benchmark the running performance of these
frameworks with three popular convolutional neural networks (i.e., AlexNet,
GoogleNet and ResNet-50), after that, we analyze what factors that result in
the performance gap among these four frameworks. Through both analytical and
experimental analysis, we identify bottlenecks and overheads which could be
further optimized. The main contribution is that the proposed performance
models and the analysis provide further optimization directions in both
algorithmic design and system configuration.Comment: Published at DataCom'201
Distributed Training Large-Scale Deep Architectures
Scale of data and scale of computation infrastructures together enable the
current deep learning renaissance. However, training large-scale deep
architectures demands both algorithmic improvement and careful system
configuration. In this paper, we focus on employing the system approach to
speed up large-scale training. Via lessons learned from our routine
benchmarking effort, we first identify bottlenecks and overheads that hinter
data parallelism. We then devise guidelines that help practitioners to
configure an effective system and fine-tune parameters to achieve desired
speedup. Specifically, we develop a procedure for setting minibatch size and
choosing computation algorithms. We also derive lemmas for determining the
quantity of key components such as the number of GPUs and parameter servers.
Experiments and examples show that these guidelines help effectively speed up
large-scale deep learning training
Deep Learning in the Automotive Industry: Applications and Tools
Deep Learning refers to a set of machine learning techniques that utilize
neural networks with many hidden layers for tasks, such as image
classification, speech recognition, language understanding. Deep learning has
been proven to be very effective in these domains and is pervasively used by
many Internet services. In this paper, we describe different automotive uses
cases for deep learning in particular in the domain of computer vision. We
surveys the current state-of-the-art in libraries, tools and infrastructures
(e.\,g.\ GPUs and clouds) for implementing, training and deploying deep neural
networks. We particularly focus on convolutional neural networks and computer
vision use cases, such as the visual inspection process in manufacturing plants
and the analysis of social media data. To train neural networks, curated and
labeled datasets are essential. In particular, both the availability and scope
of such datasets is typically very limited. A main contribution of this paper
is the creation of an automotive dataset, that allows us to learn and
automatically recognize different vehicle properties. We describe an end-to-end
deep learning application utilizing a mobile app for data collection and
process support, and an Amazon-based cloud backend for storage and training.
For training we evaluate the use of cloud and on-premises infrastructures
(including multiple GPUs) in conjunction with different neural network
architectures and frameworks. We assess both the training times as well as the
accuracy of the classifier. Finally, we demonstrate the effectiveness of the
trained classifier in a real world setting during manufacturing process.Comment: 10 page
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
- …