51 research outputs found
Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training
Most work in the deep learning systems community has focused on faster
inference, but arriving at a trained model requires lengthy experiments.
Accelerating training lets developers iterate faster and come up with better
models. DNN training is often seen as a compute-bound problem, best done in a
single large compute node with many GPUs. As DNNs get bigger, training requires
going distributed. Distributed deep neural network (DDNN) training constitutes
an important workload on the cloud. Larger DNN models and faster compute
engines shift the training performance bottleneck from computation to
communication. Our experiments show existing DNN training frameworks do not
scale in a typical cloud environment due to insufficient bandwidth and
inefficient parameter server software stacks.We propose PBox, a balanced,
scalable central PS hardware that balances compute and communication resources,
and PHub, a high performance parameter server (PS) software design that
provides an optimized network stack and a streamlined gradient processing
pipeline to benefit common PS setups to utilize PBox. We show that in a typical
cloud environment, PBox can achieve up to 3.8x speedup over state-of-the-art
designs when training ImageNet. We discuss future directions of integrating
PBox with programmable switches for in-network aggregation during training,
leveraging the datacenter network topology to reduce bandwidth usage and
localize data movement
Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training
Modern deep neural network (DNN) training jobs use complex and heterogeneous
software/hardware stacks. The efficacy of software-level optimizations can vary
significantly when used in different deployment configurations. It is onerous
and error-prone for ML practitioners and system developers to implement each
optimization separately, and determine which ones will improve performance in
their own configurations. Unfortunately, existing profiling tools do not aim to
answer predictive questions such as "How will optimization X affect the
performance of my model?". We address this critical limitation, and proposes a
new profiling tool, Daydream, to help programmers efficiently explore the
efficacy of DNN optimizations. Daydream models DNN execution with a
fine-grained dependency graph based on low-level traces collected by CUPTI, and
predicts runtime by simulating execution based on the dependency graph.
Daydream maps the low-level traces using DNN domain-specific knowledge, and
introduces a set of graph-transformation primitives that can easily model a
wide variety of optimizations. We show that Daydream is able to model most
mainstream DNN optimization techniques, and accurately predict the efficacy of
optimizations that will result in significant performance improvements
TBD: Benchmarking and Analyzing Deep Neural Network Training
The recent popularity of deep neural networks (DNNs) has generated a lot of
research interest in performing DNN-related computation efficiently. However,
the primary focus is usually very narrow and limited to (i) inference -- i.e.
how to efficiently execute already trained models and (ii) image classification
networks as the primary benchmark for evaluation.
Our primary goal in this work is to break this myopic view by (i) proposing a
new benchmark for DNN training, called TBD (TBD is short for Training Benchmark
for DNNs), that uses a representative set of DNN models that cover a wide range
of machine learning applications: image classification, machine translation,
speech recognition, object detection, adversarial networks, reinforcement
learning, and (ii) by performing an extensive performance analysis of training
these different applications on three major deep learning frameworks
(TensorFlow, MXNet, CNTK) across different hardware configurations (single-GPU,
multi-GPU, and multi-machine). TBD currently covers six major application
domains and eight different state-of-the-art models.
We present a new toolchain for performance analysis for these models that
combines the targeted usage of existing performance analysis tools, careful
selection of new and existing metrics and methodologies to analyze the results,
and utilization of domain specific characteristics of DNN training. We also
build a new set of tools for memory profiling in all three major frameworks;
much needed tools that can finally shed some light on precisely how much memory
is consumed by different data structures (weights, activations, gradients,
workspace) in DNN training. By using our tools and methodologies, we make
several important observations and recommendations on where the future research
and optimization of DNN training should be focused
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
PipeDream is a Deep Neural Network(DNN) training system for GPUs that
parallelizes computation by pipelining execution across multiple machines. Its
pipeline parallel computing model avoids the slowdowns faced by data-parallel
training when large models and/or limited network bandwidth induce high
communication-to-computation ratios. PipeDream reduces communication by up to
95% for large DNNs relative to data-parallel training, and allows perfect
overlap of communication and computation. PipeDream keeps all available GPUs
productive by systematically partitioning DNN layers among them to balance work
and minimize communication, versions model parameters for backward pass
correctness, and schedules the forward and backward passes of different inputs
in round-robin fashion to optimize "time to target accuracy". Experiments with
five different DNNs on two different clusters show that PipeDream is up to 5x
faster in time-to-accuracy compared to data-parallel training
Analyzing and Mitigating Data Stalls in DNN Training
Training Deep Neural Networks (DNNs) is resource-intensive and
time-consuming. While prior research has explored many different ways of
reducing DNN training time, the impact of input data pipeline, i.e., fetching
raw data items from storage and performing data pre-processing in memory, has
been relatively unexplored. This paper makes the following contributions: (1)
We present the first comprehensive analysis of how the input data pipeline
affects the training time of widely-used computer vision and audio Deep Neural
Networks (DNNs), that typically involve complex data preprocessing. We analyze
nine different models across three tasks and four datasets while varying
factors such as the amount of memory, number of CPU threads, storage device,
GPU generation etc on servers that are a part of a large production cluster at
Microsoft. We find that in many cases, DNN training time is dominated by data
stall time: time spent waiting for data to be fetched and preprocessed. (2) We
build a tool, DS-Analyzer to precisely measure data stalls using a differential
technique, and perform predictive what-if analysis on data stalls. (3) Finally,
based on the insights from our analysis, we design and implement three simple
but effective techniques in a data-loading library, CoorDL, to mitigate data
stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware
configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI
data loading library, DNN training time is reduced significantly (by as much as
5x on a single server)
The Non-IID Data Quagmire of Decentralized Machine Learning
Many large-scale machine learning (ML) applications need to perform
decentralized learning over datasets generated at different devices and
locations. Such datasets pose a significant challenge to decentralized learning
because their different contexts result in significant data distribution skew
across devices/locations. In this paper, we take a step toward better
understanding this challenge by presenting a detailed experimental study of
decentralized DNN training on a common type of data skew: skewed distribution
of data labels across devices/locations. Our study shows that: (i) skewed data
labels are a fundamental and pervasive problem for decentralized learning,
causing significant accuracy loss across many ML applications, DNN models,
training datasets, and decentralized learning algorithms; (ii) the problem is
particularly challenging for DNN models with batch normalization; and (iii) the
degree of data skew is a key determinant of the difficulty of the problem.
Based on these findings, we present SkewScout, a system-level approach that
adapts the communication frequency of decentralized learning algorithms to the
(skew-induced) accuracy loss between data partitions. We also show that group
normalization can recover much of the accuracy loss of batch normalization
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have
been increasingly deployed to train deep learning models. These accelerators
exhibit heterogeneous performance behavior across model architectures. Existing
schedulers for clusters of accelerators, which are used to arbitrate these
expensive training resources across many users, have shown how to optimize for
various multi-job, multi-user objectives, like fairness and makespan.
Unfortunately, existing schedulers largely do not consider performance
heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler
that systematically generalizes a wide range of existing scheduling policies.
Gavel expresses these policies as optimization problems, making it easy to
optimize for objectives in a heterogeneity-aware way, while also being
cognizant of performance optimizations like space sharing. Gavel then uses a
round-based scheduling mechanism to ensure jobs receive their ideal allocation
given the target scheduling policy. Gavel's heterogeneity-aware policies allow
a heterogeneous cluster to sustain higher input load, and improve end
objectives such as average job completion time and makespan by up to 3.5x
compared to heterogeneity-agnostic policies
Memory-Efficient Pipeline-Parallel DNN Training
Many state-of-the-art ML results have been obtained by scaling up the number
of parameters in existing models. However, parameters and activations for such
large models often do not fit in the memory of a single accelerator device;
this means that it is necessary to distribute training of large models over
multiple accelerators. In this work, we propose PipeDream-2BW, a system that
supports memory-efficient pipeline parallelism. PipeDream-2BW uses a novel
pipelining and weight gradient coalescing strategy, combined with the double
buffering of weights, to ensure high throughput, low memory footprint, and
weight update semantics similar to data parallelism. In addition, PipeDream-2BW
automatically partitions the model over the available hardware resources, while
respecting hardware constraints such as memory capacities of accelerators and
interconnect topologies. PipeDream-2BW can accelerate the training of large GPT
and BERT language models by up to 20 with similar final model accuracy.Comment: Accepted to ICML 202
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
Distributed deep neural network (DDNN) training constitutes an increasingly
important workload that frequently runs in the cloud. Larger DNN models and
faster compute engines are shifting DDNN training bottlenecks from computation
to communication. This paper characterizes DDNN training to precisely pinpoint
these bottlenecks. We found that timely training requires high performance
parameter servers (PSs) with optimized network stacks and gradient processing
pipelines, as well as server and network hardware with balanced computation and
communication resources. We therefore propose PHub, a high performance
multi-tenant, rack-scale PS design. PHub co-designs the PS software and
hardware to accelerate rack-level and hierarchical cross-rack parameter
exchange, with an API compatible with many DDNN training frameworks. PHub
provides a performance improvement of up to 2.7x compared to state-of-the-art
distributed training techniques for cloud-based ImageNet workloads, with 25%
better throughput per dollar
Efficient Algorithms for Device Placement of DNN Graph Operators
Modern machine learning workloads use large models, with complex structures,
that are very expensive to execute. The devices that execute complex models are
becoming increasingly heterogeneous as we see a flourishing of domain-specific
accelerators being offered as hardware accelerators in addition to CPUs. These
trends necessitate distributing the workload across multiple devices. Recent
work has shown that significant gains can be obtained with model parallelism,
i.e, partitioning a neural network's computational graph onto multiple devices.
In particular, this form of parallelism assumes a pipeline of devices, which is
fed a stream of samples and yields high throughput for training and inference
of DNNs. However, for such settings (large models and multiple heterogeneous
devices), we require automated algorithms and toolchains that can partition the
ML workload across devices. In this paper, we identify and isolate the
structured optimization problem at the core of device placement of DNN
operators, for both inference and training, especially in modern pipelined
settings. We then provide algorithms that solve this problem to optimality. We
demonstrate the applicability and efficiency of our approaches using several
contemporary DNN computation graphs.Comment: Accepted to NeurIPS 202
- …