Daydream: Accurately Estimating the Efficacy of Optimizations for DNN
  Training by Zhu, Hongyu et al.
Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training
Hongyu Zhu†, Amar Phanishayee?, Gennady Pekhimenko†
†University of Toronto & Vector Institute ?Microsoft Research
Abstract
Modern deep neural network (DNN) training jobs use com-
plex and heterogeneous software/hardware stacks. The effi-
cacy of software-level optimizations can vary significantly
when used in different deployment configurations. It is oner-
ous and error-prone for ML practitioners and system develop-
ers to implement each optimization separately, and determine
which ones will improve performance in their own configu-
rations. Unfortunately, existing profiling tools do not aim to
answer predictive questions such as "How will optimization
X affect the performance of my model?". We address this
critical limitation, and proposes a new profiling tool, Day-
dream, to help programmers efficiently explore the efficacy of
DNN optimizations. Daydream models DNN execution with
a fine-grained dependency graph based on low-level traces
collected by CUPTI [49], and predicts runtime by simulat-
ing execution based on the dependency graph. Daydream
maps the low-level traces using DNN domain-specific knowl-
edge, and introduces a set of graph-transformation primitives
that can easily model a wide variety of optimizations. We
show that Daydream is able to model most mainstream DNN
optimization techniques, and accurately predict the efficacy
of optimizations that will result in significant performance
improvements.
1 Introduction
Recent years have witnessed the co-evolution of deep neu-
ral network (DNN) algorithms and the underlying hardware
and software design. ML researchers have developed many
important models [21, 27, 28, 71] at a rapid pace, creating
a huge demand for computation power [67]. To meet the
demand for fast DNN computation, computer architects re-
spond with new, AI-optimized GPUs (e.g., NVidia Turing
architecture [55]) and various domain-specific hardware ac-
celerators from FPGAs (e.g., Microsoft Catapult [63]) to
ASICs (e.g., Google TPU [34], Amazon Inferentia [68]). How-
ever these accelerators might not be effective in improving
performance without proper software optimizations across
the full systems stack [81]. As a result, systems researchers
have proposed many optimizations, targeting different bot-
tlenecks across the system stack – for example, improving
memory utilization [30, 66], better overlapping of communi-
cation with computation [26,31,80], and increasing communi-
cation efficiency [17]. Moreover, researchers have also devel-
oped workload-centric optimizations to exploit the stochastic
nature of DNN computation. For example, precision reduc-
tion [19, 24, 42] aims to reduce runtime as well as memory
consumption, and gradient compression [40, 41] aims at re-
ducing the communication overhead in distributed training.
Despite these advances, the benefits of many proposed op-
timizations cannot be fully exploited due to two main reasons.
First, the efficacy of many proposed performance optimiza-
tions can drastically change when applied to different ML
models and deployment configurations. The hardware deploy-
ments that practitioners use might be completely different
from the hardware configurations used by optimization and
model inventors. Differences in DNN models, accelerator
type, compute capabilities, available memory, networking ca-
pabilities, and software library versions can all shift the major
runtime bottlenecks. Second, it is onerous for programmers
to implement and evaluate various optimizations to identify
the ones that actually work for their models. As a result, it is
common for users to ask what-if questions such as:
Why did my DNN training workload run slowly? Will opti-
mization X improve the performance of my model? Does GPU
memory capacity limit the performance of my model? Would
upgrading to a faster network improve training throughput?
How will my workload scale with the number of GPUs?
The central focus of this paper is to answer the following
general question for DNN training workloads: Given a model
and a deployment scenario, how can we efficiently explore
the efficacy of potential solutions? Systems researchers have
tried to explore the impact of different potential performance
bottlenecks (e.g., CPU, network, IO) in many non-ML con-
texts [6,18,43,58,59,72]. The basic approaches to explore the
what-if questions are similar: decompose the workloads into
atomic tasks, profile runtime statistics for each task, model the
ar
X
iv
:2
00
6.
03
31
8v
1 
 [c
s.D
C]
  5
 Ju
n 2
02
0
what-if question, and use simulation to estimate performance.
These systems typically address what-if questions of the form:
"How does runtime change if a task T is N times (or even
infinitely) faster?" [18, 59]. Such questions can be simply
modeled by shrinking task runtime. While this basic approach
seems sufficient to address the central question above for
ML workloads, the diversity of DNN optimizations intro-
duces three key requirements unique to these workloads, thus
motivating the need for a novel solution.
First, we need to track dependencies at a kernel-level
abstraction i.e., one GPU kernel corresponds to one task (the
smallest unit of execution in the dependency graph). Such
fine-grained abstraction is necessary because optimizations
that improve hardware utilization typically target individual
compute kernels (e.g., mixed precision [42]). Meanwhile, ac-
curate performance estimation has to consider both CPU and
GPU runtime. Certain optimizations, e.g., kernel fusion, re-
quire potentially removing existing CPU and GPU tasks from
the dependency graph. Existing tools do not provide such
dependency tracking. It is therefore important to track kernel-
level dependencies among concurrently executing tasks.
Second, we need to map tasks to DNN layers. In con-
trast to prior works that explore what-if questions in non-
ML contexts, predicting the performance of DNN optimiza-
tions requires domain knowledge about DNNs to properly
model them. For example, MetaFlow [33] and TASO [32]
fuse DNN layers. Modeling them requires a mapping from
tasks to specific DNN layers. However, collecting kernel-level
traces on accelerators requires generic vendor-provided tools
(e.g., NVProf [48], CUPTI [49]), which have no application
specific knowledge. We therefore need to have the ability to
map low-level tasks to DNN layers.
Third, we need the ability to easily model diverse DNN
optimizations. Modeling a DNN optimization might involve
not just scaling or shrinking task durations, but also compli-
cated transformations to the dependency graph. For exam-
ple, TicTac [26] reschedules communication tasks, BlueCon-
nect [17] replaces the communication primitives to utilize
parallel network channels, and the optimization proposed by
Jung et al. [35] restructures the GPU kernel implementations.
Manually manipulating the kernel-level dependency graph
could be extremely intricate and error-prone. The system
should enable users to flexibly and effectively model such
diverse optimizations with minimal effort.
We introduce Daydream, a new system that fulfills all three
requirements described above, and achieves our goal of an-
swering potential what-if questions for DNN workloads. Con-
structing dependencies among potentially thousands of low-
level tasks is not an easy problem: tasks can be spread across
multiple execution threads (including both CPU threads and
GPU streams), thus even for simple DNN workloads, this re-
sults in thousands of tasks to be tracked. The intricacy comes
from identifying dependencies across threads. We make a key
observation about DNN training workloads: despite the large
number of tasks that need to be tracked, the number of concur-
rently executing threads is surprisingly quite limited. Based
on this observation, Daydream constructs the low-level depen-
dency graph, which provides a realistic model of overlapping
among CPU, GPU, and communication runtimes in a DNN
training workload. It uses a synchronization-free approach
to map GPU tasks onto appropriate higher-level DNN layer
abstractions. We also introduce a set of graph-transformation
rules, allowing programmers to effectively model various per-
formance optimizations. After modeling the optimization,
Daydream simulates the execution based on the new depen-
dency graph to predict the overall runtime. In our evaluation,
we show that Daydream is able to distinguish effective DNN
optimizations from those that will bring limited improvements
by accurately predicting their performance speedups.
In summary, we make the following key contributions:
• We make the observation that fine-grained tasks in DNN
training workloads are highly sequential. This greatly
simplifies dependency graph construction, over thou-
sands of tasks, as we only need to identify a limited
number of inter-thread dependencies.
• Daydream introduces the abstraction of a kernel-
granularity dependency graph that contains mappings
back to DNN specific abstractions (layers), by collect-
ing profiling data, instrumenting DNN frameworks, and
exploiting information from vendor-provided tools like
CUPTI. Daydream also provides primitives to mutate
the dependency graph in the form of simple graph trans-
formations. Taken together this enables programmers
to both (i) model a diverse set of popular optimizations
spanning kernel- and layer-level enhancements by using
simple graph-transformation primitives, and (ii) estimate
the efficacy of optimizations by simulating execution
time based on optimization-induced graph mutations.
• We extensively evaluate Daydream, with five different
optimizations on five DNN models across three distinct
applications. We show that Daydream can effectively de-
tect which optimizations provide improvements and also
accurately predict their magnitude for different DNN
models and deployments. For example, we estimate that
using mixed precision will improve the iteration time
of training BERTLARGE model by 17.2% (with <3% er-
ror), while the kernel fusion technique can improve it by
38.7% (with <7% error). We can also accurately predict
performance in distributed training with different num-
ber of workers and variable network bandwidth, based
on runtime profiles collected from a single-GPU setting.
2 Background
2.1 DNN Training Basics
DNN training is an iterative algorithm, in which one itera-
tion consists of three phases: (i) forward, (ii) backward, and
2
Optimization Goal Strategy Technique Examples
Improving Hardware Utilization
in Single-Worker Setting
Increasing Mini-batch Size by
Reducing Memory Footprints
vDNN [66], Gist [30], Chen et al. [15]
Reducing Precision Micikevicius et al. [42], Gupta et al. [24], Das et al. [19]
Fusing Kernels/Layers FusedAdam [?], MetaFlow [33], Ashari et al. [11], TASO [32]
Improving Low-level Kernel
Implementation
Restructing Batchnorm [35], Tensor Comprehensions [70],
Kjolstad et al. [37], TVM [14]
Lowering Communication Overhead
in Distributed Training
Reducing Communication
Workloads
Deep Gradient Compression [40], AdaComm [74], Parallax [36],
TernGrad [75], QSGD [9]
Improving Communication
Efficiency/Overlap
Wait-free Backprop [80], P3 [31], BlueConnect [17], TicTac [26],
BytePS [61], Xue et al. [77]
Table 1: Representative optimizations for DNN training. We show how we can accurately estimate the performance of optimiza-
tions (shown in italics) in Section 6, and can effectively model many other optimizations (shown in bold) in Section 5.
(iii) weight update. The forward phase takes training data sam-
ples as input and produces output based on current weights
(or parameters). The error between the forward output and
the input data labels is fed to the backward phase, which com-
putes the gradients of weights with respect to the input data.
The weight update phase then uses the gradients to update
weights accordingly. In each iteration, the input data samples
are randomly selected [12], forming a mini-batch of input.
2.2 DNN Training Optimizations
Modern DNNs have millions of parameters [25], resulting
in training times of days or even weeks [38]. To improve
DNN training performance, researchers have proposed var-
ious strategies focusing on different optimization goals. To
understand the potential what-if questions and how to design
a system to answer them, we study a list of software-level
techniques that speedup DNN training from top systems and
ML conferences in recent years. Table 1 shows our summary.
Exploiting computation power of hardware accelera-
tors. ML programmers often use large mini-batches, within
the memory budget, for better hardware utilization and faster
convergence. This motivates strategies that reduce the mem-
ory footprint of DNN training and hence enables training
with larger mini-batch sizes [15, 30, 66]. Researchers have
also proposed some generic strategies to increase hardware
utilization, including precision reduction [19, 24, 42], ker-
nel/layer fusion [11, 32, 33], and improving low-level kernel
implementation [14, 35, 37, 70]. Meanwhile, libraries such as
cuDNN [16], cuBLAS [45], MKL [73], Eigen [2], NCCL [46],
are also constantly evolving to provide operations and primi-
tives that can better utilize underlying hardware.
Scalable distributed training. Data parallelism [12] is a
simple and effective strategy to improve training performance.
Using multiple accelerators significantly reduces DNN train-
ing time to hours or even minutes [44]. This success is mainly
based on the techniques that guarantee model convergence
under extremely large mini-batch size [8, 23, 78]. One of
the major performance bottlenecks for distributed training is
communication, which can be optimized by compressing traf-
fic [40, 41, 74, 75], increasing network utilization [17, 77], or
increasing the overlap between communication and computa-
tion [26,31,80]. Exploring the efficacy of these optimizations
without prediction requires a multi-machine cluster. Our pro-
posed design, Daydream, avoids the potential cost of cluster
setup (i.e. extra machines, accelerators, high-speed communi-
cation), by predicting distributed training performance with
profiles collected from a single-worker environment.
2.3 Profiling Tools for DNNs
As the full ML system stack is constantly evolving, profiling
tools play a key role in helping programmers identify the per-
formance bottlenecks under different system configurations.
Hardware profiling tools. Modern DNN training heav-
ily relies on hardware accelerators such as GPUs [55] and
TPUs [34]. To help programmers develop highly efficient
applications, hardware vendors provide profiling tools that
can expose hardware performance counters. For example,
NVProf [48] provides programmers with information includ-
ing start/end time, core utilization, memory throughput, cache
miss rate, along with hundreds of other hardware counters
for every GPU kernel. CUPTI [49] enables programmers to
extract and manipulate these counters at runtime. Nsight [47]
aims to provide details on the state of more fine-grained coun-
ters for recent GPU architectures [55]. Our proposed system,
Daydream, relies on CUPTI to collect low-level traces for
further analysis.
Framework built-in tools. For more intuitive profiling
results, it is often desirable for a profiler to show runtime
statistics for framework operations, or even DNN layers.
DNN frameworks have built-in tools to achieve this goal
by correlating the hardware counters with runtime informa-
tion collected in frameworks. TensorFlow [4], coupled with
the Cloud TPU Tool [22], can provide an execution timeline
and runtime statistics for each TensorFlow operation. Simi-
larly, other mainstream frameworks (e.g., MXNet [13] and
PyTorch [60]) provide built-in tools that can extract per-layer
3
CPU thread #1
CPU thread #2
The default GPU stream
CUDA memory copies
Figure 1: NVProf timeline example of training ResNet-50.
or per-operation runtime from both the CPU and the GPU.
The framework built-in tools render intuitive results for pro-
grammers, but omit important details (for example, the CPU
runtime). We show in our work that such information is cru-
cial in building an accurate runtime predictor.
3 Key Ideas
In this section we highlight the key ideas and observations
behind the Daydream design.
Constructing kernel-granularity dependency graph.
The neural network topology is a natural graph structure in
which nodes are DNN operators or layers. Most mainstream
DNN frameworks [13, 60] provide built-in tools to record
the layer-level runtime profile. The layer-level abstraction
is intuitive for programmers to understand the "where time
goes" question, but hides important information about the
parallel execution of the CPU functions, GPU kernels, and
memory transfers. This information is crucial for accurate per-
formance predictions. For example, optimizations that reduce
numerical precision will change the duration of GPU kernels
while the CPU runtime remains unchanged, and optimizations
like vDNN [66] will inject CUDA memory copies, without
changing the duration of GPU kernels. It is extremely hard
to predict how duration of each layer changes when applying
these optimizations if lacking low-level details about CPU
and GPU runtime. To accommodate optimizations that target
fine granularity tasks (such as GPU kernels), our proposed
system, Daydream chooses to model the training workloads
using a kernel-level dependency graph (i.e., each GPU ker-
nel has one corresponding task in the graph), incorporating
detailed traces of CPU, GPU and communication runtime.
With a large number of kernel-level tasks that are spread
across several threads and CUDA streams, the complexity of
constructing the dependency graph comes mainly from identi-
fying the inter-thread dependencies [72]. Existing tools do not
provide such dependency tracking. We make the following
key observations about the DNN training workloads to over-
come this general challenge of dependency tracking in concur-
rent systems. First, for the implementations in the mainstream
frameworks [13, 60], once a mini-batch has been prepared
by data loading threads, only one or two CPU threads are
involved in the control flow of computation.1 Second, there
is a very limited number of concurrent GPU kernels. Such
1Our approach can be generalized to frameworks that use more concurrent
CPU threads.
serialization of GPU kernels is due to two main reasons: (i)
GPU kernels in the modern cuDNN library achieve high GPU
core utilization; (ii) ML frameworks usually invoke only one
CUDA stream. Figure 1 shows the NVProf profiles of one
training iteration of ResNet-50. There are two CPU threads
involved, but no CPU tasks run concurrently. The high serial-
ization of low-level traces is not a unique phenomenon for just
convolutional networks. We observe a similar phenomenon
in most DNN training workloads.
Based on these insights, Daydream constructs the kernel-
level dependency graph in three major steps. First, Daydream
uses CUPTI to extract traces of all GPU kernels, CUDA mem-
ory copies, and CUDA APIs. Second, Daydream captures the
dependencies between CPU and GPU tasks, caused by CUDA
synchronizations and GPU kernel launches. Third, when pre-
dicting performance for distributed training, Daydream adds
communication tasks to the dependency graph.
Synchronization-free task-to-layer mapping. In dis-
tributed training, mainstream frameworks implement the wait-
free backpropagation strategy [80] to overlap communication
with computation. This strategy immediately transfers gra-
dients once they are computed by corresponding backward
layers. To properly add dependencies related to communica-
tion tasks, we need the task-to-layer mapping to know when
the computation of each layer ends. Meanwhile, accurately
modeling DNN optimizations by changing the graph poten-
tially requires this task-to-layer mapping to determine which
tasks are involved and how to change them.
Unfortunately, vendor-provided tools like CUPTI do not
have the required knowledge about these applications and
building such a mapping requires extra DNN framework in-
strumentation. A naïve approach to achieve this mapping is
to compare the start and stop timestamps of GPU kernels
and DNN layers. This requires additional CUDA synchro-
nization calls for each layer since GPU kernels are launched
asynchronously. However, such synchronizations might sig-
nificantly alter the execution runtime by adding additional
dependencies from GPU to CPU tasks. Hence, we design a
synchronization-free procedure to achieve this mapping by
instrumenting timestamps for each layer in the frameworks,
and utilizing the correlations between CPU and GPU tasks.
Representing complex optimizations with simple
graph-transformation primitives. As shown in Table 1,
DNN optimizations target a wide range of performance bot-
tlenecks with various approaches. Unlike prior dependency
graph analysis in non-ML contexts [18, 58, 59], where users
can model most what-if questions by simply shrinking and
scaling task runtime, accurately modeling DNN optimizations
with the low-level dependency graph might require compli-
cated changes to the dependency graph. Manually changing
the kernel-level graph to model optimizations could be both
complicated and error-prone, and the programmers might sim-
ply opt to rather directly implement the optimizations.
To address this problem, we propose a small set of graph-
4
transformation primitives, so that popular optimization tech-
niques can be effectively represented as a combination of
these primitives. These primitives include (i) task inser-
tion/removal, (ii) task selection and update, and (iii) changing
the policy for scheduling tasks. The proposed primitives are
simple yet powerful enough to represent many different opti-
mizations as we will show in Section 5. They play a key role
in realizing our goal of efficiently exploring what-if questions.
In summary, Daydream introduces the abstraction of a
kernel-granularity dependency graph that contains mappings
back to DNN specific abstractions (layers). It tracks depen-
dencies by collecting profiling data as well as instrument-
ing DNN frameworks. Daydream also provides primitives
to mutate the dependency graph in the form of simple graph
transformations. Altogether this enables programmers to both
(i) model a diverse set of popular optimizations spanning
kernel- and layer-level enhancements by using simple graph-
transformation primitives, and (ii) estimate the efficacy of opti-
mizations by simulating execution time based on optimization-
induced graph mutations.
4 Design
We describe Daydream’s design with an emphasis on how to
construct Daydream’s proposed graph abstraction: the kernel-
granularity dependency graph with mappings back to DNN
layers. We also describe the primitives for mutating this graph
to model different optimizations and how Daydream uses the
graph to estimate the efficacy of various DNN optimizations.
4.1 Overview of Daydream
Figure 2 shows the workflow of performance prediction in
Daydream. It consists of the following four phases:
Phase 1: Trace collection. Constructing a kernel-level de-
pendency graph requires low-level details for all tasks. These
details are extremely massive, differ across ML frameworks,
and can be obtained by profiling a baseline workload. Day-
dream collects low-level profiling data using CUPTI [49], a
tool which provides details for all CPU/GPU tasks includ-
ing name, start time, duration, CUDA stream ID, thread ID,
etc. We manually augment three popular frameworks (Caffe,
MXNet, PyTorch) for use with CUPTI and modify the layer
modules of these frameworks to collect timestamps of each
layer, which will be used for task-to-layer mapping, described
in Section 4.3. Through our instrumentation, we also collect
the necessary information (e.g., size of gradients) to construct
the dependency graph of distributed training via a profile
collected in a single worker setting.
Phase 2: Dependency graph construction. Daydream
constructs the dependency graph with details of tasks pro-
vided by the first phase. A dependency could be induced by
domain knowledge (e.g., a GPU task triggers a communica-
tion task), or by hardware/software implementation (e.g., a
cudaLaunchKernel API triggers the corresponding GPU task).
Based on our analysis, we identify five different types of de-
pendencies (described in Section 4.2.2), which are sufficient
for Daydream to accurately simulate baseline execution.
Phase 3: Graph transformation. To estimate the efficacy
of a given optimization, Daydream models the optimization
by transforming the dependency graph. Daydream provides a
set of primitives to represent these transformations. We design
these primitives in a way such that they are succinct (easy to
use), flexible (able to depict a wide range of optimizations),
and accurate (being able to achieve high prediction accuracy).
Algorithm 1: Daydream’s Simulation Algorithm
Input :Dependency graph: G(V,E)
Output :The start time of each task u ∈V
1 F ← /0 // initialize the frontier task set
2 P←{0} // initialize thread progress
3 foreach task u ∈V do
4 u.re f ← |{u′sparents}|
5 if u.ref = 0 then
6 F ← F ∪{u}
7 end
8 while F 6= /0 do
9 u← schedule(F) // pick a task to exec.
10 t← u.ExecutionT hread
11 F ← F−{u}
12 u.start← max(P[t],u.start)
13 P[t]← u.start+u.duration+u.gap
14 foreach c ∈ u.children do
15 c.re f ← c.re f −1
16 c.start←
max(c.start,u.start+u.duration+u.gap)
17 if c.re f = 0 then
18 F ← F ∪{c}
19 end
20 end
21 end
Phase 4: Runtime simulation. Daydream simulates the
execution of optimizations to predict runtime based on the
dependency graph. Algorithm 1 shows the simulation process,
which traverses the dependency graph and puts tasks into
execution threads. In each iteration, Daydream picks one
task from the execution frontier (i.e. tasks that are ready to
execute), dispatches it to its corresponding execution thread,
and updates the thread progress. The simulation determines
the start time of each task and records the total execution time.
4.2 Dependency Graph Construction
Constructing the dependency graph is essentially to determine
the node (task) set and edge (dependency) set.
4.2.1 Task
Daydream’s kernel-level dependency graph contains the fol-
lowing four types of tasks:
GPU tasks. Each GPU task in the graph corresponds to one
GPU kernel. Daydream also views CUDA memory copies as
5
CPU Timeline
GPU Timeline
sync
CUPTI traces
Network Timeline
(a) Constructing the dependency graph based on
CUPTI traces (the black arrows represent task de-
pendencies).
CPU Timeline
GPU Timeline
Network Timeline
sync
allReduce allReduce
(b) Mapping each task to DNN layers (shown in differ-
ent colors in the figure), then inserting communication
tasks based on mapping and instrumentation.
CPU Timeline sync
GPU Timeline
Network Timeline
(c) Predicting "what if network bandwidth is
2×" by shrinking allReduce duration by 2×
and simulating the new dependency graph.
Figure 2: An example showing Daydream’s overall workflow for predicting runtime assuming network bandwidth doubles.
GPU tasks, because each memory copy is associated with a
specific CUDA stream, and therefore has dependencies with
other GPU kernels. The runtime of all these tasks can be
collected using CUPTI.
CPU tasks. To model the concurrency and dependencies
between CPU runtime and the GPU runtime, Daydream gen-
erates CPU tasks based on CPU traces collected by CUPTI.
One of the limitations of CUPTI is that it can only expose
CUDA-related traces. Instead of adding massive instrumen-
tation to the framework, Daydream captures the non-CUDA
runtime by recording the lengths of gaps between consecutive
CPU tasks (shown in line 13 of Algorithm 1).
Data loading tasks. One data loading task corresponds to
loading one mini-batch from disk/flash to CPU memory. We
include data loading tasks for completeness, even though data
loading in most DNN training workloads is not a performance
bottleneck. In Daydream’s implementation, we treat all data
loading tasks as CPU tasks.
Communication tasks. A communication task corre-
sponds to one communication primitive, e.g., a push/pull
operation in parameter-server based frameworks [39], or an
all-reduce operation in decentralized frameworks. When pre-
dicting distributed training performance, Daydream automat-
ically adds communication tasks to the dependency graph
based on a single-worker profile. We notice that in PyTorch,
gradients from multiple layers can be grouped and sent with
a single allReduce primitive [3]. Thus, properly adding com-
munication tasks to a PyTorch profile requires additional in-
strumentation to extract knowledge about gradients grouping.
Given the types of tasks in the graph, Daydream collects
and maintains the following information for each task, which
is later used in what-if analysis and simulation:
ExecutionThread. Depending on the type of a task, its
execution thread can be on of the following: (i) a CPU pro-
cess, (ii) a GPU stream, and (iii) a communication channel.
A data loading task is executed in a CPU process. A CPU
process has a process ID, a GPU stream has a stream ID, and
a communication channel could be send/receive when using
parameter server primitives, or a unified one when using col-
lective primitives. This field is used in line 10 of Algorithm 1.
Duration. This field specifies how long a task takes to
execute. The duration of a CPU/GPU task is collected by
CUPTI. The runtime of data loading tasks is measured by
injecting timestamps to the framework. Daydream aims to
predict distributed training performance based on profiling in
a single-GPU configuration. Hence we calculate the duration
of all communication task based on the size of gradients, the
communication type (push/pull/all-reduce), and the network
bandwidth. These numbers can be obtained based on knowl-
edge of the DNN model and framework implementation.
Gap. The duration of low-level CUDA APIs (e.g.,
cudaMalloc) might be only tens of microseconds, which is of
the same magnitude as the runtime of their non-CUDA equiva-
lent C functions (e.g., malloc), or the runtime of the call stack
from Python front-end to C back-end. NVidia-provided tools
cannot expose non-CUDA traces, but they are indispensable to
simulation accuracy. The non-CUDA CPU runtime is usually
not a target for optimization in DNN models, hence, we do
not need to define and measure corresponding tasks. Instead,
for each CPU task in our current definition, we measure the
gap between its end and the start of the next task in the same
execution thread, and simulate these gaps in Algorithm 1.
Layer. This field refers to which DNN layer a task belongs
to, which is necessary information for programmers to trans-
form the graph and model optimizations. Daydream uses a
synchronization-free approach to map a task to DNN layers.
We will describe the details of this approach in Section 4.3.
4.2.2 Dependency
Based on our discussion in Section 3, we identify the follow-
ing five types of dependencies for accurate simulations.
Sequential order of CPU tasks in the same thread. CPU
tasks in the same thread are serialized. The order that CPU
tasks are executed in is determined by the framework and does
not change in two separate executions. We add a dependency
between each two consecutive CPU tasks in the same thread.
Sequential order of GPU tasks in the same CUDA
stream. GPU kernels belonging to the same CUDA stream
are executed sequentially. Similar to CPU tasks, the order
of GPU tasks in the same stream does not change between
executions. Hence, two consecutive GPU tasks in the same
CUDA stream have a dependency between them.
Correlation from CUDA APIs to GPU kernels. Each
GPU kernel or CUDA memory copy has a correspond-
ing CPU-sided CUDA API (cudaLaunch, cudaMemcpy, or
cudaMemcpyAsync) that triggers the GPU task. CUPTI pro-
vides a correlation ID for every CUDA API and GPU kernel.
A GPU kernel is dependant on a CUDA API if they share the
same correlation ID.
CUDA Synchronization. A CUDA synchronization API
6
CPU Timeline
GPU Timeline Kernel K0
Launch K1
Kernel K2Kernel K1
Launch K0 Launch K2
CL: Duration of Layer L measured on CPU
GPU kernels mapped to Layer L
Figure 3: The mapping of GPU kernels to a layer. CUPTI pro-
vides correlations between CUDA launches and GPU kernels.
(e.g., cudaDeviceSynchronize) is invoked on CPU, and re-
turns after GPU kernels (or CUDA memory copies) that are
launched before this synchronization complete. A CUDA syn-
chronization therefore generates dependency from a GPU
task to a CPU task. Similar to CUDA synchronizations, even
though a cudaMemcpyAsyncDtoH call returns before a mem-
ory copy completes, we found it still blocks the CPU until all
previous GPU kernels on the same stream are completed.
Communication. Mainstream frameworks including Py-
Torch and MXNet implement the wait-free backpropagation
strategy [80] to schedule gradient communication. Here, a
communication primitive is launched as soon as the weight
gradients are ready, thus overlapping communication with
the backward phases of subsequent layers. Hence, we need to
know the runtime of DNN layers (not just kernels) to deter-
mine which tasks trigger communication.
4.3 Mapping Tasks to Layers
The task-to-layer mapping enables Daydream to construct
the dependency graph for distributed training, and provides
necessary domain knowledge for Daydream to model DNN
optimizations. Figure 3 shows how Daydream determines
which tasks belong to a certain layer. Let L be the forward
phase of a DNN layer. Daydream collects the CPU and GPU
runtime information using CUPTI [49], as well as timestamps
before and after the forward, backward, and weight update
phases for each layer. The start and end timestamps of L
will determine the CPU runtime of L (denoted by CL). To
determine the GPU runtime of L, Daydream gathers all CUDA
launch calls invoked during CL. With CUPTI providing the
correlations between CUDA launch calls and corresponding
GPU kernels, Daydream can identify all the GPU kernels
launched during CL, and map these kernels to L. This process
can also be applied to the backward or weight update phases
of any layers, and can be further generalized to any code
region of interest in the framework or user-level programs.
4.4 Graph Transformation
What-if analysis by transforming the graph and simulating
the execution requires input about the optimizations from
programmers. Daydream provides a set of primitives for pro-
grammers to model DNN optimizations by modifying the
graph. Like most what-if analysis in non-ML contexts, model-
ing DNN optimizations requires potentially shrinking or scal-
ing the duration of tasks (the shrink/scale primitives). We
CPU Timeline
insert
remove
insert
remove
CPU Timeline
GPU Timeline
(a)
(b)
Figure 4: Insert/Remove a (a) CPU task; (b) GPU task.
carefully study common DNN optimization techniques and
identify the following primitives (besides the shrink/scale
primitives), which are sufficient for programmers to describe
those optimizations.
Insert/Remove a task. Inserting a task to an execution
thread just involves an appending of a node to a linked list.
Figure 4 shows how this process works. When inserting a
GPU task, we need to insert the corresponding CPU tasks
that launch it. Which CPU tasks to insert and their duration
depend on the framework implementation, and can be inferred
based on collected traces.
Select. This operation allows users to select tasks of interest
for further operations. One potentially useful selection crite-
rion is select-by-layer, as many optimizations are depicted
based on DNN layers. Another potentially useful criterion is
to select by keywords in task names, based on knowledge of
the software library (e.g., cuDNN [50]). For example, kernels
with keywords such as elementwise or PointwiseApply
in the names are element-wise arithmetic operations. These
kernels are typically not compute-bound, and could be much
shorter than their corresponding CUDA launch calls. Simi-
larly, kernels with sgemm string in names are compute-bound
matrix-multiplications.
Schedule. The schedule function picks one task from a
set of frontier tasks that are ready to execute (line 9 in Algo-
rithm 1). By default, it picks the task with the earliest start.
Programmers can override this function and implement any
custom scheduling policy. which is useful to model optimiza-
tions that increase computation-communication overlap.
5 Modeling Optimizations
To demonstrate that Daydream is able to estimate the perfor-
mance of the most common optimizations in DNN training,
we select ten techniques from Table 1 with different opti-
mization goals. We show that we can easily model these
optimizations using the primitives Daydream provides 2.
5.1 Optimizations for Evaluation
We select the following five important optimizations to evalu-
ate Daydream’s prediction accuracy. We use implementations
from the authors of these optimizations in cases where they
were not readily available.
Automatic Mixed Precision (AMP). We aim to predict
the efficacy of the AMP optimization [42], implemented us-
2We show pseudo code for AMP in this section. Refer to the appendix
for the pseudo code of all examples shown in Section 5.
7
ing NVidia’s Apex package [51]. We expect that AMP will
improve memory-bounded GPU kernels by 2× because the
number of transferred bits is halved. With Tensor Cores in the
Volta and Turing architectures, AMP empirically yields up to
3× speedup on the most compute-intensive workloads [57].
To predict AMP performance, we simply select all the
compute-intensive (e.g., sgemm, conv) kernels and memory-
bounded (e.g., elementwise, batchnorm, RELU) kernels, and
shrink their duration by 3× and 2× respectively.
Algorithm 2: What_If_AMP
Input :Dependency graph: G(V,E)
Output :A modified graph G(V,E) to model AMP
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 foreach u ∈ GPUTasks do
3 if ”sgemm” in u.Name or ”scudnn” in u.Name then
4 u.duration← u.duration/3
5 else
6 u.duration← u.duration/2
7 end
8 end
FusedAdam Optimizer. We use the FusedAdam opti-
mizer [?] implemented in NVidia’s Apex package [51] as
an example for the kernel fusion optimization. This optimizer
fuses all kernels in one weight update phase into one uni-
fied kernel. It is applicable to the models that use the Adam
optimizer (e.g., GNMT, BERT). Daydream uses the kernel-to-
layer mapping to identify the CPU/GPU tasks that belong to a
weight update phase. We remove all these tasks, then insert
a new GPU task whose duration is roughly estimated by the
sum of all removed compute-intensive kernels.
Reconstructing Batchnorm. Recently Jung et al. [35] pro-
posed a technique that optimizes non-convolutional layers in
state-of-the-art CNNs. It first splits each batch normalization
layer into two sub-layers, then fuses the first sub-layer with
the previous convolutional layer, and the second sub-layer
with the following activation and convolutional layers. We
remove the affected activation kernels when estimating per-
formance, since they are memory-bound kernels now fused
with compute-intensive convolutional kernels. For the batch
nomalization layers, we estimate that the GPU kernels will
be improved by 2× since this optimization halves the amount
of input data that these layers load from GPU memory.
Distributed Training. Using Daydream we can accurately
predict distributed training performance with the profile based
on the single-GPU environment. We evaluate Daydream’s
prediction based on PyTorch, which uses collective communi-
cation primitives from the NCCl library [46]. PyTorch groups
gradients from multiple layers into buckets before transfer-
ring them. Hence, to predict distributed training performance,
we need to insert one allReduce task for every bucket. The
dependencies of the inserted tasks are determined based on
the layer-to-bucket mapping (which requires additional instru-
mentation to the PyTorch framework).
Priority-Based Parameter Propagation (P3). P3 [31] is
a technique that optimizes communication overhead by slic-
ing and prioritizing. We evaluate Daydream’s prediction of
P3 based on MXNet, which uses the parameter-server mech-
anism [39]. In order to model parameter slicing, we insert
multiple push task and pull tasks between the backward and
the forward GPU tasks for each layer. The duration of the
push/pull task is calculated from the slice size and the network
bandwidth. To model the priority scheduling, we override the
schedule function with a priority queue.
5.2 Modeling Additional Optimizations
In addition to the above optimizations, we show that Day-
dream is capable of modeling an additional set of diverse
DNN optimizations.
BlueConnect. BlueConnect [17] optimizes communica-
tion by decomposing the allReduce primitives into a series
of reduce-scatter and all-gather primitives. These primitives
run concurrently as they use parallel communication chan-
nels. To predict the performance of BlueConnect, instead of
inserting regular allReduce or push/pull tasks, we need to
insert reduce-scatter and all-gather tasks, and assign them
to corresponding network channels (the duration can be esti-
mated according to formulas shown in [56]).
MetaFlow. MetaFlow [33] is a layer-fusion technique to
optimize DNN training by fusing DNN layers to simplify the
DNN topology. We select the GPU kernels of substituted
layers, remove them, and insert GPU kernels of new layers
to predict the performance of MetaFlow in Daydream. The
new layers are mostly existing layers with different dimen-
sions; their GPU kernel durations can be inferred by profiling.
vDNN. Virtualized DNN [66] reduces GPU memory con-
sumption by temporarily offloading intermediate data from
GPU memory to CPU memory. The offloaded data needs
to be prefetched back to GPU to perform execution, which
causes potential performance overhead due to PCIe traffic or
late prefetching. To predict the performance overhead using
Daydream, we only need to insert additional CUDA mem-
ory copies, and override the schedule function to implement
a custom prefetching policy.
Gist. Gist [30] reduces GPU memory consumption by stor-
ing encoded intermediate data and decoding before the data
is used. The encoding and decoding introduces performance
overhead. We insert extra encoding and decoding GPU
kernels (along with cudaLaunchKernel calls in CPU) to es-
timate the performance overhead in Daydream. The duration
of the inserted encoding/decoding kernels can be estimated
using existing element-wise kernels.
Deep Gradient Compression (DGC). DGC [40] is a tech-
nique that reduces communication overhead by compressing
the gradients. To estimate performance, we: (i) scale the
duration of communication; (ii) insert the GPU tasks of
compression and decompression. The duration of inserted
8
Application Model Dataset
Image Classification VGG19 [69] ImageNet [20]
DenseNet-121 [29]
ResNet-50 [28]
Machine Translation GNMT [76] WMT16 [5]
Language Modeling BERT [21] SQuAD [64]
Table 2: The models and datasets we use in this paper.
0%
6%
12%
18%
0
150
300
450
BERT_Base BERT_Large Seq2Seq ResNet-50
P
re
d
ic
ti
o
n
 E
rr
o
r
It
e
ra
ti
o
n
 T
im
e
 (
m
s)
Baseline Ground Truth Prediction Error
Figure 5: AMP – comparing baseline (FP32), ground truth
with mixed precision, and predictions by Daydream.
GPU tasks can be estimated according to the compression
rate and duration of existing element-wise GPU kernels.
6 Evaluation
6.1 Methodology
We implement Daydream based on three mainstream DNN
frameworks: PyTorch [60], MXNet [13], and Caffe [1]. We
add CUPTI [49] support to each framework to obtain traces
of CUDA APIs and GPU kernels. We also add instrumen-
tation to the frameworks to acquire layer-wise timestamps
for the kernel-to-layer mapping process, and communication
information such as the size of each allReduce call and their
dependencies with other layer-wise computation.
Infrastructure. We evaluate Daydream’s runtime predic-
tion on a cluster of four machines. Each machine contains one
AMD EPYC 7601 16-core processor [10], and four 2080Ti
GPUs [54] with 11GB GDDR6 memory each, connected
through PCIe 3.0 [7]. Our experiments are based on Ubuntu
16.04, CUDA v10.0 [52], cuDNN v7.4.1 [53], and NCCL
v2.4.2 [46]. Our software implementation is based on Py-
Torch v1.0, MXNet v1.1, and Caffe v1.0.
Models. Table 2 shows the DNN models and datasets we
use to evaluate Daydream. We select five DNN models from
three different applications, covering a diverse set of DNN
models. For the BERT model, we evaluate both "base" and
"large" versions. The difference between these versions is that
the "base" version contains 12 "Transformer blocks" (the main
layer type in BERT) where as the "large" version contains 24.
6.2 Automatic Mixed Precision (AMP)
We evaluate Daydream’s prediction accuracy of AMP [42],
which is implemented in NVidia’s Apex package [51] based
on the PyTorch framework. Figure 5 shows the performance
of using AMP and the corresponding performance prediction
0
150
300
450
FP32 FP16 FP32 FP16 FP32 FP16 FP32 FP16
ResNet-50 GNMT BERT_BASE BERT_LARGEIt
e
ra
ti
o
n
 T
im
e
 (
m
s) CPU+GPU CPU-only GPU-only
Figure 6: Runtime breakdown of the baseline (FP32) and
mixed precision (FP16).
0%
5%
10%
15%
20%
0
100
200
300
400
BERT_Base BERT_Large Seq2Seq
P
re
d
ic
ti
o
n
 E
rr
o
r
It
e
ra
ti
o
n
 T
im
e
 (
m
s)
Baseline Ground Truth Prediction Error
Figure 7: FusedAdam - comparing baseline (FP32), ground
truth with FusedAdam, and predictions by Daydream.
given by Daydream. Our predictions have errors below 13%
for all the models we evaluate.
Our experiments show that using AMP brings speedups
generally less than 2× – much less than the theoretical boost
of using AMP for individual kernels (e.g., 3×). To understand
how AMP improves performance, we break down the overall
runtime into the following three components:
CPU-only runtime. This component refers to the runtime
when the CPU is busy, but the GPU is not executing any ker-
nels. It is straightforward to calculate this runtime by simply
subtracting all GPU kernel runtime from the total runtime.
GPU-only runtime. This component refers to the runtime
when the CPU is waiting for the GPU kernels to complete.
It includes not only the duration of CUDA synchronization
APIs, but also the cudaMemcpyAsync calls of all the device-
to-host CUDA memory copies.
CPU+GPU parallel runtime. This component refers to
the runtime when both CPU and GPU are busy. We calculate
this part of runtime by deducting the CPU-only and GPU-only
parts from the total runtime.
Figure 6 shows the runtime breakdown of the models we
evaluated. CPU runtime generally becomes the new perfor-
mance bottleneck in the models that incur limited speedups
(e.g., BERTLARGE). When applying AMP, the CPU bottle-
neck increases, because the GPU runtime becomes shorter
and part of the CPU+GPU parallel runtime is shifted to the
CPU-only runtime. The overall runtime improvement comes
mostly from the reduction of GPU-only runtime while CPU
runtime barely changes. This demonstrates the necessity of
the kernel-level abstraction when predicting performance.
6.3 FusedAdam Optimizer
We apply the FusedAdam optimization to the BERT and
GNMT models as they use the Adam optimizer. Figure 7
shows the performance of using the FusedAdam optimizer.
Our predictions are within 13% of the ground truth runtime.
9
0%
5%
10%
15%
20%
0
75
150
225
300
1x1 2x1 3x1 4x1 2x2 3x2 4x2 2x1 3x1 4x1 2x2 3x2 4x2 2x1 3x1 4x1 2x2 3x2 4x2
10Gbps 20Gbps 40Gbps
P
re
d
ic
ti
o
n
 E
rr
o
r
It
e
ra
ti
o
n
 T
im
e
 (
m
s)
System Configuration (# of machines x # of GPUs per machine, bandwidth)
Ground Truth Prediction Error
(a) Runtime predictions for ResNet-50.
0%
5%
10%
15%
20%
0
400
800
1200
1600
1x1 2x1 3x1 4x1 2x2 3x2 4x2 2x1 3x1 4x1 2x2 3x2 4x2 2x1 3x1 4x1 2x2 3x2 4x2
10Gbps 20Gbps 40Gbps
P
re
d
ic
ti
o
n
 E
rr
o
r
It
e
ra
ti
o
n
 T
im
e
 (
m
s)
System Configuration (# of machines x # of GPUs per machine, bandwidth)
Ground Truth Prediction Error
(b) Runtime predictions for GNMT.
0%
5%
10%
15%
20%
0
800
1600
2400
3200
1x1 2x1 3x1 4x1 2x2 3x2 4x2 2x1 3x1 4x1 2x2 3x2 4x2 2x1 3x1 4x1 2x2 3x2 4x2
10Gbps 20Gbps 40Gbps
P
re
d
ic
ti
o
n
 E
rr
o
r
It
e
ra
ti
o
n
 T
im
e
 (
m
s)
System Configuration (# of machines x # of GPUs per machine, bandwidth)
Ground Truth Prediction Error
(c) Runtime predictions for BERTBASE.
0%
5%
10%
15%
20%
0
750
1500
2250
3000
1x1 2x1 2x2 3x1 3x2 4x1 4x2 2x1 2x2 3x1 3x2 4x1 4x2 2x1 2x2 3x1 3x2 4x1 4x2
40Gbps 20Gbps 10Gbps
P
re
d
ic
ti
o
n
 E
rr
o
r
It
er
at
io
n
 T
im
e
 (
m
s)
System Configuration (# of machines x # og GPUs per machine, bandwidth)
Ground Truth (base) Prediction Error
(d) Runtime predictions for BERTLARGE.
Figure 8: The error between Daydream’s runtime predictions
and the baseline with synchronization before each allReduce
under various system configurations.
There are two reasons why the FusedAdam optimizer
substantially improves the performance of BERT models.
First, unlike most DNN training workloads, the weight up-
date phase is a significant proportion of a BERT model’s
iteration runtime (around 30% for BERTBASE and 45% for
BERTLARGE). Second, the weight update phase consists of
very many element-wise GPU kernels (2633 for BERTBASE,
5164 for BERTLARGE). Thus, the CUDA launch calls on the
CPU become the main bottleneck. The FusedAdam opti-
mizer almost eliminates all CPU kernel launch overhead in
the weight update phase by fusing all GPU kernels into one
single GPU kernel. Compared to BERT models, the GNMT
model spends less than 10% of its iteration time on the weight
update phase, explaining the lower speedup improvements.
6.4 Reconstructing Batchnorm
We evaluate our performance prediction for the optimization
of reconstructing batch normalization [35] based on the Caffe
implementation of DenseNet-121 [29]. Using Daydream, we
predict that reconstructing batchnorm will yield a moderate
performance improvement of 12.7% compared to the baseline.
0
10
20
30
40
R
e
d
u
ct
io
n
 T
im
e
 (
m
s)
All reduction calls in one GNMT iteration
Baseline Sync Optimal Theoretical
Figure 9: Comparison of all individual reduction runtimes
in one training iteration of GNMT. Baseline: runtime mea-
sured in regular training; Sync: runtime measured with an
additional CUDA synchronization before each reduction; Op-
timal: runtime measured when executing exclusively; Theo-
retical: runtime calculated using the formula [56].
This suggests that reconstructing batchnorm in our configura-
tion is less promising than the paper claims (17.5% speedup).
We verify this conclusion by testing the ground truth imple-
mentation of reconstructing batchnorm, and find out that this
optimization yields even lower 7% speedup.
We notice that there are two main reasons for the difference
between our prediction and the ground truth. First, the ground
truth uses a completely new implementation of the batchnorm
layers, and it is hard to precisely predict the runtime of newly
implemented kernels. Second, the ground truth implementa-
tion introduces new CUDA memory copies and allocations,
which add performance overhead. Obtaining a very precise
estimate would require us to understand not just the high-level
idea from the paper, but also the detailed implementation of
the user-level programs and the Caffe framework.
6.5 Distributed Training
Next we evaluate distributed training using PyTorch with the
NCCL [46] library. Figure 8 shows the comparisons between
runtimes predicted by Daydream and the measured ground
truth runtimes, for each DNN model under different system
configurations. We evaluate the prediction accuracy for Ether-
net and InfiniBand connecting multi-machine systems under
different network bandwidths (10, 20, 40 Gbps). In most of
the configurations, Daydream predicts distributed runtime
with at most 10% prediction error, with a few exceptions for
the 20Gbps and 40Gbps configurations.
The prediction errors of the overall iteration times are
mainly due to inaccurate estimates of individual NCCL prim-
itives. Figure 9 shows the comparisons of NCCL allReduce
calls between the ground truths and predictions. The ground
truths are on average 34% higher than the theoretical values.
An NCCL primitive is both a communication primitive and
a GPU kernel, suggesting that it could be bottlenecked by two
types of hardware resources: (i) the network bandwidth, and
(ii) GPU resources (e.g., memory bandwidth, streaming multi-
processors). Figure 9 shows that the predicted values are very
close to the runtimes measured when running NCCL primi-
tives exclusively. This suggests that the ground truth is slower
because they compete for GPU resources with other GPU ker-
nels. Based on this insight, we try to reduce this interference
by adding CUDA synchronizations before invoking NCCL
10
0 2 4 6 8
network bandwidth (Gbps)
0
500
1000
1500
ite
ra
tio
n 
tim
e 
(m
s) Baseline
Ground Truth
Prediction
(a) ResNet-50.
0 5 10 15 20 25
network bandwidth (Gbps)
0
1000
2000
3000
ite
ra
tio
n 
tim
e 
(m
s) Baseline
Ground Truth
Prediction
(b) VGG-19.
Figure 10: Daydream’s prediction for how the P3 optimization
will help under different network bandwidths.
primitives. As shown in Figure 9, adding synchronizations
improve the NCCL primitives by 22.8% on average when
compared to the baseline.
We also verify the impact to the overall iteration time when
adding synchronizations before NCCL primitives. We run the
experiments on all the configurations shown in Figure 8. We
find that this simple approach does not lead to performance
degradation in any configuration. Instead, it could bring an
improvement of up to 22%.
6.6 Priority-Based Parameter Propagation
We evaluate Daydream’s prediction accuracy of applying
Priority-Based Parameter Propagation (P3) to VGG-19 and
ResNet-50. To reproduce the performance speedups of P3,
we use a cluster of four machines with one P4000 GPU per
machine (which is consistent with the evaluation setup of the
P3 paper [31]). We use MXNet v1.1, and have one worker
process and one parameter server process on each machine.
Figure 10 shows the iteration time of the baseline, ground
truth, and prediction using Daydream under different band-
widths. Our prediction faithfully reflects the trend of P3
speedups when the network bandwidth increases. The predic-
tion error is at most 16.2% among all the configurations we
tested, and lower in most of the configurations.
We overestimate the speedup of P3, especially when train-
ing VGG-19 with a 15 or 20 Gbps network bandwidth. The
reason is similar to our previous insight about NCCL prim-
itives: when bandwidth is higher, a communication task is
increasingly bottlenecked by non-network resources. In the
case of MXNet, this overhead could be caused by the server
processes, or the control flow of the worker processes.
7 Discussion
In this section, we discuss the adaptability, potential exten-
sions, and some limitations of Daydream.
7.1 Why Not Simply Run the Optimizations?
The main problem many ML developers face is that not all
optimizations are readily available on all platforms. In fact,
we are only able to evaluate the prediction accuracy of op-
timizations with the implementations already available (see
Table 1); for the remaining ones, we highlight the flexibility
of Daydream by showing that they can be represented suc-
cinctly. Most newly proposed optimizations do not have open-
source implementations on all DNN frameworks available
right away; it would be unreasonable to expect researchers to
open-source their implementations and port their optimiza-
tions on all platforms. Therefore, analyzing if these optimiza-
tions can help in a deployment setting, using Daydream, can
still precede the programming effort to port the optimiza-
tions. Furthermore, Daydream’s profiling can be performed
just once, and using that profile on a given platform, one can
answer questions for many different optimizations.
7.2 Adaptability of Daydream
Daydream requires support from hardware profilers. The cur-
rent implementation of Daydream utilizes GPU-based pro-
filers, and it relies on CUPTI to provide: (i) CPU and GPU
traces and (ii) information about which CPU call triggered the
launch of a specific GPU kernel. Adapting our design to other
architectures (e.g., TPUs), would require hardware vendor
profilers to provide similar traces for this new hardware.
Daydream can be also easily adapted to other ML frame-
works (e.g., MXNet and TensorFlow). We built Daydream
based on PyTorch, and then post-process the dumped traces to
make predictions. The post-processing scripts are framework-
independent. To add framework instrumentation, we need to:
(i) add CUPTI (or similar tool) support, (ii) insert per-layer
timestamps, and (iii) gather the gradient-to-bucket mappings
for injecting the communication primitives to the dependency
graph (required for PyTorch). Such instrumentation is rela-
tively light-weight and can be easily adapted to other main-
stream frameworks such as TensorFlow [4] and MXNet [13].
7.3 Training Accuracy Prediction
In addition to improving iteration time, some optimizations
may also affect training accuracy (e.g., AMP [42], DGC [40]);
predicting the impact of optimizations on accuracy is currently
outside of Daydream’s scope. We leave this interesting and
challenging problem for future work.
7.4 Kernel Runtime Prediction
Estimating the effect of optimizations that alter existing GPU
kernels or introduce new ones requires predicting the runtime
of new/changed GPU kernels. When estimating performance
of AMP, our estimation of kernels that use half-precision ker-
nels was based on findings/observations from NVIDIA [42].
This generalization above for all kernels (in contrast to identi-
fying how each kernel in isolation is affected by AMP), still
leads to the low prediction errors we observe in Figure 5.
However, optimizations such as DGC [40], Reconstructing
Batchnorm [35], and Gist [30] introduce newly-implemented
kernels to the runtime. Accurately predicting runtime for new
kernels is a challenging problem. Daydream estimates the
overall runtime based on existing kernel implementations,
11
or using guidelines from studies that highlight quantitative
improvements for the proposed kernels. But if the estimated
runtimes for such new kernels are inaccurate, it may lead to
relatively high prediction error (Section 6.4). How much a
kernel’s runtime estimation error contributes to the overall
prediction error depends on the training workload itself.
While Daydream cannot predict individual kernel runtime,
it provides a high-level structure for kernel developers to
estimate the overall performance. Developers can profile their
individual kernels, and then input the profiling results into
Daydream to accurately estimate the overall runtime. This
approach saves the engineering effort of porting the kernel
implementation into the DNN frameworks.
7.5 Concurrent Kernels
Existing GPU profilers such as CUPTI usually serialize GPU
kernel execution, removing all concurrency, making our per-
formance estimation somewhat conservative. Despite this, we
observe that the runtime for models with concurrent execu-
tion (e.g., GNMT) can still be predicted with high accuracy
(§ 6.2). This is because the majority of computation time goes
to fully connected layers (including embedding layers), which
have no concurrent kernels executed in parallel with them. We
leave a complete solution for concurrent kernels, requiring
better support from profiling tools, as a part of future work.
8 Related Work
To help programmers understand the performance of the hard-
ware accelerators and develop highly efficient applications,
hardware vendors provide profiling tools (e.g., NVProf [48],
Nsight [47], vTune [65]) that can reveal low-level perfor-
mance counters (e.g., cache hit rate, memory speed, clock
rate). These tools are usually designed with general applica-
tions in mind, and expose hundreds of low-level performance
counters. The fundamental limitation of all these tools is that
they do not utilize application-specific knowledge.
The new generation of profiling tools feature the
application-aware property, enabling them to deliver domain-
specific (e.g., ML-specific) insights about performance to
programmers. The Cloud TPU Tool [22] is an example of
such a profiling tool. It correlates low-level TPU metrics with
the DNN structure, and shows the performance for each DNN
layer. Similarly, MXNet [13] and PyTorch [60] also have
their own built-in profiling tools. These domain-specific tools
can highlight performance hotspots, but are less efficient in
finding optimization opportunities. In contrast, Daydream is
not only application-aware, but also optimization-aware, en-
abling Daydream to quantitatively estimate the efficacy of
different optimizations without fully implementing them.
Prior works have tried to explore what-if questions in other
contexts by using low-level traces. Curtsinger et al. proposed a
causal profiler (COZ [18]) to identify potentially unknown op-
timization opportunities by running performance simulation
with certain functions being virtually speed-up. Unlike Day-
dream, COZ does not require dependencies among functions
because it does not consider the cases where functions can be
added or deleted (which is the case for many ML optimiza-
tions). Pourghassemi et al. uses the idea of COZ to analyze
the performance for web browser applications [62]. For data
analytic frameworks, such as Spark [79], Ousterhout et al. use
dependency analysis to understand the overhead caused by
I/O, network, and stragglers [58,59]. Daydream is designed to
address a more diversified set of what-if questions, and hence
requires more powerful modeling.
Prior works address what-if questions of the form "What
if we can speedup task T by N times (or infinity)?", but they
do not study whether existing optimizations can deliver this
speedup. In the ML context, given an optimization, accurately
predicting the performance of individual tasks in the depen-
dency graph, is still an open problem. It requires additional
knowledge about the kernel implementation and the archi-
tecture design. Currently Daydream can not automatically
estimate the runtime of new GPU kernels. However, as we
show in Section 6, even with rough estimates of per-kernel
duration based on domain knowledge and reasonable assump-
tions, we can still achieve high overall prediction accuracy.
9 Conclusion
The efficacy of DNN optimizations can vary largely across
different DNN models and deployments. Daydream is a new
profiler to effectively explore the efficacy of a diverse set of
DNN optimizations. Daydream achieves this goal by using
three key ideas: (i) constructing a kernel-level dependency
graph by utilizing vendor-provided profiling tools, while track-
ing dependencies among concurrently executing tasks; (ii)
mapping low-level traces to DNN layers in a synchronization-
free manner; (iii) introducing a set of rules for programmers
to effectively describe and model different optimizations. Our
evaluation shows that using Daydream, we can effectively
model (i.e. predict runtime) the most common DNN optimiza-
tions, and accurately identify both optimizations that result in
significant performance improvements as well as those that
provide limited benefits or even slowdowns.
Acknowledgement
Daydream is part of Project Fiddle at MSR. We thank the
MSR Lab LT, especially Ricardo Bianchini and Donald Koss-
mann, for their enthusiastic and unwavering support of Project
Fiddle. We also thank our shepherd, Swaminathan Sundarara-
man, the anonymous ATC reviewers, Brian Hirano, James
Gleeson, Geoffrey Yu, Xiaodan (Serina) Tan, Jorgen Thelin,
Shivaram Venkataraman, and Deepak Narayanan, for their
constructive feedback during the development of this work.
This work was also supported in part by the NSERC Discov-
ery grant, the Canada Foundation for Innovation JELF grant,
the Connaught Fund, and Huawei grants.
12
References
[1] caffe: Convolutional architecture for fast feature embed-
ding.
[2] Eigen: A C++ linear algebra library. http://eigen.
tuxfamily.org/index.php?title=Main_Page.
[3] PyTorch Documentation. https://pytorch.org/
docs/stable/index.html, 2019.
[4] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv-
ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan-
delion Mané, Rajat Monga, Sherry Moore, Derek Mur-
ray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit
Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-
cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin
Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow:
Large-Scale Machine Learning on Heterogeneous Sys-
tems, 2015. Software available from tensorflow.org.
[5] ACL. Shared Task: Machine Translation
of News. http://www.statmt.org/wmt16/
translation-task.html, 2016.
[6] Marcos K Aguilera, Jeffrey C Mogul, Janet L Wiener,
Patrick Reynolds, and Athicha Muthitacharoen. Perfor-
mance debugging for distributed systems of black boxes.
In ACM SIGOPS Operating Systems Review, volume 37,
pages 74–89. ACM, 2003.
[7] Jasmin Ajanovic. PCI Express*(PCIe*) 3.0 Accelerator
Features. Intel Corporation, 10, 2008.
[8] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Ex-
tremely large minibatch SGD: training resnet-50 on im-
agenet in 15 minutes. arXiv preprint arXiv:1711.04325,
2017.
[9] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka,
and Milan Vojnovic. QSGD: Communication-efficient
SGD via gradient quantization and encoding. In Ad-
vances in Neural Information Processing Systems, pages
1709–1720, 2017.
[10] AMD. AMD EPYCTM 7601. https://www.amd.com/
en/products/cpu/amd-epyc-7601, 2019.
[11] Arash Ashari, Shirish Tatikonda, Matthias Boehm,
Berthold Reinwald, Keith Campbell, John Keenleyside,
and P Sadayappan. On optimizing machine learning
workloads via kernel fusion. In ACM SIGPLAN Notices,
volume 50, pages 173–182. ACM, 2015.
[12] Léon Bottou. Large-scale machine learning with
stochastic gradient descent. In Proceedings of COMP-
STAT’2010, pages 177–186. Springer, 2010.
[13] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,
Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang,
and Zheng Zhang. MXNet: A Flexible and Efficient Ma-
chine Learning Library for Heterogeneous Distributed
Systems. CoRR, abs/1512.01274, 2015.
[14] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen
Shen, Eddie Q Yan, Leyuan Wang, Yuwei Hu, Luis Ceze,
Carlos Guestrin, and Arvind Krishnamurthy. TVM:
end-to-end optimization stack for deep learning. arXiv
preprint arXiv:1802.04799, pages 1–15, 2018.
[15] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos
Guestrin. Training deep nets with sublinear memory
cost. arXiv preprint arXiv:1604.06174, 2016.
[16] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch,
Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan
Shelhamer. cuDNN: Efficient primitives for deep learn-
ing. arXiv preprint arXiv:1410.0759, 2014.
[17] Minsik Cho, Ulrich Finkler, Mauricio Serrano, David
Kung, and Hillery Hunter. BlueConnect: Decomposing
all-reduce for deep learning on heterogeneous network
hierarchy. IBM Journal of Research and Development,
63(6):1–1, 2019.
[18] Charlie Curtsinger and Emery D Berger. C oz: finding
code that counts with causal profiling. In Proceedings
of the 25th Symposium on Operating Systems Principles,
pages 184–197. ACM, 2015.
[19] Dipankar Das, Naveen Mellempudi, Dheevatsa Mudi-
gere, Dhiraj Kalamkar, Sasikanth Avancha, Kunal Baner-
jee, Srinivas Sridharan, Karthik Vaidyanathan, Bharat
Kaul, Evangelos Georganas, et al. Mixed precision train-
ing of convolutional neural networks using integer oper-
ations. arXiv preprint arXiv:1802.00930, 2018.
[20] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee,
2009.
[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
[22] Google. Cloud TPU Tools. https://cloud.google.
com/tpu/docs/cloud-tpu-tools, 2018.
13
[23] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tul-
loch, Yangqing Jia, and Kaiming He. Accurate, large
minibatch sgd: Training imagenet in 1 hour. arXiv
preprint arXiv:1706.02677, 2017.
[24] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan,
and Pritish Narayanan. Deep learning with limited nu-
merical precision. In International Conference on Ma-
chine Learning, pages 1737–1746, 2015.
[25] Song Han, Huizi Mao, and William J. Dally. Deep
compression: Compressing deep neural network with
pruning, trained quantization and huffman coding. In-
ternational Conference on Learning Representations
(ICLR 2016), 2016.
[26] Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and
Roy H Campbell. TicTac: Accelerating distributed deep
learning with communication scheduling. arXiv preprint
arXiv:1803.03288, 2018.
[27] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross
Girshick. Mask r-cnn. In Proceedings of the IEEE in-
ternational conference on computer vision, pages 2961–
2969, 2017.
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition.
CoRR, abs/1512.03385, 2015.
[29] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and
Kilian Q Weinberger. Densely connected convolutional
networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4700–
4708, 2017.
[30] Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia
Tang, and Gennady Pekhimenko. Gist: Efficient data en-
coding for deep neural network training. In Proceeding
of the 45st Annual International Symposium on Com-
puter Architecture, ISCA 2018, pages 776–789, 2018.
[31] Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra
Fedorova, and Gennady Pekhimenko. Priority-based pa-
rameter propagation for distributed DNN training. arXiv
preprint arXiv:1905.03960, 2019.
[32] Zhihao Jia, Oded Padon, James Thomas, Todd Warsza-
wski, Matei Zaharia, and Alex Aiken. TASO: optimizing
deep learning computation with automatic generation
of graph substitutions. In Proceedings of the 27th ACM
Symposium on Operating Systems Principles, pages 47–
62. ACM, 2019.
[33] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu
Gao, Matei Zaharia, and Alex Aiken. Optimizing DNN
computation with relaxed graph substitutions. In Proc.
Conference on Systems and Machine Learning, SysML,
volume 19, 2019.
[34] Norman P. Jouppi, Cliff Young, Nishant Patil, David
Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah
Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick
Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,
Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean,
Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Got-
tipati, William Gulland, Robert Hagmann, C. Richard
Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt,
Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander
Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch,
Naveen Kumar, Steve Lacy, James Laudon, James Law,
Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke,
Alan Lundin, Gordon MacKean, Adriana Maggiore,
Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi
Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie,
Mark Omernick, Narayana Penukonda, Andy Phelps,
Jonathan Ross, Matt Ross, Amir Salek, Emad Samadi-
ani, Chris Severn, Gregory Sizikov, Matthew Snelham,
Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan,
Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,
Vijay Vasudevan, Richard Walter, Walter Wang, Eric
Wilcox, and Doe Hyun Yoon. In-datacenter performance
analysis of a tensor processing unit. In Proceedings of
the 44th Annual International Symposium on Computer
Architecture, ISCA 2017, pages 1–12, New York, NY,
USA, 2017. ACM.
[35] Wonkyung Jung, Daejin Jung, Sunjung Lee, Wonjong
Rhee, Jung Ho Ahn, et al. Restructuring batch nor-
malization to accelerate CNN training. arXiv preprint
arXiv:1807.01702, 2018.
[36] Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo
Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong
Jeong, and Byung-Gon Chun. Parallax: Sparsity-aware
Data Parallel Training of Deep Neural Networks. In
Proceedings of the Fourteenth EuroSys Conference 2019,
page 43. ACM, 2019.
[37] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David
Lugato, and Saman Amarasinghe. The tensor algebra
compiler. Proceedings of the ACM on Programming
Languages, 1(OOPSLA):77, 2017.
[38] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
[39] Mu Li, David G Andersen, Jun Woo Park, Alexander J
Smola, Amr Ahmed, Vanja Josifovski, James Long, Eu-
gene J Shekita, and Bor-Yiing Su. Scaling distributed
machine learning with the parameter server. In 11th
14
{USENIX} Symposium on Operating Systems Design
and Implementation ({OSDI} 14), pages 583–598, 2014.
[40] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and
William J Dally. Deep gradient compression: Reducing
the communication bandwidth for distributed training.
arXiv preprint arXiv:1712.01887, 2017.
[41] Qu Lu, Wantao Liu, Jizhong Han, and Jinrong Guo.
Multi-stage Gradient Compression: Overcoming the
Communication Bottleneck in Distributed Deep Learn-
ing. In International Conference on Neural Information
Processing, pages 107–119. Springer, 2018.
[42] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gre-
gory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,
Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh,
et al. Mixed precision training. arXiv preprint
arXiv:1710.03740, 2017.
[43] Barton P Miller and Cui-Qing Yang. IPS: An Interac-
tive and Automatic Performance Measurement Tool for
Parallel and Distributed Programs. In ICDCS, pages
482–489, 1987.
[44] MLPerf. MLPerf Training Results v0.6. https://
mlperf.org/training-results-0-6, 2019.
[45] NVIDIA. CUDA implementation of the standard ba-
sic linear algebra subroutines (BLAS). http://docs.
nvidia.com/cuda/cublas/index.html.
[46] NVIDIA. NVIDIA Collective Communications Library
(NCCL). https://developer.nvidia.com/nccl.
[47] NVIDIA. NVIDIA Nsight. https://developer.
nvidia.com/tools-overview.
[48] NVIDIA. NVIDIA Profiler. docs.nvidia.com/cuda/
profiler-users-guide/index.html.
[49] NVIDIA. The CUDA Profiling Tools Interface
(CUPTI). https://docs.nvidia.com/cuda/cupti/
index.html.
[50] NVIDIA. cudnn library developer guide v6.0. 2017.
[51] NVIDIA. A PyTorch Extension: Tools for easy mixed
precision and distributed training in Pytorch. https:
//github.com/NVIDIA/apex, 2018.
[52] NVIDIA. Cuda toolkit documentation v10.0. https:
//docs.nvidia.com/cuda/, 2018.
[53] NVIDIA. cudnn library developer guide v 7.4.1. 2018.
[54] NVIDIA. GEFORCE R© RTX 2080 Ti.
https://www.nvidia.com/en-us/geforce/
graphics-cards/rtx-2080-ti, 2018.
[55] NVIDIA. NVIDIA Turing GPU architec-
ture. https://www.nvidia.com/content/dam/
en-zz/Solutions/design-visualization/
technologies/turing-architecture/
NVIDIA-Turing-Architecture-Whitepaper.pdf,
2018.
[56] NVIDIA. Performance reported by NCCL tests.
https://github.com/NVIDIA/nccl-tests/blob/
master/doc/PERFORMANCE.md, 2018.
[57] NVIDIA. Training With Mixed Preci-
sion: Deep Learning SDK Documentation.
https://docs.nvidia.com/deeplearning/sdk/
mixed-precision-training/index.html, 2019.
[58] Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy,
and Scott Shenker. Monotasks: Architecting for per-
formance clarity in data analytics frameworks. In Pro-
ceedings of the 26th Symposium on Operating Systems
Principles, pages 184–200. ACM, 2017.
[59] Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott
Shenker, and Byung-Gon Chun. Making sense of perfor-
mance in data analytics frameworks. In 12th {USENIX}
Symposium on Networked Systems Design and Imple-
mentation ({NSDI} 15), pages 293–307, 2015.
[60] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
Alban Desmaison, Luca Antiga, and Adam Lerer. Auto-
matic differentiation in PyTorch. 2017.
[61] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao,
Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo.
A generic communication scheduler for distributed DNN
training acceleration. In Proceedings of the 27th ACM
Symposium on Operating Systems Principles, pages 16–
29. ACM, 2019.
[62] Behnam Pourghassemi, Ardalan Amiri Sani, and Aparna
Chandramowlishwaran. What-If Analysis of Page Load
Time in Web Browsers Using Causal Profiling. Pro-
ceedings of the ACM on Measurement and Analysis of
Computing Systems, 3(2):27, 2019.
[63] Andrew Putnam, Adrian M Caulfield, Eric S Chung,
Derek Chiou, Kypros Constantinides, John Demme,
Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth
Gopal, Jan Gray, et al. A reconfigurable fabric for accel-
erating large-scale datacenter services. ACM SIGARCH
Computer Architecture News, 42(3):13–24, 2014.
[64] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
and Percy Liang. Squad: 100,000+ questions for
machine comprehension of text. arXiv preprint
arXiv:1606.05250, 2016.
15
[65] James Reinders. VTune performance analyzer essentials.
Intel Press, 2005.
[66] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Ar-
slan Zulfiqar, and Stephen W. Keckler. vdnn: Virtualized
deep neural networks for scalable, memory-efficient neu-
ral network design. In The 49th Annual IEEE/ACM Inter-
national Symposium on Microarchitecture, MICRO-49,
pages 18:1–18:13, Piscataway, NJ, USA, 2016. IEEE
Press.
[67] Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren
Etzioni. Green AI. arXiv preprint arXiv:1907.10597,
2019.
[68] Amazon Web Services. AWS Inferentia. https://aws.
amazon.com/machine-learning/inferentia.
[69] Karen Simonyan and Andrew Zisserman. Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
[70] Nicolas Vasilache, Oleksandr Zinenko, Theodoros
Theodoridis, Priya Goyal, Zachary DeVito, William S
Moses, Sven Verdoolaege, Andrew Adams, and Albert
Cohen. Tensor comprehensions: Framework-agnostic
high-performance machine learning abstractions. arXiv
preprint arXiv:1802.04730, 2018.
[71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. Attention is all you need. In Ad-
vances in neural information processing systems, pages
5998–6008, 2017.
[72] Christoph von Praun, Rajesh Bordawekar, and Calin
Cascaval. Modeling optimistic concurrency using quan-
titative dependence analysis. In Proceedings of the 13th
ACM SIGPLAN Symposium on Principles and practice
of parallel programming, pages 185–196. ACM, 2008.
[73] Endong Wang, Qing Zhang, Bo Shen, Guangyong
Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. Intel
math kernel library. In High-Performance Computing
on the Intel R© Xeon PhiTM, pages 167–188. Springer,
2014.
[74] Jianyu Wang and Gauri Joshi. Adaptive communication
strategies to achieve the best error-runtime trade-off in
local-update SGD. arXiv preprint arXiv:1810.08313,
2018.
[75] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan
Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradi-
ents to reduce communication in distributed deep learn-
ing. In Advances in neural information processing sys-
tems, pages 1509–1519, 2017.
[76] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey, Maxim
Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff
Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu,
Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, George Kurian,
Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Ja-
son Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. Google’s neural
machine translation system: Bridging the gap between
human and machine translation. CoRR, abs/1609.08144,
2016.
[77] Jilong Xue, Youshan Miao, Cheng Chen, Ming Wu, Lin-
tao Zhang, and Lidong Zhou. Fast Distributed Deep
Learning over RDMA. In Proceedings of the Fourteenth
EuroSys Conference 2019, page 44. ACM, 2019.
[78] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel,
and Kurt Keutzer. Imagenet training in minutes. In
Proceedings of the 47th International Conference on
Parallel Processing, page 1. ACM, 2018.
[79] Matei Zaharia, Mosharaf Chowdhury, Michael J
Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster
computing with working sets. HotCloud, 10(10-10):95,
2010.
[80] Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong
Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao
Xie, and Eric P Xing. Poseidon: An efficient communi-
cation architecture for distributed deep learning on GPU
clusters. In 2017 {USENIX} Annual Technical Confer-
ence ({USENIX}{ATC} 17), pages 181–193, 2017.
[81] Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew
Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca
Schroeder, and Gennady Pekhimenko. Benchmarking
and analyzing deep neural network training. In 2018
IEEE International Symposium on Workload Character-
ization (IISWC), pages 88–100. IEEE, 2018.
Appendices
A Modeling Optimizations
Due to the space limitation, we are not able to include all
details in our main sections. The most important ones are
about how to use them to model various optimizations. In this
appendix we provide these details.
As shown in Table 1, there are a wide range of DNN op-
timizations, which would introduce various impacts on the
training runtime. One of such impacts is that duration of tasks
in will scale/shrink. For example, using AMP will shrink the
16
duration of GPU kernels. Using Daydream, such impact is
easy to model with the help of the Select operator to pick
tasks of interests.
DNN optimizations might alter the network topology (e.g.
kernel fusion [11], MetaFlow [33]), TASO [32], introduce new
operators (e.g. Gist [30], vDNN [66], Deep Gradient Com-
pression [40]), or restructuring the communication scheme
(e.g., P3 [31], BlueConnect [17]). These optimizations will
eventually alter the low-level dependency graph, adding or
removing GPU kernels and communication primitives. Day-
dream provides Insert/Remove operators for programmers
to model these transformations. Programmers need to locate
where tasks are inserted/removed with the help of the Select
operator. As we will show later, this locating varies across
different optimizations, but is generally not complicated.
Rescheduling tasks is another transformation that needs to
be supported in Daydream. This operator does not change
the dependency graph topology or the task duration. Instead,
it manipulates the execution order of the tasks, and aims at
higher parallelism among the tasks. One example of such
transformation is the prioritization scheme in P3 [31]. Mod-
eling this scheme involves just overriding the Scheduling
function in the simulation process 1. Programmers might
need to attach additional attributes to the tasks to implement
a custom scheduling policy. In the optimizations we show
below, modeling P3 [31] and vDNN [66] require overriding
the Scheduling function.
A.1 Automatic Mixed Precision (AMP)
To model AMP, we shrink the duration of GPU kernels by
2×. If TensorCore is available on the GPU, compute intensive
kernels such as sgemm are expected to speed up by 3× [57].
We show the pseudo code in Algorithm 3.
Algorithm 3: What_If_AMP
Input :Dependency graph: G(V,E)
Output :An updated graph G(V,E) to model AMP
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 foreach u ∈ GPUTasks do
3 if ”sgemm” in u.Name or ”scudnn” in u.Name
then
4 u.duration← u.duration/3
5 else
6 u.duration← u.duration/2
7 end
8 end
A.2 Fused Adam Optimizer
The Fused Adam optimizer fuses all the kernels in the weight
update phase. To model this optimizer, we remove all but one
kernels in the weight update phase, and scale the duration
of the remaining kernels with the sum of all fused ones. We
show the pseudo code in Algorithm 4.
Algorithm 4: What_If_Fused_Adam
Input :Dependency graph: G(V,E)
Output :Am updated graph G(V,E) to model the
Fused_Adam optimizer
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 WUTasks←
GPUTasks.Select( f uncPtr(IsWeightU pdate))
3 WUSum← 0
4 foreach u ∈WUTasks do
5 WUSum←WUSum+u.duration
6 end
7 First← True
8 foreach u ∈WUTasks do
9 if First then
10 u.duration←WUSum
11 First← False
12 else
13 G.Remove(u)
14 end
A.3 Reconstructing Batchnorm
Reconstructing Batchnorm [35] improves the performance
of training CNNs by splitting batch normalization layers and
fusing memory-intensive kernels with compute-intensive ker-
nels. We show the pseudo-code of using Daydream to model
this optimization. We show the pseudo code in Algorithm 5.
Algorithm 5: What_If_Restructuring_Batchnorm
Input :Dependency graph: G(V,E)
Output :An updated graph G(V,E) to model
Restructuring_Batchnorm
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 foreach u ∈ GPUTasks do
3 if u.layer is ReLU then
4 G.Remove(u)
5 end
6 if u.layer is Batchnorm then
7 u.duration← u.duration/2
8 end
9 end
A.4 Distributed Training
We show how to use Daydream to model distributed training
in PyTorch’s decentralized architecture with the NCCL back-
end, based on runtime on a single GPU. When invoking NCCL
all-reduce primitives, PyTorch groups small gradient tensors
together to better utilize the bandwidth. Such grouping infor-
mation can be collected by instrumentation from the PyTorch
17
framework. In our code example, we use layer_bucket_id
to represent the mapping from layers to communication buck-
ets. Each bucket corresponds to one communication call. We
show the pseudo code in Algorithm 6.
Algorithm 6: What_If_Distributed_Training
Input :Dependency graph: G(V,E), Gradient Grouping:
layer_bucket_id[]
Output :An updated graph G(V,E) to model distributed
training
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 Bucket_Task← []
3 WU ← the earliest node in the weight update phase
4 foreach b ∈ [1..#_o f _bucket] do
5 AllReduceTask = newNode(”AllReduce”, ...)
6 AllReduceTask.size← 0
7 G.AddDependencies(AllReduceTask→WU)
8 Bucket_Task[n]← AllReduceTask
9 end
10 foreach u ∈ GPUTasks do
11 if u is FF_layer then
12 bucket_id← layer_bucket_i[u]
13 T ← Bucket_Task[bucket_id]
14 T.size← T.size+u.gradient_size
15 G.AddDependencies(u→ T )
16 end
17 end
A.5 Priority-based Parameter Propagation
(P3)
P3 [31] splits each gradient tensor into small slices and
reschedules the communication based on the order in which
gradient tensors are generated. We show how to model
P3 based on MXNet’s parameter server architecture (with
push/pull communication primitives). To model P3 with Day-
dream, we insert parallel push/pull primitives for each gradi-
ent slice, tag each slice with priority based on the generation
order, and override the Schedule function to model the prior-
itization scheme.
A.6 BlueConnect
BlueConnect [17] optimizes the bandwidth usage by decom-
posing the synchronous all-reduce operations into a series
of reduce-scatter and all-reduce operations. The decompo-
sition helps better utilize the heterogeneous intra-node and
inter-node bandwidths. The decomposition of all-reduce op-
erations is based on a factorization of the number of GPUs.
We show the pseudo code in Algorithm 8.
A.7 MetaFlow
MetaFlow [33] is a relaxed graph substitution optimizer. It
simplifies the layer representation of a DNN topology by
Algorithm 7: What_If_P3
Input :Dependency graph: G(V,E), slice_size
Output :An updated graph G(V,E) to model P3
// Select GPU tasks in BP and FF
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 foreach u ∈ GPUTasks do
3 v← u’s corresponding BP layer
4 g← |u.layer’s gradients|
5 while g > 0 do
6 s← min(g, slice_size)
7 push← newNode("push v.layer", s, ...)
8 pull← newNode("pull v.layer", s, ...)
9 push.priority← -(distance to output)
10 push.ExecutionT hread← comm.send
11 if this slice is stored on the first server then
12 pull.ExecutionT hread← comm.send
13 else
14 pull.ExecutionT hread← comm.receive
15 G.AddDependencies(u→ push→ pull→ v)
16 g← g− slice_size
17 end
18 end
19 Function Schedule(TaskQueue: Q):
20 earliest← Q. f irst()
21 thread← earliest.ExecutionT hread
22 time← max(P[thread],earliest.start)
23 foreach task ∈ Q do
24 this_thread← task.ExecutionT hread
25 this_time← max(P[this_thread], task.start)
26 if this_time < time then
27 time← this_time
28 earliest← task
29 end
30 if this_time = time∧ task is
push/pull∧ earliest is
push/pull∧ task.priority >
earliest.priority then
31 earliest← task
32 end
33 end
34 return earliest
35 End Function
18
Algorithm 8: What_If_BlueConnect
Input :Dependency graph of distributed training: G(V,E),
decomposition factorization: p1 p2...pk
Output :Am updated graph G(V,E) to model BlueConnect
1 ReduceTasks←{G.Select( f uncPtr(IsAllReduce))}
2 foreach u ∈ ReduceTasks do
3 s← u.prevNodes
4 t← u.postNodes
5 G.Remove(u)
6 foreach i← 1..k do
7 RSNode← new(Reduce_Scatter_Node(pi))
8 G.Insert(s,RSNode, t)
9 s← RSNode
10 end
11 foreach i← k..1 do
12 AGNode← new(All_Gather_Node(pi))
13 G.Insert(s,AGNode, t)
14 s← AGNode
15 end
16 end
using operations like enlarging convolution kernel dimen-
sions and layer fusion. The policy to transform the layer-wise
topology is determined by a backtracking search algorithm.
Daydream does not provide extra support that automatically
determines the policy, as this is a duplicated work.
A transformation policy of MetaFlow will eventually re-
move or scale the dimension of existing layers. Given a policy,
Daydream can estimate its performance by modeling layer-
wise removal/scaling operations, with the help of layer map-
ping (described in Section 4.3). We show Daydream’s pseudo
code of implementing these two operations in Algorithm 9.
MetaFlow’s search algorithm uses a cost model to evaluate
the performance of a given policy. Daydream can be used as
a more precise cost model for the search algorithm.
A.8 Virtualized DNN (vDNN)
Virtualized DNN [66] optimizes the memory footprint in CNN
training by offloading feature maps from GPU memory to
CPU memory. To model vDNN with Daydream, we only need
to insert the corresponding cudaMemcpy calls, and implement
prefetching strategy by using the overriding Schedule func-
tion. The custom Schedule function delays the execution of
the prefetching operation. We demonstrate how to model the
vDNNconv policy, which only offloads the feature maps of all
convolutional layers. We tag each layer with an ID (a layer
with higher ID means closer to the output layer), and use the
findPrefetchLayer function defined in the original vDNN
paper [66]. We show the pseudo code in Algorithm 10.
Algorithm 9: What_If_MetaFlow
Input :Dependency graph: G(V,E)
Output :An updated graph G(V,E) to model MetaFlow
1 Function Remove_layer(Dependency Graph: G(V, E),
Layer: l):
2 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
3 foreach u ∈ GPUTasks do
4 if u.layer is l then
5 G.Remove(u)
6 end
7 end
8 End Function
9 Function Scale_layer(Dependency Graph: G(V, E),
Layer: l):
10 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
11 foreach u ∈ GPUTasks do
12 if u.layer is l then
13 u.duration← u.duration× s
14 end
15 end
16 End Function
Algorithm 10: What_If_vDNN
Input :Dependency graph: G(V,E)
Output :An updated graph G(V,E) to model vDNN
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 ID2Pre f etchTask←{}
3 foreach u ∈ GPUTasks do
4 if u.layer is not CONV_FF then
5 continue
6 end
7 v← u’s corresponding BP layer
8 t1← newCPUNode(”cudaMemcpyLaunch”, ...)
9 t2← newGPUNode(”cudaMemcpyH2D”, ...)
10 t3← newCPUNode(”cudaFree_vDNN”, ...)
11 t4← newCPUNode(”cudaMalloc_vDNN”, ...)
12 ID2Pre f etchTask[u.ID]← t4
13 t5← newCPUNode(”cudaMemcpyLaunch”, ...)
14 t6← newGPUNode(”cudaMemcpyD2H”, ...)
15 G.addDependencies(u→ t1→ t2→ t3→ t4→ t5→
t6→ v)
16 end
17 Function Schedule(TaskQueue: Q):
18 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
19 next← Q.last()
20 if next.layer is BP then
21 l← f indPre f etchLayer(next.ID)
22 if l 6=−1 then
23 return next
24 else
25 return ID2Pre f etchTask[l]
26 end
27 end
28 End Function
19
A.9 Gist
Gist [30] is an technique that optimizes the memory footprint
when training CNNs. It reduces the memory consumption of
the intermediate feature maps by adding encoding/decoding
operations to the training iterations. Gist provides both loss-
less and lossy compression strategies. We can use Daydream
to estimate the performance overhead of Gist, by inserting
the encoding/decoding kernels. When estimating the lossless
compression, we need to insert GPU kernels that are either
element-wise kernels (including clamping, pooling-mapping,
bit-wise kernels, etc.), or cuSPARSE kernels. When estimat-
ing the lossy compression, we need to additionally insert
the GPU kernels that perform Delayed Precision Reduction
(DPR) scheme.
Note that estimating the duration of these kernels is crucial
to the prediction accuracy. The duration of these kernels can
be either inferred based on existing kernels, or profiled sepa-
rately (the latter is outside of Daydream’s focus and should be
resolved using other techniques). We show the pseudo code
in Algorithm 11.
Algorithm 11: What_If_Gist
Input :Dependency graph: G(V,E)
Output :An updated graph G(V,E) to model Gist
1 GPUTasks←{G.Select( f uncPtr(IsOnGPU))}
2 foreach u ∈ GPUTasks do
3 v← u.postNode
4 w← v.postNode
5 if u.layer is RELU_FF ∧v.layer is
POOL_FF∧w.layer is CONV_FF then
6 SSDC_kernels← newNode(...)
7 G.insert(v,SSDC,w)
8 end
9 if u.layer is RELU_FF ∧v.layer is POOL_FF then
10 Binarize← newNode(...)
11 G.insert(v,Binarize,w)
12 end
13 end
14 if LOSSY_COMPRESSION then
15 foreach u ∈ GPUTasks do
16 if u is not RELU then
17 v← u.postNode
18 DPR← newNode(...)
19 G.insert(u,DPR,v)
20 end
21 end
22 end
// Add decode kernels to the backward pass
23 ...
A.10 Deep Gradient Compression (DGC)
DGC [40] reduces communication overhead by compressing
the gradients before transmission and decompressing the gra-
dients before weight update phase. When using Daydream
to estimate the performance overhead of DGC, we need to
insert the compression/decompression kernels before/after
the communication primitives. Similar to Gist, the prediction
accuracy mainly depends on the estimation of the inserted
kernels. We show the pseudo code in Algorithm 12.
Algorithm 12: What_If_DGC
Input :Dependency graph: G(V,E)
Output :An updated graph G(V,E) to model Deep Gradient
Compression
1 ReduceTasks←{G.Select( f uncPtr(IsAllReduce))}
2 foreach r ∈ ReduceTasks do
3 s← r.prevNodes()
4 t← r.postNodes()
// Initialize compression kernels
5 quantize_op← newNode(...)
6 sparse_op← newNode(...)
7 ...
8 G.Insert(s,quantize_op,r)
9 G.Insert(quantize_op,sparse_op,r)
10 ...
// Initialize decompression kernels
11 d_kernels← ...
12 G.Insert(r,d_kernels, t)
13 end
20
