Automatic Model Parallelism for Deep Neural Networks with Compiler and
  Hardware Support by Tavarageri, Sanket et al.
Automatic Model Parallelism for Deep Neural
Networks with Compiler and Hardware Support
Sanket Tavarageri
Intel Labs
sanket.tavarageri@intel.com
Srinivas Sridharan
Intel Labs
srinivas.sridharan@intel.com
Bharat Kaul
Intel Labs
bharat.kaul@intel.com
ABSTRACT
The deep neural networks (DNNs) have been enormously
successful in tasks that were hitherto in the human-only
realm such as image recognition, and language translation.
Owing to their success the DNNs are being explored for use
in ever more sophisticated tasks. One of the ways that the
DNNs are made to scale for the complex undertakings is by
increasing their size – deeper and wider networks can model
well the additional complexity. Such large models are trained
using model parallelism on multiple compute devices such
as multi-GPUs and multi-node systems.
In this paper, we develop a compiler-driven approach to
achieve model parallelism. We model the computation and
communication costs of a dataflow graph that embodies the
neural network training process and then, partition the graph
using heuristics in such a manner that the communication
between compute devices is minimal andwe have a good load
balance. The hardware scheduling assistants are proposed
to assist the compiler in fine tuning the distribution of work
at runtime.
1 INTRODUCTION
The deep neural networks (DNNs) as they grow in size neces-
sitate the use of multiple compute devices (e.g., multi-GPUs)
for their training. When the neural network model is split
across multiple compute devices while training, it is termed
model parallelism. Achieving high performance in model
parallelism is an important and a difficult problem. We have
developed a comprehensive solution to automatically obtain
efficient model parallelism through compiler analyses and
with the use of novel hardware support.
We develop a compiler and hardware scheduling assistant
based solution for realizing model parallelism while training
deep neural networks (DNNs). Figure 1 shows the overview
of the system. The DNN compiler maps the dataflow graph
produced by the DNN model to multiple compute devices
to efficiently execute the graph. The hardware scheduling
assistants are programmed by the compiler to optimally mi-
grate computations between devices at runtime so that the
resources of the system are effectively utilized.
The DNN compiler through analytical cost modeling of
computation and communication costs, partitions and maps
the neural network tomultiple compute devices that achieves
minimal communication and optimal load balance. However,
to account for impreciseness in analytical cost modeling, and
dynamic changes in the execution environment, it enlists
the hardware help as follows. The compiler encodes simple
rules for hardware scheduling assistants to move around
parts of the neural network to dynamically adapt for high
performance.
The techniques presented in this paper will dramatically
increase the performance of training of deep learning models
on Intel architectures. The hardware and software synergy
that this solution will bring about, will be effective in achiev-
ing better scaling compared to compiler-only approaches
as the system will be able to adapt to changing execution
environments and will fine-tune the model parallelism for
performance continually. This hardware, software co-design
is superior to other middle-ware based runtime techniques
because of encoding of rules in the hardware which will
eliminate the runtime overheads.
In Section 2, we describe the compiler techniques, and
the hardware scheduling assistants are detailed in Section 3.
The related work is discussed in Section 4, while concluding
remarks are presented in Section 5.
2 COMPILER DIRECTED MODEL
PARALLELISM
In deep neural network frameworks such as TensorFlow,
the computation is represented as a dataflow graph. The
nodes of the graph represent computations while the edges
capture the input and output data. To execute the dataflow
graph on multiple devices, such as in a multi-GPU, or a
multi-node environment, first we need to identify which
parts of the graph will run on what device, and also have to
insert communication primitives between any two nodes that
are connected by an edge but are now mapped to different
devices.
In this work, we develop an approach to partitioning the
dataflow graph. The subgraphs that the partition induces
will be executed on different devices. The goals of the parti-
tioning algorithm will be to 1) reduce the volume of data to
be communicated between subgraphs, and 2) achieve a good
load balance by creating subgraphs of roughly equal size.
1
ar
X
iv
:1
90
6.
08
16
8v
1 
 [c
s.D
C]
  1
1 J
un
 20
19
DNN model
DNN compiler
Compute device 1 Compute device 2 Compute device n
Sched. 
assistant 1
Sched. 
assistant 2
Sched. 
assistant n
...
Figure 1: The overview of the system
The various phases of the overall approach are as follows.
(1) Selection of computationally expensive and relocatable
nodes: We profile the workload to discover computation-
ally expensive nodes. Further, among the computationally
expensive nodes, only the stateless nodes are considered for
further analysis (an example of a stateful node would be a
variable which is used to save a model’s parameters).
(2) Analytical cost modeling: Analytical cost modeling of the
dataflow graph is performed which assigns computation,
and communication costs to nodes, and edges of the graph
respectively. Here, cost modeling of only the selected com-
putationally expensive nodes is carried out. The costs thus
assigned form the basis of subsequent graph partitioning.
The dataflow graphG consists of vertices/nodesV and edges
E: G = (V ,E). The nodes indicate computations, and the
edges encode the data and control dependencies between
nodes. Let there be k devices available: D1,D2, . . .Dk with
potentially varying computational capabilities. A node vi
mapped to a deviceDj is assigned a computation cost of cDjvi .
The cost cDjvi denotes the number of time units it takes to ex-
ecutevi onDj . The cost modeling is based on the number of
operations an operator entails, and the number of operations
that a given device can perform in a unit of time.
The edges of the graph are assigned numerical weights equal
to the volume of data that the edges carry. The edges denote
data and control dependencies between nodes. A control
dependency edge is given the weight 0, while the number of
bytes of data an edge carries becomes its weight. In terms of
notation, the weight dvivj is assigned to the edge connecting
node vi and node vj .
(3) Initial partitioning of the dataflow graph: We use one
of the following strategies to create initial partitions: 1) block
partitioning, 2) random partitioning. The heuristic described
next subsequently improves the partitions in terms of opti-
mizing criteria, namely communication minimization, and
load balance.
Block partitioning: If C is the total computation cost, and
k is the number of devices, then the nodes are assigned to
devices in such a way that each partition gets nodes worth
a total of Ck . To do so, the dataflow graph is topologically
sorted, and a list of sorted order of nodes is created. Then,
the list is divided up into k partitions in a block fashion so
that nodes in each partition have an aggregate cost of Ck . The
k partitions are mapped to k devices.
Random partitioning: The nodes are randomly assigned to
devices.
2
(4) Iterative repartitioningWe adapt the Kerningham and Lin
formulation of the communication cost [4] and Karypis and
Kumar’s greedy refinement approach [3] for the context of
dataflow graphs which are directed graphs. (The Kerningham
and Lin formulation is applicable only to undirected graphs;
unlike Karypis and Kumar’s approach where load balance is
a secondary goal, in our formulation we can consider it to be
a primary goal which allows us to completely automatically
achieve model parallelism).
The communication cost of a node ni mapped to device
Dp is calculated as follows. The incoming edges into ni are
considered. Let Ini be the sum of weights of edges emanating
from nodes that are mapped to the same device as ni and
end in ni . Let Eni be the sum of weights of edges originating
from nodes mapped to a device different from that of ni . The
difference between Eni and Ini is the communication cost
associated with ni .
D
Dp
ni = E
Dp
ni − I
Dp
ni
It is observed that if EDpni is 0 then D
Dp
ni is a negative value
assuming IDpni is non-zero. In this case all of ni ’s communica-
tion is internal to the device. On the other hand, if IDpni is 0,
thenDDpni is a positive quantity assuming E
Dp
ni is non-zero. In
this instance, all of ni ’s communicating partners are located
on other devices.
We would like to minimize the sum of Dni s over all nodes as
much as possible to achieve minimal communication subject
to the constraint that a certain load balance requirement
among devices is maintained. We define the load balance
constraint as the share of the computational cost of a device
being within a threshold of the ideal share of the computa-
tional cost. That is,
∀Di ∈{D1, ...Dk }
 CDi − Ck ≤ ϵ
where ϵ is a parameter.
We move a node ni from device Dq to device Dr with the
minimum Dni and if the following condition is met:(
DDrni < D
Dq
ni
)
∧
(
(CDr+cDrni )−
C
k
≤ ϵ
)
∧
(C
k
−(CDq−cDqni ) ≤ ϵ
)
The first part of the condition makes sure that the com-
munication cost on the new device Dr is smaller than the
communication cost on the original device Dq . The second
part of the formula states that the computation share of the
device Dr receiving the new node does not exceed the ideal
computational share beyond the threshold ϵ . The third part
of the conjunction asserts that the computational share of the
deviceDq losing the node does not drop below the threshold
ϵ when compared to the ideal share.
In addition to communication minimization, the additional
goal is to also improve the load balance of the system, a
node ni is moved from device Dq to device Dr if 1) device
Dq ’s consequent computational share remains above the
ideal share, and 2) deviceDr ’s share remains below the ideal
share: (
(CDr + cDrni ) <
C
k
)
∧
(
(CDq − cDqni ) >
C
k
)
3 HARDWARE SUPPORT: SCHEDULING
ASSISTANTS
Owing to the impreciseness of the analytical model, and
possible interference of co-located applications, the com-
piler directed model parallelism may not be able to achieve
optimal performance. Therefore, we augment compiler or-
chestrated model parallelism with dynamic adaptation by
hardware scheduling assistants. The scheduling assistants
are programmed by the compiler with a set of rules that will
dictate how the nodes are migrated among compute devices.
The nodes in the dataflow graph will be annotated with
the following tags depending on the bottleneck that the
operations in the nodes face:
• compute-bound
• memory-bound
• network-bound
The scheduling assistant observes the compute, memory,
and network activity on a device, and migrates the nodes
depending on their tags as follows:
• When a device Di ’s compute utilization exceeds a certain
threshold θ (say, 95%), then it selects one of the compute-
bound nodes mapped to it and places it in the compute
out-box. Another device whose compute utilization falls
below a certain threshold γ (say, 50%) may acquire the
node thus placed in the compute out-box of Di .
• Correspondingly, if the compute utilization of Di falls
below γ , then Di picks a node placed in another device’s
compute out-box.
Similar rules are formed with respect to memory bandwidth
utilization, and network utilization. The compiler’s desig-
nating of nodes as compute-bound, or memory-bound, or
network-bound provides the scheduling assistants of the sys-
tem to swap nodes of the dataflow graph to maximize their
collective utilization of various resources.
4 RELATEDWORK
Kernighan et al [4] and Karypis et al [3] develop techniques
to partition the dataflow graphs which can be used to obtain
model parallelism for training of deep neural networks. Kern-
ingham and Lin [4] propose a formulation for themodeling of
communication cost that can be used as the basis for dataflow
3
graph partitioning for multi-device execution. Karypis and
Kumar [3] develop a greedy refinement approach to partition
the dataflow graph by first coarsening and then uncoarsening
the graph.
The hardware schedulers have been explored mainly with
the goal of reducing overheads in scheduling of jobs by an
Operating System. We discuss some of the representative
works. Eugen et al [1] design a hardware scheduler engine
to reduce the task switching time targeted for Real Time
Operating Systems (RTOSs). Gupta et al [2] devise a hard-
ware scheduler to implement the Pfair scheduling algorithm
which allows processes to make proportionate progress in a
multi-processor system.
Our deep learning compiler performs partitioning of the
dataflow graph like the other approaches mentioned above
with some key differences: our techniques are applicable to
directed graphs (and dataflow graphs are directed) whereas
the Kerningham and Lin formulation is applicable only to
undirected graphs. Unlike Karypis and KumarâĂŹs approach
where communication is the primary goal, and load balance
is the secondary goal, in our formulation we can consider
the load balance to be a primary goal as well which allows
us to completely automatically achieve model parallelism.
The hardware scheduling assistant developed in this paper
is intended to perform load balancing of work after being
programmed by the deep learning compiler. In contrast, prior
hardware schedulers are designed to assist scheduling of
processes by an Operating System, which is a completely
distinct problem.
5 CONCLUSION
We presented a compiler technology and a hardware archi-
tecture to automatically achieve model parallelism during
the training of deep neural networks. As the network size
grows, the model will no longer fit in the memory of a single
GPU or a single CPU. Therefore, it becomes imperative that
the model parallelism be used to split the model across the
memories of multiple compute devices while training. The
compiler directed partitioning of the dataflow graph maps
the computation to multiple compute devices and the hard-
ware scheduling assistants dynamically adjust the mapping
at runtime to maintain high load balance and low communi-
cation.
REFERENCES
[1] EugenDodiu and Vasile Gheorghita Gaitan. 2012. Custom designed CPU
architecture based on a hardware scheduler and independent pipeline
registersâĂŤConcept and theory of operation. In 2012 IEEE International
Conference on Electro/Information Technology. IEEE, 1–5.
[2] Nikhil Gupta, Suman KMandal, Javier Malave, AyanMandal, and Rabi N
Mahapatra. 2010. A hardware scheduler for real time multiprocessor
system on chip. In 2010 23rd International Conference on VLSI Design.
IEEE, 264–269.
[3] George Karypis and Vipin Kumar. 1998. Multilevel k-way partition-
ing scheme for irregular graphs. Journal of Parallel and Distributed
computing 48, 1 (1998), 96–129.
[4] Brian W Kernighan and Shen Lin. 1970. An efficient heuristic procedure
for partitioning graphs. The Bell system technical journal 49, 2 (1970),
291–307.
4
