A Computational-Graph Partitioning Method for Training
  Memory-Constrained DNNs by Qararyah, Fareed et al.
A Computational-Graph Partitioning Method for
Training Memory-Constrained DNNs
Fareed Qararyah
Koc¸ University, Turkey
fqararyah18@ku.edu.tr
Mohamed Wahib
National Institute of Advanced
Industrial Science and Technology,
Japan
mohamed.aia@aist.go.jp
Dog˘a Dikbayır
Michigan State University, USA
dikbayir@msu.edu
Mehmet Esat Belviranli
Colorado School of Mines, USA
belviranli@mines.edu
Didem Unat
Koc¸ University, Turkey
dunat@ku.edu.tr
Abstract
We propose ParDNN, an automatic, generic, and non-intrusive
partitioning strategy for large DNN models that do not t
into single device memory. ParDNN decides a placement of
DNN’s underlying computational graph operations across
multiple devices so that the devices’ memory constraints
are met and the training time is minimized. ParDNN is
completely independent of the deep learning aspects of a
DNN and requires no modication neither at the model nor
at the systems level implementation of operation kernels.
It partitions DNNs having billions of parameters and hun-
dreds of thousands of operations in seconds to few minutes.
Our experiments with TensorFlow on 16 GPUs demonstrate
ecient training of 5 very large models while achieving su-
perlinear scaling for both the batch size and training through-
put. In comparison to related work (Mesh-TensorFlow and
Gradient Checkpointing), ParDNN either outperforms or
qualitatively improves upon them.
Keywords DNN, graph partitioning, model parallelism
1 Introduction
DNN models have doubled in size roughly every 2.4 years [16]
and this growth is expected to continue in the coming years [2,
50]. Larger models, deeper or wider or both, produce results
with higher accuracy on more complex tasks. However, they
come at a high memory cost required to store the parameters
and the intermediate results for both training and inference
[53]. For example, in computer vision, considering Wide
Residual Network [70], a widened variant of the well-known
Resnet [19], widening the model 8 times increases the num-
ber of parameters ∼ 60 times [70] leading to a substantial
increase in the memory requirements. e same trend shows
up in the NLP eld where deep-stacked LSTMs [66] or aen-
tion layers [61] oen give more accurate results compared to
shallower models but these newer models push the number
of parameters up to O(10B) [50, 55].
Dierent approaches have been proposed to tackle the
issue of training very large models on multiple devices. One
approach is to work on the model level, where the model is
partitioned across multiple devices through model, pipeline,
channel parallelism, or combinations of them [12, 15, 21, 27,
28, 43, 55]. Even though these methods are successful to
some extent, they suer from either: (a) being not generic
as they target a specic class of DNNs, (b) introduce non-
negligible memory overhead to maintain the statistical ef-
ciency, or (c) can incur a high implementation cost and
necessitate detailed understanding of the DNN model for an
accurate cost model. Another approach works at the systems
level by partitioning the computational graph that represents
the operations in a neural network model and distributes it
over multiple devices. However, the method proposed in [64]
has a restricted applicability because it relies on a descriptive
language to specify computations and cannot describe all
the operations used in DL. Others propose a reinforcement
learning-based approach, which is impractical in many cases
due to substantial resource and time requirements [41, 42].
We adopt the system-level approach and propose a generic,
ecient, and non-intrusive partitioning strategy (ParDNN)
that avoids the drawbacks of the related work. ParDNN
directly works on the computational graph representation
of the neural network adopted by the most popular general-
purpose DL frameworks such as TensorFlow [1] and MXNet
[9]. Operating on the graph level has three main benets.
First, it provides a ne-grained view of the model, which
gives more parallelization options and allows beer load
balancing and resource utilization. Second, it isolates our
strategy from the details of the learning process, which pro-
vides more generality and guarantees unaected statistical
eciency [43] of the model. ird, working at the level of
the graph enables us to leverage decades of work on graph
partitioning and static scheduling (as will be discussed later).
ParDNN’s strategy is composed of two main steps. First,
we cluster the operation-nodes of the computational graph
into K partitions, where K represents the number of the
available devices. e objective of this step is to reduce the
end-to-end runtime by assigning the operations on the parti-
tions auch that the computational loads are balanced and the
ar
X
iv
:2
00
8.
08
63
6v
1 
 [c
s.D
C]
  1
9 A
ug
 20
20
communication is minimized. In the second step, we check
whether the memory constraints are met in each partition.
If they are not, we reassign some operations to dierent
partitions such that the reassigned operations have the least
possible perturbed eect on the placement generated by the
rst step but at the same time meet the memory constraints.
Most existing graph partitioning libraries are designed to
handle undirected graphs. State-of-the-art graph partition-
ing tools, such as Scotch static mapper [45, 46] and MinCut
optimizer, results in 2 to 10 times slowdown when applied on
directed graphs of DL models [41, 42]. Our algorithm outline
is inspired by the principle of the multilevel approach used in
graph partitioning [30] but the design and algorithmic details
of ParDNN includes a mix of variants of static scheduling
heuristics [31] that are mutated to reduce the time complex-
ity, and novel techniques to address some shortcomings in
the existing ones [40, 49]. Our contributions are:
• We propose a novel computational graph partitioning method
that enables training models with large memory consump-
tion on a set of devices with limited memory.
• We conduct extensive experiments with large DNNs to
demonstrate ParDNN’s eciency. In comparison to re-
lated work: (a) ParDNN’s performance is comparable to
that of Mesh-TensorFlow, a state-of-the-art distributed
training framework [54], while having qualitative advan-
tages of automating the partitioning and not requiring
model rewrite. (b) It generally outperforms redundant re-
computation methods (Gradient Checkpointing [47]). (c)
It outperforms out-of-core methods (CUDA Unied Mem-
ory).
• For models that do not t into a single GPU’s memory,
ParDNN enables training models having up to 5.1 billion
parameters using only 4 GPUs. For models that barely
t into a single device memory, it allows more ecient
training by superlinearly scaling the batch size, and in
many cases, the training throughput.
• ParDNN’s overhead is negligible. For a graph having hun-
dreds of thousands of nodes representing DNNs with bil-
lions of parameters, it takes ∼ 2 minutes to nd a partition
for 16 GPUs, while training these models takes days or
even weeks.
• To the best of our knowledge ParDNN is the rst of its
type that permits the training of models that do not t
into a single device memory while being generic due to
(a) having zero dependency and requiring no knowledge
about the DL aspects of the models, and (b) not requiring
any modications of the model or operation kernels.
2 Background
Many DL frameworks model a computation as a directed
graph [1, 5, 9]. TensorFlow uses a stateful directed graph to
represent the computational ow of operations. It extends
the classical dataow graph model to allow maintaining and
Deep Learning 
Model
Memory Consumption 
Estimations
Compute Times
Communication Times 
(estimation)
Python, C/C++, Java
Cost ModelComputational 
Graph
Scheduler Emulator
Offline Profiling
Deep Learning Distributed Execution Engine
Device Device Device DeviceHost
Mapping
2 3 4
5
ParDNN
1 A
B, 2 C, 2
D, 1 E, 1 F, 2
G, 1
H, 1
I, 3
J, 1
K, 3
L
0
0
1
1 1
2
3
1
1 1
1
1
3
4
0
NodeID DeviceID
B 0
C 1
… …
Figure 1. ParDNN Overview
updating the persistent state of some special nodes, branch-
ing, and loop control. In a TensorFlow graphG = (V ,E), each
node n ∈ V represents the instantiation of an operation (e.g.,
matrix multiplication or convolution) and it has zero or more
inputs and zero or more outputs. Each edge e ∈ E represents
a dependency between its incident nodes. Normal edges rep-
resent dataow between the nodes, while special edges, e.g.
control dependencies, are used to enforce happens-before
relationships with no data ows along them [1].
Graph partitioning is, in general, dened as spliing the
graph G(V ,E) into K disjoint subsets [6]. e constrained
version of the graph partitioning aims at partitioning in such
a way that the sums of the vertices weights in each set are
as equal as possible, and the sum of the weights of edges
crossing between sets is minimized [30]. An extension of
general graph partitioning which aims to assign a set of com-
municating tasks to processors is called static mapping [6].
Static mapping does not consider the logical and temporal
dependencies of the tasks, it is assumed that all the tasks
simultaneously coexist throughout the program execution.
Finding a spatial and temporal assignment of the set of
nodes in a task graph G = (V ,E) onto a set of processors
resulting in the fastest possible execution, while respecting
the precedence constraints expressed by all e ∈ E is referred
to as task scheduling problem [56]. e schedule length,
makespan, is the completion time (Ct ) of the last node in G
assuming that the graph execution starts at time 0. e goal
is to minimizeCtmax , whereCtmax = maxn∈VCt (n). Finding
an optimal schedule or static mapping is NP-hard [6, 56].
3 ParDNN: A DNN Partitioning Strategy
ParDNN oers a practical, non-intrusive, and generic method
to partition a DNN on a set of processing elements (PE).
e main objective of ParDNN is to minimize Ctmax , the
makespan of the computational graph, while satisfying the
memory capacity constraints of the target processing ele-
ments. It is important to mention that ParDNN does not
have a runtime component. All the steps of ParDNN are
done ahead of time. Aer running ParDNN once, the result-
ing partitioning can be used as long as the model parameters
that aect the memory consumption do not change.
2
Table 1. Terminology used in this work
Term Description
G = (V , E) Computational graph with vertex set V , edge set E
CP Critical path of a graph
Ctmax Makespan of G , schedule length
PE, pe Set of processing elements, a processing element
K Number of processing elements (e.g., # of GPUs)
comp(n) Weight of a node n (compute time)
mem(n) Memory consumption of outputs of a node n
comm(e ) Cost of an edge e (communication time)
sc Secondary cluster, which is a node or a path
comm(sc ) Total communication cost incurred by all edges that have one
end in sc
tl (n) Node top level: length of the costliest path between the the
source node of the graph and the node n, excluding the node
n. Where the length of a path, is the summation of the compu-
tation costs of the nodes on the path and the communication
cost of its edges ∑n∈p comp(n) + ∑e∈p comm(e )
bl (n) Node boom level: length of the costliest path between n and
the sink node including the node n
w lvl (n) Node weighted level: t l (n) + bl (n)
span(sc ) Time between the expected nish time of the last parent of the
rst node in a sc , and the expected starting time of the rst
child of the last node in that path. Last and rst here mean
topologically.
potential (sc ) Summation of the weights of all nodes that can be executed
within span(sc )
st (n) Starting time of node n, the time when n is assigned to a pe
to execute
f t (n) Finish time of noden, the time whenpe is done with executing
n
Mcons (pe, t ) Memory consumed by the processing element at time t
Mpot (n, t ) Memory potential of a node n at time t . e summation of
the memory occupied by the outputs of n’s direct ancestors
that are executed before t , and for which n is the last direct
descendant in itspe . Plusn’s memory consumption if st (n) ≤
t ≤ f t (n)
Figure 1 shows the overall process. ParDNN takes a com-
putational directed acyclic graph as an input, it annotates
this graph with computation, communication, and memory
consumption information gathered using oine proling.
ParDNN splits the graph into parts to be mapped to process-
ing elements. ParDNN outputs the mapping information to
be used by the execution engine of the DL framework (e.g.
TensorFlow).
Our algorithm is divided into two major steps. Step-1
aims to obtain a partitioning that has a minimal makespan.
Step-1 is further divided into three stages. Stage-I, graph
slicing splits the graph into K disjoint primary and S dis-
joint secondary clusters. is spliing enables working at
a coarser level in the upcoming stages. Stage-II, mapping,
merges these S secondary clusters into the K primary clus-
ters by rstly merging the clusters that have no parallelism
gain, then merging the rest of S using a novel load balancing
algorithm. e nal stage of Step-1 is a renement of the
mapping through path swapping and node switching. In
Step-2, the result from Step-1 is validated against the mem-
ory constraints of the given devices, if the constraints are
satised, the partition will be the nal output. Otherwise,
the partition is rened until the memory consumption by a
processing element pe at any time t ∈ [0,Ctmax ] is less than
or equal to pe’s memory capacity.
Next, we explain the details of each step. Table 1 summa-
rizes the terms and notations used during explanations.
Algorithm 1 Graph Slicing
In : K, Graph G . number of devices, DNN graph
Out: pri clusters[ ], sec clusters[ ] . initially empty
1: j ← 1
2: w lvls ← compute weighted levels(G )
3: while G 6= ϕ do
4: heaviest path ← nd heaviest path(G, w lvls )
5: if j ≤ K then
6: pr i clusters[j]← heaviest path
7: w lvls ← compute weighted levels(G )
8: else
9: sec clusters[j − K ]← heaviest path
10: end if
11: G ← G − {heaviest path }
12: j ← j + 1
13: end while
3.1 Step-1: Partitioning To Minimize Makespan
Before presenting the details of the step, it is important to
point its distinction from both static task scheduling and
static mapping. Unlike scheduling algorithms, we do not
specify an order of task execution; we rather focus on spa-
tially allocating tasks on a set of processors while addressing
the locality-parallelism trade-o. e order of execution de-
cision is le to the runtime dynamic scheduler, e.g. Tensor-
Flow scheduler. Unlike static mapping, ParDNN considers
the logical and temporal dependencies between the tasks.
e size, (|V |), of a DNNs’ computational graph is usually
in the order of hundreds of thousands and is projected to
grow to millions of operation-nodes [50]. To have a scalable
Step-1, we follow the concept of multilevel method [6], where
we group vertices together and deal with groups of vertices,
rather than individuals. is reduces the problem size and
allows our heuristics to be applied within a reasonable time.
Step-1 is designed in three main stages.
3.1.1 Graph slicing
is stage groups the nodes of the graph into disjoint clusters.
It iteratively nds the critical path (CP ) in the graph and
removesCP ’s nodes and their incident edges from the graph
by marking them as visited so that they are not explored in
the following iterations. is is repeated K times resulting in
K primary clusters, which are the initial partitions assigned
to dierent processing elements. Hence, the terms primary
cluster and pe are going to be used interchangeably. Aer
nding those primary clusters, if there are leover nodes,
we group them into secondary clusters. A secondary cluster,
which is a linear cluster [56], is either a single node or a
path. All the secondary clusters are identied and tagged
until there is no node le on the graph that is not part of any
cluster. Figure 2(b) shows an example.
Algorithm 1 shows the pseudo-code of the graph slicing,
which takes K and graph G as inputs and outputs primary
and secondary clusters. Line 2 computes the weighted level
(w lvl (n)) for all nodes in the graph. e heaviest path, (Line
4), is the CP when w lvl(n) are recalculated. Finding the
3
heaviest path is done by traversing the graph using the com-
puted w lvls as priorities until reaching a dead-end. Aer
forming aCP , it is added to the primary clusters and its nodes
and edges are removed from the graph (Line 11). Unlike lin-
ear clustering [31], we obtain only K many CPs, then we
stop recalculating w lvl (n) for the secondary clusters since
computing weighted levels is expensive. When weighted lev-
els are not recalculated, f ind heaviest path may not return
a CP , it rather returns a path of a heavy cost. is aims at
capturing dependent and heavily-communicating nodes in
the same cluster to increase locality. If a path could not be
obtained, it returns a single node.
3.1.2 Mapping
is stage aaches the secondary clusters to the primaries
with the objective of balancing the load among partitions
and reducing communication. First, initial merging is ap-
plied to some of the secondary clusters for which executing
them in parallel is not advantageous. For example, in Fig-
ure 2(c), the cluster {J } is merged with a primary because
the total amount of communication (comm) incurred by the
nodes in cluster {J } can not be covered by its potential ({J }).
Intuitively, the potential of a cluster measures how much
parallel work is in the cluster at the time of its execution and
whether or not that work is sucient to totally hide its com-
munication. In other words comm({J }) − potential ({J }) > 0,
hence there is no gain from assigning it to a distinct pe . Such
a cluster is merged with the primary cluster with which it
communicates the most.
Second, we apply a level-aware load balancing technique
at which we merge the secondary clusters that are not merged
by the initial merging. is process is referred to as clus-
ter mapping in the scheduling literature. ere are some
heuristics such as wrap cluster merging [67], list scheduling
based cluster assignment [52], and Guided Load Balancing
(GLB) [49]. In a comprehensive evaluation of scheduling and
cluster merging algorithms in [62], GLB is shown to produce
the best result. However, GLB assumes that its preceding
clustering step has eliminated the largest communication
delays. As a result, the communication delays are not con-
sidered for cluster mapping [49]. Ignoring communication
cost results in a low-quality mapping when the graph be-
comes very large. Even if each inter-cluster communication
is small, the cumulative eect becomes considerable. In addi-
tion, the load balancing is global rather than time dependent
(temporal). is issue is demonstrated in Figure 2(d) and (e).
Ignoring the temporal aspect of load balancing causes GLB to
make locally sub-optimal decision for cluster {D}. It assigns
it to the less loaded pe , yet that pe has more work within the
span({D}). is assignment paern is not suitable especially
for TensorFlow graphs with frequent forks and joins, where
the local and the global loads become more uncorrelated.
We propose a novel time-ecient heuristic called Level-
Aware Load Balancing (LALB). LALB considers both commu-
nication minimization and the temporal load balancing. We
Algorithm 2 Mapping
InOut: pri clusters[ ] . primary clusters
In: sec clusters[ ] . secondary clusters
1: for sc ∈ sec clusters do
2: if comm(sc ) − potential (sc ) > 0 then
3: tarдet ← nd most comm(sc , pr i clusters )
4: tarдet ← tarдet + {sc }
5: sec clusters ← sec clusters − {sc }
6: end if
7: end for
8: comps[ ]← ϕ , comms[ ]← ϕ
9: while sec clusters 6= ϕ do
10: sc ← remove next secondary(sec clusters )
11: comps ← calc work at span(span(sc ), pr i clusters )
12: comms ← calc comms with(sc , pr i clusters )
13: tarдet ← nd minimal(comps, comms )
14: tarдet ← tarдet + {sc }
15: end while
temporally balance the loads by considering the workload of
every pe within span(sc), where sc is the secondary cluster
that is going to be merged with one of the primary clusters.
sc is mapped to a pe that has the minimal computational load
within the span(sc), and minimizes the incurred communica-
tion with the other processing elements. Equation (1) shows
the selection criteria. In case of ties, we assign sc to the pe
which has the highest communication value with it.
min
pe ∈PE
( ∑
n∈pe
tl (n)∈span(sc )
comp(n) +
∑
(n,u)∈E,
n∈{PE }−pe,
u ∈sc
comm(n,u) +
∑
(u,n)∈E,
n∈{PE }−pe,
u ∈sc
comm(u,n)
)
(1)
Algorithm 2 shows the two mapping procedures. Lines 1-7
do initial merging of the secondary clusters to the primaries.
e while loop (Line 9) applies our novel load balancing.
Most time is consumed in calculating the work in the span
of the target secondary cluster sc (Line 11), in each of the pri-
mary clusters. We model this part as a problem of frequent
range queries with updates. More specically, we nd the
sum of the weights of the nodes whose levels fall in the span,
and upon merging, the weights of those levels are updated.
We use binary-indexed-trees [13] as a data structure, where
the tree nodes store the weights per level. is data struc-
ture allows logarithmic range summation and value updates.
Line 12 calculates the cost of communication between the
secondary cluster sc in each of the primary clusters. Line 13
performs the selection criteria dened in Eqn (1) to select
the best primary cluster to merge the sc with.
3.1.3 Renement
is stage renes the partitioning with two renement poli-
cies. e rst is responsible for coarse-grained renement
at the secondary cluster level and the second does the ne-
grained renement at the node level. e rst policy searches
for a secondary cluster sc for which there is another sec-
ondary cluster sc ′ within its span that when swapped with
sc , it results in beer quality partitioning. e beer quality
comes from either enhancing the load balancing or reducing
the total communication, or both.
4
AB, 2 C, 2
D, 1 E, 1 F, 2
G, 1
H, 1
I, 3
J, 1
K, 3
L
0
0
1
1 1
2
3
1
1 1
1
1
3
4
0
A
B, 2 C, 2
D, 1 E, 1 F, 2
G, 1
H, 1
I, 3
J, 1
K, 3
L
0
0
1
1 1
2
3
1
1 1
1
1
3
4
0
A
B, 2 C, 2
D, 1 E, 1 F, 2
G, 1
H, 1
I, 3
J, 1
K, 3
L
0
0
1
1 1
2
3
1
1 1
1
1
3
4
0
A
B, 2 C, 2
D, 1 E, 1 F, 2
G, 1
H, 1
I, 3
J, 1
K, 3
L
0
0
1
1 1
2
3
1
1 1
1
1
3
4
0
A
B, 2 C, 2
D, 1 E, 1 F, 2
G, 1
H, 1
I, 3
J, 1
K, 3
L
0
0
1
1 1
2
3
1
1 1
1
1
3
4
0
(a) (b) (c) (d) (e)
Figure 2. In the computational graph, vertex and edge weights indicate computation and communication costs, respectively. (a) Original
computational graph, source node (A) and sink node (L) are added by ParDNN. (b) Shows the slicing stage when there are two pe(s). Clusters
are found in the following order: {A,B,E,G, I ,K ,L}, {C, F}, {J}, {D}, {H}. First two are primary clusters, the other three are secondaries.
(c) Cluster {J} is merged to a primary cluster in initial merging since it has a communication of 5 units that cannot be hidden by the work
within its span. (d) and (e) show LALB (ours) and GLB load balancing algorithms, respectively, aer initial merging. e makespan of the
LALB output is 13 units, while GLB is 15, thus LALB results in 15% performance gain.
e second policy handles a general issue with CP based
heuristics that is discussed in [40]. is issue arises on the
communication-edges among the processing elements. When
there are many costly communicating operations inG , some
of them may fall outside theCP . If their eect is large enough,
they will create heavier CPs in the partitioned graph. Note
that the CP of the graph aer partitioning is probably dier-
ent than the original CP . For example, in Figure 2(a) the CP
is initially {A,B,E,G, I ,K ,L}, but aer partitioning, the CP
becomes {A,C, F ,G, I ,K ,L} as in Figure 2(d). is is because
the communication between the nodes in the same cluster
is assumed to be zero. We repeatedly nd the CP in the par-
titioned graph, as in Algorithm 1. en we check the edges
in that path that connect two dierent primaries, if moving
a node incident to any of these edges to another primary
shortens theCP , we switch that node’s primary. is process
could be repeated as long as the CP can be shortened but
since each time w lvls needs to be recalculated we choose
to do it at most K times.
3.2 Step 2: Validating Memory Constraints
Similar to Step-1, we handle the memory constraints stati-
cally ahead of time for two reasons: (a) to avoid any runtime
overhead and (b) to reduce the chance of conicting with
other runtime optimizations. Our approach is implemented
separately from the memory management module of a DL
framework, and could be seamlessly used with any dynamic
optimization policies provided by the framework. Step-2 is
further divided into three stages.
3.2.1 Scheduler Emulator
To address the memory consumption statically, temporal
modeling of the allocation and deallocation paerns is re-
quired. Such modeling necessitates knowledge about sched-
uling in the DL framework to estimate when an operation
is going to start and nish execution. Consequently, when
the memory allocated for the operation inputs is released
and when a new memory is allocated to hold the operation
outputs. To estimate those values, we emulate the Tensor-
Flow scheduler. e TensorFlow scheduler maintains a ready
queue that initially contains nodes with no ancestors. Each
node has an in-degree representing the number of dependent
nodes. e nodes are executed in FIFO order. Once a node is
executed, the in-degrees of its children are decremented by
one. Any node having an in-degree of zero will be pushed
to the queue. Using the per-node running times and com-
munication sizes collected from proling, we emulate this
behaviour to get the expected start- and end-times of the
operations under a certain partitioning.
3.2.2 Tracking Memory Consumption
In TensorFlow, from the memory consumption perspective,
operation-nodes broadly fall into three main categories. First,
operations of which the data survives across iterations [1]
and we refer to them as residual nodes (res ns). Second,
special operations that mutate the referenced tensor, of the
rst type, we refer to them as reference nodes (re f ns). ose
operations do not reserve any additional memory. However,
they are co-located with the variables that they are mutating
and must be moved together with their referred variable
nodes. ird, operations that require additional memory
proportional to their output size and we call them normal
nodes (nor ns). Memory for the output of these nodes is
allocated upon scheduling and released once all their direct
descendants are executed. is third type covers most of
TensorFlow operations such as matmul, conv, and add. ere
is also temporary memory allocated for operation’s local
variables. ose are immediately released once executed.
One might think that proling solely peak memory foot-
prints would be sucient to predict the memory overows.
5
However, to handle an overow, nodes have to be moved
between partitions and that in turn changes the schedule
and the memory consumption at a certain time. Our cost
model takes this dynamic behavior into account and models
the interplay between the scheduler and memory usage.
Mcons (pe, t ) =
∑
n∈pe,
n∈r es ns
mem(n) +
∑
n∈pe,
n∈nor ns,
st (n)≤t ≤f t (n)
mem(n) +
∑
n /∈(pe∩r es ns ),
f t (n)≤t,
∃(n,u)∈E :st (u)≥t,u ∈pe
mem(n) (2)
Eqn (2) denes the memory consumption of a pe at time t
as Mcons (pe, t ). e rst term is the memory consumption of
the res ns assigned to that pe . e second term indicates the
memory consumption of the normal nodes that have started
on that pe at ≤ t and are being executed at t . e third
indicates the nodes that have descendants assigned to that
pe and the descendants’ expected starting time is ≥ t , and
those nodes have nished at ≤ t at any processing element
except pe , or are non-residual that have nished at ≤ t on
that pe .
e overall memory consumption needs to be estimated
for each node (|V | time points) because the change in mem-
ory consumption is triggered by node executions. Once a
node starts executing, new memory space needs to be allo-
cated and that may cause an overow. Estimating memory
consumption is done by visiting all the nodes in the graph
in the order of their estimated starting times, which is ob-
tained from the scheduler emulator, and keeping track of
the accumulated memory consumption. In the same pass,
the memory potential values of the nodes (Mpot in Table 1)
are obtained. A node’s memory consumption is added to the
cumulative value once it is visited, and subtracted aer its
last descendent is visited unless it is a res ns .
3.2.3 Addressing Overow
Aer estimating the memory consumption, we traverse the
graph starting from the sink node and keep the nodes in a
heap data structure, namely nodes heap. When the memory
consumed exceeds the limit, we deal with the overow as a
0-1 min-knapsack problem [11]. e min-knapsack problem
is formulated as follows; given N pairs of positive integers
(c j , aj ) and a positive integer O , nd x1, x2, …, xN so as to:
minimize
N∑
j=1
c jx j s .t .
N∑
j=1
aj ≥ O and x j ∈ {0, 1} (3)
In our case, O represents the amount of memory overow,
and aj represents Mpot (n, t ). For the cost cj , we use the sum-
mation of the node computation cost and the communication
with its direct ancestors and descendants located on the same
pe , as shown in Eqn (4), which denes move cost.
comp(n) +∀u,v ∈ pe(n) ∑
u :(u,n)∈G
comm(u,n) +
∑
v :(n,v )∈G,v
comm(n,v) (4)
Table 2. Complexity of Each Step of ParDNN
Step-1 Partitioning to Minimize Makespan
Graph Slicing (inc. sorting) O (K ( |V |+ |E |))
Mapping O ( |V |∗loд |V |)
Renement O (K ( |V |+ |E |))
Step-2 Satisfying Memory Constraints
TensorFlow Scheduler Emulator O ( |V |+ |E |)
Tracking Memory Consumption O ( |V |)
Addressing Overow O ( |V 2 |)
Overall complexity of ParDNN O ( |V |2)
e idea behind move cost is that when a node is moved
from a pe to another, it incurs a computational load imbal-
ance proportional to its weight and extra communication
proportional to its communication with the nodes assigned
to the same pe . Our goal is to nd a set of operation-nodes
that the summation of their memory consumption potentials
at the overow time is ≥ overow when their total move-
ment cost is minimized. e movement criteria is to pick the
node with the lowest move cost/Mpot (n, t ). In other words
we choose the node that alleviates the overow while incur-
ring the least amount of communication and computation
imbalance.
enodes heap is a min heap in which the move cost is the
ordering key. To avoid choosing a node that has a low move-
ment criteria but high move cost, each node for which the
Mpot (n, t ) > overow is inserted in another heap at which the
sorting key is move cost. When selecting, the top node is re-
moved from both heaps and the one with the least move cost
is chosen, and the other is returned to its heap. e selected
node is moved to another pe if the target pe has sucient
memory to accommodate that node’s memory potential. Oth-
erwise, the node is not considered again and another node is
picked from the heap. e algorithm terminates when either
the overow is completely eliminated or we run out of nodes
without addressing it.
3.3 Time Complexity of ParDNN
Table 2 summarizes the time complexity of each step of
ParDNN. Detailed explanations for the time complexities of
each step can be found in Appendix A. e reported com-
plexities aer each step are relaxed ones and for some stages
tighter bounds maybe driven with amortized analysis. Split-
ting the partitioning strategy into a set of simple, yet ecient,
sub-stages permits lowering the complexity. e nodes are
grouped into clusters in the rst step, then for the most of
later stages, ParDNN works at the cluster rather than node
granularity, which considerably reduces the instance size it
deals with. In practice, running ParDNN on the DNN mod-
els listed in Table 3 takes up to 2 minutes on a typical laptop
processor, namely an Intel i7-7600u CPU @ 2.80GHz. Con-
sidering the training time of those models is in the orders
of days or even weeks, ParDNN oers an extremely light-
weight and practical approach to partition the computational
graphs of DNNs.
4 Implementation
Our algorithm takes as an input the device count, their mem-
ory capacities, the interconnection bandwidth and latency
6
between them, the model computational graph, proling
data, operations metadata. e proling data contain execu-
tion time measurements and the size of the output of each
operation-node. e operation metadata contain the opera-
tion types(section 3.2.2s). TensorFlow standard APIs provide
the proling information including per-node time, memory
consumption, and communication sizes at the granularity of
graph nodes for regular as well as user-dened operators.
To estimate the memory consumption, we implemented
an emulator of TensorFlow’s scheduler described in [1]. It is
important to note that if ParDNN is intended to be used with
another DL framework, another emulator can be wrien to
emulate its scheduler, if needed, without modifying our par-
titioning algorithm. When handling memory constraints
there is a trade-o between the overhead and the accuracy;
static handling prioritizes overhead reduction over accuracy
while dynamic handling targets the opposite. Due to the
eciency and maintainability reasons, we adopt the static
approach. To accommodate sacricing the exact details of
the memory management optimizations and allocation de-
tails, such as fragmentation and temporary memory for local
variables, we spare 10% of the device memory and constrain
ourselves to the remaining 90%. is threshold was sucient
to successfully run all our experiments without going OOM.
Nevertheless, this ratio might need to be tuned and it is the
only parameter of ParDNN that needs tuning.
As shown in Figure 1, the output of our algorithm is a
single le containing the operation placement as key-value
pairs. Each key is an operation-node name and the value
is the device on which the operation should be allocated.
To control the placement at operation-node granularity, the
TensorFlow back-end reads the node-to-device assignment
from the placement le generated by our algorithm.
ParDNN on multiple nodes: Despite the capability of
designing ParDNN to partition a DNN on multiple nodes,
in this work we assume a single node where the process-
ing elements are identical GPUs connected to a common
host. is is because the number of GPUs per node has been
steadily increasing over time. For instance, systems with
16 or more GPUs per node are in production (e.g. NVIDIA
DGX SuperPOD). As suggested by many state-of-the-art
works [21, 50, 55], we argue that a hybrid approach of data
parallelism across compute nodes and using ParDNN inside
the compute node is a practical choice. is approach bene-
ts from the eciency and non-invasiveness of our method
in tackling the memory capacity issue at the node-level,
while also harnessing the weak scaling properties of data
parallelism across the nodes.
5 Results
is section is organized into three parts. First part compares
the performance of ParDNN against related work: explicit
model parallelism, redundant recompute, and an out-of-core
method. e second part evaluates the scaling of ParDNN,
Table 3. Specications of Models Datasets. (C)HSD: (Character)
Hidden State Dimension, SL: Sequence Length, ED: Embedding
Dimensions, RU: Residual Units, WF: Widening Factor, MD: Model
Dimension, FS:Filter Size, P SZ: patch size.
Model / Dataset Acronym #Layers HSD SL #Para. #Graph(109) Nodes
RNN for Word-Level
Language [58] /
Tiny Shakespeare [29]
Word-RNN 8 2048 28 0.34 10578
Word-RNN-2 8 4096 25 1.28 10578
CHSD ED
Character-Aware Neural
Language Models [32] /
Penn Treebank (PTB) [38]
Char-CRN 8 2048 15 0.23 22748
Char-CRN-2 32 2048 15 1.09 86663
#RU WF
Wide Residual Net. [70] /
CIFAR100 [34]
WRN 610 101 14 1.91 187742
WRN-2 304 50 28 3.77 79742
HSD MD
Transformer [61] / IWSLT’16
German–English corpus [8]
TRN 24 5120 2048 1.97 80550
TRN-2 48 8192 2048 5.1 160518
HSD FS P˙SZ
Eidetic 3D LSTM[65] /
Moving MNIST digits [57]
E3D 320 5 4 0.95 55756
E3D-2 512 5 8 2.4 55756
and the last part performs overhead and delity analysis of
ParDNN. Key ndings of each part are as follows:
• Comparison with Related Work: (i) ParDNN achieves
similar performance to the distributed tensor computa-
tion framework, Mesh-TensorFlow [54] but provides much
higher user productivity. (ii) ParDNN outperforms Gradi-
ent Checkpointing [7] combined with data parallelism in
many cases, yielding up to 2.8x speedup. More importantly,
ParDNN enables training models where applying Gradient
Checkpointing result in out of memory (OOM) even with
a batch size of 1. (iii) ParDNN outperforms CUDA Unied
Memory for all congurations and GPU counts.
• Scaling: (i) For the same number of GPUs, ParDNN en-
ables the use of more than 9x batch size over the maximum
possible with data parallelism on average. (ii) Superlinear
speedup in most models and congurations is observed
going from one GPU to 16 GPUs.
• Overhead andFidelity: (i) Empirical overhead of ParDNN
is no more than 2 minutes for the largest model over 16
GPUs. (ii) Replacing any of ParDNN steps with other
heuristics or using alternative approaches result in sig-
nicant drop in performance or huge increase in overhead,
hence, demonstrating and justifying the design choices and
eciency of ParDNN’s algorithmic steps.
5.1 Environment, Models, and Datasets
We conducted all our experiments on a NVIDIA DGX-2 with
16 Tesla V100 SXM3 32GB GPUs connected via NVSwitch.
e throughput measurements are conducted over the inter-
val between the 100th and the 150th training iterations to
get stable results. We use TensorFlow 1.14, and CUDA 10.0.
We experimented with ve large models representing four
main tracks of DL applications: image classication, transla-
tion, video prediction, and language modeling. All models
and datasets used in experiments are listed in Table 3, and
detailed in Appendix A. We focus our analysis on the perfor-
mance of ParDNN, rather than pursuing the accuracy since
7
915.2
11.2
31.5
35.1
28.7
8.7
11.8
9.3
4.5 6.3
6.8
4.2 4.2 4.8
0
5
10
15
20
25
30
35
40
4
(3
2
)
8
(1
0
2
4
)
1
6
(2
0
4
8
)
4
(1
2
8
)
8
(5
1
2
)
1
6
(1
0
2
4
)
4
(4
)
8
(1
6
)
1
6
(3
2
)
4
(2
)
8
(1
6
)
1
6
(3
2
)
4
(8
)
8
(1
6
)
1
6
(3
2
)
Word-RNN-2 Char-CRN-2 WRN-2 TRN-2 E3D-2
ParDNN Speedup over CUDA Unified Memory
1.5
0.98
0.67
2.8
2.3
1.3
1.1
1.5
0.94
1.9
1.4
0.98 1.1
0.78
1.4
1.2
0
0.5
1
1.5
2
2.5
3
2
(5
1
2
)
4
(1
0
2
4
)
8
(2
0
4
8
)
2
(2
5
6
)
4
(5
1
2
)
8
(1
0
2
4
)
2
(4
)
4
(8
)
8
(1
6
)
2
(8
)
4
(1
6
)
8
(3
2
)
1
(1
)
4
(3
2
)
8
(5
1
2
)
4
(1
2
8
)
8
(2
5
6
)
1
(1
)
1
(1
)
1
(1
)
Word-RNN Char-CRN WRN E3D TRN Word-RNN-2 Char-CRN-2 WRN-2 E3D-2 TRN-2
ParDNN Speedup over Gradient Checkpointing + Data Parallelism
0.99
1.1 1.1
0.89 0.94
0
0.2
0.4
0.6
0.8
1
1.2
m4 m2:b2 m8 m4:b2 m2:b4
4-GPUs 8-GPUs
Model:Batch (m:b) Split Dimensions of 
Mesh-Tensorflow
ParDNN speedup over Mesh-
TensorFlow (Transformer Model)
 
O
O
M
 (
G
ra
d
ie
n
t 
C
h
ec
kp
o
in
ti
n
g)
 
O
O
M
 (
G
ra
d
ie
n
t 
C
h
ec
kp
o
in
ti
n
g)
 
O
O
M
 (
G
ra
d
ie
n
t 
C
h
ec
kp
o
in
ti
n
g)
 
O
O
M
 (
G
ra
d
ie
n
t 
C
h
ec
kp
o
in
ti
n
g)
 
(a) (b) (c)
Figure 3. (a) ParDNN speedup over Mesh-Tensorow, (b) ParDNN speedup over gradient checkpointing combined with data parallelism to
run on multiple GPUs, (c) ParDNN speedup over CUDA Unied Memory (UM) using larger models. X-axis: Number of GPUs (Batch Size)
ParDNN has no eect on the learning aspect of the model:
ParDNN does not alter the model nor its hyper-parameters.
5.2 Comparison with Related Work
We compare ParDNN to three dierent state-of-the-art ap-
proaches used to circumvent the memory limitation when
training DNNs. We compare with (i) Mesh-TensorFlow [54]
for explicit model parallelism, (ii) gradient checkpointing [7]
in combination with data parallelism for redundant recom-
pute and (iii) CUDA Unied Memory for out-of-core comput-
ing. Although there exists other graph-based solutions, we
cannot directly compare either because we are not aware of
any open source implementation [41] or the implementation
is available for MXNet only [64]. It is worth mentioning,
however, that ParDNN takes no more than 2 minutes for
the largest conguration we tested, in comparison to 10s of
hours reported by the other graph-based methods, in addi-
tion to ParDNN working on models 2.3x as large as what
the other methods experimented with [41].
5.2.1 Mesh-TensorFlow
Mesh-TensorFlow [54], an extension to TensorFlow, was pro-
posed to overcome the memory limitations of a single device
and permits specifying a general class of distributed tensor
computations. We compare the performance of ParDNN
with Mesh-TensorFlow using the Transformer model which
the original authors used to demonstrate the scaling [54]. Fig-
ure 3(a) shows the speedup of ParDNN over Mesh-TensorFlow
using 4 and 8 GPUs. We report all permutations [60] possible
with the maximum trainable batch size for Mesh-TensorFlow.
ParDNN is on par with Mesh-TensorFlow, however, unlike
Mesh-TensorFlow (a) ParDNN requires no knowledge about
the DNN structure by the user, while with Mesh-TensorFlow
it is the responsibility of the user to rewrite the model using
Mesh-TensorFlow syntax. (b) ParDNN entirely automates
the partitioning, while with Mesh-TensorFlow users have to
manually specify the tensor-dimensions to be split across
a multi-dimensional processor mesh and nding the best
assignment is an NP-hard problem. (c) Mesh-TensorFlow
has a non-negligible pre-run overhead which doubles when
doubling the number of GPUs reaching ∼ 1 hour for 8 GPU
assignment.
5.2.2 Redundant Recompute: Gradient Checkpoint
Gradient checkpointing [10] enables DNN training with a
sublinear memory cost (O(
√
N )) when training an N layer
network by recomputing the activations during backpropa-
gation, instead of holding the forward pass results. In our
comparison, we use a TensorFlow-based open-source imple-
mentation [7]. Figure 3(b) shows the speedup of ParDNN
over gradient checkpoint when combined with data paral-
lelism to run on multiple GPUs. For ParDNN and check-
pointing, we used the common largest possible batch sizes.
ParDNN outperforms gradient checkpointing in most cases.
In few cases, checkpointing is beer than ParDNN; this hap-
pens mainly when the degree of parallelism inherent in the
graph is not sucient to utilize all the GPUs. However, more
importantly, ParDNN is qualitatively superior to gradient
checkpointing since it enables the training of models where
checkpointing fails to make them t in device memory, even
when using a batch size of one. For example, Figure 3(b)
shows several congurations where gradient checkpointing
goes out-of-memory at the batch size of one. Moreover, the
overhead of gradient checkpointing can be up to 5 hours [7].
5.2.3 Out-of-core: CUDA Unied Memory
Figure 3(c) shows the speedup of ParDNN over CUDA Uni-
ed Memory (UM). UM, to the authors knowledge, is the
only out-of-core solution that has an available Tensorow
implementation. In all cases, ParDNN throughput always
improves going from 4 to 16 GPUs while increasing the batch
size. UM performance in this case degrades when increasing
the batch size due to the page faulting penalty [3].
5.3 Scaling Studies
We experimented with models under two main use-cases of
ParDNN. First, model instances that t into a single device
memory only with very small batch sizes. Small here is relative
to the numbers used by the DL community and reported in
the literature. In such a case, ParDNN provides a qualitative
advantage over data parallelism (DP), which splits the input
over dierent GPUs that hold the replicas of the model. e
second use-case is model instances that do not t into a single
GPU memory even with small batch sizes. ese are larger
variants of each model in Figure 4(a).
8
2
.6
4
.7 5
.6
0
.9
3
.8
5
.4
0
.4
2
0
.5
9
0
.7
3
0
1
2
3
4
5
6
4
(4
)
1
6
(3
2
)
8
(1
6
)
4
(8
)
1
6
(3
2
)
WRN-2 TRN-2 E3D-2
3
6
2
1
3 2
4
6
8
5
1
5
1
2
9
2
0
50
100
150
200
250
300
350
4
(3
2
)
8
(1
0
2
4
)
1
6
(2
0
4
8
)
4
(1
2
8
)
8
(5
1
2
)
1
6
(1
0
2
4
)
Word-RNN-2 Char-CRN-2
Th
ro
u
gh
p
u
t(
it
em
/s
ec
o
n
d
)
5
.2 6
.9 8
.6 1
0
.4
8
.9 1
3
.5 1
7
.6
2
8
.4
2
.6 6
.3 7
.2
6
.8
4 4
.4 4
.9
4
.8
3
.3 4
.3 5
.6 6
.3
0
5
10
15
20
25
30
2
 (
5
1
2
)
4
 (
1
0
2
4
)
8
 (
2
0
4
8
)
1
6
 (
2
0
4
8
)
2
 (
2
5
6
)
4
 (
5
1
2
)
8
 (
1
0
2
4
)
1
6
 (
2
0
4
8
)
2
 (
4
)
4
 (
1
6
)
8
 (
3
2
)
1
6
 (
6
4
)
2
 (
3
2
)
4
 (
6
4
)
8
 (
1
2
8
)
1
6
 (
2
5
6
)
2
 (
8
)
4
 (
1
6
)
8
 (
3
2
)
1
6
 (
6
4
)
Word-RNN Char-CRN WRN TRN E3D
ParDNN Speedup over Single GPU 
 
 Batch Size Scaling Increase Over Ideal DP 
Model/#GPUs 1 2 4 8 16 1 2 4 8 16 
Word-RNN 16 512 1024 2048 2048 1x 
1x 
1x 
1x 
1x 
16x 
16x 
2x 
16x 
2x 
16x 
16x 
4x 
16x 
4x 
16x 
16x 
4x 
16x 
4x 
8x 
16x 
4x 
16x 
4x 
Char-CRN 8 256 512 1024 2048 
WRN 1 4 16 32 64 
TRN 1 32 64 128 256 
E3D 1 4 16 32 64 
Word-RNN-2 - 
- 
- 
- 
- 
- 
- 
- 
- 
- 
32 
128 
4 
2 
8 
1024 
512 
16 
16 
16 
2048 
1024 
32 
32 
32 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
1x 
1x 
1x 
1x 
1x 
16x 
2x 
2x 
4x 
1x 
16x 
2x 
2x 
4x 
1x 
Char-CRN-2 
WRN-2 
TRN-2 
E3D-2 
ParDNN Throughput 
(a) (b) (c)
Figure 4. (a) Maximum batch sizes (bsz) made possible by ParDNN. Bsz on a single GPU is the maximum that could t without triggering
OOM. Table also shows the multiplier by which ParDNN could increase the bsz over ideal data parallelism (DP). For use-cases-1, DP is
assumed to applied on top of a single GPU reference point. For use-cases-2, ParDNN enables ≥ 4-GPU assignment and DP is assumed to be
applied on top of 4-GPU reference point. (b) ParDNN speedup over a single GPU using 2, 4, 8 and 16 GPUs. (c) roughput and scaling up
to 16 GPUs with larger models. In (b) and (c), batch size is shown in parenthesis.
5.3.1 Batch Size Scaling
Training with large batch sizes oers more parallelism and
drastically reduces the overall training time. Authors in [17]
proposed a method to scale batch sizes, which reduced the
training of RESNET-50 on ImageNet to one hour. Another
work harnessed very large batch sizes to reduce BERT train-
ing time from 3 days to 76 mins [69]. ParDNN enables super-
linear scaling of the batch sizes while increasing the number
of GPUs. Figure 4 (a) shows the batch size scaling for all
of our experiments. We could increase the batch size by
up to 256x for use-cases-1 and 64x for use-cases-2. is
gives ParDNN a qualitative advantage even for models that
t into a single GPU since ParDNN enables training with
much larger batch sizes than what can be achieved with DP.
ParDNN achieves superlinear scaling of the batch size
rstly because with ParDNN, the parameters are not repli-
cated but distributed. A large fraction of the memory con-
sumed by the large models is to store the parameters and
variables that survive through iterations. For instance, for
1.91 billion parameter WRN, TensorFlow allocates around
8GB for those variables. Using ParDNN these parameters are
distributed, but with DP they need to be replicated. Secondly,
for some operations, the memory consumption does not scale
linearly with the batch size. For example, in Word-RNN and
Char-CRN, the outputs of matrix multiplication operations
have the largest memory consumption ratio. When doubling
the batch size, the memory consumption by matrix multipli-
cation results increases by only ∼ 25%. is is because the
batch size might be the inner dimension for many of these
multiplications, when multiplying a matrix of dimensions
a∗batch size by another of batch size∗b, the result has the
dimensions of a ∗ b regardless of the batch size. So the mem-
ory allocated to store the output of that operation does not
increase, and this eect propagates to its decedents that will
take its output as their input.
5.3.2 GPU Count Scaling
Figure 4(b) and (c) show the speedup over a single GPU for
small models and the throughput scaling of ParDNN for
large models, respectively, using the batch sizes in Figure 4.
In Figure 4(b), ParDNN shows a substantial improvement
on 2 GPUs and a superlinear speedups up to 4 GPUs for
all the models. e sharp performance increase happens
because when a model ts into single GPU memory with
small batch size, the resources are extremely underutilized.
Pushing larger batches, while doubling the number of GPUs,
improves the device utilization considerably. Aer 4 GPUs,
the batch size could only be doubled when doubling the GPU
count. e performance behavior beyond this point depends
on inherent DoP (degree of parallelism) in the graph and CCR
(the ratio of total communication cost to total computation
cost) of the graph [44, 56]. Both Word-RNN and Char-CRN
have large DoP, as a result, they continue to give superlinear
speedups up to 8 and 16 GPUs, respectively. TRN exhibits
modest improvement beyond 4 GPUs even though its graph
has higher DoP in comparison to WRN. is is because it
has a larger CCR (e.g. on 4-GPU congurations the CCR of
TRN is 1.58 compared to 0.59 of WRN), hence a considerable
amount of time is spent on communication. E3D scales beer
beyond 4 GPUs in comparison to WRN due to having higher
DoP. However, the speedup, compared to a single GPU, is
less because E3D’s main operation is 3D convolution, which
heavily utilizes the GPU even with small batch sizes.
In Figure 4(c), going from 4 to 8 GPUs enables much larger
batches in all cases but two. is in turn enhances the re-
source utilization and results in the substantial throughput
improvements. Char-CRN-2 perfectly scales up to 16 GPUs
due to its high DoP. Word-RNN-2 and WRN-2 scale modestly
from 8 to 16 because the batch size of 8 is sucient to satu-
rate the GPUs for Word-RNN-2. In case of WRN-2, the modest
scaling is due to the low DoP.
5.4 Overhead and Fidelity of ParDNN
5.4.1 Overhead of ParDNN
ParDNN has a negligible overhead thanks to the low com-
plexity of each step. e longest partitioning time among
all the combinations of batch sizes, GPUs and model con-
gurations used in this work was 117 secs in the case of
partitioning TRN-2 over 16 GPUs. e minimum time of 18
secs was taken to partition Word-RNN over 2 GPUs. Even
though handling the memory overow takes most of the
9
0
.8
2
0
.7
6
0
.9
2
0
.9
8
0
.9
8
0
.8
1
0
.6
7
0
.9
3
1 0
.9
8
0
.7
8
0
.5
1 0
.9
5
1 0
.9
9
0
.7
6
0
.5
0
.9
9
1 1
0
0.5
1
1.5
Normalized Makespan Of ParDNN Over 
Linear Clustering
K=2 K=4 K=8 K=16
1
1
.7 1
.8
1
1
.4 1
.7
1
1
.9 2
.2
1
1
.6 2
1
1
.7 1
.9
0
0.5
1
1.5
2
2.5
R
R
w
/o
-r
ef
in
e
m
en
t
P
ar
D
N
N R
R
w
/o
-r
ef
in
e
m
en
t
P
ar
D
N
N R
R
w
/o
-r
ef
in
e
m
en
t
P
ar
D
N
N R
R
w
/o
-r
ef
in
e
m
en
t
P
ar
D
N
N R
R
w
/o
-r
ef
in
e
m
en
t
P
ar
D
N
N
Word-RNN Char-CRN WRN TRN E3D
Partitioning: ParDNN vs. Alternatives
 
Figure 5. (a) ParDNN’s speedup over Round Robin (RR) and
ParDNN without renement. e values are normalized over RR.
Four GPU conguration is used. (b) Makespan of ParDNN over that
of Linear clustering (lower is beer). K is the number of partitions.
overall partitioning time, the time taken to handle memory
overow is much lower than the theoretical upper bound.
is is because Step-2 of ParDNN depends on how many
nodes need to be moved between clusters to address the over-
ow, which is much less than |V | in practice. e average
ratio of the nodes moved in all our experiments is 8%.
5.4.2 Analysis of ParDNN Algorithmic Steps
To analyze the impact of slicing-mapping-renement stages
of Step-1, we replace Step-1 with a naive approach, which
simply distributes the graph-nodes to the devices in a round-
robin fashion (RR), where the nodes of the graph are iterated
in their topological order. Figure 5(a) shows the perfor-
mance improvement by ParDNN over RR. In addition, the
gure also shows the ParDNN’s performance without the
renement. Compared to RR results, ParDNN doubles the
training throughput on average. Applying renement has a
non-negligible eect, contributing to 5-25% improvement.
5.4.3 ParDNN vs Linear Clustering
ParDNN is not a standalone scheduling algorithm. It leaves
the order of execution decision to the dynamic scheduler.
However, it can still serve as an ecient phase in static
scheduling. To show this feature and the advantage of our
multi-staged approach over a high-complexity single heuris-
tic, we compare ParDNN with linear clustering (LC). To
do a fair comparison, we implemented LC with GLB and
Earliest Estimated Time First (EST First) [62] as a task or-
dering heuristic since this combination gave the best results.
For ParDNN, we used EST First as well to derive the execu-
tion order of tasks and omit the memory constraints (Step
2). Figure 5(b) shows that in all the experiments ParDNN
outperforms or is on par with LC. In particular, ParDNN
produces much beer results than LC when the degree of
parallelism is high as in Char-CRN and Word-RNN. Another
advantage comes from the overhead. For the largest graph,
WRN with ∼ 190K nodes, it took ParDNN 36 secs while LC
took about 4.5 hours.
6 Related Work
Systems-level approaches: Mirhoseini et al. proposed a re-
inforcement learning-based method to place dataow graphs
on multiple devices [41, 42]. is approach suers from
signicant time and resource consumption. e proposed
policy was trained for hours using 16 workers to produce
placements for models having less than 100K operations. A
more ecient approach was proposed by Wang et al. in [64].
However, it requires a description language to specify com-
putations and cannot describe all the operations used in DL.
Moreover, it partitions all operators and tensors across all
workers, resulting in poor resource utilization.
DL-level approaches: Explicit model parallelism, where
each worker is responsible for a subset of the layers, suers
from two major limitations: requiring complex cost models
on case-by-case bases and leaving the partitioning burden
to the programmer [43]. Pipeline parallelism provides good
resource utilization yet some implementations requires a
single layer to t in a single device [22], which may not be
the case for models with 3D inputs [39]. While in others,
extra memory overhead proportional to the size of the model
weights is necessary to address the statistical eciency issue,
i.e. preventing model convergence [43]. In [12, 27, 33, 55]
non-generic techniques were proposed to parallelize specic
types of DL models, some focusing on CNNs while others
relying on Transformer in their optimizations.
Virtualization and Recomputation: methods relax the
memory requirements. vDNN [51] is a memory manager that
virtualizes GPU memory in DNN training. ooc cuDNN [25]
extends cuDNN and applies cuDNN-compatible operators
even when a layer exceeds GPU memory capacity by swap-
ping at the granularity of individual tensor dimensions. Gra-
dient checkpointing [10] reduces the memory needed to
store the intermediate outputs and gradients with the cost
of doubling the forward pass computational cost [10, 26].
PoocH [24] and Capuchin [47] propose a hybrid approach
that selects either recomputing or swapping for certain lay-
ers to reduce the performance overhead based on proling
data.
Graph partitioning: To deal with a directed graph, ex-
isting graph partitioning libraries convert every directed
edge to undirected even though this conversion loses cru-
cial information [4]. Due to this reason, Scotch static map-
per [45, 46] and MinCut optimizer, results in 2 to 10 times
slowdown when applied on graphs of DL models [41, 42].
In [20], new techniques are proposed to deal with directed
graphs and [44] built on top of those techniques for a clus-
tering based scheduler. ey aim at producing acyclic parti-
tioning, where if there is a cut edge from partition a to b and
another from b to a, the partition is considered cyclic, and
is not acceptable. Since the graphs produced by Tensorow
are full of fork-joins, applying their technique to our DNN
models results in unbalanced partitions.
Static graph scheduling: Plenty of sophisticated and
high-quality algorithms were proposed [18, 23, 35, 36, 68] in
this area. e vast majority of these algorithms were devel-
oped in 1990’s to handle small-sized graphs, and they were
later evaluated using instances having up to 3000 nodes [14,
18, 37, 62, 63]. A recent evaluation on large graphs shows that
10
they either do not scale due to their high time-complexity,
or produce low-quality allocations due to their inability to
capture the global structure of the graph [44].
7 Conclusion
ParDNN presents a lightweight approach to partition com-
putational graphs of very large DNN models. It permits the
training of models that do not t into a single device memory.
e experiments on ve large DNNs and comparisons with
related work demonstrate its high eciency and superlinear
scaling of batch size and training throughput.
Acknowledgement
Authors from Koc¸ University are supported by the Turkish
Science and Technology Research Centre Grant No: 118E801.
is work was partially supported by JST-CREST under
Grant Number JPMJCR19F5. e research presented in this
paper has beneted from the Experimental Infrastructure
for Exploration of Exascale Computing (eX3), which is nan-
cially supported by the Research Council of Norway under
contract 270053.
References
[1] Martı´n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S Corrado, Andy Davis, Jerey Dean, Mahieu
Devin, et al. 2016. Tensorow: Large-scale machine learning on
heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
(2016).
[2] Zahangir Alo, Tarek M. Taha, Chris Yakopcic, Stefan Westberg, Vasit
Sagan, Mst Shamima Nasrin, Mahmudul Hasan, Brian C. Van Essen,
Abdul A. S. Awwal, and Vijayan K. Asari. 2019. A State-of-the-Art
Survey on Deep Learning eory and Architectures. Electronics 8, 3
(2019), 292. hps://doi.org/10.3390/electronics8030292
[3] Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi
Lu, and Dhabaleswar K Panda. 2018. OC-DNN: Exploiting Advanced
Unied Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-
Core DNN Training. In 2018 IEEE 25th International Conference on
High Performance Computing (HiPC). IEEE, 143–152.
[4] David A Bader, Henning Meyerhenke, Peter Sanders, and Dorothea
Wagner. 2013. Graph partitioning and graph clustering. Vol. 588. Amer-
ican Mathematical Society Providence, RI.
[5] James Bergstra, Fre´de´ric Bastien, Olivier Breuleux, Pascal Lamblin,
Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David
Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. 2011. eano:
Deep learning on gpus with python. In NIPS 2011, BigLearning Work-
shop, Granada, Spain, Vol. 3. Citeseer, 1–48.
[6] Charles-Edmond Bichot and Patrick Siarry. 2011. Graph partitioning.
Wiley Online Library.
[7] Yaroslav Bulatov. 2018. gradient-checkpointing. hps://github.com/
cybertronai/gradient-checkpointing.
[8] Mauro Ceolo, Niehues Jan, Stu¨ker Sebastian, Luisa Bentivogli,
Roldano Caoni, and Marcello Federico. 2016. e IWSLT 2016 evalu-
ation campaign. In International Workshop on Spoken Language Trans-
lation.
[9] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,
Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015.
Mxnet: A exible and ecient machine learning library for heteroge-
neous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[10] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin.
2016. Training Deep Nets with Sublinear Memory Cost. ArXiv
abs/1604.06174 (2016).
[11] Ja´nos Csirik. 1991. Heuristics for the 0-1 min-knapsack problem. Acta
Cybernetica 10, 1-2 (1991), 15–20.
[12] Nikoli Dryden et al. 2019. Channel and Filter Parallelism for Large-
Scale CNN Training. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage, and Analysis (SC
’19). Article 46, 13 pages.
[13] Peter M Fenwick. 1994. A new data structure for cumulative frequency
tables. Soware: Practice and experience 24, 3 (1994), 327–336.
[14] Apostolos Gerasoulis and Tao Yang. 1992. A comparison of clustering
heuristics for scheduling directed acyclic graphs on multiprocessors.
journal of parallel and distributed computing 16, 4 (1992), 276–291.
[15] Amir Gholami, Ariful Azad, Peter Jin, Kurt Keutzer, and Aydin Buluc.
2017. Integrated model, batch and domain parallelism in training
neural networks. arXiv preprint arXiv:1712.04432 (2017).
[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep
learning. MIT press.
[17] Priya Goyal, Piotr Dolla´r, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming
He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour.
arXiv preprint arXiv:1706.02677 (2017).
[18] Kun He, Xiaozhu Meng, Zhizhou Pan, Ling Yuan, and Pan Zhou. 2018.
A novel task-duplication based clustering algorithm for heterogeneous
computing environments. IEEE Transactions on Parallel and Distributed
Systems 30, 1 (2018), 2–14.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and paern recognition. 770–778.
[20] Julien Herrmann, Jonathan Kho, Bora Uc¸ar, Kamer Kaya, and U¨mit V
C¸atalyu¨rek. 2017. Acyclic partitioning of large directed acyclic graphs.
In 2017 17th IEEE/ACM international symposium on cluster, cloud and
grid computing (CCGRID). IEEE, 371–380.
[21] Yanping Huang et al. 2018. GPipe: Ecient Training of Giant Neural
Networks using Pipeline Parallelism. CoRR abs/1811.06965 (2018).
arXiv:1811.06965
[22] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao
Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, oc V Le, Yonghui
Wu, et al. 2019. Gpipe: Ecient training of giant neural networks using
pipeline parallelism. In Advances in Neural Information Processing
Systems. 103–112.
[23] Jing-Jang Hwang, Yuan-Chieh Chow, Frank D Anger, and Chung-Yee
Lee. 1989. Scheduling precedence graphs in systems with interproces-
sor communication times. SIAM J. Comput. 18, 2 (1989), 244–257.
[24] Yuki Ito, Haruki Imai, Tung D. Le, Yasushi Negishi, Kiyokuni
Kawachiya, Ryo Matsumiya, and Toshio Endo. 2019. Proling based
out-of-core hybrid method for large neural networks: poster. ArXiv
abs/1907.05013 (2019).
[25] Yuki Ito, Ryo Matsumiya, and Toshio Endo. 2017. ooc cuDNN: Ac-
commodating convolutional neural networks over GPU memory
capacity. In 2017 IEEE International Conference on Big Data, Big-
Data 2017, Boston, MA, USA, December 11-14, 2017. 183–192. hps:
//doi.org/10.1109/BigData.2017.8257926
[26] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter
Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Break-
ing the Memory Wall with Optimal Tensor Rematerialization. In
Proceedings of Machine Learning and Systems 2020. 497–511.
[27] Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring
hidden dimensions in parallelizing convolutional neural networks.
arXiv preprint arXiv:1802.04924 (2018).
[28] Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and
Model Parallelism for Deep Neural Networks. CoRR abs/1807.05358
(2018). arXiv:1807.05358 hp://arxiv.org/abs/1807.05358
11
[29] Andrej Karpathy. 2015. tinyshakespeare. hps://github.com/
karpathy/char-rnn/tree/master/data/tinyshakespeare
[30] George Karypis and Vipin Kumar. 1995. Multilevel graph partitioning
schemes. In ICPP (3). 113–122.
[31] Sung J Kim. 1988. A general approach to multiprocessor scheduling.
(1988).
[32] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016.
Character-aware neural language models. Inirtieth AAAI Conference
on Articial Intelligence.
[33] Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional
neural networks (2014). arXiv preprint arXiv:1404.5997 (2014).
[34] Alex Krizhevsky, Georey Hinton, et al. 2009. Learning multiple layers
of features from tiny images. (2009).
[35] Yu-Kwong Kwok and Ishfaq Ahmad. 1995. Bubble scheduling: A
quasi dynamic algorithm for static allocation of tasks to parallel ar-
chitectures. In Proceedings. Seventh IEEE Symposium on Parallel and
Distributed Processing. IEEE, 36–43.
[36] Yu-Kwong Kwok and Ishfaq Ahmad. 1996. Dynamic critical-path
scheduling: An eective technique for allocating task graphs to multi-
processors. IEEE transactions on parallel and distributed systems 7, 5
(1996), 506–521.
[37] Jing-Chiou Liou and Michael A Palis. 1997. A comparison of gen-
eral approaches to multiprocessor scheduling. In Proceedings 11th
International Parallel Processing Symposium. IEEE, 152–156.
[38] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert Mac-
Intyre, Ann Bies, Mark Ferguson, Karen Katz, and Bria Schasberger.
1994. e Penn Treebank: annotating predicate argument structure.
In Proceedings of the workshop on Human Language Technology. Asso-
ciation for Computational Linguistics, 114–119.
[39] Amrita Mathuriya et al. 2018. CosmoFlow: Using Deep Learning
to Learn the Universe at Scale. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage, and
Analysis (SC ’18). IEEE Press, Piscataway, NJ, USA, Article 65, 11 pages.
[40] CL McCreary, MA Cleveland, and AA Khan. 1996. e problem with
critical path scheduling algorithms. Master’s esis, Department of
Computer Science and Engineering Auburn University, USA (1996).
[41] Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, oc V
Le, and Je Dean. 2018. A hierarchical model for device placement.
(2018).
[42] Azalia Mirhoseini, Hieu Pham, oc V. Le, Benoit Steiner, Rasmus
Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy
Bengio, and Je Dean. 2017. Device Placement Optimization with Rein-
forcement Learning. In Proceedings of the 34th International Conference
on Machine Learning - Volume 70 (ICML’17). JMLR.org, 2430–2439.
[43] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri,
Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei
Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN
training. In Proceedings of the 27th ACM Symposium on Operating
Systems Principles. 1–15.
[44] M Yusuf O¨zkaya, Anne Benoit, Bora Uc¸ar, Julien Herrmann, and
U¨mit V C¸atalyu¨rek. 2019. A scalable clustering-based task scheduler
for homogeneous processors using DAG partitioning. In 2019 IEEE
International Parallel and Distributed Processing Symposium (IPDPS).
IEEE, 155–165.
[45] Franc¸ois Pellegrini. 2009. Distillating knowledge about Scotch. In
Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fu¨r
Informatik.
[46] Franc¸ois Pellegrini and Jean Roman. 1996. Scotch: A soware package
for static mapping by dual recursive bipartitioning of process and
architecture graphs. In International Conference on High-Performance
Computing and Networking. Springer, 493–498.
[47] Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian
Xiong, Fan Yang, and Xuehai Qian. 2020. Capuchin: Tensor-Based
GPU Memory Management for Deep Learning. In Proceedings of the
Twenty-Fih International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS ’20). As-
sociation for Computing Machinery, New York, NY, USA, 891–905.
hps://doi.org/10.1145/3373376.3378505
[48] Alec Radford, Je Wu, Rewon Child, David Luan, Dario Amodei, and
Ilya Sutskever. 2019. Language Models are Unsupervised Multitask
Learners. OpenAI Technical Report (2019).
[49] Andrei Radulescu and Arjan JC Van Gemund. 1998. GLB: A low-
cost scheduling algorithm for distributed-memory architectures. In
Proceedings. Fih International Conference on High Performance Com-
puting (Cat. No. 98EX238). IEEE, 294–301.
[50] Samyam Rajbhandari, Je Rasley, Olatunji Ruwase, and Yuxiong He.
2019. ZeRO: Memory Optimization Towards Training A Trillion
Parameter Models. ArXiv abs/1910.02054 (2019).
[51] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulqar, and
Stephen W Keckler. 2016. vDNN: Virtualized deep neural networks for
scalable, memory-ecient neural network design. In e 49th Annual
IEEE/ACM International Symposium on Microarchitecture. IEEE Press,
18.
[52] Vivek Sarkar. 1988. Partitioning and scheduling parallel programs for
execution on multiprocessors. (1988).
[53] Taro Sekiyama, Takashi Imamichi, Haruki Imai, and Rudy Raymond.
2018. Prole-guided memory optimization for deep neural networks.
arXiv preprint arXiv:1804.10001 (2018).
[54] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish
Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee,
Mingsheng Hong, Cli Young, Ryan Sepassi, and Blake Hechtman.
2018. Mesh-TensorFlow: Deep Learning for Supercomputers. In Pro-
ceedings of the 32nd International Conference on Neural Information
Processing Systems (NIPS’18). Curran Associates Inc., Red Hook, NY,
USA, 10435–10444.
[55] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley,
Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training
multi-billion parameter language models using gpu model parallelism.
arXiv preprint arXiv:1909.08053 (2019).
[56] Oliver Sinnen. 2007. Task scheduling for parallel systems. Vol. 60. John
Wiley & Sons.
[57] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015.
Unsupervised learning of video representations using lstms. In Inter-
national conference on machine learning. 843–852.
[58] Jack Jackson Sung Kim. 2017. Multi-layer Recurrent Neural Networks
(LSTM, RNN) for word-level language models in Python using Ten-
sorFlow. hps://github.com/hunkim/word-rnn-tensorflow
[59] Ilya Sutskever, James Martens, and Georey E Hinton. 2011. Gener-
ating text with recurrent neural networks. In Proceedings of the 28th
international conference on machine learning (ICML-11). 1017–1024.
[60] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet,
Aidan N. Gomez, Stephan Gouws, Llion Jones,  Lukasz Kaiser, Nal
Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob
Uszkoreit. 2018. Tensor2Tensor for Neural Machine Translation. CoRR
abs/1803.07416 (2018). hp://arxiv.org/abs/1803.07416
[61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez,  Lukasz Kaiser, and Illia Polosukhin. 2017.
Aention is all you need. In Advances in neural information processing
systems. 5998–6008.
[62] Huijun Wang and Oliver Sinnen. 2018. List-scheduling versus cluster-
scheduling. IEEE Transactions on Parallel and Distributed Systems 29,
8 (2018), 1736–1749.
[63] Jian Wang, Xinke Lv, and Xiao Chen. 2016. Comparative analysis
of list scheduling algorithms on homogeneous multi-processors. In
2016 8th IEEE International Conference on Communication Soware and
Networks (ICCSN). IEEE, 708–713.
[64] Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting
very large models using automatic dataow graph partitioning. In
Proceedings of the Fourteenth EuroSys Conference 2019. 1–17.
12
[65] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long,
and Li Fei-Fei. 2018. Eidetic 3d lstm: A model for video prediction and
beyond. (2018).
[66] Yonghui Wu, Mike Schuster, Zhifeng Chen, oc V Le, Mohammad
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, et al. 2016. Google’s neural machine translation
system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144 (2016).
[67] Tao Yang. 1993. Scheduling and code generation for parallel architectures.
Ph.D. Dissertation. Citeseer.
[68] Tao Yang and Apostolos Gerasoulis. 1994. DSC: Scheduling parallel
tasks on an unbounded number of processors. IEEE Transactions on
Parallel and Distributed Systems 5, 9 (1994), 951–967.
[69] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Sri-
nadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and
Cho-Jui Hsieh. 2019. Large batch optimization for deep learning:
Training bert in 76 minutes. In International Conference on Learning
Representations.
[70] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual net-
works. arXiv preprint arXiv:1605.07146 (2016).
A Appendix
A.1 Time Complexity of Each Step of ParDNN
Complexity of Graph Slicing: e most expensive part of
Algorithm 1 is computing weighted levels for all the nodes.
is operation performs a variant of topological sorting and
has time complexity ofO(|V |+|E |) [56]. It is done K times, re-
sulting in an overall complexity of O(K (|V |+|E |)) as opposed
to linear clustering that would cost O(|V |(|E |+|V |)) [62].
Complexity of Mapping: Since the clusters are disjoint
paths or singular nodes, and have no common nodes (a node
exists only in one cluster), the total number of the update
operations is bounded by |V |. e number of range summa-
tion queries is bounded by the number of the paths which is
again bounded by |V |. e cost of either of the operations
is logarithmic in the number of levels. e number of levels
is ≤ |V |, so we end up with O(|V |∗loд |V |). Before starting
LALB, we sort the clusters by their weights (the heaviest
clusters rst due to their importance in balancing the loads),
this has an upper bound of O(|V |∗loд |V |), since the number
of clusters is upper-bounded by the number of nodes. Hence,
the overall complexity of the mapping stage isO(|V |∗loд |V |).
Complexity of Renement: For swapping, initially we
sort the clusters by tl (n) of their source nodes to nd the
clusters within the span of a certain cluster using binary
search. Since the number of disjoint clusters is bounded by
the number of nodes, this process gives the complexity of
O(|V |∗loд(|V |)). Once two clusters are swapped, they are
marked and not considered again leading to at most |V |
cluster, hence node, swaps. With each swap the binary-
indexed-trees are updated to reect the new work loads.
Since each update takes O(loд(|V |)), overall complexity is
O(|V |∗loд(|V |)). e node-level renement is repeated K
times and each time we recalculate the weighted levels and
the CP . Upon node switching we update the trees. e
overall time complexity is O(K (|V |+|E |)).
Complexity of Scheduler Emulator: e scheduler em-
ulator estimates the starting time st (n) and nishing time
f t (n) of the nodes in the graph. e emulator has time com-
plexity of O(|V |+|E |).
Complexity of TrackingMemoryConsumption: Track-
ing the memory consumption requires O(|V |) time since it
is done in one pass over the graph nodes while keeping the
cumulative values and calculating the potentials.
Complexity of Addressing Overow: We solve the
knapsack greedily as the dynamic programming based solu-
tion complexity is impractical. When an overow is detected,
we pick the nodes from the heaps in a logarithmic time. Any
node that is moved to another partition is guaranteed not to
be moved again since it is moved only if the destination pe
can accommodate it, meaning that it can neither cause nor
solve an overow on that pe . As a result, there is no repe-
tition and a node can enter or exit the heap once, resulting
in O(|V |∗loд |V |). When a node is moved, the new potentials
13
and memory consumption need to be recalculated (O(|V |)).
It may happen at most |V | times. Overall the complexity is
O(|V |2).
A.2 Models and Datasets
From the language modeling, we use Word-RNN a multi-
layer Recurrent Neural Network for word-level language
inspired by the character-level modeling [59], and character-
Aware Neural Language Models (Char-CRN) [32]. Both
models can be enlarged by increasing the number of layers or
the hidden state size. While Penn Treebank text corpus [38]
is used to train Char-CRN, Word-RNN is trained using
Tiny Shakespeare [29].
From computer vision, we experiment with WRN [70],
which is a widened version of the residual network model.
In WRN the width of the convolutional layers can be con-
gured. e model size grows quadratically when widened.
WRN has achieved beer accuracy when the model is widened [70].
WRN is trained using CIFAR [34] as it is the dataset used
by the original authors.
TRN (Transformer) [61] is a widely used model that had
a signicant inuence on the design of SoTA Transformer-
based models in the NLP domain such as GPT-2 [48] and
Megatron-LM [55]. Transformer can be enlarged by increas-
ing the number of layers, which deepens the model, and by
widening the inner-layer dimensionality. Deeper [22] and
wider [61] congurations of Transformer are shown to give
higher accuracy. We trained Transformer using IWSLT 2016
German–English parallel corpus for training.
E3D is Eidetic 3D LSTM [65] for video prediction. is
model achieves state-of-the-art performance in future frame
prediction. E3D is closely related to convolutional recurrent
networks, where the dimensions of memory states are in-
creased, and 3D-Convs are adopted as the basic operators
for state transitions. E3D can be enlarged by increasing
the number of the hidden state channels on the memory di-
mensions. We trained E3D-LSTM using the Moving MNIST
dataset.
14
