Exploring Hidden Dimensions in Parallelizing Convolutional Neural
  Networks by Jia, Zhihao et al.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Zhihao Jia 1 Sina Lin 2 Charles R. Qi 1 Alex Aiken 1
Abstract
The past few years have witnessed growth in the
computational requirements for training deep con-
volutional neural networks. Current approaches
parallelize training onto multiple devices by ap-
plying a single parallelization strategy (e.g., data
or model parallelism) to all layers in a network.
Although easy to reason about, these approaches
result in suboptimal runtime performance in large-
scale distributed training, since different layers
in a network may prefer different parallelization
strategies. In this paper, we propose layer-wise
parallelism that allows each layer in a network
to use an individual parallelization strategy. We
jointly optimize how each layer is parallelized by
solving a graph search problem. Our evaluation
shows that layer-wise parallelism outperforms
state-of-the-art approaches by increasing train-
ing throughput, reducing communication costs,
achieving better scalability to multiple GPUs,
while maintaining original network accuracy.
1. Introduction
Convolutional neural networks (CNNs) have proven to be
general and effective across many tasks including image
classification (Krizhevsky et al., 2012; Szegedy et al., 2016),
face recognition (Lawrence et al., 1997), text classifica-
tion (Wang et al., 2012), and game playing (Silver et al.,
2016). Their success has resulted in growth in the compu-
tational requirements to train today’s CNNs, which takes
days or even weeks on modern processors (Zeiler & Fergus,
2014; Simonyan & Zisserman, 2014; Szegedy et al., 2016).
Previous work has investigated parallelization techniques
to accelerate training. The most common approach is data
parallelism (Krizhevsky et al., 2012; Simonyan & Zisser-
man, 2014) that keeps a replica of an entire network on
each device and assigns a subset of the training data to
1Stanford University 2Microsoft. Correspondence to: Zhihao
Jia <zhihao@cs.stanford.edu>.
Proceedings of the 35 th International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).
each device. Another common approach is model paral-
lelism (Mirhoseini et al., 2017; Kim et al., 2017) that divides
the network parameters into disjoint subsets and trains each
subset on a dedicated device. Both approaches apply a sin-
gle parallelization strategy (i.e., data or model parallelism)
to all layers in a CNN. However, within a CNN, differ-
ent layers may prefer different parallelization strategies for
achieving optimal runtime performance. For example, a
densely-connected layer with millions of parameters prefers
model parallelism to reduce communication cost for syn-
chronizing parameters, while a convolutional layer typically
prefers data parallelism to eliminate data transfers from the
previous layer. In addition, some layers may prefer more
sophisticated parallelization strategies such as parallelizing
in a mixture of multiple data dimensions (see Section 3).
Because of the differing characteristics of different layers in
a network, applying a single parallelization strategy to all
layers usually results in suboptimal runtime performance.
In this paper, we propose layer-wise parallelism, which en-
ables each layer in a network to use an individual paralleliza-
tion strategy. Layer-wise parallelism performs the same
computation for each layer as it is defined in the original
network and therefore maintains the same network accuracy
by design. Compared to existing parallelization approaches,
our approach defines a more comprehensive search space
of parallelization strategies, which includes data and model
parallelism as two special cases. Our goal is to find the par-
allelization strategies for individual layers to jointly achieve
the best possible runtime performance while maintaining
the original network accuracy. To formalize the problem,
we introduce parallelization configurations that define the
search space for parallelizing a layer across multiple devices.
We propose a cost model that quantitively evaluates the run-
time performance of different parallelization strategies. The
cost model considers both the computation power of each
device and the communication bandwidth between devices.
With the cost model, we convert the original problem of
choosing parallelization configurations for individual layers
to a graph search problem and develop an efficient algorithm
to find a globally optimal strategy under the cost model.
We evaluate the runtime performance of layer-wise paral-
lelism with AlexNet (Krizhevsky et al., 2012), VGG-16 (Si-
monyan & Zisserman, 2014), and Inception-v3 (Szegedy
et al., 2016) on the ILSVRC 2012 image classification
ar
X
iv
:1
80
2.
04
92
4v
2 
 [c
s.L
G]
  9
 Ju
n 2
01
8
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
dataset. For distributed training on 16 P100 GPUs (on 4
nodes), layer-wise parallelism is 1.4-2.2× faster than state-
of-the-art parallelization strategies. Note that the speedup is
achieved without sacrificing network accuracy, since layer-
wise parallelism trains the same network as data and model
parallelism and uses more efficient parallelization strate-
gies to achieve better runtime performance. In addition,
layer-wise parallelism reduces communication costs by 1.3-
23.0× compared to data and model parallelism. Finally, we
show that layer-wise parallelism achieves better scalability
than other parallelization strategies. Scaling the training
of Inception-v3 from 1 GPU to 16 GPUs, layer-wise paral-
lelism obtains 15.5× speedup, while other parallelization
strategies achieve at most 11.2× speedup.
To summarize, our contributions are:
• We propose layer-wise parallelism, which allows differ-
ent layers in a network to use individual parallelization
configurations.
• We define the search space of possible parallelization
configurations for a layer and present a cost model to
quantitively evaluate the runtime performance of train-
ing a network. Based on the cost model, we develop
an efficient algorithm to jointly find a globally optimal
parallelization strategy.
• We provide an implementation that supports layer-wise
parallelism and show that layer-wise parallelism can
increase training throughput by 1.4-2.2× and reduce
communication costs by 1.3-23.0× over state-of-the-
art approaches while improving scalability.
2. Related Work
Data and model parallelism have been widely used by
existing deep learning frameworks (e.g., TensorFlow (Abadi
et al., 2016), Caffe22, and PyTorch3) to parallelize training.
Data parallelism (Krizhevsky et al., 2012) keeps a copy
of an entire network on each device, which is inefficient
for layers with large numbers of network parameters and
becomes a scalability bottleneck in large scale distributed
training. Model parallelism (Dean et al., 2012) divides
network parameters into disjoint subsets and trains each
subset on a dedicated device. This reduces communication
costs for synchronizing network parameters but exposes
limited parallelism.
Krizhevsky (2014) introduces “one weird trick” (OWT) that
uses data parallelism for convolutional and pooling layers
and switches to model parallelism for fully-connected layers
to accelerate training. This achieves better runtime perfor-
2https://caffe2.ai
3https://pytorch.org
mance than data and model parallelism but is still subopti-
mal. In this paper, we use OWT parallelism as a baseline in
the experiments and show that layer-wise parallelism can
further reduce communication costs and improve training
performance compared to OWT parallelism.
System optimizations. A number of system-level optimiza-
tions have been proposed to accelerate large scale training.
Goyal et al. (2017) uses a three-step allreduce operation to
optimize communication across devices and aggressively
overlaps gradient synchronization with back propagation.
Zhang et al. (2017) introduces a hybrid communication
scheme to reduce communication costs for gradient synchro-
nization. All these systems are based on data parallelism
and are limited in runtime performance by communication
costs.
Network parameter reduction. Han et al. (2015) presents
an iterative weight pruning method that repeatedly retrains
the network while removing weak connections. Alvarez
& Salzmann (2016) proposes a network that learns the re-
dundant parameters in each layer and iteratively eliminates
the redundant parameters. These approaches improve run-
time performance by significantly reducing the number of
parameters in a neural network, which results in a modi-
fied network and may decrease the network accuracy (as
reported in these papers). By contrast, in this paper, we in-
troduce a new approach that accelerates distributed training
while maintaining the original network accuracy.
3. Hidden Dimensions in Parallelizing a Layer
Data parallelism parallelizes training by partitioning a train-
ing dataset in the sample dimension. However, other dimen-
sions can also be used to parallelize a layer. For example,
in standard CNNs for 2D images, data is commonly orga-
nized as 4-dimensional tensors (i.e., sample, height, width,
and channel). The sample dimension includes an index for
each image in a training dataset. The height and width di-
mensions specify a position in an image. For a particular
position, the channel dimension indexes different neurons
for that position.
In principle, any combination of these dimensions can be
used to parallelize a layer, and we should consider both se-
lecting the dimensions to parallelize training and the degree
of parallelism in each dimension. Exploring these additional
dimensions has the following advantages.
First, parallelizing a layer in other dimensions can reduce
execution time. Figure 1 shows the time to process a 2D
convolutional layer on 4 GPUs using parallelism in differ-
ent dimensions. For this layer, data parallelism achieves
suboptimal performance.
Second, exploring parallelism in other dimensions can re-
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Figure 1. Execution time for parallelizing a convolutional layer
(Conv8 in VGG-16 (Simonyan & Zisserman, 2014)) on 4 GPUs
by using different dimensions. Parallelizing a layer in other dimen-
sions preserves the same output as parallelizing it in the sample
dimension. To achieve this, different GPUs may share some com-
mon input data for parallelizations in the height, width, and channel
dimensions.
Sa
m
pl
es
	
Output	
Tensor	
Input	
Tensor	Sa
m
pl
es
	
Channels	
Channels	
Sa
m
pl
es
	
Sa
m
pl
es
	
Channels	
GPU	0	 GPU	1	
Xfer	size	
=	408	MB	
Xfer	size	
=	408	MB	
FC-layer	
Parameters	
(a) Parallelism in the sample
dimension.
Channels	
Sa
m
pl
es
	
Sa
m
pl
es
	
Channels	
Channels	
Sa
m
pl
es
	
Sa
m
pl
es
	
Channels	
Xfer	size=32	MB	 Xfer	size=32	MB	
GPU	0	 GPU	1	
(b) Parallelism in the channel
dimension.
Figure 2. Different ways to parallelize the first fully-connected
layer of VGG-16. Rectangles with solid lines indicate tensors
managed by the local GPU, while rectangles with dotted lines are
tensors managed by a remote GPU. The shadow rectangles indicate
data transfers in each step.
duce communication costs. Figure 2 shows an example of
parallelizing a fully-connected layer on two GPUs in differ-
ent dimensions. In data parallelism (Figure 2a), each GPU
synchronizes the gradients of the entire fully-connected
layer (shown as the shadow rectangles) in every step. An
alternative approach (Figure 2b) parallelizes in the channel
dimension, which eliminates parameter synchronization, as
different GPUs train disjoint subsets of the parameters, but
introduces additional data transfers for input tensors (shown
as the shadow rectangles). For this particular case, using par-
allelism in the channel dimension reduces communication
costs by 12×.
Third, the degree of parallelism (number of parallel devices)
is another dimension that affects runtime performance. Dif-
ferent layers have different execution time and communica-
tion costs and may prefer different degrees of parallelism.
Figure 3 shows the runtime performance of processing two
layers in Inception-v3 with different degrees of parallelism.
The convolutional layer performs best on 16 GPUs, while
the fully-connected layer performs best on 4 GPUs.
0
50
100
150
200
250
300
350
400
Convolutional Layer (Third Layer)
2 GPUs 4 GPUs 8 GPUs 16 GPUs
Number of GPUs
0
20
40
60
80
100
120
140
Fully-Connected Layer (Last Layer)
R
u
n
 t
im
e 
p
er
 s
te
p
(m
ill
is
ec
on
d
s)
Computation Communication
Figure 3. Computation and communication time to process the
third layer and the last layer of Inception-v3 using data parallelism.
Table 1. Parallelizable dimensions for different layers. The length
dimension specifies a position in a 1D image.
Layer Parallelizable dimensions
Fully-connected {sample, channel}
1D convolution/pooling {sample, channel, length}
2D convolution/pooling {sample, channel, height, width}
3D convolution/pooling {sample, channel, height, width, depth}
4. Problem Definition
We define the parallelization problem with two graphs. The
first is a device graph that models all available hardware
devices and the connections between them. The second is
a computation graph that defines the neural network to be
mapped onto the device graph.
In a device graph D, each node di is a device (e.g., a CPU
or a GPU), and each edge (di, dj) is an connection between
di and dj with communication bandwidth b(di, dj). In a
computation graph G, each node li ∈ G is a layer in the
neural network, and each edge (li, lj) ∈ G is a tensor that is
an output of layer li and an input of layer lj .
We now define the parallelization of a layer. To parallelize
a layer across multiple devices, we assume that different
devices can process the layer in parallel without any depen-
dencies. This requires different devices to compute disjoint
subsets of a layer’s output tensor. Therefore, we describe the
parallelization of a layer by defining how its output tensor
is partitioned.
For a layer li, we define its parallelizable dimensions Pi as
the set of all divisible dimensions in its output tensor. Pi
includes all dimensions to parallelize the layer li. Table 1
shows the parallelizable dimensions for different layers.
A parallelization configuration ci of a layer li defines how
li is parallelized across different devices. For each par-
allelizable dimension in Pi, ci includes a positive integer
that describes the degree of parallelism in that dimension.
For a configuration ci, the product of the integers over all
dimensions is the total degree of parallelism for li. We as-
sume equal partitioning in each parallelizable dimension,
which provides well-balanced workload among multiple de-
vices. Figure 4 demonstrates some possible configurations
for parallelizing a 2D convolutional layer over four devices.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Width	
He
ig
ht
	
Ch
an
ne
l	
(a) (n=1, c=1, h=1, w=4)
Width	
He
ig
ht
	
Ch
an
ne
l	
(b) (n=1, c=1, h=4, w=1)
Width	
He
ig
ht
	
Ch
an
ne
l	
(c) (n=1, c=4, h=1, w=1)
Width	
He
ig
ht
	
Ch
an
ne
l	
(d) (n=1, c=1, h=2, w=2)
Figure 4. Example configurations to parallelize a 2D convolutional
layer in a single dimension or combinations of multiple dimensions.
The figure shows how each training sample is partitioned.
Parallelizing a layer in any configuration produces the same
output. This guarantees that all configurations parallelize
training on the original network and therefore maintains the
original network accuracy.
A parallelization strategy S includes a configuration ci for
each layer li ∈ G. Let t(G,D,S) denote the per-iteration
execution time to parallelize the computation graph G on
the device graphD by using strategy S . Our goal is to find a
parallelization strategy S such that the per-iteration training
time t(G,D,S) is minimized.
5. Method
5.1. Cost Model
We introduce a cost model to quantitively evaluate the run-
time performance of different parallelization strategies and
use an dynamic programming based graph search algorithm
to find an optimal parallelization strategy under our cost
model. The cost model depends on the following assump-
tions:
1. For a layer li ∈ G, the time to process li is predictable
with low variance and is largely independent of the
contents of the input data.
2. For each connection (di, dj) between device di and
dj with bandwidth b, transferring a tensor of size s
from di to dj takes s/b time (i.e., the communication
bandwidth can be fully utilized).
3. The runtime system has negligible overhead. A device
begins processing a layer as soon as its input tensors
are available and the device has finished previous tasks.
Most layers in CNNs are based on dense matrix operations,
whose execution time satisfies the first assumption. In addi-
tion, the experiments show that our implementation satisfies
the second and third assumptions well enough to obtain
significant runtime performance improvements.
We define three cost functions on computation graphs:
1. For each layer li and its parallelization configuration
ci, tc(li, ci) is the time to process the layer li under
configuration ci. This includes both the forward and
back propagation time and is estimated by processing
the layer under that configuration multiple times on the
device and measuring the average execution time.
2. For each tensor e = (li, lj), tx(e, ci, cj) estimates the
time to transfer the input tensors to the target devices,
using the size of the data to be moved and the known
communication bandwidth.
3. For each layer li and its parallelization configuration ci,
ts(li, ci) is the time to synchronize the parameters in
layer li after back propagation. To complete parameter
synchronization, each device that holds a copy of the
parameters for layer li transfers its local gradients to a
parameter server that stores the up-to-date parameters
for layer li. After receiving the gradients for layer li,
the parameter server applies the gradients to the pa-
rameters and transfers the updated parameters back to
the device. In this process, the communication time
is much longer than the execution time to update pa-
rameters, therefore we use the communication time to
approximate the parameter synchronization time.
Using the three cost functions above, we define
to(G,D,S) =
∑
li∈G
{tc(li, ci) + ts(li, ci)}
+
∑
e=(li,lj)∈G
tx(e, ci, cj)
(1)
to(G,D,S) estimates the per-step execution time for par-
allelization strategy S, which includes forward processing,
back propagation, and parameter synchronization.
5.2. Graph Search
Equation 1 expresses the problem of finding an optimal par-
allelization strategy as a graph search problem: our goal is to
find a strategy S so that the overall runtime cost to(G,D,S)
is minimized.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
lk 
e1 
e2 
e’ 
li 
lj 
lk 
li 
(a) Node elimination.
lj 
e1 e2 e’ 
li 
lj 
li 
(b) Edge elimination.
Figure 5. Node and edge elimination on a computation graph.
Since layer-wise parallelism allows each layer to use an indi-
vidual configuration, the number of potential parallelization
strategies is exponential in the number of layers in a com-
putation graph, which makes it impractical to enumerate
all strategies for large CNNs. However, the CNNs we have
seen in practice exhibit strong locality: each layer is only
connected to a few layers with similar depths in a computa-
tion graph. Based on this observation, we use the following
two graph reductions to iteratively simplify a computation
graph while preserving optimal parallelization strategies.
Node elimination. If a computation graph G includes a
node lj with a single in-edge e1 = (li, lj) and a single
out-edge e2 = (lj , lk), a node elimination removes node
lj and the two edges e1 and e2 from G, inserts a new edge
e′ = (li, lk) back into G, and returns the modified graph
(Figure 5a). We define tx(e′, ·, ·) in a way that preserves
optimal parallelization strategies (see Theorem 1).
tx(e′, ci, ck) = min
cj
{tc(lj , cj) + ts(lj , cj)
+ tx(e1, ci, cj) + tx(e2, cj , ck)}
(2)
Intuitively, we use dynamic programming to compute an
optimal configuration cj for node lj for every possible com-
bination of ci and ck and use the cost functions associated
with lj to define tx(e′, ci, ck).
Theorem 1. Assume G′ = NodeElimination(G) and lj is the
eliminated node. If So′ is an optimal parallelization srategy
for G′, then So = So′ + cj is an optimal parallelization
strategy for G, where cj minimizes Equation 2.
Edge elimination. If a computation graph G includes two
edges with the same source and destination nodes (i.e.,
e1 = (li, lj) and e2 = (li, lj)), an edge elimination re-
moves e1 and e2 from G, inserts a new edge e′ = (li, lj)
into G (Figure 5b). We define tx(e′, ·, ·) using tx(e1, ·, ·)
and tx(e2, ·, ·).
tx(e′, ci, cj) = tx(e1, ci, cj) + tx(e2, ci, cj) (3)
Theorem 2. Assume G′ = EdgeElimination(G), and So′ is
an optimal parallelization strategy of G′, then So = So′ is
an optimal parallelization strategy of G.
Algorithm 1 Finding Optimal Parallelization Strategy S.
1: Input: A computation graph G, a device graph D, and
precomputed cost functions (i.e., tc(·), ts(·) and tx(·) )
2: Output: A parallelization strategy S minimizing
to(G,D,S)
3:
4: G(0) = G
5: m = 0
6: while true do
7: G(m+1) = NODEELIMINATION(G(m))
8: G(m+2) = EDGEELIMINATION(G(m+1))
9: if G(m+2) = G(m) then
10: break
11: end if
12: m = m+ 2
13: end while
14: Find the optimal strategy S(m) for G(m) by enumerating
all possible candidate strategies
15: for i = m-1 to 0 do
16: if G(i+1) = NODEELIMINATION(G(i)) then
17: . Assume lj is the node eliminated from G(i)
18: Find cj that minimizes Equation 1
19: S(i) = S(i+1) + cj
20: else
21: S(i) = S(i+1)
22: end if
23: end for
24: return S(0)
We formally define node and edge eliminations and prove
Theorem 1 and 2 in the appendix.4 The two theorems show
that given an optimal parallelization strategy for the modi-
fied graph, we can easily construct an optimal strategy for
the original graph.
Algorithm 1 shows pseudocode using node and edge elim-
inations as subroutines to find an optimal parallelization
strategy under our cost model. The algorithm first itera-
tively uses node and edge eliminations to simplify an input
computation graph until neither elimination can be applied
(lines 4-13). Figure 6 demonstrates how node and edge elim-
inations are performed on an Inception module (Szegedy
et al., 2016).
After the elimination phase, the algorithm enumerates all po-
tential strategies for the final graph G(m) and chooses S(m)
that minimizes to(G(m),D,S(m)) (line 14). After deciding
the configuration for each node in G(m), we then decide
the configurations for the eliminated nodes by iteratively
undoing the eliminations in reverse order (lines 15-23). The-
orem 1 and 2 guarantee that S(i) is an optimal strategy for
4An extended version of this paper with proofs is available at
https://arxiv.org/abs/1802.04924.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Concat
Conv1x1
Conv1x1
Conv1x1 Pooling
Concat
Conv1x3 Conv3x1 Conv1x3 Conv3x1
Conv3x3 Conv1x1
(a) Initial graph.
Concat
Conv1x1 Conv1x1 Conv3x3 Conv1x1
Concat
(b) Node elimination.
Concat
Conv1x1 Conv1x1 Conv3x3 Conv1x1
Concat
(c) Edge elimination.
Concat
Concat
(d) Node elimination.
Concat
Concat
(e) Edge elimination.
Figure 6. Iteratively performing node/edge eliminations on an Inception module.
Table 2. Time complexity of Algorithm 1. C is the maximum
number of potential configurations for a layer. N and E are the
number of nodes and edges in G, respectively. K is the number of
nodes in the final graph after node and edge eliminations.
Step Time Complexity
Performing node and edge eliminations O(EC3)
Finding the optimal strategy for the
O(KCK)final graph
Undoing node and edge eliminations O(EC)
Overall O(EC3 +KCK)
Table 3. Execution time for finding the optimal parallelization strat-
egy for 4 GPUs. Note that the number of nodes in the final graph
(i.e., K) is equal to 2 for all networks. For LeNet-5 and AlexNet,
two algorithms find the same optimal strategies.
Network # Layers Baseline Our Algorithm
LeNet-5 6 5.6 seconds 0.01 seconds
AlexNet 11 2.1 hours 0.02 seconds
VGG-16 21 > 24 hours 0.1 seconds
Inception-v3 102 > 24 hours 0.4 seconds
Time complexity O(ECN ) O(EC3)
G(i) (0 ≤ i ≤ m). Finally, S(0) is an optimal parallelization
strategy for the original graph G.
Time complexity. Table 2 shows the time complexity of Al-
gorithm 1. Performing a node or edge elimination requires
computing Equation 2 or 3 for the inserted edge, which takes
O(C3) and O(C2) time, respectively. The total number of
node and edge eliminations is smaller than E, since an elim-
ination reduces the number of edges in the graph by one.
Therefore, the time complexity for performing and undoing
node and edge eliminations is O(EC3) and O(EC), respec-
tively. The algorithm enumerates all possible strategies for
the final graph G(m), which takes O(KCK) time. The algo-
rithm works efficiently on a wide range of real-world CNNs
including AlexNet (Krizhevsky et al., 2012), VGG (Si-
monyan & Zisserman, 2014), Inception-v3 (Szegedy et al.,
2016), and ResNet (He et al., 2016), all of which are reduced
to a final graph with only 2 nodes (i.e., K = 2).
We compare Algorithm 1 with a baseline algorithm that uses
a depth-first search algorithm to find an optimal strategy for
the original graph G. Table 3 compares the time complex-
ity and actual execution time of the two algorithms. Our
algorithm achieves lower time complexity and reduces the
execution time by orders of magnitude over the baseline.
6. Experiments
We found that it is non-trivial to parallelize a layer in the
height, width, and channel dimensions in existing frame-
works (e.g., TensorFlow, PyTorch, and Caffe2), and none
provides an interface for controlling parallelization at the
granularity of individual layers. Therefore, we implemented
our framework in Legion (Bauer et al., 2012), a high-
performance parallel runtime for distributed heterogeneous
architectures, and use cuDNN (Chetlur et al., 2014) and
cuBLAS (cub, 2016) as the underlying libraries to process
neural network layers. The following Legion features signif-
icantly simplify our implementation. First, Legion supports
high-dimensional partitioning that allows us to parallelizing
any layer in any combination of the dimensions. Second,
Legion permits control of parallelization at the granularity
of each layer. Third, Legion allows fine-grain control over
the placement of tasks and data. Fourth, the underlying
implementation of Legion automatically and systematically
overlaps communication with computation and optimizes
the path and pipelining of data movement across the ma-
chine (Treichler et al., 2014; Jia et al., 2018; 2017).
Benchmarks. We evaluate our approach on three estab-
lished CNNs. AlexNet (Krizhevsky et al., 2012) is the
winner of the ILSVRC-2012 image classification compe-
tition. VGG-16 (Simonyan & Zisserman, 2014) improves
network accuracy by pushing the depth of the network to
16 weighted layers. Inception-v3 (Szegedy et al., 2016) is a
102-layer deep CNN that uses carefully designed Inception
modules to increase the number of layers while maintaining
a reasonable computational budget.
Datasets. We evaluate the runtime performance of all three
CNNs on the ImageNet-1K dataset (Deng et al., 2009) that
consists of 1.2 million images from 1,000 categories.
Baselines. We compare the following parallelization strate-
gies in the experiments.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
1. Data parallelism is the most common parallelization
strategy for large-scale training (Goyal et al., 2017;
Abadi et al., 2016). In data parallelism, each device
has a copy of the entire network and processes a subset
of the training dataset.
2. Model parallelism. We use a model parallelism ap-
proach (Krizhevsky, 2014) as a baseline, which dis-
tributes the network parameters in each layer equally
to all devices, providing good load balancing.
3. OWT parallelism is designed to reduce communica-
tion costs by using data parallelism for convolutional
and pooling layers and switching to model parallelism
for densely-connected layers.
4. Layer-wise parallelism. Given a computation graph
and a set of available devices, we run the algorithm
described in Section 5 to find a parallelization strategy
minimizing Equation 1.
Experimental setup. All experiments were performed on
a GPU cluster with 4 compute nodes, each of which is
equipped with two Intel 10-core E5-2600 CPUs, 256G main
memory, and four NVIDIA Tesla P100 GPUs. GPUs on the
same node are connected by NVLink, and nodes are con-
nected over 100Gb/s EDR Infiniband. We use synchronous
training and a per-GPU batch size of 32 for all experiments.
To rule out implementation differences, we ran data paral-
lelism experiments in TensorFlow r1.7, PyTorch v0.3, and
our implementation and compared the runtime performance.
Our Legion-based framework achieves the same or better
runtime performance on all three CNNs compared to Ten-
sorFlow and PyTorch, and therefore we report the data par-
allelism performance achieved by our framework.
6.1. Runtime Performance
We compare the training throughput and communication
cost among different parallelization strategies. Figure 7
shows the training throughputs with different CNNs and
different sets of available devices. Both model and data
parallelism scale well on a single node but show limited
scalability in distributed training, where the inter-node com-
munication limits runtime performance. OWT parallelism
achieves improved runtime performance by switching to
model parallelism for densely-connected layers to reduce
communication costs. In all experiments, layer-wise par-
allelism consistently outperforms the other strategies and
increases the training throughput by up to 2.2×, 1.5×, and
1.4× for AlexNet, VGG-16, and Inception-v3, respectively.
In addition, layer-wise parallelism achieves better scalability
than the other strategies. Scaling the training of the three
CNNs from 1 GPU to 16 GPUs (on 4 nodes), layer-wise
Table 4. Relative difference between estimated execution time
to(G, C) and actual execution time t(G, C).
Available Devices (to(G, C)− t(G, C))/t(G, C)AlexNet VGG-16 Inception-v3
1 GPU (1 node) 1% 0% 1%
2 GPUs (1 node) 4% 3% 5%
4 GPUs (1 node) -5% 2% 5%
8 GPUs (2 nodes) 2% 6% 9%
16 GPUs (4 nodes) -1% 7% 6%
Table 5. An optimal parallelization strategy under the cost model
for parallelizing VGG-16 on 4 GPUs on a single compute node.
Layers Parallelization Configuration
2 x Conv + Pooling
{n=4, h=1, w=1, c=1}2 x Conv + Pooling3 x Conv + Pooling
3 x Conv + Pooling
3 x Conv + Pooling {n=1, h=2, w=2, c=1}
Fully-connected {n=1, c=4}Fully-connected
Fully-connected {n=1, c=2}
Softmax {n=1, c=1}
parallelism achieves 12.2×, 14.8×, and 15.5× speedup for
AlexNet, VGG-16, and Inception-v3, respectively, while
the best other strategy achieves 6.1×, 10.2×, and 11.2×
speedup. Moreover, Figure 7 shows that the layer-wise
parallelism can help bridge the runtime performance gap
between the ideal training throughputs in linear scale (the
red lines) and the actual training throughputs achieved by
current parallelization strategies. This shows that layer-wise
parallelism is more efficient for large-scale training.
Communication cost is another important performance met-
ric in large-scale training. Figure 8 compares the communi-
cation costs of different strategies. OWT parallelism elimi-
nates gradient synchronization for densely-connected layers
and reduces overall communication costs by 1.1-23.0× com-
pared to data and model parallelism. In addition, Layer-wise
parallelism outperforms OWT parallelism by further reduc-
ing communication overhead by 1.2-2.5×.
6.2. Cost Model
We compare the estimated execution time to(G,D, C) pro-
jected by our cost model (see Equation 1) with the measured
per-step execution time t(G,D, C) in the experiments. The
results are shown in Table 4. In all experiments, the rela-
tive difference between the estimated and the real execution
time is within 10%, showing that the cost model can reliably
predict a CNN’s per-step execution time given the set of
available devices and the connections between them.
6.3. Analysis of Optimal Parallelization Strategies
We analyze the optimal parallelization strategies under our
cost model and find several similarities among them.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
1 GPU
(1)
2 GPUs
(1)
4 GPUs
(1)
8 GPUs
(2)
16 GPUs
(4)
AlexNet
0
2000
4000
6000
8000
10000
12000
Tr
ai
n
in
g
 T
h
ro
u
g
h
p
u
t 
(i
m
ag
es
 p
er
 s
ec
on
d
)
1 GPU
(1)
2 GPUs
(1)
4 GPUs
(1)
8 GPUs
(2)
16 GPUs
(4)
VGG-16
0
500
1000
1500
2000
2500
3000
1 GPU
(1)
2 GPUs
(1)
4 GPUs
(1)
8 GPUs
(2)
16 GPUs
(4)
Inception-v3
0
500
1000
1500
2000
2500
3000
Model Parallelism Data Parallelism OWT Parallelism Layer-wise Parallelism
Figure 7. Training throughput (i.e., number of images processed per second) with different parallelization strategies (higher is better).
Numbers in parenthesis are the number of compute nodes used in the experiments. The red lines show the training throughput in linear
scale (ideal case).
1 GPU
(1)
2 GPUs
(1)
4 GPUs
(1)
8 GPUs
(2)
16 GPUs
(4)
AlexNet
0
1
2
3
4
5
To
ta
l D
at
a 
Tr
an
sf
er
re
d
 P
er
 S
te
p
 (
G
B
)
9
1 GPU
(1)
2 GPUs
(1)
4 GPUs
(1)
8 GPUs
(2)
16 GPUs
(4)
VGG-16
0
2
4
6
8
10 14 29 16
1 GPU
(1)
2 GPUs
(1)
4 GPUs
(1)
8 GPUs
(2)
16 GPUs
(4)
Inception-v3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 7 14 28
Model Parallelism Data Parallelism OWT Parallelism Layer-wise Parallelism
Figure 8. Communication cost (i.e., data transferred in each step) with different parallelization strategies (lower is better).
First, for the beginning layers of a CNN with large
height/width dimensions and a small channel dimension,
an optimal strategy usually uses data parallelism on all avail-
able devices, since the communication costs for synchro-
nizing gradients are much smaller than the communication
costs for moving tensors between layers.
Second, deeper layers in a CNN tend to have smaller
height/width dimensions and a larger channel dimension.
As a result, the costs for moving tensors between different
layers decrease, while the costs for synchronizing param-
eters increase. An optimal strategy adaptively reduces the
number of devices for these layers to reduce communication
costs to synchronize parameters and opportunistically uses
parallelism in the height/width dimensions to achieve better
runtime performance.
Finally, for densely-connected layers, an optimal strategy
eventually switches to model parallelism on a small number
of devices, because synchronizing gradients and transferring
tensors are both much more expensive than the execution
time for densely-connected layers. This reduces the com-
munication costs for synchronizing parameters and moving
tensors at the cost of only using a subset of available devices.
Table 5 shows an optimal strategy under the cost model
for parallelizing VGG-16 on 4 GPUs. This strategy first
uses parallelism in the sample dimension for the beginning
convolutional and pooling layers and then uses parallelism
in both the height and width dimensions to accelerate the
last three convolutional layers. For the fully-connected
layers, it uses parallelism in the channel dimension to reduce
communication costs and adaptively decreases the degrees
of parallelism.
7. Conclusion
We have introduced layer-wise parallelism, which allows
each layer in a CNN to use an individual parallelization
configuration. We propose a cost model that quantitively
evaluates the runtime performance of different strategies and
use a dynamic programming based graph search algorithm
to find a globally optimal strategy under the cost model. Our
experiments show that layer-wise parallelism significantly
outperforms state-of-the-art strategies for CNNs by increas-
ing training throughput, reducing communication costs, and
achieving better scalability on larger numbers of devices.
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Acknowledgements
This research was supported by NSF grant CCF-1160904
and the Exascale Computing Project (17-SC-20-SC), a col-
laborative effort of the U.S. Department of Energy Office of
Science and the National Nuclear Security Administration.
References
Dense Linear Algebra on GPUs. https://developer.
nvidia.com/cublas, 2016.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur,
M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,
Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,
M., Yu, Y., and Zheng, X. Tensorflow: A system for
large-scale machine learning. In Proceedings of the 12th
USENIX Conference on Operating Systems Design and
Implementation, OSDI, 2016.
Alvarez, J. M. and Salzmann, M. Learning the number of
neurons in deep networks. In Proceedings of the 29th
International Conference on Neural Information Process-
ing Systems, NIPS, 2016.
Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. Le-
gion: Expressing locality and independence with logical
regions. In Proceedings of the International Conference
on High Performance Computing, Networking, Storage
and Analysis, SC, 2012.
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,
J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient
primitives for deep learning. CoRR, abs/1410.0759, 2014.
URL http://arxiv.org/abs/1410.0759.
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M.,
Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker,
P., Yang, K., and Ng, A. Y. Large scale distributed deep
networks. In Proceedings of the International Conference
on Neural Information Processing Systems, NIPS, 2012.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. ImageNet: A large-scale hierarchical image database.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, CVPR, 2009.
Goyal, P., Dolla´r, P., Girshick, R. B., Noordhuis, P.,
Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
He, K. Accurate, large minibatch SGD: training ima-
genet in 1 hour. CoRR, abs/1706.02677, 2017. URL
http://arxiv.org/abs/1706.02677.
Han, S., Pool, J., Tran, J., and Dally, W. J. Learning both
weights and connections for efficient neural networks.
In Proceedings of the 28th International Conference on
Neural Information Processing Systems, NIPS, 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
CVPR, 2016.
Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M.,
and Aiken, A. A distributed multi-gpu system for fast
graph processing. PVLDB, 11(3), 2017.
Jia, Z., Treichler, S., Shipman, G., McCormick, P., and
Aiken, A. Isometry: A path-based distributed data trans-
fer system. In Proceedings of the International Confer-
ence on Supercomputing, ICS, 2018.
Kim, J., Park, Y., Kim, G., and Hwang, S. J. SplitNet:
Learning to semantically split deep networks for parame-
ter reduction and model parallelization. In Proceedings of
the 34th International Conference on Machine Learning,
ICML, 2017.
Krizhevsky, A. One weird trick for parallelizing convo-
lutional neural networks. CoRR, abs/1404.5997, 2014.
URL http://arxiv.org/abs/1404.5997.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet
classification with deep convolutional neural networks.
In Proceedings of the 25th International Conference on
Neural Information Processing Systems, NIPS, 2012.
Lawrence, S., Giles, C. L., Tsoi, A. C., and Back, A. D. Face
recognition: A convolutional neural-network approach.
IEEE transactions on neural networks, 1997.
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen,
R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and
Dean, J. Device placement optimization with reinforce-
ment learning. In Proceedings of the 34th International
Conference on Machine Learning, ICML, 2017.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., et al. Mastering the
game of go with deep neural networks and tree search.
Nature, 529:484–489, 2016.
Simonyan, K. and Zisserman, A. Very deep convolu-
tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. URL http://arxiv.org/
abs/1409.1556.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR, 2016.
Treichler, S., Bauer, M., and Aiken, A. Realm: An event-
based low-level runtime for distributed memory architec-
tures. In Proceedings of the International Conference on
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Parallel Architecture and Compilation Techniques, PACT,
2014.
Wang, T., Wu, D. J., Coates, A., and Ng, A. Y. End-to-
end text recognition with convolutional neural networks.
In Proceddings of the 21st International Conference on
Pattern Recognition, ICPR, 2012.
Zeiler, M. D. and Fergus, R. Visualizing and understanding
convolutional networks. In Proceedings of the European
Conference on Computer Vision, ECCV, 2014.
Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu,
Z., Wei, J., Xie, P., and Xing, E. P. Poseidon: An efficient
communication architecture for distributed deep learning
on GPU clusters. In 2017 USENIX Annual Technical
Conference, ATC, 2017.
A. Node and Edge Eliminations
We define node and edge eliminations in Algorithm 2.
Algorithm 2 Node and edge eliminations.
1: function NODEELIMINATION(G)
2: if there exists a node lj with a single in-edge e1 =
(li, lj) and a single out-edge e2 = (lj , lk) then
3: e′ = (li, lk)
4: G′ = G − lj − e1 − e2 + e′
5: return G′
6: else
7: return G
8: end if
9: end function
10:
11: function EDGEELIMINATION(G)
12: if there exist two edges e1 = (li, lj) and e2 =
(li, lj) then
13: e′ = (li, lj)
14: G′ = G − e1 − e2 + e′
15: return G′
16: else
17: return G
18: end if
19: end function
20:
Theorem 3. Assume G′ = NodeElimination(G) and lj is the
eliminated layer. If So′ is an optimal strategy for G′, then
So = So′ + ĉj is an optimal strategy for G, where
ĉj = argmin
cj
{tc(nj , cj) + ts(nj , cj)
+ tx(e1, ci, cj) + tx(e2, cj , ck)}
(4)
Proof. It suffices to prove that to(G,S1) ≥ to(G,So) for
any other strategy S1. We assume layer li has parallelization
configuration ci1 ∈ S1. We claim that
to(G,S1) ≥ to(G′,S1) (5)
≥ to(G′,So′) (6)
= to(G,So) (7)
To prove (5), note that the difference between to(G,S1) and
to(G′,S1) is
to(G,S1)− to(G′,S1)
=tc(lj , cj1) + ts(lj , cj1) + tx(e1, ci1, cj1)
+ tx(e2, cj1, ck1)− tx(e′, ci1, ck1)
(8)
because all other layers except lj use the same configu-
rations in to(G,S1) and to(G′,S1), and therefore all cost
functions unrelated to lj drop out in the subtraction. The
remaining parts are lj , e1, and e2, which no longer exist in
G′ after node elimination, and e′ that is added to G′. Recall
that tx(e′, ·, ·) is defined as follows.
tx(e′, ci, ck) = min
cj
{tc(lj , cj) + ts(lj , cj)
+ tx(e1, ci, cj) + tx(e2, cj , ck)}
(9)
Combining (8) and (9), we have to(G,S1) ≥ to(G′,S1).
To prove (6), simply observe that the inequality must hold
because So′ is assumed to be an optimal strategy for G′.
To prove (7), the difference between to(G′,So′) and
to(G,So) is
to(G,So)− to(G′,So′)
=tc(lj , ĉj) + ts(lj , ĉj) + tx(e1, ci, ĉj)
+ tx(e2, ĉj , ck)− tx(e′, ci, ck)
(10)
This is because So = So′ + ĉj , and therefore all cost func-
tions unrelated to lj drop out. We can prove (7) by plugging
(4) into (10).
Theorem 4. Assume G′ = EdgeElimination(G), and So′ is
an optimal strategy for G′, then So = So′ is an optimal
strategy for G.
Proof. The proof is the same sequence of steps for Theo-
rem 3, but the justification of each step is different.
To prove (5) for Theorem 4, the difference between
to(G,S1) and to(G′,S1) is
to(G,S1)− to(G′,S1)
=tx(e1, ci1, cj1) + tx(e2, ci1, cj1)− tx(e′, ci1, cj1)
(11)
Recall that tx(e′, ·, ·) is defined as follows:
tx(e′, ci, cj) = tx(e1, ci, cj) + tx(e2, ci, cj) (12)
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
Combining (11) and (12), we have to(G,S1) = to(G′,S1).
For (6), the inequality holds because So′ is an optimal strat-
egy for G′.
For (7), the difference between to(G′,So′) and to(G,So) is
to(G,So)− to(G′,So′)
=tx(e1, ci, cj) + tx(e2, ci, cj)− tx(e′, ci, cj)
=0
(13)
