Effective Approaches to Batch Parallelization for Dynamic Neural Network
  Architectures by Suarez, Joseph & Zhu, Clare
Effective Approaches to Batch Parallelization for
Dynamic Neural Network Architectures
Joseph Suarez*
joseph15@stanford.edu
Clare Zhu*
clarezhu@stanford.edu
Abstract
We present a simple dynamic batching approach appli-
cable to a large class of dynamic architectures that consis-
tently yields speedups of over 10x. We provide performance
bounds when the architecture is not known a priori and a
stronger bound in the special case where the architecture is
a predetermined balanced tree. We evaluate our approach
on Johnson et al.’s recent visual question answering (VQA)
result of his CLEVR dataset by Inferring and Executing Pro-
grams (IEP). We also evaluate on sparsely gated mixture of
experts layers and achieve speedups of up to 1000x over the
naive implementation.
1. Introduction
The problem we address is neither VQA nor optimiza-
tion of a single architecture. Our motivation is to accel-
erate a large class of dynamic architectures such that they
become computationally comparable to their static counter-
parts. This cause is not motivated only by the recent suc-
cesses of dynamic architectures, but by their numerous de-
sirable properties that make them likely to retain and in-
crease in importance in the future, particularly their ability
to explicitly modularize knowledge.
We specifically explore Johnson et al.’s recent work[8]
in greatest detail because it serves as a useful testbed for
multiple approaches to dynamic batching. Their execution
engine’s modules (see Background or original work) not
only yield dramatic accuracy gains over all strong baselines,
but are also a prime example of explicit modularization of
knowledge. We view this as a key advantage of dynamic
architectures directly comparable to facets of human intelli-
gence. Our work successfully enables efficient paralleliza-
tion over minibatches in a large class of architectures de-
spite the fact that a new network is assembled for each ex-
ample.
* Authors contributed equally
1.1. Related Work
Previous notable dynamic graph results include neural
module networks [1], which form the basis of the execu-
tion engine of Johnson et al. in their CLEVR[7] IEP re-
sult. The difference is that latter’s architecture is built on
generic, minimally-engineered neural network blocks that
are more likely to generalize to a wider class of problems
than the original neural module networks approach, which
uses a heavily-engineered question parser and custom per-
module architectures. Whereas improvement upon neural
module networks constitutes improvement upon a single ar-
chitecture, improvement on the CLEVR architecture is gen-
eralizable to a wide class of models under a minimal set of
assumptions (see Discussion).
Additional dynamic graph results include neural Turing
machines [3] [4] and memory networks [14] [13], which
both provide auxiliary queryable memory for read/write use
during inference. While such architectures are applicable in
problems requiring long-term memory, visual question an-
swering places more focus on short term memory. Like the
IEP result, these works tend towards higher level reason-
ing. However, they are perhaps less directly comparable
than approaches that explicitly attempt to build generaliz-
able program structures, such as neural program interpreters
[11]. The main difference is that the IEP result assembles
programs that are defined in their entirety before being ex-
ecuted, thus additional dynamic batching optimizations are
possible. Note that a subset of our results are applicable in
both cases.
1.2. Background
Much of our work is built atop the recently published
CLEVR dataset and subsequent IEP result. We briefly out-
line these for convenience.
1.2.1 CLEVR
CLEVR is a VQA dataset comprising 70K images and
700K questions/answers/programs triplets. Images are syn-
thetic but high quality 3D renders of geometric objects
1
ar
X
iv
:1
70
7.
02
40
2v
1 
 [c
s.C
V]
  8
 Ju
l 2
01
7
with varying shapes, sizes, colors, and textures. The stan-
dard VQA task is given by (question, image) → (an-
swer). The difference lies in the inclusion of programs in
CLEVR, which are functional representations of the ques-
tions. CLEVR therefore allows VQA to be split between
two intermediate tasks, as in the IEP result: (question) →
(program) and (program, image)→ (answer).
One might argue that intermediate programs are unreal-
istic, as one is unlikely to have program annotations in large,
realistic tasks. From the CLEVR result, it seems likely that
one could collect a small number of annotations on realis-
tic datasets and use these to initialize the program generator.
This is similar to the transfer learning experiment in the IEP
result. However, performance did degrade compared to the
original task; additional work is required to close the gap.
1.2.2 Visual Reasoning Programs
The IEP result consists of a program generator and execu-
tion engine. The program generator is a 2-layer word-level
question encoder LSTM [6] and 2-layer word (function)-
level program decoder LSTM. We focus on the execution
engine, as it is the dynamic portion of the architecture and
the source of the majority of computation time.
The program generator predicts a sequence of functions
over the function vocabulary with a standard argmax. As
the arity (number of arguments) of each function is prede-
termined, there exists a unique mapping from the predicted
vector of functions to a program tree. This is assembled
via a depth-first search. Each function is itself a neural net-
work, with the exception of a special SCENE token, which
instead outputs ResNet-101 features [5] taken from an inter-
mediate layer. This program tree is then directly executed,
and the outputs are passed through a small classifier net-
work (one convolutional and two fully connected layers) to
yield a softmax confidence distribution over answers, which
is then optimized as normal via backpropagation over the
cross-entropy loss.
2. Methods
In the IEP result, programs must be executed sequen-
tially with an explicit loop over the examples in each mini-
batch. As a result, unlike static networks, the computation
time of the forward pass scales linearly with the batch size.
We present two variants of topological sort that remedy this
issue.
To clarify the ongoing notation, programs have max
length s and function vocabulary size p. The batch size is
denoted by b and the max program tree depth by d.
Standard topological sort. First, consider a naive topo-
logical sort. Each program tree is sorted via an infix depth-
first search. This results in a queue ordering such that each
Figure 1. Example of program tree labeling scheme. Nodes are
labeled and aggregated based upon their dependencies.
node can be executed sequentially; no node is executed be-
fore all of its dependents. While this operation runs in time
linear in the number of nodes (e.g. O(bs)), it is fast com-
pared to expensive neural network operations and can be
multithreaded extremely efficiently, thus we ignore this fac-
tor in our computations.
We now have a flat representation of each program,
which can be viewed as a grid of size b × s. Instead of
executing each program independently, we loop only over
the rows and execute one full column of size b. Each node
corresponds to a different element in the function vocabu-
lary. However, for b > p, we need only make at most p
expensive neural network calls instead of b. This results in
O(ps) neural network calls.
Improved topological sort. In the improved variant of
topological sort, we take this sorting operation one step fur-
ther. Instead of flattening programs, we instead label each
node by its maximum distance from the root node. Nodes
with the same label are pooled. Each pool is executed at
once inO(p) neural network calls, for a total ofO(pd) calls.
In the case where program trees are balanced (important in
the design of future datasets), this yields O(p log2 s) execu-
tion. The program trees used in CLEVR are, unfortunately,
highly imbalanced, thus this approach results in only a 10-
25 percent speedup over standard topological sort. Note
that, as d is the maximum depth across all programs in a
minibatch, d log2 s.
3. Results
3.1. IEP
We evaluate performance gains with improved topologi-
cal sort vs. our implementation of the original IEP architec-
ture. Relevant portions of program construction/execution
code are shared appropriately: our experiments are robust
to any unintended inefficiency in our implementation of the
2
original architecture.
Using all memory in a single Nvidia GTX 1080Ti, we
achieve 5.5X faster inference and a 2X faster backward
pass. It is currently unclear why gains do not better trans-
fer to the backwards pass in the PyTorch backend, as the
expected gains are symmetric. However, this is not a fair
comparison, as over half of computation time is spent in in-
efficient CPU Python graph sorting code. Furthermore, this
CPU code is embarrassingly parallel and should be written
in multithreaded C++.
Thus, a fairer comparison is to measure neural cell exe-
cution time, in which case we achieve over 14X gains (see
Fig. 2). There is a small amount of additional data stack-
ing code omitted from this computation because it can likely
also be optimized and is not directly comparable to the orig-
inal architecture; unfairly including this, gains are still well
over 10X.
More importantly, scaling is linear with batch size. Dou-
bling GPU memory yields 2X performance. Those famil-
iar with minibatch parallelism may object that this perfor-
mance gain usually drops off after a certain point (minibatch
size>1000 in our experience). However, this is not likely to
be an issue in our case, as there is an additional factor of the
program function vocabulary size (40). With equal distri-
bution of execution over functions, minibatch size 1000 per
cell corresponds to overall minibatch of size 40000, which
would require approximately 500 GB of GRAM. While in-
creasing the program vocabulary does incur a linear de-
crease in performance, it causes an equivalent increase in
maximum minibatch size before incurring diminishing re-
turns.
Furthermore, it is possible to maintain such gains in the
case of multiple GPUs (e.g. large batch size split over many
devices) by assigning a different cell function to each GPU.
Goyal et. al recently demonstrated that this sort of data par-
allel scaling can remain practical even at extremely large
batch sizes by scaling the learning rate correspondingly [2].
3.2. Sparsely Gated Mixture of Experts
As a second test, we evaluate performance on the
sparsely gated mixture of experts (MOE) layer as in [12],
where we maintain a set of n fully connected expert net-
works of which k are active for each of the b examples in
each minibatch, and k  n. Note that unlike the IEP ex-
ample above, each MOE layer must fully terminate before
the next layer can begin execution: the architecture is not
known a-priori. We therefore apply a degenerate case of
our standard topological sort: we batch computation across
all currently known modules. For the MOE layer, this cor-
responds to k experts in each of b examples for a total of kb
known modules.
The naive implementation loops over and executes the
kb experts independently. Our implementation makes only
Figure 2. Log scaled visualization of efficiency gains incurred
from our improved topological sort. Vanilla denotes the imple-
mentation in Johnson et al. This makes clearer the near-linear
gains in speed as the minibatch size approaches 1000.
n expert calls. We evaluate vs. the naive implementation
using 256-dimensional data. Each expert is a neural net-
work with one hidden layer. We tested many network sizes,
but found that network size is largely independent of the
speedup factor. We therefore only present results with 256
hidden units in Fig. 3; this is the largest experiment that fit
on a single 11GB card.
Our metric for performance is cell execution speed (e.g.
total time spent executing experts) rather than forward pass
speed. This is, as in IEP, to avoid unfair comparison to large
swaths of unoptimized python CPU code. Note that there is
an additional data stacking operation in our approach omit-
ted from these calculations, as in the IEP result, because
including it is not an equal comparison to the vanilla ap-
proach. Also, this stacking operation is almost certainly re-
quired for any approach in the distributed case.
Notice that performance quickly decays as the number
of experts approaches the minibatch size. This should be
expected, as parallelization is impossible if each expert re-
ceives one example. It is difficult to predict the largest prac-
tical speedup achievable, as we do not possess the compu-
tational resources to run production-scale results. However,
we can extrapolate from the results above. Furthermore, we
assume that our results will scale linearly with many GPUs.
This is reasonable both from our suggested parallelization
scheme (see IEP) and the results of the similar scheme uti-
lized in the original MOE work.
Consider the case where each of n experts has h hidden
units, and our data is d dimensional. This yields 2hnd pa-
rameters. If we want each expert to to receive m exam-
ples and we use k experts per example, then we must store
nm(2d+h)
k floating point activations. Thus the ratio of mem-
ory required between the activations and the parameters is
m(2d+h)
2khd Note that this does not depend on the number of
experts.
3
Figure 3. Cell execution speedup time with dynamic batching in
a sparsely gated mixture of experts layer. Speedup factor is com-
puted by averaging over only 5 runs for efficiency. Each expert is
a 256-256-#experts fully connected network.
As a practical example, with h = d = 2048, k = 100,
and m = 1 million, this ratio is 7.3. Thus even outra-
geously large networks require a reasonable fraction of the
total memory in order to use extremely large batch sizes.
With 10k experts, we would expect 50-80X performance
with this configuration. Furthermore, we could increase k
by up to a factor of 10 if we desired without significant loss
of absolute performance and obtain a 8-10X efficiency gain.
4. Discussion
The recent independent results of automatic batching in
DyNet [10] and the TensorFlow Fold library [9] are closest
to our work. However, the DyNet batching optimizer relies
on lazy execution to optimally organize data on the fly, and
TensorFlow Fold similarly operates over compiled graphs.
Lazy execution is not present in alternative frameworks
such as PyTorch by design, as it is decouples implementa-
tion from execution, which is often an undesirable quality
during implementation. Without lazy execution, automatic
batching is not possible, but variants of our approach are
still viable. Our standard and improved topological sorting
approaches can be viewed as a set of manual batching op-
timizations applicable to a large class of dynamic architec-
tures without requiring lazy execution or prior compilation.
With regard to technical implementation, we mark the
dependencies of each cell as in DyNet. However, we batch
by dependency depth as in TensorFlow Fold in order to min-
imize extra CPU code. This facilitates changing between
approaches. While our experiments are not directly compa-
rable to the benchmarks vs. TensorFlow Fold in the DyNet
result, our speedup curves maintain the linear gains of Ten-
sorFlow Fold at large batch sizes.
Our improved topological sort makes the following hard
assumptions in order to achieve O(p log2 s) where com-
plexity is measured by the number of calls to expensive (e.g.
neural network) functions:
1. There exists a set of p expensive modules.
2. The architecture, composed of modules, can be exe-
cuted with batch size b such that b p.
3. The architecture is a balanced tree with structure and
module arities known a priori.
In the case where the third assumption fails because the
architecture (and arities) are known ahead of time but it is
a general DAG, our improved topological sort is still appli-
cable, but with complexity O(pd) where d is the maximum
dependency path length. This has the same complexity as
the standard topological sort but with an equivalent or more
favorable constant factor which can be fairly significant, de-
pending on the number of concurrent branches in the graph.
In the case where the architecture is a generic graph but
is not specifically known ahead of time, it is no longer pos-
sible to use our improved topological sort. However, as the
next module is always known in any architecture, it is still
possible to apply our standard topological sort approach and
achieve O(pd) by aggregating the computations of all cur-
rent modules over p, as done in the MOE example above.
In the presence of cycles, d becomes the maximum length
of an unrolled graph. This is always limited in practice to
avoid infinite cycles.
In general, our approach is applicable whenever signifi-
cant module reuse is present among examples. While it may
be possible to improve upon our approach in select cases by
searching over the set of known modules to optimize the or-
der of dependency execution, this would require additional
CPU code that may not be possible to fully optimize.
5. Conclusion
We demonstrate the effectiveness of our dynamic
batching method on IEP and MOE (our codebase is
available at https://github.com/jsuarez5341/
Efficient-Dynamic-Batching), achieving over
14X and up to 1000X neural cell execution, respectively.
In each case, we characterize the trend of improvements as
batch size varies, which yields increasing returns and be-
comes linear until extremely large batch size. We define the
class of problems for which our improved topological sort
is applicable as well as the class where it is not but standard
topological sort is still feasible; in both cases, we provide
complexity bounds as a function of neural network calls.
The breadth of architectures in which at least one variant of
our approach is applicable implies that a large class of dy-
namic architectures can be trained and executed as quickly
and efficiently as their static counterparts.
4
References
[1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural
module networks. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
[2] P. Goyal, P. Dolla´r, R. Girshick, P. Noordhuis,
L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.
Accurate, large minibatch sgd: Training imagenet in 1 hour.
arXiv preprint arXiv:1706.02677, 2017.
[3] A. Graves, G. Wayne, and I. Danihelka. Neural turing ma-
chines. arXiv preprint arXiv:1410.5401, 2014.
[4] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka,
A. Grabska-Barwin´ska, S. G. Colmenarejo, E. Grefenstette,
T. Ramalho, J. Agapiou, et al. Hybrid computing using
a neural network with dynamic external memory. Nature,
538(7626):471–476, 2016.
[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016.
[6] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Computation, 9(8):1735–1780, 1997.
[7] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L.
Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset
for compositional language and elementary visual reasoning.
CoRR, abs/1612.06890, 2016.
[8] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman,
F. Li, C. L. Zitnick, and R. B. Girshick. Inferring and execut-
ing programs for visual reasoning. CoRR, abs/1705.03633,
2017.
[9] M. Looks, M. Herreshoff, D. Hutchins, and P. Norvig. Deep
learning with dynamic computation graphs. arXiv preprint
arXiv:1702.02181, 2017.
[10] G. Neubig, Y. Goldberg, and C. Dyer. On-the-fly opera-
tion batching in dynamic computation graphs. arXiv preprint
arXiv:1705.07860, 2017.
[11] S. Reed and N. De Freitas. Neural programmer-interpreters.
arXiv preprint arXiv:1511.06279, 2015.
[12] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le,
G. Hinton, and J. Dean. Outrageously large neural networks:
The sparsely-gated mixture-of-experts layer. arXiv preprint
arXiv:1701.06538, 2017.
[13] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. End-to-
end memory networks. In C. Cortes, N. D. Lawrence, D. D.
Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neu-
ral Information Processing Systems 28, pages 2440–2448.
Curran Associates, Inc., 2015.
[14] J. Weston, S. Chopra, and A. Bordes. Memory networks.
arXiv preprint arXiv:1410.3916, 2014.
5
