Optimal DNN Primitive Selection with Partitioned Boolean Quadratic
  Programming by Anderson, Andrew & Gregg, David
Optimal DNN Primitive Selection with
Partitioned Booleanadratic Programming
Andrew Anderson
School of Computer Science and Statistics
Trinity College Dublin
Dublin, Ireland
andersan@cs.tcd.ie
David Gregg
School of Computer Science and Statistics
Trinity College Dublin
Dublin, Ireland
dgregg@cs.tcd.ie
Abstract
Deep Neural Networks (DNNs) require very large amounts of com-
putation, and many dierent algorithms have been proposed to
implement their most expensive layers, each of which has a large
number of variants with dierent trade-os of parallelism, locality,
memory footprint, and execution time. In addition, specic algo-
rithms operate much more eciently on specialized data layouts.
We state the problem of optimal primitive selection in the pres-
ence of data layout transformations, and show that it is NP-hard by
demonstrating an embedding in the Partitioned Booleanadratic
Assignment problem (PBQP). We propose an analytic solution via
a PBQP solver, and evaluate our approach experimentally by opti-
mizing several popular DNNs using a library of more than 70 DNN
primitives, on an embedded platform and a general purpose plat-
form. We show experimentally that signicant gains are possible
versus the state of the art vendor libraries by using a principled an-
alytic solution to the problem of primitive selection in the presence
of data layout transformations.
Keywords Deep Neural Networks, Primitive Selection
1 Motivation
Deep neural networks are among the most successful techniques
for processing image, video, sound and other data arising from real-
world sensors. DNNs require very large amounts of computation
which challenge the resource of all but the most powerful machines.
However, DNNs are most useful when deployed in real-world
embedded devices, which navigate and interact with their surround-
ings. In these embedded environments, limits on baery capacity,
memory and processing power are signicant constraints.
DNNs consist of a directed graph of “layers” that receive raw
input data, and output a transformation or classication of the data.
Several dierent types of layers are used to implement DNNs, such
as activation layers, pooling layers, convolution layers, and fully-
connected layers. In the best-known and best-performing DNNs,
a great majority of the execution time is spent in the convolution
layers.
ere are many ways that each layer can be implemented. For
example, a common approach to implementing convolution is to re-
structure the input data and call a matrix multiplication routine [8].
Other researchers have used carefully hand-tuned C or assembly
routines or specialized domain-specic algorithms (such as Wino-
grad or FFT convolution) with low asymptotic time complexity.
preprint, 2018
2018. is is the author’s version of the work. It is posted here for your personal use.
Not for redistribution. e denitive Version of Record was published in Proceedings
of preprint, arxiv.org.
Given the large space of approaches, it can be dicult to guess
which primitive routine might be best used to instantiate any given
layer from a DNN. With no further information, a simple solution
is to simply pick a single primitive for all layers. However, we show
that no single primitive yields best performance for all layers.
On the contrary, some algorithms perform well across a range
of inputs, whereas others can be quite inecient on average, but
perform extremely well in particular cases. An additional compli-
cation is that input data layouts can have a large impact on the
performance of particular primitives.
However, the outputs of one layer become the inputs of others.
us data layout decisions cannot be made in isolation, because
the selection of the output layout of a producer layer determines
the input layouts of all connected consumer layers.
Contributions In this paper we address the problem of per-layer
primitive selection for deep neural networks to optimize a global
goal function for the whole network. Specically, we make the
following contributions:
• We demonstrate empirically that dierent DNN primitives
can provide very dierent levels of performance for the
same layer.
• We formulate the selection of implementations for convolu-
tional layers in the presence of data layout transformations
as a Partitioned Booleanadratic Programming (PBQP)
problem.
• We demonstrate the eectiveness of our technique using a
large library of more than 70 DNN primitives operating on
a variety of data layouts with an o-the-shelf PBQP solver.
• We demonstrate signicant speedups from our approach
on desktop and embedded CPUs on several popular deep
neural networks.
2 Background
A deep neural network (DNN) consists of a directed graph of layers
that receive, process, and output data. Input data enters the graph
through an input layer. Starting from the input layer(s), each layer
of the graph is executed in topological order. Data ows between
layers along directed edges, which determine the order of execution,
similar to data dependences in a basic block.
e directed graph of layers may be cyclic or acyclic. One well-
known class of acyclic feedforward DNNs are Convolutional Neural
Networks (CNNs). CNNs normally accept a large matrix or tensor
input, such as an RGB image. e input layer of the CNN processes
the input tensor, and produces one or more output tensors on its
output edge(s). ese outputs trigger the execution of subsequent
layers.
e output of the CNN is commonly a classication, that is, a
weighted distribution of categories — such as dogs or helicopters —
ar
X
iv
:1
71
0.
01
07
9v
2 
 [c
s.P
F]
  2
 N
ov
 20
18
preprint, arxiv.org, 2018 A. Anderson and D. Gregg
that the CNN predicts for the input. Once training is complete, a
CNN is stateless; the output is purely a function of the most recent
input and the trained, xed internal weights.
e layers within a DNN consist of standard mathematical oper-
ators such as convolution, activation, pooling, and full-connected
layers. ese standard operators can be used to build a very large
variety of deep learning models. A DNN can be implemented using
a library of primitive routines, where each primitive implements
one of each of the types of layer in the DNN. A single input to a
layer is oen a large four dimensional tensor. erefore the amount
of computation performed by each primitive is typically large.
2.1 Convolution Layers
e most computationally-intensive layer in many DNNs is the
convolution layer. e convolution layer is based on the idea of
convolving a 2D input of dimensions H ×W with a convolution
kernel of k×k . However, as shown in Figure 1, in DNN convolution
both the input and kernels have C separate channels. Rather than
convolving with just a single kernel, the C-channel input is con-
volved withM separateC-channel kernels. us, DNN convolution
operates on a 3D input tensor, a 4D tensor of kernels, and produces
a 3D tensor as output. e total number of operations in a simple
(non-strided) DNN convolution is O(H ×W ×C × k2 ×M).
Figure 1. DNN convolution of a 3D input tensor composed of C
input feature maps, each of size H ×W , with a 4D kernel composed
ofM multichannel lters, each with C channels and a k × k kernel,
resulting inM output feature maps, each of size H ×W .
2.2 Data Layouts
Each primitive takes inputs in a given data layout, performs the
operations of one or more layers, and produces output in a given
layout. Some DNN libraries produce all inputs and outputs in a
canonical layout, such as the NCHW data layout used by Cae [8].
However, forcing each layer to use the same layout removes the
opportunity to customize layouts to particular operations (such as
convolution with large K but small C) or for cross-layer optimiza-
tions that exploit specialized layouts.
A simple way to nd the cost of implementing any layer with
any primitive is to execute the primitive using sample inputs for the
layer. e cost of execution of most DNN layers depends primarily
on the dimensions of the input rather than on the actual input
values. Since the execution time tends to remain stable regardless
of the particular input values, it is possible to measure with some
consistency the execution time of a given primitive implementing
a given layer. However, given a set of execution times for each of
the layers in a network with dierent primitives, it is not obvious
which primitives to select to implement each layer to maximize
overall performance.
In particular, when input data is converted to a special repre-
sentation, such as a frequency domain representation, it is oen
much more ecient to keep the data in that representation for as
long as possible, rather than to convert back and forth between
data representations for each layer in the network.
3 Primitive Selection
In this section we consider the problem of selecting a primitive
to implement each layer to yield the lowest cost instantiation of a
given DNN. By lowest cost we mean that the sum of the execution
times of each of the layers of the DNN is minimized. is problem
may seem no more than selecting the fastest implementation of
each layer, but in fact it is much more dicult.
e parameters of a convolutional layer on which the runtime
chiey depends are informally well understood — the work to
be done grows with the size and number of input feature maps,
and the size and number of lters, but decreases as the stride of
the convolution increases. We can model a convolutional scenario
formally as a 6-tuple {C,H ,W ,δ ,K ,M}, respectively, the number
of input feature maps, the height of an input feature map, the width
of an input feature map, the stride of the convolution, the radix of
the convolutional lters, and the number of output feature maps.
Note that our formulation does not consider minibatching. Mini-
batching can trivially be incorporated by adding a seventh parame-
ter to encode the batch size. However, our application context is
highly latency sensitive, so our formulation, in practice, considers
only a minibatch size of 1.
Input data is provided to a convolutional layer in a 3-dimensional
tensor of sizeC×H×W . In the abstract, any layout (i.e. permutation
of the order of these dimensions) of the tensor is valid. However,
each primitive operator deals with inputs and outputs in a specic
data layout.
We model primitive operators with a 3-tuple {Lin , P ,Lout }, re-
spectively, the input layout, primitive identier, and output layout,
where the layouts are a permutation of {C,H ,W }. A directed edge
from a layer instantiated with primitive A to another layer instanti-
ated with primitive B is legal i Lout (A) = Lin (B). us a primitive
assignment also implies a specic layout assignment to the input
and output edges of a DNN layer.
Two incompatible primitives cannot be connected, regardless of
the optimality of such an arrangement. For example, a particular
primitive operator that performs convolution might operate on
tensors of 16-bit xed point data. Another might operate on 32-bit
oating point. If the output data of one primitive were provided as
input to the other, garbage would result.
We combine dierent incompatible primitives using a legaliza-
tion phase. e legalization phase inserts additional data layout
conversion layers to bisect illegal edges, and legalize an assignment.
e legalizer can then select one or more data layout transformation
primitives to implement the conversion layers.
A problemwith selecting data layout transformation primitives is
that the number of supported data layouts may be large. ere may
not be a separate conversion primitive connecting every pair of data
layouts. is may result in a chain of data layout transformations
being required to convert from one layout to another.
Optimal DNN Primitive Selection with PBQP preprint, arxiv.org, 2018
An additional complication that arises from inserting data lay-
out transformations is that the transformations themselves take
time to execute. e legalization pass may allow two incompatible
primitives to work together, but the cost of the data layout trans-
formations may be so high that the selection is no longer optimal.
Once the cost of data layout transformations is included, the op-
timal selection might instead be another selection with fewer or
cheaper data layout conversions. If the cost of the data layouts is
considered only aer selection, the solution may be sub-optimal.
Clearly, the problem is more complex than simply picking the
fastest legal primitive for each DNN layer. In fact, as we show in
the next section, it is NP-hard to nd the least-cost assignment of
primitives to layers in the presence of data layout transformations.
3.1 Computing Costs
We provide a two-stage solution to the primitive selection prob-
lem. In the rst stage we compute the cost of converting between
each of the supported data layouts. Note that the set of data trans-
formation routines between the various pairs of data layouts is
not normally complete. Instead we have a limited set of direct
data layout transformation routines between various pairs of data
layouts.
Where there is no direct routine to tranform from data layoutA to
data layout B, it may be possible to build a chain of transformations
between the layouts. For example, if there exist transformations
A → D, D → X and X → B, then it is possible to convert from
layout A to B via that chain A→ D → X → B.
Considering the set of data layouts supported by a DNN library
as nodes in a graph, we can construct a data-layout transformation
(DT) graph. e direct data layout transformation routines can be
considered directed edges of the DT graph. If there is a directed
path between any pair of nodes in the graph, then a chain of data
layout transformations can be constructed to convert from one
layout to the other. e full set of possible direct and indirect data
layout transformations is then given by the transitive closure of
the DT graph.
However, the transitive closure tells us only which pairs of data
layouts can be linked by a chain of data layout transformations.
We also need to know the cost of every DT graph path. us, we
must measure the execution time of the data layout transformation
routines on tensors of the size that appear at their inputs in the
DNN, or else provide a heuristic cost for paths.
In the current paper we measure the execution time of transfor-
mations ahead of time, but simple heuristics might be almost as
eective.
To nd the least-cost chain of data layout transformations be-
tween a pair of data layouts we need to nd the shortest path
between the corresponding pair of DT graph nodes. Rather than
computing the shortest path between each pair of nodes each time
we need it, we instead compute the all-pairs shortest path for the
DT graph ahead of time. Where no path exists between a pair of
nodes, the cost of the data layout transformation is innite. is
gives us the (transitive) data layout transformation costs for the
transfer of data between layers of the original DNN graph.
In addition to data layout transformation costs, we also com-
pute the cost of implementing each of the primitive functions that
implement layers of the DNN graph. Each layer might be imple-
mented by one of many dierent primitives. For example, in our
DNN library, we provide implementations of six major families of
convolution algorithm. Each algorithm has many variants with
distinct performance characteristics, resulting in over 70 dierent
primitive routines that implement DNN convolution.
To estimate the cost of a specic assignment of a primitive to a
DNN layer, we prole the execution time of the primitive operating
on tensors of the size used in the layer. Note that the dimensions of
all inputs to DNN layers are known statically, and the control ow
in most layers is sensitive only to the size of inputs, not the specic
values of the input data. erefore, statically-measured execution
times on random input of the appropriate size give a very good
estimate of the actual execution time.
3.2 e Optimization Problem
Given a DNN graph G, we compute the instance cost (execution
time) for all convolutional scenarios S ∈ G under every available
primitive P0...Pn to form the product space S × P . Since each edge
cost is determined by the pair of assignments of primitives to nodes
in each of the nodes linked by the edge, we must then compute all
paths D in the DT graph implied by each assignment, I , of the nodes
of G, drawn from points in this space, and compute the instance
cost (execution time) for these also. e task before us is then to
nd the point in the product space S × P × D of instantiations ofG
which minimizes the total cost.
Constructed in this way, we can map the problem of layer selec-
tion in the presence of data layout transformations to a well-known
existing optimization problem — PBQP.
3.3 Partitioned Booleanadratic Programming
e partitioned Boolean quadratic programming (PBQP) problem is
an assignment problem that can be visualized as a graph. For each
node there are several possible assignments, each with a known
cost. In addition, there is a second set of costs associated with edges
in the graph. e cost of edges is a table indexed by the pair of
assignments of the two nodes linked by the edge. PBQP has been
used to model a number of problems in compiler optimization such
as register allocation for architectures with irregular instruction
sets [14], and instruction selection on DAGs [6].
Given a set of layers and a set of primitives, we aempt to assign
the layers to primitives to minimize the total cost. e cost of
assigning a layer to a primitive is the execution time of the layer
implemented with that primitive. Figure 2a shows a simple graph
representation of a PBQP instance.
Each layer can be implemented by any one of three dierent
primitives: A, B or C . Each of the three primitives uses a dierent
algorithm, and therefore has a dierent cost to implement the layer.
Given the cost of implementing the nodes with each of the three
primitives in Figure 2a, the optimal primitive selection for each
node is clear. e lowest cost selection for each of the three layers
is B, C and B respectively.
In addition to the cost table for each individual layer, the PBQP
formulation allows a second cost to be specied for edges. In our
simple example, an additional cost matrix can be associated with
each edge in the graph. e cost matrix for an edge represents
the costs associated with all pairs of selections for the two nodes
connected by the edge.
Figure 2b shows a cost table for the edge that transitions from
conv1 and conv2. For example, if primitive A is selected for conv1
and A is also selected for conv2, then the edge cost of transitioning
between conv1 and conv2 is zero. is zero cost is the result of
preprint, arxiv.org, 2018 A. Anderson and D. Gregg
conv1
conv2
conv3
(8, 6, 10)
(17, 19, 14)
(20, 17, 22)
B
C
B
total cost: 37
(a) Node costs only
conv1
conv2
conv3
(8, 6, 10)
(17, 19, 14)
(20, 17, 22)
0, 2, 4
4, 0, 5
2, 1, 0(
(
C
C
0, 3, 5
6, 0, 5
1, 5, 0(
(
a      b      c
a
b
c
A
total cost: 45
(b) Node and edge costs
Figure 2. Example of simple linear PBQP problem. Optimal assign-
ments are indicated by the leer le of the node.
using the same data layout for both conv1 and conv2. In contrast,
in this example we assume that primitives A, B and C operate on
dierent data layouts, and there is therefore a data layout trans-
formation cost which arises when we transition between dierent
primitives in connected layers.
In Figure 2b primitive B has the lowest cost for layer conv1. How-
ever, when we consider data layout transformation costs, primitive
B is no longer the optimal selection for layer conv1. For layer
conv2, primitive C is so much faster than the other two choices.
However, the cost of transitioning from B at layer conv1 and C
at layer conv2 is so high that B is not the optimal selection for
layer conv1. e edge cost is large because the two primitives use
dierent data layouts, and therefore a data layout transformation
must be inserted between the two primitives. e additional cost
that arises from pairs of assignment choices on the graph edges
makes PBQP dierent from other similar problems, such as the
quadratic assignment problem [10].
convolve 
1x1
convolve 
1x1
convolve 
1x1
maxpool 
3x3
concat
input
convolve 
3x3
convolve 
5x5
convolve 
1x1
Figure 3. Inception Module
In the simple example in Figure 2 it is feasible to solve the selec-
tion problem relatively simply even in the presence of edge costs.
is is partly because the example is small, and partly because
the graph in the example is purely linear. However, data layout
transformation costs on graph edges are much more problematic in
DAG-shaped DNNs. Figure 3 shows an example of a DAG sub-graph
from the GoogleNet DNN [16].
Where a layer has multiple direct successors and/or predeces-
sors, the same data layout may not be optimal for all. Choosing
one layout over another may limit the choice of primitives that
can be used in successor/predecessor layers, or may require the
insertion of expensive data layout transformations. PBQP allows
us to model these complicated DAG costs within an abstract opti-
mization problem, and to nd solutions with an o-the-shelf PBQP
solver.
4 DNN Convolution Algorithms
A wide variety of convolution algorithms have been proposed over
many years, each with their own strengths and weaknesses [4].
Researchers in signal processing oen distinguish between direct
and indirect convolution methods.
Direct methods compute the convolution as a simple sum of
products, as is found in a standard textbook denition of convo-
lution. On the other hand so-called fast convolution algorithms
reduce the computational complexity by transforming the input to
another form before performing the convolution. In this section
we describe several dierent classes of convolution algorithms that
are commonly used to implement DNNs.
•e direct-loop family of convolution algorithms perform multi-
channel multi-kernel convolution using a simple six-deep loop nest.
ere are many variants of this loop nest with dierent reorderings,
tilings, and schedules to improve execution time, vectorization, and
spatial and temporal locality of data access.
e sum-of-single-channels algorithm is a member of the direct-
loop family of methods. It uses the loop orderingM ×C ×H ×W ×
K × K .
•e im2 family of convolution algorithms are variants of the well-
known im2col approach [8]. ese convolutions rst construct a
Toeplitz matrix from the input image, and convolve this with the
kernel using a single call to the BLAS GEMM routine.
•e kn2 family of low-memory GEMM-based convolution algo-
rithms are presented by Vasudevan et al. [3, 18]. is family of
approaches does not construct a Toeplitz matrix, and instead com-
putes convolution as the sum of several matrix multiplications. We
use variants of the kn2 family that compute the sum of GEMMs
as an accumulation and achieve good execution times with low
additional memory.
•e Winograd family of methods use the Winograd algorithm
for convolution with a theoretically optimal number of multipli-
cations [4]. Unlike the direct-loop, im2 and kn2 approaches, all
of which perform the same number of oating point operations,
winograd is a “fast” algorithm by the signal processing denition,
which greatly reduces the number of operations. We implemented
the Winograd algorithm for scenarios with K = 3 and K = 5.
•e t family of methods perform FFT convolution via the con-
volution theorem, by rst computing the Fourier transform of the
input image and the kernel, applying a pointwise multiplication,
and then computing the inverse Fourier transform of the resulting
matrix to produce the output. Our t implementations compute
2D convolution as a sum of 1D FFT convolutions, which requires
less space than 2D FFT convolution at the cost of more operations.
Table 1 gives a brief overview of some of the strengths and weak-
neses of these major approaches. In principle it should be possible
to optimize the simple loop nest to achieve high performance, but
in practice such loop nests are more oen very slow. e Wino-
grad algorithm can be very fast for K = 3 and K = 5 convolutions,
but equally there are cases where the im2 and kn2 approaches are
faster. We have found it very dicult to predict the cases where
Optimal DNN Primitive Selection with PBQP preprint, arxiv.org, 2018
Table 1. Strengths and weaknesses of dierent convolution algo-
rithms commonly used in DNN implementations
Algorithm Time Memory Strided Bad cases
direct loop - - ++ ++ Non-strided
im2 + - - ++ Large image
kn2 + + - - Few channels
Winograd ++ - - Unpredictable
FFT - + Small kernel
Winograd will be faster without measuring the performance of the
layers. e kn2 approach is fast and requires lile memory, but it
cannot be used to eciently implement strided convolutions. On
the other hand, our FFT algorithm is only sometimes faster than
other approaches, but in those rare cases it can give signicant
speedups.
Real-World Solutions To demonstrate our formulation in prac-
tice, Figure 4 shows two primitive selections made by the formula-
tion. Figure 4 shows the convolution layers of AlexNet [11]. e
gure shows the selection of primitives that our approach makes
on ARM Cortex A-57 and Intel Core i5-4570 processors.
conv1
conv2
conv3
conv4
conv5
Intel
conv1
conv2
conv3
conv4
conv5
ARM
im2row
A B I K
im2row
A BT I K
winograd 
3X3 5X5 VF8
winograd
3 5 VF4
winograd
2X2 3X3 VF4
winograd
2 3 VF4
winograd
2 3 VF4
winograd
2X2 3X3 VF8
winograd
2X2 3X3
winograd
2X2 3X3 VF8
Figure 4. PBQP selections for multithreaded execution of
AlexNet on ARM Cortex A-57 and Intel Core i5-4570.
e primitive selections on both processors are interesting in
both their similarities and dierences. Recall that we select from
a library of around 70 primitive functions which implement DNN
convolution, and therefore the range of possible selections is large.
e conv1 layer of AlexNet consists of a K = 11,δ = 4 convolu-
tion. e layers selected for both ARM and Intel processors are im2
layers, with a row-oriented layout. e only dierence between the
two is that the layer selected for ARM passes the kernel matrix to
the GEMM matrix multiplication call as a transposed matrix. ere
is no good way to select between these very similar primitives
except by proling the code to see which is faster on a particular
target architecture.
e remaining layer selections are all instances of the Wino-
grad convolution algorithm for both target architectures. However,
there is a striking dierence in the selections for the ARM and Intel
processors. e selection for the Intel processor consists entirely
of primitives that implement two-dimensional Winograd convolu-
tion. is approach requires signicant memory, but minimizes the
number of executed operations.
In contrast, a majority of the Winograd selections for the ARM
processor are one-dimensional Winograd convolutions. One di-
mensional Winograd convolutions require more oating point op-
erations than their 2D counterparts, because the 2D operation must
be constructed from the sum of a number of 1D Winograd convolu-
tions. However, the 1D algorithm requires less memory. Given that
the ARM processor has much smaller caches than the Intel proces-
sor, it seems that the lower memory requirements of 1D Winograd
makes it faster on the ARM processor. As a result, of the four
Winograd primtives that are selected to implement convolution
layers on the ARM processor, three are 1D Winograd convolution
algorithms.
Note also that the optimizer selects the 8-way vectorized imple-
mentations of Winograd on the Intel processor, which has Intel’s
8-way FP32 AVX2 vector extensions, while on the ARM proces-
sor, which has ARM’s 4-way FP32 NEON vector extensions, it also
selects the appropriate architectural variant.
A strength of our approach is that we can easily capture these ne
architectural dierences with layerwise proling, while keeping
the optimizer free from platform-specic special cases. Layerwise
proling need only be run once per hardware platform per DNN
model. e resulting cost tables are tiny compared to the weight
data required for most DNN models, making it feasible to produce
these cost tables before deployment, and ship them with the trained
model to maximise inference performance in situ.
5 Experimental Evaluation
To validate our primitive selection approach, we performed whole-
network benchmarking with several popular CNN architectures on
multiple hardware platforms. We also ran these CNN architectures
using some state-of-the-art soware frameworks, to demonstrate
the benet of our global approach to optimization of DNN inference.
is section presents the results of our evaluation, with discussion.
5.1 Experimental Setup
We evaluated our approach on an Intel Haswell CPU, specically the
Intel Core i5-4570, and on an ARM Cortex A-57 CPU. e Cortex
A-57 is used in NVIDIA’s Tegra X1 platform for embedded and
automotive development.
On the ARM platform, we used the popular BVLC Cae frame-
work [9], which is accelerated by the high-performance OpenBLAS
library under the hood. On the Intel platform, there is a vendor
library available with an optimizing code generator which targets
DNN inference: MKL-DNN [2]. e BVLC Cae framework also
runs on this platform.
We used Cae version 1.0 on all platforms, and on the Intel plat-
form, we used MKL-DNN version 0.10. We used OpenBLAS release
0.2.20 on all platforms. Both Cae and our own primitive library
use OpenBLAS internally to perform the matrix multiplications
at the heart of DNN convolution, so using the same underlying
OpenBLAS library ensures we are competing on an equal footing.
We kept all external soware dependencies in sync between each
platform. e C++ compiler we used was GCC stable release 7.2.
All benchmark executables were compiled using -march=native
-std=c++14 -O3. e NVIDIA Tegra X1 board we used for the
Cortex-A57 benchmarking was running NVIDIA JetPack 3.1.
preprint, arxiv.org, 2018 A. Anderson and D. Gregg
5.2 Methodology
For performance evaluation of our approach, we used three popular
network architectures: AlexNet [11], VGG [15], and GoogleNet [16].
Each of these network architectures has a public model corre-
sponding to the publication made available by the authors, or via
the BVLC Cae Model Zoo, a public repository of trained DNN
architectures. We used these public versions of the network archi-
tectures for our experimentation.
Note that Cae models other than VGG-D and VGG-E have not
been released by the authors [15], so we cannot present Cae re-
sults for these models. However, we present results obtained from
a faithful reconstruction of the models by hand, exactly follow-
ing [15].
Starting from the published network models, we extracted all
convolutional scenarios in the graph, performed the proling to
gather cost data, and constructed the PBQP query for the minimum
cost instantiation. We mapped the solution to code with a simple
code generator which emied calls to primitive operations in our
library.
Note that we modelled convolutional layers only in the DNN
graph for optimization. All other layers were represented in our
formulation as dummy nodes, accepting any input and output
layouts, and having zero cost. Since convolution accounts for a very
large proportion of the execution time of a DNN, we hoped that this
simplifying assumption would reduce the size of our optimization
queries, while not damaging the quality of the solution.
We used the same code generator with the example code which
ships with Intel’s MKL-DNN library to obtain executable instances
of the models using MKL-DNN1. For Cae, we executed published
network models directly using caffe time. Every whole-network
benchmark was run for ve iterations, and the mean execution
time for one forward pass was used to calculate speedups.
We performed separate single-threaded and multi-threaded cost
modelling, and solved and built independently two versions of every
benchmark program. e multi-threaded benchmarks were run
using all cores available on the machine. Both of the test systems
had four CPU cores. On our graphs, all bars represent a speedup
over a common baseline. For this baseline, all convolutions in the
network are performed using the textbook sum-of-single-channels
algorithm, with single-threaded execution.
5.3 Data Layouts Used
Published algorithms forWinograd convolution, im2 and kn2 family
convolutions, and direct convolutions tend to use one of three main
data layouts for inputs and outputs: C × H ×W , H ×C ×W , and
H ×W × C . e laer two layouts correspond to transpositions
or blockings of the simple data layout used by direct convolutions
(C ×H ×W ). Note that the same data layout is not always used for
both input and output by a primitive operation.
All layers and all primitives used to instantiate them in every
network in our benchmarking operate on 32-bit single-precision
oating point data.
5.4 Optimization Overheads
We used the PBQP solver of Scholz et al. [7] to solve our optimiza-
tion queries. Solving the PBQP optimization query took less than
1Intel’s state-of-the-art MKL-DNN framework [2] contains an optimizing JIT compiler
for DNN convolution. e framework natively targets Intel vector extensions, and
underpins projects such as Intel Cae [1].
one second for each of the networks we experimented with, using
our Intel experimental machine. In each case, the solver reported
that the optimal solution was found.
5.5 Interpretation of Results
Our presentation of results from whole-network benchmarking
labels results with the name of the family of convolution algorithms
which was used to produce the result. For each family of algorithms,
we construct the test network by picking the fastest variant of that
family to replace the sum-of-single-channels algorithm for each
individual convolution in the network, if the replacement is, in fact,
faster than sum-of-single-channels for that convolutional scenario
in benchmarking. From this we obtain bars for each family of
convolution algorithms from Section 4: direct, im2, kn2, winograd,
t.
We also tested the simple strategy outlined in Section 2.2 which
eliminates all data layout tranformations by choosing a canonical
layout for tensors, and keeping all data in this format. is is the
local optimal bar on the graphs. For our experiments, we used the
default Cae layout, C × H ×W , as the canonical layout.
Absolute timings for the subset of networks which run on both
experimental architectures are presented in Tables 2 and 3.
5.6 Experimental Results on Core i5-4570
Figures 5 and 6 present the results of our whole-network evaluation
on the Intel Haswell platform. e solution found by the PBQP
formulation for single threaded execution of a DNN model is com-
petitive with the optimized vendor library, even outperforming it
in some cases, such as on the VGG-B, VGG-C, and VGG-E DNN
models. However, it is in multithreaded execution where the PBQP
approach really shines. e PBQP solver nds excellent solutions
for all DNN models, outperforming the vendor library by as much
as a factor of 2x in the case of the VGG-E model (Figure 6).
5.7 Experimental Results on Cortex-A57
Figure 7 presents the results of our whole-network evaluation on
the ARM Cortex-A57 platform. Note that the VGG-B, VGG-C, and
VGG-E DNN models are too large to t on this platform. However
we were able to reliably execute the AlexNet and GoogleNet DNN
models.
Although less pronounced than on Intel Haswell, we still see a
very signicant speedup (up to 7x) from our approach versus Cae
on the Cortex-A57 (Figure 7b).
5.8 Experimental Trends
Our results strongly support the position that there is no one con-
volution algorithm which excels in every scenario. For example,
Winograd convolution is supremely eective in VGG family DNN
models, because they are entirely composed of K = 3 convolu-
tion layers. However, Winograd by itself is among the slowest
approaches for the AlexNet and GoogleNet models (Figure 6).
We also see an experimental validation of the importance of
modelling the cost of data layout transformations. Recall that our
strategy for non-PBQP bars is to replace the SUM2D algorithm only if
the replacement has a smaller execution time. Figure 7a shows that
for GoogleNet, this strategy leads to a net slowdown for the family of
direct loop-based convolutions — even though at all points the faster
loop was chosen, the legalizing data layout transformations are so
Optimal DNN Primitive Selection with PBQP preprint, arxiv.org, 2018
 0
 2
 4
 6
 8
 10
 12
AlexNet VGG-B VGG-C VGG-E GoogleNet
Sp
ee
du
p 
vs
 s
um
2d
Whole Network Benchmarking (x86_64)
direct
im2
kn2
Winograd
FFT
Local Optimal (CHW)
PBQP
mkldnn
caffe
Figure 5. Single-readed Comparison of approaches with PBQP selection on Intel Haswell
 0
 5
 10
 15
 20
 25
AlexNet VGG-B VGG-C VGG-E GoogleNet
Sp
ee
du
p 
vs
 s
um
2d
Whole Network Benchmarking (x86_64) (multithreaded)
direct
im2
kn2
Winograd
FFT
Local Optimal (CHW)
PBQP
mkldnn
caffe
Figure 6. Multi-readed Comparison of approaches with PBQP selection on Intel Haswell
Table 2. Single inference time on Intel Core i5-4570 (ms) with
(S)ingle-threaded and (M)ulti-threaded execution.
Network SUM2D L.OPT PBQP CAFFE
(S) AlexNet 711.75 231.75 100 419.565
(S) GoogleNet 1401 465.25 249 1267.07
(M) AlexNet 712.25 186 44.25 286.518
(M) GoogleNet 1400.25 261.5 123.5 919.196
expensive that the baseline SUM2D instantiation of the network is
faster.
e eect is even larger considering the case of Winograd convo-
lution for AlexNet on Intel Haswell (Figure 6). Recall from Figure 4
that the PBQP optimal selection on this platform uses a Winograd
variant for 4 of the 5 convolutions in AlexNet. Yet simply selecting
the fastest Winograd variant ignoring data layout transformation
costs yields an instantiation that performs only marginally beer
than the baseline SUM2D network instantiation.
Table 3. Single inference time on ARM Cortex A-57 (ms) with
(S)ingle-threaded and (M)ulti-threaded execution.
Network SUM2D L.OPT PBQP CAFFE
(S) AlexNet 2369.5 744.25 461 2341.09
(S) GoogleNet 4544.75 1695.25 1025 5782.4
(M) AlexNet 2432.5 639.25 294 1342.62
(M) GoogleNet 4509.75 919.25 547.5 3707.91
6 Discussion
Our experimental results demonstrate that our PBQP formulation
is extremely eective at selecting the optimal primitives to imple-
ment a DNN. In this section we briey consider some alternative
strategies to the same problem, and discuss their strengths and
weaknesses as compared with our approach.
As we have shown in Section 3 the primitive selection problem
can be transformed into the NP-hard PBQP problem. However, it is
worth noting that the primitive selection problem is NP-hard only
in the presence of multiple data layouts for tensors in the network.
preprint, arxiv.org, 2018 A. Anderson and D. Gregg
 0
 1
 2
 3
 4
 5
 6
AlexNet GoogleNet
Sp
ee
du
p 
vs
 s
um
2d
Whole Network Benchmarking (aarch64)
direct
im2
kn2
Winograd
FFT
Local Optimal (CHW)
PBQP
armcl
caffe
(a) Single-readed Comparison of approaches with
PBQP selection on ARM Cortex-A57
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
AlexNet GoogleNet
Sp
ee
du
p 
vs
 s
um
2d
Whole Network Benchmarking (aarch64) (multithreaded)
direct
im2
kn2
Winograd
FFT
Local Optimal (CHW)
PBQP
armcl
caffe
(b) Multi-readed Comparison of approaches with
PBQP selection on ARM Cortex-A57
Figure 7. Whole-network benchmarking results on ARM Cortex-A57
If a xed canonical data layout is selected for all layers, the problem
becomes much easier.
If the NP-hardness of primitive selection can be eliminated by
keeping all data in a canonical layout, then would it not be easier
to simply use canonical data types rather than solve a dicult
selection problem? Our experiments show that this approach can
yield high performance in some situations (see VGG-C in Figure 5),
but it is always outperformed by the optimal selection. In some
cases, the gap is very wide (see AlexNet in Figure 5).
Our experimental data shows that even simple variations on
input data layout can have a signicant impact on data locality, and
therefore upon the execution time of DNN primitives. By solving
the primitive selection problem using our PBQP formulation, we
can nd an optimal solution that takes data layout transformation
costs into account.
Heuristics A similar question arises around other heuristics that
might be used for primitive selection. In other words, is our PBQP
formulation a more sophisticated solution than is needed to solve
the primitive selection problems in our experiments?
ere are simpler heuristics (some of which we employ in bench-
marking, such as locally optimal method selection) that still provide
good results for the experiments in Section 5.
However, the problem with heuristic solutions is not that they
perform poorly on the experimental data that was used during their
development. Simple survivorship bias guarantees that heuristics
that do poorly on test data are quickly dropped by those developing
them.
e problem with heuristics is that one is never sure how they
will do on unseen data that arises in practice. We therefore argue
that it is worthwhile using a more sophisticated approach to the
problem.
7 Related Work
PBQP is an extension of the quadratic assignment problem (QAP)
[10], one of the fundamental combinatorial optimization problems.
QAP is NP-hard, as is PBQP, but heuristics have been identied
that can oen solve or nd approximate solutions for practical
instances of PBQP quickly [6]. PBQP has been used to solve several
problems in compiler optimization such as register allocation for
architectures with irregular instruction sets [14], and code selection
on SSA graphs [6].
Lae [17] is a domain-specic language, compiler and runtime
for specifying DNNs. Lae provides abstractions such as neurons,
ensembles and connections that allow a programmer to specify the
elements of a DNN. A sophisticated compiler generates optimized
code, and paern-matches for code regions that can be replaced by
calls to optimized libraries such as GEMM. An advantage of Lae is
that it can automatically fuse successive layers such as convolution
and activation. In contrast, we use a library of highly-optimized
routines, and focus on selecting the best among them rather than
aempting to generate the best code from a high-level specication.
Moskewicz et al. [12] proposed Boda, a program generator for
deep neural networks. Boda generates large numbers of variants of
loop nests to implement DNN layers, and uses auto-tuning to select
from among them. Our approach has two signicant advantages
when compared with auto-tuning. First, our approach requires
much less time than auto-tuning. Our proling step requires that
we time each of our methods for each layer in the DNN, whereas
auto-tuning typically requires large numbers of measurements in
context. Second, insofar as the proled times of each routine are
accurate, and the PBQP instance can be solved in reasonable time,
our approach will provide an optimal selection of layers. A viable
future approach might be to use code generators and auto-tuners
to generate the code and data layouts for given layers and use our
approach to combine these code segments to implement an entire
DNN.
GPU vendors and research groups have developed libraries of
optimized routines to implement each layer of DNNs. Perhaps the
fastest library for NVIDIA GPUs is cuDNN [13]. e cuDNN library
provides several implementations of DNN convolution.
A function is provided to query the library at runtime to heuris-
tically select the method to implement a given layer given the input
Optimal DNN Primitive Selection with PBQP preprint, arxiv.org, 2018
parameters. In contrast, we solve for the global optimal layer se-
lection, taking into account data layout transformation costs. We
also measure prole layer execution times rather than relying on
estimated values. e cuDNN approach is equivalent to applying
our local optimal heuristic without xing the data layout.
TensorFlow XLA is a domain-specic compiler for linear algebra
computations which has been applied to DNNs [5]. Compared to
simply invoking primitive functions from a high-level deep learn-
ing framework, XLA allows a stand-alone C/C++ code base to be
generated to implement a DNN. XLA can eliminate intermediate
storage buers and fuse adjacent layers.
e XLA ahead-of-time compiler is arguably similar to our own
program generation framework which uses C++ templates to avoid
many of the costs of cross-layer transitions and to give the C++ com-
piler the best opportunity for code optimization. Our layer selection
approach is well-suited to systems such as XLA that generate DNN
code ahead of time.
8 Conclusion & Future Work
Using a PBQP solver in conjunction with per-layer proling to
create a cost model is an extremely eective tactic for DNN im-
plementation, even under simplifying assumptions where only
convolutional layers are modelled in the PBQP formulation. We
have demonstrated signicant gains in DNN inference performance
on both a high performance and embedded processors.
Future Work Modelling our problem as an instance of PBQP
gives us a great deal of extensibility. For example, given some
convolution routines which leverage sparsity in the kernel (for
example routines based on a sparse GEMM), our approach can
be used to decide whether a dense or a sparse implementation
(and moreover, which sparse implementation) will be faster for any
given convolutional layer, with the addition of a kernel sparsity
ratio parameter to the formulation.
Our approach can enable the construction of DNNs using convo-
lution routines from dierent libraries, if at least one edge in the DT
graph connects a convolution from library A to one from library B.
Investigation of the performance of these ensembles is an exciting
prospect for future work.
Our formulation, as mentioned in Section 3, does not currently
consider minibatch parallelism, but this can be encoded with an-
other integer parameter to the model (the minibatch size). is
would enable our optimization approach to select either parallel
GEMM or minibatch parallelism on a per-layer basis.
Acknowledgment
is work was supported by Science Foundation Ireland grant
12/IA/1381. is project has received funding from the European
Union’s Horizon 2020 research and innovation programme under
grant agreement No 732204 (Bonseyes). is work is supported by
the Swiss State Secretariat for Education, Research and Innovation
(SERI) under contract number 16.0159. e opinions expressed and
arguments employed herein do not necessarily reect the ocial
views of these funding bodies. is work was supported in part by
Science Foundation Ireland grant 13/RC/2094 to Lero — the Irish
Soware Research Centre (www.lero.ie).
References
[1] [n. d.]. Intel(R) Distribution of Cae. hps://github.com/intel/cae. ([n. d.]).
Accessed: 2017-09-13.
[2] [n. d.]. Intel(R) Math Kernel Library for Deep Neural Networks. hps://01.org/
mkl-dnn. ([n. d.]). Accessed: 2017-09-13.
[3] Andrew Anderson, Aravind Vasudevan, Cormac Keane, and David Gregg. 2017.
Low-memory GEMM-based convolution algorithms for deep neural networks.
CoRR arXiv:1704.04428 (2017). hps://arxiv.org/abs/1709.03395
[4] Richard E. Blahut. 2010. Fast Algorithms for Signal Processing. Cambridge
University Press, Cambridge, UK.
[5] Je Dean. 2016. TensorFlow w/XLA: TensorFlow, Compiled. (December 2016).
hps://autodi-workshop.github.io/ Presentation at Autodi Workshop, co-
located with NIPS 2016.
[6] Erik Eckstein, Oliver Ko¨nig, and Bernhard Scholz. 2003. Code Instruction Selection
Based on SSA-Graphs. Springer Berlin Heidelberg, Berlin, Heidelberg, 49–65.
hps://doi.org/10.1007/978-3-540-39920-9 5
[7] Lang Hames and Bernhard Scholz. 2006. Nearly Optimal Register Allocation
with PBQP. Springer Berlin Heidelberg, Berlin, Heidelberg, 346–361. hps:
//doi.org/10.1007/11860990 21
[8] Yangqing Jia. 2014. Learning Semantic Image Representations at a Large Scale.
Ph.D. Dissertation. EECS Department, University of California, Berkeley. hp:
//www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-93.html
[9] Yangqing Jia, Evan Shelhamer, Je Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Cae: Convolu-
tional architecture for fast feature embedding. In Proceedings of the 22nd ACM
international conference on Multimedia. ACM, 675–678.
[10] Tjalling C. Koopmans and Martin Beckmann. 1957. Assignment Problems and
the Location of Economic Activities. Econometrica 25, 1 (1957), 53–76.
[11] Alex Krizhevsky, Ilya Sutskever, and Georey E. Hinton. 2012. ImageNet Classi-
cation with Deep Convolutional Neural Networks. In 26th Annual Conference
on Neural Information Processing Systems 2012. 1106–1114.
[12] Mahew W. Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda: A Holistic
Approach for Implementing Neural Network Computations. In Proceedings of
the Computing Frontiers Conference (CF’17). ACM, New York, NY, USA, 53–62.
hps://doi.org/10.1145/3075564.3077382
[13] NVIDIA Inc 2017. cuDNN Developer Guide (du-06702-001 v07 ed.). NVIDIA Inc.
Accessed: 2017-09-13.
[14] Bernhard Scholz and Erik Eckstein. 2002. Register Allocation for Irregular
Architectures. In Proceedings of the Joint Conference on Languages, Compilers
and Tools for Embedded Systems: Soware and Compilers for Embedded Systems
(LCTES/SCOPES ’02). ACM, New York, NY, USA, 139–148. hps://doi.org/10.
1145/513829.513854
[15] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional
Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
hp://arxiv.org/abs/1409.1556
[16] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Sco E. Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. 2015. Going deeper with convolutions. In IEEE Conference on Computer
Vision and Paern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. 1–9.
hps://doi.org/10.1109/CVPR.2015.7298594
[17] Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley,
Armando Fox, and Tatiana Shpeisman. 2016. Lae: A Language, Compiler,
and Runtime for Elegant and Ecient Deep Neural Networks. In Proceed-
ings of the 37th ACM SIGPLAN Conference on Programming Language Design
and Implementation (PLDI ’16). ACM, New York, NY, USA, 209–223. hps:
//doi.org/10.1145/2908080.2908105
[18] Aravind Vasudevan, Andrew Anderson, and David Gregg. 2017. Parallel Multi
Channel convolution using General Matrix Multiplication. In 28th IEEE Inter-
national Conference on Application-specic Systems, Architectures and Processors,
ASAP 2017, Seale, WA, USA, July 10-12, 2017. 19–24. hps://doi.org/10.1109/
ASAP.2017.7995254
