Automated Design Space Exploration for optimised Deployment of DNN on
  Arm Cortex-A CPUs by de Prado, Miguel et al.
Automated Design Space Exploration for optimised
Deployment of DNN on Arm Cortex-A CPUs
Miguel de Prado ∗†
miguel.deprado@he-arch.ch
Andrew Mundy ‡
Andrew.Mundy@Arm.com
Rabia Saeed ∗
Rabia.Saeed@he-arc.ch
Maurizo Denna §
maurizio.denna@nviso.ch
Nuria Pazos ∗
Nuria.Pazos@he-arc.ch
Luca Benini †
lbenini@iis.ee.ethz.ch
∗ He-Arc Ingenierie, HES-SO †Integrated System Lab, ETH Zurich ‡Arm Ltd. §Nviso
Abstract—The spread of deep learning on embedded devices
has prompted the development of numerous methods to optimise
the deployment of deep neural networks (DNN). Works have
mainly focused on: i) efficient DNN architectures, ii) network
optimisation techniques such as pruning and quantisation, iii)
optimised algorithms to speed up the execution of the most
computational intensive layers and, iv) dedicated hardware to
accelerate the data flow and computation. However, there is a
lack of research on the combination of these methods as the space
of approaches becomes too large to test and obtain a globally
optimised solution, which leads to suboptimal deployment in
terms of latency, accuracy, and memory.
In this work, we first detail and analyse the methods to
improve the deployment of DNNs across the different levels of
software optimisation. Building on this knowledge, we present
an automated exploration framework to ease the deployment of
DNNs for industrial applications by automatically exploring the
design space and learning an optimised solution that speeds up
the performance and reduces the memory on embedded CPU
platforms. The framework relies on a Reinforcement Learning
-based search that, combined with a deep learning inference
framework, enables the deployment of DNN implementations to
obtain empirical measurements on embedded AI applications.
Thus, we present a set of results for state-of-the-art DNNs on
a range of Arm Cortex-A CPU platforms achieving up to 4x
improvement in performance and over 2x reduction in memory
with negligible loss in accuracy with respect to the BLAS floating-
point implementation.
I. INTRODUCTION
Artificial intelligence (AI) is rapidly growing and will soon
become ubiquitous in our daily life. In particular, deep learning
has grown quickly in the last years, achieving remarkable
results in computer vision [1] and speech recognition [2].
Adoption of deep learning by major industrial players, e.g.,
Google [3], Tesla [4], is already a reality, and its numerous
applications are to bring on a new technological revolution.
Deep Neural Networks (DNN) are capable of learning ab-
stract features by stacking many layers in parallel and in depth,
which turns them into complex architectures. Training of
CNNs has drawn significant attention in the last years towards
building more and more competitive and accurate architectures
[5] and surpassing human capabilities, e.g., ImageNet com-
petition [6]. More recently, the focus has shifted towards the
deployment of such DNNs on resource-constrained devices. In
contrast to cloud environments, edge devices are often severely
constrained in terms of computing power, memory, and energy
consumption, which is available to a given application. These
constraints hamper deployment of deep learning solutions to
edge devices and require innovation in the design of deep
learning systems (or neural network architectures), and in the
software which executes them.
Numerous research works have focused on optimizing the
deployment of DNN through the development of i) efficient
DNN architectures such as MobileNets [7], SqueezeNet [8],
including hardware-aware neural architecture search (NAS)
[9], [10], ii) optimisation techniques such as pruning and
quantisation [11], [12], [13], [14], iii) optimised algorithms
to speedup the execution of the most computational layers,
e.g., general matrix multiplication (GEMM) [15] or Winograd
[16] and, iv) dedicated hardware to accelerate the data flow
and parallelise the computation [17], [18], [19].
There is, however, a lack of research on cross-level opti-
misation and design space exploration (DSE) for a complete
end-to-end solution [20]. The space of approaches for DNN
deployment becomes too large to fully explore and obtain an
optimal implementation as each layer of the neural network
may be executed following a different optimisation technique,
optimised algorithm, or even in a different processor, result-
ing in a different performance, memory footprint or power
consumption. The complexity of exploring and combining the
wide variety of design options usually results in a sub-optimal
solution [21].
The objective of this work is to ease the deployment of
pre-trained DNNs for industrial applications by automatically
exploring the design space and finding an optimised solution
to speed up the performance and reduce the memory on
embedded CPU platforms. To that end, we employ LPDNN
[22], a deep learning framework that enables the deployment
and inference of DNN implementations. We focus on software
optimisation for the deployment on Arm CPU cores as these
represent the majority of processors on mobile and IoT devices
and have extensive support for DNN inference [23]. Our work
is complementary to DNN architecture design to optimise the
deployment further and could also be applied to dedicated
hardware. Our contributions are the following:
ar
X
iv
:2
00
6.
05
18
1v
1 
 [c
s.L
G]
  9
 Ju
n 2
02
0
• We analyse methods to improve the deployment of DNNs
across different levels of software optimisation and in-
troduce the range of techniques provided by LPDNN to
optimise DNN inference.
• We present QS-DNN, an automatic exploration frame-
work based on Reinforcement Learning (RL), that, com-
bined with LPDNN, finds an optimised combination of
design options that speeds up DNN inference and reduces
memory for a target platform.
• We present a set of results for state-of-the-art DNNs on
a wide range of Arm Cortex-A CPU platforms that cover
the current spectrum of deployment on mobile devices.
The paper is organized as follows: In Section II, we present
the background of the optimization for the deployment of
DNNs. Section III describes the deep learning inference frame-
work. In Section IV, we address the design space problem
and introduce the Reinforcement-Learning-based approach. In
Section V, we introduce the RL-based search engine and
the methodology of the experiments. Section VI presents the
results and discussion.
II. BACKGROUND: DEPLOYMENT OPTIMISATION OF
DEEP NEURAL NETWORKS
Given the constraints imposed on edge devices, namely,
relatively limited compute performance, small memory capac-
ities, and thermal and power consumption restrictions, there
are several goals for which one may choose to optimise.
For example, one might decide to sacrifice neural network
inference latency to reduce overall power consumption, or
to stay within a more limited memory capacity. Depending
on the goal of the optimisation, neural networks present a
range of software optimisation opportunities. We divide these
opportunities into several broad categories as shown in Fig. 1:
A. Network Design
We define network design optimisation to be the set of
techniques that tailor the structure of a network before, or
during training, to improve the latency or cost of network
inference. Examples of this are MobileNet-V1/V2 [7], [24],
SqueezeNet [8] and, ShuffleNet [25] which were manually
shaped thanks to the expertise of the authors. A set of
newer works introduced the neural architecture search (NAS)
as a technique to reduce the high-level human knowledge
needed for the conception of such architectures. Examples
of this are MNASnet [26], FbNet [10], and Lemonade [27]
which used hardware-aware NAS via reinforcement learning,
evolutionary algorithms or gradient-based methods to discover
neural network structures with both good accuracy and high
performance.
Distillation is another technique where a neural network
teacher can transfer its learned knowledge to student networks.
Students are constructed to present lower computational com-
plexity and memory cost than their teachers. The student
imitates the teacher over a training dataset and obtains high
accuracy while reducing the complexity of the neural network.
Works implementing this technique are [28], [29], [30].
Fig. 1: Optimisation categories for the deployment of DNNs
at different levels of the software stack.
B. Network Optimisation
In contrast to neural network design, network optimisation
techniques take an existing network and modify the way it
is represented, for example by exploiting lower-precision data
types (quantisation), the inherent sparsity of weights and acti-
vations (pruning) or fusion of linear operations (layer fusion).
Techniques in this category may require a neural network to
be retrained to deal with a loss of accuracy, or to enforce
a particular structure upon sparse data. However, significant
reductions in memory consumption can be achieved.
1) Pruning: Explicit representation of sparsity can result
in less computational work being wasted on needless com-
putations (such as multiplication by zero). There are two
categories of sparsity: unstructured and structured. The former
can achieve a higher degree of sparsity over the tensor but
involves a high overhead to decode the sparse format at infer-
ence time, making this approach more suitable for specialised
accelerators [13], [31], [32]. The latter exploits the inherent
structure of the data. One example of this is forcing a reduction
of feature maps within a layer resulting in a dense but smaller
tensor [33], [34], [35].
2) Quantisation: Quantisation is a compression method
that reduces the storage cost of a variable by employing
reduced-numerical precision. This improves the arithmetic
intensity of neural network inference by increasing the amount
of computational work which can be performed for a given
amount of memory traffic. Some of the first works addressing
hardware-oriented quantisation of DNNs were Ristretto [36]
and [37], which analysed the effect of quantising weights,
biases, and activations for each layer. DNN quantisation gained
much attention and evolved quickly towards: i) reducing the
bitwidth down to binary networks [38], [39] and, ii) techniques
to find a suitable trade-off between compression and accuracy
where autoML methods represent the SoA for mixed-precision
inference [40], [41], [42]. We refer the reader to extensive anal-
yses for efficient quantisation deployment, which are provided
by [43], [11].
All the mentioned works provide quantisation methods or
tools that involve training or fine-tuning the DNNs to push the
limits of quantisation as well as a large training dataset. There
are, however, several works that provide tools for post-training
(direct) quantisation achieving 8-bit [44], [45], [46] or even 4-
bit [47], [48] inference with minimal loss in accuracy, making
them very attractive for any user to deploy DNNs efficiently
on embedded devices.
3) Layer fusion: Layer fusion can improve the memory
traffic of a DNN as several linear operations in a neural
network graph can be fused into a single one – avoiding repeat-
edly writing and rereading the same area of a tensor. In general
terms, the fusion of two consecutive layers approximately
halves the memory traffic associated with the combination.
Examples of this are merging the batch normalisation and
scale layer or the activation and concatenation layer into the
previous convolution or fully connected layer. Further memory
optimisations can be achieved by different layers sharing the
same memory space if there is no dependency between them,
e.g., in-place computation or network-memory pool.
C. Algorithm Optimisation
Once a network has been designed (and possibly optimised
through application of quantisation, sparsity, or fusion), execu-
tion of the network can be optimised through modification of
the way in which layers of the network are implemented. We
mainly focus on the performance optimization of convolutions
since these comprise the lions share of the computation work
contained in a neural network. There are several ways in which
convolution can be performed: direct convolution; a number
of the several approaches that exploit a General Matrix-
matrix Multiplication (GEMM) call; or by one of many fast-
convolution methods like Winograd convolution. Each of these
methods has its own trade-offs.
1) Direct: A naive approach to implementing convolution
on CPU is to directly implement the six-nested for loop
which describes convolution. Although direct convolutions
incur no memory overhead, its usage is rare since it is
difficult to express the algorithm in a way that extracts much
performance from CPU architectures [49]. Implementations
can be improved by keeping the some of the weights, inputs,
or outputs resident in registers – especially for a small number
of parameters [50] – and reordering the layout and loops to
optimise data reuse [51].
2) GEMM-based: Use of GEMM is attractive to accelerate
convolutions since there exist a wide range of fast implemen-
tations provided by highly optimised BLAS libraries such as
OpenBLAS [52] or BLIS [53] capable of exploiting the SIMD
instructions of the Armv8-A architecture. The prototypical
approach to constructing a GEMM-backed convolution is to
use the im2col or im2row algorithms to construct a “patch”
matrix, which can be multiplied by a matrix representing the
convolution weights to form the final output matrix. It should
be noted, however, that while the amount of work performed
by the GEMM is equivalent to direct convolution, the memory
footprint is k2 larger (where k is the size of the kernel).
This significant memory overhead has led to research into
more memory efficient GEMM-backed convolutions. Exam-
ples of which are the kn2row technique [15], and indirect
GEMM [49] approach – neither of these techniques reduce
the arithmetic cost of performing a convolution, although they
do avoid the cost of rearranging the data into im2col or
im2row form.
3) Winograd: Winograd Convolution [54] can help to
address the problem of the high arithmetic cost of convo-
lution. These algorithms help to reduce the overall compute
complexity of convolution by transforming the convolution
into another domain where the number of required strong
operations (such as multiplication) is reduced at the expense
of an increase in the number of weak operations (such as
addition). Implementations of these algorithms are well suited
to low power embedded systems, as the resources and power
budget are very limited. By contrast, they have a higher cost
in memory consumption and accuracy [16].
D. Primitive design
Finally, the lowest level of software instantiating a neural
network can be optimised to make better use of the hardware
upon which it is executed. All of the algorithms described
in the previous sections feature at least one loop, which will
be executed many thousands of times during neural network
inference. Ensuring that this innermost loop is implemented
as well as possible is vital to achieving good overall perfor-
mance. Optimisations at this level of abstraction can vary from
changing the layout of data in memory, through changing how
vectorised instructions are used to process the layer, to writing
assembly implementations of the kernels to extract maximum
performance from specific processors.
1) Data layout: Ensuring that operands are laid out to
achieve proper use of the processor cache hierarchy, and easy
exploitation of vectorized execution may produce significant
improvements in performance [16]. Indeed, several works,
including [21], [55], apply several algorithms to find an
optimised selection of data layout for each layer of a DNNs.
2) Vectorisation: It is crucial that the vector (Single In-
struction Multiple Data – SIMD) instructions provided by
the Instruction Set Architecture (ISA) are used to make the
most of processor throughput. Examples of works leveraging
vectorisation for the optimisation of convolutions on GPU or
CPU are [56], [53], [16].
3) Assembly code: Modern compilers, while good at ensur-
ing general-purpose code can be compiled into fairly efficient
assembly, have some drawbacks. Writing assembly code by
hand can allow the programmer to perform optimisations
missed by the compiler and allows for a much greater degree
of control of the final binary.
E. Discussion
The vast majority of the works presented above focused on
specific optimisations for DNNs without taking into account
the trade-offs at different levels of the software stack. We draw
inspiration from Anderson et al. [21] who use PBQP to opti-
mize inference time by selecting suitable backends. However,
they only profile the latency for convolutional layers without
addressing other layer types or optimisations at network level,
e.g., quantisation.
In this work, we provide a broader picture and show the
various steps of the optimisation for the deployment of DNNs
on CPU. Furthermore, we present an automatic exploration
framework, based on Reinforcement Learning, that searches
through different design option and analyses several DNNs on
a range of embedded platforms while trading off metrics like
latency, memory, or accuracy. Thereby, we can find a solution
that, for instance, can answer the following questions: What
DNN shall I use? What are the best optimisation techniques
that I can follow? How can I obtain a fast implementation
under a certain memory or accuracy constraints?
III. DEEP LEARNING INFERENCE FRAMEWORK
We form part of a European collaboration to bring deep
learning methods to any party who would like to take up
deep learning solutions in an industrial environment [57]. In
this context, a deep learning framework (LPDNN) has been
developed [22] to produce efficient and tunable code that
enables and maximizes the portability among platforms. In
this work, we introduce the range of techniques provided
by LPDNN to optimise the deployment of DNNs. Besides,
we address the integration of the LPDNN into our search
environment to tightly couple empirical measurements of a
heterogeneous platform to a learning-based search.
A. Architecture
One of the main goals of LPDNN is the portability and
flexibility of AI applications across platforms. LPDNN’s core
comprises a set of CPU dependency-free functions which can
be complemented by specific-platform acceleration libraries,
such as Arm Compute Library [58] and cuDNN [59], to
generate an optimised implementation for the system. LPDNN
contains a modular and hierarchical architecture that supports
multiple libraries and optimisations at the same time. This
flexibility allows us to experiment with optimised algorithms
for a particular layer or blocks to execute each layer with the
most suitable implementation according to the network archi-
tecture, target platform, and desired accuracy and performance
specification.
B. Optimisations
We follow the structure given in Section II, and show the
software optimisations that LPDNN contains at various levels:
1) Network optimisation: LPDNN provides efficient in-
ference with integer arithmetic as it supports post-training
quantisation for both weights and activations. Weights can
be directly quantised to 8-bit integer while the activations
require a validation set to determine their dynamic range. The
range is then used to calculate the scale and offset for both
symmetric and asymmetric quantisation methods. The scale
value can be further tuned to reduce the loss of information
by minimising the KL divergence between the quantised and
original distribution [46]. LPDNN supports both per layer and
per channel quantisation. However, due to the lack of support
by the acceleration libraries for channel quantisation, we only
focus and show results for the former.
Several other optimisations are performed in LPDNN:
• Static layer fusion: Fusion of linear operations to reduce
the neural network graph at build time. LPDNN supports
the fusion of the Bnorm and scale layers into the previous
convolution or fully connected layer.
• Runtime layer fusion: The execution of two or more
layers is performed in a single pass with a significant re-
duction in memory traffic. LPDNN supports the fusion of
the activation and concatenation layers into the previous
convolution.
• In-place computation: Layers such as activation, reshape
or, flatten may store the output result directly on the
memory allocated for the input, which halves the memory
allocations for a layer.
• Memory pool: Layers whose execution does not overlap
and who do not have data dependencies, share the same
memory, which – due to the sequential nature of most
DNNs – notably reduces overall memory footprint.
2) Algorithm optimisation: LPDNN integrates a set of
acceleration libraries including, OpenBLAS, BLIS, NNPACK
and, ArmCL, that provide optimised algorithms for the exe-
cution of DNNs on CPUs. LPDNN leverages the algorithms
provided by the libraries and may execute each layer of the
network with a different algorithm. Further, LPDNN also uses
a lower-level interface of the Arm Compute Library that we
refer to as LPDNN-Arm library. It supports both floating-point
32-bit (FP32) and integer 8-bit (INT8) operations and provides
special optimisations for the following layers:
• Standard convolution: LPDNN integrates a FP32 fast
convolution implementations that relies on Winograd
for 3x3, 5x5 and linear kernels (originally from [16]),
and a vectorised GEMM implementation for all kernels,
including the common 1x1, for both FP32 and INT8.
• Depthwise Convolution: Despite containing relatevely
little computational work, is challenging to implement
efficiently due to its memory-bound nature. LPDNN’s
exploits all the reuse presented by the algorithm by
carefully mapping to both the SIMD instructions and the
cache for both F32 and INT8.
• Others: LPDNN also optimises pooling, element-wise
and, activation layers by providing vectorisation for both
F32 and INT8 implementations.
Fig. 2: Conversion penalties. 3-layer network where the
arrows express incompatibility penalty between implementa-
tions. The agent is able to avoid local minimum, e.g. red path,
which contains the fastest intermediate implementation and
selects the blue path instead: fastest overall.
3) Primitive design: In Section II, we noted three different
elements of primitive design which can be combined to build
optimised kernels to implement neural network algorithms.
These were data layout, vectorisation, and assembly code.
The first two of these are neatly tied together: the order in
which data is stored suggests the vectorisation approach taken
and vice versa. For example, when implementing a vectorised
convolution one may decide to operate on several channels of
data simultaneously, in which case storing channels contigu-
ously facilitates easier use of the data. We have determined
empirically that, on CPUs, it is often better (both easier and
more performant) to write kernels which operate on multiple
channels simultaneously [16, § 2.1]. Consequently, the major-
ity of optimised kernels in LPDNN operate of NHWC-ordered
data (where N stands for the number of batches, H for the
height of the tensor, W for the width and C for the number of
channels). However, it can still be beneficial to support data in
other formats (as shown in Section VI) – for example, when
the channel count is low it is better to make use of data-reuse
across the plane of a convolution.
We noted above that it can often be worth hand-writing
assembly code implementations for key kernels, rather than
relying on the compiler. There are a few reasons for this,
largely stemming from wanting finer-grain control over reg-
ister allocation, instruction selection and scheduling than is
possible from use of compiler intrinsics. LPDNN integrates
several hand-optimised vendor kernels covering algorithms
such as GEMM and depthwise convolution.
IV. LEARNING-BASED SEARCH ENGINE
In this section, we address the design space problem for
the deployment’s optimisation of DNN and we propose Rein-
forcement Learning as a solution.
A. Problem formulation
Given a DNN, each layer of the neural network may follow
a different optimisation technique, or be executed by different
acceleration libraries which, in turn, might provide several
algorithms, data types or layouts. The space of approaches for
DNN deployment becomes too large to test exhaustively and
obtain an optimal implementation. The problem is not as trivial
as to benchmark all possible implementations individually
and select the most suitable for each layer to make up the
optimal network implementation. Each implementation may
follow a different optimisation strategy, have a different layout,
data type, or even be executed in a different processor which
might not correspond to those from the previous and following
layers. Therefore, incompatibilities arise and a conversion or
data copy layers are needed which incur in extra penalties, see
Fig. 2.
The number of combinations within a network, which is
the design space to explore, grows exponentially with the
number of layers, NL, having as base the number of different
implementations for such layer, NI . Hence, the design space
size for a network would be NNLI as the worst case. This is
a non trivial problem and therefore, a careful search must be
carried out to select the right set of deployment options that,
combined and assuming the conversion penalties, yields the
most suitable implementation for a given goal, e.g., latency,
accuracy or memory.
B. Reinforcement Learning Approach
Reinforcement Learning (RL) lends itself perfectly to ex-
ploring large design spaces due to its sample-based approach
and far-sighted accumulative reward [60], [61]. Consider the
network space exploration as a Markov Decision Process
(MDP) containing an agent. We are interested in learning a
function that optimises the agent’s behavior or policy pi(at|st),
i.e., mapping from state st to actions at, without modeling
the environment and only relying on the reward function. Q-
learning [62] fits well this description as it is a model-free and
value-based implementation, having the policy implicit in the
value function. The action-value function qpi is the expected
return Gt in a state st taking an action at:
qpi (s, a) = Epi [Gt|st = s, at = a] (1)
The objective of Q-learning is to maximize the total reward:
RT =
∑∞
t=0 γ
trt where rt is an individual reward and γ
is the discounted factor for successive states. Besides, Q-
learning is an off-policy implementation, that is, it may follow
a behavior policy A while targeting a greedy policy B.
Following Bellman’s equation, we can iteratively update the
action-value function (Q = qpi) as follows:
Q(st, at) = Qst,at(1− α) + α
[
rt + γmax
a
Q(st+1, a)
]
(2)
C. Search Engine
We consider an agent whose aim is to learn the optimal path
among a large but finite set of states S i.e., layer representa-
tions, employing a set of actions A i.e., layer implementations.
RL suits well the specifications of the problem that we address
in this work. Latency, accuracy or memory represent clear
reward function given by the environment that we aim to
explore: a Deep Neural Network.
State Parameters Definition
Layer type Any layer, e.g., convolution, pooling
Layer depth Position of the layer in the network
Acceleration library Name of the library
Algorithm Routine type
Algorithm config Sub-routine or lowering method
Data type Any type, e.g., FP32, FP16, INT8
Data layout Any layout, e.g., NCHW, NHWC
Target hardware core CPU, GPU, FPGA.
Table I: State Space. Parameters define the execution imple-
mentation of a layer on a target platform.
1) State Space:: The agent samples sequentially a new set
of implementations for the network, layer by layer. The state
space is defined as a tuple of the parameters that specify the
execution implementation of a layer on a target platform, see
Table I. All implementations are defined by an algorithm,
its configuration format, a data type, a layout and a HW
processor. The agent chooses one implementation from the set
of acceleration libraries given the current layer type. Based on
the action, the agent moves to another state and the process is
repeated until the end of the network.
2) Exploration Strategy:: Similar to Baker et al. [63], we
have implemented an -greedy strategy [64] which trades off
between exploitation and exploration. The agent starts mainly
exploring the design space (random actions) to sample the
diverse possibilities ( = 1). We slowly decrease  over the
episodes for the agent to select the best actions and finally
learn an optimal path: full exploitation ( = 0). In addition, we
have added an experience replay [65], a technique that reuses
past experiences, to help the action-value function converge
faster. After each episode, a batch of past experiences are
sampled and presented to the agent. We have set the experience
replay’s buffer size to 128 following [63].
3) Reward Function: As our main goal is to optimise
the deployment of pre-trained DNNs on Cortex-A CPUs, we
primarily focus on the latency as our main optimisation goal
while setting the accuracy loss and memory consumption as
hard constraints. The objective is to maximize the total reward,
in this case, minimize the inference time. Although we initially
used the total network inference time as unique reward signal,
we have added rewards at each step for better convergence
(Reward Shaping [66]). Hence, each state receives as reward
its own layer inference time but reversing the sign, e.g.
0.01ms ⇒ -0.01ms. Thanks to the Q-learning update rule,
each layer also receives Q-knowledge from the best following
state. Therefore, the agent is able to combine both sources of
knowledge, look ahead and avoid local minima due to penalties
introduced by incompatibility between layers, see Fig. 2.
V. AUTOMATED DSE FOR DEPLOYMENT OF DNNS
We name QS-DNN (Q-based Search) to our automated DSE
framework for the deployment of DNNs. The aim of QS-
DNN is to automatically optimise the inference of any DNN
on an embedded system. The process is composed of three
phases: a) inference of the DNN on the embedded system
Fig. 3: Architecture of QS-DNN. Complete flow: Inference
on an embedded on the left, RL-based learning on the right.
to obtain empirical measurements: latency and memory, b)
automatic RL-based search to explore the design space and
learn optimised solutions and, c) inference of the learnt
solutions to obtain accuracy measurements. We have separated
the phases (Fig. 3) to avoid inferring on the embedded system
each possible solution of the space search, which would
significantly slow down the process. Finally, we obtain the
Pareto optimal solution based on latency optimisation with
accuracy and memory as constraints.
A. Metrics Collection (1/3)
We employ LPDNN, the deep learning inference framework
described in Section III, to obtain real measurements although
the search could be also applied to any other framework. We
employ LPDNN’s acceleration libraries which provide imple-
mentations that may leverage the optimisations at network,
algorithm and primitive level. The objective of this phase is
to measure the costs of all possible graph nodes, i.e., layer
implementations, and all possible edges, i.e., the compatibility
conversions inserted between each node, to build a look-up
inference table for the search engine.
Thus, the inference controller goes over each acceleration
library and benchmarks each implementation1, one at a time,
in all those layers where the library is able to implement such
implementation. Therefore, we only need to infer the whole
network on the embedded platform as many times as different
global implementations exist. In each inference, the execution
time and memory consumption for each layer are measured.
Once all the implementations have been benchmarked, we
profile the compatibility conversions for data type and layout
transformation as well as for data transfers between different
processor if needed.
B. Search Engine (2/3)
The search space and the conditions of the search can be
defined for each network. They specify the behavior of the
agent: number of episodes for each , learning rate, discounted
factor and replay buffer’s size. We have set the learning rate
to 0.05 and discounted factor to 0.9 to give slightly more
importance to short-term rewards. Once the metrics collection
phase has finished, the Q-learning -based search begins and
proceeds as shown in Algorithm 1.
First,  is retrieved from the specifications as well as the
number of episodes for such . In all experiments, 50% of
1Each implementation is inferred for 20 images and the mean is calculated.
Algorithm 1 QS-DNN - Search
1: ← new
2: while Learned Episodes < Episodes() do
3: Reset Path
4: while Layer 6= End Layer() do
5: if Generate Random <  then
6: Action ← Q-values(Random)
7: else
8: Action ← Q-values(Max)
9: Layer ← Next Layer
10: Check for Incompatibility
11: Compute Inference Time
12: Experience Replay & Update (eq. 2)
Fig. 4: RL search 1000 episodes where the 500 first episodes
are fully exploration. From there on,  is decreased by 0.1
towards exploitation after every 50 episodes.
the total episodes correspond to full exploration and 5% to
any other  from 0.9 to 0.1. By these means, the agent ob-
tains enough knowledge from the environment before starting
exploitation, see Fig. 4.
For each episode, the agent samples sequentially a new
set of implementations based on the −strategy. Once the
network’s configuration is set, the engine automatically looks
for incompatibilities between layers. At last, the total network
inference time is computed by looking up each implementation
in the inference table and summing up the values of all layers.
If any incompatibility has been found between two layers, the
extra penalty in time is added to the inference time of the
latter layer. Finally, the action-value function is updated with
the current reward and stored for experience replay. When the
number of episodes for a given  has been met,  is decreased
towards exploitation phase. By the end of the search, the
engine gives out the most performing configuration and the
learning curve that the agent has followed, see Fig 4.
C. Inference of Learnt Solutions (3/3)
A drop in accuracy may be caused by quantized or fast-
convolution methods. Thus, we perform the accuracy mea-
surements after the RL-based search have been performed and
benchmark the learnt solutions against a validation dataset.
We are only interested in the most performing networks and
hence, only benchmark those solutions that are up to 25%
slower than the fastest learnt solution. Thus, we can speed up
the process and obtain Pareto optimal solutions with a strong
focus on latency optimisation having accuracy and memory
as thresholds, e.g., accuracy drop <1% or memory reduction
>2x.
VI. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we show the optimisation results for the
deployment of state-of-the-art DNNs on a range of Arm
Cortex-A CPUs. First, we introduce the set of networks and
platforms that we have used to validate our experiments.
Next, we show the importance of the different individual
software optimisations currently available in LPDNN (see
Section III-B). Finally, we present the results from applying
the automated design space exploration, introduced in Section
V, to optimise deployment of DNNs on Arm Cortex-A CPUs.
A. Experimental setup and platforms
The set of pre-trained networks that we have selected form
part of the Imagenet contest (image classification task) as it
represents a challenging dataset where optimisations can have
a significant effect on metrics such as latency, accuracy, and
memory. We evaluate several representative network topolo-
gies for resource-constrained devices that allows us to show
the range of optimisation and trade-offs on the target plat-
forms: small networks such as Squeezenet and MobilenetV3-
small, slightly more complex networks like MobilenetV2, and
reasonably large networks such as MobilenetV3-large and
Resnet50. Although we have focused on an image classifi-
cation task to demonstrate our design, the experiments could
also be applied to any other deep learning task.
The range of Arm Cortex-A (Armv8 64-bit) CPU platforms
that we have chosen cover the current spectrum of deployment
on mobile devices. We divide the experiment into two parts:
i) For each of the techniques discussed in Section III-B, we
present benchmark results on the RaspberryPi 4 (Cortex-A72
at 1.5GHz) as a reference platform to show each optimisation.
ii) For the automated DSE discussed in Section V, we show
experiments on the Raspberry4, Nvidia Jetson Nano, CPU
only (Cortex-A57 at 1.43GHz) and, RaspberryPi 3b+ (Cortex-
A53 at 1.4GHz) to validate the design and optimisations on
LITTLE and big cores:
• Cortex-A53 is a low-power, highly efficient, core [67].
Power efficiency is achieved through running in-
order [68], hence this core is highly sensitive to instruc-
tion order and operand data availability.
• Cortex-A57 is a higher-performance, and consequently
higher-power, out-of-order core [69].
• Cortex-A72 is an update to the Cortex-A57 with im-
proved floating-point and memory performance [70]
All inferences are performed identically: using a single-
thread, calculating the average of twenty inferences after an
initial (discarded) warm-up run. The boards were fitted with a
heatsink, and the CPU frequency governors were overridden
to achieve the maximum possible performance. To ensure that
the platform does not overheat, triggering thermal throttling,
Fig. 5: Quantisation optimisation. Speedup of INT8 over
FP32 for Squeezenet-V1’s convolutions (the higher, the better).
we have monitored the platforms and sampled the OS registers
each second.
B. Optimisation results
In this section, we demonstrate the improvement in perfor-
mance due to the software optimisations explained in Section
III-B.
1) Network Optimisation: We show three different optimi-
sations at this level:
Quantisation: We compare the performance of DNN layers
when they are deployed employing INT8 arithmetic instead of
the baseline FP32 operations. Although INT8 variables can
be packed into 32-bit operations and, therefore, achieve a
theoretical x4 arithmetic-intensity improvement, the real uplift
in performance may not be as much. For instance, Fig. 5 shows
INT8 performance uplift for Squeezenet’s convolutions, where
the highest layer improvement goes no higher than 1.7x and
the overall network speedup (including all layers) is 1.24x.
These results are roughly in line with [71], where speedups
of 1.13x and 1.2x are obtained by using INT8 on a small and
big core of the Snapdragon835, respectively. The relatively low
improvement with respect to FP32 is due to several factors:
i) the need to perform several instructions to multiply and
accumulate four 8-bit integer terms into a single 32-bit integer
result2, ii) the need to requantise the activations back to INT8
(scale and offset) after each layer and, iii) the existence of very
performant FP32 primitives thanks to an intense optimisation
over the years.
Layer fusion: Fig. 6 shows the performance improvement
obtained by static and runtime layer fusions. Fig. 6a depicts
the static fusion of Bnorm and scale layers into previous
convolutions for Mobilenet-V2, obtaining a 25% improvement
in latency. Such significant improvement is mostly due to the
lack of optimisation on both fused layers as this graph reduc-
tion is a common approach in many inference frameworks.
Fig. 6b exhibits the fusion of activation and concatenation
layers into preceding convolutions for Squeezenet, achieving
an improvement of 15% in latency due to the reduction of
memory accesses (for only a modest increase in the time spent
performing convolution (< 1%)).
Memory optimisation: Table. II displays the total dynamic
memory allocated over the execution of various DNNs in
2Newer versions of the Arm ISA include the DOT and MMLA instructions
to mitigate this [72], [73].
Unfused Fused0
20
40
60
80
100 Fuseable scale
Fuseable Bnorm
Other
DWise convolution
Convolution
(a) Static fusion: Bnorm and Scale
layers into previous convolutions for
Mobilenet-V2.
Unfused Fused0
20
40
60
80
100 Fuseable concat.
Fuseable activation
Other
Convolution
(b) Runtime fusion: Activation and
Concat layers into previous convolu-
tions for Squeezenet-V1.
Fig. 6: Layer fusion optimisation. Portion of overall execu-
tion time. Normalised to non-fused performance (the lower,
the better).
Squeezenet-V1 Mobilenet-V2 Resnet50
(48MiB) (96MiB) (425MiB)
Weights / % 9.7 14.1 23.1
Activations / % 60.6 70.2 23.1
Other / % 29.7 15.7 53.8
Act. (Original) / % 100.0 100.0 100.0
Act. (In-place) / % 70.4 62.2 83.9
Act. (Mem pool) / % 20.0 11.9 13.0
Table II: Memory optimisation. Upper part: total memory
allocated for the weights, activations and other (code, struct,
etc.) Lower part: memory allocated for the activations (nor-
malised to activation (original)).
terms of weights, activations, and other (structures, buffers,
or code). We can observe that, while the activations account
for a small portion in Resnet50, they consume most of the
memory allocated in Squezeenet and Mobilenet-V2. Thus,
we show the reduction in memory achieved in the allocation
of the activations by in-place and memory pool techniques.
The use of in-place technique achieves a noticeable reduction,
especially on the ReLu layers, which varies from 16.1%
(Resnet50) to 37.2% (Mobilenet-V2). Further, memory pool
technique accomplishes a remarkable memory reduction that
goes from 80% (Squeezenet) to 88.1% (Mobilenet-V2) thanks
to the reuse of memory across layers. Further memory op-
timisations can be achieved by applying quantisation which
would provoke a 4x reduction in the memory allocated for the
weights and the remaining activations.
2) Algorithm Optimisation: We present the comparison be-
tween GEMM and Winograd algorithms - employing LPDNN-
Arm library - as these two represent the most common algo-
rithms for convolutions on CPU deployment. Fig. 7 illustrates
the speedup achieved by Winograd-FP32 over GEMM-FP32
for Squeezenet’s 3x3 convolutions (Winograd not available for
k=1x1 in LPDNN). We can observe that Winograd clearly
outperforms GEMM-FP32 in all convolutions, accomplishing
an uplift of up to 2.5x. Likewise, Winograd achieves a speedup
of up to 3.9x for Resnet50, especially in the first convolutions,
which are the most computing-intensive (Fig. 14 in Appendix).
Further, we have included GEMM-INT8 performance in
Fig. 7: Algorithm optimisation. Speedup of Winograd over
GEMM-FP32 for Squeezenet-V1 (the higher, the better).
Fig. 8: Primitive optimisation. Layout performance com-
parison between NCHW and NHWC for Squeezenet’s most
computational-intensive layers. (the lower, the better).
both figures to demonstrate that early design decisions may
lead to sub-optimal solutions, e.g., selecting the use of quan-
tisation at network level without considering the underneath
algorithm level. In this case, selecting uniform quantisation,
e.g., INT8 data type for the whole network, discards Winograd
algorithm as, to date, it is only available in FP32. Thus, a
homogeneous INT8 network would underperform against a
mixed-precision network, including Winograd.
3) Primitive Optimisation: We show the influence of data
layout for vectorised methods employing the GEMM algo-
tithm. Fig. 8 present the most computational-intensive con-
volutions of Squeezenet (kernel=3x3). We can observe that
NCHW performs slightly better in the first layers, which might
be due to a lower number of channels in this stage of the
network and, therefore, lower reuse of data under the NWHC
layout. Nonetheless, NHWC broadly outperforms NCHW on
3x3 kernels, and also on 1x1 kernels, which, having a lower
degree of data reuse across the plane of the convolution, are
notably more performing under the NHWC layout. Overall,
NHWC achieves a reduction in the latency of 8% throughout
the network.
Another substantial improvement is the optimisation of the
first convolution, which typically accounts for one of the most
time-consuming layers. Input data is generally organised in
NCHW-order while, by contrast, many performant convolution
primitives prefer an NHWC layout (as we saw previously).
However, converting from NCHW to NHWC (particularly
Fig. 9: Primitive optimisation. Layout optimisation for the
1st convolution by avoiding layout rearrangements thanks to
an im2row specialised routine. (the lower, the better).
where only three channels of data are concerned) to match
the preferred input layout of the first convolution is relatively
costly and hence undesirable. We can avoid this by taking
input data in NCHW and producing NHWC directly – avoiding
additional rearrangement. This is achieved simply by providing
an im2row routine specialised for NCHW-order data and a
low channel count.
C. Automated Design Space Exploration (DSE)
We aim to show the trade-offs between latency, accuracy,
and memory footprint when all the different levels of software
optimisations shown in Section VI-B are available and may be
applied, e.g., employing either quantised, or fast-convolution
methods may have a substantial impact in all three metrics. We
illustrate the DSE and the optimisations achieved by QS-DNN
by providing two Pareto fronts: latency-accuracy and latency-
memory, where we present the achievable performance based
on different degrees of accuracy and memory.
To show the optimization improvements, we display the
following interesting points on the graphs:
• Ref-FP32: We take LPDNN coupled with Openblas
library as reference implementation since it is a well-
known and standard library for industrial deployment.
Ref-FP32 employs GEMM-based methods under NCWH
layout for convolutions and fully connected layers and
direct methods for any other layer.
• Opt-FP32: Solution found by QS-DNN when only FP32
implementations are allowed, i.e., no quantised methods.
• INT8: Fully INT8 deployment implementation.
• Pareto front: Set of points that are not dominated by any
other implementation based on the two given objectives.
For the sake of fairness, static layer fusion (graph reduc-
tion) and memory pool optimisations are always ON for all
implementations.
1) Latency-Accuracy: Fig. 10 illustrates the most inter-
esting points found by QS-DNN while performing a DSE
for the range of networks on the Jetson Nano. Dotted lines
represent the loss in accuracy (1%, 2.25%, 5% and 10%) with
respect to the Ref-FP32 implementation while braces expose
the performance gains.
Generally, we can observe that the DSE provided by QS-
DNN for Opt-FP32 clearly outperforms Ref-FP32 from 1.7x
Fig. 10: Latency vs Accuracy trade-off. Automated DSE on
Jetson Nano (A-57) showing the most interesting points. Solid
line represents the Pareto front. X-axis is log2 based.
Fig. 11: Latency-Memory trade-off. Automated DSE on
Jetson Nano (A-57) showing the most interesting points. Solid
line represents the Pareto front. X- and Y-axes are log2 based.
(Resnet50) to 3.87x (MobilenetV3-small) with no loss in
accuracy. We can explain this significant improvement mainly
due to the selection of Winograd convolutions for Squeezenet
and Resnet50 and, optimised depthwise convolutions for Mo-
bilenets. Besides, techniques such as runtime layer fusion
and the optimisation of layouts through the network make a
meaningful impact in performance.
By allowing QS-DNN to select quantised methods, further
improvement can be achieved with only a small drop in
accuracy: up to 1.05x and 1.2x increase in performance with
under 1% and 3% drop in accuracy for Squeezenet and
Resnet50, respectively. Mobilenets, on the other hand, are
significantly more sensitive to quantisation and their accuracy
drops drastically when quantised methods are employed. This
sensitivity to quantisation is largely due to depthwise convo-
lutions, as each channel represent an independent kernel and
may have very different ranges. As our acceleration libraries
only support layer quantisation, i.e., one range or scale per
layer, the use of quantised depthwise layers clearly hurts the
accuracy (see the fully INT8 solution on Fig. 10). Thanks to the
DSE given by QS-DNN, we can find mixed-precision solutions
that bring up to 1.10x and 1.13x increase in performance
Fig. 12: Latency-Accuracy trade-off. Pareto front compari-
son between the RPI4, Jetson Nano and Hikey960. X-axis is
log2 based.
Fig. 13: Latency-Memory trade-off. Pareto front comparison
between the RPI4, Jetson Nano and Hikey960. X- and Y-axes
are log2 based.
with a modest drop in accuracy of around 3% and 8% for
MobilenetV3-small and -large respectively.
If we compare the Pareto fronts of the networks on the
different platforms, we can see in Fig. 12 that there is not a
significant difference in latency or accuracy between the RPI4
and the Jetson Nano while the RPI3, containing a “LITTLE”
Arm core, performs around 2x slower. Interestingly, no fully
INT8 solution forms part of the Pareto front on the RPI4
and Jetson Nano platforms. This indicates that fully quantised
networks may not always be optimal in terms of latency due
to a lack of support for certain primitives and the need for
requantisation after each layer.
2) Latency-Memory: As we described earlier, the mem-
ory pool optimisation for the activations is always ON and
achieves over 80% reduction in memory allocation. Hence, we
can only achieve further reductions of memory by reducing
the size of the weights, e.g., through quantisation. Fig. 11
shows the most interesting points found by QS-DNN while
performing a DSE for the range of networks on the Jetson
Nano. Dotted lines represent the memory consumption for the
weights (75%, 50%, and 25%) with respect to the Ref-FP32
implementation while braces expose the performance gains.
Building from the Latency-Accuracy graph, we see that the
Latency-Memory Pareto fronts are convex rather than linear.
While FP32 layers are preferred in the Latency-Accuracy DSE
due to their higher precision, in this case, quantised methods
are favoured due to their lower memory footprint, being the
fully INT8 solution the lowest point of the Pareto front (25%)
for all networks. From the lowest point, the Pareto rises and
turns towards less latency as some more performing FP32
algorithms are picked, e.g., Winograd, which increases the
memory up to 0.58% and 0.65% for Squeezenet and Resnet50,
respectively.
Distinctly, Mobilenets’ Pareto fronts rise less and tend to
remain close to 25% as they do not contain any Winograd
implementation and quantized methods are more efficient.
Overall, if we combined the optimisations in memory from
both the weights and activation, the total memory reduction
can go up to 1.9x and 2.3x for small networks like Squeezenet
and MobilenetV3-small and, up to 1.6x and 2.5x for larger
networks such as Resnet50 and MobilenetV3-large.
From the comparison between the Pareto fronts on the
different platforms (Fig. 13), we can draw that the RPI4
performs slightly better than the Jetson Nano. This may
indicate, given the similar FP32 performance, that quantised
methods are more performing on the A72 core than the A57
counterpart. Interestingly, the Pareto front of Mobilenets on
the RPI3 consists of one point, the INT8 solution. This can be
explained by looking at the Latency-Accuracy Pareto front of
Mobilenets where INT8 solutions are the fastest ones (despite
the poor accuracy), which denotes the efficiency of quantized
methods for this network topology on the A53 core.
3) Discussion: Thanks to the automatic DSE provided
by QS-DNN, we can quickly analyse several DNNs on a
range of embedded platforms and find a suitable solution for
a given problem. Thus, analysing the previous experiments,
we can observe that the recent MobilenetV3s are far more
efficient than Squeezenet and Resnet50, e.g., MobilenetV3-
small performs 3x faster than Squeezenet (on the Arm Cortex-
A57) and is and over 8% more accurate while having the same
memory footprint. Likewise, we could select the most suitable
platform for a given problem based on latency, memory or
energy constraint.
VII. CONCLUSION AND FUTURE WORK
We have analysed the methods to improve the deployment
of DNNs across the different levels of software optimisation
and introduced the range of techniques provided by LPDNN
to optimise DNN inference. Building on this knowledge, we
have shown that single optimisation methods may lead to
sub-optimal deployment for end-to-end solutions in terms of
latency, accuracy, and memory footprint.
Therefore, we have introduced an automated exploration
framework that relies on a Reinforcement-Learning-based
search which, combined with LPDNN, enables the deployment
of DNN implementations to obtain empirical measurements.
Thus, we are able to learn an optimised solution for a given
task by automatically exploring the deployment design options
on the target embedded device. To validate the design, we
have presented a set of results for state-of-the-art DNNs on
a range of Arm Cortex-A CPU platforms achieving up to
4x improvement in performance and over to 2x reduction in
memory with negligible loss in accuracy with respect to the
BLAS floating-point implementation.
We aim to extend this work to micro-controllers where the
resources are very limited, and careful design needs to be per-
formed. Further, we envision extending this work for runtime
adaptation of the AI solution by having an online search to
improve continuously the latency and memory consumption
based on the environment state.
VIII. ACKNOWLEDGEMENT
This project has received funding from the European
Union’s Horizon 2020 research and innovation programme
under grant agreement No. 732204 (Bonseyes). This work is
supported by the Swiss State Secretariat for Education, Re-
search and Innovation (SERI) under contract number 16.0159.
The opinions expressed and argument employed herein do not
necessarily reflect the official views of these funding bodies.
REFERENCES
[1] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based
face detection,” IEEE Transactions on pattern analysis and machine
intelligence, vol. 20, no. 1, pp. 23–38, 1998.
[2] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in 2013 IEEE international conference
on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–
6649.
[3] “Google ai,” 2020. [Online]. Available: URL:https://ai.google/
[4] “Tesla.” [Online]. Available: https:
//www.forbes.com/sites/bernardmarr/2018/01/08/
the-amazing-ways-tesla-is-using-artificial-intelligence-and-big-data
[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[6] “Imagenet.” [Online]. Available: http://www.image-net.org.
[7] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo-
lutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[8] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,
2016.
[9] T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze,
and H. Adam, “Netadapt: Platform-aware neural network adaptation for
mobile applications,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 285–300.
[10] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda,
Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design
via differentiable neural architecture search,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2019, pp.
10 734–10 742.
[11] E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk, P. Y.
Cheung, and G. A. Constantinides, “Deep neural network approximation
for custom hardware: Where we’ve been, where we’re going,” ACM
Computing Surveys (CSUR), vol. 52, no. 2, pp. 1–39, 2019.
[12] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for
model compression and acceleration on mobile devices,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp.
784–800.
[13] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
[14] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Je´gou, “And the
bit goes down: Revisiting the quantization of neural networks,” arXiv
preprint arXiv:1907.05686, 2019.
[15] A. Anderson, A. Vasudevan, C. Keane, and D. Gregg, “Low-memory
gemm-based convolution algorithms for deep neural networks,” arXiv
preprint arXiv:1709.03395, 2017.
[16] P. Maji, A. Mundy, G. Dasika, J. Beu, M. Mattina, and R. Mullins,
“Efficient winograd or cook-toom convolution kernel implementation
on widely used mobile cpus,” arXiv preprint arXiv:1903.01521, 2019.
[17] K.-S. Oh and K. Jung, “Gpu implementation of neural networks,” Pattern
Recognition, vol. 37, no. 6, pp. 1311–1314, 2004.
[18] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, 2015, pp. 161–170.
[19] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An ultra-
low power convolutional neural network accelerator based on binary
weights,” in 2016 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI). IEEE, 2016, pp. 236–241.
[20] M. de Prado, J. Su, R. Dahyot, R. Saeed, L. Keller, and N. Vallez, “Ai
pipeline-bringing ai to you. end-to-end integration of data, algorithms
and deployment tools,” arXiv preprint arXiv:1901.05049, 2019.
[21] A. Anderson and D. Gregg, “Optimal dnn primitive selection with
partitioned boolean quadratic programming,” in Proceedings of the 2018
International Symposium on Code Generation and Optimization, 2018,
pp. 340–351.
[22] M. de Prado, M. Denna, L. Benini, and N. Pazos, “Quenn: Quantization
engine for low-power neural networks,” in Proceedings of the 15th ACM
International Conference on Computing Frontiers, 2018, pp. 36–44.
[23] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan,
K. Hazelwood, E. Isaac, Y. Jia, B. Jia et al., “Machine learning at
facebook: Understanding inference at the edge,” in 2019 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA).
IEEE, 2019, pp. 331–344.
[24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2018, pp. 4510–4520.
[25] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi-
cient convolutional neural network for mobile devices,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
2018, pp. 6848–6856.
[26] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard,
and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for
mobile,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 2820–2828.
[27] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective
neural architecture search via lamarckian evolution,” arXiv preprint
arXiv:1804.09081, 2018.
[28] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
network,” arXiv preprint arXiv:1503.02531, 2015.
[29] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision
tree,” arXiv preprint arXiv:1711.09784, 2017.
[30] E. J. Crowley, G. Gray, and A. J. Storkey, “Moonshine: Distilling with
cheap convolutions,” in Advances in Neural Information Processing
Systems, 2018, pp. 2888–2898.
[31] A. Elafrou, V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, and
N. Koziris, “Sparsex: A library for high-performance sparse matrix-
vector multiplication on multicore platforms,” ACM Transactions on
Mathematical Software (TOMS), vol. 44, no. 3, pp. 1–32, 2018.
[32] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong
Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra
et al., “Can fpgas beat gpus in accelerating next-generation deep
neural networks?” in Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, 2017, pp. 5–14.
[33] I. Fedorov, R. P. Adams, M. Mattina, and P. Whatmough, “Sparse: Sparse
architecture search for cnns on resource-constrained microcontrollers,”
in Advances in Neural Information Processing Systems, 2019, pp. 4978–
4990.
[34] E. J. Crowley, J. Turner, A. Storkey, and M. O’Boyle, “Pruning neural
networks: is it time to nip it in the bud?” 2018.
[35] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
convolutional neural networks,” ACM Journal on Emerging Technologies
in Computing Systems (JETC), vol. 13, no. 3, pp. 1–18, 2017.
[36] P. Gysel, “Ristretto: Hardware-oriented approximation of convolutional
neural networks,” arXiv preprint arXiv:1605.06402, 2016.
[37] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of
deep convolutional networks,” in International Conference on Machine
Learning, 2016, pp. 2849–2858.
[38] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision. Springer, 2016, pp. 525–
542.
[39] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan,
and K. Gopalakrishnan, “Pact: Parameterized clipping activation for
quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.
[40] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “Haq: Hardware-aware
automated quantization with mixed precision,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2019, pp.
8612–8620.
[41] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, “Hawq:
Hessian aware quantization of neural networks with mixed-precision,” in
Proceedings of the IEEE International Conference on Computer Vision,
2019, pp. 293–302.
[42] B. Wu, Y. Wang, P. Zhang, Y. Tian, P. Vajda, and K. Keutzer, “Mixed
precision quantization of convnets via differentiable neural architecture
search,” arXiv preprint arXiv:1812.00090, 2018.
[43] R. Krishnamoorthi, “Quantizing deep convolutional networks for effi-
cient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
[44] R. Zhao, Y. Hu, J. Dotzel, C. De Sa, and Z. Zhang, “Improving neural
network quantization without retraining using outlier channel splitting,”
in International Conference on Machine Learning, 2019, pp. 7543–7552.
[45] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling, “Data-
free quantization through weight equalization and bias correction,” in
Proceedings of the IEEE International Conference on Computer Vision,
2019, pp. 1325–1334.
[46] “Caffe-int8-convert-tools,” 2020. [Online]. Available: https://github.
com/BUG1989/caffe-int8-convert-tools
[47] R. Banner, Y. Nahshan, and D. Soudry, “Post training 4-bit quantization
of convolutional networks for rapid-deployment,” in Advances in Neural
Information Processing Systems, 2019, pp. 7948–7956.
[48] M. Nagel, R. A. Amjad, M. van Baalen, C. Louizos, and T. Blankevoort,
“Up or down? adaptive rounding for post-training quantization,” arXiv
preprint arXiv:2004.10568, 2020.
[49] M. Dukhan, “The indirect convolution algorithm,” 2019.
[50] “Fft vs direct convolution,” 2019. [Online]. Available: https://ccrma.
stanford.edu/∼jos/ReviewFourier/FFT Convolution vs Direct.html
[51] J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-memory
overhead direct convolutions,” arXiv preprint arXiv:1809.10170, 2018.
[52] “Openblas,” 2020. [Online]. Available: https://www.openblas.net/
[53] F. G. Van Zee and R. A. Van De Geijn, “Blis: A framework for rapidly
instantiating blas functionality,” ACM Transactions on Mathematical
Software (TOMS), vol. 41, no. 3, pp. 1–33, 2015.
[54] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 4013–4021.
[55] M. de Prado, N. Pazos, and L. Benini, “Learning to infer: Rl-based
search for dnn primitive selection on heterogeneous embedded systems,”
in 2019 Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE, 2019, pp. 1409–1414.
[56] S. Rovder, J. Cano, and M. OBoyle, “Optimising convolutional neu-
ral networks inference on low-powered gpus,” in 12th International
Workshop on Programmability and Architectures for Heterogeneous
Multicores (MULTIPROG), 2019.
[57] T. Llewellynn, M. M. Ferna´ndez-Carrobles, O. Deniz, S. Fricker,
A. Storkey, N. Pazos, G. Velikic, K. Leufgen, R. Dahyot, S. Koller
et al., “Bonseyes: platform for open development of systems of artificial
intelligence,” in Proceedings of the Computing Frontiers Conference,
2017, pp. 299–304.
[58] “Armcl,” 2020. [Online]. Available: https://developer.arm.
com-technologies/compute-library
[59] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv
preprint arXiv:1410.0759, 2014.
[60] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint
arXiv:1701.07274, 2017.
[61] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning.
MIT press Cambridge, 1998, vol. 135.
[62] C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
[63] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural
network architectures using reinforcement learning,” arXiv preprint
arXiv:1611.02167, 2016.
[64] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[65] L.-J. Lin, “Self-improving reactive agents based on reinforcement learn-
ing, planning and teaching,” Machine learning, vol. 8, no. 3-4, pp. 293–
321, 1992.
[66] E. Wiewiora, Reward Shaping. Boston, MA: Springer US,
2010, pp. 863–865. [Online]. Available: https://doi.org/10.1007/
978-0-387-30164-8 731
[67] Arm Ltd. Cortex-A53. [Online]. Available: https://www.arm.com/
products/silicon-ip-cpu/cortex-a/cortex-a53
[68] A. Lal Shimpi. (2012, Oct.) ARM’s Cortex A57 and
A53. [Online]. Available: https://www.anandtech.com/show/6420/
arms-cortex-a57-and-cortex-a53-the-first-64bit-armv8-cpu-cores
[69] Arm Ltd. Cortex-A57. [Online]. Available: https://www.arm.com/
products/silicon-ip-cpu/cortex-a/cortex-a57
[70] A. Frumusanu. (2015, Apr.) ARM Reveals Cortex-A72 Architecture
Details. [Online]. Available: https://www.anandtech.com/show/9184/
arm-reveals-cortex-a72-architecture-details
[71] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,
and D. Kalenichenko, “Quantization and training of neural networks
for efficient integer-arithmetic-only inference,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 2704–2713.
[72] J. Andrews. (2017, Dec.) Exploring the Arm dot product
instructions. [Online]. Available: https://community.arm.com/
developer/tools-software/tools/b/tools-software-ides-blog/posts/
exploring-the-arm-dot-product-instructions
[73] N. Stephens. (2019, Sep.) Developments in the
Arm A-Profile Architecture: Armv8.6-A. [Online]. Avail-
able: https://community.arm.com/developer/ip-products/processors/b/
processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
APPENDIX
Fig. 14: Algorithm optimisation. Speedup of Winograd over
GEMM-FP32 for Resnet50’ (the higher, the better).
