A Simple Method to Reduce Off-chip Memory Accesses on Convolutional
  Neural Networks by Kim, Doyun et al.
A Simple Method to Reduce Off-chip Memory Accesses
on Convolutional Neural Networks
Anonymous Authors1
Abstract
For convolutional neural networks, a simple al-
gorithm to reduce off-chip memory accesses is
proposed by maximally utilizing on-chip memory
in a neural process unit. Especially, the algorithm
provides an effective way to process a module
which consists of multiple branches and a merge
layer. For Inception-V3 on Samsung’s NPU in
Exynos, our evaluation shows that the proposed
algorithm makes off-chip memory accesses re-
duced by 1/50, and accordingly achieves 97.59%
reduction in the amount of feature-map data to be
transferred from/to off-chip memory.
1. Introduction
Recent achievements in image processing tasks such as im-
age recognition, object detection, and scene segmentation
have been coupled with the application of deep convolu-
tional networks (Szegedy et al., 2015; Ren et al., 2015;
Long et al., 2015). As the need for more complex net-
works increases, we get faced with several implementation
issues, i.e. real time processing, limited power budget, and
memory bandwidth. For the issues to get resolved, various
approaches have been investigated in both cloud and mobile
applications; low-precision (Courbariaux et al., 2014; 2015;
Hubara et al., 2016; Gupta et al., 2015; Gysel et al., 2016;
Judd et al., 2015; Lin et al., 2016; Kim et al., 2018), network
compression (Han et al., 2015), (Han et al., 2016b), and
small network design (Iandola et al., 2016), (Howard et al.,
2017).
Another remarkable trend is to execute deep convolutional
networks on mobile platforms, and it is getting important by
concerns about response time, dependency on an internet
connection, privacy, and security. Many companies and
research groups have been recently developing notable hard-
ware accelerators called as a neural processing unit (NPU)
1Anonymous Institution, Anonymous City, Anonymous Region,
Anonymous Country. Correspondence to: Anonymous Author
<anon.email@domain.com>.
Preliminary work. Under review by the International Conference
on Machine Learning (ICML). Do not distribute.
(Song et al.; Zhang et al., 2016; ARM-ML-processor). They
tried to develop energy-efficient NPUs based on novel al-
gorithms such as exploiting network sparsity for high uti-
lization of multiply/accumulate (MAC) units or quantizing
networks to reduce the power of MAC units. To operate
CNNs on a NPU, it is needed to access memory hugely to
read and write weight and feature-map data. In (Han et al.,
2016a), it was shown that the total energy is dominated by
the required memory access if there is no data reuse, and that
the energy cost of on-chip memory (SRAM in (Han et al.,
2016a)) is 128 times better than one of off-chip memory
(DRAM in (Han et al., 2016a)). Since NPUs for a mobile
platform have the limited amount of on-chip memory, how-
ever, it is not easy to maintain NPU efficiency for most
applications. Therefore, we reasoned that reducing off-chip
memory accesses by utilizing on-chip memory maximally
can be one of the most powerful solutions to increase the
efficiency of NPUs.
To achieve the high utilization of on-chip memory, we
needed to know what series of operations are required in
NPUs for processing the convolution layer and to choose
the most efficient one among possible series of operations.
We focused on a series of operations which fetches weights
in minimal increments and executes convolution with the
weights for the feature-maps of all input channels. Such
a series of operations makes the realization of convolution
simple by leaving out the consideration about how to locate
weights in on-chip memory. Figure 1 shows the required
memory size of weights and feature-maps according to the
index of layer, for Inception-V3 and ResNet-50. The size of
weights increases as the layer index increases, whereas the
size of feature-maps decreases as the layer index increases.
From the patterns of sizes, we thought that it would be ben-
eficial to locate feature-maps fully in on-chip memory and
to manage them efficiently if the size of feature-maps is
small enough to be located in on-chip memory. In recent
architectures of convolutional neural networks, moreover,
a concept of module or block has been introduced to com-
puter vision applications for higher representational power
of neural networks. The important thing of such observation
is that modules or blocks are repeatedly used in a network
just after feature-maps are scaled down for reduction of
computational cost, as shown in Figure 1. Depending on
ar
X
iv
:1
90
1.
09
61
4v
1 
 [c
s.N
E]
  2
8 J
an
 20
19
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
Inception-V3
Weight size [Kbyte]
Feature-map size [Kbyte]
(= Input + Output feature-maps)
ResNet 50
Inception a1 Inception a2 Inception a3
Reduction a
Inception b1 Inception b2 Inception b3 Inception b4
Reduction c
Inception c1 Inception c2
3 Bottleneck 
(56x56)
4 Bottleneck (28x28) 6 Bottleneck (14x14) 3 Bottleneck (7x7)
Figure 1. Sizes of weights and feature-maps according to indexes
of layers for Inception-V3 and ResNet-50.
the size of feature-maps and the characteristics of modules,
therefore, we were able to find a simple way to minimize
off-chip memory accesses by efficiently managing on-chip
memory for modules.
In this paper, we propose a simple method to support energy-
efficient and real-time processing of NPUs through the re-
duction of off-chip memory accesses. Firstly, the algorithm
detects certain types of modules or blocks by graph inter-
pretation of the whole network. Then, several regions of
on-chip memory are assigned to reuse feature-maps during
a module or a block processing. In order to utilize on-chip
memory maximally, moreover, we also propose a branch-
reordering algorithm and two branch-processing algorithms.
By combining the proposed algorithms, we can effectively
cut down off-chip memory accesses for convolutional neu-
ral networks such as Inception-V3 and ResNet. The rest
of the paper is organized as follows. Section 2 describes
the module we define. Then, in Section 3, we introduce the
proposed algorithms to reduce off-chip memory accesses.
Section 4 shows evaluation results for a representative net-
work, Inception-V3 (Szegedy et al., 2015; 2016). Finally,
Section 5 makes a conclusion.
2. Definition of Module
After Network-in-Network was proposed in (Lin et al., 2013)
in order to increase the representational power of neural net-
works, a concept of module or block is getting popular in
convolution neural networks for computer vision applica-
tions like Inception network (Szegedy et al., 2015; 2016),
ResNet (He et al., 2016), MobileNet V2 (Sandler et al.,
2018), SqueezeNet (Iandola et al., 2016), ShuffleNet (Ma
Module Start
Pool
Conv
Conv
Conv
Module End
Conv
Conv
Conv
Conv
Concat
Conv
Conv
b=3 b=2b=1b=4
Module Start
Conv
Conv
Conv
Module End
E-sum
b=1 b=2
b=1
k=1
di(k)=1
b=2
k=1
di(k)=1
b=2
k=3
di(k)=2
b=2
k=2
di(k)=2
b=3
k=4
di(k)=2
b=4
k=2
di(k)=1
b=3
k=3
di(k)=2
b=3
k=1
di(k)=1
b=3
k=2
di(k)=1
b=4
k=1
di(k)=1
b=1
k=3
di(k)=1
b=1
k=1
di(k)=1
b=1
k=2
di(k)=1
Figure 2. Illustration of various modules satisfying four conditions
in Section 2. Here, b and k denote the indexes of a branch and a
layer for the module, And di(k) is the maximum depth of module
at the kth layer.
et al., 2018), and MnasNet (Tan et al., 2018). In this section,
we define module and specify which modules can be uti-
lized in the algorithms we will propose among the modules
defined for various networks.
In general, a module used in convolutional neural networks
can be one of the directed acyclic graphs (DAGs), which is
a finite directed graph without directed cycles. That is, it
consists of finite multiple layers and edges, with each edge
directed from one layer to another, such that there is no way
to start at any layer A and follow a consistently-directed
sequence of edges that eventually loops back to A again.
For a convolutional neural network including many modules,
moreover, it can be viewed as a large DAG with multiple
DAGs.
In the paper, we consider only the limited structure of DAGs
satisfying the following conditions: 1) the module has mul-
tiple branches within itself, 2) the module has to include at
least a merge layer, and 3) the type of merge layers should
be either concatenation or element-wise summation. In addi-
tion, 4) we do not cover a large-sized module configured by
a long skip-connection or having lots of layers inside, which
has been frequently used in neural networks to extract multi-
scaled features. This is because there seems no efficient way
to utilize on-chip memory for large-sized modules. When
we focused on only the modules satisfying the four condi-
tions mentioned above, we could reach a sub-optimal but
simple solution to reduce off-chip memory accesses even
if the proposed algorithms did not cover all kinds of neural
networks.
Figure 2 shows the examples of modules satisfying the re-
quired conditions and introduces some useful indices to
explain the algorithms proposed in this paper, where b de-
notes the index of branch in a module, k is the index of
layer in a branch, and di(k) denotes the maximum depth of
modules for the kth layer. In other words, it means kth layer
is included in the di(k)th module. With these parameters,
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
we explain two types of modules, each having a different
kind of merge layer as follows:
Concatenation based module: this type of module in-
cludes the k-th layer on the b-th branch, and a concatenation
layer at the end of module, as shown in the left plot of Fig-
ure 2. Sometimes the module also includes the sub-module
which is a module in a module like as the Inception-C type.
The representative neural networks having the concatenation
based module are several versions of Inception Networks
(Szegedy et al., 2015; 2016) and SqueezeNet (Iandola et al.,
2016). In Figure 2, Conv and Pool mean a convolutional
layer and a pooling layer, respectively.
Element-wise adder based module: this type of module
has two branches, multiple layers, and an element-wise
adder at the end. The representative neural networks with
the element-wise adder based module are ResNet (He et al.,
2016) and MobileNet V2 (Sandler et al., 2018).
3. Algorithm for Effective Module Processing
The flowchart demonstrated in Figure 3 shows an overall
process of compiler optimization including the algorithm
proposed for module processing, where the compiler serves
to interpret graphs as well as to make a policy for effec-
tive execution of a neural network on NPUs. Neural net-
work source code represents a certain type of files including
graph information of a neural network such as prototxt and
TFLite formats. Through parse unit, network parameters
are extracted such as kernel size, stride, pad, layer type,
feature-map size, module parameters and so on. With the
extracted parameters, we can operate the proposed algorithm
in optimization unit which includes four phases: module
detection, micro-instruction generation, branch reordering
and branch processing (I/II). Here, module detection is to
detect the modules satisfying required conditions in a neu-
ral network, and micro-instruction generation is to find the
best sequence of micro-instructions per module defined in
NPU. branch reordering decides on a new order of branches
to be processed in a module based on a proposed criteria.
And branch processing I/II decide an effective policy. Then,
memory allocation unit executes a resource allocation of
on-chip memory based on the policy by the optimization
unit. Finally, neural network object code is generated as an
output of object code generation unit.
Since our main proposal is optimization unit in Figure 3, we
explain it in detail. Algorithm 1 shows a whole procedure
of optimization unit which has six steps such as module
detection, operator sequence generation, branch reordering,
MIFM (Module’s Input Feature Map) memory size calcu-
lation, occupied MOFM (Module’s Output Feature Map)
memory size calculation, and branch processing. Here, a
network graph Ω is used as an input in optimization unit,
OPTIMIZATION UNIT
(PROPOSED ALGORITHM)
PARSE UNIT
MEMORY ALLOCATION 
UNIT
OBJECT CODE
GENERATION UNIT
NEURAL 
NETWORK
SOURCE 
CODE
NEURAL 
NETWORK
OBJECT 
CODE
MODULE DETECTION
MICRO-INSTRUCTION
GENERATION
BRANCH REORDERING
BRANCH PROCESSING I/II
Figure 3. Overall flowchart of the compiler optimization including
the proposed algorithm.
and there are two types of outputs: the best sequence of
micro-instructions, and the resource allocation information
in on-chip memory per module. Each step of Algorithm 1
is described in detail in Algorithm 2 through Algorithm 6.
First of all, the proposed algorithm has to detect all possible
valid modules within a network. Different kinds of modules
are designed for well-known neural network models, but we
focus only on those modules which are suitable for efficient
use of on-chip memory. Algorithm 2 is about how to detect
modules that meet the required conditions as mentioned
in the previous section. First, we skip modules configured
by a long skip-connection since it is extremely difficult to
manage them within on-chip memory efficiently. Then, the
algorithm detects modules in a modified graph where all the
long skip-connections are erased.
Figure 4 shows the results of the module detection algo-
rithm being applied to Inception-V3. Here we can see that
11 modules are detected by the Algorithm 2. After detect-
ing modules, the optimization unit generates a sequence of
micro-instructions for whole layers of each module. Since
this step is very hardware-dependent, it is hard to explain
details in this paper. In the optimization unit, anyway, the
best sequence of micro-instructions should be found in order
to make the layers of each module processed efficiently in a
specific hardware.
As the 3rd step, we execute a re-ordering process of
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
Algorithm 1 Optimization unit
Input: Network graph (Ω).
% [Step-1] module detection through Ω
ModuleDet(Ω)
for m = 1 to len(modules) do
% [Step-2] generate a sequence of micro-instructions
for layers in the module
Ops(m)=GenSeqOps(m)
% [Step-3] branch re-ordering
BrReordering(m)
for b = 1 to len(branches of module) do
for k = 1 to len(layers of branch) do
% [Step-4] calculation of memory sizes of MIFM
CalSizeMifmMem(k)
end for
% [Step-5] calculation of the occupied memory size
within MOFM according to b
CalSizeMofmOccMem(b)
% [Step-6] branch operation based on Ops
BrProcess(b, I) or BrProcess(b, II)
end for
end for
Algorithm 2 Module detection, ModuleDet(Ω)
Input: Network graph (Ω) with a layer set (Ψ) and an
edge set (Θ).
for ν in Θ do
if ν is not a long skip-connection: then
ν ∈ Θe.
else
νe = Disconnect(ν)
νe ∈ Θe, where νe is originated from off-chip mem-
ory.
end if
end for
Input: Effective network graph (Ωe) with a layer set (Ψ)
and an effective edge set (Θe).
for ψ in Ψ of Ωe do
if ψ == merge layer: then
σ = SearchStartLayer(ψ)
ω = ExtractModuleParam(σ, ψ)
end if
end for
CONV
POOL CONCAT
FC
DETECTED
MODULE
Figure 4. Illustration of results of the module detection algorithm
in Algorithm 2.
branches for all modules through Algorithm 3. Firstly, we
calculate a required memory size of each branch within
a module. Here the required memory size of a layer
is calculated by accumulating sizes of IFM, OFM, in-
ternal working memories, and weights in a function of
CalcSizeReqMem, but except MIFM and MOFM. And
the required memory size of a branch is determined by
the largest of the memory sizes required for layers on
the branch. Finally, we change the processing order of
the branches in descending order of the memory size re-
quired by each branch. Figure 5 shows an example in-
cluding a module with four branches. When it is assumed
sizereq(1) < sizereq(3) < sizereq(2) < sizereq(4), BrRe-
ordering changes the order of branches as follows: Br(4),
Br(2), Br(3), and Br(1) in the right plot of Figure 5.
Through BrReordering, we can allocate more available re-
source in on-chip memory for the branch which requires a
larger size of memory.
At the next step, we need to calculate the size of MIFM
and to allocate an offset in order to share a memory region
without any collision during module processing. As shown
Algorithm 3 Re-ordering of branches, BrReordering(m)
Input: module parameters ω.
for b = 1 to len(branches of module) do
for k = 1 to len(layers of branch) do
% calculating a size of the required memory except
MIFM and MOFM for each layer
sizereq(b,k) = CalcSizeReqMem(b, k)
end for
sizereq(b) = max(sizereq(b,:))
end for
% Sorting the indexes of branches in descending order
Sort(sizereq(:))
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
sizereq(1)=sizereq(1,1)
BrReordering
sizereq(1,1)
sizereq(2,2)
sizereq(3,4)
sizereq(4,1)
sizereq(2)=sizereq(2,2)
sizereq(3)=sizereq(3,4)
sizereq(4)=sizereq(4,1)
Layer needed the maximum required memory within a branch
Assumption:  sizereq(1)<sizereq(3)<sizereq(2)<sizereq(4)
Figure 5. Illustration of the branch re-ordering algorithm in Algo-
rithm 3
in Algorithm 4, sizes and offsets are calculated at the kth
layer for all possible depths, di(k), in the module. In Figure
6, we look at the red-colored box as an example of MIFM
calculation algorithm, where it is the second branch with
three layers in the module. If you look at the 1st layer, only
sizemifm(1,1) is considered because it is included in the
only 1st depth module. Because sizemifm(1,1) is exactly
same with the size of the previous branch, of course, the
memory region is directly handed over from the previous
branch. For the 2nd and the 3rd layers within the 2nd depth
module, we need to consider the second shared memory
region of sizemifm(2,2) and offsetmifm(2,2). After oper-
ating the 3rd layer, we can release the second shared mem-
ory region of sizemifm(2,2) and offsetmifm(2,2) because
there is no need in the next branches.
Algorithm 4 Calculation of MIFM size for the kth layer,
CalSizeMifmMem(k)
Input: module parameters ω, effective depth of MIFM
for the kth layer in module di(k).
for i = 1 to di(k) do
% calculating MIFM size for the module with ω
sizemifm(k,i) = calcMifmMemSize(k, i)
offsetmifm(k,i) = calcMifmMemOffset(k, i)
end for
on-chip memory
sizemifm(2, 1)
sizemifm(2, 2)(=sizemifm(3, 2))
sizemifm(3, 1)
sizemifm(3, 2)
2nd depth MIFM
1st depth MIFM
sizemifm(1, 1)
(=sizemifm(2, 1)=sizemifm(3, 1))
CalSizeMifmMem
offsetmifm(2, 1)
offsetmifm(2, 2)
Figure 6. Illustration of the MIFM calculation algorithm in Algo-
rithm 4
CalSizeMofmOccMem(b)
sizemifm(k, 1)
Occupied MOFM
Unoccupied MOFM
sizeoccu(4)
MOFM
offsetoccu(4)
sizeoccu(3)offsetoccu(3)
Figure 7. Illustration of the occupied MOFM calculation algorithm
in Algorithm 5
As the 5th step, we calculate the occupied memory sizes
within MOFM at the bth branch in Algorithm 5. It can be
simply calculated by accumulating the output size of the last
layer on all previous branches. It means sizeoccu(b) is the
size of occupied region of the MOFM for the current branch
as shown in Figure 7. It is important to exactly calculate
the size of the occupied region because the region (greed
colored region in Figure 7) in on-chip memory cannot be
utilized for the current branch operations.
The branch processing is conducted at the last step as shown
Algorithm 5 Calculation of the occupied memory size of
MOFM at the bth branch, CalSizeMofmOccMem(b)
Input: module parameters ω.
sizeoccu(b) = 0
for i = 1 to b− 1 do
sizeoccu(b)+ = calcMofmMemSize(i)
offsetoccu(b) = calcMofmMemOffset(i)
end for
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
W
M
BrProcess(b)
OFMWM
IFMOFM
IFM
W
W
W
W
M
MOFM
sizeavail
sizeoccu(2)sizemifm(1, 1)
Occupied MOFM (utilized region)
Occupied MIFM/MOFM (no utilization)
Occupied MIFM (utilized region)
OFM forwarding of kth layer within branch, fwd(k) = true 
Memory region for the partial weights of  kth layerw
wm Memory region for the partial working memory of  kth layer
Memory region for IFM of  kth layerIFM
Memory region for OFM of  kth layerOFM
MOFMMIFM
MOFM MIFM
(m)th module
(m+1)th module
Figure 8. Illustration of the branch processing algorithm in Algo-
rithm 6
in Algorithm 6. We propose two types of branch processing:
(I) default algorithm considering both MIFM and MOFM,
and (II) optional algorithm considering only MIFM. The
optimization unit adaptively chooses one between (I) and
(II) according to the required memory size of a module.
The branch processing (I) can support an operation where
MOFM of the former module is directly forwarded to MIFM
of the latter module (= sharing memory between consecutive
modules), but we have to use less memory region (sizeavail
in Figure 8) within branch operation because it needs the
occupied memory region for both MIFM and MOFM in the
module. In other hands, the branch processing (II) is an
optional algorithm applicable when a required memory size
sizereq(b) for the bth branch is larger than that of available
on-chip memory sizeavail(b,k). It is because we can use
additional memory region of sizeoccu(b) for a branch pro-
cessing and because we can also get more available memory
by considering sizeOFM.Partial, not sizeOFM.Full at the
last layer on the branch. That is, we give up the shared
memory region of MOFM in order to increase sizeavail,
instead we get a benefit only from sharing MIFM on the
branch. More detail operations are explained in Algorithm
6, where fwd(k) denotes a flag of an OFM forwarding of
the kth layer. Figure 8 shows the example of the branch
processing (I), where we consider the second branch op-
eration including three layers within a module. At the 1st
layer, sizemifm(1,1) means the shared memory for MIFM
in the module, and sizeoccu(2) is mapped to the occupied
region by OFM of last layer in the 1st branch. That is, we
Algorithm 6 Branch processing, BrProcess(b, opt)
Input: Branch index b, and MIFM depth di(k), and pre-
occupied MOFM size sizeoccu(b)
fwd(0)=false;
% estimating the forwarding status of OFM for the kth
layer on the branch
for k = 1 to len(layers of branch) do
% calculating the available memory size
if opt = ′II ′ then
sizeoccu(b)=0
end if
sizeavail(b,k) = sizemem -
∑di(k)
m=1 sizemifm(k,m) -
sizeoccu(b)
if (fwd(k−1)): then
sizereq(b,k) = sizeIFM.Full + sizeOFM.Full +
sizeWM.Partial + sizeW.Partial
if sizereq(b,k) ≤ sizeavail(b,k): then
fwd(k) = true;
else
fwd(k) = false;
end if
else
sizereq(b,k) = sizeIFM.Partial + sizeOFM.Full +
sizeWM.Partial + sizeW.Partial
if sizereq(b,k) ≤ sizeavail(b,k): then
fwd(k) = true;
else
fwd(k) = false;
end if
end if
if (opt = ′II ′) and (k==len(layers of branch)) then
fwd(k) = false;
% sizereq(b,k) can be calculated with
sizeOFM.Partial, not sizeOFM.Full.
end if
end for
can use only sizeavail which is defined in Figure 8 for the
1st layer. If the total size including OFM, working memory
(WM) and weights (W) is equal to or less than sizeavail,
OFM can be directly forwarded to IFM of the 2nd layer, as
shown in Figure 8. At the 3rd layer by the same sequences,
IFM is a heritage region from OFM of the second layer and
OFM is stored as the green colored region within MOFM
region. By using the branch processing (I), therefore, we
can completely erase off-chip memory accesses during the
processing within a module. The algorithm also provides
the removal of the off-chip accesses between consecutive
modules by forwarding the MOFM of the former module to
the MIFM of the latter module in neural networks, as shown
in the second plot of Figure 8.
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
0
1000
2000
3000
4000
5000
6000
7000
8000
Feature Map + Weight (Read + Write) [kByte]
Weight Feature Map of naïve algorithm Feature Map of proposed algorithm
Total amount of Off-chip memory accesses in Naïve algorithm
Total amount of Off-chip memory accesses in Proposed algorithm
Figure 9. Comparison of total data sizes accessed the off-chip memory for Naive and Proposed algorithms, where total data size was
summed for both weights and feature-maps.
4. Evaluation
To evaluate the proposed algorithm, a representative CNN
model, Inception-V3 (Szegedy et al., 2016) with an input
image of 299 × 299, has been selected. The target neu-
ral processor and its features are as follows: 1) Samsungs
NPU (Song et al.) in Exynos has 1024 multiply/accumulate
(MAC) units on 16 MAAs (multiply/accumulate arrays),
and on-chip memory of 1,024 Kbyte which contains IFMs,
OFMs, weights and temporary WMs. The NPU also has 3
parallelism: First, IFMs are divided and fetched into four
chunks along channel. Second, OFMs in the form of 4x4
patch in a MAA are computed in parallel. Lastly, a weight
kernel is copied to 16 kernels for parallel operation on 16
MAAs. 2) It is not essential that all weight kernels of a layer
have to be in on-chip memory. The NPU can make 16 OFM
channels in parallel if partial weights for 16 MAAs are in
on-chip memory. The partial weights can be read and writ-
ten in a double buffering manner to effectively hide memory
access time. 3) We assume that all weights and feature-maps
are 8-bit quantized in the evaluation, even though the NPU
supports other precisions.
Figure 9 shows the total amount of data with off-chip mem-
ory access for two algorithms: Naive and Proposed algo-
rithms. Here, it is assumed that Naive algorithm has to
access off-chip memory to read and write feature-maps for
every layers. The results show that the amount of feature-
map with off-chip memory access is almost gone and the
amount of weight with off-chip memory access remains the
same. Therefore, the proposed algorithm reduces the total
0
256
512
768
1024
(kB)
incep-a1 incep-a2 incep-a3 red-a incep-b1 incep-b2 incep-b3 incep-b4 red-b incep-c1 incep-c2
Sizereq(b,k) Sizeavail(b,k)
Figure 10. Illustration of the relationship between sizereq(b,k) and
sizeavail(b,k) through the proposed algorithm at the kth layer on
the bth branch for 11 modules in Inception-V3.
amount of data with off-chip memory access by almost half.
By applying the proposed algorithm, only reduction-a and
Inception-b1 among 11 modules in Inception-V3 have a
little amount of feature-map with off-chip memory access.
The reason can be explained as follows: In reduction-a
module, we operated BrProcessing(b, II) which needs to
access off-chip memory three times at the end of branches
because sizereq was not satisfied with sizeavail for the
module, as shown in Figure 10. That is why the off-chip
accesses in reduction-a happen. And then, there exists one
time of off-chip memory access at the start of the Inception-
b1 module according to the result of the former module,
reduction-a. Figure 10 shows the important information
related to sizereq(b,k) and sizeavail(b,k) in our proposed al-
gorithm. Firstly, branches in each module are re-ordered
by sizereq(b,k). And sizeavail(b,k) also decreases as the in-
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
Amount of (For FM) Number of
off-chip memory accesses [KB] off-chip memory accesses
W Naive Proposed Overall FM Naive ProposedFM Overall FM Overall Ratio [%] Ratio [%] Read Write Read Write
inception-a1 249 2308.5 2557.5 0 249 9.74 0.00 8 8 0 0
inception-a2 270 2835 3105 0 270 8.70 0.00 8 8 0 0
inception-a3 277.5 3078 3355.5 0 277.5 8.27 0.00 8 8 0 0
reduction-a 1125 1798.5 2923.5 300 1425 48.74 16.68 5 5 0 3
inception-b1 1264 2700 3964 300 1564 39.46 11.11 11 11 1 0
inception-b2 1648 2850 4498 0 1648 36.64 0.00 11 11 0 0
inception-b3 1648 2850 4498 0 1648 36.64 0.00 11 11 0 0
inception-b4 2088 3000 5088 0 2088 41.04 0.00 11 11 0 0
reduction-b 1656 1580 3236 0 1656 51.17 0.00 7 7 0 0
inception-c1 4920 808 5728 0 4920 85.89 0.00 10 10 0 0
inception-c2 5928 1096 7024 0 5928 84.40 0.00 10 10 0 0
total 21073.5 24904 45977.5 600 21673.5 47.14 2.41 100 100 1 3
Table 1. Summary about the amount and the number of the off-chip accesses for 11 modules in Inception-V3, where FM and W mean a
feature-map and a weight, respectively.
dex of branch in the module increases because sizeoccu(b)
increases. Moreover, we can see the relationship between
sizereq(b,k) and sizeavail(b,k) used for the criteria to select
BrProcessing(b, I) or BrProcessing(b, II).
Consequently, Table 1 summarizes the overall amount of
off-chip accesses for every modules in Inception-V3, and
an reduction ratio is calculated to 47.14 % (=21673.5 /
45977.5). And The table also represents the amount and
the number of off-chip accesses for only a feature-map in
Inception-V3. We can know that the proposed algorithm
can achieve a great reduction up to 97.59 % in terms of
the amount, and can effectively reduce to 1/50 (= 4/200) in
terms of the number of accesses for all modules.
5. Conclusion
In the paper, we proposed a simple method for energy-
efficient and real-time processing of NPUs through the re-
duction of off-chip memory accesses. To achieve it, we
focused on the modules used for convolutional neural net-
works that have multiple branches and a merge layer. In
the algorithm, the key ideas consisted of module detection
ignoring long skip-connections, branch re-ordering for uti-
lizing available memory maximally, assignment of MIFM
and MOFM to share between modules, and branch process-
ing. For Inception-V3 on Samsung’s NPU in Exynos, we
showed the proposed algorithm achieved 97.59 % reduc-
tion in the amount of data to require off-chip access, and
reduced the number of off-chip accesses by 1/50. Finally,
we think the proposed algorithm can be a powerful solution
to increase the efficiency of NPUs when processing various
convolutional neural networks.
References
ARM-ML-processor. industry-leading perfor-
mance and efficiency for inference at the
edge. https://developer.arm.com/
products/processors/machine-learning/
arm-ml-processor.
Courbariaux, M., Bengio, Y., and David, J.-P. Training deep
neural networks with low precision multiplications. arXiv
preprint arXiv:1412.7024, 2014.
Courbariaux, M., Bengio, Y., and David, J.-P. Binarycon-
nect: Training deep neural networks with binary weights
during propagations. In Advances in Neural Information
Processing Systems, pp. 3123–3131, 2015.
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan,
P. Deep learning with limited numerical precision. In
Proceedings of the 32nd International Conference on
Machine Learning (ICML-15), pp. 1737–1746, 2015.
Gysel, P., Motamedi, M., and Ghiasi, S. Hardware-oriented
approximation of convolutional neural networks. arXiv
preprint arXiv:1604.03168, 2016.
Han, S., Mao, H., and Dally, W. J. Deep compres-
sion: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz,
M. A., and Dally, W. J. Eie: efficient inference engine on
compressed deep neural network. In Computer Architec-
ture (ISCA), 2016 ACM/IEEE 43rd Annual International
Symposium on, pp. 243–254. IEEE, 2016a.
A Simple Method to Reduce Off-chip Memory Accesses on Convolutional Neural Networks
Han, S., Pool, J., Narang, S., Mao, H., Tang, S., Elsen, E.,
Catanzaro, B., Tran, J., and Dally, W. J. Dsd: Regu-
larizing deep neural networks with dense-sparse-dense
training flow. arXiv preprint arXiv:1607.04381, 2016b.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
Efficient convolutional neural networks for mobile vision
applications. arXiv preprint arXiv:1704.04861, 2017.
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and
Bengio, Y. Quantized neural networks: Training neu-
ral networks with low precision weights and activations.
arXiv preprint arXiv:1609.07061, 2016.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,
Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and¡ 0.5 mb model
size. arXiv preprint arXiv:1602.07360, 2016.
Judd, P., Albericio, J., Hetherington, T., Aamodt, T., Jerger,
N. E., Urtasun, R., and Moshovos, A. Reduced-precision
strategies for bounded memory in deep neural nets. arXiv
preprint arXiv:1511.05236, 2015.
Kim, D., Yim, H. Y., Ha, S., Lee, C., and Kang, I. Convo-
lutional neural network quantization using generalized
gamma distribution. arXiv preprint arXiv:1810.13329,
2018.
Lin, D., Talathi, S., and Annapureddy, S. Fixed point quan-
tization of deep convolutional networks. In International
Conference on Machine Learning, pp. 2849–2858, 2016.
Lin, M., Chen, Q., and Yan, S. Network in network. arXiv
preprint arXiv:1312.4400, 2013.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional
networks for semantic segmentation. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3431–3440, 2015.
Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2:
Practical guidelines for efficient cnn architecture design.
arXiv preprint arXiv:1807.11164, 2018.
Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn:
Towards real-time object detection with region proposal
networks. In Advances in neural information processing
systems, pp. 91–99, 2015.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–
4520, 2018.
Song, J., Cho, Y., Park, J.-S., Jang, J.-W., Lee, S., Song,
J.-H., Lee, J.-G., and Kang, I. An 11.5tops/w 1024-mac
butterfly structure dual-core sparsity-aware neural pro-
cessing unit in 8nm flagship mobile soc. In accepted for
2019 IEEE International Solid-State Circuits Conference
(ISSCC).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. Going deeper with convolutions. In Proceedings
of the IEEE conference on computer vision and pattern
recognition, pp. 1–9, 2015.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. Rethinking the inception architecture for computer vi-
sion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2818–2826, 2016.
Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V.
Mnasnet: Platform-aware neural architecture search for
mobile. arXiv preprint arXiv:1807.11626, 2018.
Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo,
Q., Chen, T., and Chen, Y. Cambricon-x: An acceler-
ator for sparse neural networks. In The 49th Annual
IEEE/ACM International Symposium on Microarchitec-
ture, pp. 20. IEEE Press, 2016.
