Improving Memory Utilization in Convolutional Neural Network
  Accelerators by Jokic, Petar et al.
This article has been accepted for publication in IEEE Embedded Systems Letters, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LES.2020.3009924, 
IEEE Embedded Systems Letters 
© 2020 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including 
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any 
copyrighted component of this work in other works. 
  
Abstract—While the accuracy of convolutional neural networks 
has achieved vast improvements by introducing larger and deeper 
network architectures, also the memory footprint for storing their 
parameters and activations has increased. This trend especially 
challenges power- and resource-limited accelerator designs, which 
are often restricted to store all network data in on-chip memory to 
avoid interfacing energy-hungry external memories. Maximizing 
the network size that fits on a given accelerator thus requires to 
maximize its memory utilization. While the traditionally used ping-
pong buffering technique is mapping subsequent activation layers to 
disjunctive memory regions, we propose a mapping method that 
allows these regions to overlap and thus utilize the memory more 
efficiently. This work presents the mathematical model to compute 
the maximum activations memory overlap and thus the lower bound 
of on-chip memory needed to perform layer-by-layer processing of 
convolutional neural networks on memory-limited accelerators. 
Our experiments with various real-world object detector networks 
show that the proposed mapping technique can decrease the 
activations memory by up to 32.9%, reducing the overall memory 
for the entire network by up to 23.9% compared to traditional ping-
pong buffering. For higher resolution de-noising networks, we 
achieve activation memory savings of 48.8%. Additionally, we 
implement a face detector network on an FPGA-based camera to 
validate these memory savings on a complete end-to-end system.  
 
Index Terms— Convolutional neural networks, hardware 
accelerator, memory requirements, lower bound 
 
I. INTRODUCTION 
onvolutional neural networks (CNN) are the key 
components of today’s state of the art object detectors and 
classifiers in computer vision. Performing inference with a 
CNN is a highly data-intensive task in which input activations, 
starting with an input image, get convolved with kernels 
consisting of learned weights, summed up with bias parameters 
and fed through an activation function. The layered structure of 
CNNs allows them to be processed sequentially, layer by layer. 
This is beneficial in terms of memory requirements because a 
maximum of only two subsequent activation layers, as opposed 
to all of them, have to be stored at any point in time: the inputs 
from the preceding layer are needed to be convolved with the 
kernels, while the results of these computations (output acti-
vations) must be buffered to serve as inputs for processing the 
 
This paragraph of the first footnote will contain the date on which you 
submitted your brief for review.  
Petar Jokic is with the Swiss Federal Institute of Technology, ETH Zurich, 
8092 Zurich, Switzerland and CSEM SA, 8005 Zurich, Switzerland (email: 
petar.jokic@csem.ch). 
following layer. Because consecutive layers output their results 
alternatingly into one of two activation memory sections this 
pattern is called ping-pong processing. The network parameters 
(weights and biases) are reused at every inference of the 
network and should thus be kept in local memory to avoid 
costly data reloading from external memory. To succeed in 
storing all network data on-chip for layer-wise CNN process-
ing, the memory must be large enough to store the constant 
parameters and the largest pair of successive input and output 
activations as shown in Fig. 1 (a). With the traditionally used 
ping-pong buffering technique these activations are 
alternatingly mapped to disjunctive memory regions, such that 
the worst-case pair of activations amounts to the maximum sum 
of any two subsequent layers [1]. For CNN accelerators on 
resource-limited platforms, like field-programmable gate arrays 
(FPGA) [2, 3], this constraint largely limits the maximum 
network size that can be processed. On-chip static random-
access memories (SRAM) dominate today’s CNN accelerator 
designs (e.g. around 1 mm for 1 MB in 22 nm technology) [4]. 
Reducing memory size will therefore clearly reduce chip area 
and thus largely influence the chip cost. Additionally, large 
memories increase the static power consumption due to 
leakage, and also their energy-per-access is heavily impacted 
by size, surpassing the energy consumed by the processing of 
the fetched data by a factor of more than 25x [5, 6]. Thus, it is 
essential to minimize the on-chip memory size to the targeted 
networks’ needs.   
Stephane Emery is with CSEM SA, 8005 Zurich, Switzerland (email: 
stephane.emery@csem.ch). 
Luca Benini is with the Swiss Federal Institute of Technology, ETH Zurich, 
8092 Zurich, Switzerland and the University of Bologna, 40126 Bologna, Italy 
(email: lbenini@iis.ee.ethz.ch). 
Improving Memory Utilization in 
Convolutional Neural Network Accelerators 
Petar Jokic, Member, IEEE, Stephane Emery, Member, IEEE, and Luca Benini, Fellow, IEEE 
C
Fig. 1.  Memory allocation of the traditional (a) and the proposed (b) 
activations mapping approach, visualizing the introduced overlap. 
 
Application data (reserved)
Parameter data (constant)
Activation data: layer (n+1)
Activation data: layer (n)
Application data (reserved)
Parameter data (constant)
Activation data: layer (n+1)
Activation data: layer (n)
(a) Traditional mapping (b) Proposed mapping
co
n
st
a
n
t
A
ct
iv
a
ti
o
n
s 
re
g
io
n
Memory savings
o
v
e
rl
a
p
 2
This work presents a CNN memory mapping method that 
allows activation regions of subsequent layers to overlap, as 
shown in Fig. 1 (b), and thus utilize the memory more efficiently 
than the traditionally used ping-pong buffering technique. It 
consists of a mathematical model for computing the maximum 
overlap and thus its lower bound of on-chip memory needed to 
perform layer-wise processing of convolutional neural networks. 
This is especially attractive for newer networks where the 
memory is dominated by activations. The resulting memory size 
can be used to determine the minimum memory requirements for 
a new accelerator design or to optimize a given network to 
efficiently utilize the memory resources of an existing accele-
rator. Our experiments show activation memory savings of up to 
32.9% for real-world object detector CNNs and up to 48.8% for 
high-resolution de-noising CNNs when compared to traditional 
ping-pong buffering.  
II. IMPROVING CNN MEMORY UTILIZATION 
The traditional ping-pong mapping of activations ensures 
that the outputs of a layer do not overwrite any of its input 
activations, because they might still be needed for pending 
computations. But allocating two separate regions for this 
reason is too pessimistic, unnecessarily restricting the allowed 
network size, keeping the memory utilization low and thus the 
power consumption as well as cost high. In the following 
sections, we show how the data access pattern of CNNs can be 
exploited to improve the memory utilization of accelerators. 
A. CNN data access pattern 
To determine the memory requirements for computing an 
entire CNN inference, we need to understand the data access 
pattern of this process. Fig. 2 visualizes the computational 
structure of a CNN layer, convolving an input feature map in 
(of size  ∙  ∙ ) with a weight kernel k (	
 kernels of 
size  ∙  ∙ ), producing an output feature map out (of size 	
 ∙ 	
 ∙ 	
). To convolve the whole input feature map, 
input activations are accessed in a sliding window operation, 
moving the kernel-sized window across the x/y plane. This 
operation can be represented with the 6 nested loops shown in 
Fig. 3. The window moves in strides of  and   in x- and y-
direction, respectively. Inputs can be padded with  and  
zero-pixels on each side in x- and y-direction. 
Input data are stored in memory in the depth-first order: all  input channels of an input pixel are followed by all entries 
of the neighboring input pixels in the x-direction. At the end of 
a row, the following rows in the y-direction are appended. This 
simplifies the addressing scheme and keeps the number of 
cycles between data reuse low by following the window pattern. 
Minimizing this so-called reuse distance [7] is important as it 
allows a specific memory entry to be overwritten as soon as 
possible, freeing space for new data. Because output feature 
maps will serve as inputs in the next layer, they must have the 
same memory order as the input feature maps. 
B. Model for optimized memory utilization 
To optimize the memory utilization in CNN processors we 
propose a memory mapping method that allows activation 
memory regions of consecutive layers to be overlapping. Fig. 1 
shows a simplified memory map that compares the traditional 
memory allocation (a) with our proposed approach (b). While 
(a) is keeping each layer’s activations in separate disjunctive 
regions, (b) allows the activation regions to be partially 
overlapping, resulting in large memory savings.  
If two subsequent layers have overlapping memory regions, 
the allocation method must avoid that output activations are 
overwriting data from the preceding layer that is still needed for 
pending computations. This constraint can be ensured by 
(negatively) offsetting the output write pointer in such a way, 
that the input reading pointer will never be reached during the 
computation of any layer in the network. This concept is 
depicted in Fig. 4. At the beginning of each layer computation 
(here denoted as t=0), the distance between the input activations 
read pointer  and the output activations write pointer , is 
set to an optimized offset. As the sliding window for computing 
the convolution operation moves on, pointer  writes results 
and increments in direction of . Pointer  moves accordingly 
Fig. 4.  Simplified memory map with current pointer positions at two different 
points in time. The left figures show the current position of the sliding window 
while the right figures visualize the memory content (including window data).
Input activations:
layer (n)
(to be read)


t=0:
t=X:
  
Activations: layer (n+1)
Activations: layer (n)


 
Activations: layer (n+1)
Activations: layer (n)
Other (reserved) memory regions
Other (reserved) memory regions
Input activations
(no longer needed)
Input activations:
layer (n)
(no longer needed)  


Fig. 2.  Visualization of the 2D convolution operation in a CNN layer with 
relevant dimensions of the input and output activations as well as the kernel. 
∗
 	

!"  
##  
$  	

	
 	

	

 

!	

"	

%& # '
for "	
 in 0 to 	
(i) 
  for !	
 in 0 to 	
(i) 
  
  
for  	
 in 0 to 	
(i) 
  
    
for # in 0 to (i) 
  
      
for # in 0 to (i) 
  
        
for   in 0 to (i) 
 
    
 
" = "	
 ∙ (%) − (%)   ! = !	
 ∙ (%) − (%)   
out("	
, !	
,  	
) += \ 
in(" + #, ! + #,  ) ∙ k(#, #,  ,  	
) 
Fig. 3.  Computation loops of a CNN layer i (omitting the accumulation reset, 
the input padding and the activation function at the end of each  	
 loop). 
 3
on the input activations region, away from . The underlying 
idea of this memory mapping is the locality of the convolution 
operation: the activations for each window position are only 
read from a small, connected region of the input layer which 
itself is slid over the inputs in a continuous fashion. Because the 
corresponding memory data is ordered in the same way as they 
appear in this sliding operation, most parts of the processed 
input data will never be used again and can thus be overwritten 
by resulting output activations.  
The maximum overlap of the two activation regions is found 
by mathematically describing the pointer positions and optimi-
zing their relative offset distance at the beginning of each layer 
such that the total memory is minimal while constraining the 
write pointer to be smaller than the read pointer, avoiding any 
overwriting of still needed data. To meet this constraint, both 
pointer positions must be known for every point in time. They 
can be calculated from their starting points and velocities, 
derived from the network characteristics. We model the pointer 
positions (addresses) as a function of time, assuming one 
multiply-accumulate (MAC) operation per clock cycle (t) and 
that activation data is stored in the depth-first order described 
above. The sliding window follows this pattern and only moves 
on once all outputs for a certain window position are computed.  
Equation (1) represents the velocity of pointer , which 
advances 1 position per calculation of a single kernel convolu-
tion (or  ∙  ∙  MAC operations). The address only 
increments once the full kernel is computed, which can be 
mathematically represented by rounding down the integral of 
its speed over time as shown in (2). The formula for  takes 
the padding of the input layer, stride width, and the behavior 
during sliding window movements into account. It is sufficient 
to look at the lowest address of the input window (which would 
collide with  at the earliest), simplifying the formula of its 
average velocity to (3). In the resulting  formula (4), the y-
direction stride is implemented with rounding operations, 
causing the pointer to skip some rows when moving in the y-
direction. The minimum memory ,-  required for mapping 
the activations of two subsequent CNN layers to the memory 
can then be determined from (2) and (4), as shown in (5). It is 
given by the sum of the input activations space - and the 
minimum offset difference (. − .) for which () is 
larger than () throughout the entire computation of a layer / (during the interval 0-). From (5), the lower bound of 
activations memory required for computing the entire CNN, , can be derived by finding the minimum memory size that 
supports all layers of the network, as shown in (6).  
The worst-case scenario for memories in layer-wise CNN 
accelerators is a network with two maximum-sized layers back-
to-back, requiring an activations buffer of twice the maximum 
layer size when using the traditional ping-pong buffering. For 
the same scenario, our method can reduce the memory needs by 
almost 50% if each input pixel creates equally many output 
pixels, keeping pointers at a constant short offset. This 
represents the theoretical upper savings limit of the proposed 
technique. Our model assumes one datum per memory word but 
can be easily transferred to multiple data entries per word by 
linearly scaling down the speed of each pointer accordingly. 
Note that for residual layers, - must additionally include acti-
vations of identity connections in parallel to the convolutions. 
III. EXPERIMENTS AND RESULTS 
We evaluate the memory savings of the presented method on 
four real-world CNN networks: 9-layer DLIB face detector [8], 
12-layer YOLO lite [9], 20-layer DMCNN-VD 3x3 [10] and 
12-layer MobileNetv2 [11]. The first three have an input 
resolution of 640x640, while the input of MobileNetv2 is 
224x224x3. Table I presents the memory savings of our 
proposed method compared to traditional ping-pong buffering. 
Our approach is saving between 19.6% and 48.8% of 
activations memory in the evaluated networks, achieving total 
memory savings (including parameters) of 6.2% to 48.2%. The 
lowest overall savings are found in MobileNetv2, where 
parameters dominate the memory due to the deep architecture 
and the small image size. It must be noted that we compute 
MobileNetv2 in a strictly layer-wise manner, while [11] 
suggests that operations of some intermediate layers could be 
concatenated without buffering the respective layers entirely. 
Many recent networks feature small kernels and larger images, 
increasing the dominance of activations in memory and thus 
memory savings. This can be seen in the 20-layer DMCNN-VD 
 vp
w
=
1Cin·Kx·Ky (1) 
 () $ floor7 89: ∙ ; + . (2) 
 89< $ =>∙?@-?ABC∙7?@-∙D>∙DE;FGGGHGGGIJKLMNNO +
7=EKP;∙?@-∙Q@-RSTUUVRW∙X>YZ@-[\>]> ^_P^∙?ABC∙7?@-∙D>∙DE;FGGGGGGGGGHGGGGGGGGGI`KLMNNO
 (3) 
 () $ max c0, floor e P?ABC∙7?@-∙D>∙DE; ∙ f ∙ ( ∙ )FGGGGGGGGGGHGGGGGGGGGGIJ MULghgUi + floor j
PRk		RW∙X>YZ@-[\>]> ^_P^∙?ABC∙7?@-∙D>∙DE; ∙ l ∙ R7 − 1; ∙  ∙ ^FGGGGGGGGGGGGGGGGGGGGHGGGGGGGGGGGGGGGGGGGGI` MULghgUi
 −\
 ∙  ∙ FGGGHGGGIhUM MoOOgip − ceil j
PRk		RW∙X>YZ@-[\>]> ^_P^∙?ABC∙7?@-∙D>∙DE; ∙ l ∙ max t0, jRfloor R∙u>_Q@-KD>=> ^ + 1^ ∙  − l ∙ ( ∙ )vFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGHGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGILgON MoOOgip USSLNh w
+ .  (4) 
 ,- $ min∈z{-9<()|9:() R- + (. − .)^     0- $ }0, Rfloor R
∙u>_Q@-KD>=> ^ + 1^ ∙ efloor e∙uE_~@-KDE=E f + 1f ∙ 	
 − 1 (5) 
  $ max-∈ℒ (,-) (6) 
 4
with small 3x3 kernels, yielding 48.2% total memory savings. 
We note that our technique still offers significant (32.9%) 
activation memory savings for smaller networks, such as DLIB.  
To validate the memory savings in a real application, we 
employ our method for implementing a DLIB face detector 
CNN [8] on a configurable FastEye camera [12]. This hosts a 
1-megapixel image sensor and a Xilinx XC7K325T FPGA. We 
extend the existing data-path, implementing the sensor readout 
and a USB interface, with a simple CNN processing state 
machine and a block memory (BRAM) consisting of 36 kbit 
blocks for parameters and activations. The image from the 
sensor gets cropped to 640x640 pixels and stored on the 
BRAM. Triggered by the image, the state machine processes 
the network layer-wise as described in Fig. 3. Without any 
processing optimizations, the maximum CNN inference rate is 
0.5 frames per second at 100 MHz clock frequency. Weights 
and activations are quantized to 16 bits. The output of the last 
CNN layer gets transferred via USB to a computer for post-
processing of the resulting bounding boxes. Two different 
system configurations are implemented, a) with the standard 
ping-pong mapping, and b) with our proposed memory 
mapping technique, differing only in the BRAM size and the 
address generation. Both FPGA implementations successfully 
perform on-camera face detection on acquired images. Table II 
states the utilization report of the initial FPGA firmware and the 
two CNN-extended versions. Comparing the resources added to 
the initial camera firmware, the proposed memory mapping (b) 
shows memory savings of 23% and power savings of 20% with 
respect to the standard memory allocation (a). The number of 
used flip-flops (FF), look-up tables (LUT) and signal processors 
(DSP) in the FPGA rests almost constant. This confirms the 
theoretical savings (23.9%), differing by only 0.9%, which is 
due to the limited block granularity of the memory macro. 
 
TABLE I 
RESULTS OF EVALUATED NETWORKS 
CNN network Mem. savings: 
activations only 
(total network) 
Network name Parameter  
[# words] 
Activations [# words] 
Standard This work 
DLIB face det.  [8] 229.8k 614.4k 412.2k 32.9% (23.9%) 
YOLO Lite [9] 443.0k 16.4M 13.1M 19.9% (18.7%) 
MobileNetv2 [11] 3.3M 1.5M 1.2M 19.6% (6.2%) 
DMCNN-VD [10] 668.2k 53.7M 27.5M 48.8% (48.2%) 
 
TABLE II 
FPGA UTILIZATION REPORT AND POWER MEASUREMENT OF THE CAMERA 
 LUT FF DSP BRAM Power 
Cam only 28.6k (14%) 82.4k (20%) 8 (1%) 12 (3%) 12.61 W 
a) Cam + CNN 52.9k (26%) 98.3k (24%) 13 (2%) 428 (96%) 13.20 W 
b) Cam + CNN 52.5k (26%) 98.3k (24%) 13 (2%) 332 (75%) 13.08 W 
Savings: b vs. a 2% 0% 0% 23% 20 % 
IV. RELATED WORK 
Stoutchinin et al. [7] present an optimal model search 
approach that outputs optimized CNN loop order, tiling and 
buffer size parameters to reduce access to external memories. 
They achieve memory bandwidth reductions of up to 14x com-
pared to previous implementations. Yang et al. [6] propose an 
analytical approach to model data locality in CNNs to find the 
optimal blocking strategy that maximizes the energy efficiency 
of an accelerator. Other works like [1] focus on optimizing data 
movements between internal and external memory while using 
traditional ping-pong buffering for on-chip memory. These 
approaches either do not consider cases with all activations 
stored on-chip or base their models on the inefficient ping-pong 
buffering. In contrast, we provide a more efficient activations 
mapping that can be used on any platform and only requires the 
adaption of the addressing scheme. While we focus on standard 
convolutions, networks with separable convolutions [11] allow 
intermediate layers to be stored only partially, reducing the 
memory footprint of intermediate layers with many channels. 
V. CONCLUSION 
This work presented the mathematical model of the lower 
memory bound for buffering activations in layer-wise convolu-
tional neural network accelerators using overlapping activation 
regions. We show that the mapping method derived from this 
model can utilize the memory more efficiently than the standard 
ping-pong buffering method. This allows reducing the required 
on-chip memory size of new accelerator designs or to map 
larger networks to existing resource-limited implementations. 
Experimental results on real-world CNNs show that the 
activations memory space can be reduced by up to 48.8%, and 
the overall network memory needs by up to 48.2%. 
REFERENCES 
[1]  K. Siu, D. M. Stuart, M. Mahmoud, and A. Moshovos, “Memory 
requirements for convolutional neural network hardware accelerators,” 
in Proc. IISWC, Raleigh, NC, USA, 2018.  
[2]  S. Moini, B. Alizadeh, M. Emad, and R. Ebrahimpour, “A Resource-
limited hardware accelerator for convolutional neural networks in 
embedded vision applications,” IEEE Trans. Circuits Syst. II Exp. 
Briefs, vol. 64, no. 10, pp. 1217-1221, Oct. 2017.  
[3]  A. A. Gilan, M. Emad, and B. Alizadeh, “FPGA-based implementation 
of a real-time object recognition system using convolutional neural 
network,” IEEE Trans. Circuits Syst. II Exp. Briefs, pp. 1-1, 2019.  
[4]  E. Karl et al., “A 4.6 GHz 162 Mb SRAM design in 22 nm tri-gate 
CMOS technology with integrated read and write assist circuitry,” 
IEEE J. of Solid-State Circuits, col. 48, no. 1, pp. 150-158, Jan. 2013.  
[5]  M. Horowitz, “1.1 Computing's energy problem (and what we can do 
about it),” in Proc. ISSCC, San Francisco, CA, USA, 2014.  
[6]  X. Yang et al., “A systematic approach to blocking convolutional 
neural networks,” arXiv:1606.04209 [cs.DC], 2016.  
[7]  A. Stoutchinin, F. Conti, and L. Benini, “Optimally scheduling CNN 
convolutions for efficient memory access,” arXiv:1902.01492 [cs.NE], 
2019.  
[8]  D. King, “DLIB CNN face detector,” 2018. [Online]. Available: 
https://github.com/davisking/dlib-models 
[9]  J. Pedoeem and R. Huang, “YOLO-LITE: a real-time object detection 
algorithm optimized for non-GPU computers,” arXiv:1811.05588 
[cs.CV], 2018.  
[10] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “GraphPIM: 
enabling instruction-level PIM offloading in graph computing 
frameworks,” in Proc. HPCA, Austin, TX, USA, 2017.  
[11] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, 
“Inverted Residuals and Linear Bottlenecks: Mobile Networks for 
Classification, Detection and Segmentation,” arXiv:1801.04381 
[cs.CV], 2019.  
[12] P. Jokic et al., “FastEye-A 1 MP high-speed camera with multiple ROI 
running at up to 64'000 fps,” CSEM Scientific and Technical Report, 
2019, [Online]. Available: https://www.csem.ch/Doc.aspx?id=49356 
 
