Convolutional Neural Networks have demonstrated their competence in extracting information from data, especially in the field of computer vision. Their computational complexity prompts for hardware acceleration. The challenge in the design of hardware accelerators for CNNs is providing a sustained throughput with low power consumption, for what FPGAs have captured community attention. In CNNs pooling layers are introduced to reduce model spatial dimensions. This work explores the influence of pooling layers modification in some state-of-the-art CNNs, namely AlexNet and SqueezeNet. The objective is to optimize hardware resources utilization without negative impact on inference accuracy.
INTRODUCTION
The evolution of Convolutional Neural Networks (CNNs) have relegated traditional image processing algorithms in classification, detection, and segmentation tasks because of their superior performance. Their enhanced precision comes accompanied by higher requirements in terms of memory and computational power, inspiring the development of application-tailored accelerators. With shorter development time and lower cost than ASICs, FPGAs have become preeminent platforms to accelerate CNNs achieving better energy consumption rates when compared with CPUs or GPUs. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. In practice, the power saving objective translates into the development of data movement minimization strategies including model and data size reduction, as well as data reuse optimization.
The insertion of pooling layers is normally used in CNN topologies to reduce their size and number of MAC operations. Pooling layer parameters vary on a per-model and per-layer basis, forcing accelerator architectures to be reconfigurable. In order to cope with the need of reconfigurability derived of CNN layers diverse structure, Network-on-Chip (NoC) based accelerators constructed with systolic-2D meshes of Processing Elements (PEs) linked by routing blocks are commonly used [3] . For external memory access reduction and data reuse improvement PEs enclose local-buffering elements (P B ) for both fmaps and filters [4] . This paper proposes an adaptable implementation of a pooling block for FPGAs.
MAPPING CNNS ON FPGAS
A CNN model can be decomposed in a pile of heterogeneous stacked layers. The exploration of CNNs design both in terms of layer description and organization have concluded in several unique models pursuing to increase inference accuracy while running on embedded devices, as those presented in Table 1 . To perform an inference, neurons are arranged in 2D maps (named feature maps or fmaps) of dimensions H×W packed in blocks of C channels. These are convolved with a number N of filters of size C×K x ×K y applied with a stride S, with border padding size P. The calculation of an output fmap can be expressed as follows:
Where I : input fmaps, W : filters or weights, O : output fmaps, and B : bias term. Activation layers are typically included after convolutions to normalize the fmaps using functions f (·) like ReLU.
Figure 1: Network-on-Chip mapping and pooling halos
Pooling layers downsample input fmaps dimensions by a factor of S P by the application of filters on windows of K P ×K P size. Convolutional layers dominate model calculations, constituting 90% of them. Because of that, recent investigations about CNNs mapping into RTL focus on their execution optimization exploiting either inter-layer [5] or intra-layer [6] parallelization.
HARDWARE IMPLEMENTATION OF THE POOLING BLOCK
The total number of elements in a CNN layer rarely fits entirely on the limited FPGA hardware resources for fully simultaneous computation, this is solved by CNN model partitioning. In order to reduce required buffer allocation, CNN accelerator design benefits of merging both convolution and pooling layers hence avoiding intermediate fmaps storage. The pooling block must be able to dynamically adjust its behavior conditioned by NoC mapping, pooling kernel size K P , stride S P , and type T P . To the best of our knowledge, actual CNNs pooling parameter values are narrowed down to few combinations: K P =[2,3], S P =2, and T P =[MAX, AVG]. Subject to accelerator configuration, on-the-fly pooling calculation can demand halo data buffering, this is the case depicted in Fig. 1. Fig. 2 details the RTL model of a pooling block compatible with Table 1 parameters, which can be further simplified in case of reducing K P range of allowed values from [2, 3] → [2] . Required halo storage blocks depth (halo DEPT H ) is proportional to inter-layer max(N·H), in case of Halo-C, and max(N) for Halo-R. Conditional to fmaps bit resolution and platform memory block features, requested number of per-halo buffers is:
Where halo P is the number of halo buffers that need to be read in parallel, determined by the NoC mapping strategy. For those CNN models described in Table 1 , in Zynq104 FPGA the amount of data to store determines to better use LUTRAM for Halo-R buffers and BRAM in Halo-C case, resulting in the hardware utilization shown in Table 2 .
CONCLUSION
A dynamically configurable pooling hardware module is presented enabling on-the-fly fmaps down-sampling thus reducing data movement which entails a power consumption diminution.
State-of-the-art CNN models are retrained with pooling kernel K P =2, with a positive impact on accuracy (Table 3 ) leading to further pooling block simplification. 
