ABSTRACT This paper presents a low-area and energy-efficient hardware accelerator for the convolutional neural networks (CNNs). Based on the multiply-accumulate-based architecture, three design techniques are proposed to reduce the hardware cost of the convolutional computations. First, to reduce the computational cost of convolutions, an adaptive bit-width reduction combined with near-zero skipping is proposed based on differential input method (DIM). The DIM-based design technique can reduce 62.5% of operation bit-width and improve 17.0% of activation sparsity with almost ignorable CNN accuracy degradation. Second, it has been found that adopting a bi-directional filtering window in a CNN accelerator can considerably reduce the energy for data movement with a much smaller number of memory accesses. To expedite the bi-directional filtering operations, we also propose a bi-directional first-input-first-output (bi-FIFO). With SRAM bit-cell layout manner, the proposed bi-FIFO facilitates fast data re-distribution with area and energy efficiency. To verify the effectiveness of the proposed techniques, the AlexNet accelerator has been designed. The numerical results show that the proposed adaptive bit-width reduction scheme achieves 34.6% and 58.2% of area and energy savings, respectively. The bi-FIFO-based accelerator also achieves 32.8% improved processing time.
I. INTRODUCTION
Recently, convolutional neural networks (CNNs) have achieved human-like performance in speech recognition and computer vision applications. In order to take advantage of the superior CNN performance in various applications, low-cost hardware implementation of the CNN algorithm is the fundamental requirement for the future intelligent platform. Many recent research efforts have been focused on low-area, high-throughput, and energy efficient CNN accelerator design [1] - [14] . Those works can be classified into the following categories: i) energy-efficient data flow design [1] - [3] , ii) reconfigurable and/or adaptive hardware parallelization [4] - [6] , and iii) co-design of CNN models and hardware [7] - [10] .
For the energy-efficient data flow of the CNN accelerator, since the external DRAM access consumes three orders of magnitude more energy than an add operation with same data bandwidth (32-bit) [8] , data movement should be minimized using a subdivided memory hierarchy. To minimize the data movement from DRAMs and large-scale embedded memories, the state-of-the-art CNN accelerators adopt local registers in processing elements [2] . As the reconfigurable and/or adaptive hardware parallelization approaches, adaptive network mapping strategy with various CNN shapes has been proposed to offer general purpose deep learning accelerator [6] . Typically, due to the inextricable correlation between data reusing methods and parallel architectures, many previous works [1] - [6] build data reusing method on their own architecture (that is optimally designed for the dedicated CNN).
For the co-design of CNN models and hardware, how to reduce the computation size (bit-width) or the number of computations is one of the main issues to address. In [8] - [14] , by carefully quantizing the activation and/or weight, the computation bit-width is reduced while minimizing accuracy degradation and hardware cost. In [8] - [10] , the activation sparsity, which indicates the redundant data, is also considered to skip the significant amounts of storage and computations. Although those approaches can be stepping stones for low cost implementation of CNN algorithms, the hardware cost of the existing CNN accelerator is still a way larger than that of typical mobile processors, and further optimization of CNN accelerator is still highly required.
In this work, the following three design approaches are proposed to proceed one more step closer to the mobileoriented CNN accelerator; i) data bit-width reduction and its collaborative design techniques, ii) bi-directional filtering window movement, and iii) circuit-level register file optimization. Contrary to the previous CNN models and hardware co-design approaches, the proposed data bit-width reduction technique not only reduces hardware footprint but also adaptively deactivates the unnecessary computations. In the proposed bit-width reduction approach, by combining the differential input method (DIM) [18] and adaptive bitwidth reduction (ABWR) scheme, the computation size can be effectively reduced. Thanks to the inherent low-frequency components of the image data, the combination of DIM and near-zero skipping maximizes the activation sparsity, thus more computation can be skipped in the CNN accelerator. For the energy-efficient data flow, it has been found that adopting bi-directional filtering window in CNN accelerator can lead to considerable reduction of embedded memory accesses with much larger number of data-reuses. To the best of our knowledge, the direction of filtering window has not been studied in the convolutional layer (CONV) design. As the third approach, in order to expedite the bi-directional filtering, bi-directional first-input-first-output (bi-FIFO) is proposed. Based on SRAM-like layout, the proposed bi-FIFO based CNN accelerator can offer much faster processing time with ignorable area overhead.
The rest of the paper is organized as follows. In section II, the previous CNN models and hardware co-design approaches are reviewed as a background. The DIM and its collaborative schemes are proposed in Section III. Section IV presents the bi-directional filtering window and bi-directional FIFO. The experimental results in AlexNet [19] accelerator are presented in Section V. Finally, Section VI concludes this work.
II. THE CONVENTIONAL CNN MODELS AND HARDWARE CO-DESIGN APPROACHES
In this section, the previous CNN models and hardware co-design approaches are discussed. Considering the co-design approaches, the subsections are divided as the computation bit-width reduction, and the reduced number of operations and model size [23] .
A. COMPUTATION BIT-WIDTH REDUCTION
In the data processing of CNN, since the delicate computation precision of each convolution operation does not directly affect the model accuracy, the conventional co-design approaches try to reduce the computation bit-width while minimizing accuracy loss. The previous CNN models and hardware co-design approaches are summarized in Table 1 . In [24] , using the log 2 -based quantization levels, the computation bit-width is reduced while alleviating the accuracy loss. To further reduce the accuracy loss, incremental network quantization [25] iteratively re-trains the weight. As a different quantization approach, [26] adopts weight sharing based on the k-means algorithm. Alternatively, as shown in the dynamic fixed-point approach of Table 1 , an adaptive decimal point that is varied depending on the desired dynamic range is introduced in [10] and [14] . Since the activation function, such as ReLU, induces a quite different dynamic range, the dynamic fixed-point can effectively reduce the quantization error. On the other hand, by using the binarized activation and/or weight, the computation bit-width is aggressively reduced in [27] - [29] . In those approaches, due to the binary constraint, relatively poor accuracies on ImageNet are reported.
B. REDUCTION OF THE NUMBER OF COMPUTATION
Considering the activation sparsity, the computational cost of the CNN accelerator can also be reduced. As shown in Table 1 , since the activation function, ReLU, induces significant amount of zero-valued activations, the redundant data can be compressed or skipped during CNN processing. For example, as shown in Table 1 and [2] , a simple run-length compression can be used to reduce the memory bandwidth. Similarly, the redundant read operation and convolutional computation can be skipped as shown in [8] - [10] . By pruning the low-valued activations, further computational cost can be saved [9] . However, since pruning itself cannot directly reduce the hardware cost, various co-design approaches have been performed to efficiently support pruned CNN models. Those approaches are different storage formats for the sparse matrix-vector multiplications [30] , hardwarefriendly designed pruning method [26] and network architectures [31] .
III. THE PROPOSED COMPUTATION BIT-WIDTH REDUCTION AND COMPUTATION SKIPPING A. DIFFERENTIAL INPUT METHOD (DIM)
As mentioned in Section II, data bit-width primarily decides the overall cost of CNN accelerator. We propose the differential input method (DIM) [18] aided data bit-width reduction technique, which can be effectively adapted in 2D convolutional filtering operations, to reduce the hardware cost of CNN accelerator. When the high-dimensional convolution of CNN is split into the 1-D convolution, a pixel of output feature map (oFMAP) can be expressed as
Here, W , X , and Y denote a filter weight, a pixel of input feature map (iFMAP) and oFMAP, respectively. N and j indicate the total number of filter weight and pixel index of a feature map, respectively. The equation of previous output (Y j−1 ) can be expressed as
By subtracting (2) from (1), it can be represented as
In (3), (X j−k -X j−k−1 ) and Y j−1 indicate the difference between two adjacent pixels of iFMAP and a previous pixel of oFMAP. Compared to its original form in (1) based implementation, if all the differences (X j−k -X j−k−1 ) are bounded in the relatively narrow region, the CNN accelerator can be implemented with a reduced data bit-width.
To verify the effectiveness of the DIM for CNN accelerator, DIM has been applied to two popular CNN models. The first one is Convnet [20] which consists of three CONV layers, three POOL layers, and two FC layers for CIFAR-10 [21] dataset classification. This model classifies the 32 × 32 × 3 pixels based CIFAR-10 images into ten classes of objects. Filter weights of this model are pre-trained with software, and Convnet reaches an accuracy of 77.0% for 10,000 test images. The other CNN model is AlexNet. AlexNet can classify ImageNet [22] dataset which has 227×227×3 pixels of color images with thousand classes of objects. AlexNet consists of five CONV layers, three POOL layers, and three FC layers. With pre-trained weights, AlexNet model reaches 56.6% of top-1 accuracy and 79.6% of top-5 accuracy for 50,000 validation images. The numerical data of Convnet on CIFAR-10 and AlexNet on ImageNet are presented in Fig. 1 . Fig. 1(a) and Fig. 1(b) show the distributions of data (X j−k in (1)) multiplied by the filter weights in Convnet and AlexNet, respectively. After applying DIM, the distributions of the differences (X j−k − X j−k−1 in (3)) multiplied by the weights, become narrower than the original distribution as shown in Fig. 1 . The narrower distribution means that the most of data can be represented with smaller bit-width due to small magnitude. In Fig. 2 , to examine the impact of bit-width reduction of each layer on the accuracy, the bit-width of one layer is changed while the bit-widths of other layers are fixed as 16-bit. From those accuracy results, the bit-width of all layers, where the bit-width configuration satisfies 99% of the relative accuracy of 32-bit floating point computation, are The better accuracy is due to the narrower dynamic range of the data, where more bits can be allocated to the fractional parts to improve CNN accuracy. To have 99% of the relative accuracy of 32-bit floating point computation, 9-bit of data bit-width is needed in the case of Convnet as shown in Fig. 2(a) . After applying DIM, the bit-width of iFMAP is reduced only by 1-bit ( Fig. 2(b) ). In AlexNet case, 10-bit is needed with simple truncation (Fig. 2(c) ) for 99% of the relative accuracy of 32-bit floating point computation [19] , while the bit-width of iFMAP is reduced by 2 bit with DIM ( Fig. 2(d) ). Actually, the size of feature maps in ImageNet (227 × 227) and intermediate feature map size (31×31 and 15×15) are larger than those of Convnet, which have the smaller size of input image (32×32) and intermediate feature map (15 × 15 and 7 × 7). Since the smaller image has lower resolution, differences between adjacent data (X j−k -X j−k−1 in (3)) is larger ( Fig. 1) , thus DIM is less effective for Convnet. Since most of state-ofthe-art visual devices supports high resolution images, DIM should be valid and efficient in the future CNN accelerators.
B. THE PROPOSED DIM WITH ADAPTIVE BIT-WIDTH REDUCTION
In addition to DIM approach, in order to further reduce the operation bit-width of CNN accelerator with minor accuracy loss, the adaptive bit-width reduction (ABWR) scheme is also proposed. Fig. 3 shows the conceptual illustration of the conventional dynamic fixed-point [14] and the proposed adaptive bit-width reduction scheme. In the conventional approach, depending on the distribution of the magnitude of integer part, the optimal computation bit-widths are decided as illustrated in Fig. 3 (a) . Since the distributions of the magnitude of integer part are quite different in each layer, integer part and fractional part of the computation in each layer have dedicated bit-widths. On the other hand, in the proposed adaptive bit-width reduction scheme, by applying DIM, the following two major advantages can be obtained: i) rarely occurred quantization error and ii) pixel-by-pixel controlled quantization error. As shown in the left illustration of Fig. 3(b) , when the DIM is applied to reduce computation bit-width, the following two cases can be occurred: CASE 1-when an input shows consecutive same sign bits (in two's compliments representation), the bit-width can be reduced without any accuracy loss, CASE 2-when the sign bit and the next bit are different, the fractional bits can be partially discarded to maintain same data bit-width at the cost of quantization error. As shown in Fig. 1 , the simulations with Convnet and AlexNet show that since DIM already makes the data distribution narrow, the redundant sign bits are frequently observed, where we can efficiently reduce the operation bit-width without seriously sacrificing the accuracy. In addition, as shown in the right illustration of Fig. 3(b) , since the proposed DIM with ABWR scheme finds the optimal bit-width for each pixel, the quantization error can more adaptively reduced than the conventional approach (optimal bit-width is decided in each layer). Fig. 4 (a) and (b) also show the accuracy results of DIM and ABWR when those are applied simultaneously for Convnet and AlexNet, respectively. In the simulation, like the results shown in Fig. 2 , the bit-width of one layer is changed while the bit-widths of other layers are fixed as 16-bit. The bitwidth of all layers, where the configuration satisfies 99% of the relative accuracy of 32-bit floating point computation, are also specified in Fig. 4 . When the DIM is combined with ABWR, accuracy has been significantly improved. Since the CNN includes the significant amount of near-zero values in the intermediate results [9] , the accuracy loss can be reduced, resulting in 10 bit-width reduction (6-bit data bit-width) in both of Convnet and AlexNet implementation. The proposed bit-width reduction scheme (DIM+ABWR) is also compared with the previous dynamic fixed point (decimal point) approaches [11] - [14] , and the results are presented in Table 2 . In this comparison, the binary neural networks [27] - [29] and non-linear quantization [24] , which sacrifice the classification accuracy over 5% and 3%, respectively, are not included. As shown in Table 2 , the proposed bitwidth reduction scheme shows relatively smaller bit-width at the cost of smaller accuracy loss in both of CIFAR-10 and ImageNet benchmarks. Since the conventional dynamic fixed point approaches process the raw input data, which has a wider dynamic range than the proposed approach, the reducible bit-width is fundamentally limited [11] - [14] . For this reason, as shown in Table 2 , the required bit-width in [14] is larger than the proposed approach although the filter weights are re-trained to minimize the accuracy loss. Our experimental results show that the bit-width of iFMAP can be reduced to the 6 bit with the 0.1% and 0.4% of top-1 accuracy loss for Convnet and AlexNet, respectively.
C. THE PROPOSED DIM WITH NEAR ZERO SKIPPING
In addition to the computation bit-width reduction effect, the DIM can be also used to increase the activation sparsity. Fig. 5 shows the comparison between the conventional input activation pruning (Cnvlutin) [9] and the proposed DIM based near-zero skipping (nZeroSkip) method. As shown in Fig. 5 , when the pruning technique is applied to the CNN acceleration, the near-zero valued input activations are set to zero when their magnitudes are below a pre-specified, perlayer threshold. Here, the input activations that became zeros, can save computation energy both from omitted multiplications and memory accesses. Unlike the conventional pruning, the proposed approach discards the relatively small input differences which means indistinct feature or less informative data. As shown in Fig. 5(b) and Table 3 , the combination of the DIM and pruning (near-zero skipping) shows distinguishable activation sparsity for the unfiltered first layer. In this example (Fig. 5) , since the blurred background beyond the bird consists of low frequency components, the proposed DIM based near-zero skipping can effectively reduce the redundant data. The dedicated threshold and the corresponding skip rate in each layer are compared in Table 3 . For the numerical results, the simulations are carried out using AlexNet with 5,000 validation images on the ImageNet dataset. Here, the validation images are sampled uniformly across each classes. To find the optimal threshold for each layer, top-1 accuracy loss is equally constrained as 0.28%. Due to the large contribution in the first layer, the total skip rate of the proposed approach increases by 17% compared to the conventional pruning method. As shown in Fig. 1 , since the DIM effectively reduces the dynamic range of input activations, especially for the first layer, the proportion of near-zero values significantly increases in the first layers.
D. SIMULATION RESULTS OF THE PROPOSED DIM BASED BIT-WIDTH REDUCTION SCHEME
To further examine the impact of the DIM on the accuracy for the images with high frequency components, additional experiments are performed with AlexNet. First, 50,000 validation images are sorted by the portion of high frequency components in an image, which are obtained from the summation of the 2D-FFT results. Among the sorted 50,000 validation images, the proposed bit-width reduction scheme is applied to top 10,000 low frequency images (1∼10,000 ranked images) and top 10,000 high frequency images (40,001∼50,000 ranked images), and the results are shown in Fig. 6 . As presented in Fig. 6 , the AlexNet accuracies are fluctuating for the images with different frequency components. Here, to prevent overflow, the predetermined maximum and minimum boundaries are decided as {6'b 011111} and {6'b 100000}, respectively. We can notice from the results in Fig. 6 that since the important image information can be maintained even after the DIM based nearzero skipping is applied (as shown in Fig. 5 ), the accuracy drop caused by image frequencies is not observed in the proposed bit-width reduction scheme.
E. IMPLEMENTATION OF DIM WITH ADAPTIVE BIT-WIDTH REDUCTION AND NEAR ZERO SKIPPING
The detailed implementations of DIM with ABWR and nearzero skipping are shown in Fig. 7 . As shown in the figure, first, the differential input is generated using a DFF and subtractor, and the difference is compared to the threshold and reference value which are pre-determined per each layer through analysis. When the difference is smaller than the threshold, which indicates that the differential input is close to zero, the comparator decides to select the output of adaptive difference input generator (ADIG) as zero value. Similar to [2] , the positions of the zeros in the FIFOs are saved in 12-bit zero buffers, and used to turn-off the multiplication with zero. When the difference is smaller than the reference value, the comparator decides to select the CASE-1 data representation (left illustration of Fig. 3(b) ). Otherwise, the comparator decides to select the data representation of CASE-2 ( Fig. 3(b) ). When the difference is too large for the representation of CASE-2, the predetermined maximum or minimum boundary is selected to prevent overflow. In Fig. 7 , ±(2 D+N − 1) means the boundaries of the differences. Here, D denotes the bit-width of reference value and N denotes the number of shifts in CASE-2. For example, when the difference is larger than 2 6+2 −1 (D = 6 and N = 2), the difference is replaced by signed {6'b 011111}. After multiplying the difference with filter weight, the decimal point aligner rearranges the decimal point of the shifted data and accumulate the intermediate results. The PE hardware (Fig. 7) that supports the proposed DIM with ABWR and nZeroSkip, have been implemented using 65nm CMOS standard cell library. The conventional PE with simple truncation [3] is also implemented for comparison. As shown in Fig. 8 , compared to the conventional PE, PE with the proposed schemes (DIM with ABWR and nZeroSkip) shows 30.7% and 47.4% reduced area and power consumption, respectively. These results are mainly due to the smaller multipliers and FIFOs with reduced bit-width, and the deactivated unnecessary power with nZeroSkip. The power consumption results shown in Fig. 8 are obtained with PrimeTime-PX simulations using 5,000 test vectors at 200 MHz operation frequency. In the PE, the hardware modules for supporting the proposed bit-width reduction techniques and nZeroSkip (including adaptive differential input generator, decimal point aligner, and zero buffer in Fig. 7 ) take only 9.2 % of the whole PE array area. More hardware implementation results will be presented in Section V. 
F. POSSIBLE OTHER APPLICATIONS OF THE DIM BASED BIT-WIDTH REDUCTION
Previous subsections show that the proposed bit-width reduction approach can be efficiently used for reducing the computational cost while minimizing accuracy loss. In order to investigate whether the proposed technique will be efficient in other deep neural network (DNN) applications than CNNs, this subsection presents the case study of when the proposed bit-width reduction technique is applied to recurrent neural network (RNN) [36] . For the RNN model, long short-term memory (LSTM) based RNN, which is dedicated for the speech recognition, is selected. The basic computation unit for the LSTM based RNN is presented in Fig. 9 . As shown in this figure, the LSTM layer consists of three types of gates (input gate, forget gate, and output gate). At time t, the states of three gates are denoted as I t , F t , O t , respectively. Also, G t and H t indicate the intermediate state and hidden state, respectively. Each notation can be represented as the following equations [15] :
Here, X , W , and b denote the input data, weight, and bias respectively. The proposed DIM based bit-width reduction schemes are applied to the term X t in (4)-(7). As shown in Fig. 10 , the raw audio data ( Fig. 10(a) ) of the Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpus dataset [38] is converted to the mel-frequency cepstral coefficients (MFCCs) (Fig. 10(b) ). As shown in Fig. 10(b) , 40 MFCCs in each frame are preprocessed and transferred to the computation units of the LSTM layer. Since the values of coefficients change gradually as the frames change, the input data (coefficient) has similarity between the adjacent frames. By utilizing this property, the proposed bit-width reduction scheme can be effectively used to implement the LSTM-based RNN accelerator. The proposed bit-width reduction scheme has been applied to RNN model and the simulation results are presented in Fig. 11 . In the simulations, for three integer bit-width cases (allocating 2, 3 and 4 integer bits among total bit-width), the bit-width of LSTM layer is changed while those of other layers are fixed as 16 bit. As shown in Fig. 11 , while the simple truncation can reduce bit-width by 8-bit, the combination of DIM and ABWR can reduce the bit-width by 6-bit with less than 1% of relative accuracy loss. These results show that the proposed bit-width reduction scheme (DIM+ ABWR) is applicable to different types of DNNs for reducing the hardware cost of DNN accelerator.
IV. THE PROPOSED BI-DIRECTIONAL FILTERING WINDOW MOVEMENT AND BI-DIRECTIONAL FIFO A. THE PROPOSED BI-DIRECTIONAL FILTERING
As shown in Fig. 12(a) , in the convolution computations, the filtering window is first located on the top-left of iFMAP, and the corresponding input feature and weight are loaded in the local FIFOs. Then, the MAC operations are performed to generate the fraction of output feature map. When the convolution operations of loaded data are completed, the filtering window slides to the right-side according to the number of strides. At this time, the overlapped input features ( Fig. 12(a) ) between the adjacent filtering windows are reused to reduce the SRAM based global buffer (image and weight buffers in Fig. 7 ) accesses. As shown in Fig. 7 , by re-arranging the reusable FIFO data (using circular FIFO: feeding FIFO outputs to the FIFO inputs) to adjust the order of the stored data considering input and weight feature pairs, the reusable input feature and weight are remained in FIFO. For the newly required input data (Fig. 12(b) ), the FIFOs discard the unnecessary data and get the new data through global buffer access. In the conventional approach, since the uni-directional monotonic sliding is repeated for each row, the data reuses are only available in the same row.
In the CNN convolutional layer design mentioned above, for the energy efficient accelerator design, further reducing the number of the global buffer accesses is the fundamental requirement. Here, an interesting observation is that when the direction of filtering window is changed as bi-directional, the re-useable data considerably increases, thus leading to the reduction of the number of memory accesses. Fig. 12(b) shows the comparison between the uni-directional and bi-directional filtering window movements. As shown in the case of bi-directional approach of Fig. 12(a) , when the filtering window moves to the next row, the input feature map is continuously overlapped. Since considerable data switching in filtering window occurs in the CONV layer, the unnecessary data movement can be removed with the proposed bi-directional filtering method.
In CNN accelerator, as shown above, in addition to SRAM based global buffer, a large number of register file (RF) based FIFOs for storing reusable image and weight data are needed to minimize the energy cost of the convolution operations [2] . Here, the typical FIFO size is below 1kB per PE and the area cost per byte depends on the size and type of memory [2] . In case of the small size FIFO (<10kB), since the SRAM based FIFO requires the large footprint of the SRAM peripheral circuits, the flip-flop based FIFO is normally preferred [2] . For this reason, we consider the flip-flop based FIFO memory for data-reuse in the MAC based CNN accelerator. In order to investigate the available data re-using sequences, the two types of FIFOs are considered as shown in Fig. 12(b) . When filtering window slides to left or right directions (first two rows in Fig. 12(b) ) with a stride of N S , new N S pixels of input feature are updated and previous N S pixels of input feature are discarded in a FIFO. Using the uni-directional FIFO (length: N F ), 2N F − N S and N S shift operations are needed to reshape the input data for the right and left filter shifting, respectively. On the other hand, the bi-directional FIFO requires only N S shift operations, regardless of the filter sliding direction. In case of the downward filter sliding (third row in Fig. 12(b) ), the data re-distribution is needed in PE array. As shown in Fig. 12(b) , to match the input data and weight pair for the current filtering window, N F × N S of shift operations are needed in the PE array.
As shown in Fig.13 , the number of reusable data in a CONV layer with bi-directional filtering can be derived as:
In this equation, R i.iFMAP denotes the number of reusable data in a iFMAP of the i th CONV layer. R i.slide means the number of reusable data when a filtering window sliding downward once. T i.slide is the number of downward sliding of filtering window in a iFMAP. By multiplying R i.slide with T i.slide , R i.iFMAP can be calculated. Parameters of I i , F i , and S i denote the size of iFMAP, filter and stride, respectively. The portion of the reusable data in a iFMAP is presented in Table 4 . In the case of larger value of F i and smaller value of S i , the data reusing rate is higher, because there are more overlapped data as the filter moves down. For example, the second layer of AlexNet which uses 5x5 filter shows much higher reusing rate than third one that has 3x3 filter size. Since the 2D convolution operations are repeated for C i of iFMAP and M i of filters, the total number of reusable data can be obtained by multiplying (8) with C i and M i . Moreover, the total number of reusable data (R total,allLayer ) in CONV layers can be obtained by accumulating the number of data in each layer.
Here, L is the number of CONV layer, C i is the number of iFMAP channels, and M i is the number of filters per CONV layer.
B. THE PROPOSED BI-DIRECTIONAL FIFO Fig. 14(a) shows the conventional flip-flop based bi-directional FIFO (bi-FIFO). As shown in the figure, the additional MUX to enable the bi-directional shift operation induces large area overhead. Although the dynamic latch such as clocked CMOS (C2MOS) and true single phase clock (TSPC) latches ( Fig. 14(b) and Fig. 14(c) , respectively) can be used to reduce the area cost, the static latch ( Fig. 14(d) ) is preferred to ensure data retention and operation stability. In order to alleviate the area overhead of bi-FIFO while ensuring the data retention, we propose a bi-directional static latch. As shown in Fig. 12(c) , an SRAM like cross-coupled inverter and two clocked inverters compose the proposed bi-directional static latch. To reduce the contention of data transfer, the clocked inverters (TNVs) are sized up to 2, and the weak (stacked) inverters are used for the cross-coupled inverter (INVs). This configuration not only increases the operation stability, but also enables the bit-cell manner layout. As shown in Fig. 15 , the power sources (supply voltage and ground) and clock signals (LCK, RCK, and their complementary signals) are placed at the outside, resulting in an area efficient layout. In addition, all physical layers can be drawn in litho-friendly manner (i.e., uniform poly orientation and easy metal routing). Fig. 15 also shows the basic operation principle of the proposed bi-directional latch based FIFO. Similar to the conventional DFF based FIFO, the complementary two latches form a flip-flop. In this rightshifting example that uses only LCK, the left and right latches transfer the input data in low-level and high-level sensitives, respectively. By activating only RCK, the shifting direction can be easily changed in the proposed bi-FIFO. For the comparison and evaluation of the proposed bi-FIFO, the conventional bi-directional FIFOs (bi-FIFO) have been designed with SRAM (with head/tail pointers) using 65nm CMOS technology. The FIFO depth and bitwidth are selected as 12 and 16 bit, respectively, and the comparison results are presented in Table 5 . Due to the large footprint of the peripheral circuits, the SRAM (dual-port register file) based bi-FIFO shows 2.24 times larger area and 1.79 times larger power consumption compared to the flipflop based bi-FIFO. Here, the power consumption has been measured with PrimeTime-PX simulations at 200 MHz operation frequency. In the SRAM based bi-FIFO, the header/tail pointer occupies 3.8% area and 7.0% power. Compared to the conventional flip-flop based design, the proposed bi-FIFO shows 33% smaller area and 31% reduced power consumption.
In order to find the optimal memory sizes (depth), the hardware costs with various sizes of bi-FIFO have been also examined. As shown in Fig. 16 , when the depth of FIFO is larger than 32, the SRAM based local memory have smaller hardware cost than the flip-flop based approaches. Since a large size of local (PE) memory increases the PE size as well, only the small size of local memories can be used considering the total number of PEs that are needed in a CNN accelerator chip. For this reason, the proposed bi-FIFO is more effective when used for the small size (less than depth=32) of local memory.
V. OVERALL CNN ACCELERATOR ARCHITECTURE AND NUMERICAL RESULTS
In order to verify the effectiveness of the proposed design techniques, CNN accelerator has been implemented and circuit simulations are performed using 65 nm CMOS technology. For the comparisons of energy dissipation in architectural and circuit-level, PrimeTime-PX and HSPICE are used in TYPICAL 1.2V 25 • C corner @ 200 MHz (clock period = 5ns), respectively. 
A. ARCHITECTURE OF CNN ACCELERATOR
Using the fixed hardware structure, to efficiently handle the varied shapes of filter and feature maps, the optimal configuration of the AlexNet accelerator is considered with the configuration parameters shown in Fig. 17(a) . Considering the area of PE and different sizes of filters in whole layers, the depth of FIFO has been decided as 12. We also assume that 1) the computation for each filter is parallelized as much as possible, 2) accelerator serially processes all the filter channels (C in table of Fig. 17(a) ). With those premises, the total number of PE can be expressed as
Here, the N SPE and N PE denote the total number of PE in a sub-array and an entire-array, respectively. The N M −PE is the total number of memory in a PE (M-PE) array. To maximize the PE utilization in each layer, the total number of M-PE array (N M −PE ) has been decided as the possible greatest divisor of each M (total number of filters in each CONV layer) which is less than N M −PE . Since the small numbers of N PE and N M −PE induces the increasing CNN processing time, the PE utilizations are computed sweeping the N PE and N M −PE . As shown in Fig. 17 , when the N PE and N M −PE are selected as 24 and 4, respectively, the 95% of PE utilization Based on the configuration, the CNN accelerator has been designed as shown in Fig. 18 . In this architecture, to increase the data reusability while reducing the accumulation complexity of intermediate results (partial sum), the data flow has been decided as follows. First, as shown in Fig. 18 , the reconfigurable PE groups, which are in each M-PE array, are determined according to filter size. From 4 × 0.5KB weight buffers, the 3D filter weights are loaded to each M-PE array. To obtain the output image for a 3D filter, the intermediate results, which are computed in PE using 3D filter weights and inputs, should be accumulated in an M-PE array. This means that each of M-PE array takes different 3D filters to generate the output images of the corresponding 3D filters. The data flow allows constant data bandwidth between the PSUM buffer (4 × 11.87KB) and M-PE array even if the CONV layer changes. For input images, to expedite the buffer initialization, image buffers are divided to 4×4.94KB SRAM banks. Here, in order to reduce the number of the image buffer accesses, all the M-PE arrays receive same input image data from an SRAM bank. The comparison results of the proposed CNN accelerator will be discussed in the following section.
B. COMPARISON RESULTS
As mentioned in Section IV, the bi-directional filtering window and bi-directional FIFO (bi-FIFO) can facilitate the efficient data reuse by reducing the data re-distribution time.
To verify the effectiveness of the proposed bi-FIFO, the PE area and overall processing time are compared with the conventional FIFO based approaches. For the comparison purpose, the bit-width of all PE components are set to the 16-bit. As shown in Fig. 19 , thanks to the SRAM-like layout, the proposed FIFO shows 9.4% and 33.8% reduced area compared to the conventional uni-directional FIFO and bi-directional FIFO, respectively. The entire PE area using the proposed FIFO shows 20.3% smaller area than the conventional bi-FIFO based design. Fig. 19 also shows the proportions of each PE component. Since the multiplier and FIFO occupy the main portion of the PE area, the adaptive bit-width reduction technique can effectively reduce the area of CNN accelerator. Using the ABWR, the bit-width of iFMAP in the multiplier and FIFO can be reduced to 6-bit. Fig. 20 presents the comparison results of the AlexNet accelerator using the proposed design approaches. The numerical results show that the combination of the proposed FIFO and adaptive bit-width reduction scheme achieves 34.6% of accelerator area savings. In addition, with the reduced hardware footprint (ABWR), processing time (bi-directional FIFO), and unnecessary computations (nZeroSkip), 58.2% of the normalized energy savings can be achieved in our approach. Thanks to the efficient data redistribution with bi-FIFO, the proposed accelerator shows 32.8% reduced computation cycles in the whole CONV layers.
VI. CONCLUSION
In this paper, we present three design techniques to reduce the hardware cost of the convolutional computations. First, to reduce both of the data bit-width and the number of computations, DIM based adaptive bit-width reduction (ABWR) and the near-zero skipping (nZeroSkip) are proposed. In addition, to optimize the power-consuming data movement in CNN accelerator, an energy-efficient bi-directional filtering is also proposed. By reusing the overlapped data during downward filter sliding, the considerable memory access and data movement energy can be saved. As a circuit level approach, a bi-directional first-input-first-output (bi-FIFO) is also proposed. Based on the SRAM bit-cell layout manner, the proposed bi-FIFO effectively facilitates data re-distribution with compact area. Since the proposed schemes do not change the overall architecture, and those schemes can be directly applied over the other state-of-art architectures. To verify the effectiveness of the proposed techniques, the AlexNet accelerator has been designed. The numerical results show that the proposed techniques achieve 34.6% of area and 58.2% of energy savings. The proposed bi-FIFO based accelerator also shows 32.8% faster processing time.
