Abstract. In this paper, we propose a reconfigurable architecture that supports various convolutional neural networks (CNNs) such as GoogLeNet and AlexNet. The proposed architecture mainly includes 24 parallel PEs (processing engines) for image data convolution processing, each engine containing 9x4 MAC (multiplier-accumulator) units. Through the combination of PE, this structure can support a variety of bit-width convolution operations, namely 8bit×8bit, 16bit×8bit, 16bit×16bit. At the same time, it also supports a variety of sizes of convolution operation, that is, 1×1, 3×3, 5×5, 7×7. The architecture is synthesized using 65-nm TSMC technology and achieves a peak of 1105.9 GOPS at 640MHz, 1V, and a power consumption of 193mW. Compared with the existing AlexNet architecture, the proposed architecture improves the computational efficiency by 20% to 27.4%.
INTRODUCTION
Deep learning shows excellent performance in various applications such as computer vision and voice processing. Convolutional neural networks can achieve unprecedented accuracy, such as object recognition, detection, and so on [1, 2] . However, CNN-based applications face two important issues: computational complexity is much higher than traditional methods; parametric data transmission requires high storage bandwidth.
Improving CNN performance requires overcoming a number of computation-related issues. The traditional way to accelerate CNN is to use a GPU capable of high-speed matrix multiplication and various types of FPGAs (Field Programmable Gate Arrays). Some scholars point out that one way to continue to improve processor performance and energy cost is to develop domain-specific methods [3, 4] . processor. Some researchers have proposed some ASIC-based (application specific integrated circuits) CNN accelerators. Many companies are developing artificial intelligence dedicated hardware that focuses on performance [5, 6] . Develop an architecture that supports parallel processing of multiple PE units and increase the efficiency of CNN operations. It is of great significance to the practical application of deep learning.
This paper presents a multi-mode CNN accelerator that includes 24 reconfigurable processing engines (PEs). Each PE supports 9×4 MAC (multiply-accumulate) operations in parallel. Through the combination of PE, it supports different bit-width data and weights. Also supports different size convolution operations. Because convolution accounts for more than 90% of CNN operations, the article focuses on convolutional layer design. Compared with the existing research results, part of the performance is improved.
DESIGN OF PE ARRAY STRUCTURE
As shown in Figure 1 , the structure of the PE array consists of 24 3×3 PE units. The PE unit supports different numbers of MAC operations in different modes. In the 8-bit by 8-bit mode, each PE unit supports 9×4 MAC operations; in the 16-bit by 8-bit mode, 9×2 MAC are supported; in the 16-bit by 16-bit mode, 9 MAC are supported.
The PE array architecture supports concurrent convolution operations for 1×1, 3×3, 5×5, and 7×7 convolutional templates.
FIG. 1. Structure of PE Array
Single Multi-Mode PE Structure.
As shown in Figure 2 , a PE contains 36 (4 x 9) Booth coding units, and 16 Wallace tree XXs (4 Wallace trees 00~03, 4 Wallace trees 10~13, 4 Wallace trees 20~23, and 4 Wallace trees 30~33.) Units, 4 Wallace trees_0 units, 2 Wallace trees_1 units, 1 Wallace tree_2 unit and 4 CPA_0 units, 2 CPA_1 units, and 1 CPA_2 unit. The data that can be processed is divided into 3 modes, 8bit×8bit, 16bit×8bit and 16bit×16bit. The 36 8-bit image/feature data is input as a multiplicand to the Booth encoding unit, and the 36-parameter data is input as a multiplier. Each Booth encoding unit outputs 4 data d0, d4. Each of the 9 Booth encoding units is divided into a group, and all d0 of the 9 Booth encoding units in the group are connected to the input of Wallace tree_0; 9 Booth encoded d1 of the group are input to Wallace treeX1, and so on. Therefore, all Wallace treeXX inputs are bit-aligned, effectively reducing the number of columns in the Wallace treeXX array, thus saving circuit area. The output of each Wallace treeX0~X3 is used as the Wallace tree_0 input. The output of Wallace tree_0 acts as the input to CPA0 (Carry Spread Adder). The result of 4 sets of CPA0 is output as an 8 bit×8-bit result. The output of Wallace tree_0 is also used as the input of Wallace tree_1. The output of Wallace tree_1 is used as the input of CPA_1, and the output of CPA_1_0 and CPA_1_1 is output as the result of 16bit×8bit. The output of Wallace tree_1 is also used as the input of Wallace tree_2. The output of Wallace tree_2 is output as the result of 16bit×16bit after the result obtained by CPA_2. The output of CPA_0, CPA_1, CPA_2 only takes the truncated 16-bit data, discards the remaining digits, and holds the 3×3 convolution result as 16 bits.
FIG. 2. Single multi-mode PE Structure

Advances in Intelligent Systems Research, volume 147
Combination of PE Arrays.
As shown in Figure 3 , three PEs are occupied for a 5×5 convolution. The output of each PE is the output of the four Wallace tree_0s in Figure 3 , a total of eight. The outputs of carry0_0 and sum0_0 of the three PEs are collectively used as the input of Wallace treeA_0, three carry0_1 and three sum0_1 of three PEs are input as Wallace treeA_1, and so on. The outputs of the 8 Wallace treeA are taken as the input of 4 CPA_A, and the outputs of the 4 CPA_A are 5×5 convolution results in the 8 bits×8-bit mode. The outputs of the 8 Wallace treeA are divided into 2 groups, which are respectively the input of Wallace treeB. The output of Wallace treeB is passed to two CPA_Bs, and the two outputs of CPA_B are the 5×5 convolution results in 16bit×8bit mode. The output of the two Wallace treeBs is also used as the input of the Wallace treeC. After that, the CPA_C results in a 5×5 convolution result in the 16 bits×16-bit mode. Figure 4 shows a 7×7 convolution with 6 PEs. It is composed of two 5×5convolutions, and the structure is similar to a 5×5convolution. It uses the 3 types of Wallace trees and 3 types of CPA to output 7×7convolution results in 8bit×8bit, 16bit×8bit and 16bit×16bit modes, respectively. 
EXPERIMENTAL RESULTS
Using the 65nm TSMC CMOS technology, we synthesized the PE architecture of our design and achieved a core clock frequency of 640 MHz. PE array area of 1.23 mm2 and power consumption of 193.3mW. The maximum performance reached 1105.9 GOPS, the average performance reached 981.6 GOPS, and the computational efficiency reached 88.8%. As shown in Table 1 . Compared with the existing AlexNet architecture, this architecture improves computational efficiency by 20% to 27.4%. 
SUMMARY
This paper presents a multi-mode high-performance CNN accelerator architecture that supports various convolutional neural networks such as GoogLeNet and AlexNet. This architecture supports 1×1, 3×3, 5×5, 7×7 convolutions in three-bit width (8bit×8bit, 16bit×8bit, 16bit×16bit) modes. The architecture includes 24 PEs, each with 9×4 MAC units. The design is synthesized in a 65-nm TSMC and operates with a 1V 640MHz core clock, consumes 193.3mW, and has an area of 1.23 mm2. Compared with ASICs based on AlexNet's existing similar architecture, the computational efficiency has increased by 20% to 27.4%.
