This paper proposes an energy-efficient reconfigurable architecture for deep neural networks (EERA-DNN) with hybrid bitwidth and logarithmic multiplier. To speed up the computing and achieve high energy efficiency, we first propose an efficient network compression method with hybrid bit-width weights scheme, that saves the memory storage of network LeNet, AlexNet and EESEN by 7x-8x with negligible accuracy loss. Then, we propose an approximate unfolded logarithmic multiplier to process the multiplication operations efficiently. Comparing with state-of-the-art architectures EIE and Thinker, this work achieves over 1.8x and 2.7x better in energy efficiency respectively.
Introduction
Deep neural networks have evolved to the state-of-the-art technique for computer vision [1] [2] , speech recognition [3] , automatic translation, advertisement recommendation, and so on [4] [5] . As the large amounts of synaptic weights incur intensive computation and memory accesses, efficiently processing large-scale neural networks with limited resources remains a challenging problem.
In order to overcome the challenge caused by overwhelming neurons and synapses, Denil et al. demonstrated that there are a large number of parameter redundancy problems in the learning model and further that some of these parameters can be pruned directly [6] . Han et al. [7] have proposed a pruning technique to shrink the amount of synaptic weights by about 10x with negligible accuracy loss. Reagen et al. have explored minimum number of bits requirements for each data path while preserving model accuracy within established negligible error bound [8] . On the basis of sparse and reduced network, some accelerators on compressed deep neural network have been proposed [9] [10] .
Interestingly, dramatically reducing the amount of synapses and using traditional sparse matrix calculation scheme do not necessarily improve the performance and energy efficiency of existing accelerators. The computationally intensive model of the neural network is mainly embodied in a large number of multiplication operations. Although a variety of neural network accelerator solutions have been proposed [8] [9] [10] [11] [12] , most of the work remain in the solution to the problem of access and data path design, and do not really go deep into the underlying multiplication unit.
In this paper, we propose an energy-efficient reconfigurable engine for hybrid bit-width DNNs. An efficient network compression method with hybrid bit-width weights scheme is proposed. Based on that, an energy-efficient approximate unfolded logarithmic multiplier is also adopted in this work to further reduce the calculation power consumption. The rest of this paper is organized as followed. In section 2, we present the hybrid bit-width weights scheme, the approximate unfolded logarithmic multiplier and a DNN accelerator architecture. In section 3, several case studies and the implementation results are shown. Section 4 concludes this paper.
Energy-Efficient Reconfigurable Architecture for DNNs with
Hybrid Bit-Width and Logarithmic Multiplier
Hybrid Bit-width weights scheme
The the weight magnitude of neural network reflects the importance of synapses. In terms of its distribution, the closer to the zero point, the more densely the weight is distributed. Figure 1 shows the weight distribution of typical DNN network LeNet. w e i g h t r a n g e
Fig. 1. The weight distribution of LeNet
For the weight grading, which is shown in Table 1 , there are several pieces of information available to characterize weight size as efficiently as possible. First is leading "1" position, requires only 4 bits for the weight represented by the 16-bit binary number, since the level of the weight is set according to the leading "1" index, knowing the leading "1" index will naturally be able to know the level and the effective length of the weight; the second is the sign bit; the third is the significant bit used to characterize the size of the weight. Figure 2 shows an example of reducing the complexity of weights. The storage of the weight is divided into two parts, one part is the mixed bit stream, the other part is the leading "1" index, the leading "1" index indicates the length of significant bits for weights and the position of leading "1" in a weight. Figure 3 shows the principle of weight storage and parsing. In the upper part of the D ij , i represents the i-th number, and j represents the j-th significant bit of the data i. For example, D 62 represents the second significant bit of the sixth number. D 63 represents the sigh bit of the sixth number. The significant bits and the number of leading "1" index for each weight are represented by the same color. The process of weight parsing is as follows: The decoder reads a leading "1" index, for example, 0111, by looking up table 1, the most significant bit position is 7 and the corresponding level is 2, so the decoding result is D 63 000 0000 1D 62 D 61 D 60 0000. For the leading "1" index 1110, the most significant bit position is 14 and the corresponding level is 6, so the decoding result is D 75 01D 74 D 73 D 72 D 71 D 70 0000 0000. The third, fourth and fifth leading "1" index are 0000, which indicates that these three weights are all 0, and therefore the decoding result is 0000 0000 0000 0000. In addition, the weights with value of 0 are only needed to store the part of leading "1" index of each.
In this paper, we use the 4-bit leading "1" index to characterize the highest nonzero position, the level of weight and so on. Figure 4 is the hardware implementation for decoding the leading "1" index. The leading D871D86D85D84D83D82D81D80   D71 D70 D63 D62 D61 D60 D27 D26  D93 D92 D91 D90 D87 D86 D85 D84   D25 D24 D23 D22 D21 D20 D11 D10  D83 D82 D81 D80 D75 D74 D73 D72 D11000 0000 0000 0 1D10 0 D271D26D25D24D23D22D21D20 0000 0000 0000 0000 0000 0000 0000 0000 D63 000 0000 D750 000 0000 1D62D61D60 0000 1D74D73D72D71D70 0000 0000 000 0000 0000 0000 0000 0000
RESOLVER
The leading 1 index
The weights value after parsing The significance bit s for weights 0011 1111 0000 0000 0000 0111 1110 1111 Fig. 3 . Weights storage scheme with hybrid bit-width "1" index of each weight are pre-processed with Huffman encoding off-line [13] , and dynamically decoded on-chip as shown in Figure 4 . The selector determines whether the input data enters LUT1 or LUT2. LUT1 is selected when the first six digits of the input codeword (CW) are not all 1, otherwise LUT2 is selected. Finally, the MUX outputs code length (CL) which is returned to the shift register and symbol (position of the first bit of weight). D1  D2  D3  D4  D5  D6  D7  D8  D9  D10  D11   D1  D2  D3  D4  D5  D6  D7  D8  D9  D10  D11   D0  D1  D2  D3  D4  D5  D6  D7  D8  D9  D10  D11  D12   D0  D1  D2  D3  D4  D5  D6  D7  D8  D9  D10  D11 D12 Fig. 4 . Decoder for the leading "1" index of weights
Approximate unfolded logarithmic multiplier
The iterative logarithmic multiplier scheme is proposed in paper [14] . But it is not suitable for DNN with the following factors: first, iterative computing limits the data processing throughput; second, iterative computing seriously affects the pipeline of DNN scheduling. So an approximate unfolded logarithmic multiplier is proposed in this paper.
The binary representation of the number N can be written as (1):
IEICE Electronics Express, Vol.*, No.*, 1-11
From (1), where k is the characteristic number or place of the most significant bit with the value of '1', Z i is the bit value at i-th position, x is a mantissa (fraction) and j depends on the number precision. A full precise expression for the multiplication can be written as:
The formula (3) derived from (1) must be taken into account for avoiding the approximation error:
Combining (2) with (3), we can get:
be the first order approximation. We can get:
The two multiplicands in (6) can be obtained simply by removing the leading 1 in the numbers N 1 and N 2 . Therefore, we can easily repeat the multiplication procedure with multiplicands. Note that the first term can be seen as a binary number that the k 1 + k 2 bit is 1, the other bits are zero; the second item can be seen as the first multiplier to be removed the leading 1 and then shift k 2 bit; the third term can be seen as the second multiplier to be removed the leading 1 then shift k 1 bit.
Considering the fault tolerance of neural networks, only 3 iterations of logarithmic multiplier are enough for the network accuracy [15] . So an approximate unfolded logarithmic multiplier is designed, as shown in figure 5 . It is composed of three stages, each stage generates the multiplication results of current iteration, and also the input of next stage (the third iteration only generates the multiplication results). The results of the three stages are finally merged to the final result. To promote the throughput, a group of pipeline registers are inserted between stage 2 and stage 3. Two 16-bit shifter SH, a 16-bit adder, a 16-bit S-A and an optional priority encoder Pri-Enc are included in one stage. The Pri-Enc module replaces the complex LOD (leading one detector) module in the original design [14] to achieve a shorter delay without affecting the multiplier accuracy,The maximum for the highest nonzero bit of the last two terms are only possible for the (k 1 + k 2 -1)-th bit, the effective bit overlap region is less than 16 bits, so that it is a 16-bit adder. In the first iteration, k 1 −1 , k 2 −1 in the figure represent (16-k 1 ) and (16-k 2 ), and the SH module is a shifter. k 1 −1 and k 2 −1 are used as control signals to shift the two multipliers so that they are aligned to the left most significant bit. The shifted two numbers are summed by the adder. Since the result can be affected by the previous shift operation, it needs to be readjusted by the S-A module, which is composed of several sets of shift operation arrays. In the second and third operations, Pri-Enc module obtains a new multiplier k 1 −1 , k 2 −1 , while the functionality of other modules is the same as in the first stage. 2.3 DNN accelerator with hybrid bit-width and logarithmic multiplier The top level architecture of an EERA-DNN prototype system is as shown in figure 6 . It consists of an ARM7TDMI used as system controller, a 32KB scratch-pad memory (SPM) used as system buffer, two EERA-DNN accelerators for accelerating DNN networks, and other modules, including an Interrupt Controller (IntCtl), a Direct Memory Access Controller (DMAC), and an External Memory Interface (EMI).
As shown in figure 6 , each EERA-DNN accelerator consists of a controller, a sigmoid function module, a Huffman decoder, 16 process elements (PEs), a Conf SRAM and a Dest SRAM. EERA-DNN accelerator can directly access both the input data and the weights via DMA. The leading "1" index of the weighted values is stored in the Conf SRAM. The Huffman decoder decodes the weights in the Conf SRAM and then transmit it to the control module; the control module distributes the weights to the PEs. The leading "1" index of the input vector is obtained through an LOD module [14] , which is distributed to the PE along with the input vector for the multiply-accumulate operation in PEs.
Each PE consists of an approximate unfolded logarithmic multiplier proposed above, eight 16-bit data input/output registers and multiplexers, a weight SRAM with a size of 2KB, a data input FIFO, and a resolver module as denoted in figure 3 for parsing the weights. The PE can be configured to process multiplication operations, or addition operations with the adder adopted in the proposed multiplier architecture. Besides, the interconnections between PEs can also be reconfigured by setting the data input/output multiplexers of each PE, and therefore EERA-DNN can process matrix multiplication and addition with various sizes of different DNN layers. To compress the network, we first adopt the typical pruning method [9] , and then use the proposed hybrid bit-width scheme to further compress the network. Two models for CNN were trained and tested on MNIST and ImageNet data-sets, the LeNet is on MNIST and the AlexNet is on ImageNet. A RNN is trained using EESEN [17] .We use the THCHS-30 Speech Corpus [18] , including 750 sentences, as the training set. All experiments were performed with the Caffe framework [19] and ESSEN [17] . The network parameters and accuracy before and after compressing are shown in Table II , where HB represent the proposed hybrid bit-width scheme. Taking LeNet as an example, the top-1 error is 0.80% originally, when adopted with the typical pruning method [9] , the network is compressed by 2.7X and the top-1 error is 0.89%. When the proposed hybrid bit-width scheme is further adopted, the network can be compressed 7.0X as much, and the the top-1 error is 1.10%. The compression effects of other networks including AlexNet and ESSEN are similar. The compression schemes save network storage by 7x and 8x across these three different networks with negligible accuracy loss. The total size of LeNet decreased from 1720k to 246k, AlexNet decreased from 240 MB to 30 MB and EESEN decreased from 17 MB to 0.23 MB, this is of great significance to the optimization of neural networks. Table III is results of the extend Cohn-Kanaded database (CK+) [16] and ImageNet respectively on AlexNet. From experimental results, it can be seen that the approximate multiplier achieves about 94.55% and 67.12% recognition accuracy while the standard multiplier achieve 94.63% and 67.26% when trained and tested respectively. Therefore, the proposed logarithmic multiplier can achieve negligible loss of recognition accuracy, and is suitable for DNN applications. The EERA-DNN prototype system is simulated at netlist level with TSM-C 45nm LP process technology. The timing and power consumption are evaluated at 1.1V 25 • C TT process corner. Static timing analysis by PrimeTime after placement and routing shows that its critical path is 1.25ns, so its clock frequency reaches 800MHz. The area is about 4.34 mm 2 after place and routing, and the power is 120mW when the throughput is 51.2GOPS. The area and power consumption is shown in Table IV . The comparisons with other accelerators are shown in Table V . The result
