ABSTRACT This paper proposes that approximation by reducing bit-precision and using inexact multiplier can save power consumption of digital multilayer perceptron accelerator during the classification of MNIST (inference) with negligible accuracy degradation. Based on the error sensitivity precomputed during the training, synaptic weights with less sensitivity are approximated. Under given bit-precision modes, our proposed algorithm determines bit precision for all synapse to minimize power consumption for given target accuracy. For entire network, earlier layer can be more approximated since it has lower error sensitivity. Proposed algorithm can save power 57.4 percent while accuracy is degraded about 1.7 percent. After approximation, retraining with few iterations can improve the accuracy while maintaining power consumption. The impact of different training conditions on the approximation is also studied. Training with small quantization error (less bit precision) allows more power saving in inference. It also shows that enough number of iteration during the training is important for approximation in inference. Network with more layers is more sensitive to the approximation.
I. INTRODUCTION
Neural network (NN) has been a powerful tool for solving complex non-linear problems in different domains such as computer vision [1] . A neural network is composed of multiple layers, where each layer has multiple neurons, connected to other neurons in same or previous layers through synaptic weights. The primary arithmetic operation performed in the NN is multiplication and accumulation (MAC) as state of the neuron is determined by weighted summation of output of neurons from previous/same layers with synaptic weights. In a fully connected feedforward network like Multilayer Perceptron (MLP), the neuron is connected to all neurons in the previous layer. As the number of layers and/or number of neurons in a NN increases to cover complex and large scale problem, the number of connections in NN also increases; therefore, total number of MAC computations for NN with many layers increases significantly. This is a major challenge for applying NN for system requiring high performance with limited power budget.
Several research efforts have investigated approximation techniques to reduce power of arithmetic computations with graceful accuracy degradation [2] - [4] in neural network such as reducing number of bits for fixed point or inexact multiplier. Specifically, the impact of errors due to approximation on the NN algorithm requires careful analysis to minimize accuracy degradation. Moreover, an important question is among many MAC operations in the NN, how to decide which set of computations should be approximated to have maximum power saving with minimum quality degradation. This paper is extended version of previous work [3] , which introduced techniques to find 'near optimal' approximation scheme for each synaptic weights to minimize power under target accuracy in a feedforward NN, in particular MLP based on sensitivity that precomputed during the training [3] . The approximation was realized by bit-precision control coupled with replacing some of multiplier with inexact multiplier. In the proposed approach, after training, some of the synaptic weights were selectively approximated based on the error sensitivity already computed during the training (approximate synapse). As an example, MNIST [5] (a wellknown handwritten digit recognition application, shows that approximating synapses with bit-precision control (software approach) and inexact multipliers (hardware approach) can significantly reduce the power dissipation of a digital MLP hardware about 57.4 percent while accuracy degradation is about 1.7 percent.
In this paper, digital MLP system is designed for on-chip training while other previous work considered off-chip training [2] - [4] . Therefore, higher bit precision should be considered and the impact of different training conditions with hardware constraints (bit precision during the training, number of iterations of training, and the number of layers of MLP) on the effective of approximation during the inference should be studied. Although the results from two different training conditions are similar (similar inference accuracy without approximation), the amounts of power saving by approximation are different based on the training conditions. Further, we observe re-training with approximated synapses can improve the accuracy without additional power overhead.
The main contribution of this paper is developing approximation strategy for MLP (i.e., what to approximate); wellknown approximation techniques, like precision control or inexact multiplier (i.e., how to realization approximation) are based on prior works as mentioned in the paper. Further, this paper emphasizes the relationship between different training conditions such as bit-precision during the training, number of iterations during the training, MLP structure on the power saving during the inference by well-known approximation schemes.
The rest of paper is organized as follows: Section II presents the prior work; Section III analyzes the power consumption of digital MLP system; Section IV presents our proposed algorithm and power-quality results; Section V analyzes the impacts of training; Section VI concludes the paper.
II. PREVIOUS WORK A. MULTILAYER PERCEPTRON
Multilayer perceptron is the simple feedforward network (no recurrent connection) with multiple layers [6] . All equations and notations for this paper follow [6] . Figure 1 illustrates MLP with multiple layers. The state of each neuron is computed as weighted summation of the outputs of the neurons in the previous layer with synaptic weights as below (1) where x j ðkÞ is the state of jth neuron in layer k, w ji ðk À 1Þ is the synaptic weight connecting between ith neuron in layer k À 1 to jth neuron in layer k, and 'ðÞ is the non-linear activation function such as sigmoid or hyperbolic tangent function [6] to generate the output of the neuron.
For the given input, the output of MLP is determined by feedforwarding from the input layer to output layer with all synaptic weights (inference). In other words, the operation of MLP is determined by synaptic weights by training. In training, all weights are randomly initialized and updated with training data and its desired output (supervised training). Based on the error (EðnÞ) between the output of MLP and desired output and its back-propagation, error sensitivity of each weight to network output (gradient: @E/@w ji ) is computed to change synaptic weights. Figure 1 (b) illustrates that gradient is computed from output layer back to input layer to minimize the error at the output layer. According to the error sensitivity (gradient), each weight at iteration n þ 1 is updated as follows:
where h is learning rate and EðnÞ is the error at iteration n. According to (2) , synaptic weights are updated based on the gradient and the learning rate until the output error is less than predetermined threshold. Although MLP inference with given pre-trained weight needs just feedforward phase (1) to compute the state of neurons at the output layer, training requires both feedforward phase (1) and feedback phase (2) and these two phases are iterated until error at the output layer is low; thus it demands huge number of computations. According to both (1) and (2), multiplication and accumulation are main arithmetic operations. Therefore, multiplier-and-accumulator is the main processing elements (PE) of MLP hardware.
B. HARDWARE IMPLEMENTATION OF MLP
Although most of previous MLP hardware studies were analog circuits due to its simple multiplier design, analog hardware is less effective as an accelerator in a digital system due to the overhead of analog-to-digital (and opposite) conversion [7] .
Most of digital MLP implementations were based on the field programmable gate array (FPGA) and relied on off-chip learning and solved simple problems such as XOR with small number of neurons (less than 500 neurons) [8] - [11] . The design presented in [11] considered on-chip training, however, is limited to only about 500 neurons and uses a large number of MAC units (288 MAC units) making it power hungry. Even though a full digital NN system with eight processing engines (FPGA) and external memory controller (ARM core) has been implemented in [11] , it also does not provide any power saving methods.
Most recent digital MLP hardware [12] multiplier is based on 8 or 12 bit synaptic weights using pre-computed alphabet set multiplier (ASM). To save additional power overhead of ASM, they constrained the number of ASM. In other words, they quantized multiplicand by rounding up or down. After initial training without constraints (no quantization), re-training is performed to minimize quantization level until it reaches the target accuracy. However, the energy-quality trade-off in both training and inference has not been studied before.
C. ERROR ANALYSIS OF MLP
Many studies have been done to mathematically analyze error sensitivity of MLP (network output) due to small perturbation (uniform distribution or normal distribution) at the inputs and/or the weights [13] , [14] , [15] . An algorithm extending layers sensitivity to next layer from input layer to output layer has been proposed [13] . However, the impact of error during training on the inference was not explained. In [14] , the sensitivity of neural network among different sets of weight matrices is discussed.
More interestingly, the effects of analog noise (small perturbation) in synaptic weights during training on fault tolerance and performance of MLP are studied [15] . This work provided mathematical fundamentals and simulation results showing that small perturbation on synaptic weights during the training can improve fault tolerance of MLP for inference under the error in synaptic weight or input and accelerate training of MLP, i.e., cost function of MLP converges fast during the training (less iteration). But too much error significantly increases training time and cannot guarantee the accuracy of target operation. The small perturbation, however, is limited for only uniform distribution; therefore this error distribution is not practical for quantization error caused by limited bit precision. Probability distribution of quantization error by limited bit-precision will be studied in Section III. The impact of bit-precision on the number of iterations for XOR training (small perturbation) is also discussed in [10] , showing similar results as in [15] .
D. APPROXIMATE COMPUTING
Recently, approximate computing by reducing bit precision [2] , [3] , [16] or using inexact multipliers [3] , [4] , [17] in neural network have been proposed. The closest work is the one presented in [2] where some LSBs are forced to zero in selected neuron's state to reduce power dissipation [2] of MAC. Drawback of approximate neurons compared to approximate synapses will be discussed in Section IV.
The initial version of this paper, presented at [3] introduced the concept of approximate synapses (Section IV). The major extension presented in this article is understanding the role of training condition on the approximate synapses, and the role of re-training. As we show later, training plays a major role in improving the effectiveness of approximate synapses (Section V).
An inexact multiplier, simplifying design by allowing bit error on some LSBs, is designed with iterative logarithmic multiplier having computation error less than 1 percent. Based on the probabilistic logic minimization algorithm, different types of inexact multipliers were designed and applied for simple MLP network [17] . Inexact multiplier can reduce both power consumption (x2.67) and area overhead (x1.46). In [17] , only small MLP network was studied and all neurons and weights are approximated.
Although there was previous work to apply approximate computing on inference MLP to save power, there was no study about the impact of approximations on both inference and training. Moreover, the relationship be-tween approximated training and approximated inference was not studied before.
III. POWER ANALYSIS OF FULL SYSTEM OF DIGITAL MLP
In this section, we briefly explained the full system of digital MLP introduced in [3] . It is synthesized with 130 nm CMOS and its power profiles are simulated for different bit precision and types of MAC units [3] .
A. ANALYSIS ON DYNAMIC POWER OF MAC
In MLP (1), we already noted that multiplier-and-accumulator is the primitive processing element in digital MLP hardware. To compare power consumption of accurate MAC and inexact MAC for different bit precision, those units for 32-bit fixed point are synthesized in 130 nm CMOS to operate 200 MHz. For non-forced bit, input activity is assumed as 0.3 for dynamic power analysis [18] . Although there were some papers illustrated MLP hardware with 8-bit synaptic weights [12] , [19] , these papers do not consider on-chip training. As we explained later, for on-chip training, digital MLP hardware needs higher precision due to gradient. Moreover, as optimal bit precision for both inference and training strongly depends on benchmark or network structure, we assume 32-bit fixed point for both synaptic weights and states as a base line to avoid loss of generality. Figure 2 compares the power dissipation of both types of multiplier for different bit precisions. As shown in Figure 2 , reducing the bit width from 32 to 8 bits, the power can be saved by 53 percent. The error of multiplier for given precision is also simulated by multiplying two randomly generated operands and comparing with 32-bit reference. Figure 3 (a) shows the cumulative error distribution for 16-bit fixed point (Q 1;7;8 : 1 sign bit, 7 integer bits, and 8 fractional bits). The probability of quantization error increases as bit width decreases from 32 bits and decreasing to 12 and 8 bits have quiet large error due to overflow.
In the same manner, cumulative probability of error due to inexact multiplier with 20-bit recovery (20 MSBs will not have error) for each bit precision is simulated (Figure 3(b) ). As there is no error in 20 MSBs, more than 95 percent of errors are smaller than 10 À2 . Interestingly, the additional error due to inexact multiplier decreases in reduced bit precision since some of LSBs are forced zero. Therefore, errors from inexact multiplier are hidden. It allows more aggressive power saving with very small additional error. Theses error distributions ( Figure 3 ) are used in MLP simulation to study the impact of approximation on MLP under given precision.
For the rest of this paper, the amount of power saving by approximation is computed using this simulation result ( Figure 2 ).
B. FULL SYSTEM OF DIGITAL MLP
The full system of digital MLP is implemented in 130 nm CMOS based on the system architecture shown in In other words, MNIST [5] with three-layerd MLP (784-144-10) can be operated (inference) in 328,000 frame (28 by 28 pixel) per second. To maintain maximum system throughput (full utilization), the number of PEs and SRAMs and its operating frequencies are obtained under the given memory specification (Wide I/O2 [20] ). In full MLP system, critical path is placed on accurate MAC units (4.8 ns). For PE design, approximation controller, and DRAM-SRAM interface controller are designed in Verilog and synthesized using Synopsys Design Compiler [21] and place and route using Cadence SoC-Encounter [22] . Detail design of digital MLP system is explained in [3] .
IV. APPROXIMATE SYNAPSE SELECTION ALGORITHM
In this section, we briefly explained the idea of approximating synapses and its benefit compared to approximating neurons (Section IV-A). Based on this idea, the algorithm to select synapses and assign bit precision to minimize the power under target accuracy [3] is illustrated (Section IV-B). Proposed algorithm is compared with other approximation approaches for single PE or full system in terms of power and accuracy tradeoff (Section IV-C). We also analyze the amount of approximation for each layer in three-layer and four-layer MLP (Section IV-D).
A. DESIGN METHODOLOGY 1) APPROXIMATE SYNAPSE
Main idea of approximation is selecting synapses, which have less impact on the result of output layer. It is similar to [2] , but we approximate synapses instead of neurons [3] . Error sensitivities of weights (@E/@w ji ) in (2)) are already computed during the training with back-propagation algorithm. Based on pre-computed error sensitivity sorted in FIGURE 3. The cumulative probability of (a) quantization error due to precision control and (b) the additional error induced by using an inexact multiplier with error correction of 20 MSBs (x-axis is logarithmic scale). ascending order, synapses with less error sensitivities are selected while neurons are selected in [2] . Figure 5 compares the idea selecting neurons and selecting synapses: selecting synapse can avoid synapse with high error sensitivity while selecting neuron may do because it selects neuron with low average error sensitivity.
In conclusion, selecting synapse is much finer-grained algorithm, which is effective in reducing power consumption of the digital MLP hardware under given accuracy degradation.
2) APPROXIMATION BY PRECISION CONTROL
We define m prec as the portion of approximated synaptic weights of entire synaptic weights. Then total power consumption (P tot ) of all PEs become
where N is total number of synaptic weights, n k is number of synaptic weights at kth bit-precision, P k is the power consumption of a PE at kth bit-precision, and P m prec is power of PE at 32 bits precision (without approximation). To simplify the analysis, lets assume that we have two bit-precision modes (m prec ¼ 2: 32 and 12 bits). To guarantee power constraint (P target ), the ratio of 12 bit precision r 12bit ¼ n 12bit /N should follow:
For the first equation of (4), right-hand side represents total power related to N synapses.
As an example benchmark, MNIST, a hand-written digit recognition [5] , is selected. For MNIST, three-layer MLP is used; the number of neurons for each layer is 784, 144, and 10. Based on the Equation (4), the percentages of approximation for different power target are simulated and its normalized accuracy rate ( Figure 6 , red dashed line). Normalized accuracy rate (recognition rate) is based on the accuracy with 64 bits floating point using MATLAB. According to Figure 6 , total power can be saved about 50 percent within 6 percent accuracy degradation. In this paper, approximation by reducing bit precision is called as software approach.
3) APPROXIMATION BY INEXACT MULTIPLIER
For more aggressive power saving, some of the multipliers in the MLP hardware are replaced with inexact multipliers, which can save power by 51 percent on average ( Figure 2 ) and its additional error is hidden since affected LSBs are forced to zero by precision control (Figure 3(b) ).
Since the number of inexact multiplier is set during hardware design (hardware approach), some accurate synapses need to be computed by inexact multipliers for full utilization of processing elements. Equation (5) shows the conditions under the power constraint (P target ) with n prec bitprecision types and the ratio of inexact PEs (g ¼ 0:4 as an example) similar to (4)
þ maxðg À x; 0Þ Á P appðn prec Þ þ minðx; gÞ Á P appðmÞ ; (5) where r k is the ratio of kth bit-precision among all, x ¼ P m k¼1 r k represents the ratio of bit-precisions that entirely/partially fed into inexact PEs, and P accðkÞ ðP appðkÞ Þ is the power consumption of an accurate (approximate) PE operating at kth bit-precision. Similar to Section IV-A2, two bit precisions are assumed (32 and 12 bit). Figure 6 shows that applying inexact PEs can achieve higher accuracy for same power consumption. Diagram illustrating the advantage of approximating synapses over approximating neurons on a given trained feedforward NN. In selecting neuron, all synapses connected to neuron 'A' are selected since 'A' has average error sensitivity of 0.9 while neuron 'B' has 0.93. It will ignore the synapse with error sensitivity 3.0 in 'A', which is the largest among all synapses. 
B. GREEDY ALGORITHM TO CHOOSE NEAR-OPTIMAL BIT PRECISION FOR LOW-POWER DESIGN
Under given finite bit precision sets, fine-grained greedy algorithm was introduced (Algorithm 1) [3] to determine near-optimal bit precision for all synaptic weights, which minimizing power under the target accuracy. Starting from the lowest precision mode (lowest accuracy and lowest power consumption), it generates two different bit precision ratio sets (R trial1 ; R trial2 ) to improve accuracy. After feedforward with two candidates, it selects the set with high score (the weighted sum of normalized quality increase and the negative of normalized power increase). It iterates until it meets target accuracy. Detail explanation for Algorithm 1 is illustrated in [3] . After R prec is decided, type of MAC (accurate or inexact) is assigned for all synaptic weights based on sorted error sensitivity (Figure 7(a) ). Figure 7 (b) shows the trace of Algorithm 1 with three bit-precision modes: 32, 16, and 8 bit with 0.9 target accuracy. As a result, 0.9017 recognition rate is achieved with 56 percent power saving compared to the result with all 32bit using 100 percent accurate PEs (0.9053 recognition rate is achieved). Although four different bit-precisions (32, 16, 12 , and 8 bit) are tried, the result is similar to the case with three bit precisions (Figure 7(c) ). However, we can see two bit precision modes are not enough to save power of MLP with MNIST application. ; D test ; P est ) 7: score1 compute.score(Q trial1 ; P trial1 ; Q; P) 8: score2 compute.score(Q trial2 ; P trial2 ; Q; P) 9: if score1 > score2 then 10: [R prec ; Q; P] update(R trial1 ; Q trial1 ; P trial1 ) 11: else 12: [R prec ; Q; P] update(R trial2 ; Q trial2 ; P trial2 ) 13 Figure 8 compares the proposed algorithm with software approach (only precision control), hardware approach (only inexact multiplier), and approximate neurons with precision control (AxNN [2] with three-level bit precisions: 32, 16, and 8 bits) for a single PE. For proposed algorithm and AxNN [2] , 5 percent accuracy degradation is allowed.
For [2] , 80 percent of neurons are selected as approximate and precision is reduced to 8 bit (the rest with 16 bit) by Comparison of the proposed algorithm to just doing software (reduced bit-precisions) or hardware (approximate PEs) approach. Proposed approach couples both reduced bitpreciosn and approximate PE. Note that baseline accuracy (without any approximation) is improved to 0.98 from 0.91 [3] and power consumption of additional hardware (Section IV-F) for proposed algorithm is also considered. using their algorithm that we demonstrated as far as possible. The proposed greedy algorithm effectively reduces power consumption (66 percent) with negligible quality degradation (À5 percent) with a given digital MLP hardware. Table 1 compared different system of digital MLPs: system1: baseline system with all accurate PEs at 32 bits fixed point, system2: 50 percent of 16 bits computation and the rest with 8 bits (software approach) but all accurate PEs, system3: 100 percent inexact PEs at 32-bit fixed point (only hardware approach), and the proposed system with 40 percent approximate PEs (60 percent accurate PEs) with bit-precisions obtained by Algorithm 1 targeting 0.96 accuracy.
The power consumption of PEs in the proposed system is 0.90 W while that of other components is same. The PE power is reduced by 63.56 percent compared to the system 1. However, due to the noticeable SRAM power, the system power is reduced by 41 percent compared to system1. This amount of power saving is achieved by sacrificing only 1.45 percent of recognition accuracy.
Currently, SRAM take 51 percent of block area and 58 percent of power consumption. SRAM capacity is set as twice of DRAM page size (2 Â 2 KB) to utilize burst mode of DRAM fully. As SRAM size is twice of page size, half of SRAM can be filled by DRAM while half of SRAM can push data to PE. To reduce SRAM capacity, we may extend our approximation such as reducing bit precision for preprocessing before storing them in SRAM. Or compress synaptic weights [23] , [24] can be applied to reduce required memory size.
D. LAYER-BY-LAYER DISTRIBUTION OF APPROXIMATE SYNAPSES
Algorithm 1 sorts all synaptic weights of the neural network regardless of layer. Therefore, each layer has different percentages of approximation under given total bit-precision mode (Ex: 32 bits: 0 percent, 16 bits: 20 percent, 8 bits: 80 percent for entire network). Figure 9 shows the percentages of each bit precision for each layer to understand which layer allows more approximation. For example, in MLP with three layers, first synaptic connections (between first and second layer) has 80 percent of 8 bits and 20 percent of 16 bits while second synaptic connections (between second and last layer) has 36 percent of 8 bits and 64 percent of 16 bits. In other words, synaptic connections between the earlier layers can be approximated more. The same observation is also made by increasing the number of layers in the MLP (see the result for four layers MLP in Figure 9 ). It is also observed in [19] . The analysis suggests that future work may consider using different sensitivity for different layer, layer-wise approximate should be studied in the future.
E. PROPOSED ALGORITHM IN COMPLEX NETWORK
In this section, we tried our proposed algorithm in more complex network: recurrent neural network (RNN) for human activity recognition [29] and MLP for CIFAR-10 [28] to study validation of our algorithm in complex network. MLP network for CIFAR-10 is 3,072-1,024-128-10, which is bigger than MLP for MNIST.
Although the number of hidden layers and neurons in hidden layer are low ($100 in this paper), its recurrent connection can be transformed to T fully connected layers after time unfolding (inference based on T history); in other words, it is very deep T layered network (we set T as 50). Three different human activity recognition video dataset (KTH [25] , UCFG [26] , and USC [27] ) are trained using backpropagation. RNN with a single hidden layer is used.
After training network, proposed algorithm is applied with three bit precision modes (16, 12 , and 8 bit). Figure 10 shows our proposed algorithm can save power 14 $ 36 percent while accuracy is degraded less than 5 percent compared to TABLE 1. The system analysis in terms of power, area, and recognition accuracy. the baseline. Baseline for each benchmark is illustrated in the Figure 10 . We can see that assigning different bit precision for each synaptic weights based on gradient can save more power while maintain the baseline accuracy even considering additional power overhead of proposed algorithm (Section IV-F).
It shows that our proposed algorithm can be also applied to complex network such as big MLP or RNN, very deep network trained using backpropagation and it can save power with less accuracy degradation.
F. OVERHEAD OF PROPOSED ALGORITHM IN ON-CHIP LEVEL
For proposed algorithm in on-chip level, two hardware modules are additionally required. First, we need to sort weights based on gradients and group them for given bit precision sets. After grouping, we need to force some of LSBs of weight and assign MAC unit types (accurate or inexact).
1) GROUPING SYNAPTIC WEIGHTS
The most important module for the hardware implementation of the proposed algorithm is the sorting engine where input data is the gradient stream from memory after training. The design of the sorting engine needs to consider the trade-off between power and performance. A simple and low-power approach is to design a Bubble Sorter using a single 32 bit fixed point comparator. A 32 bit comparator operating at 800 MHz in 130 nm CMOS requires 675 mm 2 of area and 74.5 mW of power, incurring negligible overhead. However, the worst-case complexity of bubble sort is Oðn 2 Þ, therefore the latency for sorting all gradients of MNIST (784-144-10) is 16.4 seconds with 800 MHz bubble sorter. Hence, the worst-case latency is significantly high. The latency can be improved using complex parallel sorting engine, such as a Radix sort engine [30] . We can estimate area/power overhead of Radix Sorter in 130 nm process following prior work [30] . For example, a 250 MHz Radix Sorter using 130 nm require 2.6 mm 2 of area and 730 mW of power incurring significant overhead. However, as the complexity of Radix Sort is linear to number of elements, total latency for sorting all gradients of MNIST (784-144-10) is estimated as 0.2 mS.
Considering the latency-power trade-off associated with the sorting engine, we propose an alternative approach for the hardware implementation. We observe that the proposed algorithm only needs to group the weights based on the gradients, it does not necessary require sorting of the weights. Therefore, we propose to perform on-chip binning of the weights, rather than sorting. After training, based on given number of bins (ex: 256), we compute the bin centers and stored in SRAM (32 bit Â 256 = 1 KB) to generate histogram of gradients. A gradient from external memory is compared with 256 bin centers in parallel using 256 32 bitcomparators. It triggers one of 256 16 bit-counters to generate histogram. After comparing all gradients, histogram of gradients is transformed to Cumulative Distribution Function of the gradients. Based on ratio of precision, we can find set of bins for each precision mode. This approach is illustrated in Figure 11 . A binning approach may not generate exact percentages, however, the error introduced by binning is negligible. For example, in the MNIST (784-144-10) application if we want 20 and 80 percent weights in 16 and 8 bit, respectively, the binning will assign 19.03 percent and of weights to 16 bit and 80.97 percent of weights to 8 bit. The proposed approach requires additional on-chip memory to generate and store CDF, but with a linear complexity (OðnÞ). Considering 800 MHz SRAM, the latency required for the proposed approach for the MNIST application is 0.2 mS, which is negligible compared to training latency (2,000 epochs: 17.9 mS). The additional hardware (256 comparators, 1 KB SRAM, 256 16 bit counters) require 0.45 mm 2 area (3 percent overhead) and 43 mW or power (2 percent overhead). The energy consumption of the algorithm (8.6 uJ) is also negligible compared to the training energy (74.47 mJ for 2,000 epochs).
2) CONTROLLING APPROXIMATION DURING THE INFERENCE
After grouping, last (log 2 ðNÞ þ 1) bits of each synaptic weights, are assigned to represent approximation flag: log 2 ðNÞ bits to represent N bit-precision modes and 1 bit to mark whether it uses in-exact MAC or not (result of Algorithm 1). In our design (3 bit-precision modes: 32, 16, and 8 bit), last 3 bits for each synaptic weights are assigned for approximation flag. Therefor, there is no additional overhead for storing 'approximation information'. Based on approximation flag, approximation controller in each PE checks last 3 LSBs of a synaptic weight, then forces some of LSBs as zero using 32 1bit 2-to-1 MUXs and delivers the synaptic weight to accurate MAC or inexact MAC unit (32 bit 2-to-1 MUX). Due to its simplicity, its area (2 percent) and power (4 percent) overheads are negligible compared to 32 bit fixed point MAC units.
Total hardware overhead for proposed algorithm is illustrated in Table 2 . Compared to base architecture, hardware overhead is 7 percent in area and 3 percent in power.
V. INTERACTION BETWEEN TRAINING CONDITIONS AND APPROXIMATION DURING INFERENCE
In Section IV, we trained the network using 64 bits floating point without any approximation (off-chip training) using MATLAB. By storing look-up table for both non-linear activation function and its derivative, digital MLP system illustrated in Section III can train the network as well (on-chip training). As approximation is based on the result of training (error sensitivity), Section V studies the impact of training conditions (bit precision, max. number of iterations, and number of layers in network) on the approximated inference.
A. DIFFERENT BIT PRECISION DURING THE TRAINING 1) ANALYSIS ON THE ITERATIONS AND ACCURACY OF TRAINING
Instead of 64-bit floating point using MATLAB with PC (off-chip training), fixed point with different bit precisions (on-chip training) are used for training MNIST [5] and CNAE-9 [31] (classifying the business into nine categories based on the text description). As we explained in Section II-C, error during the training can change not only MLP accuracy but also number of iterations. Since number of iterations during the training is critical for both on-chip training and off-chip training, the impact of quantization on the number of iterations should be studied. Figure 12 shows the number of iterations and accuracy of MNIST and CNAE-9 for different bit precision of training. In this experiment, four different bit precisions were tried for training: Q 1;7;24 (32-bit), Q 1;7;20 (28-bit), Q 1;7;16 (24-bit), and Q 1;7;12 (20-bit) . For inference (classification), 32-bit fixed point is used for all cases. Iteration during the training is stopped when root mean square error computed at the output layer is below the threshold for each benchmark. The maximum number of iterations is limited to 10,000.
According to Figure 12 , we should note that higher bit precision is required for training compared to inference; we need at least 24 bits for training while we can use 16 bits for inference. Main reason for high bit precision in training is gradient [32] . For example, during the training of MNIST with 64 bits floating point, minimum magnitude (resolution) for gradient ranges 2 À20 $ 1 À20 . As explained in Section II-C, small perturbation due to limited precision may reduce the number of iteration of training [15] ; it is observed for CNAE-9; 24-bit and 28-bit precision has less iterations than 32 bits. For the 20 bits, however, number of iterations increases dramatically as the amount of error increases. For MNIST, the statement from [15] that small perturbation during the training decreases the number of iterations does not follow. First of all, small perturbation in [15] is uniform distribution while quantization error for 24 bits is always positive since LSB are forced to zero and its maximum value is around (1:5 Â 10 À5 % 2 À17 þ Á Á Á þ 2 À21 , since last 3 LSBs are approximation flag). As quantization error is always positive, the impact of accumulated error is different with that of accumulated uniform distributed error (cancel out each other in the accumulation) [15] . Moreover, the quantization error exists in every operands (states, weights, and gradients) while small perturbation exists in only weights [15] .
It emphasizes that selecting appropriate bit-precision is very critical for training quality (accuracy) and energy efficiency in training (power consumption and training iterations). Moreover, the impact of different bit precision during the training on the approximation in inference will should be studied to maximize power saving (Section V-A3). { On-chip implementation of proposed algorithm (Section IV-F1). 
2) COMPARING APPROXIMATE NEURONS AND SYNAPSES FOR DIFFERENT TRAINING BIT PRECISION
Approximate synapses [3] and neurons [2] are compared based on the different bit precisions in training ( Figure 13 ). Based on different bit precisions in training, error sensitivities are pre-computed and applied to the proposed algorithm to approximate neuron or synapse. In both approximation approaches, the number of approximated synapses are same. Accuracy degradation budget is set as 8 percent; 8 percent accuracy degradation from inference with 32-bit fixed point precision (Q 1;7;24 ) is allowed for each training conditions. First of all, CNAE-9 is simple enough to be trained with 8-bit fixed point for either synapses or neurons; therefore, there is no difference between approximate neurons and synapses. In contrast, for MNIST, approximate synapse can have more 8-bit precision for all training conditions. Therefore, approximate synapse can save more power than approximate neuron for all training conditions. Interestingly, for MNIST, approximate synapse can have more 8-bit precision ratio when MNIST is trained with low precision. It means that training with lower bit (with small quantization error) can make MLP more robust to small perturbation during the inference as we explained in Section II-C; it can approximate more synapses. For approximate neurons, however, this trend is not observable. Training with less bit precision has less chance to approximate neurons during the inference. As the error due to approximate neuron is not small perturbation (it may approximate some synapses, which are highly sensitive to the error at the output), MLP trained with less bit cannot overcome the error from the approximate neuron (more accuracy degradation). The relationship between bit precision in training and approximation in inference will be covered in next section.
3) IMPACT OF DIFFERENT BIT PRECISIONS IN TRAINING ON POWER SAVING IN INFERENCE
For the rest of the paper, we will use MNIST as example to see the relationship between training conditions and approximation in inference since CNAE-9 can be trained with even 8-bit precision.
MNIST trained with 24 bits and trained with 32 bits are compared in detail (Figure 14) . In the proposed algorithm, target accuracy is swept from 1.0 to 0.8. MNIST trained with 28 bits shows exact same results with MNIST trained with 32 bits. From this simulation, we can see MNIST inference does not need 32 bits; 16-bit synapses (20 percent) and 8-bit synapses (80 percent) shows less than 1 percent accuracy degradation for both 24 bits trained and 32 bits trained. If the target accuracy is 0.90, 32 bits trained MLP needs 16 bit for 10 percent of synapses while 24 bit trained MLP can operate with 8 bits for all synapses, which shows the similar result from [15] . Therefore, MLP trained with 24 bits can approximate more synaptic weights with low precision (more power saving) than MLP trained with 32 bit.
For MNIST, however, 24 bits operation increases the number of iterations during the training (Figure 12(a) ) about 60 percent. Therefore, based on the use-case of application (training once or training frequently), designer should consider proper bit precision due to the trade-off between training time and power saving during the inference.
Since bit precision for training changes not only training quality but also number of iterations (Figure 12 ), its energy overhead should be compared with energy saving during the inference. Latency for training one iteration is computed using 38:4GOPs=s Â N Ops (number of operations) and N Ops for training fully connected layer from M neurons to N neurons are defined as 3 Â M Â N [33] . Table. 3 shows two modes comparison from Figure 14: (1) 32 bits training and 10 percent 16 bits + 90 percent 8 bits inference and (2) 24 bits training and all 8bits inferences to achieve 0.94 accuracy. By reducing bit precision during the training, 24 bit training requires more energy (+37.4 mJ). During the inference, however, it can save 0.19 mJ. In other words, if this system will MNIST inference more than 200,000 times, 24 bit training is beneficial, otherwise, 32 bits training is beneficial in terms of energy.
In conclusion, approximating synapse can save more power than approximating neuron during the inference under the given accuracy degradation. Moreover, power saving with approximated synapses increases when bit precision during the training is reduced carefully. However, training with low bit precision may increase the number of iterations. Therefore, based on the use-case, bit precision for training should be considered.
B. DIFFERENT NUMBER OF ITERATIONS DURING THE TRAINING
One of the difficulties during training is selecting a maximum number of iterations as it strongly depends on training data set, randomly initialized weights, and learning rate.
In this section, we analyze the impact of maximum iterations during the training on the inference with approximate synapses. For MNIST, maximum iterations are varied from 300 and 50,000. For all cases, three-layer MLP network (784-144-10) with 32 bits is trained. As the number of iterations for MNIST training with 32 bit fixed point is about 2,000 in average (Figure 12(a) ), training with 300$1,000 iterations shows lower accuracy (0.9460 and 0.9651) compared to training with iterations up to 3,000 (0.9793). Figure 15(a) shows the relationship between accuracy and maximum number of iterations in training. After 3,000, we can see there is no improvement on the accuracy (fluctuation less than 0.5 percent).
Since the baseline accuracies (inference with 32 bit fixed point without any approximation) are different for all cases, Figure 15 compares the power saving and its normalized accuracy based on the baseline accuracy. Training with less iterations (300 or 1,000) shows more accuracy degradation for the same power saving. In other words, under the given accuracy degradation budget, increasing the number of iterations during training can save more power during inference by allowing more approximation on the weights. For example, assumed that maximum allowable accuracy degradation is 5 percent. Consider two training conditions, one with 3,000 iterations and the other with 300 iterations. The inference based on training with 3,000 iterations, can use 16 bits for 10 percent synapses and 8 bits for 90 percent synapses. On the other hand, when 300 iterations are concerned 16 bit is used for 20 percent synapses and 8 bit for 80 percent synapses. For the example in Figure 15 , we did not observe much difference in power saving for more than 3,000 iterations during training.
In summary, although training with less iteration could save training energy with negligible accuracy degradation, less-trained network is highly sensitive to approximation during the inference. In other words, training with enough number of iterations provides more potential for power saving using approximate synapse during the inference. This is an important observation especially for applications where training is performed rarely but inference is performed frequently. For such cases, a performance loss (longer) during training can lead to more power saving during inference.
C. DIFFERENT MLP STRUCTURE (DIFFERENT NUMBER OF LAYERS)
The number of hidden layers and number of neurons in those layers are important design parameters for a MLP (or any deep learning network). In general, network with more layers is believed to be able to realize more complex input-output relations in a NN [34] , [35] . To understand the impact of MLP structure, for example number of layers, on power saving for MLP inference, MNIST is trained with three-layer MLP (784-144-10) and four-layer (784-144-64-10). For both cases, maximum number of iterations is 3,000 and 32 bit is used for training. Inference is also performed with same MLP structure used in training. As both cases show similar accuracy for inference with 32 bit ($0.977), we can say that four-layer MLP is not the optimal MLP structure for MNIST, in other words, it has a redundant layer. Figure 16 shows three-layer MLP can save more power than four-layer MLP for all three approximations. For example, we assumed that target accuracy is 0.95. Training with three layers can use 16 bits for 10 percent synapses and 8 bits for 90 percent synapses while 16-bit precision is used for 20 percent synapses and 8-bit precision for 80 percent synapses with four layers. According to [36] , MLP structure with more layers is more sensitive to weight error if the number of synapses for single neuron is high enough. Unlike Convolutional Neural Network [37] , which has local neighbor connections, MLP with many layers could be more sensitive to weight error; less power saving during the inference. Training with less number of layers (optimal network structure) is critical to not only training time (less number of computations) but also power saving during the inference.
For approximation in deep network (network composed of many layer), more advanced approximate techniques are required to compensate accumulated error through layers.
First, to avoid shrinking value due to approximation, normalization at each layer is required [38] . Moreover, stochastic rounding [38] (round up or round down randomly) can have both positive and negative quantization error in network. In this paper, for simplicity in hardware design, such advanced approximating units are not considered. In addition, as we already explained in Section IV-D, layer-wise approximation (approximation more neurons in earlier layer) could be considered to improve the accuracy while maintaining same power saving.
D. SUMMARY OF RELATIONSHIP BETWEEN TRAINING
CONDITIONS AND POWER SAVING Figure 17 illustrates the average power consumption for single PE and its normalized accuracy for different approximate approaches (proposed algorithm, software approach, and hardware approach) and different training conditions. For hardware approach, the ratio of in-exact multipliers increases from 10 to 100 percent. For all different training conditions, proposed algorithm shows the best power saving, and hardware only (in-exact multiplication) approach shows the least power saving for target accuracy.
E. RETRAINING WITH APPROXIMATE SYNAPSES
In this section, we will discuss about retraining the network after approximation to improve the accuracy while maintaining the power consumption during on-chip inference. Figure 18 shows the flow of retraining. After initial training, error sensitivity is computed by gradient descent. Under the target accuracy (or accuracy degradation margin), bit-precision rate (e.g., [rate for 32 bit, 16 bit, 8 bit] = [0%, 20%, 80%]) is determined by proposed algorithm from Section IV. After approximate synapses using the bit-precision rate, neural network is retrained with a small training data set with few iterations. During the retraining, approximated synapses maintain its bit precision. In other words, we allowed quantization error during the retraining. Although retraining requires additional latency and energy consumption, it will be performed once before on-chip inference. The inference will be repeated many times. For 50 epochs of training of MNIST, energy consumption is 1.69 mJ, which is similar to inference 263 times (263 Â 6.41 mJ). Therefore, we believe power saving during the inference while yielding high accuracy is desirable, even at the expense of increased training energy. Figure 19 shows the accuracy improvement by retraining. We constrained the maximum iteration for retraining as 50; it is very small number of iterations compared to number of iterations in initial training ($3,000). Base accuracy (before retraining) is determined as 20 percent 16 bit and 80 percent 8 bit approximation. We should mention that the power consumption of the MLP hardware is constant since bit precision for synapse is maintained. When initial training is iterated 3,000 times, base accuracy is 0.9676 and its power consumption is 4.686 W. After 15 iterations for retraining, accuracy is improved up to 0.9731. The impact of retraining is much significant when initial training is less iterated. After initial training with 1,000 iterations, base accuracy is 0.9497, but it can be improved to 0.9643.
Note that the inference accuracy can also be improved by allowing less approximation during inference, instead of using retraining. However, that will increase power dissipation. For example, consider the case of initial training with 3,000 iterations. The improved accuracy of 0.9731 with retraining can also be achieved with 5.007 W (+7 percent) power consumption during inference without retraining. Similarly, in the case of training with 1,000 iterations, we need about 5.328 W (+13 percent) power for inference without retraining to achieve the accuracy level of 0.9643. In summary, re-training after approximate synapses with small number of iterations (50) is effective method to improve the accuracy without any power penalty. Especially, if Algorithm 1 cannot guarantee both quality constraint and power constraint, retraining should be considered.
F. COMPARISON PROPOSED ALGORITHM WITH UNIFORM BIT SELECTION
In [12] , network is trained without any approximation initially. After initial training, network is retrained with weight restriction (approximation). Since all synaptic weights are encoded in same bit-precision, in this section, we refer it as 'Uniform Bit Selection'. According to [12] , accuracy of MNIST with 8-bit MLP is about 0.9745. To compare the proposed algorithm with uniform bit selection, uniform bit selection approach is applied in our digital MLP full system (100 percent accurate MAC with 8-bit precision). Figure 20 compares two approaches in terms of normalized power and MNIST accuracy. From the proposed algorithm, we can get 0.9644 for MNIST accuracy with 64 percent power saving. After retraining explained in Section V-E, we can get 0.9731 without additional power consumption. Although the accuracy of our proposed algorithm is slightly lower (À0.0087) than uniform bit selection, it can save more power (À11 percent).
Since MAC is replaced with alphabet set multiplier with limited number of alphabets [12] to save dynamic power FIGURE 18 . Retraining flow. After approximate synapse using proposed algorithm, bit-precision for each synapse is maintained. Therefore, accuracy is improved while power consumption is constant. consumption, coupling bit precision control based on error sensitivity with ASM remains as future work.
VI. CONCLUSION
This paper discussed the concept of approximate synapse to reduce power dissipation of a feedforward net-work, namely, MLP, during inference. The paper presented a methodology to reduce power dissipation with selective approximating synapse using coupling bit precision control (SW) and inexact MAC (HW), referred to as the 'approximate synapses'. The approximated synapses are selected based on gradient (error sensitivity of synaptic weights) precomputed during training phase. The power analysis shows that error-sensitivity-driven choice of approximate synapses coupled with reduced-precision and in-exact arithmetic based approximation techniques allow appreciable power saving in digital MLP hardware under less accuracy degradation compared to other approximation (only SW or only HW). Moreover, we observed that training conditions play an important role in power saving during the inference with approximate computing. For example, training with reduced bit precision or more iterations provides more opportunities for power saving with approximate synapses during inference. However, use of optimal number of layers in the MLP is critical; too many layers increase the sensitivity to small error thereby reducing the effectiveness of approximation. Further, it is shown that after the selection of approximate synapses, performing retraining can improve the accuracy; hence, more aggressive approximation be-comes possible for a target accuracy. In conclusions, the approximate synapse is observed to be an effective technique for reducing power dissipation during inference; however, the effectiveness depend on the training conditions.
