The widespread application of artificial neural networks has prompted researchers to experiment with field-programmable gate array and customized ASIC designs to speed up their computation. These implementation efforts have generally focused on weight multiplication and signal summation operations, and less on activation functions used in these applications. Yet, efficient hardware implementations of nonlinear activation functions like exponential linear units (ELU), scaled ELU (SELU), and hyperbolic tangent (tanh), are central to designing effective neural network accelerators, since these functions require lots of resources. In this paper, we explore efficient hardware implementations of activation functions using purely combinational circuits, with a focus on two widely used nonlinear activation functions, i.e., SELU and tanh. Our experiments demonstrate that neural networks are generally insensitive to the precision of the activation function. The results also prove that the proposed combinational circuitbased approach is very efficient in terms of speed and area, with negligible accuracy loss on the MNIST, CIFAR-10, and IMAGENET benchmarks. Synopsys design compiler synthesis results show that circuit designs for tanh and SELU can save between ×3.13 ∼ ×7.69 and ×4.45 ∼ ×8.45 area compared to the look-up table/memory-based implementations, and can operate at 5.14 GHz and 4.52 GHz using the 28-nm SVT library, respectively. The implementation is available at: https://github.com/ThomasMrY/ActivationFunctionDemo. Index Terms-Activation functions, artificial neural networks (ANNs), exponential linear units (ELUs), hyperbolic tangent (tanh), scaled ELUs (SELUs).
I. INTRODUCTION
Artificial neural networks (ANNs) are deployed in a wide range of applications, such as image recognition, speech recognition, and natural language processing. Speeding up neural network inference and reducing power consumption have become essential in order to enable ANN adoption in edge devices where low-power and lowlatency are required. Current CPUs and GPUs are ill-suited for this class of devices, leading many researchers to pursue custom fieldprogrammable gate array (FPGA) or ASIC accelerators.
ANNs consist of neurons, which sum incoming signals and apply an activation function, and connections, which amplify or inhibit passing signals. When the neuron's activation function is nonlinear, the two-layer neural network becomes a universal function approximator [1] . Various nonlinear equations, such as sigmoid, logistic, tanh, rectified linear unit (ReLU), scaled exponential linear unit (SELU), etc. [2] have been used to implement activation functions. Manuscript Basterretxea et al. [3] show that nonlinear activation functions affect the learning and generalization capabilities of ANNs. The rationale for focusing on the efficient implementation of exponential functions is twofold: 1) exponential functions are used in several activation functions, such as exponential linear unit (ELU), SELU, tanh, and sigmoid and 2) the ELU [4] and SELU [5] functions have been shown: a) to significantly decrease training time; b) to push mean activations closer to zero; c) to not require batch normalization; and d) to alleviate the vanishing gradient problem. For example, the SELU activation function provides lower and upper bounds on the gradient variance and removes the vanishing/exploding gradient problem. Therefore, we expect a wider adoption of these activation functions in the future and attempts to reduce their hardware area, latency, and power consumption.
However, straightforward implementation of the aforementioned nonlinear activation functions in hardware is very expensive because most of these equations require exponentiation and division [6] . Most of the accelerators do not implement an ISA [7]- [9] but rather create modules individually, therefore, preventing designers from amortizing the costs of physical activation functions. Thus, besides pushing for the efficient execution of the matrix multiplication operations, special attention should also be paid to the other components of the ANN acceleration hardware. This holds true for the activation function. Each neuron in the hidden and output layers needs an activation function. Therefore, small implementation inefficiencies in the activation function can quickly add up. In fact, to achieve a significant speedup, hardware accelerators possess thousands or more processing elements. Hence, the number of hardware activation function components can be significant, and efforts to optimize activation function circuits could dramatically decrease ANN area and power requirements [10] . For example, if the tanh function is implemented using a 10-bit output and 1000 data points, the storage of the function values will require a 10Kb memory structure. Having hundreds of these modules in a design would require multiple megabits of storage. Indeed, Li et al. [11] compared 8-bit neurons ReLU and tanh/sigmoid activation functions. They show that replacing the ReLU with tanh increases the neuron area by 20% and neuron energy by 36%.
In general, nonlinear functions like tanh cannot be effectively approximated using only combinational logic. However, deep neural networks can tolerate low precision operations, therefore, lending themselves to such approximations. Using purely combinational logic has the benefits of providing low latency with small area overhead compared conventional ROM-based approaches. We illustrate this point using the tanh and SELU functions. Their implementations are generalized and open-sourced.
In this paper, we explore the design space tradeoffs of neural network activation function circuits. In particular, we focus on the efficient implementation of activation functions using purely combinational logic for higher clocking speed and smaller area overhead. The rest of this paper is organized as follows. The previous works are introduced in Section II. In Sections III and IV, we present a detailed implementation of the SELU and tanh functions. Section V summarizes the experimental results and Section VI concludes this paper.
II. RELATED WORK Various approaches have been proposed for implementing the activation functions in hardware. Generally, these methods fall into two categories: 1) piecewise approximations and 2) look-up table 0278-0070 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. (LUT)-based approaches [12] . In this paper, we consider the six most commonly used approaches to make the review concise. On the whole, high-fidelity approximations tend to use more resources and have higher latencies, while low-fidelity implementations incur approximation losses but are faster and require fewer hardware resources. In Fig. 1 , we plot the approximation of the e x curve with methods 1, 2, 4, and 5. Method 3 (CORDIC algorithm [13] ) and method 6 (optimized LUT-based method) are omitted as they require too much resource to be directly implemented in hardware.
A. Storing Function Values in LUTs
LUTs are the most commonly used method to implement activation functions in hardware. The function values are divided into equal subranges and each subrange is approximated by a value stored in an LUT. For LUT implementations, raising precision requires increasing the sampling rate, adding more storage, and increasing latency.
B. Storing Parameters in LUTs
Instead of storing the function values directly, this method keeps the function slope and the function intercept in the LUT. The function value can then be calculated using the following formula, where k is the slope and b is the function intercept. This approach is a general form of storing function values in LUTs with k = 0 (see Section II-A)
This method leads to a small improvement on the accuracy, but it also has to store more data and uses an adder and a multiplier to calculate the function values.
C. CORDIC Algorithm
The third method is the CORDIC algorithm. It uses shift, addition, and subtraction operations approximate the nonlinear activation function. The CORDIC algorithm requires less area than storing the parameters in LUTs, but more clock cycles and hardware modules are required to compute the activation function. While the algorithm achieves higher approximation accuracy, its increase in latency may not be suitable for deployment in low-latency edge devices.
D. Taylor Series Expansion
The Taylor series expansion can be used to approximate a nonlinear activation function to any precision. The expansion formula is of the form
This method does require multipliers and several clock cycles to perform the calculation.
E. Approximation Formula
The method introduced in [14] uses the following formula to approximate the exponential function:
Based on this formulation, one can calculate the sigmoid function as
whereas the tanh function can be calculated as
The method requires four cycles to approximate the sigmoid function. The authors designed a structure to calculate the expression 2 −1.5x , which takes two cycles. An add and a division operations are also performed and take one cycle each. For the tanh function approximate an additional clock cycle is required. The implementation of this approximation formula uses fewer resources than the CORDIC approach, but its latency is still high.
F. Optimized LUT-Based Method
This approach is an optimized LUT-based method combined with a Taylor series expansion. The equation is expanded up to the fifthorder
When (x 3 /3) − (2x 5 /15) ≤ 0.02, one can use the approximation tanh(x) ≈ x. By solving the inequality, one gets x ≥ 0.39, and tanh(2.90) ≈ 1. Only the values in the [0.39, 2.90] range need to be stored. In all, LUT-based methods need storage/memory and an extra pipeline stage for the memory access. All these methods, except the one that stores function values in LUTs, either require relatively complex calculations/logic or several clock cycles to minimize the approximation error. According to our experiments in Section V, ANNs are generally insensitive to activation function precision. This is a key insight that allows us to simplify the approximation method without sacrificing the system accuracy. In the following sections, we analyze the activation functions and present our proposed combinational circuit-based implementation method.
III. ACTIVATION FUNCTIONS EXPLORATION
In this section, we discuss the nonlinear activation functions realized using our proposed design approach. We define the sigmoid and tanh function as
Compared to sigmoid function, the tanh function passes through zero and can be approximated as y = x around zero. As a result, when the absolute value of the input is small enough, one can perform the matrix operation directly, therefore, the training process is relatively easy. In principle, sigmoid and tanh have similar expressive ability, but in practice, sigmoid is equivalent to an activation function with a bias. It still needs the real bias term to offset its influence, which can affect the optimization. Therefore, the tanh function is used more often. Furthermore, it converges faster than the sigmoid function. The tanh function has been shown experimentally to outperform the sigmoid function. There two reasons for this: the output of the tanh function is normalized around 0, producing both positive and negative outputs. The sigmoid is not, introducing a systematic bias. Second, when the output of the neuron restricted to [−1, 1], the activation is more likely to be close to 0, so the neurons are generally less saturated with tanh than with sigmoid, allowing gradients to better propagate and speeding up learning [15] .
Klambauer et al. [5] introduced the SELU function and analytically proved that neuron activations converge toward zero mean and unit variance. This allows networks with SELU activations to train deeper models, speed up learning, and use stronger regularizers without sacrificing accuracy. This is the main motivation behind focusing this paper on the efficient implementation of exponential functions.
We define ELU and SELU as
Here, λ = 1.0507 and α = 1.6733.
IV. ACTIVATION FUNCTION IMPLEMENTATION A. Implementation of the tanh Function
In this section, we introduce our method for implementing the tanh activation function using exclusively combinational circuits. We consider only the intervals where the function changes significantly.
1) Properties of the tanh Function: tanh is an odd function, meaning that it is symmetric with respect to 0. In order to approximate it, we only need to observe the positive half of the function. As it converges to 1, we approximate its value in the range [0, 2] for the targeted precision in this paper.
We divide the activation function range into 2 k segments evenly with the step (1/2 k ) . The approximation error depends on the number k, which controls the sampling density. The larger the k is, the lower the approximation errors are, but more complex the implementation. During training, the exact tanh function is used to calculate the gradients, since the approximate function is nondifferentiable. The approximated function is used for the forward pass.
2) Encoding the Value of the Activation Function: After selecting a sampling rate, we choose the output value's integer and fractional parts bit-width. The integer part is either 0 or 1. For the illustrative case, in order to simplify the complexity of the combinational logic, we choose 7-bits to encode the output value of the activation function: 1-bit for the integer part and 6-bits for the fractional part.
3) Generating the Karnaugh Map for the tanh Function: Boolean functions can be expressed in their canonical form: by listing the input values on the left side of the truth table and the output values on the right side, we get a Karnaugh map. Fig. 2 shows the Karnaugh map for one of the tanh activation bits. By analyzing the map, one can derive the needed circuit for implementing the bit. We repeat this procedure for all the bits of the tanh activation function.
A direct implementation will have a circuit for every cell with the value 1, and a multiple-input OR gate choosing one of these circuit outputs. We can simplify the logic expression from the Karnaugh map by combining some of the adjacent 1's in the table cells. This reduces the number of individual circuits. As an illustration, here is the expression of one bit of the output value
X i refers to the i + 1th bit of the input value, and the Y 1 refers to the second bit of the output value. 4) Combinational Logic for the tanh Function: Finally, we can implement the logic expression using an RTL language to get the logical circuits. As for the negative part of the function, since the tanh is an odd function, we can deliver the sign bits to the output directly. If we use g(x) to represent the ladder function between [−2, 2], the approximated activation function tanh can be written as follows:
5) Simulation and Validation:
Once we have the RTL module, we need to simulate it to check the logic expression and make sure it approximates the desired function. Next, we analyze the time delay of the combinational circuit and check whether the activation function lies on the critical path of the design. After functional and timing testing, if there exist any race conditions or hazards, we change the Karnaugh map to remove them.
After simplifying the logic expression, we obtain the final expressions of the tanh function as illustrated in Table I .
B. Implementation of the SELU Activation Function
We demonstrate in this section the implementation of the SELU function using only combinational circuits.
1) Properties of the SELU Activation Function: From (8), the positive part of the SELU function is linear, so we only need to approximate the negative part. Considering e −3.875 ≈ 0.0208 ≈ 0, if the input value is less than −3.875, the output value is −α, α being a static predetermined parameter. We then divide the interval [−3.875, 0] into k segments evenly.
2) Encoding the Value of the Activation Function: We encode the input value with five bits. To maintain precision, we encode the output value into 8-bits, one bit for the integer part and seven bits for the fractional part. In this way, it can be represented as tanh_8_5.
3) Generating the Karnaugh Map for the SELU Function: We can construct the truth table and Karnaugh map in a similar fashion as described in Section IV-A3. From the Karnaugh map, we draw the Karnaugh circle to get the simplest logical expression without race condition and hazards. In total, we arrive at 31 logical expressions. Here, we show an illustration using one bit of the output value
X i refers to the i + 1th bit of the input value and Y 1 refers to the second bit of the output value. In Fig. 3 , each color block refers to a product, and the logic expression is the sum of all the products. The blocks that have the same color refer to the same product.
4) Combinational Logic of the SELU Function:
The approximation of SELU using purely combinational logic is shown in Table II . The table shows the final complete logic expressions. We define the SELU function using the formulation shown in (13) . It is worth noting that we only define it on the (−3.875, 0) interval, as the function is linear for x ≥ 0 and constant for x ≤ −3.875
5) Simulation and Validation:
The purpose of the simulation is the same as in the case of the tanh activation function. As more variables may lead to race conditions and hazards more easily, all the possible combinations should be simulated.
More accurate approximations can be achieved by increasing the number of bits for inputs and outputs. Increasing the number of bits in the input helps break the function into more linear segments. Whereas, a larger number of bits in the output representation boosts its precision. In all, using higher bit-widths improves the approximation accuracy but also leads to more complex circuits.
V. EXPERIMENT LUT-based designs are the most common implementation of activation functions. Therefore, in our comparative study, for the baseline designs, we implement the tanh and SELU functions using LUTs. The function values storage-based implementation is denoted as (ROM_y) and the parameters storage-based one is (ROM_k_b). Their construction uses LUTs and follows the procedures described in Section II. Their comparison with the proposed combinational circuit-based approach is done in terms of approximation error, power, area, and network accuracy.
The evaluation is conducted in a two-step, software-hardware approach. First, we evaluate the approximation method in software using PyTorch to verify the neural network accuracy. It is worth noting that the procedure may run multiple iterations to find out an appropriate bit-width. Second, for a selected bit-width, the full neural network is implemented in hardware. We then perform circuit-level analysis on the RTL code and deploy it on the FPGA board for further system-level validation. 
A. Approximation Error
The average errors (AEs) for the three different methodsthe proposed combinational circuit-based approach, ROM_y, and ROM_k_b-are shown in Table III . The errors for the tanh and SELU functions are bounded to the ranges −2 < x < 2 and −3.875 < x < 0, respectively. The AE is calculated using the following formula:
P is the function value, A is the approximate value, and N represents the number of sample points. Since piecewise linear approximation is used in our proposed method, the AE is the same as in the function values storage approach (ROM_y) and larger than when the parameters are stored (ROM_k_b). The parameters storage approach does use more resources for this slight accuracy improvement (see Section V-B). 
B. Resources Analysis
To get more accurate results, we synthesized the different designs using a 28-nm SVT library. The absolute time delays for the tanh and SELU functions are 0.1947 ns and 0.221 ns. This means that the combinational logic can operate at the maximum frequencies of 5.14 GHz (tanh) and 4.52 GHz (SELU). The area overheads of the tanh and SELU implementations are shown in Table III .
The proposed combinational circuit-based method implementation of the tanh function saves 68.1% and 87.0% in area compared to function values storage approach (ROM_y) and the parameters storage method (ROM_k_b), respectively. For the SELU function implementation, the area savings are 77.53% and 88.17% over ROM_y and ROM_k_b, respectively. The three methods are deployed on the Xilinx VC7V2000T FPGA board. The power consumption results are reported in Fig. 4 .
One clock cycle is needed to get the function value from the LUT for both the ROM_y and ROM_k_b approaches. For the ROM_k_b, two clock cycles are needed for the linear function computation. On the other hand, the proposed method is purely combinational.
C. Inference Accuracy
In this section, we focus on the network accuracy. We use PyTorch with bit-wise operations to approximate the activation functions. We train the neural networks using the original, full precision activation function implementations. Then in the validation phase we replace the activation functions with their approximation function circuits. We make no attempt to retrain the network after changing the activation function. Since such an attempt may remove accuracy losses incurred by quantizing the activation functions.
We test the proposed method with LeNet on the MNIST dataset, VGG-16 on the CIFAR-10, and AlexNet on the IMAGENET dataset. The experimental results show an accuracy loss of 0.05% and an increase of 0.37% compared to the original network on MNIST using tanh_7_4 and SELU_8_5, respectively. In case of the CIFAR-10 experiments, we get an accuracy loss of −1.96% and −0.69% for tanh_7_4 and SELU_8_5. For the experiments on the IMAGENET, the accuracy losses are 7.83% on top-1 and 8.42% on top-5 under tanh_7_4, while there are gains of 0.368% on top-1 and 1.278% on top-5 for the SELU_8_5. The results of the comparative study of the exact implementation and the proposed approximation method are summarized in Table IV . When the quantization method is applied, the network inference accuracy increases. The overall effect of the quantization precision on the inference accuracy follows the pattern observed in other studies [16] .
VI. CONCLUSION
In this paper, we propose an efficient approximation scheme for activation functions using purely combinational logic, which takes only one clock cycle. We showed its implementation and performance on two widely used activation functions, i.e., tanh and SELU. We conduct a comparative study of the proposed method with other widely used methods, i.e., storage-based approaches. Based on the average approximation errors, our method has the best performance to circuit complexity ratio. Activation quantization bears little effect on network accuracy. The hardware implementation of the proposed activation functions is realized using the 28-nm SVT library to further validate the efficiency of the proposed approach in terms of area and timing delay. Area reductions of 68.1% and 87.0% for the tanh function, and 77.53% and 88.17% for the SELU function are recorded when compared with the two baseline LUT-based activation function implementations (ROM_y and ROM_k_b).
