Design space exploration of neural network activation function circuits by Yang, Tao et al.
Boston University
OpenBU http://open.bu.edu
Electrical and Computer Engineering BU Open Access Articles
2019-10
Design space exploration of neural
network activation function circuits
This work was made openly accessible by BU Faculty. Please share how this access benefits you.
Your story matters.
Version Accepted manuscript
Citation (published version): Tao Yang, Yadong Wei, Zhijun Tu, Haolun Zeng, Michel A Kinsy,
Nanning Zheng, Pengju Ren. 2019. "Design Space Exploration of
Neural Network Activation Function Circuits." IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, Volume
38, Issue 10, pp. 1974 - 1978.
https://doi.org/10.1109/tcad.2018.2871198
https://hdl.handle.net/2144/39100
Boston University
This article hasbeenaccepted forpublication in a future issue of this journal, buthas not been fully edited. Contentmay changeprior to final publication. Citation information:DOI 10.1109/TCAD.2018.2871198, IEEE
Transactions on Computer-Aided Design of Integrated Circuits andSystems
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
× ∼ × × ∼ ×
Design Space Exploration of Neural Network Activation
Function Circuits
Tao Yang, Yadong Wei, Zhijun Tu, Haolun Zeng, Michel A. Kinsy, Member IEEE,
Nanning Zheng, fellow IEEE and Pengju Ren, Member IEEE
Abstract—The widespread application of artificial neural networks has
prompted researchers to experiment with FPGA and customized ASIC
designs to speed up their computation. These implementation efforts
have generally focused on weight multiplication and signal summation
operations, and less on activation functions used in these applications.
Yet, efficient hardware implementations of nonlinear activation functions
like Exponential Linear Units (ELU), Scaled Exponential Linear Units
(SELU), and Hyperbolic Tangent (tanh), are central to designing effec-
tive neural network accelerators, since these functions require lots of
resources. In this paper, we explore efficient hardware implementations
of activation functions using purely combinational circuits, with a focus on
two widely used nonlinear activation functions, i.e., SELU and tanh. Our
experiments demonstrate that neural networks are generally insensitive
to the precision of the activation function. The results also prove that the
proposed combinational circuit based approach is very efficient in terms
of speed and area, with negligible accuracy loss on the MNIST, CIFAR-
10 and IMAGENET benchmarks. Synopsys Design Compiler synthesis
results show that circuit designs for tanh and SELU can save between
3.13 7.69 and 4.45 8.45 area compared to the LUT/memory
based implementations, and can operate at 5.14GHz and 4.52GHz using
the 28nm SVT library, respectively. The implementation is available at:
https://github.com/ThomasMrY/ActivationFunctionDemo.
Index Terms—Artificial Neural Networks; Activation Functions; Ex-
ponential Linear Units (ELU), Scaled Exponential Linear Units (SELU),
Hyperbolic Tangent (tanh).
I. INTRODUCTION
Artificial neural networks (ANN) are deployed in a wide range
of applications, such as image recognition, speech recognition, and
natural language processing. Speeding up neural network inference
and reducing power consumption have become essential in order to
enable ANN adoption in edge devices where low-power and low-
latency are required. Current CPUs and GPUs are ill-suited for this
class of devices, leading many researchers to pursue custom FPGA
or ASIC accelerators.
ANNs consist of neurons, which sum incoming signals and apply
an activation function, and connections, which amplify or inhibit
passing signals. When the neuron’s activation function is nonlinear,
the two-layer neural network becomes a universal function approx-
imator [1]. Various nonlinear equations, such as sigmoid, logistic,
tanh, Rectified linear unit (ReLU), Scaled Exponential Linear Unit
(SELU), etc. [2] have been used to implement activation functions.
Researchers in [3] show that nonlinear activation functions affect the
learning and generalization capabilities of ANNs.
The rationale for focusing on the efficient implementation of
exponential functions is twofold: (a) exponential functions are used in
several activation functions, such as ELU, SELU, tanh, and sigmoid,
and (b) the ELU [4] and SELU [5] functions have been shown (i)
to significantly decrease training time, (ii) to push mean activations
closer to zero, (iii) to not require batch normalization, and (iv) to
alleviate the vanishing gradient problem. For example, the SELU
activation function provides lower and upper bounds on the gradient
variance and removes the vanishing/exploding gradient problem.
Therefore, we expect a wider adoption of these activation functions
in the future and attempts to reduce their hardware area, latency, and
power consumption.
However, straightforward implementation of the aforementioned
nonlinear activation functions in hardware is very expensive because
most of these equations require exponentiation and division [6]. Most
of accelerators do not implement an ISA [7]–[9] but rather create
modules individually, therefore preventing designers from amortizing
the costs of physical activation functions. Thus, besides pushing for
the efficient execution of the matrix multiplication operations, special
attention should also be paid to the other components of the ANN
acceleration hardware. This holds true for the activation function.
Each neuron in the hidden and output layers needs an activation func-
tion. Therefore, small implementation inefficiencies in the activation
function can quickly add up. In fact, to achieve a significant speedup,
hardware accelerators possess thousands or more processing elements
(PEs). Hence, the number of hardware activation function components
can be significant, and efforts to optimize activation function circuits
could dramatically decrease ANN area and power requirements [10].
For example, if the tanh function is implemented using a 10-bit output
and 1000 data points, the storage of the function values will require
a 10Kb memory structure. Having hundreds of these modules in a
design would require multiple megabits of storage. Indeed, in [11],
the authors compare 8-bit neurons ReLU and tanh/sigmoid activation
functions. They show that replacing the ReLU with tanh increases
the neuron area by 20% and neuron energy by 36%.
In general, nonlinear functions like tanh cannot be effectively
approximated using only combinational logic. However, deep neural
networks can tolerate low precision operations, therefore lending
themselves to such approximations. Using purely combinational logic
has the benefits of providing low latency with small area overhead
compared conventional ROM-based approaches. We illustrate this
point using the tanh and SELU functions. Their implementations are
generalized and open-sourced.
In this work, we explore the design space trade-offs of neural
network activation function circuits. In particular, we focus on the
efficient implementation of activation functions using purely combi-
national logic for higher clocking speed and smaller area overhead.
The rest of this paper is organized as follows: previous works
are introduced in Section II, in Section III and IV we present a
detailed implementation of the SELU and tanh functions, Section V
summarizes the experimental results and Section VI concludes this
paper.
II. RELATED WORK
Various approaches have been proposed for implementing activa-
tion functions in hardware. Generally, these methods fall into two
categories: piecewise approximations and look-up table (LUT) based
approaches [12]. In this work, we consider the six most commonly
used approaches to make the review concise. On the whole, high-
fidelity approximations tend to use more resources and have higher
latencies, while low-fidelity implementations incur approximation
losses but are faster and require fewer hardware resources. In Fig-
ure 1, we plot the approximation of the ex curve with methods 1, 2, 4
and 5. Method 3 (CORDIC algorithm [13]) and method 6 (Optimized
This article hasbeenaccepted forpublication in a future issue of this journal, buthas not been fully edited. Contentmay changeprior to final publication. Citation information:DOI 10.1109/TCAD.2018.2871198, IEEE
Transactions on Computer-Aided Design of Integrated Circuits andSystems
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
+
≥ ≈
3 15
1 e − e
n! 0
∞ (n)
−
8
7
6
5
4
3
2
10.0 0.5 1.0 1.5 2.0
(a) Approximation of ex curve using different methods.
1.2
1.0
0.8
0.6
0.4
0.2
0.00.0 0.5 1.0 1.5 2.0
(b) Errors of ex curve using different methods.
Fig. 1. Approximation of ex curve using different methods.Method E has a low approximation error, causing the respective curves to be overlapped.
LUT-based method) are omitted as they require too much resource
to be directly implemented in hardware.
A. Storing function values in LUTs
Look-up tables (LUT) are the most commonly used method to
implement activation functions in hardware. The function values are
divided into equal sub-ranges and each sub-range is approximated
by a value stored in a LUT. For LUT implementations, raising
precision requires increasing the sampling rate, adding more storage
and increasing latency.
The method requires four cycles to approximate the sigmoid
function. The authors designed a structure to calculate the expression
2−1.5x, which takes two cycles. An add and a division operations are
also performed and take one cycle each. For the tanh function ap-
proximate an additional clock cycle is required. The implementation
of this approximation formula uses fewer resources than the CORDIC
approach, but its latency is still high.
F. Optimized LUT-based method
This approach is an optimized LUT-based method combined with
a Taylor series expansion. The equation is expanded up to the fifth-
B. Storing parameters in LUTs
Instead of storing the function values directly, this method keeps
order:
tanh(x)≈x− x
3 2x5
(6)3 15
the function slope and the function intercept in the LUT. The function
value can then be calculated using the following formula, where k is When x3−2x5 ≤ 0.02, one canuse the approximation tanh(x) ≈ x.
the slope and b is the function intercept. This approach is a general
form of storing function values in LUTs with k = 0 (cf. II-A).
y = kx+ b (1)
This method leads to a small improvement on the accuracy, but
it also has to store more data and uses an adder and a multiplier to
calculate the function values.
C. CORDIC algorithm
The third method is the CORDIC algorithm. It uses shift, addi-
tion and subtraction operations approximate the nonlinear activation
function. The CORDIC algorithm requires less area than storing the
parameters in LUTs, but more clock cycles and hardware modules
are required to compute the activation function. While the algorithm
achieves higher approximation accuracy, its increase in latency may
not be suitable for deployment in low-latency edge devices.
D. Taylor series expansion
The Taylor series expansion can be used to approximate a nonlinear
activation function to any precision. The expansion formula is of the
By solving the inequality, one gets x 0.39, and tanh(2.90) 1.
Only the values in the [0.39, 2.90] range need to be stored.
In all, LUT based methods need storage/memory and an extra
pipeline stage for the memory access. All these methods, except
the one that stores function values in LUTs, either require relatively
complex calculations/logic or several clock cycles to minimize the
approximation error. According to our experiments in Section V,
ANNs are generally insensitive to activation function precision.
This is a key insight that allows us to simplify the approximation
method without sacrificing the system accuracy. In the following
sections, we analyze the activation functions and present our proposed
combinational circuit based implementation method.
III. ACTIVATION FUNCTIONS EXPLORATION
In this section, we discuss the nonlinear activation functions
realized using our proposed design approach. We define the sigmoid
and tanh function as:
x −xsigmoid(x) = , tanh(x) = (7)
1 + e−x ex+e−x
form:
f(x) = Σ f (x0) (x − x )n (2)
n=0
Compared to sigmoid function, the tanh function passes through
zero and can be approximated as y = x around zero. As a result,
when the absolute value of the input is small enough, one can
This method does require multipliers and several clock cycles to
perform the calculation.
E. Approximation formula
The method introduced in [14] uses the following formula to
approximate the exponential function:
ex ≈ Ex(x) ≈ 21.44x (3)
Based on this formulation, one can calculate the sigmoid function as:
perform the matrix operation directly, therefore, the training process is
relatively easy. In principle, sigmoid and tanh have similar expressive
ability, but in practice, sigmoid is equivalent to an activation function
with a bias. It still needs the real bias term to offset its influence,
which can affect the optimization. Therefore, the tanh function is
used more often. Furthermore, it converges faster than the sigmoid
function.
The tanh function has been shown experimentally to outperform
the sigmoid function. There two reasons for this: the output of the
1 1Sigmoid(x) ≈ 1 + 2−1.44x ≈ 1 + 2−1.5x (4)
tanh function is normalized around 0, producing both positive and
negative outputs. The sigmoid is not, introducing a systematic bias.
whereas the tanh function can be calculated as:
tanh(x) = 1 − 2Sigmoid(−2x) (5)
Second, when the output of the neuron restricted to [ 1, 1], the
activation is more likely to be close to 0, so the neurons are generally
Original: y = ex
Method A: LUT stores function value
Method B: LUT stores parameters
Method D: Taylor series expansion
Method E: Approximate formula
Method E: Approximate formula
Method A: LUT stores function value
Method B: LUT stores parameters
Method D: Taylor seriesexpansion
This article hasbeenaccepted forpublication in a future issue of this journal, buthas not been fully edited. Contentmay changeprior to final publication. Citation information:DOI 10.1109/TCAD.2018.2871198, IEEE
Transactions on Computer-Aided Design of Integrated Circuits andSystems
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
.
2k
−
XX
3 2
XX
2 1
XX
1 0 00 01 11 10 XX1 0
00
01
11
X
2
10 0m
1011
m
0
9
0
m
1
m
8
0
m
1413
1
m12
0
m
6
1
m
5
1
m
1
m
4
2
0
m
15
m
1
7
1
m
3
1
m1
1
m
0
m
0
−
XXXX
X
less saturated with tanh than with sigmoid, allowing gradients to
better propagate and speeding up learning [15].
In [5], the authors introduced the SELU function and analytically
proved that neuron activations converge towards zero mean and unit
variance. This allows networks with SELU activations to train deeper
models, speed up learning, and use stronger regularizers without
sacrificing accuracy. This is the main motivation behind focusing our
work on the efficient implementation of exponential functions.
We define ELU and SELU as:
TABLE I
TANH 7 4 IMPLEMENTATION
Input X3 X2 X1 X0
AND plane
p1 {X3} p2 {X2, X0}
p3 {X2, X1} p4 {X2, X1, X0}
p5 {X2, X1, X0} p6 {X3, X2}
p7 {X3, X1, X0} p8 {X3, X1, X0}
p9 {X3, X2, X1, X0} p10 {X3, X2, X0}
p11 {X3, X2, X1} p12 {X3, X1, X0}
p13 {X2, X1, X0} p14 {X3, X2, X1, X0}
ELU (x) = x x > 0αex − α else (8)
p15 {X1, X0} p16 {X2 , X1 }
p17 {X3 , X2, X1 , X0} p18 {X3, X1, X0}
p19 {X3, X2, X1}
SELU (x) = λELU (x). (9)
Here λ = 1.0507 and α = 1.6733.
IV. ACTIVATION FUNCTION IMPLEMENTATION
A. Implementation of the tanh function
In this section, we introduce our method for implementing the
tanh activation function using exclusively combinational circuits. We
consider only the intervals where the function changes significantly.
1) Properties of the tanh function: Tanh is an odd function, mean-
ing that it is symmetric with respect to 0. In order to approximate it,
we only need to observe the positive half of the function. As it
converges to 1, we approximate its value in the range [0, 2] for the
targeted precision in this work.
We divide the activation function range into 2k segments evenly
with the step 1 . The approximation error depends on the number k,
which controls the sampling density. The larger the k is, the lower
the approximation errors are, but more complex the implementation.
During training, the exact tanh function is used to calculate the
Output Y6Y5Y4Y3Y2Y1Y0
OR plane
Y6 {p3,p2,p1}
Y5 {p5,p4,p1}
Y4 {p8, p7, p6, p3, p2}
Y3 {p14, p13, p12, p11, p10, p9, p7}
Y2 {p15, p11, p10, p5}
Y1 {p17, p16, p15,p2}
Y0 {p19, p18, p11, p8, p4}
reduces the number of individual circuits. As an illustration, here is
the expression of one bit of the output value:
Y1 = X2X1 + X2X0 + X3X2X1X0+ X1X0 (10)
Xi refers to the i+1-th bit of the input value, and the Y1 refers to the
second bit of the output value.
4) Combinational logic for the tanh function: Finally, we can
implement the logic expression using an RTL language to get the logical
circuits. As for the negative part of the function, since the tanh is an odd
function, we can deliver the sign bits to the output directly. If we use g(x)
to represent the ladder function between [ 2, 2], the approximated
activation function tanh can be written as follows:
gradients, since the approximate function is non-differentiable. The
approximated function is used for the forward pass.
2) Encoding the value of the activation function: After selecting
a sampling rate, we choose the output value’s integer and fractional
tanh(x) =
1 x ≥ 2
g(x) 2 ≥ x >−2
−1 −2 ≥ x
(11)
parts bit-width. The integer part is either 0 or 1. For the illustrative
case, in order to simplify the complexity of the combinational logic,
we choose 7-bits to encode the output value of the activation function:
1-bit for the integer part and 6-bits for the fractional part.
3) Generating the Karnaugh map for the tanh function: Boolean
functions can be expressed in their canonical form: by listing the input
values on the left side of the truth table and the output values on the
right side, we get a Karnaugh map. Figure 2 shows the Karnaugh
map for one of the tanh activation bits. By analyzing the map, one
can derive the needed circuit for implementing the bit. We repeat this
procedure for all the bits of the tanh activation function.
5) Simulation and validation: Once we have the RTL module, we
need to simulate it to check the logic expression and make sure it
approximates the desired function. Next, we analyze the time delay of
the combinational circuit and check whether the activation function lies
on the critical path of the design. After functional and timing testing, if
there exist any race conditions or hazards, we change the Karnaugh map
to remove them.
After simplifying the logic expression, we obtain the final expressions
of the tanh function as illustrated in Table I.
B. Implementation of the SELU activation function
We demonstrate in this section the implementation of the SELU
function using only combinational circuits.
1) Properties of the SELU activation function: From the formula 8,
the positive part of the SELU function is linear, so we only need to
approximate the negative part. Considering e−3.875 ≈ 0.0208 ≈ 0, if the
input value is less than −3.875, the output value is −α, α being a static
3 2 1 0
predetermined parameter. We then divide the interval [ 3.875, 0] into k
0 segments evenly.
2) Encoding the value of the activation function: We encode the
input value with 5 bits. Tomaintain precision, we encode the output value
into 8-bits, 1 bit for the integer part and 7-bits for the fractional part. In
this way, it can be represented as tanh 8 5.
Fig. 2. Karnaugh map of one of the output bits of the tanh activation function
(Y 1) with a 4-bits input and a 7-bits output (tanh 7 4).
A direct implementation will have a circuit for every cell with the
value 1, and a multiple-input OR gate choosing one of these circuit
outputs. We can simplify the logic expression from the Karnaugh
map by combining some of the adjacent 1’s in the table cells. This
3) Generating the Karnaugh map for the SELU function: We
can construct the truth table and Karnaugh map in a similar fashion
as described in Section IV.A.3). From the Karnaugh map, we draw
the Karnaugh circle to get the simplest logical expression without race
condition and hazards. In total, we arrive at 31 logical expressions. Here
we show an illustration using one bit of the output value:
This article hasbeenaccepted forpublication in a future issue of this journal, buthas not been fully edited. Contentmay changeprior to final publication. Citation information:DOI 10.1109/TCAD.2018.2871198, IEEE
Transactions on Computer-Aided Design of Integrated Circuits andSystems
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
0
m
0
XXX
X X XXXXX X XXX
XXXX
XXXX
4
XXX
XXX 4 XXX
19 3 2 1 0 4 2 1 0
{ } { }
− −
Y3 = X4X3X2 + X4X3X1 +X3X1X0
+ X2X1X0 + X3X2X1 + X4X3X2X1
+ X4X3X1X0 + X4X3X1X0 + X4X3X2X1
(12)
TABLE II
SELU 8 5 IMPLEMENTATION
Input X4X3X2X1X0
AND plane
Xi refers to the i+1-th bit of the input value, and Y1 refers to the second
bit of the output value.
In Figure 3, each color block refers to a product, and the logic
expression is the sum of all the products. The blocks that have the same
color refer to the same product.
2 1 0
p1 {X3} p2 {X4}
p3 {X4, X3} p4 {X2, X1, X0}
p5 {X3, X2, X1} p6 {X4, X1, X0}
p7 {X4, X3, X1} p8 {X4, X3, X2}
p9 {X4, X2, X0} p10 {X4, X2, X1}
p11 {X4, X3, X2} p12 {X4, X3, X1}
p13 {X2, X1, X0} p14 {X3, X2, X1}
3 2 4 3 2 3 2 4 3 2 1
p15 {X3, X1, X0} p16 {X4, X3, X1}
XX 00 01 11 10 X X 00 01 11 10 p17 {X3, X2, X1} p18 {X3, X2, X1, X0}
1 0 1 0
00 00
p {X , X , X , X } p
4 3 1 0
{X , X , X , X }
01
11
4 3 2 1 10
1
m
0
m
m
0
m m
1
m
8
0
m
X  1
3 2
01
7 6
11
14
10
10
4 3 1
1
1
m
0
m
0
m m
X  0
0
m
1
m
0
m
10
3 2 1
3 1 0
p21 {X4, X3, X2, X0} p22 {X4, X2, X1 , X0}
p23 {X4, X3, X1, X0} p24 {X4, X3, X2 , X1}
p25 {X4, X3, X2, X1} p26 {X4, X3, X1 , X0}
p27 {X4, X3, X1, X0} p28 {X4, X3, X2 , X1}
p29 {X4, X3, X2, X0} p30 {X4, X3, X1 , X0}
p31 {X4, X2, X1, X0} p32 {X4, X3, X2 , X0}
p33 {X3, X2, X1, X0} p34 {X4, X3, X2 , X1}
Fig. 3. Karnaugh map of one of the output bits of SELU activation function
(y3) with 5-bits input and 8-bits output(SELU 8 5).
4) Combinational logic of the SELU function: The approximation
of SELU using purely combinational logic is shown in Table II. The table
shows the final complete logic expressions. We define the SELU function
using the formulation shown in equation 13. It is worth noting that we
only define it on the (−3.875, 0) interval, as the function is linear for
x ≥ 0 and constant for x ≤ −3.875.
p35 {X4, X3, X1, X0} p36 {X4 , X2, X1, X0}
p37 {X4, X3, X2, X1} p38 {X4 , X3, X2, X1}
p39 {X4, X3, X2, X0} p40 {X4 , X3, X1, X0}
p41 {X4, X2, X1, X0} p42 {X4 , X3, X2, X1}
p43 {X3, X2, X1, X0} p44 {X3 , X2, X1, X0}
p45 {X3, X2, X1, X0} p46 {X4 , X3, X2, X1}
p47 {X4, X3, X1, X0} p48 {X4 , X3, X2, X0}
p49 {X3, X2, X1, X0} p50 {X3 , X2, X1, X0}
p51 {X4, X3, X2, X1, X0} p52 {X4, X3, X2, X1, X0}
SeLU(x) = λ
x x ≥ 0
f(x) 0≥ x >−3.875
−α −3.875 ≥ x
(13)
p53 {X4, X3, X2, X1, X0} p54 {X4, X3, X2, X1, X0}
p55 X4, X3, X2, X1, X0 p56 X4, X3, X2, X1, X0
Output Y7Y6Y5Y4Y3Y2Y1Y0
OR plane
5) Simulation and validation: The purpose of the simulation is the
same as in the case of the tanh activation function. As more variables
may lead to race conditions and hazards more easily, all the possible
combinations should be simulated.
More accurate approximations can be achieved by increasing the
number of bits for inputs and outputs. Increasing the number of bits in
the input helps break the function into more linear segments. Whereas,
a larger number of bits in the output representation boosts its precision.
In all, using higher bit-widths improves the approximation accuracy but
also leads to more complex circuits.
V. EXPERIMENT
Look-up table based designs are the most common implementation of
activation functions. Therefore, in our comparative study, for the baseline
designs, we implement the tanh and SELU functions using look-up tables.
The function values storage based implementation is denoted as (ROM y)
and the parameters storage based one is (ROM k b). Their construction
uses LUTs and follows the procedures described in Section II. Their
comparison with the proposed combinational circuit based approach is
done in terms of approximation error, power, area, and network accuracy.
The evaluation is conducted in a two-step, software-hardware approach.
First, we evaluate the approximation method in software using PyTorch to
verify the neural network accuracy. It is worth noting that the procedure
may run multiple iterations to find out an appropriate bit-width. Second,
for a selected bit-width, the full neural network is implemented in
hardware. We then perform circuit-level analysis on the RTL code and
deploy it on the FPGA board for further system-level validation.
A. Approximation Error
The average errors for the three different methods – the proposed
combinational circuit based approach, ROM y and ROM k b – are
shown in Table III. The errors for the tanh and SELU functions are
bounded to the ranges 2 < x < 2 and 3.875 < x < 0, respectively.
The average error is calculated using the following formula:
Σ |P−A|
Y7 {p4, p2, p1}
Y6 {p19,p18, p5, p2}
Y5 {p20, p8, p7, p6}
Y4 {p53, p52, p51, p24, p23, p22, p21, p10, p9, p3}
Y3 {p28, p27, p26, p25, p15, p14, p13, p12, p11}
Y2 {p55, p54, p34, p32, p31, p30, p29, p18, p17, p16}
Y1 {p45, p44, p43, p42, p41, p40, p39, p38, p37,
p36 , p35 , p34 , p27 , p23 , p22 , p19 }
Y0 {p56, p50, p49, p48, p47, p47, p46, p44}
TABLE III
COMPARISON OFAE(AVERAGE ERROR) AND AREA
Active Function Tanh 7 4 SELU 8 5
Index AE Area(µm2) AE Area(µm2)
Our method 4.19% 97.65 2.22% 137.59
ROM y1
ROM k b
4.19%
0.52%
306.12
751.44
2.22%
0.17%
612.24
1162.93
P is the function value, A is the approximate value, and N represents
the number of sample points. Since piecewise linear approximation is
used in our proposed method, the average error is the same as in the
function values storage approach (ROM y) and larger than when the
parameters are stored (ROM k b). The parameters storage approach does
use more resources for this slight accuracy improvement (cf. V-B).
B. Resources Analysis
To get more accurate results, we synthesized the different designs
using a 28nm SVT library. The absolute time delays for the tanh
and SELU functions are 0.1947ns and 0.221ns. This means that the
combinational logic can operate at the maximum frequencies of 5.14GHz
(tanh) and 4.52GHz (SELU). The area overheads of the tanh and SELU
implementations are shown in the Table III.
The proposed combinational circuit based method implementation of
the tanh function saves 68.1% and 87.0% in area compared to function
14-bits inputs and 7-bits outputs (1-bit for the integer) LUT can be
implemented with 16 × 6bits ROM, which is estimated to take half of the
area of a 32 × 6bits ROM. The ROM in Memory Compiler has at least 32
AverageError = |P|N × 100% (14) entries. 5-bits inputs and 9-bits outputs (1-bit for the integer) LUT can beimplemented with 32 × 8bits ROM.
m m m
0
m
1
m
1
m
0
2
m
3
1
9
0
m
1
m
1
m
15
1
1
m
11
1
m 11
1
98
100
1312
54
14151312
6
0
m
75
1
m
4
11m
1
m0m
0
20
This article hasbeenaccepted forpublication in a future issue of this journal, buthas not been fully edited. Contentmay changeprior to final publication. Citation information:DOI 10.1109/TCAD.2018.2871198, IEEE
Transactions on Computer-Aided Design of Integrated Circuits andSystems
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
− − − −
− − −
− −
− − − −
Comparison of Power Consumption(mW)
250
200
150
100
50
0
Our method ROM_y ROM_k_b
tanh SeLU
Fig. 4. Power consumption of the three methods. The energy consumption
of the proposed method is lower compared to the alternative LUT-based
approaches.
values storage approach (ROM y) and the parameters storage method
(ROM k b), respectively. For the SELU function implementation, the
area savings are 77.53% and 88.17% over ROM y and ROM k b,
respectively. The three methods are deployed on the Xilinx VC7V2000T
FPGA board. The power consumption results are reported in Figure 4.
One clock cycle is needed to get the function value from the LUT
for both the ROM y and ROM k b approaches. For the ROM k b, two
clock cycles are needed for the linear function computation. On the other
hand, the proposed method is purely combinational.
C. Inference Accuracy
In this section, we focus on the network accuracy. We use Pytorch
with bit-wise operations to approximate the activation functions. We train
the neural networks using the original, full precision activation function
implementations. Then in the validation phase we replace the activation
functions with their approximation function circuits. Wemake no attempt
to retrain the network after changing the activation function. Since such an
attempt may remove accuracy losses incurred by quantizing theactivation
functions.
TABLE IV
PERFORMANCE OFOURMETHOD ONANN COMPARED WITH
ORIGINALDESIGN
MNIST CIFAR-10 ImageNet(Top1/Top5)
Tanh(Original) 96.15% 87.17% 42.39%/67.61%
Tanh 5 4 0.2% 5.07% 8.16%/ 8.94%
Tanh 7 4 0.05% 1.96% 7.83%/ 8.42%
Tanh 7 6 +0.04% 0.29% 7.23%/ 8.0%
SeLU(Original) 97.67% 86.79% 39.260%/63.342%
SeLU 5 4 +0.04% −4.15% −0.122%/ + 0.458%
SeLU 7 4 +0.01% −4.47% −0.004%/ + 0.866%
SeLU 8 5 +0.37% −0.69% +0.368%/ + 1.278%
We test the proposed method with LeNet on the MNIST dataset,
VGG-16 on the CIFAR-10, and AlexNet on the IMAGENET dataset. The
experimental results show an accuracy loss of 0.05% and an increase of
0.37% compared to the original network on MNIST using tanh 7 4 and
SELU 8 5, respectively. In case of the CIFAR-10 experiments, we get
an accuracy loss of 1.96% and 0.69% for tanh 7 4 and SELU 8 5.
For the experiments on the IMAGENET, the accuracy losses are 7.83%
on top-1 and 8.42% on top-5 under tanh 7 4, while there are gains
of 0.368% on top-1 and 1.278% on top-5 for the SELU 8 5. The
results of the comparative study of the exact implementation and the
proposed approximation method are summarized in Table IV. When the
quantization method is applied, the network inference accuracy increases.
The overall effect of the quantization precision on the inference accuracy
follows the pattern observed in other studies [16].
VI. CONCLUSION
In this work, we propose an efficient approximation scheme for
activation functions using purely combinational logic, which takes only
one clock cycle. We should its implementation and performance on
two widely used activation functions, i.e., tanh and SELU. We conduct
a comparative study of the proposed method with other widely used
methods, i.e., storage based approaches. Based on the average approxi-
mation errors, our method has the best performance to circuit complexity
ratio. Activation quantization bears little effect on network accuracy. The
hardware implementation of the proposed activation functions is realized
using the 28nm SVT library to further validate the efficiency of the
proposed approach in terms of area and timing delay. Area reductions
of 68.1% and 87.0% for the tanh function, and 77.53% and 88.17% for
the SELU function are recorded when compared with the two baseline
LUT-based activation function implementations (ROM y and ROM kb).
VII. ACKNOWLEDGEMENTS
This work was supported in part by the National Science and Tech-
nology Major Project of China No.2018ZX01028-101-001, National Key
Research and Development Plan No.2016YFB0200202 and National
Natural Science Foundation of China No.61773307.
REFERENCES
[1] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
Approximation Theory & Its Applications, vol. 9, no. 3, pp. 17–28, 1993.
[2] C. W. Lin and J. S. Wang, “A digital circuit design of hyperbolic tangent
sigmoid function for neural networks,” in IEEE International Symposium
on Circuits and Systems, 2008, pp. 856–859.
[3] K. Basterretxea, J. M. Tarela, I. Del Campo, and G. Bosque, “An
experimental study on nonlinear function computation for neural/fuzzy
hardware design,” IEEE Transactions on Neural Networks, vol. 18, no. 1,
pp. 266–83, 2007.
[4] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep
network learning by exponential linear units (elus),” 2015.
[5] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-
Normalizing Neural Networks,” 2017.
[6] S. Vassiliadis, M. Zhang, and J. G. Delgado-Frias, “Elementary function
generators for neural-network emulators,” IEEE Transactions on Neural
Networks, vol. 11, no. 6, pp. 1438–1449, 2002.
[7] S. Han, J. Kang, H. Mao, Y.Hu, X. Li, Y.Li, D. Xie, H. Luo,
S. Yao, Y. Wang, H. Yang, and W. B. J. Dally, “Ese: Efficient speech
recognition engine with sparse lstm on fpga,” in Proceedings of the
2017 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2017, pp. 75–84.
[8] S. Yin, P.Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu,
L. Liu, and S. Wei, “A high energy efficient reconfigurable hybrid
neural network processor for deep learning applications,” IEEE Journal
of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2018.
[9] S. Yin, S. Tang, X. Lin, P. Ouyang, F. Tu, L. Liu, and S. Wei, “A
high throughput acceleration for hybrid neural networks with efficient
resource management on fpga,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. PP, no. 99, pp. 1–14,
2018.
[10] M. T. Tommiska, “Efficient digital implementation of the sigmoid
function for reprogrammable logic,” IEE Proceedings - Computers and
Digital Techniques, vol. 150, no. 6, pp. 403–411, 2003.
[11] J. Li, Z. Yuan, Z. Li, C. Ding, A. Ren, Q. Qiu, J. Draper, and Y.Wang,
“Hardware-driven nonlinear activation for stochastic computing based
deep convolutional neural networks,” 2017.
[12] M. Zhang, S. Vassiliadis, and J. G. Delgado-Frias, “Sigmoid generators
for neural computing using piecewise approximations,” IEEE Transac-
tions on Computers, vol. 45, no. 9, pp. 1045–1049, 1996.
[13] M. Qian, “Application of cordic algorithm to neural networks vlsi
design,” in Computational Engineering in Systems Applications, IMACS
Multiconference on, vol. 1. IEEE, 2006, pp. 504–508.
[14] S. Gomar, M. Mirhassani, and M. Ahmadi, “Precise digital implemen-
tations of hyperbolic tanh and sigmoid function,” in Signals, Systems
and Computers, 2016 50th Asilomar Conference on. IEEE, 2016, pp.
1586–1589.
[15] Y. A. LeCun, L. Bottou, G. B. Orr, and K. R. Muller, “Efficient back-
prop,” Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol.
7700 LECTU, pp. 9–48, 2012.
[16] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Quantized neural networks: Training neural networks with low preci-
sion weights and activations,” 2016.
