Abstract-Approximate operators have been developed to overcome the performance limitations of the original accurate arithmetic operators. They trade off the output quality of the operator and its energy consumption, area or delay. To benefit from this trade-off, the logic structure of the original accurate operator is modified. When integrating approximate operators in a complex system, numerous simulations of the application are required to ensure the fulfillment of the application requirements, despite the induced approximations. Because the hardware implementation of approximate operators is not always available in early phases of application prototyping, long software simulation of their complex bit-level structure is used. This paper proposes a fast simulator for approximate operators built from the output values of the original accurate operator. The error due to the approximation is modeled by a stochastic process whose features are learned from the errors of the approximate operator. The proposed simulator is compared to the bit-accurate logic-level simulation and to a simulator of approximate operators built on the input values of the operator. Experiments on 10-bit operators show that the proposed method is up to 63× faster than a bitaccurate logic level simulation.
I. INTRODUCTION
The current challenge, when designing algorithms for embedded platforms, is to embed always more intricate algorithms on over-constrained platforms, whether it be constraints on the energy consumption or on the memory footprint.
To overcome this paradox, approximate computing has been proposed [6] , [9] , taking advantage of the inherent errorresilience of a majority of algorithms to embed. Algorithms in image, video processing or data mining fields, or recursive algorithms are error-resilient by nature as explained in [4] for video coding algorithms. The output quality of these algorithms is traded-off with their implementation cost using approximations, such that real-time and power consumption constraints of embedded systems are met. Among the numerous approximation techniques proposed, approximate operators are under consideration in this paper.
To overtake the current limitations in terms of power consumption and/or area of exact arithmetic operators, their boolean function can be modified. This functional approximation induces errors to reduce the logic complexity and/or the length of critical paths as presented in [5] . Approximate operators, as the Almost Correct Adder (ACA) in [10] or the Approximate Array Multiplier (AAM) in [8] , are under consideration in this paper. To use an approximate operator in an application, a design space exploration is processed to find the best approximation configuration with respect to the constraints of the platform, as well as to ensure that the application quality metric is met despite the induced approximations. For example, for an image processing application, the application quality metric is the Structural Similarity Index [11] . To explore the whole design space to fix the different parameters of the approximate operators and find which operations should be approximated, a fast simulation of approximate operators is needed. However, the simulation of approximate operators is intricate since their hardware implementation is more complex than their accurate implementation. To study the impact of the induced approximations on the application, techniques as interval or affine arithmetic [3] cannot be used since they do not monitor the impact of the approximation on the application quality metric. Moreover, the frequency of error occurrence is not taken into account.
To study the impact of approximate operators in an application, simulation is then required. To explore the design space of an application, the approximate operator is simulated within the application described with a C code. The different approximation perspectives are simulated with a large set of input data to compute the quality metric of the application. Currently, simulating approximate operators is complex since the operation is not implemented directly with a single native instruction of the host computer. Three alternatives can be considered for simulation. The simulation can be done in Hardware Description Language (HDL) at the logic level as in [7] . Hardware in the loop approach can be considered by implementing the operator in an Field Programmable Gate Array (FPGA). Nevertheless, these two alternatives require the implementation of a complex interface with the C code or the computer. The simplest solution is to simulate approximate operators with an equivalent C code reproducing the internal logic structure of the operator. This software simulation is very long since, to be bit-accurate, it must be considered at the logic-level (BALL simulation). Despite being prohibitively long, numerous simulations, with regards to the set of inputs and the different approximation configurations, are necessary. Without being able to test the impact of such operators in an application, their use in real embedded systems is compromised. Thus, a fast simulation of these operators is strongly needed to quickly evaluate their impact on the application quality.
In this paper, we propose a new approach to evaluate quickly the application quality metric from simulation. To accelerate simulation compared to a BALL one, the error e due to approximate operators is modeled by a pseudorandom variable (p.r.v.)ẽ. The error e depends on the values handled by the operators. Thus, the output set is decomposed in subspaces and a p.r.v.ẽ i is associated to each subspace. The characteristics of each p.r.v.ẽ i are stored in a table T err .
Compared to our previous approach (FnF i ) presented in [1] , in the proposed Fast and Fuzzy simulator (FnF o ), the output set is decomposed instead of the input set. Consequently, the size of the table storing the statistical parametersẽ i is dramatically decreased in the case of an addition or subtraction operator. The p.r.v.ẽ combines a Bernouilli distributed p.r.v. to control the error frequency and a uniform p.r.v. to control the error amplitude. Moreover, the effect of the number of subspaces on the result of the simulation is analyzed.
For a simple approximate adder on 16 bits, a simulation time saving up to 5.5× is demonstrated. For a more sophisticated approximate mutiplier on 10 bits, a simulation time saving up to 63× is demonstrated while inducing an acceptable loss of quality on the simulation output.
The paper is organized as follows: Section II details the proposed FnF o simulation. The efficiency of the FnF o simulation in terms of time savings and quality as well as a comparison with the FnF i simulation, are exposed in Section III. Finally, Section V concludes the article.
II. PROPOSED TECHNIQUE

A. Modelization of the approximate operator error by a stochastic process
In this part , the proposed method to accelerate the quality evaluation process is presented. Let be an approximate operator whose input operands x ∈ I x = [x; x] and y ∈ I y = [y; y] are coded on N x and N y bits respectively. x and x represent the minimum and maximum value of x (same for y). The accurate original operator is . The output of the accurate operator is z = x y and is coded on N z bits. The set of all the possible output values for the accurate operator is O. For an addition or subtraction, the output set O is composed of 2 max(Nx,Ny)+1 values and for a multiplier O is composed of 2 Nx+Ny values.
Let e(x, y) be the error at the output of the approximate operator whose inputs are x and y. e(x, y) is expressed as:
In case of fixed-point arithmetic, the error generated by the finite word-length has been widely studied to derive mathematical models as presented in [2] . The error caused by fixed-point format can be considered as a uniform random variable. The frequency of error occurrence is equal to 1 since the error is always present but the error amplitude is small compared to the amplitude of the original signal. Contrary to fixed-point errors, the error e due to approximate operators can have a high amplitude, but does not always occur. Consequently, when using approximate operators, the frequency of error occurrence must be low so as not to degrade too much the output quality of the application. Thus, defining a mathematical model for approximate operators is still an issue.
The straightforward approach to evaluate the degradation on the application quality is to simulate the application with the approximations on a set of representative data. To capture the effects of approximate operators on the application quality, the internal logic structure of the operator must be simulated. Even if this simulation is carried-out with C language, the logical level simulation slows down drastically the application simulation process. A table storing the error e and addressed by x and y can be considered to reproduce the exact behavior of the approximate operator. Nevertheless, this table is composed of 2
Nx+Ny elements. The amount of memory to store this table is prohibitively large even for small values of N x and N y .
In the proposed approach, the error due to the approximation is modeled by a stochastic process whose features are determined with an operator characterization phase. The proposed method aims to replace the approximate operator by the accurate version of the operator plus an errorẽ with the same statistical characteristics as the error e generated by the approximate operator, as:
B. Pseudo-random variablesẽ i to model the error e Considering a single errorẽ to model the error of the entire set I x ×I y leads to a coarse model of the approximate operator error e. To model the approximate operator error more finely, the proposed method decomposes the output set O in 2 The statistical characteristics of the p.r.v.ẽ i are stored in a table T err and the N z -F MSBs of the variable z are used to address T err . Rather than indexing this table with the inputs, as in the simulator FnF i , the proposed approach uses the output values to index the table T err .
Numerous input combinations do not generate an error at the output of the approximate operator . Let f i be the frequency of error occurrence in the subspace O i . The error e is equal to 0 with a probability of 1 − f i . To model the error committed in the subspace O i , the p.r.v.ẽ i is generated with the Equation 3.ẽ
In Equation 3, u i represents a uniform random variable and p i a random boolean variable whose distribution follows a Bernoulli law. The random variable p i is equal to 1 with a probability f i and to 0 with a probability 1 − f i . The random variable p i is obtained from the random variable u i uniformly distributed in the interval [0, 1] and presented in Equation 4 . The variables (a i , b i ) are the coefficient of the affine form used to compute an error value with the right amplitude.
During the approximate operator error characterization phase to build the proposed simulator, for each subspace O i , the characteristics of the error e i generated by the approximate operator are extracted. For each O i in O, the error values e i are computed for the input combinations (x, y) such that z = x y ∈ O i . Then, from these error values in O i , the error amplitude represented by (a i , b i ) and the threshold f i to generate an error with the same frequency of error occurrence are computed and stored in the table T err .
C. Algorithm to simulate x y
The algorithm to simulate x y is built with two precomputed tables, T idx and T err . T idx , of size 2 Nz−F , is used to know if an error occurs in O i . The table T err stores the statistical characteristics (a i , b i , f i ) of the different errorsẽ i .
The computation of x y is detailed in Figure 1 . Firstly, the exact value z = x y is computed. Then, the N z -F MSBs of z are extracted, leading to the value z 0 , addressing table T idx . The value z 0 indicates in which subspace O i the output value z belongs to.
The value T idx [z 0 ] indicates if x y may generate an error. If T idx [z 0 ] is equal to zero, no error is generated and the exact version of the arithmetic operator already computed z is used, thus avoiding any supplementary processing. Otherwise, T idx [z 0 ] gives the index k to address the second table T err and allows to retrieve the parameters a i , b i and f i of the p.r.v.ẽ i . The errorẽ i is finally generated from the Equation 3.
To generate the uniform random variable u i used to compute the errorẽ i , the LSBs of the accurate output value z are considered. As described in [12] , the LSBs of z can be considered as a uniform random variable. Widrow derived that the LSBs of a signal can be considered as a white random additive noise non-correlated with the input signal. The LSBs of z are then xored with a constant K to scramble it. The obtained result is the uniform random variable u i .
The C code developed to implement the proposed approach has been optimized to waste the least cycles possible when simulating operands that do not generate any error.
III. EXPERIMENTAL RESULTS
The proposed approach allows a fast simulation of approximate operators to process the design space exploration of an application with approximate operators. To demonstrate the savings on the simulation time, the BALL simulation time is compared to the FnF o and FnF i simulation time for two operators and different input operands word-lengths. Then, the impact of the fuzziness degree F on the simulation output quality is presented. Finally, the overhead in terms of computation time and memory footprint due to the approximate operator characterization phase is quantified for FnF o and FnF i .
A. Simulation time savings with the FnF simulation
To demonstrate the simulation time savings offered by the use of the FnF o simulator, the simulation time of FnF o is compared with the BALL and the FnF i simulation time. The results have been obtained on a processor Intel i7-6700 with 32Go of memory for two operators, the ACA and the AAM.
The simulation time for different input operand bit-width for the operator ACA is presented in Figure 2 . The internal logic structure of the ACA is simple to reproduce with a C code since the approximation simply consists in cutting the carry-chain length of an addition. The simulation time gains are moderate due to the simplicity of the logic structure of this adder and are up to 5.5× compared to the BALL simulation. The FnF o simulator is faster than the FnF i since the tables to store are much smaller, and less operations are needed when computing a simulation with no errors.
The FnF i simulator was particularly useful for the design space exploration of an algorithm with more complex operators, as for instance the AAM. Indeed, for the simulation of a 16-bit AAM, the FnF i was 44× faster than the BALL simulation. Nevertheless, the behavior of the FnF o with a multiplier is more complex, as presented in Figure 3 . With the AAM, the FnF o simulation time gains are increasing up to 63× on a 10-bit AAM, to then decrease up to 37 on a 16-bit AAM. The FnF i offers higher gains for multipliers whose input bit-width is greater or equal to 11, but in both cases, the simulation time savings on a 16-bit multiplier are considerable for the design space exploration of an application.
B. Impact of F on the quality of the simulation
The lower F is, the more accurate the simulation is, the bigger the tables to store are and the slower the simulation is. F fosters a trade-off between the memory space available for the simulator, the accuracy needed to study the impact of the approximate operator on the application and the simulation time to process the design space exploration. The simulation time depends on the number of errors generated by the original approximate operator and on F that impacts the size of the tables and consequently the cache misses. The lower F , the slower the FnF simulation.
To study the impact of the fuzziness degree on the simulation output quality, the relative error of normalized root mean squared error (NRMSE) between the approximation (computed with the BALL simulation) and the FnF o and FnF i simulations are presented in Figure 4 for two operators, the ACA on 8 bits with a carry chain-length cut at 4 and the ACA on 16 bits with a carry chain-length cut at 8. The relative error between both NRMSE is called δ N RM SE and is expressed in percent. For the ACA on 8 bits, δ N RM SE stays under 10% if F is lower or equal to half of the input bit-width. On the 16-bit ACA, the margin is bigger. Indeed, δ N RM SE stays under 10% until F is equal to 12. The supplementary error due to the proposed model is acceptable for a number of Fuzzy bits F between 50 to 75 % of the total operator word-length. As shown in the next section, this leads to small tables for our approach.
C. Approximate operator characterization phase
Compared to the BALL simulation, a pre-processing phase is required to build the tables T err and T idx for each operator. The approximate operator error e i is characterized and the statistical characteristics of the p.r.v.ẽ i are computed and stored in T idx and T err . In Table I , the memory footprint to store the tables T err and T idx for an average value of F i.e. F = Nz 2 is provided, as well as the time to build the tables. The characterization step is done once off-line for each new operator and the tables used in this paper are available online to avoid this step. To reduce the overhead of this step, MonteCarlo simulations can be used. The simulator FnF o is longer to build by construction because consecutive output values of an operator are not necessarily belonging to the same subspace, which induces a bad locality in terms of memory accesses. The results are presented in the case of an exhaustive test of all the value x and y. Nevertheless, for operators with a high operand word-length, an exhaustive test is not mandatory. Accurate 
IV. ACKNOWLEDGMENTS
This project has received funding from the French Agence Nationale de la Recherche under grant ANR-15-CE25-0015 (ARTEFaCT project).
V. CONCLUSION
This paper presents a fast simulator for the design space exploration of an application with approximate operators. Built on the output values of the original exact operator, the proposed FnF o simulator is always faster than the FnF i simulator for the approximate adder ACA, and for the approximate multiplier AAM up to an input bit-width equal to 10. The size of the tables on which the simulator is based is significantly lower than with the FnF i for the adder. This paper proposes a comparison to help the approximate algorithm designer to choose the best simulator depending on the considered operator. As a future work, a method to characterize high word-length operators with Monte-Carlo simulations will be developed to avoid to exhaustively test all the operand values.
978-1-5386-4881-0/18/$31.00 ©2018 IEEE
