A wide variety of error tolerant applications supports the use of approximate circuits that achieve power savings by introducing small errors. This paper proposes a fast and novel algorithm for the design of such circuits with the goal of maximizing power savings, constrained by a fixed error budget, through an analytical expression to optimally select the number of bits to be approximated. This algorithm outperforms uniform approximation schemes by over 30% in power savings, with negligible computational overhead.
INTRODUCTION
Approximate computing has emerged as a new and promising paradigm [1, 2] for low power design. This approach uses circuits that deliberately introduce errors to reduce power dissipation, leveraging the inherent error resilience of certain applications to produce good enough results within a specific error margin. Such applications include (but are not limited to) signal processing, multimedia, data mining and other non-safety-critical domains [3, 4] .
Approximate computing has been explored using circuit level design [5] [6] [7] [8] as well as gate-level synthesis [9] . At higher levels of design, [10] and [11] perform various high level synthesis transformations on abstract syntax trees and directed acyclic graphs (DAGs) representing a circuit, respectively, with consideration of approximate components. These methods use coarse-grained decisions to choose among a few approximation options in conjunction with other high level decisions such as scheduling and binding.
We propose SABER, an optimization framework at the register transfer level that is solely focused on the tradeo↵ between approximation with power, but deployed at a much finer granularity level than [10] and [11] . More specifically, our optimization can continuously decide how many bits to be approximated for each arithmetic operation in a design.
The input to our optimization framework is the dataflow graph of the circuit, represented as a DAG whose nodes correspond to arithmetic units that can potentially be ap-proximated, and whose edges indicate the connections between these units. Our formulation maximizes the number of approximation bits in a circuit (which translates to power/area minimization) so that it uses minimal resources under a specified error budget. This work demonstrates results on fixed-point integer arithmetic operations. For convenience, in our exposition we assume operands to be integers since fractional operands can easily be scaled to integers and back again through simple shift operations. To the best of our knowledge, this is the first work on design optimization through analytical methods considering approximation at bit-level granularity. We develop an error model and a fast heuristic such that the computing cost is very low and highly scalable. This low cost provides the potential for our technique to be frequently called in the inner loop of a high level design-space exploration and optimization such as [10] . Our result serves as an initial solution for further refinement by gate-level synthesis methods [9] .
For a fair comparison, instead of comparing against a noapproximation scheme (against which large improvements are easy to show), we compare our approach with methodologies where uniform approximation is used to approximate the circuit [5, 12] , and demonstrate that our approach can outperform such methodologies by over 30% in power savings, for similar error specifications. The contributions of this paper are summarized as follows:
• Precharacterization: We perform gate level characterization of the error variance of multi-bit adders as a function of the number of approximated bits, starting from the least significant bit (LSB). This step is a one-time e↵ort for a library of approximate gates.
• Error Formulation: We propose a computationally efficient, and accurate framework for expressing the error variance at the output of a DAG as a function of the number of approximate LSBs within each of its nodes, and model it as a nonlinear expression.
• Design Optimization: We formulate an optimization problem to maximize the total approximation in a circuit, constrained to an error budget. Since this optimization is an integer non-linear programming problem, we propose a heuristic to solve this NP-Hard problem. We generate an accurate starting point, followed by a fast approach to obtain the final solution in a simple, analytical form.
Through our optimization routine, we determine precisely if and how each node of a DAG should be approximated and optimize circuit performance under error specifications.
ERROR CHARACTERIZATION
The key ingredient of any methodology based on approximate design is an accurate quantification of the error injected into a computation by the approximation scheme. We use the variance of this error as the error metric to be constrained within a user-specified budget. Here we obtain an analytical expression of this error variance as a function of the total approximation in a circuit.
Let us consider a circuit representing an arithmetic operation with two N -bit operands, X and Y , producing an output, Z. An approximate implementation of the hardware unit yields the benefit of using fewer resources [1] than its exact counterpart. Typically some of the LSBs can be allowed to be erroneous, as this introduces a limited level of approximation. Hence the hardware connected to y LSBs, for example, is approximate, while that connected to the (N y) most significant bits (MSBs) is accurate. Clearly, the higher the value of y, the greater is the power saving due to the imprecise hardware, although the error is also higher. We use the parameter, y, referred to as the number of approximate LSBs, to quantify the amount of approximation.
We present an approach for characterizing the error variance of a DAG whose nodes are candidates for approximation. We begin by obtaining the error variance of an adder as a function of the number of approximate LSBs, y, in the adder. Using this function, we show how we can compute the error variance of any DAG whose nodes are approximate adders. The results for the adder DAG can be generalized to DAGs whose nodes contain adders, subtractors, multipliers, and dividers since the fundamental element of these operations is an adder [13] , with shifters being implemented by appropriately routing the outputs of one DAG node to the inputs of others.
Error Precharacterization for an Adder
We consider transistor-level approximation where an Nbit approximate adder is implemented as an array of accurate and approximate full adders (FAs). If the error due to y approximate LSBs is e, then e can range from (2 y 1) to (2 y 1), and its exact value depends on the inputs. Typically inputs are assumed to be uniformly distributed random variables [5, 11] . Hence e, being a function of these inputs, can be assumed to be a random variable as well. Let px be the probability of e to be x, where x 2 [ (2 y 1), (2 y 1)] is an integer owing to y being an integer. The error means are negligible compared to the variance [5] . Hence we are concerned with the variance, 2 e (y), given by:
Due to the x 2 term in Eq. (1), 2 e (y) clearly depends on y. If x is uniformly distributed between (2 y 1) and (2 y 1), px = , 8x, and 2 e (y) = (2 y+1 1) 2 /12. We also evaluate 2 e (y) for normally distributed x in the later part of this section (Fig. 1) . In fact, the exponential dependence on y holds for most practical error distribution functions (not just uniform or normal) for y  N/2, N being the word-length of the adder. Additionally, using the fact that 2 e (y) should be zero for y = 0 (no approximation implies zero error), the variance of e is formulated empirically as:
where a and b are constants, obtained by fitting the error variance for di↵erent y, through Monte Carlo simulations. In this paper we consider the specific transistor-level approximate FAs from [5] and the Lower-part-Or Adder (LOA) from [14] for our analysis. For a particular type of N -bit approximate adder with y approximate LSBs (N = 10 considered here), each simulation proceeds by uniformly sampling two inputs, X and Y , from [0, 2 N 1], to produce an approximate result, Z, and hence the corresponding error, e, can be calculated. Since N is relatively small, we obtain the variance of e for a particular y by exhaustive simulations. This procedure is repeated for y = 0, · · · , N 1, to obtain a and b in Eq. (2) through regression analysis.
The results are summarized in Table 1 . The first column lists the type of adder studied in this work, followed by the respective values of a and b, defined in Eq. (2), in the next two columns, respectively. The fourth and fifth columns list the adjusted R 2 values which refer to the goodness of the fit (R 2 = 1 indicates that the fitted model explains all variability) and the root mean square error (RMSE) values of the fitted curve, respectively. Both the quantities indicate that the model is a good fit for the actual data. The simulations shown above assumed the two inputs, X and Y , of the adder node to be statistically independent. In a general scenario, the two inputs of an adder node within a DAG may have some correlation. Furthermore their distribution may not be uniform, as assumed in the above experiment. To observe the e↵ect of a di↵erent input distribution that is correlated, we perform 5000 Monte Carlo simulations on 10-bit adders implemented using the FAs from Table 1 , first with two independent 10-bit Gaussian inputs, and then with two correlated Gaussian inputs (⇢ = 0.5), altering the number of approximate LSBs, y. We compare 2 e (y), for both cases with that obtained through our model, for di↵er-ent values of y, as shown in Fig. 1 . In spite of the correlation among the inputs, which are also from a di↵erent distribution than what was used for precharacterization, variances of the error generated by the adder using a and b from Table 1 , show an excellent match with those obtained from Monte Carlo simulations. Intuitively, this e↵ect arises because the change in the distribution and correlation is more likely to a↵ect the higher order bits, which are not approximated, and the distribution of the lower order bits is close to uniform regardless of correlation and for any reasonable distribution. Hence we consider the generated adder errors as independent random variables to compute the total error variance of a DAG (in Eq. (3)).
Error Computation of a DAG
Let us consider a DAG consisting of adders and multipliers as shown in Fig. 2(a) with multiple primary inputs (PIs) and outputs (POs). A pair of adder and multiplier from Fig. 2(a) is highlighted in Fig. 2(b) to depict the implementation of the multiplier by add and shift operations. Overall, the DAG has T nodes, each representing an adder, and each edge is associated with a shift operation, denoted by <<. Each node, ni, is indexed by the subscript, i 2 [1, T ], and the fanout of ni is represented by Fi. Each of the Fi fanout edges of ni is associated with a weight resulting from a shift operation of sij bits, such that j ranges from 1 to Fi. This nomenclature is depicted in Fig. 2(b) .
Let the number of approximate LSBs in ni be represented by yi, hence the generated error variance in ni is 2 e (yi), and is obtained by substituting y = yi in Eq. (2). The error generation among di↵erent approximate operations can be assumed to be independent for all practical purposes. However, error propagation exhibits structural correlation since the approximation in ni not only a↵ects its immediate fanouts, but also those in its fanout cone, through the edges (and the associated weights) connecting ni to a PO via the transitive fanouts. We use the error sensitivity, i, of ni, to a PO, to capture the structural correlations within the DAG. For a single PO, if an error, e, at ni results in an error, Ei, at the PO, then i = Ei/e is computed by a depth-first search of the DAG [11] . The total error variance, 2 t , which is a commonly used error metric [8] in the DAG, is obtained as:
where i is alternatively called the value of the node, ni. When there are multiple POs in a DAG, to minimize error on all the POs, we simply add a dummy (but accurate) adder node with all the POs. This node is not a part of the design but conceptually indicates the summation of error variances of all nodes to compute the total error variance, 2 t for a multioutput circuit. The addition of this dummy node is a simple device that enables the depth-first traversal of the DAG to compute the values of the real nodes.
OPTIMIZATION THROUGH SABER
In this section, we outline the SABER algorithm, which yields the number of approximate LSBs in each node of the DAG, maximizing the total power and area savings, while satisfying a specified error budget.
We first explain our proposed optimization problem, which is NP-Hard, and obtain a feasible solution by relaxing some of the constraints, to make it tractable. Next we propose a heuristic to solve the original problem, using the solution of the relaxed problem.
The Optimization Problem
Let us consider a DAG with T adder nodes. The power savings increase with increasing levels of approximation in the DAG, and all components of the power savings (dynamic and leakage) are proportional to the number of approximate bits, i.e., the number of approximate FAs. For adders, the proportionality of power savings to the number of approximate FAs has been empirically observed in previous work ( [5] and [12] ). Even for a multiplier node, the linear proportionality holds true, because as discussed in Sec. 2.2, we decompose it into its constituent FAs, and hence the power savings are linear in the number of FAs in the decomposed graph. Therefore, the total number of approximate LSBs in the DAG, P T i=1 yi, is a good surrogate objective function that captures the essential trend of power savings, which we aim to maximize. The total error variance, 2 t , accumulated as a result of this approximation is given by Eq. (3). If the specified error variance budget is m, then 2 t must be less than m. We thus formulate the optimization problem as:
where Z + represents the set of non-negative integers. The constraint, yi 2 Z + , arises because the number of approximate LSBs in a node cannot be negative or fractional. A feasible solution to the problem, which satisfies the error budget, always exists: it is the zero approximation solution. However, generating the optimal solution is NP-Hard since (4) is an integer non-linear problem. Hence we relax the optimization problem, to make it tractable, and obtain a feasible solution. For this, we first remove the constraint, yi 2 Z + in (4), and then convert the inequality constraint into an equality, since the optimal solution for the new maximization problem, will lie on the constraint surface. We obtain the solution through Theorem 1. Theorem 1 If the relaxed optimization from (4) is,
with i being the error sensitivity used in Eq. (3), then the solution, e yi, is obtained as:
where Y = 1
Proof: We rewrite the optimization problem from (5) as:
Hence using Eq. (9), we obtain y1 as:
where
We rewrite, S = 1
yi, by substituting y1 from Eq. (10) in Eq. (8) . To maximize S, we set @S @y i = 0.
Substituting 2 by i in Eq. (11) and simplifying, we obtain:
We obtain the result by substituting ✓ 1 in Eq. (12). 2 Next we impose the constraint on yis to be integers. Since the e yis from Eq. (6), are not guaranteed to be integers, we use Lemma 1 to obtain a feasible solution.
Lemma 1 A feasible solution of (4) is given by be yic, where b.c represents the floor function, and e yi is the optimal solution of the relaxed problem, (5), and is defined in Eq. (6). Proof: The left hand side (LHS) of the constraint in (4) is a monotonically increasing function of the state variables. Since, be yic  e yi, if e yi is a feasible solution (i.e., ensures the LHS to be less than m), then so is be yic.
2 Since, the solution from Lemma 1 may be suboptimal, or even negative, we propose heuristics to address these issues.
Heuristics to Solve the Original Problem
We attempt to obtain the number of approximate LSBs, yi, in node, ni, of the DAG in Fig. 2 , by pushing the be yics towards the constraint surface while ensuring non-negativity.
For this, we first define X = bY c (with Y defined in Eq. (7)), so that each node now has X 1 b log 2 i approximate LSBs. This expression arises out of Eq. (6), where we apply the floor function only to a part of the solution, e yi. Next we use Theorem 2 to obtain a parameter, K, denoting the number of nodes to which we add one more approximate LSB while satisfying the error constraints, thus further increasing the objective function in (4), while keeping the solution, feasible. 2 i) approximate LSBs, where X = bY c, and Y is defined in Eq. (7), then the number of first K nodes to which one more approximate LSB can be added to satisfy the error constraint, m, is given by:
where T is the total number of adder nodes, and K < T . Proof: Comparing the impact of adding one more approximate LSB to two nodes, n1 and n2, with values, 1 and
2,
respectively, where 1 < 2, we observe that n1 introduces lower error in the DAG compared to n2. In other words, for the same increase in approximation, the total error incorporated is lower if we start increasing the number of approximate LSBs in the nodes in the order of their increasing values. Hence we renumber the nodes in increasing order of the values in Theorem 2, so thatŷi = bX 1 b log 2 i + 1c for the first K nodes, while for the rest,ŷi = bX 1
The new error variance with the increased number of approximate LSBs should satisfy the error constraint, m, in (5), such that,
Expanding the right hand side of Eq. (15), we obtain:
Hence K is obtained by simplifying the above equation. Additionally, K < T, since starting with be yic, we can never increase all the be yics by one and remain in the feasible region, because such an increment will lead to de yie (d.e being the ceiling function), and if it were in the feasible region, Theorem 1 would have chosen that solution over e yi. 2
This analysis till now holds true even if some of theŷis are negative. However, since a negativeŷi does not have any physical meaning, we need to reassign these values to zero, which in turn may violate the error constraint in (4). To address this non-trivial issue we propose Algorithm 1 to obtain non-negative approximate LSBs while satisfying the error budget, m. Algorithm 1 first resets the negativeŷis to zero, and then, by decreasing the otherŷis, computes the error to be compensated for. This compensation is performed by reducingŷi of the nodes corresponding to the highest i value as much as possible, before proceeding to the next with the second highest value, and so on. The complexity of our algorithm is dominated by the sorting of nodes in term of their values, so that the overall complexity is O(T log T ).
Algorithm 1 Algorithm to ensure non-negativity ofŷi. 
EXPERIMENTAL RESULTS
We implement SABER in MATLAB R2015b on a 64-bit Ubuntu server with a 3GHz Intel ® Core™2 Duo CPU E8400 processor. We consider the appx5 approximate adder which uses transistor-level approximation [5] for our analysis, and consider two examples to demonstrate SABER.
Optimization Results on an Example DAG
First we demonstrate the results of using SABER through an example structure, DAG10, with ten nodes, as shown in Fig. 3 . Each node, ni, can be represented by a 20-bit adder, where the approximation is introduced by replacing yi LSB FAs with approximate FAs. A dummy node has been added as explained in Sec. 2.2, to obtain the error sensitivity (i.e., the value) of the other nodes to the output. Each i is obtained by finding the number of paths from ni through a depth-first search of the DAG considering the edge weights, and is depicted in Fig. 3 .
We demonstrate our results for three error variance budgets, m = 1K, 10K, 100K, corresponding to the allowable error variance at the output of the dummy node. For comparison purposes, we consider the commonly-used uniform approximation case (e.g., in [5, 12] ), when the number of approximate bits in each node is identical.
Using SABER, we compute the number of approximate LSBs in each of the ten nodes, whose distributions among the nodes are depicted in Fig. 4 for the three error variance budgets, m. Since the uniform approximation results in the same number of approximate LSBs in each node, it is depicted by the stem plot, with a single stem of height ten, the rest being zero, in the same figure. Due to the dependence of the number of approximate LSBs in a node on its value through Eq. (6), the distributions in Fig. 4 are also determined by the distribution of values. As m increases, more approximation is possible, and this is seen by the rightward horizontal shift in the bar charts. The target, m, and the actual error variance, 2 t , arising out of the three approximate configurations, are listed in the first two columns of Table. 2. The di↵erence between them arises due to the relaxation of the original problem in (4), and application of the proposed heuristics to make our solution feasible. However, this di↵erence is less than 8% for all three cases, indicating the e↵ectiveness of our methodology. The total number of approximate LSBs by SABER and the uniform approximation case, are listed in the next two columns of Table. 2, indicating that SABER clearly outperforms the uniform approximation for DAG10, from which power savings are calculated and listed in the last two columns. The proportionality constant between the number of approximate FAs and percentage power savings is 0.5%, as evident from the last four columns of the table, since each adder node is 20-bit, and 10 such adder nodes in the DAG lead to a total of 1/200⇥100=0.5% approximate FAs for each approximate LSB within the DAG. Since we use appx5 version of FA for approximation, the power savings for the entire DAG is also scaled by the same factor (0.5%) as appx5 has negligible power consumption compared to the exact FA [5] . Hence approximating k LSB FAs in the DAG leads to 0.5k% power savings, which is accurate to a first order to indicate the advantage of using SABER to maximize power savings over the uniform approximation, without performing logic synthesis.
Optimization Results on FIR Filters
We evaluate our algorithm on a real-world example, by checking the sound quality of filtered signals from an approximate finite impulse response (FIR) filter. The results of filtering by approximate filters designed through SABER have been summarized within a compressed folder and uploaded to http://conservancy.umn.edu/handle/11299/185544. The signals under study comprise of 150K samples of eight different genres of audio clips [15] sampled at their prespecified frequency of 22.05KHz [16] and mixed with a high frequency noise. We constrain the signal to noise ratio (SNR) degradation between an exact filter and an approximate filter to be 50dB, to ensure comfortable loudness and clarity. The normalized pass band and stop band frequencies 1 of the FIR filter are 0.50 and 0.65, respectively, and the minimum order filter that MATLAB generated, had order=33.
The filter coe cients have been scaled by 1024 to facilitate integer arithmetic. All adders have word length of 20 bits, and the multiplications are implemented by array multipliers with add and shift operations. Since the coe cients are symmetric in an order-N FIR filter, we can reuse multipliers [17] , resulting in only dN/2e multipliers and N adders, as shown in Fig. 5 . In our order-33 FIR filter, the first 16 coe cients could be implemented simply by 30 adders (and shifters) based on their binary decomposition. Additionally the filter requires 33 adders. Hence the resulting DAG for the optimization problem in (4) has T = 63 nodes. To formulate the optimization problem, we need the error variance budget for each audio clip. Since the di↵erent genres of music are di↵erently sensitive to approximation in the FIR filter, we select the respective error budgets from the tradeo↵ plot of SNR degradation versus error variance for each clip as depicted in Fig. 6 . We obtain this plot by sweeping the error variance, m, to first obtain di↵erent configurations of approximate filters using SABER. We then filter the noisy signal using each such filter and compute the SNR degradation from the accurately filtered signal. Using this plot, we can select the target error variance, m, for various target SNR degradation values, an example of which is shown for 50dB by the line in Fig. 6 . This plot can be generated very quickly since SABER takes less than a second to generate one configuration of the approximate filter.
For demonstration purposes, we select three values of m (m = 100K, 200K, 400K) from Fig. 6 , to obtain three di↵er-ent approximate filters. The number of approximate LSBs in each node of the FIR filter for each m is then obtained by SABER, whose distribution among the 63 adder nodes is depicted in Fig. 7 . The stem plots depicting the corresponding uniform approximation case are also provided in the same figures as reference. Error budget, m=200K Error budget, m=400K Figure 7 : Distribution of the number of approximate LSBs over the 63 nodes of the FIR filter.
The power requirement of the approximate filters as a fraction of that of the accurate filter is listed in Table 3 . The second column denotes the values corresponding to the filter obtained by SABER with m = 100K, and by uniform approximation in the first two rows, respectively. The last row shows the percentage improvement of our method over the uniform approximation case for this m. Clearly, our algorithm not only achieves over 27% power savings over the exact implementation, but also outperforms the uniform approximation case by over 30%. These numbers increase as m is increased as seen from the last two columns of the table. Obtaining the solution through SABER can thus lead to significant power savings over the existing methodologies. The resulting SNR degradations for the eight audio clips while using the three filters are summarized in Table 4 . The audio clips are listed in the first row, and the SNR degradations for the error variance budgets, m= 100K, 200K, 400K, are listed in the next three rows, respectively. The values are all around 50dB for all m, indicating that the user experience is not compromised in spite of the approximations in the filter, which can be verified by playing the audio clips from http://conservancy.umn.edu/handle/11299/185544. For each clip, the site contains the noisy version, the exact filtered version, and filtered versions corresponding to the three error variance budgets, m, respectively.
CONCLUSION
We have proposed a bit-level optimization framework to design approximate circuits under specified error budgets, built upon an analytical expression for the number of approximate LSBs for each computational unit. The runtimes to obtain an approximate configuration of a DAG are shown to be very small due to the closed form solution, and outperforms the conventional approximation methods by over 30% in power savings.
