Abstract-Relaxing constraints of 100% accuracy in datapath can provide the freedom to create designs with better performance or energy efficiency. This paper develops probabilistic models, which enable us to explore these trade-offs for key arithmetic primitives. We show that because specific input patterns are required to cause timing violations and that these patterns arise rarely, a lower expected error can be attained by allowing some timing variations to occur, instead of reducing the precision of a circuit to meet a target latency. Experiments show that a mean reduction of 5.6× ∼ 36.7× in error expectation and an improvement of 7.2 ∼ 19.7 in signal-to-noise ratio can be obtained for practical applications.
I. INTRODUCTION
The continuous scaling of CMOS technology places progressively stringent requirements on circuit performance. Recently we have seen that tuning a circuit to meet a desired accuracy criteria can help to meet these demands, because in general it is rarely to create a completely error free design as any fixed or floating point representation introduces quantisation errors. Since a simplified number representation can lead to performance benefits, there has been significant recent research into the area of optimising the precision throughout a datapath to meet a specific accuracy requirement [1] .
However, the choice of precision is not the only source of error in a datapath, and parallel streams of research have explored alternative methods to trade accuracy for performance or energy efficiency. Ernst et. al. introduced a voltage overscaling technique which monitored the error rate using additional circuitry when overscaling the supply voltage [2] . This work demonstrated that greater power efficiency could be obtained because the errors rarely occurred, in which case the voltage could be overscaled. Similarly, a non-uniform voltage scaling technique was discussed in [3] , which analysed the link between probability of output correctness and energy saving. While the voltage scaling literature takes advantage of the fact that only specific input patterns could cause timing errors, research on imprecise architectures take this one step further by taking advantage of errors only occurring with specific input patterns to design a simplified circuit. This includes a simplified multiplier unit [4] , whilst the authors in [5] discussed the link between accuracy and clock frequency for approximation circuits which employed a simplified datapath to mimic and speculate the original logic functions. Alternatively, an adder architecture was described in [6] which provided the flexibility to trade accuracy for energy by providing the option to only utilise part of the adder or enable additional circuitry to correct for errors for high accuracy requirements.
In this paper, we attempt to bring these two strands of research together, by evaluating the probabilistic behavior of basic arithmetic primitives with different datapath precisions, when operating beyond the deterministic region. Although pipelining could be used to increase operating frequency, this technique is typically employed to meet throughput constraints and does not reduce circuit latency. Therefore we use truncation as a more appropriate comparison metric to our proposed alternative under the latency constraint environment. We initially present probabilistic models of errors generated in this process, and subsequently we test this design methodology on real DSP examples. Both our models and experimental results demonstrate that performance benefits can be achieved in comparison to the traditional situation where the target latency is limited by choice of precision.
The contributions of this paper are:
• the first combination of datapath function with overclocking in a holistic framework to trade accuracy for latency, • probabilistic models of errors resulting from overclocking arithmetic primitives under different frequencies, • analytical and empirical results which demonstrate that allowing rare timing violations to occur results in less error than truncating a datapath to meet timing.
II. RIPPLE CARRY ADDER

A. Generation of Overclocking Error
Adders serve as a key building block for arithmetic operations, and other major arithmetic operators such as multipliers can be implemented using adders. Since the ripple carry adder (RCA) is the most widely used one among numerous adder structures, particularly in FPGA technology, the philosophy of our approach is exemplified with the analysis of a RCA. In this work, we analyse the errors originating from two scenarios. The first is a traditional circuit design approach where operations occur without timing violations. To this end, the word-length of the input signal is truncated in order to meet the timing requirement. This process results in truncation or roundoff error. In our proposed new scenario, circuits are implemented with greater word-length, but are clocked beyond the safe region so that timing violations sometimes occur. This process generates "overclocking error".
An -bit RCA is composed of serial-connected full adders (FAs) as shown in Fig. 1 . Typically the critical path delay of RCA ( ) is determined by the longest carry propagation. We assume that carry propagation delay ( , 0 ≤ < ) of each FA is a value , and hence = . For an -bit RCA, if the sampling period is greater than , correct results will be sampled. If, however, < , intermediate results will be sampled, potentially generating overclocking error. For a given , the maximum length of error-free carry propagation is described by (1) where denotes the sampling frequency.
However, since the length of an actual carry chain during execution is dependent upon input patterns, in general, the worst case may occur rarely. As a result, in order to determine when this timing constraint is not met, and the size of the error in this case, we make use of standard results which decide carry generation, propagation and annihilation, as well as corresponding summation results [7] :
• If = = 1, carry chain is generated at bit , = −1 ;
• If ∕ = , carry propagates for this chain at bit , = 0; • If = , current carry chain annihilates at bit , = 1. We then model the errors assuming that all bits in and are mutually independent and uniformly distributed. However we relax this assumption in Section IV where the predictions are verified using real data.
1) Absolute Value of Overclocking Error:
For an -bit RCA, let denote the carry chain generated at bit with the length of bits. For a certain , the maximum length of error-free carry propagation, , is determined through (1). The presence of overclocking error requires > . Since the length of carry chain cannot be greater than , parameters and are bounded by (2) and (3):
For , correct results will be generated from bit to bit + −1 . Hence the absolute value of error seen at the output, normalized to the MSB (2 ), is given by (4), whereˆand denote the actual and error-free output of bit respectively.
We can calculate andˆusing the equations from the previous discussion. In the error-free case, the carry will propagate from bit to bit + −1 , and we know that + = + +1 = ⋅ ⋅ ⋅ = + −2 = 0 for carry propagation, and + −1 = 1 for carry annihilation. However, when a timing violation occurs, carry will not propagate through all these bits, instead,ˆ+ =ˆ+ +1 = ⋅ ⋅ ⋅ =ˆ+ −2 = 1 and + −1 = 0. Note that we assume all internal bits are set to 0 initially. Substituting these values into (4) yields (5), from where interestingly we see that the value of overclocking error has no dependence on the length of carry chain .
2) Probability of Overclocking Error: The carry chain occurs when there is a carry generating at bit , a carry annihilating at bit + − 1 and carry propagates in between. Consequently, its probability is given by (6) . Altogether, under the assumption that and are mutually independent and uniformly distributed, we have ( = =1) = 1/4,
can be obtained by (7) . Note that (7) takes into account the carry annihilation always occurs when + − 1 = .
3) Expectation of Overclocking Error: Expectation of overclocking error can be expressed by (8). Using and from (5) and (7) respectively, can be obtained by (9).
B. Probabilistic Model of Truncation Error
If the input signal of a circuit is bits, truncation error occurs when the input signal is truncated from bits to bits. Under this premise, the mean value of the truncated bits at signal input ( ) is given by (10).
Since we assume there are two mutually independent inputs to the RCA, the overall expectation of truncation error for the RCA is given by (11).
C. Comparison between Two Scenarios
In the traditional scenario, for a given , the word-length of RCA must be truncated to = − 1 in order to meet the required timing constraint. The error expectation of the traditional scenario is then given by (12).
In the new scenario, we allow overclocking errors, so set the word-length of RCA to be equal to the input word-length , giving (13) according to (9).
Comparing Eq.(13) and Eq.(12), we have (14). This equation indicates that by allowing timing violations, the overall error expectation of RCA outputs drops by a factor of 2 in comparison to traditional scenario. This provides the first hint that our approach is useful in practice.
III. CONSTANT COEFFICIENT MULTIPLIER Another key primitive of arithmetic operations, the constant coefficient multiplier (CCM), can be implemented using RCA and shifters. For example, operation = 9 is equivalent to = + 8 = + ( << 3), which can be built using one RCA and one shifter. We will focus on this single RCA and single shifter structure in the rest of this paper, since complex structures consisting of multiple RCAs and multiple shifters can be built in accordance with this baseline structure.
In this CCM structure, let the two inputs of the RCA be denoted by and respectively, which are both two's complement numbers.
denotes "shifted signal", with zeros padded after LSB, while denotes "original signal" with MSB sign extension. For an -bit input signal, it should be noted that an -bit RCA is sufficient for this operation, because no carry will be generated or propagated when adding with zeros, as shown in Fig. 2 . 
A. Probabilistic Model of Overclocking Error 1) Absolute Value of Overclocking Error:
For a CCM, the absolute value of overclocking error of carry chain is increased by a factor of 2 due to shifting, compared to RCA. Therefore in CCM can be modified from (5) to give (15).
2) Probability of Overclocking Error: Due to the dependencies in a CCM, carry generation requires = − = 1, propagation and annihilation of a carry chain is best considered separately for four types of carry chain generated at bit . We label these by 1 to 4 in Fig. 2 , defined by the end region of the carry chain. For 1, we have: If all bits of input signal are mutually independent, then the probability of carry propagation and annihilation for 1 and 2 is 1/2, and the probability of carry generation is 1/4. If we substitute this into (6), we obtain (16).
For carry annihilation of 3, −1 = −1 , which is always true. Thus the probability of 3 is given by (17).
4 represents carry chain annihilates over −1 , therefore carry propagation requires −1 ∕ = −1 . This means 4 never occurs in a CCM. Altogether, for a CCM is given by (18).
3) Expectation of Overclocking Error: For a CCM, since the carry chain will not propagate over −1 , the upper bound of parameter and should be modified from (2) and (3) to give (19) and (20).
Finally, by substituting (18) and (15) with modified bounds of and into (8), we obtain the expectation of overclocking error for a CCM to be given by (21).
B. Probabilistic Model of Truncation Error
In order to meet the target frequency, we assume the input signal is truncated before entering CCM. Let and denote the expectation of truncation error at the input and output of CCM respectively, then we have (22), where c denotes the coefficient of CCM.
IV. CASE STUDY: FIR FILTER
A. Experimental Setup and Model Verification
The benefits of the proposed methodology are demonstrated by using an FIR filter, as shown in Fig. 3 . The results are obtained through timing simulations. In order to achieve the desired latency between input and output, the word-length of the input signal is truncated through the quantizer (Q) in the traditional scenario. However, in our proposed new scenario, the operating frequency is over-scaled while maintaining the original input word-length. We explore the best trade-off between latency and error based on these two scenarios. Two types of input signal are employed in our experiments: 8-bit data sampled from a uniform distribution, which we refer to as "uniform independent inputs", and "real inputs", which denote 8-bit pixel values of several 512 × 512 images. 
B. Expectation of Error
We first assess the accuracy of our proposed models of error. The amount of input signal truncation for the traditional scenario varies from 1 bits to 7 bits. When the circuit is truncated, it allows the circuit to operate at a higher frequency than the rated frequency, up to 2.75×, which corresponds to the maximum frequency required for 1-bit input word-length according to our experiments. Results in Fig. 4 demonstrate that our models for both overclocking error and truncation error match well with the uniform case. However, we observe a small deviation when using real inputs, since the real data does not exactly follow the uniform distribution or the independent assumption. In Fig. 4 the real inputs are the results for the "Lena" benchmark image. Table I summarizes the experimental results obtained by uniform independent inputs together with 4 benchmark images. For all input types, error expectation is reduced in the new scenario, as expected by our model, with the geometric mean of reduction varying from 5.6× to 36.7×. In addition, we see that in practice, larger differences of error expectation are achieved. This is because for real data, long carry chains typically occur with even smaller probabilities. C. Signal-to-Noise Ratio Fig. 5 demonstrates SNR for uniform independent inputs and the "Tiffany" image with increasing frequencies. For the former input type, higher SNR is obtained in the traditional scenario where is increased to 1.1× ∼ 1.6× of rated frequency, while the new scenario outperforms when is increased to over 1.8× of rated frequency. This is because SNR is inversely proportional to square of the error. In the traditional scenario, small corresponds to limited truncation of LSBs, which in turn lead to small noise power. In the new scenario, overclocking error is generated in the MSBs, therefore large noise power is expected at the early stage of overclocking. However, as is further increased, the corresponding amount of truncation rises, resulting in smaller SNR compared to the new scenario.
For real inputs, SNR can be higher for all frequencies under the new scenario, since long carry chains occur rarely. This information could be used to achieve better performance in the new scenario. For example, suppose a user only require an SNR of 30dB or less. In this case, one could operate at a frequency of 2.1× over rated frequency under the new scenario but only 1.5× under the traditional scenario. Table II presents the differences of SNR (dB) obtained by the two input types. Similar to error expectation, we also see that the improvement of SNR is higher for real image data than uniform data. V. CONCLUSION This paper has explored the probabilistic behavior of key arithmetic primitives when allowing timing violations. We have developed models for errors generated due to both overclocking and truncation of inputs. These models indicate that it may be preferable to allow timing violations to occur, under the knowledge that they will only occur rarely. We support this hypothesis with experiments that demonstrate a geometric mean reduction of error expectation of 5.6× ∼ 36.7×, a maximum improvement of SNR of 7.2 ∼ 19.7 can be achieved in real applications over the conventional scenario. This information can in turn be used to achieve better performance for a given budget of error or SNR.
