Abstract-Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantization error into the design. In this paper, we demonstrate that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantization errors. Through the use of analytical models and empirical results on a Xilinx Virtex-6 FPGA, we show that a geometric mean reduction of 67.9% to 98.8% in error expectation or a geometric mean improvement of 3.1% to 27.6% in operating frequency can be obtained using this alternative design methodology.
I. INTRODUCTION
FPGA-based accelerators have demonstrated significant performance gains over software designs across a range of applications [1] , [2] . However, one of the major factors that limits the performance of these accelerators is that they typically run at much lower clock frequencies than general purpose processors (GPPs) or GPUs. While it is unlikely that FPGA-based accelerators will ever be run at the same clock frequency as GPPs or GPUs, timing analysis tools typically recommend that a user should run their implementation at a very conservative clock frequency in order to avoid the possibility of timing violations. This substantially limits the potential performance of the device.
The standard techniques to boost the operating frequency of a datapath are either to heavily pipeline the design or reduce the precision used. While pipelining may boost the maximum frequency, it will not tend to reduce the latency of the circuit. As a result, this method will not be applicable to many embedded applications, which typically have strict latency requirements, or in datapath containing feedback where C-slow retiming is inappropriate. Reducing the datapath precision will reduce the latency of the accelerator at the cost of introducing quantization error into the design. Due to the freedom of FPGAs to employ customized variable representations, research into exploiting the potential benefits of using the minimum precision necessary to satisfy a design specification, such as the maximum tolerable error, has been an extensive research topic within the FPGA community [3] .
However, the choice of precision is not the only source of error in a datapath. Recently, we have seen a growth of research that explores the potential power or performance benefits that can be obtained when operating circuits beyond the deterministic region. This topic is expected to be of growing importance due to the increasingly stringent timing/power requirements, design complexity and the environmental and process variations, which are all accompanied by the continuous scaling of process technologies [4] . As pointed out by the international technology roadmap for semiconductors (ITRS07) [5] , while future technologies may suffer from much poorer timing performance, extra benefits of manufacturing, test and power consumption can be obtained if the tight requirement of absolute correctness is released for devices and interconnect.
Research in this area has typically focused on relaxing the design constraints and the safety margins that are conventionally used. A series of work named "Better Than WorstCase Design" introduced a universal structure with cores (which operate with high performance) and checkers (which check and recover the system from timing errors) [6] . As an exemplary design, the Razor project [7] scaled the supply voltage and clock frequency beyond the most conservative value, while monitoring the error rate by utilizing a selfchecking circuit. This work demonstrated that the benefits brought by removing the safe margin outweigh the cost of monitoring and recovering from errors. Related work involves operating circuits slightly slower than the critical path delay with dedicated checker circuits to ensure that timing errors will not occur [8] , developing timing analysis tools that decide the optimum operating frequencies in the non-deterministic region due to process variation [9] or dynamically voltage scaling an FPGA upon detection of timing errors to prevent them occurring in the future [10] .
Alternatively, there is research focusing on designing "probabilistic circuits" which trade accuracy for performance, power and silicon area improvements by using techniques such as voltage overscaling and imprecise architectures. For example, Palem et al. [11] described a technique for the ripple carry adder that employed different voltage regions for different bits along a carry chain.
That is, higher voltage would be applied for computations generating most significant bits, and vice versa. However, the ability to implement non-uniform voltage scaling is limited in practical situations. For the second approach, Lu et al. proposed a simplified datapath that can be employed to mimic and speculate the original logic functions [12] . Similarly, Gupta et al. developed approximate adders at the transistor level and compared the energy efficiency of their proposed architectures over truncation of input word-length of conventional structures [13] . Both articles are based on the observation that errors only occur with specific input patterns. However, the link between probability of output correctness and energy saving are not analysed. In addition, these techniques cannot be directly applied to FPGA.
In this work, we attempt to bring the strands of research of arithmetic precision determination and overclocking together. We evaluate the probabilistic behavior of basic arithmetic primitives with different datapath precisions, when operating beyond the deterministic region. We suggest that for certain applications it is beneficial to move away from the traditional model of creating a conservative design that is guaranteed to avoid timing violations. Instead, it may be preferable to create a design in which timing violations may occur, under the knowledge that they are unlikely to occur frequently because they require specific input patterns to generate errors. To support this hypothesis, we initially present probabilistic models of errors generated in this process for basic arithmetic operators: the ripple carry adder (RCA) and constant coefficient multiplier (CCM). We follow this with experimental data from a Xilinx Virtex-6 FPGA, across a range of benchmark circuits and applications. We show that not only does this approach allow us to reduce the need for the conservative timing margin, more importantly, our models and experimental results demonstrate that performance benefits can be achieved in comparison to the traditional situation where target latency is limited by choice of precision. The main contributions of this paper are:
• Detailed descriptions of how to create probabilistic models for overclocking and truncation errors for basic arithmetic primitives, • Analytical and empirical results from FPGA implementation that demonstrate that allowing rare timing violations to occur results in less error than truncating a datapath to meet timing.
The rest of the paper is organized as follows: we first present theoretical probabilistic error models for RCA and CCM in Section II and Section III, respectively. This is followed by the description of a practical experimental setup on the Xilinx Virtex-6 FPGA in Section IV, and the demonstration of the benefits of our proposed approach in Section V, before drawing conclusions in Section VI.
II. RIPPLE CARRY ADDER A. Adder Structures in FPGAs
Adders serve as a key building block for arithmetic operations. Generally speaking, the ripple carry adder (RCA) is the most straightforward and widely used adder structure. As such, the philosophy of our approach is first exemplified with the analysis of a RCA. We later describe how this methodology can be extended to other arithmetic operators in Section III by discussing the CCM that is commonly used in DSP applications and numerical algorithms.
Typically the maximum frequency of a RCA is determined by the longest carry propagation. Consequently, modern FPGAs offer built-in architectures for very fast ripple carry addition. For instance, the Altera Cyclone series uses fast tables [14] while the Xilinx Virtex series employs dedicated multiplexers and encoders for the fast carry logic [15] . Figure 1 illustrates the structure of an -bit RCA, which is composed of serial-connected full adders (FAs) and utilizes the internal fast carry logic of the Virtex-6 FPGA.
While the fast carry logic reduces the time of each individual carry-propagation delay, the overall delay of carrypropagation will eventually overwhelm the delay of sum generation of each LUT with increasing operand wordlengths. For our initial analysis, we assume that the carry propagation delay of each FA is a constant value , which is a combination of logic delay and routing delay, and hence the critical path delay of the RCA is = , as shown in Figure 1 . For an -bit RCA, it follows that if the sampling period is greater than , correct results will be sampled. If, however, < , intermediate results will be sampled, potentially generating errors.
In the following sections, we consider two methods that would allow the circuit to run at a frequency higher than 1/ . The first is a traditional circuit design approach where operations occur without timing violations. To this end, the operand word-length is truncated in order to meet the timing requirement. This process results in truncation or roundoff error. In our proposed new scenario, circuits are implemented with greater word-length, but are clocked beyond the safe region so that timing violations sometimes occur. This process generates "overclocking error".
B. Probabilistic Model of Truncation Error
For ease of discussion, we assume that the input to our circuit is a fixed point number scaled to lie in the range [−1, 1). For our initial analysis, we assume every bit of each input is uniformly and independently generated. However, this assumption will be relaxed in Section V where the predictions are verified using real image data. The errors at the output are evaluated in terms of the absolute value and the probability of their occurring. These two metrics are combined as the error expectation.
If the input signal of a circuit is bits, truncation error occurs when the input signal is truncated from bits to bits. Under this premise, the mean value of the truncated bits at signal input ( ) is given by (1).
Since we assume there are two mutually independent inputs to the RCA, the overall expectation of truncation error for the RCA is given by (2).
C. Probabilistic Model of Overclocking Error 1) Generation of Overclocking Error:
For a given , the maximum length of error-free carry propagation is described by (3) , where denotes the sampling frequency.
However, since the length of an actual carry chain during execution is dependent upon input patterns, in general, the worst case may occur rarely. To determine when this timing constraint is not met and the size of the error in this case, we expand standard results [16] to the following statements, which examine carry generation, propagation and annihilation, as well as the corresponding summation results of a single bit , according to the relationship between its input patterns and :
• If = = 1, a new carry chain is generated at bit , and = −1 ; • If ∕ = , the carry propagates for this carry chain at bit , and = 0;
• If = , the current carry chain annihilates at bit , and =1.
2) Absolute Value of Overclocking Error:
For an -bit RCA, let denote the carry chain generated at bit with the length of bits. For a certain , the maximum length of error-free carry propagation, , is determined through (3) . The presence of overclocking error requires > . Since the length of carry chain cannot be greater than , parameters and are bounded by (4) and (5):
For , correct results will be generated from bit to bit + −1 . Hence the absolute value of error seen at the output, normalized to the MSB (2 ), is given by (6), wherê and denote the actual and error-free output of bit respectively.
andˆcan be determined using the equations from the previous statements in Section II-C1. In the error-free case, the carry will propagate from bit to bit + −1 , and we will obtain + = + +1 = ⋅ ⋅ ⋅ = + −2 = 0 for carry propagation, and + −1 = 1 for carry annihilation. However, when a timing violation occurs, the carry will not propagate through all these bits. Substituting these values into (6) yields (7) . Interestingly, the value of overclocking error has no dependence on the length of carry chain .
3) Probability of Overclocking Error: The carry chain occurs when there is a carry generated at bit , a carry annihilated at bit + − 1 and the carry propagates in between. Consequently, its probability is given by (8).
Under the assumption that and are mutually independent and uniformly distributed, we have ( = =1) = 1/4,
can be obtained by (9) . Note that (9) takes into account the carry annihilation always occurs when + − 1 = .
4) Expectation of Overclocking Error: Expectation of overclocking error can be expressed by (10) .
Using and from (7) and (9) respectively, can be obtained by (11) .
D. Comparison between Two Scenarios
In the traditional scenario, the word-length of RCA must be truncated, using = − 1 bits, in order to meet a given . The error expectation is then given by (12).
Overclocking errors are allowed to happen in the second scenario, therefore the word-length of RCA is set to be equal to the input word-length, that is, = . Hence we obtain (13) according to (11).
Comparing (13) and (12), we have (14) . This equation indicates that by allowing timing violations, the overall error expectation of RCA outputs drops by a factor of 2 in comparison to traditional scenario. This provides the first hint that our approach is useful in practice.
III. CONSTANT COEFFICIENT MULTIPLIER
As another key primitive of arithmetic operations, CCM can be implemented using RCA and shifters. For example, operation = 9 is equivalent to = + 8 = + ( << 3), which can be built using one RCA and one shifter. We first focus on a single RCA and single shifter structure. We describe how more complex structures consisting of multiple RCAs and multiple shifters can be built in accordance with this baseline structure in Section III-C.
In this CCM structure, let the two inputs of the RCA be denoted by and respectively, which are both two's complement numbers.
denotes the "shifted signal", with zeros padded after LSB, while denotes the "original signal" with MSB sign extension. For an -bit input signal, it should be noted that an -bit RCA is sufficient for this operation, because no carry will be generated or propagated when adding with zeros, as shown in Figure 2 . 
A. Probabilistic Model of Truncation Error
Let and denote the expectation of truncation error at the input and output of CCM respectively. We then have (15) , where denotes the coefficient value of the CCM, and can be obtained according to (2) .
B. Probabilistic Model of Overclocking Error 1) Absolute Value of Overclocking Error:
The absolute value of overclocking error of carry chain is increased by a factor of 2 due to shifting, compared to RCA. Hence in CCM can be modified from (7) to give (16).
2) Probability of Overclocking Error: Due to the dependencies in a CCM, carry generation requires = − = 1, propagation and annihilation of a carry chain is best considered separately for four types of carry chain generated at bit . We label these by 1 to 4 in Figure 2 , defined by the end region of the carry chain. For 1, we have: For the first two types of carry chain 1 and 2, the probability of carry propagation and annihilation is 1/2 and the probability of carry generation is 1/4, under the premise that all bits of input signal are mutually independent. Therefore (17) can be obtained by substituting this into (8).
For carry annihilation of 3, −1 = −1 , which is always true. Thus the probability of 3 is given by (18).
4 represents carry chain annihilates over −1 , therefore carry propagation requires −1 ∕ = −1 . This means 4 never occurs in a CCM.
Altogether, for a CCM is given by (19) .
3) Expectation of Overclocking Error:
Since the carry chain of a CCM will not propagate over −1 , the upper bound of parameter and should be modified from (4) and (5) to give (20) and (21).
Finally, by substituting (19) and (16) with modified bounds of and into (10), we obtain the expectation of overclocking error for a CCM to be given by (22).
C. CCM with Multiple RCAs and Shifters
In the case where a CCM is composed of two shifters and one RCA, such as operation = 20 = ( << 2) + ( << 4), let the shifted bits be denoted as 1 and 2 respectively. Hence the equivalent in (22) can be obtained through (23).
For those operations such as = 37 = ( << 5) + ( << 2) + ( << 1), the CCM can be built using a tree structure. Each root node is the baseline CCM and the errors are propagated through an adder tree, of which the error can be determined based on our previous RCA model.
IV. TEST PLATFORM
In our experiments, we compare two design perspectives. In the first scenario, the word-length of the input signal is truncated before propagating through the datapath in order to meet a given latency. In our proposed overclocking scenario, the circuit is overclocked while keeping the original operand word-length. The benefits of the proposed methodology are demonstrated over a set of DSP example designs, which are implemented on the Xilinx ML605 board with a Virtex-6 FPGA XC6VLX240T-1FFG1156.
A. Experimental Setup
We initially build up a test framework on an FPGA. The general architecture is depicted in Figure 3 . The main body of the test framework consists of the circuit under test (CUT), the test frequency generator and the control logic, as shown in the dotted box in Figure 3 . The I/Os of the CUT are registered by the launch registers (LRs) and the sample registers (SRs), which are all triggered by the test clock. Input test vectors are stored in the on-chip memory during initialization. The results are sampled using Xilinx ChipScope. Finally, we perform an offline comparison of the output of the original circuit at the rated frequency with the output of the overclocked as well as the truncated designs using the same input vectors.
The test frequency generator is implemented using two cascaded mixed-mode clock managers (MMCMs), created using Xilinx Core Generator [17] . Besides the outputs, the corresponding input vectors and memory addresses are also recorded into the comparator, as can be seen in Figure 3 , in order to ensure that the recorded errors arise from overclocking the CUT rather than the surrounding circuitry when high test frequencies are applied.
B. Benchmark Circuits
Three types of DSP designs are tested: digital filters (FIR, IIR and Butterworth), a Sobel edge detector and a direct implementation of a Discrete Cosine Transformation (DCT). The filter parameters are generated through MATLAB filter design toolbox, and they are normalized to integers for implementation. Table I summarizes the operating frequency of each implemented design in Xilinx ISE14.1 when the word-length of input signal is 8-bit. The input data are generated from two sources. One is called "uniform independent inputs", which are randomly sampled from a uniform distribution of 8-bit numbers. The other is referred to as "real inputs", which denote 8-bit pixel values of the 512×512 Lena image.
C. Exploring the Conservative Timing Margin
Generally, the operating frequency provided by EDA tools tends to be conservative to ensure the correct functionality under a wide range of operating environments and workloads. In a practical situation, this may result in a large gap between the predicted frequency and the actual frequency under which the correct operation is maintained [18] .
For example, the predicted frequencies and the actual frequencies of a 5 ℎ order FIR filter using different wordlengths are depicted in Figure 4 . The "actual" maximum frequencies are computed by increasing the operating frequency from the rated value until errors are observed at the output; the maximum operating frequency with correct output is recorded for the current word-length. As can be seen in Figure 4 , the circuit can operate without errors at a much higher frequency in practice than predicted according to our experiments. A maximum speed differential of 3.2× is obtained when the input signal is 5-bit. In our experiments in Section V, the conservative timing margin is removed in the traditional scenario for a fairer comparison to the overclocking scenario. To do this, for each truncated word-length, we select the maximum frequency at which we see no overclocking error on the FPGA board in our lab. For example, in Figure 4 , the operating frequency of the design when the word-lengths are truncated to 8, 5 and 2 bits are 400MHz, 450MHz and 500MHz respectively. Figure 4 also demonstrates that when the circuit is truncated, it allows the circuit to operate at a higher frequency than the frequency of full precision implementation. However, a non-uniform period change can be observed for both results. For instance, the maximum operating frequency keeps almost constant when the operand word-length reduces from 8 to 6 or from 5 to 3 in both the experimental results and those of timing analyzer. This will cause a slight deviation between our analytical model which assumes that the single bit carry propagation delay to be a constant value, as discussed in (12) with expression = − 1. This deviation will be influenced by many factors including how the architecture has been packed onto LUTs and CLBs and process variation causing non-uniform interconnection delays [19] . However, we shall see that our model remains close to the true empirical results in Section V.
D. Evaluation Metric of Outputs
The results are evaluated in terms of mean relative error (MRE), which represents the percentage of error at outputs. MRE is given by (24), where and refer to the mean value of error and the correct output respectively.
E. Computing Model Parameters
The accuracy of our proposed models is examined with practical results on Virtex-6 FPGA. We first determine the model parameters. There are two types of parameters in the models of overclocking error. The first is based on the circuit architecture. For example, the word-length of RCAs and CCMs ( ), the shifted bits of the shifters in CCM ( ), and the word-length of the input signal ( ). This is determined through static analysis. The second depends on timing information, such as the single bit carry propagation delay . In order to keep consistency with the assumption made in models that is a fixed value, it is obtained according to the actual FPGA measurement results.
Initially the maximum error-free frequency 0 is applied. In this case we have (25) where is a constant value which denotes the interconnection delay. The frequency is then increased such that (26) is obtained. This process repeats until the maximum frequency −1 is applied in (27). Based on these frequency values, can be determined.
V. RESULTS AND DISCUSSION

A. Case study: FIR filter
We first assess the accuracy of our proposed models of error. The modeled values of both overclocking error and truncation error of the FIR filter are presented in Figure 5 (dotted lines), as well as the actual measurements on the FPGA (solid lines) with two types of input data. The results demonstrate that our models match well with the practical results obtained using the uniform independent inputs. According to Figure 5 , output errors are reduced in the overclocking scenario for both input types in comparison to the traditional scenario, as expected by our models. In addition, we see that using real data, more significant reduction of MRE are achieved, and that no errors are observed when (a) 425MHz, n=8 , no errors observed (b) 430MHz, n=8, SNR=47.15dB (c) 480MHz, n=8, SNR=24.1dB (d) 520MHz, n=8, SNR=10.86dB
(e) 425MHz, n=7, SNR=26.06dB (f) 430MHz, n=5, SNR=24.85dB (g) 480MHz, n=2, SNR=6.57dB (h) 520MHz, n=1, SNR=3.95dB Figure 6 . Output images of the FIR filter for both overclocking scenario (top row) and traditional scenario (bottom row) under various operating frequencies.
frequency is initially increased. This is because for real data, long carry chains are typically generated with even smaller probabilities, and the longest carry chain rarely occurs. The output images of the FIR filter for both of the two scenarios with increasing frequencies are presented in Figure 6 , from which we can clearly see the differences between the errors generated in these two scenarios. In the overclocking scenario, we observe errors in the MSBs for certain input patterns. This leads to "salt and pepper noise", as shown on the images in the top row of Figure 6 . In the traditional scenario, truncation causes an overall degradation of the whole image, as can be seen in the bottom row of Figure 6 . Furthermore, it is difficult to recover from the latter type of error, since it is generated due to precision loss.
B. Potential Benefits in Circuit Design
Our results could be of interest to a circuit designer in two ways. Typically, either the designer will want to create a circuit that can run at a given frequency with the minimum possible MRE, or the algorithm designer will wish to run as fast as possible whilst maintaining a specific error tolerance. In the first case, the experimental results for all five example designs on FPGA are summarized in Table II in terms of the relative reduction of MRE as given in (28) where and denote the value obtained in the traditional scenario and in the overclocking scenario, respectively. − × 100%
In this table, the frequency is normalized to the maximum error-free frequency for each design when the input signal is 8-bit. The N/A in Table II refers to the situations where a certain frequency simply cannot be achieved using the traditional scenario. It can be seen that a significant reduction of MRE can be achieved using the proposed overclocking scenario, and the geometric mean reduction varies from 67.9% to 95.4% using uniform input data. Even larger differences of MRE can be observed when testing with real image data for each design, ranging from 83.6% to 98.8%, as expected given the results shown in Figure 5 . Table III illustrates the frequency speedups for each design when the specified error tolerance varies from 0.05% to 50%. For all designs, we see that the overclocking scenario still outperforms the traditional scenario for each MRE budget in terms of operating frequency. Likewise, the frequency speedup is higher for real image inputs than uniform inputs. The geometric mean of frequency speedups of 3.1% to 21.8% can be achieved by using uniform data, while 5.3% to 27.6% when using real image data.
VI. CONCLUSION
This paper has explored the probabilistic behavior of key arithmetic primitives in an FPGA when operating beyond the conservative region. We have developed models for errors generated due to both overclocking and truncation of inputs. These models indicate that it may be preferable to allow timing violations to occur, under the knowledge that they will only occur rarely. We support this hypothesis with empirical results on a Virtex-6 FPGA that demonstrate a geometric mean reduction of 67.9% to 98.8% in mean relative error, or a geometric mean improvement in operating frequency of 3.1% to 27.6% can be achieved in real applications over the conventional scenario. In the future, we wish to expand our methodology by incorporating silicon area as the third evaluation metric, and to analyze the tradeoffs using alternative architectures.
