High Performance Reliable Variable Latency Carry Select Addition by Du, Kai
RICE UNIVERSITY 
High Performance Reliable 
Variable Latency Carry Select Addition 
by 
KaiDu 
A THESIS SUBMITTED 
IN PARTIAL FULFILLMENT OF THE 
REQUIREMENTS FOR THE DEGREE 
Master of Science 
SIS COMMITTEE: 
Peter Varman, Chairman 
Professor of Electrical and Computer 
Engineering 
Rice University 
Associate Professor of Electrical and 
Computer Engineering 
University of Pitt gh 
Lin ng 
Associate Professor of Electrical and 
Kevin elly 
Associate Professor of Electri al and 
Computer Engineering 
Rice University 
~-Ray Si ar 
Professor in the Practice of Electrical and 
Computer Engineering 
Rice University 
HOUSTON, TEXAS 
NOVEMBER, 2011 
Abstract 
High Performance Reliable 
Variable Latency Carry Select Addition 
by 
KaiDu 
This thesis describes the design and the optimization of a low overhead, high performance 
variable latency carry select adder. Previous researchers believed that the traditional adder 
has reached the theoretical speed bound. However, a considerable portion of hardware 
resources of the traditional adder is only used in the worst case. Based on this observation, 
variable latency adders have been proposed to improve on the theoretical limit, but such 
adders incur significant area overhead. By combining previous variable latency adders 
with carry select addition, this work describes a novel variable latency carry select adder. 
Applying carry select addition in the variable latency adder design significantly reduces 
the area overhead and increases its performance. This variable latency adder is faster and 
smaller than previous variable latency adders. Furthermore, this variable latency adder can 
be optimized to be faster and smaller than the fastest adder generated by the Synopsys 
Design Ware building block IP. 
Acknowledgements 
I express my sincere gratitude to my advisors, Prof. Kartik Mohanram and Prof. Peter Var-
man for all their guidance, patience and support throughout this thesis. I am also grateful 
for Prof. Lin Zhong, Prof. Kevin Kelly and Prof. Ray Simar for their time and valuable 
comments on this thesis. 
I am also grateful to my friends at Rice, Xuebei Yang, Mihir Choudhury, Masoud Ros-
tami and Ahmed ElNably. Thanks for their time and help. 
Finally, I thank my families for their love and support. 
Contents 
Abstract ii 
Acknowledgements iii 
List of Figures vi 
List of Tables viii 
1 Introduction 1 
2 Background 4 
3 Speculative carry select addition (SCSA) 7 
3.1 Operation of SCSA 8 
3.2 Error rate analysis . 9 
3.3 Error magnitude analysis 13 
4 SCSA-based speculative adder 14 
4.1 Window adder design . . 15 
4.2 Implementation of SCSA 16 
4.3 Time and space complexity of SCSA . 17 
5 Variable latency carry selection adder (VLCSA) 19 
5 .1 Error detection design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 
5.2 Error recovery design 
5.3 Operation of VLCSA 
6 Modified VLCSA (VLCSA 2) 
6.1 Motivation . . . . . . . . 
6.2 Profiling practical inputs 
6.3 Approximating practical inputs . 
6.4 Key idea of Modified VLCSA 
6.5 Modified speculative addition . 
6.6 Modified error detection 
6.7 Operation of VLCSA 2 
7 Results 
7.1 Simulation setup 
7.2 Error model validation 
7.3 Error rates for 2's complement Gaussian inputs 
7.4 Comparison with existing variable latency adders 
v 
22 
23 
25 
25 
26 
28 
30 
31 
32 
34 
37 
37 
37 
38 
40 
7.4.1 Speculative addition in VLCSA 1 VS speculative addition in VLSA 40 
7 .4.2 VLCSA 1 VS VLSA . . . . 
7.5 Comparison with Design Ware adder 
7.5.1 Speculative addition in VLCSA 1 vs DesignWare adder. 
7.5.2 VLCSA 1 vs Design Ware adder 
7.5.3 VLCSA 2 vs Design Ware adder 
8 Conclusion 
42 
44 
44 
45 
47 
50 
List of Figures 
3.1 Dot graph for addition. A dot represents an input bit. 
3.2 Input bits grouped into windows. . . . . . . . 
3.3 Dot graph to illustrate the operation of SWA. 
3.4 Error when Gk-1:0 = 1 in the ith window and Pk-1:0 = 1 in the (i + 1)th 
7 
8 
9 
window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 
3.5 Predicted error rates for different adder widths (n) and window sizes. . 12 
3.6 Example to illustrate low error magnitude. . . . . . . . . 13 
4.1 Speculative adder is consisted of 1~1 k-bit small adders. . . . . . . . . . . 14 
4.2 Window adder implementation in speculative adder. Carry-select adder 
structure is employed to increase performance. . . . . . . . . . . . . . . 15 
4.3 Tree structure for calculating the group PIG signals of Kogge-Stone adder. 16 
5.1 Error detection implementation using 2-input AND and OR gates. . . . . 20 
5.2 Area-efficient implementation for error recovery using intermediate results 
from the speculative adder. . . . . . . . . . . . . . . . 
5.3 Variable latency adder implementation, similar to [17] . 
6.1 Example of statistics of carry chain lengths for unsigned random inputs. 
22 
24 
The adder size is 32 bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 
6.2 Examples of statistics of carry chain lengths from a cryptographic work-
load [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 
Vll 
6.3 Example of statistics of carry chain lengths for 2's complement random 
inputs. The adder size is 32 bits. . . . . . . . . . . . . . . . . . . . . . . . 28 
6.4 Example of statistics of carry chain lengths for unsigned Gaussian inputs. 
The adder size is 32 bits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 
6.5 Example of statistics of carry chain lengths for 2's complement Gaussian 
inputs. The adder size is 32 bits. . . . . . . 30 
6.6 Window adder implementation in SCSA 2. . 31 
6.7 Modified error detection implementation using 2-input AND and OR gates. 32 
6.8 VLCSA 2 implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 
7.1 Comparison of analytical error model for SCSA and simulation results for 
different adder widths (n). . . . . . . . . . . . . . . . . . . . . . . 38 
7.2 Comparison of delay of speculative adders and Kogge-Stone adder. . 41 
7.3 Comparison of area of speculative adders and Kogge-Stone adder. 41 
7.4 Comparison of delay of variable latency adders and Kogge-Stone adder. 42 
7.5 Comparison of area of variable latency adders and Kogge-Stone adder. . 43 
7.6 Comparison of delay of speculative addition in VLCSA 1 and Design Ware 
adder ...................................... 45 
7.7 Comparison of area of speculative addition in VLCSA 1 and Design Ware 
adder ............................ . 
7.8 Comparison of delay of VLCSA 1 and Design Ware adder. 
7.9 Comparison of area of VLCSA 1 and Design Ware adder. 
7.10 Comparison of delay of VLCSA 2 and Design Ware adder. 
7.11 Comparison of area of VLCSA 2 and Design Ware adder. 
46 
47 
48 
49 
49 
List of Tables 
7.1 Experimental and nominal eror rates in VLCSA 1 for 2's complement 
G . . t 0 2~ 39 auss1an mpu s. 1-L = , a = . . . . . . . . . . . . . . . . . . . . . . . . 
7.2 Experimental and nominal error rates in VLCSA 2 for 2's complement 
G . . t 0 2~ 39 auss1an mpu s. 1-L = , a = . . . . . . . . . . . . . . . . . . . . . . . . 
7.3 Parameters of SCSA and the speculative adder in [17] for an error rate of 
0.01%, according to analytical error models and simulation results . . . . . 40 
7.4 Parameters of SCSA and VLCSA 1 for the error rates of 0.01% and 0.25% 
, according to analytical error models and simulation results. . . . . . . . . 44 
7.5 Parameters ofVLCSA 2 for the error rates ofO.Ol% and 0.25%, according 
to simulation results. 1-L = 0, a = 232 • • • • • • • • • • • • • • • • • • • • • • 47 
Chapter 1 
Introduction 
Addition, one of the most frequently used arithmetic operations, is employed to build ad-
vanced operations such as multiplication and division. Theoretical research has found that 
the lower bound on the critical path delay of the adder has complexity O(log n ), where 
n is the adder width. The design of high performance adders has been extensively stud-
ied [10] [15], and several adders have achieved logarithmic delays. Whereas theoretical 
bounds indicate that no traditional adder can achieve sub-logarithmic delay, it has been 
shown that speculative adders can achieve sub-logarithmic delays by neglecting rare input 
patterns that exercise the critical paths [2, 11, 13]. Furthermore, by augmenting speculative 
adders with error detection and recovery, one can construct reliable variable-latency adders 
whose average performance is very close to speculative adders [3, 6, 12, 17]. 
Speculative adders are built upon the observation that the critical path is rarely ac-
tivated in traditional adders. In traditional adders, each output depends on all previous 
(lower or equal significance) bits. In particular, the most significant output depends on all 
the n bits, where n is the adder width. In contrast, in speculative adders [2, 6, 11, 13, 17], 
each output only depends on the previous k bits rather than all previous bits, where k is 
much smaller than n. However, the cumulative error grows linearly with the adder width 
since each speculative output can independently be in error. Moreover, the calculation of 
each speculative output requires an individual k-bit adder; hence, such designs also incur 
2 
large area overhead and large fanout at the primary inputs. Techniques such as effective 
sharing [ 17] can mitigate but not eliminate fanout and area problems. Although the spec-
ulative adder in [ 18] can mitigate the area problem, it incurs a fairly high error rate that 
limits its application. For applications where errors cannot be tolerated, a reliable vari-
able latency adder can be built upon the speculative adder by adding error detection and 
recovery [3, 6, 12, 17]. For the vast majority of input combinations, the speculative adder 
produces correct results; when error detection flags an error, error recovery provides correct 
results in one or more extra cycles. Ideally, the average performance of the variable latency 
adder should be similar to the speculative one. However, existing variable latency adders 
have several drawbacks. When error detection indicates no error, the actual delay is the 
longer of the speculative adder and error detection. The delay of error detection is always 
longer than the speculative adder [6] [17]. Hence, the benefit of speculation is limited by 
the delay of error detection [3] [12]. Besides, the circuitry for error detection and recovery 
incurs nontrivial area overhead. Finally, variable latency adders are mostly restricted for 
random inputs [3, 12, 17]. 
This thesis first describes a novel function speculation technique, called speculative 
carry select addition (SCSA). The key idea is to segment the chain of propagate signals in 
addition into blocks of the same size. Specifically, the input bits of addends are segmented 
into blocks, and the carry bits between blocks are selectively truncated to 0. SCSA is less 
susceptible to errors, since it is only applied for blocks instead of individual outputs. A 
single individual adder is required to compute all outputs of a block instead of each output, 
which mitigates the area overhead problem. An analytical model to determine the error rate 
of SCSA is formulated, and the accurate relation between the block size and output error 
is developed. A high performance speculative adder design is presented for low error rates 
(e.g. 0.01% and 0.25%). 
Secondly, this thesis describes a reliable variable latency adder design that augments 
the speculative adder with error detection and recovery. The speculative adder produces 
correct results in a single cycle in most cases, and error recovery provides correct results in 
3 
an extra cycle in worst cases. The performance of the variable latency adder is close to that 
of the speculative adder. This approach has two advantages. First, the critical path delay of 
the error detection block is lower or comparable to that of the speculative adder. Second, 
the error detection and recovery circuitry incurs low area overhead by using intermediate 
results from the speculative adder. 
Finally, the previous variable latency and speculative adders are mainly designed for 
unsigned random inputs, so this thesis proposes the modified variable latency and specula-
tive adders suitable for both random and Gaussian inputs. With modified speculative adder 
and error detection block, the variable latency adder still achieves high performance when 
2's complement Gaussian inputs present. This shows that the variable latency adder design 
is feasible for practical applications. 
Simulations using 10 million unsigned random inputs are used to validate the analytical 
error model, and analytical and simulation results match well. Simulation results indicate 
that for an error rate of 0.01% (0.25% ), SCSA-based speculative addition is 10% faster 
than the Design Ware adder with up to 43% (56%) area reduction. Simulation results also 
suggest that on average, variable latency addition using SCSA-based speculative adders is 
about 10% faster than the Design Ware adder with area requirements of -19% to 16% (-16% 
to 29%) for unsigned random (2's complement Gaussian) inputs. 
This thesis is organized as follows. Chapter 2 presents the background of the specu-
lative adder and reliable variable latency adder. Chapter 3 introduces the SCSA and the 
corresponding error analysis. Chapter 4 describes the SCSA-based speculative adder de-
sign. Chapter 5 proposes the reliable variable latency adder design using the SCSA-based 
speculative adder with error detection and recovery, called variable latency carry select 
adder (VLCSA). Chapter 6 presents a modified reliable variable latency adder design suit-
able for both unsigned uniform and 2's complement Gaussian inputs. Chapter 7 validates 
above models and designs. Section 8 is a conclusion. 
Chapter 2 
Background 
Due to the importance of addition, various adders have been proposed for achieving high 
performance and low power [10] [15], such as ripple carry adder, carry select adder, carry 
skip adder, look-ahead adder, and parallel prefix adder. There is an interesting observation 
regarding adders and indeed many other designs: The critical path is rarely activated. The 
actual paths in typical cases are much shorter than the critical one. This observation indi-
cates that the traditional worst-case design methodology may require large design margin. 
Speculative adders have achieved significantly higher performance by neglecting rare input 
patterns that exercise the critical paths [2, 11, 13]. Furthermore, error-free variable latency 
adders can be constructed from speculative adders by adding error detection and recovery, 
and can achieve average performance comparable to speculative ones [3,6, 12, 17]. 
Speculative or variable latency adders fall into two main categories. The first category 
detects the input patterns that violate the timing constraint and remove these errors at design 
time. Telescopic units [4] fall in this category. However, the synthesis of an exact function 
that covers all input patterns that violate the timing constraint is expensive in practice. It 
has been shown that this problem is NP-complete [16], which limits the application of this 
technique to large circuits. The second category is called function speculation, wherein the 
original logic function is replaced by an approximate logic function. In the asynchronous 
domain, [ 14] first proposed a speculative variable latency adder. In the synchronous do-
4 
5 
main, it has been suggested that the complete logic function be replaced by a simplified 
logic function that provides correct results most of the time [2, 11, 13]. However, the tech-
niques in [2, 11, 13] have no error correction capability and may also suffer from large 
area and large fanout at the primary inputs. Recently, [17] proposed an error-free variable 
latency adder design wherein the speculative addition is similar to [2, 11, 13]. [6] studied 
an extension of [17] for the inputs extracted from practical benchmarks, which incurs ad-
ditional area overhead. Both [17] and [6] have the same area and fanout problems noted 
above. Furthermore, in [17] [6], the critical path delay of error detection is always longer 
than that of the speculative adder. The approach in [17] was generalized in [3], wherein 
an automatic synthesis technique that transforms a combinational design to a two-stage 
variable-latency design was described. This was extended in [12] to multi-stage function 
speculation. The design of speculation is strictly limited by error detection in [3] [12]. 
Besides, [3] [12] are both restricted for random inputs. Finally, although the speculative 
adder design proposed in [18] can mitigate the area problem, it exhibits a fairly high error 
rate that limits its application. 
Besides, other speculative designs or variable latency designs for low energy operation 
have been reported. The Razor technique [7] dynamically adjusts the supply voltage by 
detecting and correcting errors. Similar energy saving technique [8] has been proposed 
for signal processing applications. Signal processing applications can be error-tolerable, 
and do not require error detection and recovery. Besides, a non-uniform voltage scaling 
approach, called probabilistic arithmetic [5], was proposed to adjust voltage for each bit 
position in adder. This technique can achieve more energy saving due to the special treat-
ment for each bit position. 
In this thesis, we describe a novel function speculative addition, and the corresponding 
speculative adder and reliable variable latency adder. The closest approaches to our work 
are [ 18] [ 12]. Although the speculative adder design proposed in [ 18] can mitigate the area 
problem, it exhibits a fairly high error rate that limits its application. The error rate is esti-
mated by running the simulation for a 32-bit adder, which is not scalable for large adders. 
6 
The relation between speculation and error also remains unclear. In contrast, we present 
an analytical error model for SCSA, by which the accurate relation between speculation 
and error is formulated. In [12], a combinational adder is transformed into a multi-cycle 
variable-latency one, which requires multi-cycle timing analysis. The design is assumed to 
work for unsigned random inputs. Besides, it is difficult to incorporate this technique with 
the traditional EDA flow due to complicated multi-cycle timing constraints. In contrast, the 
SCSA-based variable latency design is suitable for both unsigned random and 2's comple-
ment Gaussian inputs, and is a simple deterministic design with 1/2 cycles of latency for 
addition. 
Chapter 3 
Speculative carry select addition (SCSA) 
Speculative carry select addition (SCSA) comes from the observation about the carry chain 
in addition. During the discussion in this chapter, we employ unsigned binary addition to 
illustrate SCSA. The input are assumed to be uniformly distributed, called random inputs. 
The addition is shown as the dot graph in Figure 3.1, where dots indicate input bits . 
•• 
•• 
•• 
•• 
•• 
•• 
•• 
•• 
•• 
• • 
• • 
• • 
Figure 3.1: Dot graph for addition. A dot represents an input bit. 
We represent two input numbers as A and B. The ith least significant bit of A and Bare 
represented as ai and bi, respectively. Then we define propagate and generate (P/G) signals 
at the ith bit position: 
7 
(3.1) 
(3.2) 
The sum bit, si, and carry-out (carry) bit, ci, at the ith bit position are rewritten as: 
8 
(3.3) 
(3.4) 
If Pi = 1, Ci = Ci-1· which indicates that changing the value of ci_1 directly changes 
the value of ci. This situation is defined asci depends on ci-1. written asci --+ ci_1. All 
other situations are defined as ci does not depends on ci-1. written as Ci --rt ci_1. Let us 
consider how ci depends on Ci-k· 0 < k ::::; i. If 3Pi = 0, i - k + 1 ::::; j ::::; i, Ci --rt ci-k· 
In other words, ci --+ ci-k iff V Pi = 1, i - k + 1 ::::; j ::::; i. The number of consecutive 
propagate signals Pi with value 1 is called the carry chain length. The probability P; = 1 
is 1/2, so the probability of a k-bit carry chain is 1/2k. This implies that ci may be locally 
approximated using several consecutive input bits. The average longest carry chain length 
in an n-bit addition has been extensively studied [10] [15]. Although there is no closed-
form solution, it is widely recognized that the carry chain length in the n-bit addition is 
O(logn) for unsigned uniform inputs [10] [15]. This interesting fact suggests that it is 
possible to quickly and accurately estimate the output bit using only several consecutive 
input bits. 
•• 
•• 
•• 
•• 
•• 
•• 
•• 
•• 
•• 
• • 
Figure 3.2: Input bits grouped into windows. 
3.1 Operation of SCSA 
• • 
• • 
Long carry chains rarely happen in addition for unsigned random inputs. In another word, 
by grouping input bits into blocks as shown in Figure 3.2, the carry chain length can be 
9 
cin cin cin 
n n n 
•• •• I I •• •• 1 ...... 1 •• •• I I • • • • •• •• •• •• • • •• • • • • 
......... ......... ......... ......... 
s* s* s* s* 
Figure 3.3: Dot graph to illustrate the operation of SWA. 
made comparable to the block size with high probability. In SCSA, input bits are divided 
into blocks of the same size, as shown in Figure 3.2. A block, called window, includes 
several consecutive input bits. The SCSA operation is shown in Figure 3.3: The adder 
width is n. The window size is k. The total number of windows is m = I~ l· The carry-out 
bit of the ith window is called C~ut• 0 ~ i < m. The carry-out bit of a window is speculated 
using only all k input bits of the window. Combining 1 speculative carry-in bit with k input 
bits of the window, k speculative sum bits of the window are computed. Any bit position in 
the window is affected by at least previous k bit positions. As argued earlier, the probability 
that an output bit depends on more than k previous bit positions is less than 1 /2k. However, 
the relation between the window size and error remains unclear. An analytical error model 
for SCSA is presented and provides critical guidance for the SCSA-based adder design. 
3.2 Error rate analysis 
We start from when an error occurs in SCSA. We observe that an error occurs if a window 
produces a group generate signal with value 1 and the next window produces a group 
propagate signal with value 1, as shown in Figure 3.4. 
We state this event in a rigorous way. The adder width is n. The window size is k. The 
total number of windows ism = 1~1· The group PIG signals at the Jfh bit position of the 
(i+2)th (i+ 1)th 
., ...... . 
,._ . . . . 
......... 
s· 
jlh 
. , ... . Cout= 1,. e e ... e e 
......... 
Gk·l:o= 1 
pk-1:0 = 0 
10 
, ....... 
Figure 3.4: Error when Gk_ 1,0 = 1 in the ith window and Pk_ 1,0 = 1 in the (i + l)th 
window. 
ith window are stated as: 
j 
c;,o = c; + Pjc;_ 1 + ... + G~ II P/, (3.5) 
l=1 
j 
P}:0 =II P/. (3.6) 
l=O 
where Pzi and Gf are the PIG signals at the zth bit position of the ith window. The group 
PIG signals of the ith window are defined as PL1,0 and GL1,0 , 0 :::; i < m. The carry-out 
bit of the ith window, c~ut' is written as: 
Ci Gi pi ci-1 1 < . 
out = k-1:0 + k-1:0 out ' - '/, < m (3.7) 
In SCSA, C~~/ is truncated to 0, and C~ut is approximated as C~~t: 
(3.8) 
As shown in Figure 3.4, the ith window has a group generate signal with value 1, GL1,0 = 
1. It is straightforward to see that: 
c~ut = 1 
p~~i:o = 1 indicates that c~ut passes through the ( i + 1 )th window, which also implies 
G~:=,11 , 0 = 0. The carry-out bit of the ( i + 1 )th window C~~1 is approximated using (3.8) as: 
Ci+h - Gi+1 - 0 
out - k-1:0-
11 
In contrast, in traditional addition, the correct carry-out bit of the ( i + 1 )th window c~;;;I is 
written as: 
=C~ut=1 
Thus, C~;it1 =I= C~;ith. SCSA incorrectly speculates the result if P~:!:t0GL1 ,0 = 1. 
The probability of the above event is calculated as follows. Since group P/G signals 
from two different windows are fully independent, the error probability, P(P~:!:i,0GL1 , 0 = 
1), is written as: 
i+l i i+1 i P(Pk-1:oGk-1:o = 1) = P(Pk-1:0 = 1)P(Gk-1:o = 1) (3.9) 
Besides, the probabilities that group P/G signals of the window equal 1 can be derived as: 
k-1 
i+1 II i+1 P(Pk_ 1,0 = 1) = P( Pi = 1) 
j=O 
k-1 
=II P(PJH = 1) = (1/2)k, 
j=O 
k-1 
P(GL1:0 = 1) = P(GL1 + PL1GL2 + ... + G~ II PJ = 1) 
j=1 
k-1 
= P(GL1 = 1) + P(PL1GL2 = 1) + ... + P(G~ II PJ = 1) 
j=1 
= (1/2)[1- (1/2)k]. 
(3.10) 
(3.11) 
where GL1• PL1 GL2• ... , Gb rr~:i PJ are mutually exclusive. Based on (3.10) and (3.11), 
(3.9) is computed as: 
"+1 . "+1 . P(Pk_1,0Gic_1,0 = 1) = P(Pk_ 1,0 = 1)P(Gic_1,0 = 1) 
= (1/2)k+1[1- (1/2)k] (3.12) 
The total error probability for SCSA is approximated by summing up probabilities of 
these events for all windows. The approximate total error probability is stated as: 
= I: (1/2)k+l [1 - (1/2)kl 
o~i<r1i'l-1 
12 
(3.13) 
(3.13) describes the relation between the window size and error rate for unsigned random 
inputs. 
1 
0.8 
en (I) 
- 0.6 ro 0::: 
L... 
0 0.4 L... 
L... 
LU 
0.2 
0 
4 6 
o n=64 
* n=128 
v n=256 
~~n=512 
8 10 12 14 16 
Window Size (bits) 
18 
Figure 3.5: Predicted error rates for different adder widths (n) and window sizes. 
Based on (3.13), predicted error rates for different adder widths and window sizes are 
plotted in Figure 3.5. Figure 3.5 suggests that the error rate rapidly decreases as the window 
size increases. The error rate becomes negligible if the window size is large enough. For 
example, if n = 256, k = 16, P;rr ~ 0.01 %. In other words, a 256-bit adder is replaced 
with 16 16-bit adders for an error rate of 0.01 %. The predicted error rate is critical for 
guiding the SCSA-based speculative adder design. 
13 
01111111 11111111 "0" 
00000000 00000000 < 
s*: o 1111111 
01111111 11111111 "1" 
00000000 00000000 < 
S: 10 0 0 0 0 0 0 
Figure 3.6: Example to illustrate low error magnitude. 
3.3 Error magnitude analysis 
Error magnitude is defined as the ratio of the error to the correct result. For example, 
the correct result is 11001, the speculative result is 10001. The error is 11001 - 10001 = 
01000. So the error magnitude is 01000/11001 = 0.32. It is preferable to have small error 
magnitude when the speculation is incorrect. For some error-tolerable applications, the 
speculative result with small error magnitude may still be acceptable. 
In SCSA, the error magnitude is low when an error occurs. An example is shown in 
Figure 3.6. The carry-in bit of the right window is truncated as 0. Then the sum bits of 
the left window are speculated as 01111111. However, the actual carry-in bit for the right 
window is 1, and the correct sum bits are 10000000. The error magnitude is 1/27, which is 
quite small. This error affects all outputs of the left window rather than an individual output, 
which amortizes the effect of the error. In contrast, if only an individual output is incorrect, 
the error magnitude can be as large as the significance of the most significant bit in addition 
such as the speculative addition in [17]. For example, the correct result is 11111111, and 
the speculated result is 01111111. The error magnitude is 27 /(28 - 1) = 50.2%, which is 
quite large. 
Chapter 4 
SCSA-based speculative adder 
A high performance, low area overhead speculative adder can be built upon SCSA, called 
SCSA-based speculative adder. We call this adder design as SCSA for simplicity. This 
design may be used for those applications where errors are tolerable, such as data mining, 
machine learning, cryptography and signal processing. In these scenarios, the incorrect 
result generated by the speculative addition may still result in the correct final result. 
(n/k)th window 
adder 
ith window 
adder 
15twindow 
adder 
s* 0 
Figure 4.1: Speculative adder is consisted of I~ l k-bit small adders. 
Ideally, then-bit adder is segmented into several small identical window adders. How-
ever, since n%k = 0 doesn't always hold, at least one of the window adders is smaller than 
others. Specifically, this window adder has n- k( ~~ l - 1) bits. Similar to the optimization 
of the carry select adder design, this adder is placed as the 1st window adder for reducing 
14 
15 
the delay of the speculative adder. Other window adders all have k bits. In summary, the 
n-bit adder is almost equally segmented into ~~ l k-bit window adders, as shown in the 
Figure 4.1. 
ai 0 
bik-1 bio 
............................................................................................................................................................................... 
Si* k-1:o 
Ci* 0 "0" 
jth window 
:. .................................................................................................................................................... ~f.l.f.l.~.r ......... . 
Figure 4.2: Window adder implementation in speculative adder. Carry-select adder struc-
ture is employed to increase performance. 
4.1 Window adder design 
Next we discuss the design of the window adder. The speculative result of the window 
adder is computed using the input bits of the window adder and the carry-in bit from the 
previous window adder. Since the window adder works as the same as the traditional adder, 
the window adder can be implemented using any traditional adder. Assume all inputs 
arrive at the same time. For the window adder, input bits arrive earlier than the carry-in bit 
provided by the previous window adder. Thus we employ the carry-select adder structure to 
increase performance. As shown in Figure 4.2, the window adder is consisted of two small 
adders, adder0 and adder1• Two small adders can be implemented using any traditional 
adder, such as Kogge-Stone or Brent-Kung adder. Kogge-Stone adder is considered as the 
possible fastest adder design in traditional adders [10]. We can employ Kogge-Stone to 
16 
further increase performance of the speculative adder. 
Figure 4.3: Tree structure for calculating the group PIG signals of Kogge-Stone adder. 
For the ith bit position of the adder, we have: 
(4.1) 
(4.2) 
where ci is the carry bit at the ith bit position, Gi_1,0 and Pi_1,0 are the group PIG signals 
at the ith bit position, eo is the carry bit at the Oth bit position, si is the sum bit at the ith bit 
position, Pt. is the propagate signal at the ith bit position. For example, the tree structure 
for calculating the group PIG signals of Kogge-Stone adder is shown in Figure 4.3. After 
calculating the group PIG signals, the sum bit is computed. 
4.2 Implementation of SCSA 
Then we how to implement SCSA. The group propagate and generate (PIG) signals of the 
ith window adder are computed. The speculative carry-in bit of the ith window adder is 
the group G signal of the (i- l)th window adder, G~-:=_11 ,0 • Then the lh sum bit of the ith 
window, s~*, is estimated as: 
i* pi = [Gi + pi ci-hl Sj = j IJ.I j-1:0 j-1:0 out 
Pi [Gi pi ci-1 l o < . k 
= i EB j-1:0 + j-1:0 k-1:0 • - J < 
17 
(4.3) 
(4.4) 
where 0~;;/* = Gt:_11:o· We employ a carry-select structure to compute two cases that 
Gt:_11:0 are 1 and 0 before c~;;/* is ready: 
s~~1 = PJ EB [G~-1:o + PJ-1:o], 
s~~o = PJ EB G~_1:o• 0 ~ j < k. 
(4.5) 
(4.6) 
One of the results above is selected and output as the speculative sum bit when 0~;;/* is 
ready, noted as c~-h in Figure 4.2. 
4.3 Time and space complexity of SCSA 
We first discuss the time complexity of SCSA. Assume we implement the small adder in the 
window adder using Kogge-Stone. The critical path of SCSA is consisted of a traditional 
adder and a multiplexer. Thus, the critical path of SCSA is equivalent to that of a k-bit 
Kogge-Stone adder plus a multiplexer. The complexity of the critical path delay of SCSA 
is O(log k). In contrast, the critical path delay of a n-bit traditional adder has at least 
complexity 0 (log n). Hence, the speculative adder can be significantly faster than the 
traditional adder. 
Then we estimate the space complexity of SCSA. At each step in SCSA, there are at 
most k times computations for intermediate group PIG signals. The total number of window 
adders is I~ l· Thus, the space complexity of a n-bit speculative adder is 0 (I~ l k log k). 
In contrast, traditional parallel prefix adders generally have larger area than the speculative 
adder. For example, the space complexity of n-bit Kogge-Stone adder is O(n logn). 
Finally, it is worth comparing SCSA with existing speculative adders. The adder in [ 17], 
one of the best state-of-the-art speculative adders, has the delay complexity O(log l) and 
18 
space complexity O(n log l). lis the speculative carry chain length, which is similar to the 
window size in SCSA. The window size k of SCSA is smaller than the speculative carry 
length l of the design in [17] for achieving similar error rates, which will be shown in Ta-
ble 7 .4. Our experiments will also validate that SCSA has smaller area than the speculative 
adder in [ 17] with similar performance and error rate. In summary, SCSA can be faster 
and smaller than the counterpart in [17]. 
Chapter 5 
Variable latency carry selection adder 
(VLCSA) 
For applications that error can be tolerated, the speculative adder can increase performance 
by introducing a certain level of inaccuracy. However, many applications are not error-
tolerable. The incorrect result generated by speculation will result in incorrect final result. 
For applications where errors cannot be tolerated, a reliable variable latency adder can 
be built upon the SCSA-based speculative adder by adding error detection and recovery, 
called variable latency carry selection adder (VLCSA). Error detection flags if speculation 
is correct. Error recovery produces the correct result when error detection flags an error. 
VLCSA works with one or two cycles of latency for addition, which is similar to [17]. 
Ideally, the average performance of VLCSA is close to the speculative one when error rate 
is low. In this chapter, unsigned random inputs are still assumed. 
5.1 Error detection design 
We first develop error detection in VLCSA. Error detection flags if speculation is incorrect. 
In another word, error detection flags when the speculative carry-in bit of the window adder 
is incorrect. The error detection block in VLCSA is designed to overestimate the error rate. 
19 
20 
Figure 5.1: Error detection implementation using 2-input AND and OR gates. 
There are two reasons: (1) Since the exact carry-in bit ofthe window is a global signal, it 
incurs nontrivial latency to wait and detect this carry-in bit. Conceptually, the advantage 
of speculation is lost if error detection requires the exact carry-in bit of the window adder. 
(2) It is crucial that all errors are detected in VLCSA. Although some correct results are 
flagged as incorrect ones, overestimation of error is helpful for detecting all errors. Besides, 
the false negative rate of error detection can be controlled under a low level. 
In Chapter 3, we discuss the analytical error model for SCSA. This error model accu-
rately describes the event that the speculative carry-in bit of a window adder is incorrect. 
Thus, we employ this model to implement error detection. The error detection signal, ERR, 
is stated as: 
ERR= '+1 . p~-l:OGk-1:0 (5.1) 
o:si<r7il-1 
where P~~to is the group propagate signal of the (i + 1) window adder, GL1:o is the group 
21 
generate signal of the i window adder. ERR flags an error if :3i, 0 :::; i < ~~ 1 - 1, Pk~~:o = 
1, Gi_1,0 = 1. In another word, ERR flags an error if GL1,0 with value 1 affects the 
carry-in bit of the next next window adder. 
The group propagate and generate signals of the window adder are computed during 
the speculative addition. So it greatly simplifies the error detection design by using these 
intermediate values from the speculative adder. Assume group propagate and generate 
signals are obtained from the speculative adder, we focus on the combination of these 
group propagate and generate signals. The error detection block is implemented using 
2-input AND and OR gates, as shown in Figure 5 .1. 
Next we discuss the time complexity of the error detection block. It takes log k steps to 
generate the group PIG signals of the window adder. As shown in Figure 5.1, it takes ad-
ditional log ~~ 1 steps to produce ERR. Therefore, the critical path delay of error detection 
has complexity O(log 1~1 +log k). In particular, we observe that log 1~1 is quite small 
in practice. For example, when n = 512 and k = 17, log ~~ 1 ~ 5. On the other hand, it 
takes several constant steps to compute sum bits after generating group PIG signals in the 
speculative adder. In VLCSA, error detection has comparable or even shorter critical path 
delays than the speculative addition. The actual critical path delay is the longer one of the 
speculative addition and error detection when error detection flags no error, so the benefit 
of speculation is maintained. 
Finally, the space complexity of the error detection block is estimated as below. At 
each step, there are at most I~ 1 computations. Besides, there are log I~ 1 steps in the error 
detection block. Thus, the space complexity of the error detection block is 0( 1~1log 1~1 ). 
This is similar to that of SCSA, but the computation only uses simple gates and requires 
low area overhead. 
(n/k)th window 
adder 
ith window 
adder 
ptwindow 
adder 
Error Recovery (n/k bit prefix adder) 
22 
Figure 5.2: Area-efficient implementation for error recovery using intermediate results 
from the speculative adder. 
5.2 Error recovery design 
Error recovery produces the correct result when error detection flags an error. The simplest 
solution is to employ a traditional adder to compute the result when the speculation is 
incorrect. But this incurs long delay penalty and large area overhead. The area overhead 
can be larger than that of SCSA. Alternatively, we employs the intermediate results from 
SCSA to simplify the error recovery design. 
Figure 5.2 shows an area-efficient implementation for error recovery using intermediate 
results from the speculative adder. The upper part of Figure 5.2 is SCSA. The lower part 
of Figure 5.2 is a ~~ l-bit prefix adder. This prefix adder takes the group propagate and 
generate (PIG) signals of window adders as the input and computes the correct carry-out 
bits for all window adders. SCSA has computed group PIG signals at each bit position of 
the window adder. Thus, correct sum bits of all window adders are computed using the 
23 
outputs of this prefix adder. 
Then we discuss the time complexity of the error recovery block. For this I~ 1-bit prefix 
adder, there are log 1~1 steps in computation. Thus the time complexity of this prefix adder 
is 0 (log I~ 1 ) . The critical path delay of the error recovery block is through the speculative 
adder and the prefix adder. There are log k steps for computing the intermediate results 
in SCSA. Thus, the time complexity of the critical path delay of the error recovery block, 
through the speculative adder and the prefix adder, is 0 (log k + log I~ 1 ) . Since the error 
recovery block introduces an extra prefix adder, it incurs nontrivial delay penalty. If the 
clock cycle is chosen to be slightly longer than the delays of SCSA and the error detection 
block, we observe that the delay of error recovery can be shorter than two cycles. 
Finally, we discuss the space complexity of the error recovery block. There are most 
I~ 1 computations in each steps in the I~ 1-bit prefix adder. The total number of steps is 
logl~l Thus, the space complexity of this prefix adder is O(l~1log 1~1). The compu-
tation in this prefix adder may uses complex gates and requires nontrivial area overhead. 
The major area overhead of VLCSA comes from the error recovery block. We can employ 
synthesis tools to reduce the area overhead. 
5.3 Operation of VLCSA 
The implementation of VLCSA is shown in Figure 5.3, which is similar to [17]. When 
inputs A and B are ready, SCSA computes the speculative result, S*. If error detection flags 
no error, ERR = 0, the output signal VALID flags that the speculative result is correct. 
Then the speculative result S* is selected and output as the final result of VLCSA. If error 
detection flags an error, ERR= 1, the output signal STALL indicates that the speculative 
result is incorrect. VLCSA stalls an extra cycle and waits for the correct result generated 
by the error recovery block. The error recovery block uses the intermediate results from 
SCSA and re-computes the correct result. Then the result from error recovery, SREC, is 
selected and output as the final result of VLCSA. 
24 
A 
B 
STALL 
VALID 
Figure 5.3: Variable latency adder implementation, similar to [17] 
Next we discuss the timing issues in the design of VLCSA. The delay of the error 
detection block is designed to be similar to that of SCSA. Thus, we will know whether the 
speculative result is correct or not when the speculative result is ready. Then speculative 
result can be selected and output without additional delays. The clock cycle, Tctk. is slightly 
longer than the delays of SCSA and the error detection block. The speculative result and 
error detection signal, ERR, are computed in a single cycle. The error recovery block 
produces the correct result in two cycles. If ERR flags no error, the speculative result is 
correct. Otherwise, error recovery produces the correct result in an additional cycle. The 
effective cycle of VLCSA, Tave• is stated as: 
(5.2) 
where Perris the error rate of SCSA. If Perris quite small such as 0.01 %, Tave c:= Tctk• the 
average latency of the variable adder is slightly longer than the delays of the speculative 
adder and the error detection block. This indicates that the average performance of VLCSA 
is close to that of SCSA. 
Chapter 6 
Modified VLCSA (VLCSA 2) 
6.1 Motivation 
In previous chapters, speculative and variable latency adders are assumed to work mainly 
for unsigned random inputs. SCSA and VLCSA work well for unsigned random inputs. 
From the discussion, we obtain valuable insight for the design of speculative and variable 
latency adders. This is very crucial for guiding speculative and variable latency adder de-
signs. Furthermore, it is crucial to evaluate the performance of the variable latency adder 
for other inputs. Different inputs for the adder will largely affect the performance of spec-
ulative and variable latency adders. 
The key idea of speculative addition is to utilize the carry chain distribution for unsigned 
random inputs. An example of statistics of carry chain lengths for unsigned random inputs 
is shown in Figure 6.1. The adder size is 32 bits. We run 106 simulations for unsigned 
random inputs to gather records. We observe that the portion of the carry chains rapidly 
decreases as the carry chain length increases. There is a gap between unsigned random 
inputs and practical inputs: the distribution of carry chain lengths in practical applications 
may be quite different from that of the unsigned random inputs. It is meaningful to evaluate 
the carry chain distribution of the practical inputs. 
25 
30% 
rn 
·ffi 25% 
.&:: 
0 
~ r3 20% 
0 
Q) N 15% 
1: 
~ 
Q) 10% 
a. 
0 
5 10 
26 
15 20 25 30 
Bit position 
Figure 6.1: Example of statistics of carry chain lengths for unsigned random inputs. The 
adder size is 32 bits. 
6.2 Profiling practical inputs 
We start from the difference between unsigned random inputs and practical inputs. There 
are two important observations for practical inputs [6] [9]: (1) The 2's complement repre-
sentation is widely used in practice. It's convenient to implement substraction with addition 
using the 2's complement representation. (2) Small numbers appear more frequently than 
large ones. The computations between two small numbers has a high frequency. Thus, the 
practical inputs can be quite different from unsigned random inputs. 
In [ 6] [9], the distributions of the carry chain lengths are extracted from practical work-
loads. There are nontrivial portions of carry chains with very long lengths. One example 
from [6] is employed here. They calculated the statistics of carry chain lengths in addi-
tion from a cryptographic workload, including RSA encryption/decryption (RSA), Elliptic 
curve E1Gama1 encryption/decryption over prime fields (ECELGP), Diffie-Hellman key 
exchange (DH), and Elliptic curve digital signature algorithm over prime fields (ECDSP). 
1 ~::;;;::=======,~{'---::R::S::-::A--l----------
•" ~ 
::: t------r=-==--=-=~---------
,.,. +---=;----l 
~~~---~~-~======~---------
'" .,. 
3% 
1% 
-1% 
~:~-====,~4'--=oH:--1--1--------
14"' 
12% 
10% 
8% 
.... 
.... 
"" 
"" 
15% 
13" ECDSP~ ,,. ~ 
"' ... 
.... 
1" 
I I 
27 
Figure 6.2: Examples of statistics of carry chain lengths from a cryptographic workload [6]. 
The statistics of carry chain lengths for these benchmarks are shown in Figure 6.2. We 
can find that very long carry chains frequently occur in these benchmarks, and carry chains 
mainly concentrate in two ranges. This is quite different from that of unsigned random 
inputs. The speculative addition designed for unsigned random inputs is unable to handle 
those very long carry chains, and thus error rate significantly increases. This will largely 
hurt the performance of speculative and variable latency addition. 
28 
6.3 Approximating practical inputs 
Practical inputs can be quite different from each other, so it is impractical to capture the 
basics of all practical inputs. We target to employ the mathematical distribution to approx-
imately profile the practical inputs similar to those in [6]. In particular, we observe that 
Gaussian inputs seems able to reflect the basics of the practical workloads such as [6]. 
There are two reasons: (1) Small numbers occur more frequently than large numbers. (2) 
It is straightforward to introduce 2's complement representation for Gaussian inputs. We 
will examine this observation using several examples. The adder size is 32 bits. We run 
106 simulations for different inputs to gather records. 
35% 
30% 
fJ) 
c: 
]! 25% 
tJ 
~ 
~ 20% 
0 
& 15% 
s 
c: 
~ 10% 
(I) 
a. 
5% 
0 
5 
• 
2's complement 
random inputs 
15 20 25 30 
Bit position 
Figure 6.3: Example of statistics of carry chain lengths for 2's complement random inputs. 
The adder size is 32 bits. 
We first compare the difference between unsigned and 2's complement random inputs 
by generating the statistics of carry chain lengths. An example of statistics of carry chain 
lengths for 2's complement random inputs is shown in Figure 6.3. we observe that carry 
chains still concentrate in the range of short carry chains. As the carry chain length in-
creases, the portion of carry chains rapidly decreases, which is similar to that of unsigned 
random inputs. 
29 
40% 
35% 1-Unsigned Gaussian inputs I 
Ul 
-~ 30% 
.c 
0 
~25% 
a 
0 20% 
Q) 
Ol 
~ 15% 
~ 
Q) 10% 
a.. 
5% 
0 
5 10 15 20 25 30 
Bit position 
Figure 6.4: Example of statistics of carry chain lengths for unsigned Gaussian inputs. The 
adder size is 32 bits. 
Then we produce the statistics of carry chain lengths for unsigned Gaussian inputs. 
An example of statistics of carry chain lengths for unsigned Gaussian inputs is shown in 
Figure 6.4. We see that carry chains still concentrate in the range of short carry chains. As 
the carry chain length increases, the portion of carry chains rapidly decreases, which is still 
similar to that of unsigned random inputs. 
Next we profile the statistics of carry chain lengths for 2's complement Gaussian inputs. 
This case combines both 2's complement representation and Gaussian inputs. An example 
of statistics of carry chain lengths for 2's complement Gaussian inputs is shown in Fig-
ure 6.5. we see that carry chains concentrate in two separate ranges. For 2's complement 
Gaussian inputs, a nontrivial portion of carry chains is as long as the adder size. Although 
the frequency of long carry chains in Figure 6.5 seems higher than that in Figure 6.2, the 
distribution in Figure 6.5 is quite close to that in Figure 6.2. In another word, 2's comple-
ment Gaussian inputs captures the basics of the inp"!}ts in [6]. We will propose a modified 
speculative addition and the corresponding variable latency addition for 2's complement 
Gaussian inputs as below. 
-~ 6% 
Ill 
.s:: 
(.) 
~ 
~ 
0 4% 
Q) 
Cl 
~ 
~ 
rf. 2% 
0 
5 
30 
10 15 20 25 30 
Bit position 
Figure 6.5: Example of statistics of carry chain lengths for 2's complement Gaussian inputs. 
The adder size is 32 bits. 
6.4 Key idea of Modified VLCSA 
As shown in Figure 6.2 and Figure 6.5, these long carry chains significantly increases 
the error rate of speculative addition in VLCSA and makes VLCSA even slower than the 
traditional adder. Therefore, it is necessary to modify the speculative addition and the 
variable latency addition for practical inputs. In this thesis, we target to the design for 2's 
complement Gaussian inputs. The speculative adder design, SCSA, is called SCSA 1, and 
the variable latency adder design, VLCSA, is called VLCSA 1. The modified speculative 
adder design is called SCSA 2, and the modified variable latency adder design is called 
VLCSA2. 
The key idea of VLCSA 2 is to correctly speculate results when long carry chains like 
those in Figure 6.5 occur. We design VLCSA 2 based on the observation: long carry chains 
are usually triggered by the addition of a small positive and a small negative number, and 
affect the most significant bit position. The error detection in VLCSA 1 flags these long 
carry chains as errors since they are much longer than the window size. If we can detect 
31 
and remove these errors, the error rate will decrease to the low level. In another word, we 
target to correctly speculate results and flags no errors when such long carry chains occur. 
Gaussian inputs in the 2's complement representation are assumed for the design. 
aik-1 aio 
bi ...... bi 
........................................................................................ ~:1 ........................... 0 .......................................................................... , 
C~ c~· 1 
0 1 Si*,1 I 
k-1:0 
jth window 
, ............................................................................................................................................................................ ?!9.9.~.r. ...... .. 
Figure 6.6: Window adder implementation in SCSA 2. 
6.5 Modified speculative addition 
We first describe the implementation of modified speculative addition, SCSA 2. The mod-
ified window adder design is shown in Figure 6.6. Compared with the window adder in 
Figure 4.2, an additional speculative result, Si*•1, is calculated in Figure 6.6. This specu-
lative result is selected by one of speculative carry-out bits of the previous window adder, 
cf-h, which is discarded in SCSA 1. By combining two speculative results, the specu-
lative adder can correctly calculate the result. In another word, Si*•1 is correct when the 
very long carry chains occur. However, how to select the speculative result still remains a 
challenge. We will handle this issue in the design of error detection. 
Then we discuss the time complexity of SCSA 2. The complexity of the critical path 
of SCSA 1 is O(log k). In SCSA 2, we add another speculative result Si*•1• Si*·1 has the 
complexity 0 (log k) as the same as that of Si*•0 • This indicates that no extra delay is added. 
32 
Finally, we estimate the space complexity of SCSA 2. Compared with SCSA 1, the 
major cost of SCSA 1 is to add an extra 2-input multiplexer in the window adder. The 
total number of window adders is 1~1· Therefore, the area overhead of SCSA 2 has the 
complexity 0 ( I~ l ) . The space complexity of a n-bit SCSA 2 is still 0 ( I~ l k log k) . 
...... ~ 
__s-t--" E R R1 
Figure 6.7: Modified error detection implementation using 2-input AND and OR gates. 
6.6 Modified error detection 
Since we target to detect those long carry chains, new error detection signal is introduced. 
We define an additional error detection signal, ERRt. for detecting long carry chains. The 
error detection signal for detecting short carry chains in VLCSA 1 is noted as ERR0• It is 
noted that the error should be detected by both ERR1 and ERRo. ERR1 in VLCSA 2 is 
33 
stated as: 
o::s;i<r~l-1 
where P~_1:o is the group propagate signal of the ith window, 15!:~11 : 0 is the complement of 
the group propagate signal of the ( i + 1 )th window. 
The key idea of this error detection is inspired by the discussion in [6]. Let us see 
why long carry chains can be flagged by ERR1 and errors can be detected by ERRo and 
ERR1: (1) ERR0 = 0. This means that Vi, 0 :::; i < f~l - 1, Pk~~:oGL1 :o = 0. Only 
short carry chains occur in this case. The speculative result is correct and just as the same 
as the one in VLCSA 1. It is not necessary to check ERR1. (2) ERR0 = 1, ERR1 = 0. 
:3i, 0 :::; i < f~l - 1, Pk~~:oG~_ 1:o = 1, Vi, 0 :::; i < f~l - 1, J5!:~11 :0PL 1 :o = 0. ERRo 
flags an error. It indicates that there may be carry chains longer than the window size. 
The speculative result in VLSCA 1 is incorrect. Besides, ERR1 = 0 implies that a long 
carry chain generates at a bit position and propagates to the most significant bit (MSB) 
position. The new speculative result in VLSCA 2 provides the correct result. In another 
word, if ERRo = 1, ERR1 = 0, long carry chains occur, but do not incur actual errors. 
(3) ERRo = 1, ERR1 = 1. :3i, 0 :::; i < f~l - 1, Pk~~:oGL 1 :o = 1, :3i, 0 :::; i < 
f~l - 1, J5!:~11 :0PL 1 :o = 1. ERRo flags an error. It indicates that there are carry chains 
longer than the window size. Besides, ERR1 = 1 implies that such an error occurs since 
a carry chain starts at a bit position and ends before reaching the MSB position. Both 
speculative results in VLSCA 2 are incorrect. In this case, error recovery is activated and 
produces the correct result. 
Next we discuss the time complexity of the error detection block. It takes log k steps 
to generate the group propagate signals of the window adder. As shown in Figure 6.7, it 
takes additional log f~l steps to produce ERR1. Therefore, the critical path delay of error 
detection has complexity O(log f~l +log k). In VLCSA 2, the error detection block has 
comparable or even shorter critical path delays than that of the speculative addition. The 
actual critical path delay is the longer one of the speculative addition and the error detection 
34 
block when error detection flags no error, so the benefit of speculation is maintained. 
Then the space complexity of the error detection block is estimated as below. At each 
step, there are at most 1~1 computations. Besides, there are log 1~1 steps in the error 
detection block. Thus, the space complexity of the error detection block is 0( 1~1log 1~1 ). 
This is similar to that of ERR0 • 
B--+-~ 
ERR STALL 
Figure 6.8: VLCSA 2 implementation. 
6. 7 Operation of VLCSA 2 
Based on the discussion, VLCSA 2 is shown in Figure 6.8. There are two speculative 
results, 8*•0 and 8*•1• 8*•0 (8*•1) is the speculative result when speculative carry-in bit 
is value 0 (1). There are two error detection signals, ERRo and ERR1• ERRo flags if 
there is any carry chain longer than the window size. ERR1 flags if there are long carry 
chains reach the MSB position. If ERR0 = 1 and ERR1 = 1, the speculative addition 
is incorrect. Then error recovery provides the correct result, which is as same as that in 
35 
VLCSA 1. 
In general, VLCSA 2 works in the similar way to VLCSA 1. Error detection signals, 
ERR0 and ERR1o are used to select speculative results and flag errors: (1) ERR0 = 0. 
The output VALID is 1 and STALL is 0. This indicates the speculative result is correct. 
The speculative result 8*•0 is selected and output. (2) ERRo = 1, ERR1 = 0. The output 
VALID is 1 and STALL is 0. This indicates the speculative result is also correct. The 
speculative result 8*•1 is selected and output. (3) ERR0 = 1, ERR1 = 1. The output 
VALID is 0 and STALL is 1. This implies that the speculative addition is incorrect. The 
variable latency adder stalls for an additional cycle, and error recovery provides the correct 
result, srec. 
Next we discuss the timing issue in the design of VLCSA 2. The delay of the er-
ror detection block is designed to be similar to that of the speculative adder. Thus, we 
can know if the speculative result is correct or not when the speculative result is ready. 
Then speculative result can be output without additional delays. The clock cycle, Tclk• 
is slightly longer than the delays of the speculative adder and the error detection block, 
Tc1k > max(r;·0 , r;·1 , TERR)· The speculative results and the error detection signal, ERR, 
are computed in a single cycle. The error recovery block produces the correct result in two 
cycles, TfEC < 2Tclk· If ERR flags no error, the speculative result is correct. Otherwise, 
error recovery produces the correct result in an additional cycle. The effective cycle of the 
design, Tave• is stated as the same as that ofVLCSA 1: 
(6.1) 
where Perr is the error rate of the speculative addition. Ideally, if the error rate is tiny, the 
average performance of VLCSA 2 is also close to that of VLCSA 1. 
Compared with VLCSA 1, we add new speculative result and error detection signal 
in VLCSA 2. This causes extra delay penalty and area overhead. Besides, there is no 
analytical error rate model for 2's complement Gaussian inputs. For similar adder settings, 
the error rate for 2's complement Gaussian inputs may be higher than that for unsigned 
random inputs. We will employ the experimental results for 2's complement Gaussian 
36 
inputs to profile the error rate. 
Chapter 7 
Results 
7.1 Simulation setup 
We have implemented C++ programs which take the adder width n and the window size k, 
and generate Verilog files for the SCSA-based speculative adder, VLCSA 1 and 2. Circuits 
are synthesized using a common standard library for UMC 65 nm CMOS technology in 
the Synopsys Design Compiler. 
We first compare the delay and area of SCSA-based speculative adder, VLCSA 1 and 
2 with the Kogge-Stone adder and the variable latency adder in [ 17]. Furthermore, we 
compare the delay and area of SCSA-based speculative adder, VLCSA 1 and 2 with the 
adder generated by the Design Ware building block IP [1], called Design Ware adder. 
7.2 Error model validation 
We verify the analytical error model by comparing it with simulation results. The sim-
ulation results are obtained by running Monte Carlo simulations for 10 million unsigned 
random inputs for different adder widths and window sizes. As shown in Figure 7.1, the 
solid lines are generated using the analytical error model for different adder widths. The 
marked points are simulation results. The analytical and experimental results fit quite well 
37 
38 
10° 
X n=64 
* 
n=128 
10-2 \7 n=256 (f) 
Q) 0 n=512 
-ro 0::: 
.... 
e 
.... 10-4 w 
10-6L-~~------~--------~--------~ 
5 10 15 20 
Window Size (bits) 
Figure 7.1: Comparison of analytical error model for SCSA and simulation results for 
different adder widths (n). 
for different adder widths and window sizes. Thus, the analytical model accurately predicts 
simulation results. 
7.3 Error rates for 2's complement Gaussian inputs 
As discussed in Chapter 6, there is no analytical error model for 2's complement Gaussian 
inputs. We estimate the error rate of speculative addition by running Monte Carlo simu-
lations. For the Gaussian distribution, the mean is p = 0, and the standard deviation is 
(j = 232. 
We first discuss the speculative addition in VLCSA 1. An error occurs if the speculative 
result is different from that of the traditional addition. We also report the nominal error rate 
based on error detection, ERR. Simulation results are generated by running Monte Carlo 
simulations for 1 million 2's complement Gaussian inputs, as shown in Table 7.1. We 
observe that the error rate is clearly larger than that for unsigned random inputs. This 
39 
adder width window size Perr Perr 
(Monte Carlo) (ERR= 1) 
64 14 25.01% 25.01% 
128 15 25.01% 25.01% 
256 16 25.01% 25.01% 
512 17 25.01% 25.01% 
Table 7.1: Experimental and nominal eror rates in VLCSA 1 for 2's complement Gaussian 
inputs. 1-l = 0, a= 232 • 
implies that VLCSA 1 becomes much slower for 2's complement Gaussian inputs: every 
one out of four computations in VLCSA 1, the error recovery produces correct result and 
incurs extra delay penalty. 
adder width window size Perr 
Perr(ERRo = 1, 
(Monte Carlo) ERR1 = 1) 
64 14 0.01% 0.01% 
128 15 0.01% 0.01% 
256 16 0.01% 0.01% 
512 17 0.01% 0.01% 
Table 7.2: Experimental and nominal error rates in VLCSA 2 for 2's complement Gaussian 
inputs. 1-l = 0, a = 232 • 
Then we discuss the speculative addition in VLCSA 2. There are two speculative results 
in VLCSA 2. If either speculative result is the same as the traditional result, the speculation 
is correct. We also report the nominal error rate based on the error detection signals, ERRo 
and ERR1. The simulation results are also generated by running Monte Carlo simulations 
for 1 million Gaussian inputs, as shown in Table 7 .2. We can observe that the error rate 
is clearly smaller than that of VLCSA 1, which is similar to that for unsigned random 
inputs. In another word, the error rate effectively reduces from 25.01% in VLCSA 1 to 
40 
0.01% in VLCSA 2 for same 2's complement Gaussian inputs. This implies that VLCSA 
2 can successfully handle the 2's complement Gaussian inputs by introducing additional 
speculative result and error detection signal. The performance of VLCSA 2 can be close to 
that of VLCSA 1. 
7.4 Comparison with existing variable latency adders 
adder width window size carry chain length [ 17] 
64 14 17 
128 15 18 
256 16 20 
512 17 21 
Table 7.3: Parameters of SCSA and the speculative adder in [ 17] for an error rate of 0.01% 
, according to analytical error models and simulation results 
It is worthwhile to compare the proposed adders with the traditional adder and exist-
ing variable latency adders. The adder in [17], called variable latency speculative adder 
(VLSA) , is one of the best state-of-the-art variable latency adders and employed here. 
VLSA is designed for unsigned random inputs. We compare the delays and areas among 
the Kogge-Stone adder, VLSA and VLCSA 1 for unsigned random inputs. The parameters 
of speculative adders are shown in Table 7.4 for an error rate of 0.01 %. Two small adders 
of the window adder in VLCSA 1 are implemented using Kogge-Stone adder. In all cases, 
we optimize for minimal delay during synthesis. 
7.4.1 Speculative addition in VLCSA 1 VS speculative addition in VLSA 
The speculative addition in VLCSA 1 is SCSA 1. As shown in Figure 7 .2, for an error rate 
of 0.01 %, the critical path delay of SCSA 1 is 18 to 38% lower than that of Kogge-Stone 
2 .5 ~r==:=======:=====:-r-----,--l 
- Kogge-Stone 
c::J Speculative in VLSA 
2 c::J Speculative in VLCSA 1 
(i) 1.5 
-S 
>-
ro 
=-Qi 
0 1 
0.5 
0 '-- 64 128 256 512 
Adder width {bits) 
Figure 7.2: Comparison of delay of speculative adders and Kogge-Stone adder. 
41 
adder. For an error rate of 0.01 %, the critical path delay of SCSA 1 is similar to that of the 
speculative addition in VLSA. This indicates that SCSA 1 can perform quite well . 
12 
10 
N~ 8 
.6 
ro 
~ 6 
<( 
4 
2 
0 
- Kogge-Stone 
c::J Speculative in VLSA 
c::J Speculative in VLCSA 1 
F" 
-
~ 
r-
nn ' 
\: 
Linn 
64 128 256 512 
Adder Width {bits) 
Figure 7.3: Comparison of area of speculative adders and Kogge-Stone adder. 
As shown in Figure 7.3, for an error rate of 0.01 %, the area of SCSA 1 is 15 to 38% 
42 
lower than Kogge-Stone adder. For an error rate of 0.01 %, the area requirement of the 
speculative adder in VLSA is -20 to 8%. We can observe that the area of SCSA 1 is always 
smaller than that in VLSA for different bitwidths. This is mainly because the speculation 
in SCSA 1 is on the level of the window while the counterpart in VLSA speculates on the 
level of the individual bit position. 
7 .4.2 VLCSA 1 VS VLSA 
We compare delays of Kogge-Stone adder, VLCSA 1 and VLSA. For the variable latency 
adder, there are three delays: speculative addition, error detection and error recovery. As 
4 ~~~======~~----~------~~ 
.. Kogge-Stone 
3.5 
3 
2.5 
a~Jl Spec , VLSA 
@'<lYe i"l Err detection, VLSA 
c:::::::J Err recovery, VLSA 
c:::::::J Spec, VLCSA 1 
c:::::::J Err detection, VLCSA 1 
c:::::::J Err recove ry, VLCSA 1 
64 128 256 
Adder Width {bits) 
512 
Figure 7.4: Comparison of delay of variable latency adders and Kogge-Stone adder. 
shown in Figure 7.4, the critical path delay of speculative adder in VLSA is 12 to 27% 
shorter than that of Kogge-Stone adder. However, the critical path delay of the error detec-
tion block is 4 to 8% higher than that of the speculative adder, offsetting the advantage of 
speculation. The delay of the error recovery block is less than twice of the longer of spec-
ulative addition and error detection . In contrast, error detection and speculative addition 
43 
in VLCSA 1 has almost the same critical path delay , both are 14 to 36% shorter than that 
of Kogge-Stone adder. The simple circuitry of error detection in our design results in the 
low latency. In general, the critical path delay of VLCSA 1 is 6 to 19% lower than that of 
VLSA when speculation is correct. The critical path delay of the error recovery block is 
less than twice of the longer of speculative addition and error detection. 
X 104 16,---.--------.--------.--------.---. 
14 
12 
~ 10 
N 
E 
3 8 
ro 
~ 
<t: 6 
4 
- Kogge-Stone 
c:::::J VLSA 
c::=J VLCSA 1 
256 512 
Adder Width (bits) 
Figure 7.5: Comparison of area of variable latency adders and Kogge-Stone adder. 
As shown in Figure 7.5, the area of VLSA is 14 to 32% larger than that of Kogge-Stone 
adder. This is due to the area overhead of the error detection and recovery blocks. On the 
other hand, the area requirement of VLCSA 1 is -6 to 17% smaller than that of Kogge-
Stone adder. In particular, the area of VLCSA 1 is 6% smaller than that of Kogge-Stone 
adder when the bitwidth is 512. In another word, VLCSA 1 can be smaller and faster than 
Kogge-Stone adder. 
44 
window size window size 
adder width 
Perr=0.01% Perr=0.25% 
64 14 10 
128 15 11 
256 16 12 
512 17 13 
Table 7.4: Parameters of SCSA and VLCSA 1 for the error rates of 0.01% and 0.25%, 
according to analytical error models and simulation results. 
7.5 Comparison with Design Ware adder 
The Synopsys Design Ware building block IP is a collection of highly optimized reusable 
IP blocks, which can quickly provide desirable designs during synthesis [1]. In particular, 
the Design Ware building block IP can generate high-quality adder designs for timing, area 
and power [19]. The DesignWare adder is synthesized for the minimal achievable delay, 
called Design Ware adder. We implemented a hybrid Kogge-Stone carry-select adder and 
observed that the adder generated by the Design Ware building block IP is faster than the 
hybrid one. The further details of the Design Ware building block IP can be referred to [ 1]. 
We compare SCSA, VLCSA 1 and 2 with the DesignWare adder. Note that SCSA 
stands for the SCSA-based speculative adder design. The parameters of SCSA are reported 
in Table 7.4 for error rates 0.01% and 0.25%. Two small adders of the window adder 
are implemented using Design Ware IP block. We also discuss the effect of different error 
rates. In comparison to the Design Ware adder, we target to achieve 10% critical path delay 
reduction and zero area overhead over the Design Ware adder during synthesis. 
7 .5.1 Speculative addition in VLCSA 1 vs Design Ware adder 
The speculative addition in VLCSA 1 is SCSA 1. As shown in Figure 7 .6, the critical path 
delays of SCSA 1 are 10% lower than those of the Design Ware adder for error rates 0.01% 
45 
0.5 
- DesignWare 
0.45 f> > .1 Speculative, P err =0.01% 
0.4 c::::::::J Speculative, P err =0.25% 
0.35 
Vl 0.3 
-S 
~ 0.25 
Qi --
0 0.2 ~-
0.15 
0.1 
0.05 
0'--
64 128 256 512 
Adder Width (bits) 
Figure 7.6: Comparison of delay of speculative addition in VLCSA 1 and DesignWare 
adder. 
and 0.25 %. In another word, if a certain level of inaccuracy is acceptable, SCSA 1 is faster 
than the Design Ware adder. 
As shown in Figure 7.7, for an error rate 0.01 %, as the adder width increases, the area 
of the SCSA 1 can be 43% smaller than that of the Design Ware adder. For an error rate 
0.25%, the area of the SCSA 1 is 21 to 56% smaller than that of the Design Ware adder. 
SCSA 1 with a lower error rate has larger area than the one with a higher error rate. 
Thus, there is a tradeoff between the error rate and area. When certain application is more 
error-tolerable, the error rate may slightly increase to clearly reduce area. Similarly, there 
is a tradeoff between the error rate and delay. 
7 .5.2 VLCSA 1 vs Design Ware adder 
Next we compare VLCSA 1 with the Design Ware adder. The parameters of the speculative 
adder in VLCSA 1 also follow Table 7 .4. As the delays of the speculative adder and the 
error detection block are designed to be very close, we only show one of them for simplicity. 
X 104 
3.5 
3 
2.5 
N~ 2 
2, 
ro 
~ 1.5 
0.5 
0 
- DesignWare 
c=J Speculative, Perr =0.01% 
c=J Speculative, P err =0.25% 
Linn -In 
64 128 
.,....., 
,... 
256 
Adder Width (bits) 
46 
r-
-
512 
Figure 7. 7: Comparison of area of speculative addition in VLCSA 1 and Design Ware adder. 
The longer one of the critical path delays of the speculative adder and error detection block 
is stated as the "correctly speculated" delay. 
As shown in Figure 7.8 , the critical path delays of VLCSA l are 10% lower than those 
of the Design Ware adder when speculation is correct. The critical path delays of the error 
recovery block are lower than twice of the "correctly speculated" delays. 
As shown in Figure 7.9, for an error rate 0.25% (0.01 %), VLCSA l has area require-
ments of -19 to 16% (-6 to 42%) over the Design Ware adder. As the adder width increases, 
the VLCSA 2 has less area requirements over the Design Ware adder. If the error rate is 
0.25% instead of 0.0 l %, on average, we can save 17% area by increasing 0.12% average 
cycle. For example, we may choose an error rate 0.25 % to save significant area while de-
creasing a little bit performance. The tradeoff between the error rate and area is valuable 
for saving area. 
0.9 
0.8 
0.7 
Cil o.6 
.s 
~ 0.5 
Qi 
0 0.4 
0.3 
0.1 
-Design Ware 
lll!iiill!Jl Correctly spec, P err =0.01 % 
c:::::J Err recovery , Perr =0.01 % 
c:::==J Correctly spec, P err =0.25% 
c:::==J Err recovery, Perr =0.25% 
64 128 256 
Adder Width (bits) 
512 
Figure 7.8: Comparison of delay of VLCSA l and Design Ware adder. 
window size window size 
adder width 
Perr=0.01% Perr=0.25% 
64 13 9 
128 13 9 
256 13 9 
512 13 9 
47 
Table 7.5: Parameters of VLCSA 2 for the error rates of 0.01 % and 0.25% , according to 
simulation results. f.L = 0, () = 232 . 
7 .5.3 VLCSA 2 vs Design Ware adder 
Finally, we compare VLCSA 2 with the Design Ware adder. The error rates of the specu-
lative addition in VLCSA 2 are obtained using simulation results. The parameters of the 
speculative adder in VLCSA 2 are reported in Table 7.5. 
As shown in Figure 7.1 0, the critical path delays of VLCSA 2 are 10% lower than those 
of the Design Ware adder when speculation is correct. If the error rate is tiny, the average 
X 104 
3.5 
3 
2.5 
N~ 2 
E 
~ 
ctl 
~ 1.5 
0.5 
0 
- DesignWare 
c:;;::::::] Variable latency, Perr =0.01% 
c:J Variable latency, Perr =0.25% 
,.,.. 
r-
-= 
r-
_~Qn 
64 128 256 
Adder Width (bits) 
,...... 
-
512 
Figure 7.9: Comparison of area of VLCSA 1 and Design Ware adder. 
performance of VLCSA 2 is close to that of the speculative adder. 
48 
As shown in Figure 7.11, for an error rate 0.25% (0.01 %), VLCSA 2 has area require-
ments of -17 to 29% ( 1 to 62%) over the Design Ware adder. The area overhead of VLCSA 
2 is larger than that of VLCSA 1 due to additional circuitry of speculative addition and 
error detection. As the adder width increases, VLCSA 2 has less area requirements over 
the Design Ware adder. Similarly, if the error rate is 0.25% instead of 0.01 %, on average, 
we can save 20% area by increasing 0.12% average cycle. 
0.9 
0.8 
0.7 
u; 0.6 
-=-~ 0.5 
Qi 
0 0.4 
- DesignWare 
Mot.' Correctly spec, P err=0.01% 
Err recovery, Perr=0.01% 
[=::J Correctly spec, P err =0.25% 
[=::J Err recovery, Perr=0.25% 
Adder Width (bits) 
512 
Figure 7.10: Comparison of delay of VLCSA 2 and Design Ware adder. 
N 
E 
~ 
ro 
~ 
<( 
X 104 
3.5 
- DesignWare 
3 [=::J Variable latency, Perr 
2.5 
[=::J Variable latency, Perr 
2 
1.5 
r-
-
r-
Ll In 0.5 0 
=0.01% 
=0.25% 
,.... 
r-
64 128 256 
Adder Width (bitS) 
' 
r-
512 
Figure 7.11 : Comparison of area of VLCSA 2 and Design Ware adder. 
49 
Chapter 8 
Conclusion 
In this thesis, we propose a novel function speculation technique, called speculative carry 
select addition (SCSA). We develop an analytical error model for unsigned random inputs. 
Then We develop a speculative adder design based on SCSA. Simulation results show that, 
for an error rate of 0.01% (0.25%), SCSA-based speculative adder can be 10% faster and 
43% (56%) smaller than the fastest adder generated by the Design Ware building block IP, 
called Designware adder. 
Next we present a reliable variable latency adder design that augments the speculative 
adder with error detection and recovery for unsigned random inputs, called variable latency 
carry selection adder I (VLCSA 1). Simulation results show that the critical path delay of 
VLCSA I is 10% lower than that of the Design Ware adder when speculation is correct. For 
an error rate 0.25% (0.01%), VLCSA 1 has area requirements of -19 to 16% (-6 to 42%) 
over the Designware adder. 
Furthermore, we develop a modified variable latency adder design suitable for both 
unsigned random and 2's complement Gaussian inputs, called VLCSA 2. The key idea 
of VLCSA 2 is to correctly speculate results when long carry chains occur. We re-design 
the speculative adder and the error detection block in VLCSA 2. Simulation results show 
that the critical path delay of VLCSA 2 is still I 0% lower than that of the Design ware 
adder when speculation is correct. For an error rate 0.25% (0.01 %), VLCSA 2 has area 
50 
51 
requirements of -17 to 29% (1 to 62%) over the DesignWare adder. 
In summary, simulation results suggest that the proposed speculative adder can be faster 
and smaller than the DesignWare adder for very low error rates. The reliable variable la-
tency adder can outperform the Design Ware adder in both delay and area. Besides, the 
proposed speculative and reliable variable latency adders are smaller than one of the best 
speculative and reliable latency adders [17] for similar design settings. The proposed reli-
able variable latency adder is also faster than the counterpart in [ 17]. 
In future, we plan to generalize the speculative and reliable variable latency carry select 
addition for floating-point numbers, or other arithmetic operations such as multiplication 
and multi-operand addition. We also plan to apply the speculative and reliable variable 
latency carry select addition for certain applications such as digital signal processing. 
Bibliography 
[1] Design Ware building block IP user guide. http://www.synopsys.com/dw/dwlibdocs.php. 
[2] T. Austin et al. Opportunities and challenges for better than worst-case design. In 
Proc. Asia and South Pacific Design Automation Conference, pages 2-7, 2005. 
[3] Baniieres et al. Variable-latency design by function speculation. In Proc. Design, 
Automation and Test in Europe, pages 1704-1709, 2009. 
[4] L. Benini et al. Telescopic units: A new paradigm for performance optimization of 
VLSI designs. IEEE Trans. Computer-aided Design, 17(3):220-232, 1998. 
[5] L. Chakrapani et al. Highly energy and performance efficient embedded computing 
through approximately correct arithmetic. In Proc. Intl. Conference on Compilers, 
Architectures and Synthesis for Embedded Systems, pages 187-196, 2008. 
[6] A. Cilardo. A new speculative addition architecture suitable for two's complement 
operations. In Proc. Design, Automation and Test in Europe, pages 664-669, 2009. 
[7] D. Ernst et al. Razor: a low-power pipeline based on circuit-level timing speculation. 
In Proc. Intl. Symposium on Microarchitecture, pages 7-18, 2003. 
[8] R. Hegde and N. R. Shanbhag. Soft digital signal processing. IEEE Trans. VLSI 
Systems, 9(6):813-823, 2001. 
[9] D. Kelly and J. Phillips. Arithmetic data value speculation. In Advances In Computer 
Systems Architecture, Lecture Notes in Computer Science, pages 353-366, 2005. 
52 
53 
[10] I. Koren. Computer Arithmetic Algorithms. A K Peters, Ltd., 2002. 
[11] T. Liu and S. Lu. Performance improvement with circuit-level speculation. In Proc. 
Intl. Symposium on Microarchitecture, pages 348-355, 2000. 
[12] Y. Liu et al. Design methodology of variable latency adders with multistage function 
speculation. In Proc. Intl. Symposium on Quality Electronic Design, pages 824-830, 
2010. 
[13] S. Lu. Speeding up processing with approximation circuits. In Computer, volume 37, 
pages 67-73, 2004. 
[14] S.M. Nowick et al. Speculative completion for the design of high-performance asyn-
chronous dynamic adders. In Proc. of 3rd Int. Symp. on Advanced Research in Asyn-
chronous Circuits and Systems, pages 210--223, 1997. 
[15] B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford Uni-
versity Press, New York, 2000. 
[16] Y.S. Su et al. An efficient mechanism for performance optimization of variable-
latency designs. In Proc. Design Automation Conference, pages 976-981, 2007. 
[17] Ajay K. Verma et al. Variable latency speculative addition: a new paradigm for arith-
metic circuit design. In Proc. Design, Automation and Test in Europe, pages 1250--
1255,2008. 
[18] N. Zhu et al. An enhanced low-power high-speed adder for error-tolerant application. 
In Proc. 12th International Symposium on Integrated Circuits, pages 69-72, 2009. 
[19] R. Zimmermann. Datapath Synthesis for Standard-Cell Design. In Proc. Inti. Sympo-
sium on Computer Arithmetic, pages 207-211, 2009. 
