Efficient Carry Select Adder Design for FPGA Implementation  by U, Sajesh Kumar & K, Mohamed Salih K
Procedia Engineering 30 (2012) 449 – 456
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2012.01.884
Available online at www.sciencedirect.com
 
 
Procedia 
Engineering  
Procedia Engineering  00 (2011) 000–000 
www.elsevier.com/locate/procedia 
 
 
International Conference on Communication Technology and System Design 2011 
Efficient Carry Select Adder Design for FPGA 
Implementation 
Sajesh Kumar Ua, Mohamed Salih K Kb, a* 
aGovt. College of Engineering, Kannur, Kerala,India 
bGovt. Enginerring College Thrissur, Kerala,India 
Abstract 
Digital adder with optimum area and speed is one of the important areas of research in VLSI system design. This 
paper discusses about the efficient implementation of parallel adder with optimized area and propagation delay for 
FPGA applications. Our approach uses carry select adder configuration and parallel adder approach for the 
implementation of fast adder. There are different choices for implementing carry select adder. We compare some of 
these methods and choose the one appropriate for FPGA implementation. Two different approaches for implementing 
linear carry select adder using Kogge Stone configuration are discussed here and compared in terms of area and 
speed. Both approaches are implemented on an FPGA and the performance is compared. Simulation results are used 
to verify the theory.  
 
 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of ICCTSD 2011 
 
Keywords: CSA; RCA;MUX; CSAS; Kogge-Stone; 
 
1.Introduction 
 
Design of high speed digital adders with efficient area and power is one of the important areas of 
research in VLSI system design. In digital adder circuits, the speed of addition is limited by the time 
required for a carry to propagate through the adder. The CSA (carry select adder) is used in many 
arithmetic systems to solve the problem of carry propagation delay by independently generating multiple 
carries and then select a carry to generate the final sum [1][2]. However, the CSA is not area efficient 
because it uses multiple pairs of adders  to generate partial sum and carry by considering carry input 
Cin=0 and Cin=1, then the final sum and carry are selected by the multiplexers.  
 
* Sajesh Ulayil Tel.: +91-4972780227; fax: +91-40-24193067. 
E-mail address: sajesh@gcek.ac.in. 
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
450  U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456
 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000 
 
Another way is to express it as a prefix addition. Several examples of such adders have been 
published and there are many efficient implementations.Kogge and Stone [3] scheme uses the 
idempotency propertyto limit the lateral logical fan-out at each node to unity, butat the cost ofa dramatic 
increase in the number of lateralwire at each level. Ladner and Fischer [4] introduced the minimum depth 
prefix graph. The longest lateral fanning wires go from a node to n/2other nodes. Capacitive fan-out loads 
become large for later levels in the graph as increasing logical fan-out combines with increasing span of 
the wires. Buffering inverters are to be added appropriately to support these large loads and there is a 
corresponding increase in the delay.Brent & Kung [5] proposed the fan-out trees such that the lateral fan-
out of each node is restricted to unity, as for the Kogge-Stone graph, but without the explosion of wires. 
Although looks attractive it increases the logical depth.Han and Carlson [6] give a good overview of 
prefix addition formulations, and present their own hybrid synthesis of the Ladner-Fischer and Kogge-
Stone graphs. Again this trades an increase in logical depth for a reduction in fan-out. Kowalczuk, Tudor 
&Mlynek [7] achieve a similar compromise by serializing the prefix computation occurring at the 
higherfan-out nodes and Beaumont-Smith & Burgess [8] combine this idea with the Han-Carlson scheme. 
All these latter papers allow the logical depth, and hence the delay increase in exchange for reductions in 
fan-outor wireflux.Knowles [9] showed that fan-out and wire flux gains are available without increasing 
logical depth from the minimum used in the Ladner-Fischer and Kogge-Stone structures. There are many 
publications [10] [11] [12] available that compares between different parallel tree adders and shows the 
advantage of these adders in terms of area, fan-out and wire tracks.  
There are many carry select adder approaches available but most of them use ripple carry adders [13] 
to implement the adder. BehnamAmelifard et.al [14], suggested a new adder called carry select adder 
with sharing (CSAS) which is area efficient but the delay is more. M. Alioto et.al [15] suggested using 
variable size block sizing depending on the MUX delay.Some papers [16] [17] [18] suggested using add 
one circuit to eliminate the second adder required for the CSA with Cin=1 condition. Feng Liu et.al [19] 
compares different parallel prefix adders in IBMs EAC adder, which is implemented using Kogge Stone 
tree. Authors showed that Han Carlson and Knowles configurations are best compromise between speed 
and area. 
This paper illustrates the carry select adder approach with Kogge Stone implementation to achieve 
minimum delay and reduced area without increasing the fan-out or lateral wires. Section 2 covers the 
comparison of RCA and Kogge Stone adder. In section 3 we discuss the two different methods used for 
realizing the carry select adder using Kogge Stone tree. In section 4 we explain the CSA implementation 
and are compared with Kogge Stone adder in terms of area and speed.                     
2. Adder approaches 
2.1 Ripple carry adder 
 
Conventional ripple carry adder is implemented using the equation. Cin=Gn-1+Pn-1Cn-1. (Fig. 1.) 
Here each stage uses the output of the previous stage and the delay increases considerably with increase 
in number of bits. This adder uses the minimum number of logic gates and the worst case delay is more. 
Adder has a regular layout and uses 5 logic gates per bit. For an n bit adder total number of logic gates 
used is 5n and the delay is 2n+2 logic gates. For area calculation only two input AND, OR, XOR gates 
are considered. AND and OR are modeled with unit delay and XOR with 2 units delay. 
 
451U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456
 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000  
 
p0g0p1g1p2g2p3g3
p0cinp1c1p2c2p3c3
cout c3 c2 c1
1 2122 1 2
5 37
48 610
a0 b0a0 b0a2 b2 a1 b1 a1 b1a3 b3 a2 b2
s1s2
s3
810 6
9
s0
cin
4
a3 b3
1
 
Fig. 1. Ripple carry adder 
 
2.2 Parallel adder 
 
In parallel adder equations are expanded to reduce the delay at the expense of number of logic gates. 
The carry equations used for this approach are according to the tree shown in the Fig. 2. 
c1= g0+p0Cin 
c2= (g1+p1g0) + p1p0Cin 
c3= (g2+p2g1) + p2p1c1 
c4= (g3+p3g2) + p3p2 (g1+p1g0) + p3p2 p1p0Cin 
c5= (g4+p4g3) + p4p3 (g2+p2g1) + p4p3 p2p1c1 
c6= (g5+p5g4) + p5p4 (g3+p3g2) + p5p4 p3p2c2 
c7= (g6+p6g5) + p6p5 (g4+p4g3) + p6p5 p4p3c3 
c8= (g7+p7g6) + p7p6 (g5+p5g4) + p7p6 p5p4 [(g3+p3g2) + p3p2 (g1+p1g0)] + p7p6 p5p4 p3p2 p1p0Cin 
 
                                      4 
 
                                      3 
 
 
 
                                      2 
 
 
 
                                      1 
g7,p7   g6,p6  g5,p5  g4,p4   g3,p3    g2,p2   g1,p1   g0,p0 
                                                       
                                                           Fig. 2. Kogge Stone Tree 
 
It is evident from the tree that with the increase in number of bits, the number of logic gates also 
increases. The time for computation is proportional to log n. For this method lesser delay advantage will 
be seen only for adders with more than 8 bits. The approximate number of logic gates required for n bit 
implementation is 3(n+1) log (n)-n+6 logic gates and the delay is 2logn+3 logic gates 
Fig. 3. shows the comparison of RCA and Kogge Stone adder based on the calculations. From the 
452  U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000 
 
figure it is clear that Kogge Stone adder gives much reduced delay but at the expense of large area. 
0 10 20 30 40 50 60
-100
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
 Area RCA
 Area KS
Ar
ea
No. of Bits
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
 Delay in ns RCA
 Delay in ns KS
 
De
la
y 
in
 n
s
 
Fig. 3. Area Delay comparison 
 
FPGA implementation results of the adders discussed above are shown below. Fig. 4.(a) shows the area 
and delay comparison. Results are very much similar to the expected. Post map delay and net delays are 
shown (Fig. 4.(b)) to give much better comparison. Both the calculated and implemented results show 
that up to 8 bits of operation there is not much difference between the adders. But for more than 8 bits 
RCA has area advantage and Kogge Stone has delay advantage.  
0 30 60
0
50
100
150
200
250
300
350
400
450
500
 No.of LUTs RCA
 No.of LUTs KS
N
o.
o
f L
UT
s
No.of bits (a)
0
10
20
30
40
50
60
70
80
90
100
 
Cr
iti
ca
l D
el
ay
 
in
 n
s
 Critical Delay RCA
 Critical Delay KS
 
0 30 60
0
5
10
15
20
25
30
35
40
45
50
55
60
 Postmap Delay RCA
 Postmap Delay KS
Po
st
m
a
p 
De
la
y 
in
 n
s
No.of bits (b)
0
5
10
15
20
25
30
35
40
45
50
 Net Delay RCA
 Net Delay KS
N
et
 
D
el
ay
 
in
 n
s
Fig. 4.(a) Area and Delay comparison; (b) Post map delay and Net delay 
 
Implementation of an efficient adder requires lesser area and lesser delay. Carry select adder is one 
such adder which places itself between RCA and KS adders. Main problem of carry select adder is the 
area overhead, due to the generation of extra carry and sum for each bit, and multiplexer for selecting the 
final sum. Delay will also be a problem as carry has to propagate through number of multiplexers. Our 
implementation of carry select adder has reduced delay without much increase in area. 
 
3. Carry select adder with Kogge Stone implementation 
3.1 First method 
 
In first method adder is implemented first by considering Cin=0 and then Cin=1 adder is generated by  
453U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000  
 
using an Excess1 adder as shown in the Fig. 5. Finally Sum and Carry output is selected by using a 
MUX.Carry equations for Cin=0 adder (4 bit) are 
c1= g0 
c2= (g1+p1g0)  
c3= (g2+p2g1) + p2p1c1 
c4= (g3+p3g2) + p3p2 (g1+p1g0)  
 
p0g0p1g1p2g2p3g3
p1g0p2p1p2g1
p3p2p3g2
g3+p3g2
g2+p2g1 g1+p1g0p2p1c1
p3p2(g1+p1g0)
c4
c3
c2
c1
1 2122 1 2
3
3 33
3
44 4
54
6
5
a0 b0a0 b0a2 b2 a1 b1 a1 b1
a3 b3 a2 b2
s01s02
s03
67
1
a3 b3
s00
4
3
p1p0
4
5
cin
p3p2p1p0
7
a0 b0a2 b2 a1 b1a3 b3
67 4
cin
1
3
7
5
9
8 6
s11s12s13
s10
8 347910 7 5
5
810
11
s3 s2 s1 s0
Kogge Stone adder with cin=0
s00s01s02s03
2
 
(a)(b) 
Fig. 5. (a) Kogge Stone adder with Cin=0; (b) Carry Select adder with Cin=0 adder, Excess 1 adder and MUX 
 
Four logic gates are saved for the implementation of Cin=0 adder as compared to Kogge Stone 
adder. But additional seven gates are used for implementing Cin=1 adder (Excess 1 adder).This clearly 
indicates that area overhead is not a big problem in carry select adders for implementing two adders. Here 
area overhead is coming only in the final stage where the final sum and carry output are selected using a 
MUX. 
 
3.2 Second method 
 
In second method adder is implemented byconsidering Cin=1 first and then Cin=0 adder is generated 
by using extra logic circuits as shown in the Fig. 6. Finally Sum and Carry output is selected by using a 
MUX. 
c1= g0+p0 
c2= (g1+p1g0) + p1p0 
c3= (g2+p2g1) + p2p1c1 
c4= (g3+p3g2) + p3p2 (g1+p1g0) + p3p2 p1p0 
Three logic gates are saved for the implementation of Cin=1 adder as compared to Kogge Stone 
adder. But additional five gates are used for implementing Cin=0 adder. Here also area overhead is 
coming only in the final stage where the final sum and carry output are selected using a MUX. 
 
454  U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456
 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000 
 
3
5
8
6
s00
s13
s10
p0g0p1g1p2g2p3g3
p1p0p1g0p2p1p2g1p3p2p3g2
g3+p3g2
g2+p2g1 g1+p1g0p2p1c1
p3p2p1p0p3p2(g1+p1g0)
c4
c3 c2
c1
1 2122 1 2
33 3 33 3
344 4
45
4
6
7
6 5
a0 b0a0 b0a2 b2 a1 b1 a1 b1a3 b3 a2 b2
s11s12s13
78 5
a3 b3
1
s03 s02 s01
3
5
7
cin
5
 
 
Fig. 6. Carry Select adder with Cin=1 adder, Cin=0 adder (MUX stage is not shown for reducing the complexity) 
 
3.3 Implementation 
 
Fig. 7.(a) shows a 16 bit adder using 4 bit CSAs. Even though this structure can give reduced area 
compared to a Kogge Stone adder, delay will be more as the carry needed to pass through all the 
multiplexers. So itwill take 16 units of delay for the final carry out compared to 11 units taken by a Kogge 
Stone adder.In order alleviate the problem we can use a fast carry network, which can be generated from 
the carry output of each stage. Fast carry logic uses the Kogge Stone tree to implement the intermediate 
carries. 
 
Mux Mux MuxMux
4 bit CSA4 bit CSA4 bit CSA4 bit CSA
2 2 2 222 2 22222222 2 2 2 2 2
A(15-12) B(15-12) A(11-8) B(11-8) A(7-4) B(7-4) A(3-0) B(3-0)
C4 (10)C8 (12)C12 (14)
s3 s2 s1 s0s7 s6 s5 s4s11 s10 s9 s8s15 s14 s13 s12
C16
(16)
cin
Mux Mux MuxMux
4 bit CSA4 bit CSA4 bit CSA4 bit CSA
2 2 22 2 2222222 2 2 2 2
A(15-12) B(15-12) A(11-8) B(11-8) A(7-4) B(7-4) A(3-0) B(3-0)
s3 s2 s1 s0s7 s6 s5 s4s11 s10 s9 s8s15 s14 s13 s12
cin
FAST CARRY LOGIC
Cout (11)
C4(7)C8(9)C12(10)
g(3
)
p(3
)
g(7
)
p(7
)g(1
1)
p(1
1)
g(1
5)
p(1
5)
 
(a)                                                                            (b) 
Fig. 7.(a) 16 bit adder without fast carry network;(b) 16 bit adder with fast carry network 
 
So as shown in the Fig. 7.(b) we can achieve same delay as that of Kogge Stone adder with this 
modification. This configuration will not take any extra area as the individual terms for the fast carry 
455U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456
 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000  
 
logic is taken from the adder itself. 
Fig. 8.(a) and (b) shows that the carry select adder can give much better performance than Kogge Stone 
adder for 32 or more number of bits. Area utilization is much reduced, keeping the same delay value. 
Performances of both carry select adders are similar with first method having slightlylesser area 
utilization. Also carry select adders have a repeating structure that will reduce the LUT utilization for 
FPGA implementation compared to a parallel tree adder. 
0 20 40 60
0
500
1000
Ar
ea
/ N
o.
 
o
f L
UT
s
No. of bits
 I Area
 II Area
 KS Area
 
0
2
4
6
8
10
12
14
16
18
20
 
De
la
y 
in
No
. o
f l
og
ic 
ga
te
s
 I Delay
 II Delay
 KS Delay
20 40 60
0
7000
14000
A
re
a
-D
e
la
y 
pr
o
du
ct
No. of bits
 I Area* Delay
 II Area* Delay
 KS Area* Delay
 
Fig. 8.(a) Area and Delay comparison; (b) Area-Delay product 
 
4. Carry select adder FPGA implementation 
 
Fig. 9. shown below is the comparison of Kogge Stone adder with Carry select adders implemented using 
RCA and Kogge Stone tree. In built adder in the FPGA is also used for comparison. Figure clearly shows 
that for more than 16 bits carry select adders can give much better performance than Kogge Stone adder. 
For 64 or more number of bits CSA implemented with Kogge Stone tree has lesser delay compared to that 
implemented with RCAs, even though its area utilization is slightly large (Fig. 9.(b)).  
 
10 20 30 40 50 60 70
5
10
15
20
25
30
P
o
st
 
m
a
p 
de
la
y 
in
 n
s
No. of bits
 Post map delay in ns (CSA using RCA)
 Post map delay in ns (CSA using KS)
 Post map delay in ns (KS)
 Post map delay in ns (In built Adder)
(a)
20 40 60
0
200
400
No
. o
f L
UT
s
No. of bits
 Area in no. of LUTs (CSA using RCA)
Area in no. of LUTs (CSA using KS)
 Area in no. of LUTs (KS)
 Area in no. of LUTs (In built Adder)
(b)
 
Fig. 9.(a) Delay comparison; (b) Area comparison 
 
Results show that the when we generate inbuilt adder, by simply using the addition operation available 
in programming, has much better performance compared to all the other adders. This is because the carry 
chain of the adder is implemented using a fast bus logic and slices are configured to act as full adder 
components. 
456  U Sajesh Kumar and KK Mohamed Salih / Procedia Engineering 30 (2012) 449 – 456
 Sajesh et.al/ Procedia Engineering 00 (2011) 000–000 
 
5. Conclusion 
In this paper we have shown the design of carry select adder implemented with Kogge Stone tree 
using two different approaches. Both the adders match in terms of their performance and are much better 
than simple Kogge Stone adder. The new configuration is then implemented on a Spartan 3E XC3S1600E 
FPGA device and the performance is compared. Results show that CSA implemented with Kogge Stone 
shows better performance in terms of delay, even though its area is somewhat more than that 
implemented with RCA.  
Acknowledgements 
The authors would like to thank Xilinx Inc. for providing their latest ISE 13.1 design suite 
References 
[1] O. J. Bedrij, “Carry-select adder,” IRE Trans. Electron. Computer, 1962, pp.340–344. 
[2] J. Sklansky, “Conditional-Sum Addition Logic” IRE. Transactions on Electronic Computers, 1960, vol. EC-9, pp. 226-231. 
[3] P.M. Kogge, H.S. Stone; “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations”, IEEE 
Trans., Aug. 73,C-22(8):786-793. 
[4] R.E. Ladner, M.J. Fischer; “Parallel Prefix Computation”, JACM, Oct. 80,27(4):831-838. 
[5] R.P. Brent, H.T. Kung;  “A Regular Layout for Parallel Adders”,  IEEE Trans, C-31(3):260-264, March 82. 
[6] T. Han, D.A. Carlson;  “Fast Area-Efficient VLSI Adders”, 8th IEEE Symp. Computer Arithmetic, Como Italy, May 87, pp. 
49-56,  
[7] J. Kowalczuk, S. Tudor, D. Mlynek; “A New Architecture for Automatic Generation of Fast Pipelined Adders”, ESSCIRC, 
Milano Italy, Sept 91, pp. 101-104. 
[8] A. Beaumont-Smith, N. Burgess; “A GaAs 32-bit Adder”, 13th Symp. Computer Arithmetic, hilomar California, June 97, pp. 
10-17. 
[9] S. Knowles, “A family of adders” , Proceedings of the 14th IEEE Symposium on Computer Arithmetic, April 14-16, 1999, 
adelaide, Australia. 
[10] Feng Liu et.al “A Comparative Study of Parallel Prefix Adders in FPGA Implementation of EAC”,  Proceedings of the 
12thEuromicro conference on digital system design ,2009. 
[11] Matthew M. Ziegler and Mircea R. Stan “A Unified Design Space for Regular Parallel Prefix Adders”,  Proceedings of the 
Design, Automation and Test in Europe Conference and Exhibition,  2004. 
[12] KonstantinosVitoroulis and Asim J. Al-Khalili, “Performance of Parallel Prefix Adders implemented with FPGA 
technology”,IEEE 2007. 
[13] Youngjoon Kim and Lee-Sup Kim,  “A Low Power Carry Select Adder with Reduced Area” ,2001 IEEE. 
[14] BehnamAmelifard et.al “Closing the Gap between Carry Select Adder and Ripple Carry Adder”, Proceedings of the Sixth 
International Symposium on Quality Electronic Design (ISQED’05),2005. 
[15] M. Alioto et.al, “A Gate Level Strategy to Design Carry Select Adders”,  ISCAS 2004. 
[16] KuldeepRawat et.al, “ A Low Power and Reduced Area Carry Select Adder”, IEEE 2002. 
[17] Yan Sun et.al, “High-Performance Carry Select Adder Using Fast All-one Finding Logic”, Second Asia International 
Conference on Modelling & Simulation ,IEEE 2008. 
[18] B. Ramkumar and Harish M Kittur, “Low-Power and Area-Efficient Carry Select Adder”, IEEE Trans. on very large scale 
integration systems ,2011. 
[19] Feng Liu et.al “A Comparative Study of Parallel Prefix Adders in FPGA Implementation of EAC”, 12th Euromicro 
Conference on Digital System Design / Architectures, Methods and Tools,  IEEE 2009. 
 
 
 
 
 
 
