Abstract-The design of a 32-bit carry-skip adder to achieve minimum delay is presented in this paper. The group generate and group propagate functions used in carry look ahead logic are used to speed up multiple stages of ripple carry adders. The optimum sizes for the skip blocks are decided by considering the critical path into account. The adder is implemented in 0.25µm CMOS technology at 3.3V. The simulation results showed a critical path delay of 3.4ns, which translates to a speed improvement of 18% compared to the current fastest carry skip adder.
INTRODUCTION
Addition is the most basic arithmetic operation and adder is the most fundamental component of any digital processor. Depending on the area, delay and power requirements, several adder configurations such as ripple carry, carry look ahead, carry-skip, and carry select are available in the literature [1] . The ripple carry adder (RCA) is the simplest adder, but has the longest delay because every sum output needs to wait for the carry-in from the previous adder cell. It uses ) (n O area and has a delay of ) (n O , for an n-bit adder. The carry look-ahead adder has delay O(log n) and uses O(n log n) area. On the other hand, the carry skip and carry select adders have O(√n) delay and uses O(n) area [2] .
The design of a carry-skip adder uses the generate and propagate signals given by:
, and
where i X and i Y are the inputs to the th i adder cell. The carry out is expressed as:
(2) Two additional signals are also used, and are given by: 
where i j G : and i j P : are Group generate and Group propagate signals from th i cell to th j cell respectively [1] . Then, the carry out from the whole group is given by:
(5) Different adder implementations have been proposed to optimize various design parameters. Kantabutra presented the design of a one-level variable block length carry skip adder [3] . In [3] , the fan-in to the carry skip logic increases linearly towards the middle of the adder. An accelerated two level carry-skip adder is presented in [4] , where the whole adder stage is divided into a number of sections, each consisting of a number of RCA blocks of linearly increasing length. These adders reduce the delay at the cost of an increase in area and less regular layout.
Gayles proposed a structured approach for building a multilevel carry skip adder [5] . In his scheme, an n-bit adder is divided into m blocks with the two outer blocks implemented as 2-bit RCAs and the remaining as variable width blocks. The width doubles as we move from LSB to MSB. Gayles achieved low power dissipation for his adder, but due to long carry chains the adder becomes slower. A highly reconfigurable single-level carry skip adder aimed at media signal processing applications is presented in [6] . A 64-bit adder implemented in 0.35µm CMOS technology at 3.3V was shown to have a delay of 4.9ns and an energy dissipation of 181.2pJ. A 32-bit carry skip adder using a Manchester carry chain (MCC) for its bypass path is presented in [7] . In their design, the maximum number of series connected transistors in the carry chain is limited to 6. The transistor sizes are progressively increased by a factor of 1.5 in each block to minimize the delay. A delay of 2.2ns is claimed with 0.35µm CMOS at 3.3V. But it is not clear whether the reported delay is the worst case delay of the adder or it is only the MCC delay. Furthermore, the use of domino logic gates slows down the whole adder.
A fully static carry-skip adder presented by Chirca [8] achieved lower power dissipation and higher performance. To reduce delay and power consumption, the adder is divided into variable-sized blocks that balance the inputs to the carry chain. The main principle behind this design was to utilize the lower blocks and make them work in parallel with higher blocks. A 32-bit adder with a delay of 7 logic levels, divided into 4 blocks, was presented in [8] , as shown in Fig.  1 . AND-OR-INVERT (AOI) and OR-AND-INVERT (OAI) Figure 1 . The 32-bit adder divided into 4 blocks [8] CMOS gates were used to reduce delay and power. But the Carry select adders used in the final CS4 block significantly increased the hardware. Furthermore, the paper claims an output delay of 7 logic levels, but a closer examination reveals that the 27 th bit of the sum output will be available only after a delay of 9 logic levels.
II.
NEW DESIGN FOR 32-BIT CARRY SKIP ADDER Our new design uses a combination of RCAs together with carry skip logic (SKIP), carry generate logic (CG), and group generate-propagate logic (PG). Both the carry generation and skip logic use AOI and OAI gates. The 32-bit adder is divided into four blocks. The width of each block is limited by the target delay T. Each block is further divided into sub-blocks. A sub-block may contain additional levels of sub-blocks in a recursive manner. The lowest level subblock is formed by a number of variable width RCA's. A block diagram of the first three blocks (A 0 , A 1 , and A 2 ) is shown in Fig. 2 . The first block A 0 (LSB) is a full adder by itself. The carry 1 C from the first block is fed into the second block A 1 and is also fed to the skip logic. The generate and propagate functions (p,g) are generated separately for each full adder in one unit time, where one unit time is defined as the delay of a complex CMOS gate with at most three transistors connected in series from the output node to any supply rail. Since the delay of a complex CMOS gate is quadratic on its stack height, in our design, the stack height is limited to 3. This restricts the maximum number of inputs to the carry skip logic to 7. On the other hand, when the generate-propagate outputs are used for implementing group generation and group propagation outputs, a stack height of 3 in the CMOS implementation will allow a 4-bit RCA.
The carry generation delay from the skip logic is minimized by alternately complementing the carry outputs. To minimize the carry generation delay from the very first block, C 1 is generated as:
(6) An AOI gate implements this and is available in 1 time unit (delay is shown in parenthesis).
Block A 1 in Fig. 2 is implemented as a k-bit RCA. For any k-bit RCA, the total number of propagate and generate (p,g) outputs would be 2k. These 2k outputs together with the carry from the previous block are fed into a carry skip logic to generate the new carry signal. The fan-in restriction of 7 to the carry-skip logic therefore limits the number of bits in the RCA to 3. The carry out 4 C from skip logic for block A 1 is implemented by an OAI gate as: ) )( ) (  (   1  1  2  3  1  2  3  2  3  3 
(7) The final sum output S 3 from this 3-bit RCA will be available in 4 time units. The sum outputs for this RCA are generated either as:
depending on the carry signal value i C or i C .
Now consider block A 2 in Fig. 2 . The delay of carry signal arriving at the input of the skip logic is 2 time units. This implies that the group generate-propagate (P,G) logic outputs feeding the skip logic must be made available in 2 time units. Hence, the inputs to the (P,G) logic must be available in 1 time unit, implying that they be the propagate and generate signals of the full adders. Block A 2 is divided into three sub-blocks A 2,0 , A 2,1 and A 2,2 (each sub-block is an RCA) due to the fan-in restriction of 7 on the skip logic. The maximum width of each RCA is limited to 4 bits due to the fan-in restrictions imposed on the (P,G) block. The width of each RCA is also limited by the target delay T of the 32-bit adder. The width W of the first RCA is given as: W = T − D, where D is the arrival delay of the carry output from the previous block. The widths of all remaining higher order RCA's in the same block will be 1 bit less because of the delayed arrival times of their carry input by an additional time unit. The carry inputs C 8 and C 11 to RCAs A 2,1 and A 2,2 are generated using AOI logic as: 
(10) For a target delay of 6 time units, A 2,0 is 4 bits and A 2,1 and A 2,2 are 3 bits each making a total of 14 bits. The carry out C 14 from the skip logic is implemented using AOI logic as: Next let us consider the final block A 3 of the 32-bit adder. This is also divided into three sub blocks due to the fan-in restrictions on the skip logic. A block diagram of A 3 with an expanded view of sub-block 0 (A 3,0 ) is shown in Fig.  3 . A 3,0 is further divided into RCAs. The number of inputs to the CG logic increases successively by two for each RCA and is limited to a maximum of 7 in any sub-block. Hence, the number of RCAs in any sub-block is limited either by the number of inputs to the CG block or by the number of inputs to the (P,G) block. Therefore, A 3,0 can accommodate four RCAs. The carry input to the skip logic as well as to the first RCA (A 3,0,0 ) arrives in 3 time units. The propagate and generate signals (p and g) from each RCA will be available with a delay of 1 time unit. This implies that we can have two levels of (P,G) logic inside the block while satisfying the time delay constraints. Hence, A 3,0,0 is 3 bits, and the widths of the remaining RCAs are 2 bits each, thereby making a total of 9 bits for A 3,0 .
The number of RCAs in A 3,1 is limited to three. The carry input C 23 to A 3,1,0 is generated by an OAI gate, and will be available in 4 time units, thereby limiting the length of the first RCA to 2 bits. The carry inputs C 25 and C 27 to the remaining RCAs in A 3,1 are also available in 4 time units. Thus the maximum width of sub-block A 3,1 is 6 bits. The carry input C 29 to A 3,2 is also generated by an OAI gate. The maximum width of sub-block A 3,2 can be calculated as 4 bits. Hence, the total width of block A 3 is 19 bits. Thus the block sizes are 1, 3, 10 and 19 bits starting from the lowest block. By combining the four blocks (A 0 , A 1 , A 2 , and A 3 ), a 33-bit adder can be implemented. The width of sub-block A 3,2 can be shortened to 3-bits for a 32-bit adder. The final 32-bit adder with a delay of 6 time units, divided into 4 blocks, is shown in Fig. 4 . The carry C 32 from the skip logic can be expressed as OAI logic, and is given by: 
, (12) and will be available in 4 time units. The 7-input AOI/OAI gates use 4 transistors in a series chain if implemented in the normal manner. As the delay is quadratic on the number of transistors in a series connected path, we decided to implement these gates as a cascade connection of a number of smaller modules as: AOI7:
, and (
OAI7:
(14) More than 30% reduction in gate delay was achieved using the cascaded implementation. Furthermore, the fanout of one of the 7-input AOI gates in our design exceeded four. To reduce the delay due to this fanout problem, a superbuffer design was used for the AOI gate. This was easily implemented in the cascaded design by progressively increasing the transistor sizes by a factor of 2.
In the 32-bit adder design presented above, the carry out C 32 from the last block A 3 is available in 4 time units. For a target delay of 6 time units, this allows the addition of 2 more blocks A 4 and A 5 , which will extend the adder to a total of 54-bits. Table I lists the maximum adder size for a given target delay using our design procedure.
III. SIMULATION The adder was implemented using Tanner tools pro 11.03. L-edit was used to generate the layout, and T-spice simulation was performed using the generic 0.25µm CMOS technology at 3.3V. For comparison purposes we selected two more adders -(i) 32-bit Chirca adder [8] and (ii) 32-bit Gayles adder [5] . To get a more realistic estimation of the delays involved, we laid out all 32-bit adders using L-edit, and performed TSPICE simulation. The simulation was carried out by feeding 5000 random vectors at a frequency of 100 MHz, and the results are shown in Table II . These results show that our 32-bit adder has the minimum delay of 3.4 ns while Gayles adder exhibited a maximum delay of 4.39 ns. The Chirca adder had a delay of 4.15 ns. Thus our design exhibits a speedup of 18% and 22% compared to those of Chirca and Gayles adders respectively. Our 32-bit adder was then extended to a 54-bit adder, and the corresponding simulation results are also included in Table  II . Even this 54-bit adder was found to be faster than the 32-bit Gayles adder. The power consumption showed a substantial increase of power for our 32-bit adder compared to Gayles's adder while outperforming Chirca adder. Overall our 32-bit adder achieved the lowest power delay product as shown in Table II. IV. CONCLUSIONS A new 32-bit carry skip adder that is divided into four variable width blocks is presented in this paper. The size of each block is limited by the delay of the carry-in signal and the final target delay. An algorithm is used to calculate the maximum size of the adder satisfying the target delay. The delay of a CMOS complex gate with a maximum stack height of 3 is used as the time unit of measurement in our analysis. The adder has been implemented by generating the layout with Generic 0.25µm CMOS technology. TSPICE simulations carried out at a frequency of 100 MHz at 3.3V showed a critical path delay of 3.4 ns. Overall our proposed adder is 18% and 22% faster compared to Chirca and Gayles adders, respectively. Furthermore, a 54-bit adder implemented using our approach can operate almost at the same speed as a 32-bit Chirca adder or Gayles adder. Even though the power consumption of our adder is more compared to the Gayles adder, overall we achieved the lowest power-delay product.
