In this paper, we describe the design of radix-3 and radix-4 parallel prefix adders, that theoretically ha ve logical depths of logsn and logqn respectively, where n is the bitwidth of the input signals. The main building bloc ksof the higher radix parallel prefix adders are identified and higher radix structures of K ogge-StoneA dders are presented. We show that with the higher radix architectures the logic depth can be reduced by 50% and the cell count can be reduced as much as 47% for 64-bit adders. Simulation results indicate that radix-4 adders can be more than 30% faster than radix-2 realizations.
Introduction
The addition of tw obinary numbers is one of the most important arithmetic function in modern digital VLSI systems, taking a major parttM design effort of modern digital signal processors and general purpose microprocessors. The maximum operating speed of these processors depend largely on h o w fast the main computation bloc k can process data. For a large mmber of applications, the speed critical computation block includes adders: either as stand-alone blocks or integrated irk0 multiplier architectures. As a result, specialized speed optimized adder architectures are required for high performance systems.
The design of faster, smaller and more efficient adder architectures has been the focus of many researc h efforts and has resulted in a large n u d e r of adder architectures. Some architectures like Carry-Skip A dder, Conditional Sum A dderand Carry Select A dder[] rely on a basic ripple carry adder structure that has been modified to shorten carry propagation path. The parallel prefix adders [2] are a more general form where a netw ork is used to pre-calculate the carry signals. Some well known parallel prefix adder architectures using different carrylookahead netw orks are: The Sklansky Binary Tree Adder 0-7803-5482-6/99/$10.00 02000 IEEE (31, the Brent Kung Adder [4] , and the Kogge-Stone Adder [5] . Although most of the above mentioned algorithms are formulated for an yradii, practical implementations ha ve generally been limited to radix-2 implementations.
In this paper, w ediscuss the theory ,and feasibilit yof implementation of the radix-4 and radix-3 implementations of the K ogge-Stone Alder. An introduction to the parallel prefix problem is given in Section-2 of this paper. Section-3 defines standard building blocks, and introduces tw onew blocks, for building higher radix parallel prefix adders. The higher radix realizations of Kogge-Stone parallel prefix architecture is examined in detail in Section-4.
Finally, Section-5 includes a summary of our results.
Parallel Prefix Problem
Most of the known adder architectures can be represented as a parallel prefix adder structure consisting of three main parts : Pre-processing, carry lookahead network and postprocessing.
vectors A and B , the preprocessing part extracts tw o special signals propagate b) and generate (9) using simple logic circuits. The calculation of the Sum is assigned t o the post-processing step, which is like the pre-processing step a constant time operation. This leaves only the carry propagation (carry loohhead) problem, which is a recursiv efunction, to be addressed. The carry propagation problem can be expressed in terms of a prefix problem where for a set of binary inputs Since the operator is associative, it can be grouped in any order and computed in a number of levels. To express
V-609
Vdd the sub-products let us introduce the notation yZkj, where k is the level of the sub-product and i : J' represent a continuous range that this sub-product covers. F o r the carry propagation problem let us define the sub-product couple (G, P ) such that:
Where the desired
regardless of the number of levels necessary to cover the range i : 0. Depending on the algorithm the carry propagation net uork will ha vea different structure and shape. In general the following observations can be made:
The maximum levels required to calculate the final Carry signal is referred as the depth of the prefix graph, and equals to the number of logic levels in the netw ork.The depth of the carry propagate netw ork is a function of the bit-width of th'e input. This number relates roughly to the delay of the network. 
Building Blocks for Radix-4 Parallel Prefix Adders
The pre-processing and post-processing stages of a typical parallel prefix adders consist of .simple logic gates. The pre-processing stage can be realized by a simple half adder, or an AND gate and an OR gate. The post-processing stage is merely an XOR gate. The simple, 2-input; prefix function can be mapped to standard logic operations as folio ws:
The total n ~n h r of binary associative operations within the netw ork determine the actiE area required to compute the result.
Secondary effects like the number of times a sub-range is used in subsequent operations (fan-out) and the distance betw een operators of an operation (connection the system.
G,kUb = Gk-' + pj-' .
length) also contribute to the overall performance of (G, P ) i u b = (G, P):-l (G, P)t-' (4)
PjUb = pj-' . p p 3 Higher Radix Parallel Prefix This function pair can be realized using an AND-OR Adder Architectures -gate and a separate AND gate sharing common inputs. We call this basic cell PP2 (parallel prefix-2) and define
The delay of a parallel prefix adder is directly proportional tw o additional cells P P~ and P P~ whim realize the parallel to the number of levels in the carry propagation netw prefix function for three and four inputs respectively. The stage, The majority of contemporary adder architectures PP3 and P P 4 cells realize the following logic functions for that have only tw o inputs.The low er bound of the Iumber input AND gate). of stages required for such netw orks lie atlogzn where n is the bit-width of the input vectors. This lower bound can be low ered using more complex blocks that process 3 or even 4 inputs to obtain low er bounds oflogan and log4.n. respectively. A dders designed using these complex bloks can theoretically achiev e higherprocessing speeds at the cost of additional area.
rely on carry propagation net orkxomposed of bloc ks the generate Output. (The propagate structure is a 3 or 4
(pz . (G1 + pl , Go)))) (6)
The transistor level schematics for these functions are shown in Figure- bloc ks to compute the result of a single PP4 blok, designs using radix-4 blocks can work faster, despite the fact that more complex basic cells are used.
Higher Radix Kogge-Stone Parallel Prefix Architecture
Using the newly defined prefix cells, higher radix adder structures can easily be designed. We will use a graph representation to pro vide a clearer view of the arc hitecture. Figure-3 shows the main symbols used in the graphs. PP4, PP3 and PP2 cells are represented using filled symbols. The dummy cell shown as a blank diamond, does not con tain a y logic (or only a buffer) and can be considered a vacant position. Among different parallel prefix adder realizations the Sklansky Binary Tree and the Kogge-Stone architectures have the least possible n e b ork depth oflog,n where T is the radix and n is the bit-width of the inputs. The main advantage of the K ogge-Stonearc hitecture is the maximum fan-out, which equals to the radix T , whereas the Sklansky Binary T reeadder has a maximum fan-out of T n -Tn--l . Although the Kogge-Stone adder has a much respectively.
We have run transistor l e d simulations to compare the relativ e performances of radix-4 and radix-2 realizations of 64-bit Kogge-Stone Parallel Prefix Adders. Figure-5 shows a typical simulation result for both adder architectures. It can be seen that the Radix-4 architecture is 32.,5% faster (3.98 ns for Radix-4, 5.8911s for Radix-2 in a 0.8 p m CMOS design using 3.3V supply voltage.)
Summary and Conclusions
In this work we have preseded the topologies for Radix-4 and Radix-3 parallel prefix adders, that have a theoretical carry propagation netw ork depth off ogrn which essentially doubles the speed of these adder structures. Although more complex base cells result in slightly larger delays, we have found that on the average the delays can be reduced as much as 32.5%.
T able-lcompares five realizations of 64-bit adders. A standard Ripple Carry adder (R CA-64), a Sklansky Binary Tree adder (SK2-64) and three realizations of KoggeStone adders with different radii. The table lists both the total number of cells and the brake-do wn irto individual low er fan-out, this reduction in fan-out comes at a cost of cell categories. It is important to note that the proposed increased cell count.The Radix-2 Kogge-Stone Adder can architectures not only reduce the depth of the carry prop- agation netw ork as mch as 50%, but the number of cells required also decrease by as much as 47% (for the Radix-4 K ogge Stone Wder) when compared to the Radix-2 realizations.
