The increasing demand for the high fidelity portable devices has laid emphasis on the development of low power and high performance systems. In the next generation processors, the low power design has to be incorporated into fundamental computation units, such as multipliers. The characterization and optimization of such low power multipliers will aid in comparison and choice of multiplier modules in system design. In this paper we performed a comparative analysis of the power, delay, and power delay product (PDP) optimization characteristics of four parallel digital multipliers implemented using low power 10 transistor (10T) adders and conventional CMOS adder cells. In order to achieve optimal power savings at smaller geometry sizes, we proposed a heuristic approach known as hybrid adder models. Multipliers realized using the Static Energy Recovery Full adder (SERF) circuit consumed considerably less power compared to 10T and static CMOS based multipliers for all the configurations studied. Furthermore, the difference between the power consumption of the 10 transistor based multipliers and 28T multipliers is significant at 180 nm, but not at 70 nm. For smaller geometry sizes down to 70 nm, the propagation delay of the multipliers implemented with 10 transistors translates to a better performance measure. Carry-Save Multipliers had better PDP range than the other multipliers for all the three adder sub-module designs. The PDP measure for optimal scaled gate width resulted in a best-case scenario for SERF Wallace tree multiplier as compared to the other three SERF based multipliers. This can be attributed to the fast computational capability of the Wallace Tree multiplier and SERF adders' recovery energy logic saving more power at deep sub-micron sizes. The proposed SERF-10T Hybrid adder model multipliers consumed the least power of all the Hybrid and regular models with no deterioration in performance. Taken together, these results suggest that SERF-10T Hybrid model based multipliers are suited for ultra low power design and fast computation at smaller geometry sizes.
INTRODUCTION
The prolific growth in semiconductor device industry has led to the development of high performance portable systems with enhanced reliability in data transmission. In order to maintain portability of high performance fidelity applications, emphasis will be on incorporation of lowpower modules in future system design. [1] [2] [3] [4] [5] The design of such modules will have to partially rely on reduced power consumption and/or dissipation in fundamental arithmetic computation units such as adders and multipliers. This underscores a need to design low power multipliers towards the development of power-efficient highperformance systems.
The selection of the most efficient architecture to implement multiplication has continually challenged DSP system designers. [6] [7] [8] The options currently available offer a wide range of tradeoffs in terms of speed, complexity and power consumption. Input sequences to the multiplier can be fed in parallel, serial or a hybrid (parallelserial) approach. To achieve higher processing speeds, parallel multipliers are usually adopted at the expense of high area complexity. Multiple parallel multiplication algorithms (architectures) (e.g., Refs. [11] [12] [13] ) have been proposed to reduce the chip area and increase the speed of the multipliers. Various techniques have been developed to reduce the power dissipation of parallel multipliers. While several of these techniques reduce power dissipation by eliminating spurious transitions, [13] [14] [15] others have focused on developing novel multiplier architectures and sign-extension techniques to reduce power dissipation and improve performance. [16] [17] [18] [19] Yet another approach is to develop low-power 3-2 counters and 4-2 compressors, which are key components in parallel multipliers. [20] [21] [22] Although each of these techniques helps reduce power dissipation, further reductions will be needed for future digital signal processing systems.
This research uses an approach to significantly reduce the power consumption and the chip area of the parallel multipliers, without sacrificing performance. The approach is based on using low power, minimal transistor count adders that are the determining blocks (second stage of algorithm) in the performance of the multiplier. The operation of a parallel multiplier can be divided into two parts: (a) formation of the partial products, and (b) summation of these products to form the final product of the multiplication. We realized the digital-parallel multipliers using three different adders: SERF adder, 23 and 10T adder, 24 and conventional CMOS 28T 33 adder. The Static Energy Recovery Full adder cell (SERF) 23 was developed using only 10 transistors, and is based on the reuse of charge stored in the load capacitance during the high output to drive the control logic. In addition, elimination of direct path to ground also reduces power consumption. The 10T adder also uses 10 transistors, whereas the logic family used is pass transistor logic (for XOR and XNOR circuit). Pass transistor logic will partially degrade the output signal, which can be overcome by using a driver at the output. To analyze the power savings obtained by these two 10-transistor adder-based multipliers, we compared our results with that of a conventional CMOS 28T based multiplier.
In this study, we investigated the power and delay performance characteristics of parallel digital multipliers using two different 10-transistor full adder circuits and a CMOS 28T adder. For comparative study, we realized multipliers using four different algorithms: Bit Array, CarrySave, Wallace Tree and Baugh Wooley. The tradeoffs between speed and power of these multipliers were compared to similar multipliers realized using regular static CMOS adder circuits. In Section 2, we describe the CMOS 28T adder, SERF, and 10T adder circuits used in our design. Section 3 describes the multiplier architectures. Section 4 describes the simulation methodology used. In Section 5 the results of simulation study are discussed and Section 6 presents a summary of the paper and the concluding remarks.
ADDER MODULES
Adders are the fundamental building blocks in all the multiplier modules. Hence employing fast and efficient full adders plays a key role in the performance of the entire system. In the following section we briefly describe the adder modules used in our design. 
Conventional CMOS 28 Transistor (28T)
Full Adder
The 28 Transistor full adder is the pioneer CMOS traditional adder circuit. 33 34 The schematic of this adder is shown in Figure 1 . This adder cell is built using equal number of N-fet and P-fet transistors. The logic for the Complimentary MOS logic was realized using the Eqs. (1) and (2)
The first 12 transistors of the circuit produce the C out and the remaining transistors produce the Sum outputs. Therefore the delay for computing C out is added to the total propagation delay of the Sum output. The structure of this adder circuit is huge and thereby consumes large on-chip area.
SERF Adder
The Static Energy Recovery Full Adder (SERF adder) circuit was developed implementing energy recovery logic and reduced number of transistors. The schematic of the 10 transistor SERF adder is shown in Figure 2 . 23 The basic idea in the SERF adder is the reuse of charge stored in the load capacitance during the high output to drive the control logic. In regular non-energy recovery adder designs the input charge applied at logic high will be drained off during logic low mode. This is achieved by using only one voltage source (VDD) in the circuit. As an added advantage there will be no path from one voltage level (VDD) to the other (GND). The elimination of the direct path to the ground removes the short circuit power component for the adder module. This reduces the total energy consumed in the circuit and making it an energy efficient design. The SERF adder is not only energy efficient but also area efficient due to its low transistor count. The main drawback of the SERF adder is the threshold voltage drop at the output voltage for certain input combinations. A detailed comparative study of SERF adder with other low power adders can be found in Ref. [23] . 
10T Adder
In the 10T adder cell, the implementation of XOR and XNOR of A and B is done using pass transistor logic and an inverter is to complement the input signal A. This implementation results in faster XOR and XNOR outputs and also ensures that there is a balance of delays at the output of these gates. This leads to less spurious SUM and C out signals. The capacitance at the outputs of XOR and XNOR gates is also reduced as they are not loaded with inverter. If the signal degradation at the SUM and C out is significant for deep sub-micron circuits, drivers can be used to reduce the degradation. The driver will help in generating outputs with equal rise and fall times. This results in better performance regarding speed, low power dissipation and driving capabilities. The output voltage swing will be equal to the VDD, if a driver is used at the output. Figure 3 gives the circuit level diagram of 10T adder. A detailed comparative study of SERF adder with other low power adders can be found in Ref. [24] .
MULTIPLIER ARCHITECTURES
Multipliers are in fact complex adder arrays. This is an operation common to a large number of applications, and the complexity of this function has lead to a large amount of research directed at speeding up its execution. Multipliers can be implemented using different algorithms. Depending on the algorithm used, the performance characteristics of the multipliers vary. In the implementation of digital multipliers binary adders are an essential component. With the emergence of power as a design consideration, speed is not the only criterion by which various implementations are judged. Designing multipliers with low power, energy efficient adders reduce the power consumption and efficiency of multipliers. In this paper we have concentrated on the design and characterization of four popular multipliers, viz. the Carry-Save Multiplier, the Bit-Array Multiplier, Wallace tree Multiplier and Baugh-Wooley Multiplier. To study the performance evaluation of these four parallel digital multipliers we implemented them using three adder cells (SERF adder, 10T adder and the static CMOS 28T adder). We further implemented each of these multipliers for operands sizes 2, 4, and 8.
Carry-Save Multiplier
Carry Save Array Multipliers 10 have a very regular structure, which makes it amenable to automation. The algorithm is based on the fact that the multiplication result does not change when the output carry bits are passed diagonally downwards instead of only to the right. 10 An extra adder, known as vector-merging adder, is added in each stage of the multiplication such that the final result is obtained. This is called the carry-save multiplier because the carry bits are not immediately added but are rather saved for the next addition stage. In the final stage the carries and the sums are merged in a fast-carry propagate adder stage, usually by using a carry-lookahead adder. Due to the additional adder in each stage there is a slight increase in the area cost. However, it uses only short wires to the nearest neighboring cells. It can also be easily pipelined. Another added advantage is that there is only one critical path rather than the several identical critical paths found in the generic array multiplier. The general structure of a Carry-Save Multiplier is shown in Figure 4 . 10 The delay of this multiplier can be expressed 10 as,
where T and is the delay of the pre-product generating AND gates, T final is the delay of the final stage carry-lookahead adder, X is the number of partial product stages, and T carry is the propagation delay between input and output carry. This equation is based on the assumption that the delay for sum generation is equal to that of the carry generation.
Bit-Array Multiplier
Bit Array Multipliers 10 are essentially regular structures and are simple to expand. The structure is similar to the previously discussed Carry-Save multiplier but propagates the carry bits from the full adders in a different fashion. A simple diagram of a 4 × 4 multiplier is shown in Figure 5. 10 Each partial product is generated by the multiplication of the multiplicand with one multiplier bit. The partial products are shifted according to their bit orders and then added. In array multiplication we need to add as many partial products as there are multiplier bits. In order to perform signed multiplication, 2's complement number system is used to represent the multiplicand and the multiplier. 1 This implies that all the adders in a particular stage should be of equal bitlength. To achieve this, the sign bits of the partial products in the initial row and the sum and carry signals of each adder stage are extended. The extension is carried out until the signals width matches the width of the largest absolute value signal in that stage.
Also, the generation of X partial products requires X × Y two-bit AND gates. Large area of the multiplier is devoted to perform addition of N partial products, which require (N − 1) M-bit adders. 1 10 The shifting of the partial products for proper alignment is performed by simple routing and does not require any logic. The array structure makes it a difficult task to measure the propagation delay. There are more than one identical length critical timing paths available in the circuit. An approximate equation as shown in Eq. (4) 10 for the propagation delay can be obtained by a detailed study of these paths.
where T and is the delay of the pre-product generating AND gates, T sum is the delay between the input carry and the sum bit of the full adder, Y is the width of the multiplicand, X is the width of the multiplier, and T carry is the propagation delay between input and output carry.
Wallace Tree Multiplier
Wallace trees 10 were first introduced in 1964 in order to design the multipliers whose completion time grows as the logarithm of the number of bits to be multiplied increases. Wallace tree multiplier is based on tree structure. In Figure 6 , 10 a 4 bit Wallace tree multiplier is shown. Wallace method uses three-steps to process the multiplication operation 1. Formation of bit products 2. The bit product matrix is reduced to a 2-row matrix by using a carry-save adder (Wallace tree). 3. The remaining two rows are summed using a fast carry-propagate adder to produce the product.
To better understand the procedure, we will show the transformation process with an example in Figure 7 . For a 4-bit operand multiplication the partial products of each stage are 4-bits wide. 10 These partial products are arranged in the form of a tree that is shown in Figure 7(a) . From the Fig. 7(a) we can clearly interpret that only column 3 in the array has to add 4-bits. Therefore the partial products are rearranged into a tree structure to visually illustrate the depth of the tree. To realize this tree with minimum number of adders, we need Full Adders (FA) and Half Adders (HA). FA is also known as 3:2 compressor, because it takes 3 inputs and produces two outputs, sum (located in the same column) and carry (located in the adjacent column). 10 FA is denoted by covering 3-bit and HA is denoted by covering 2-bits. To obtain minimal implementation, we start at the most dense part, by introducing HA's in columns 3 and 4 as shown in Figure 7 (b). The reduced tree is shown in Figure 7 (c). Another round of reductions creates a tree of depth 2, which is shown in Figure 7(d) . This final stage adder can be realized using any simple twoinput adder. In this circuit, a total of three FA's and three HA's are used for reduction process. The maximum delay for this multiplier is only six adder delays with four of these being half adders. The propagation delay of the tree is of the order O(log 3/2 (N )) 10 . However, the superior performance of Wallace tree multiplier comes with an added price: highly irregular structure and complex structure. Its highly irregular structure makes it difficult to layout the multiplier in rectangular shape, thereby leading to wastage of chip-area and power. The structure of Wallace trees is not unique, i.e., there are several ways of building a particular Wallace tree. For example, a carry generated in one column may be introduced in the next most significant column at different places, e.g., close to the tree root or close to the output. The Module Generator builds the Wallace trees dynamically with the aim of reducing the number of carries generated in each column, therefore reducing the total tree height.
Baugh Wooley Multiplier
Baugh Wooley Multiplier 35 is used for 2's complement multiplication. It adjusts the partial products to maximize regularity of the multiplication array. It moves the partial products with negative signs to the last steps and also adds the negation of partial products rather than subtracts. This technique has been developed in order to design regular multipliers, suited for 2's complement numbers. Gate-level diagram of a 4-bit Baugh Wooley multiplier is shown in Figure 8 . 10 35 The equation of Baugh-Wooley algorithm for an N × N multiplication is given by Eq. (5),
where X and Y are N -bit operands, so their product is a 2N bits number. Consequently, the most significant weight is 2N − 1, and the first term −2 2N −1 is taken into account by adding a 1 in the most significant cell of the multiplier. Each of the partial products is formed with AND gates and they are all added together. The outcome is to allow identical stages of logic in the early steps of multiplication process and push all the irregularities to the final stage. The delay equation for the Baugh Wooley multiplier is similar to that of the Array Multiplier.
SIMULATION SETUP
The functionality of each of the circuits designed was verified using simulation. The schematics were implemented as layouts using MAGIC layout editor and the post layout parasitics were extracted for SPICE simulations. We used Berkeley's Bsim3v3 28 SPICE model parameters, for all the device models. The simulations were run on a Red Hat Linux 9.0 host machine. Each of the readings were taken for 10,000 pseudo-randomly generated inputs. This covers all the possible transitions of input combinations. Also the delay between input pulses was given around 12 ns, for the output voltages to stabilize. All the multipliers were analyzed for power consumption, delay and area. To yield appropriate results, we have added CMOS inverters at the input and output. For pass transistor logic design the power consumption values also consider the buffers used to retain the logic levels. However, while measuring the power and delays, we have taken only the input driver into consideration and ignored the output buffer. The power dissipated in CMOS digital circuits is given by Eq. (6) 10 34
where C is the load capacitance, is the switching activity, f is the clock frequency, V dd is the supply voltage of the system, I sc is the short circuit current and I leak is the leakage current of the circuit. Delays of the circuit are measured for worst-case scenario, averaging the low (T pl ) and high (T ph ) transition delays. The delay measurements for each of the multipliers were averaged for 50 simulation runs and always the worst case delay was taken into consideration. Monte-Carlo analysis was used for all the simulations. This also depicts the process variations that occur due to each technology size. 
BPTM Models

SIMULATION RESULTS
In this section, performance measurement of all the four multipliers with varying bit-sizes (2-bit, 4-bit, and 8-bit) using the SERF, 10T and CMOS adders has been compared. These results were obtained from spice simulations with one common index for all comparisons, i.e., the design constraints were the same for all the multipliers. Though low power is the objective of our design, we wanted to measure the delay and area of these circuits, as they are indicators of good performance.
Power
The energy consumption for all the multipliers investigated is presented in Table I for a 180 nm technology size. For all the operand sizes, the SERF adder based multipliers consumed considerably less energy compared to the CMOS adder based multipliers. In fact, the SERF based multiplier performed at least thirty-two percent better than any CMOS based version. The SERF based 8 × 8 Bit-Array multiplier proved to have the greatest advantage over its CMOS counterpart with a sixty percent improvement. The power gain of 10T is less as compared to SERF based multipliers and hence can be used where pass transistor logic is used. The power consumed for array multiplier is higher than Baugh Wooley and Wallace tree multipliers in 4-bit, whereas the power consumption of Baugh Wooley is high in 8-bit multipliers, due to added carry select adder levels. Since greater numbers of adder cells are used for larger multipliers, the power savings for smaller operand sizes can be directly extrapolated to higher operand multiplier modules. An understanding of the power consumption of all the multipliers in sub-100 nm would help to prove that these multipliers are truly designed for low power. This is shown in Figure 9 , where all the four multipliers were implemented in 70 nm, 100 nm, 130 nm, and 180 nm Berkeley BPTM 28 technology models with different adder modules. The graph indicates the outputs for each of the 4 × 4 wide multipliers and the multipliers built using 10 transistor adders consumed less power than the conventional 28-transistor adder at all technology sizes. Interestingly, the difference between power consumption of 10 transistor based multipliers and 28T multiplier is very insignificant at 70 nm as compared to 180 nm where the difference is prominent. This could be due to the high static leakage current at smaller technology nodes dominating the total power consumption of a circuit. One probable explanation could be that the SERF and 10T based multipliers consumed more leakage current than the 28T based multiplier at 70 nm technology node size. Developing hybrid models that take advantage of both the adder modules might alleviate this problem to a large extent, which is further discussed in subsection 5.5.
Delay
Propagation delay is a measure of the speed performance of a circuit, even while consuming low power. In Table II , the delay performance characteristics of various multipliers used for our study at 180 nm technology size are given. For all the multipliers, the delay for 2 × 2 cell is almost identical because of the simplicity of the design at such a small size. At 4 × 4 and 8 × 8 bit widths however, the differences between the adder cells is significant. For 8-bit operands, the delay of SERF adder based multipliers is 
almost 15-20% less, and for 10T adder based multipliers, the delay is approximately 25% less compared to CMOS 28T adder based multipliers.
To further analyze the propagation delay of these circuits at smaller technology nodes, we performed simulations for 70 nm, 100 nm, 130 nm, and 180 nm technology nodes. The results from Figure 10 indicate that the propagation delay of the multipliers implemented with 10 transistors translates to a better performance even at smaller technology node sizes. Even though the timing delay for Wallace Tree multipliers is substantially less than other multipliers at 180 nm technology nodes, the differences diminish at 70 nm technology node.
Area
As most of the portable applications demand smaller silicon area, designing circuits with optimal area is an important performance criterion. Area comparison results for the simulated multipliers are presented in Table III . Area consumed by array multiplier is greatest among all the multipliers studied; the probable reason being its increasing structure (number of FA modules) at higher bit levels. Among all the multiplier configurations, the SERF adder based multiplier consumes the least area. For array multipliers, there is a 50% increase in area when CMOS 28T adders are used as opposed to SERF adders and a 46% increase in area as opposed to 10T adders.
PDP Product
To implement low power dissipation systems, we can either reduce the power consumed by the circuits or increase the computations/unit energy. These two optimizations can be realized only when the design tradeoffs between power and delay are well understood. The optimal setting for power delay product (PDP) of a particular technology node can be obtained by varying the size of the gates (W/L ratios), and the operating voltage. To understand the best PDP zone for the four multipliers tested, we simulated for seven different W/L ratios for the 70 nm technology MOSFETs. For small geometries the effects of parasitic wiring capacitance must be considered in the PDP models. 32 In this research, the following expressions 32 were used for the optimal device sizing
where Wp and Wn are the widths of the PMOS and NMOS transistors, n and p are the mobility of the electrons and the mobility of the holes respectively, and C wire is the output load capacitance for the multiplier. The width of the pull up transistors was double that of the width of the pull down transistors based on the fact that the mobility of electrons is higher than the mobility of the holes. Detailed explanation of the derivation of these expressions is beyond the scope of this paper and can be obtained from Ref. [32] . In Figure 11 the PDP products for four multipliers were presented with three different [SERF, 10T, CMOS 28T] adder modules. The points on the bottom leftmost corner of the graph indicate good PDP zones. For all the four SERF adder based multipliers, shown in Figure 11 (a), the gate size variation drops down the PDP as a linear curve. The ideal zone point for all the multipliers in this setup are the minimal possible device size settings, which is indicated as a circle in the Figure. The graph depicts that Carry-Save multiplier exploits the scaled gate widths better than the other three multipliers. However, at the nominal gate width, Wallace tree multiplier has the best PDP value as the area and power dissipation are reduced drastically by scaling. This gives an added advantage to the already fast computational Wallace tree algorithm. For the 10T adder based multipliers, shown in Figure 11 (b), the optimal PDP zone shifts slightly to the right, an indicator of poorer performance, as compared to the SERF adder based multipliers with the same X-axis scaling. Another point to be observed is that the PDP's of Carry-Save and the Bit-Array multipliers overlap in the 10T adder based modules suggesting that both the multipliers are an ideal choice. The 28T adder based multipliers graph, shown in Figure 11 (c), has a different X and Y axis and yet follows a similar pattern as the 10T adder based multipliers except that the PDP zone range is poorer than the prior two. Another difference is that the PDP range of Bit-Array multiplier has deviated further away from the Carry-Save Multiplier. Hence it can be interpreted that the low power adder modules chosen for building the multipliers form the key factor in the performance of the multiplier module. Also the nominal PDP point for all the multipliers falls in the same power range for the CMOS 28T based multipliers.
Hybrid Models
We observed that the power consumed for the 10 Transistor multipliers at 70 nm technology is only slightly lower than the CMOS 28T based multipliers, as compared to 180 nm node sizes. As already mentioned, this could be due to the dominant leakage current at smaller technology nodes. This performance deterioration can be overcome by implementing a hybrid model, in which the 10 transistor adders are used in conjunction with CMOS 28 transistor adders in the multiplier design. Appropriate design of such a model would gain both performance and power enhancement. This being more of a heuristic approach, we iteratively performed numerous simulations to see the best setup to insert the CMOS 28 Transistor adder in a 10 Transistor adder based model. Further insight into the propagation delay equations shows that the final adder stage is the most critical path in the multiplier operation. Insertion of CMOS 28T adders and 10 Transistor adders alternatively in the final stage would reduce the performance deterioration but not impact the power consumption. The results from Figure 12 prove that this is indeed plausible. For the hybrid models, we incorporated three different combinations of adder modules, viz. SERF-28T, 10T-28T, and SERF-10T sets. The simulation results, shown in Figure 12 , clearly indicates that the SERF-10T Hybrid model has lower power consumption as compared to the SERF adder based model. These results are shown for 8-bit operands because the difference is clearly noticeable at higher bit widths. The SERF-10T hybrid is an ideal model for smaller geometries because the SERF adder has better delay characteristics and the 10T adder consumes less static leakage current. The power savings obtained for this hybrid model range from 15-20% for each of the multipliers. Baugh Wooley Multiplier exploited this model the best as compared to other multipliers. Delay characteristics of these modules are presented in Figure 13 . As expected, the delay of the SERF-10T hybrid model is almost identical to that of the SERF based module. Hence this hybrid model is a very good choice when designing multipliers for low power.
CONCLUSION
In this paper, we have presented the power and speed performance characteristics of four different multipliers realized using 10T, SERF, and CMOS 28T static adders. For comparative analysis, we realized 2 × 2, 4 × 4, and 8 × 8 Carry-Save, Bit-Array, Wallace Tree, and Baugh Wooley multipliers. In all the multiplier configurations investigated, the SERF adder based multipliers exhibited better power performance compared to 10T and CMOS 28T adder based multipliers. The difference between the power consumed in 10 transistor adder based multipliers and CMOS 28T based multipliers decreases at 70 nm as compared to 180 nm technology size. Propagation delay of 10T and SERF based multipliers is better compared to the CMOS28T adder upto 70 nm technology size. Optimizing for PDP by scaling the gate sizes indicated a low PDP value for all the multipliers at 70 nm technology size. Scaling of gate widths resulted in better performance for SERF adder based multipliers. In general, Carry-Save multipliers displayed optimal PDP range as compared to the other three multipliers. We further proposed a heuristic approach to incorporate hybrid adder modules in the final stage of multiplication that forms the critical delay path and also a high static leakage current zone. Placing SERF adder and 10T adders alternatively in the final stage will yield in better performance and low power dissipation.
