Optimizing area and speed in parallel prefix circuits have been considered important for long time. The issue of power consumption in these circuits, however, has not been addressed. The paper presents a comparative study of different parallel prefix circuits from the point of view of power-speed trade-off. The power consumption and the power-delay product of seven parallel prefix circuits were compared. A linear output capacitance assumption, combined with PSpice simulations, is used to investigate the power consumption in the parallel prefix circuits. The degrees of freedom studied include different parallel prefix algorithms and voltage scaling. The results show that the use of the linear output capacitance assumption provides results that are consistent with those obtained using PSpice simulations. The study can help identify parallel prefix algorithms with the desirable power consumption with a given throughput.
INTRODUCTION
The three most widely accepted metrics for measuring the quality of a circuit are its area, speed, and power consumption. Optimizing area and speed have been considered important for long time, but minimizing power consumption has been gaining prominence only recently [1, 5, 10] . One important reason for minimizing power consumption of a circuit is the proliferation of portable electronic systems, such as laptops, mobile phones and wireless devices, where maximizing battery life is important. Since it is desirable to minimize the size and weight of batteries in such devices, while increasing the time between battery recharges, finding methods of reducing power consumption has assumed considerable importance. In this paper, we study power-speed trade-off for prefix circuits. The prefix circuits play an important role in many applications. It appears in a number of areas such as the carry-look-ahead adder, ranking, packing, radix sort, etc. [8] . Many new approaches for prefix circuits with the goal of optimizing depth (i.e., speed) and size (i.e., area) have been proposed [2, 6, 8, 9, 12] . As a result, performance in terms of the speed and area has improved. The issue of power consumption in these circuits, however, has not been addressed. Therefore, our goal is to make a comparative study of different prefix circuits from the point of view of power-speed trade-off in order to facilitate the design choices, specifications, and resource limitations. In this study, we use the power-delay product as a quality measure for the prefix circuits. The power-delay product is the product of the circuit's power consumption and propagation delay, which represents the energy consumed by the circuit per operation. In this paper, we first propose power modeling of prefix circuits. Then, the analysis, combined with PSpice simulations [3] , is used to investigate the power consumption in the prefix circuits considered. The simulations were carried out on voltage scaling. It is found that the divide-and-conquer prefix circuit, which is the fastest circuit, consumes the most power. Also according to PSpice simulations, the power-delay product of the LYD prefix circuit seems to be the best amongst the circuits considered while the power-delay product of the divide-and-conquer is the highest. The rest of this paper is divided into five sections. Section 2 provides an overview of prefix circuits. Section 3 reviews the sources of power consumption in a CMOS circuit and presents strategies to estimate power consumption of the circuit. Section 4 focuses on modeling the power consumption of the prefix circuits studied here. Section 5 describes the analysis of power-speed trade-off of prefix circuits considered. Finally, Section 6 concludes the results of the paper.
PREFIX CIRCUITS -AN OVERVIEW
A prefix computation is the process of taking N input values • is an associative binary operation. A prefix circuit with N inputs can also be viewed as a layered directed acyclic graph with N input nodes, N output nodes, and at least N-1 operation nodes. An operation node is neither an input nor an output node. Figure 1 illustrates the layout and the components of a prefix circuit. The numbers along the lefthand side of the layout give the depth (level) of the operation nodes on the right. The traditional metrics for measuring the performance of a prefix circuit include its size, depth, fan-in, and fan-out. The size of a prefix circuit, size(N), is the total number of operation nodes in the circuit. The depth of a prefix circuit, depth(N), is the length of the longest path measured in terms of the number of operations along the path in the circuit from its input nodes to its output nodes. The circuit depth is related to its computation time. In VLSI implementation, a circuit with smaller depth is generally faster than one with greater depth when the fan-out of most nodes in the two circuits is similar [14] . A prefix circuit is depth-optimal if the circuit has the smallest depth among all possible circuits. An N-input prefix circuit is (size, depth)-optimal if size + depth = 2N -2 [12]. Every prefix circuits have size-depth trade-off property [6] -a reduction of the circuit depth is achieved at the cost of an increase in circuit size. The fan-in of a prefix circuit is the maximum fan-in of all nodes in the circuit. The fan-out of a prefix circuit is the maximum fan-out of all nodes in the circuit. In this study, we are interested in prefix circuits with a fan-in of two and we assume that the fan-out of the prefix circuit is a function of N. In the rest of this section, we give a brief review of the design of some prefix circuits. For full description of these circuits, refer to [8] and [13].
The Serial Prefix Circuit
The layout of the serial circuit for N inputs, denoted S(N), is illustrated in Figure 2 . Clearly, both size and depth of this circuit is N-1. The serial prefix circuit has the smallest size amongst all prefix circuits. Moreover, the circuit is (size, depth)-optimal since the sum of its size and depth is 2N -2. Figures 3 to 9 give illustrations of divide-and-conquer, Ladner-Fischer (LF 0 ), Ladner-Fischer (LF k ), Brent-Kung, Snir, Shih-Lin, and LYD prefix circuits, respectively. Information about their size, depth, and fan-out is given in Table 1 . 
Parallel Prefix Circuits
) refers to the extra depth (above   N lg ) used to bring about the reduction in size. The circuit size and depth depend on the value of k. Snir [12] showed that the sum of depth and size of any prefix circuit with N inputs is bounded below by 2N -2. He also introduced an algorithm to construct the (size, depth)-optimal prefix circuit for any N with the depth in the range
may not exist. Lakshmivarahan, Yang, and Dhall [7] were the first to introduce an algorithm for a (size, depth)-optimal parallel prefix circuit with the depth in the above range. Their design provides (size, depth)-optimal circuits with a smaller depth than hitherto known. Furthermore, for N = 9 to 12, 17 to 20, and 33, the LYD circuits are not only (size, depth)-optimal, but are also depth-optimal. Table 1 provides a comparison of the prefix circuits illustrated in the previous subsection. While the parallel prefix circuits have desirable depths, which are ), (lg N Ο they differ widely in the number of operations performed. Only four prefix circuits (i.e., serial, Snir, Shih-Lin, and LYD prefix circuits) are (size, depth)-optimal. The divide-and-conquer circuit and the 0 LF prefix circuit have the shortest depth and the serial circuit has the smallest size.
Comparison
The size-depth trade-off does apply to any prefix circuit. For example, the serial prefix circuit performs fewest operations (i.e., smallest size) compared to the others, but has the longest depth while the divide-and-conquer prefix circuit has the largest size, but has the smallest depth. Although the Shih-Lin prefix circuit and the Snir prefix circuit have similar circuit layouts, Shih-Lin's circuit has a smaller depth than Snir's circuit. All circuits have unbounded fanout except the serial circuit that has a constant fan-out of two. The divide-and-conquer prefix circuit and the 
POWER CONSUMPTION IN CIRCUITS
In the previous section we examined size and depth trade-offs of different prefix circuit designs. We want to examine the power consumption characteristics of these circuits. In this section, the sources of power consumption in circuits are reviewed and the strategies to estimate the power consumption of the prefix circuits are presented.
Sources of Power Consumptions
Presently, CMOS (Complementary-symmetry Metal-Oxide Semiconductor) technology is the most popular technology used by the digital IC (Integrated Circuit) industry because of its low power consumption, its good scalability and its speed [5, 10, 14] . In CMOS circuits, power consumption is due to the following three types of current flow [14] (a) static power consumption due to leakage currents (b) dynamic power consumption due to short-circuit currents, and (c) dynamic power consumption due to switching currents from repetitively charging and discharging the parasitic capacitances at the transistors' gates ( Figure 10 ). In properly designed CMOS circuits, the major portion of the power consumption is from dynamic switching [5, 10, 14] . As a result, in this study, we focus on the dynamic component due to the repetitive charging and discharging of the capacitive loads. The average power consumption in a CMOS gate or module (e.g., an adder) due to switching can be written as [5, 14] : for a given circuit running at a given speed (i.e., L C and f constant), power consumption is a function of the supply voltage and switching activity. Therefore, power reduction can be achieved by either operating the circuit at a lower voltage or by choosing an architecture that reduces the switching activity of the circuit's signals.
Effect of Voltage Scaling
Due to the quadratic relationship between the supply voltage and the power consumption, lowering supply voltage can be an effective way to achieve dramatic power savings. However, as the supply voltage is decreased, the circuit delay generally increases relatively independent of the logic function and style( Figure 11 ). Thus, reducing supply voltage unfortunately reduces the system throughput. This loss in throughput can be recovered in some cases by applying architectural techniques to compensate for the additional delay (e.g., using parallelism and pipeline). Reference [5] shows that by changing circuit architecture it is possible to gain significant speed improvements with only a slight increase in power, hence enabling some voltage down-scaling while maintaining the throughput.
Effect of Switching Activity
The power in CMOS circuits is dissipated when the signals in the circuit switch (i.e., change values). As a result, the amount of switching activity is an indicator of the power consumption. The manner in which the nodes in a circuit are interconnected can have a strong influence on the overall switching activity [5] . Some architectures induce extra transition activity at the operation nodes called glitching transitions or dynamic hazards, which consume extra power. Glitching is a major problem that increases the effective switching activity, causing a circuit node to undergo several rapid transitions in a single clock cycle [5, 10] . Figure 12 illustrates an example of the glitching behavior for a chain of eight NAND gates [10] by using a PSpice  simulation [3] . In the simulation, all bits of the first input were set to logic 'one' and all bits of second input transition from logic 'zero' to 'one'. For an ideal circuit without propagation delays, the resultant outputs VOUT2, 4, 6 and 8 would stay logic 'one' all the time. However, due to the presence of delays, these outputs switch to low temporarily. This glitching causes extra power to be consumed. Outputs VOUT1, 3, 5 and 7 do not glitch; they just have some propagation delay. It is noted that the degree of glitching depends on the switching pattern of the input signals [10] . To reduce glitching activity, the depth of the signal paths in the circuit should be balanced. Figure 13 gives an illustration of two different circuit architectures of a 4-input adder. We assume that all primary inputs (A, B, C, and D) arrive at the time 0 t and the implementation is non-pipelined. While the adder in Figure 13a makes one transition by computing A+B, the second adder also makes one transition based on C and the previous (initial) value of A+B. After the correct value of A+B has propagated through the first adder at time say . Thus, there is a second transition at the second adder. Similarly, there will be three transitions at the third adder. With a path-balancing approach of Figure 13 (b), while the first and second adders make one transition the third adder will make only two transitions to produce the same output as in Figure 13 (a).
In [5] , the "total switched capacitance" of the circuit layout in Figures 13(a) and 13(b) has been simulated by using a switch-level simulator over random input patterns. The results show that the switched capacitance of the circuit layout in Figure 13 (a) is larger than that of the layout in Figure 13 (b) by a factor of 1.5 for a four input addition, and 2.5 for an eight input addition. Hence, increasing circuit depth generally increases the total switched capacitance due to glitching and thus increases power consumption [5] . As a consequence, the amount of transition activity (switching activity) for a layered and non-pipelined circuit can be a function of depth d and the number of nodes at each level i,
From this, it follows that in the worst case estimate for the switching activity of such a circuit can grow according to
, assuming a constant number of nodes at each level. From the previous discussion and the example of Figure 13 , we have seen that different circuit architectures for performing the same function can consume different amounts of power. Therefore, the implementation of the various prefix circuits in an application will have different power consumption as well. However, in the prefix circuits, we cannot say with certainty that the circuit with the longer depth will consume more power than one with shorter depth. The reason is that both depth and the number of operation nodes among the candidate prefix circuits differ. In prefix circuits, when the depth decreases, the number of operation nodes (i.e., size) generally increases and vice versa. This is known as the size-depth trade-off [6, 8] . As a result, the switching activity in a prefix circuit not only depends on its logic depth but also on the number of operation nodes at each level. The circuit with shorter depth and more nodes might have more switching activity than the one with longer depth and fewer nodes.
Power Consumption and Fan-out
Besides the switching activity at an operation node, the node's fan-out also has an effect on power consumption in a circuit design in VLSI [4, 14] : the larger the fan-out, the more power the circuit consumes because there are more signals. For example, by using the PSpice over random input patterns, the power consumed by a 2-input XOR gate is dependent on the fan-out and the relationship is linear (Figure 14) . Hence, fan-out should be taken into account when a power consumption estimate is made for the prefix circuit.
POWER MODELING OF PREFIX CIRCUITS
In this section, we will analyze switching activity and fan-out for each prefix circuit considered. We then use this to further estimate and investigate the power-speed trade-off between various types of prefix circuits. Having seen the various sources of power consumption in general circuits we now focus on analytical model under linear output capacitance assumption for predicting the average power consumption of a prefix circuit. As mentioned previously, the signal switching activity has a major influence on the power consumption. Therefore, the switching activity will be used as a basis to determine power consumption of prefix circuits. Further, as mentioned in Section 3.2, the power consumption of an operation node is a linear function of fan-out [4] . Therefore, to take into account the effect of fan-out on the output load capacitance of an operation node, we assume that the load capacitance of a node with fan-
where 0 C is the load capacitance of a node with fan-out 1, and ' C is the load capacitance for each additional fan-out ( Figure 15 ). The effective circuit capacitance of a prefix circuit, ), (N cap eff is the effective load capacitance of all nodes in the circuit. As defined here, the effective circuit capacitance depends on input signal patterns and the effects of signal glitching. Thus if a node output experiences two transitions due to glitching, its effective capacitance is twice that of the physical capacitance. Because the degree of glitching depends on input signal patterns, we consider derivations of the worst case scenario in which glitching at the nodes are assumed to be the maximum possible. By scaling the effective circuit capacitance by the circuit clock frequency and 2 DD V , we arrive at our power estimate
The capacitance evaluation for various circuits according to our model is made in two steps. As a first step, in Section 4.1, we assume that load capacitance for each operation node is independent of the fan-out, i.e., the load capacitance is constant 0 C . In the second step we first compute the residual circuit by deleting one output of each operation node with fan-out ≥ 1. We then compute the load capacitance of the residual circuit assuming that the load capacitance of each node is ' C , independent of the fan-out. This step is repeated 1 − k times where k is the fan-out of the given circuit. This step is performed in Section 4.2. The effective circuit capacitance is the sum of the values obtained in step 1 and step 2. In the following, we compute the effective circuit capacitance for the divide-and-conquer prefix circuit. The effective circuit capacitance for the other prefix circuits can be computed similarly (for details refer to [13]).
Step 1 -The Constant Output Capacitance
In this step, we assume that the physical output capacitance of each operation node is constant. 
The first part of ) (N Kcap eff is the constant output capacitance from the two circuits with ) 2 / (N inputs while the second part is the capacitance from the last level of ) (N DC . Solving this recurrence, we get
for the other prefix circuits can be computed similarly, although they are generally more challenging because i w is not always constant (for details refer to [13]).
Step2 -Capacitance of Residual Circuit
We have assumed that a node with fan-out 1 ≥ k , has a physical output capacitance given as
. However, the capacitances computed in Section 4.1 for various circuits are based on the assumption that the capacitance of each node is 0 C irrespective of the fan-out of the node. We still need to account for the component
for a node with fan-out k, 1 > k . To get this value, we introduce the concept of the residual circuit. The residual circuit of a prefix circuit is the circuit obtained by eliminating one of the fan-outs from each operation node of the given prefix circuit. For example, Figure 16 shows the residual circuit of the divide-and-conquer prefix circuit. This residual circuit is the result of removing one of the fan-outs from each operation node of the circuit in Figure 3 . We can compute the capacitance of this residual circuit, ), (N Rcap eff by assuming constant output capacitance ( ' C ) for all operation nodes. We then construct the residual circuit of the current residual circuit by removing one fan-out from each operation node and compute its residual output capacitance. We continue accumulating the capacitances after every reduction until there are no more fan-outs to remove. Thus, the effective circuit capacitance of the prefix circuit using the linear output capacitance assumption is given by ' ) ( ) ( ) ( 
The Divide-and-Conquer Parallel Prefix Circuit
From the layout of the divide-and-conquer prefix circuit in Figure 3 , an operation node at level ) 2 / (N depth has the maximum fan-out, which is ) 1 ) 2 / (( + N . After removing the vertical fan-outs, the residual circuit is shown in Figure 16 . The operation node of the residual circuit at level ) 2 / (N depth has the maximum fan-out, which is ) 2 / (N .
The capacitance of the residual circuit is as follows:
The first part of ) (N Rcap eff is the residual output capacitance of the two circuits with ) 2 / (N inputs while the second part is the residual output capacitance of the last node in the fist residual circuit. Solving the recurrence, we get
Thus, the effective circuit capacitance for the divide-and-conquer prefix circuit is as follows.
To summarize, the divide-and-conquer prefix circuit has Table 2 provides a comparison of the effective circuit capacitance of the prefix circuits described in Section 2. The serial prefix circuit has the largest effective circuit capacitance ( ) ( 
SIMULATION STUDIES
In Section 4, the power modeling for various prefix circuits was proposed. This section deals with the circuit simulations (using PSpice) we conducted to investigate the prefix circuits' behavior to match with the prediction of the effective circuit capacitance. The degrees of freedom studied include different prefix circuit designs and voltage scaling. Voltage scaling is used because power consumption is a quadratic function of the voltage.
Theoretical Results
Figures 18, 20, and 22 give estimated delay, power consumption, and power-delay product obtained from our theoretical model in Section 4. Figure 18 is the result obtained by assuming the circuits' delay to be proportional to the circuits' depth and applying the normalized delay from Figure 17 in order to take the effect of the supply voltage on the delay. The power consumption is estimated using the formula of Eq. 4.1. For this study we used 9 . 0 0 = C and 3 . 0
For example, at a supply voltage of 2.8V., the normalized power consumed by the divide-and-conquer prefix circuit is:
The estimated power consumption of parallel prefix circuits described in Section 2 is shown in Figure 20 . According to the figure, the divide-and-conquer prefix circuit consumes the most power. Figure 22 illustrates the power-delay product. The Brent-Kung prefix circuit has the highest power-delay product while the divide-and-conquer and the LF 0 prefix circuits have the power-delay product lower than that of the Brent-Kung prefix circuit, the Snir prefix circuit, the ShihLin prefix circuit and the LYD prefix circuit. Table 3 shows the estimated power consumption of the different prefix circuits at fixed and reduced supply voltage when 64 = N . When the supply voltage is fixed at 2.8V, amongst parallel prefix circuits considered, the divide-and-conquer prefix circuit consumes more power than other circuits. To lower power consumption by reducing the supply voltage, let us assume a fixed acceptable delay. Further, assume that delay is proportional to depth and that a delay proportional to a depth of 10 with 8
volts is acceptable. Thus the voltage of the Brent-Kung and Snir circuits cannot be lowered, and the delay of the serial circuits is not acceptable. Thus, the voltages of five prefix circuits (i.e., the divide-andconquer prefix circuit, the LF 0 prefix circuit, the LF 1 prefix circuit, the Shih-Lin prefix circuit, and the LYD prefix circuit) can be dropped from 2.8V and still achieve the acceptable delay. For example, because the delay for the divideand-conquer prefix circuit is proportional to 6 at 2.8V, the voltage can be dropped from 2.8V to 1.48V. The operating frequency can be decreased by a factor of 0.6. Thus the normalized power consumed by the divide-and-conquer prefix circuit is:
After scaling the supply voltage, there is a power improvement in the circuits having depth shorter than 10. Among these circuits, the LF 0 prefix circuit has a major reduction in power due to its shortest depth.
Simulation Results
PSpice simulation was carried out on different parallel prefix circuits with 64 inputs using XOR gate as an associative binary operation. Figures 19, 21, and 23 give delay, power consumption, and power-delay product obtained through the simulation over random inputs. As expected, amongst the parallel prefix circuits considered, the divide-and-conquer prefix circuit consumes the most power. As the supply voltage is reduced, power consumption is also reduced. Also, though the delay of the divide-and-conquer prefix circuit is the least for some values of the voltage supply, it is not so for lower voltages. This may be due to its very high fan-out compared to others (
. From the point of view of the power-delay product metric, the LYD prefix circuit is found to be the best across the entire voltage scaling. This means that the circuit provides the best trade-off between power and delay. Another result of the simulation studies shows that the power-delay product of the divide-and-conquer circuit is the highest, followed by that of the LF 0 circuit. This is at variance with our model prediction and may be due to the fact that these circuits have a very high fan-out (see Table 1 for fan-out). In our theoretical results, we do not take into account the effect of fan-out on the delay. Also according to the simulation, with voltage-scaling technique, the LYD prefix circuit has the least power consumption compared to other circuits. For example, let us assume the maximum acceptable delay is 6.4 µs. From  Figures 19 and 21 , to achieve this time-delay, the supply voltage of the divide-and-conquer, LF 0 , LF 1 , Shih-Lin, and LYD prefix circuits can be 1.8V, 1.78V, 1.78V, 2V, and 1.8V, respectively. Therefore, the powers that the divide-andconquer, LF 0 , LF 1 , Shih-Lin, and LYD prefix circuits consume are 2.25, 1.94, 1.59, 1.64, and 1.44 W, respectively. This shows that power reduction of about 1.6 times can be obtained without speed loss by using the LYD prefix circuit compared with using the divide-and-conquer prefix circuit by using appropriately chosen supply voltage.
CONCLUSIONS
The power consumption and the power-delay product of seven parallel prefix circuits were compared. We have shown that the use of our effective circuit capacitance provides results that are accurate when compared to PSpice simulations. We have also shown that parallelism at a certain level coupled with the use of low supply voltage can be used to reduce the power consumption in the circuit without throughput loss. The main discrepancy between the model and the simulation is the power-delay product metric. This may be due to the fact that the fan-out of the divide-and-conquer and the LF 0 prefix circuit is very high as compared to other circuits. In this analysis, we have assumed that the delay is uniquely determined by the depth of the circuit. The results of the simulation of the divide-and-conquer circuit in particular indicate that large fan-out in addition to contributing to more power may also indirectly affect the delay. Divide-andConquer 
