Abstract-Pipelining digital systems has been shown to provide significant performance gains over non-pipelined systems and remains a standard in microprocessor design. The desire for increased performance has seen a push for deeper pipelines, as well as the introduction of pipelining schemes such as wavepipelining and hybrid wave-pipelining. In this paper we present a hybrid wave-pipelined parallel adder that operates at 1.79 GHz, ¾± performance improvement compared to that of a superpipelined adder. The simulations have been performed using a modest 0.25 m technology. The three stage hybrid wavepipelined parallel adder sustains a total of 8 unrelated data waves within the pipe. Another performance benefit achieved by using the the hybrid wave-pipelining scheme is the lessening of delays associated with clock skew and clock distribution.
I. INTRODUCTION
High performance data path circuits continue to be a topic of interest especially as technologies are scaled to the nanometer regime. Adders fall under this group and have been the subject of in-depth analysis for decades. Careful optimization of adders and other data path circuits will grow in importance as methods to reduce power while maintaining or improving performance are sought. There exist numerous adder implementations each with good attributes and some drawbacks. Examples include ripple carry adders, carry look-ahead adders, carry skip adders, and carry select adders [1] just to name a few. Each one of the above mentioned examples has numerous variants designed to enhance performance. The advent of wide data paths requiring up-wards of 64-bit additions has further intensified the need for adder optimization. As the number of input bits increases, so does the delay associated with the computation of the carries. The desire to reduce the delays associated with carry propagation has resulted in novel adder architectures and implementations [2] , [3] , [4] . In this paper we present a pipelined parallel adder that employs the novel hybrid wave-pipelining scheme to attain significant performance gains over conventional pipelined adders. In Section 2 we briefly discuss pipelining schemes and their performance related evaluations, we also analyze data dependent delays of a select few logic gates as it relates to wave-and hybrid wave-pipelining. In Section 3 we present the results of a 32-bit hybrid wave-pipelined adder and some concluding remarks appear in Section 4.
II. PIPELINING SCHEMES
È Ô Ð Ò Ò allows unrelated data waves to be overlapped in execution leading to an increase in throughput by a factor AE for an Ò stage system. The scheme has shortfalls in that (i) the stage with the longest delay sets the pipeline's clock frequency. This leads to underutilization of logic since stages that complete their computations within a time shorter than the clock cycle time remain idle for some fraction of the clock cycle. (ii) Each pipeline stage operates on just one data wave, i.e. there is no overlapping of unrelated data waves within a single stage. Data must completely clear a stage and be latched at an intermediate latch before new data can be admitted into the pipe stage. (iii) the latches might require elaborate clock networks to manage skew and distribution (depending on the depth of the pipe and stage complexity).
To address some of these issues Û Ú Ô Ô Ð Ò Ò [5] , [6] , [7] , [8] , has been revived. The wave-pipelining scheme offers faster clock rates by minimizing the delay difference between the system's short and long paths. The logic path is designed to be long enough to sustain several unrelated data waves. This enables the removal of intermediate registers which reduces clock loading. The scheme achieves a balanced pipeline by inserting buffers in the shorter data paths. A wave-pipelined system operates asynchronously locally while it externally has the appearance of a synchronous system. The speed-up achieved is similar to that of conventional pipelining, but at a higher frequency. Some of the shortfalls of wave-pipelining include (i) vulnerability to process, voltage and temperature variations [9] (ii) possibilities of overshooting the longest path's delay while padding the shorter paths, (iii) intermediate nodes becoming difficult to observe in the absence of latches.
A novel pipelining scheme termed Ý Ö Û Ú Ô Ô Ð Ò Ò reported in [10] , [11] , [12] has been shown to further reduce the clock cycle time by re-introducing a few of the conventional pipeline registers and pipelining the clock to match the logic path delays. The hybrid wave-pipelining scheme minimizes the delay difference per stage and allows the stages to be long enough to sustain more than a single wave. Pipelining the clock offers a way to manage clock skew and distribution since the clock now experiences similar delays to those of the data. Re-introducing some of the registers of conventional pipelining allows for the intermediate nodes to become easily observable and to some degree data dispersion can be tolerated better than in wave-pipelining due to the intermediate register use. It must be noted that the registers used in hybrid wave-pipelining do not give the scheme fine granularity as in conventional pipelining since more than a single data wave can be sustained within a stage. The approach allows for further clock cycle time reduction since delay minimization is performed per stage. The hybrid wave-pipelining scheme's short-falls include: (i) increased design time which we expect can be reduced by making the design semi-custom. Regular structures such as registers can be placed in libraries. (ii) Susceptibility to process, temperature and voltage variations.
A. Pipelined Adders
In this sub-section we reference a couple of conventionally pipelined (CP) adders and highlight the significant issues the designs address. We proceed to present the hybrid wavepipelined (HWP) adder architecture and how it can be manipulated to form either a conventional-or wave-pipelined (WP) adder. Pipelining improves throughput at the expense of latency, however, once the pipe is filled we can expect one data item per unit of time. Some of the conventional pipelined adder designs that have been reported include one that uses overlapped clocks in an effort to eliminate sources of overhead [13] . The 4-bit carry propagate adder employs a series of three registers to equalize the delays in adding the four bits and has a three cycle latency [13] . Time borrowing is performed to shorten the critical path and the adder design has been realized in 0.35 m technology. A 32-bit carry-select adder in 0.25 m technology, operating at 1.67 GHz is reported in [14] . A number of pipeline registers are introduced and in a similar fashion as in [13] several of these registers are inserted in series to equalize data arrival times at adder units. The gain in speed is achieved by clocking sub-circuits faster than would be possible with a ripple carry adder. These two conventional pipelined adder architectures achieve path delay equalization by inserting registers. In the instances where these registers are used for delay equalization no logic is used between the set of registers, the output of one register connects directly to the input of the next. Figure 1 shows the register placement. Introducing several registers increases the clocking overhead and skew. The adders described in [13] and [14] maybe of different types (carry propagate and carry select) however the principles leading to the use of pipelining are the same. In this paper a parallel adder is used to present the performance gains earned due to the use of hybrid wave-pipelining. The wavepipelining scheme of [15] eliminates the intermediate registers of conventional pipelining and can thus boast the benefit of eliminating the register overhead and associated clock skew. The hybrid wave-pipelining scheme [10] , [11] , [12] on the other hand uses intermediate registers but without as fine a granularity as that of conventional pipelining.
We have fashioned our hybrid wave-pipelined adder based on the wave-pipelined one reported in [15] . The equations used to describe addition are those reformulated by Brent and Kung [16] and we reproduce them here for convenience. Brent c0  c10  c9  c8  c7  c6  c5  c4  c2  c1  c3  c12 c11  c13  c15 c14  c16  c17  c19 c18  c20  c21  c22  c23  c24  c25  c26  c27  c28  c29  c30 If only the input and output registers are left while the other two intermediate latches are removed from the block diagram of Figure 2 , we would have the wave-pipelined adder architecture of [15] . The major drawback of this particular hybrid wave-and the wave-pipelined adder architecture is that the fan-out of the stage that computes the terms grows by a factor of ¾ Ü ¾ . Where Ü denotes the number of the adder's inputs. A 32-bit adder therefore has a fan-out of 16 at this stage. For data paths larger than 32-bits wide increasing the latency at the expense of a limited fan-out becomes preferable. The hybrid wave-pipelining scheme does not require extensive data path equalization since it can rely on the intermediate registers to perform this function. Though not a necessity, padding can further allow for shorter clock cycle times. In this study the short data paths have been padded in a similar way as that of [15] . Figure 3 shows the blocks of gates that have been used to realize the computation of the carries. Shaded and blank squares/circles on the figure map directly to those of Figure 2 . In the following sub-section we will discuss data dispersion issues that have led to the use of biased NAND/AND gates in place of the standard CMOS circuits. 
B. Data Dependant Delays
One of the major design issues of wave-pipelining is the need to balance data paths. The fact that there are no intermediate registers implies that data paths must be equalized. Data paths differ due to input vector patterns, fan-outs, wire delays, differing logic paths, process, temperature and voltage variations. In this subsection we discuss data dispersion due to differing input patterns. Some CMOS logic gates suffer from these data dependent delays [15] . The CMOS inverter is the easiest to design and can be designed to have a near perfect response to both a logic zero and a logic one. CMOS AND and OR gates will have different delays depending on the input signals due to the parallel pull-up or pull-down network. We provide delay statistics for both the CMOS and Biased AND gates in Table I . It is apparent from the displayed values that the biased AND gate has improved data dispersion. There are some CMOS logic gates in addition to the inverter that have delays that are not heavily influenced by input data patterns. The CMOS XOR is one such logic gate and it owes this stability to the uniformity of both its pull-up and pulldown networks. The XOR boolean equations dictate that there be an equal number of series devices in the pull-up and pulldown paths.
III. 32-BIT PARALLEL ADDER RESULTS
Simulations of 32-bit parallel adders have been performed for conventional pipelining, wave-pipelining and hybrid wavepipelining. We consider pass transistor XOR gate circuits for the conventional pipelined adder for fair comparison. We basically give the pipelined and wave-pipelined adders significant advantages over the hybrid wave-pipelined adder particularly with regard to clock generation and distribution. The clock for the pipelined adder does not experience any delays as would be the case if an elaborate clock tree is used. The fan-out of the different stages of the hybrid wave-pipelined adder has been managed differently taking into account the significantly low drive capability of future technologies.
If the conventional pipelined adder is super-pipelined its estimated delay is 800 ps. The wave-pipelined adder has a clock cycle time of 670 ps. Running the wave-pipelined adder any faster than this leads to data overrun. The hybrid wavepipelined adder has a shorter clock cycle time than that of the other two adders. The cycle time using this scheme is 560 ps. These values are tabulated in Table II . The hybrid wavepipelined adder runs 1.429 times faster than the conventional pipelined adder and 1.196 times faster than the wave-pipelined adder. Figure 4 shows the number of intermediate waves that stage one of the Hybrid Wave-Pipelined adder can sustain. The pipelined local clock signals clocking new data in and old data out of the stage are also shown, notice that the edges of the local clocks need not be synchronized. We have to initially fill the stage with data before the output of the particular stage is valid thus the output remains at logic "0" until the stage computes the initial values. Pipelining the clock in this manner permits each data wave to travel through the pipe stages at the same rate as its associated clock. The benefits of pipelining the clock are three-fold: (i) we eliminate the need of having an elaborate clock distribution network which might see the latches receiving the significant clock edge at differing times, (ii) should we chose to employ clock gating there is no need for additional clock pulses to flush the pipe, admitted data is guaranteed to propagate through all the pipe stages since local clocks will be generated, and (iii) there is potential for power savings "assuming" that the clock distribution network dissipates a significant amount of power. Additional clock pulses associated with attempts to flush the pipe dissipate power and they do not exist with this scheme.
It must be noted that the work on power evaluation is work in progress as of this writing. For this particular adder design inverters have been used to generate and pipeline local clocks. Simulation results show that data arrive at the registers with the associated clock signals as evidenced by the traces of Figure 4 . Another important aspect of the hybrid wavepipelining scheme is that the frequency can be slowed to allow the design to operate as a conventionally pipelined system. The benefit here could be in testing. Re-introducing intermediate latches also allows for some internal nodes to be observable.
IV. CONCLUSION
In this paper we have presented a hybrid wave-pipelined adder and shown that it outperforms both conventional-and wave-pipelined adders operating 1.429 and 1.196 times faster respectively. The three stage hybrid wave-pipelined adder is capable of sustaining 8 unrelated data waves within the pipe.
A 3-stage CP adder would only have 3 unrelated data waves within a pipe. The hybrid wave-pipelined scheme allows data to travel with its associated clock through the different pipe stages and thus allows for nodes to be observable, a difficult task to achieve in wave-pipelining. Pipelining the clock lessens clock skew (no single clock is required to drive the system's multiple registers). In addition Hybrid Wave-Pipelining will operate at any slower clock frequency unlike Wave-Pipelining that operates on disjoint sets of points [7] . Future work will involve studying process, temperature and voltage variations and how they impact performance in the hybrid wave-pipelined scheme. The hybrid wave-pipelining approach's power savings potential is under consideration. It must be noted that the scheme is not only limited to adders, but is applicable to other systems or subsystems that require high speed operation.
