This naner describes a 32-bit Address Generation Unit (AGU) ~I desigkd for 4GHz operation in 1.2V. 130nm technology. The AGU utilizes a 152ps dual-V, sparse-tree adder core to achieve 20% delay reduction, 80% lower interconnect density and a low (1%) active energy leakage component. The semidynamic implementation enables an average energy profile similar to static CMOS, with good sub-130nm scaling trend.
AGU Organization
Effective address computation occurs in two phases. In the first phase (clk=O), the pass-gate shifter and the static 3:2 compressors compute the 'carry-save' sum. This static result is set up at the adder inputs before the next phase begins. Final addition is a single-phase operation that occurs in the second phase (clk=l), requiring a highperformance 32-bit adder core.
Contrary to the dense carry-merge tree approach of a Kogge-Stone (KS) adder [11[21, we propose a sparse-tree adder that divides the carry-merge tree into critical and non-critical sections. The purpose is to speed up the critical path by moving a portion of the carry-merge logic to a non-critical sidepath. Instead of generating the carry for each bit (C&I .... CJO.C~I) , as in the KS approach, the sparse-tree adder generates every 4h carry (CO, CJ.. ..C23rC27). Consequently, the critical section reduces to a pruned-down.carry-merge tree that consists of a P G generator, followed by 5 stages of carry-merge logic, resulting in a worst-case evaluation path of 3N-2P-2N-2P-2N-2P (Fig 2) . The sparseness of this tree results in 33/ 50% reduction in P/G fanouts per stage and a 25% reduction in maximum inter-stage interconnect length, providing a 20% delay reduction at equal energy ( Fig. 3 ) with respect to a KS implementation in a 1.2V, 130nm technology [31. The non-critical section of the adder consists of a 4-hit conditional sum generator that generates sums assuming an input carry of 0 and 1 (Fig. 4) . The noncriticality of the sum-generator permits the use of an energy-efficient ripple carry-merge scheme with 60% smaller transistor sizes compared to the critical section. Further optimizations on the first level carry-merge circuits reduce the logic required for the dual rails of C,.=O and Ci.=l to GI# and Pi# respectively (Fig. 7) . eliminating a stage of logic from the sum-generator path. The critical and non-critical paths converge at the 2:l multiplexer where the 1 in 4 carries select the appropriate conditional sum to deliver the final sum (Fig 5) . This results in a 7-stage adder core design. The 80% reduction in inter-stage interconnect density results in a compact adder layout (Fig 9) .
Dual-V, Semi-dynamic Design The performance criticality of the AGU demands a dynamic adder implementation. Partitioning the carrymerge tree into critical and non-critical sections enables an energy efficient implementation by leveraging dynamic, static and dual-V, design techniques. To enable singlephase operation, the performance-critical sparse tree is implemented in single-rail dynamic logic using a simple clocking scheme with seamless time-borrowing at locallygenerated staggered clock boundaries (Fig. 8) . The noncritical conditional sum-generators are designed in static CMOS. Static and dynamic signals converge at the static pass-gate multiplexer, thus avoiding irrecoverable false evaluations ,that may occur at this interface (Fig 5 ) .
Variation in the arrival times of these signals can cause a multiplexer delay variation of up to 5%. To avoid precharging the static paths, the first stage of the static section is converted to a set-dominant latch. Thus, we are leveraging the low switching activity (30%) of static CMOS to reduce the average power dissipated in the adder by 15%. without affecting performance. As a result, the average energy profile of the semi-dynamic design scales with data activity, as opposed to the flat profile of a conventional dynamic design (Fig 6) . The delay spread between the critical and non-critical sections of the adder can be further exploited by using low V, devices in the critical sparse-tree and high V, devices on the non-critical sum-generator paths. This approach offers an additional 50% reduction in leakage energy without impacting performance (Table 1) .
Scaling Performance
The sparse-tree adder design results in a low average transistor size of 3.5pm. The consequently low active leakage energy component (1%) minimizes the impact of higher leakage in future technologies. Furthermore, the decrease in interstage interconnect reduces the effect of increased wire delay in future technologies 141. In a IOOnm technology, where device leakage is expected to increase by 3-5x [5], we project 33% delay improvement and SO% energy reduction, with a low (4%) leakage energy component.
The 1.2V, 130nm semi-dynamic sparse-tree adder operating at 4GHz offers 20% delay reduction compared to a Kogge-Stone adder, with a low (1%) active leakage energy component. The adder provides dynamic CMOS performance and an energy profile similar to static CMOS, with good scaling trends to sub-130nm technologies. 
PO02

