This 64b SPARC processor is designed for compute-dense systems such as rack-mount and blade servers for network computing. This type of application requires high computing throughput for executing multiple threads/processes simultaneously, high memory bandwidth, large addressing space, low power and low cost [1] . The optimum design solution is to use a dual-core processor based on the UltraSPARC II microarchitecture [2] with on-chip 1MB L2 cache, a DDR-1 memory controller and multiprocessor bus interfaces ( Fig. 3. 2.1), as this core provides an efficient performance per watt having balanced hardware complexity with 4-issue superscalar, 9-stage pipeline and in-order execution / out-of-order completion. Typical power dissipation at 1.2GHz and 1.3V is 23W, which is substantially lower than other 64b counterparts. The chip is fabricated in Texas Instruments' 0.13µm CMOS process with 7 layers of Cu and a low-k dielectric. The transistor count is 80M, out of which 72M is SRAM. The 206mm 2 die is packaged in a 959-pin ceramic µPGA. Two major design challenges were to redesign the core circuits, last implemented in 0.25µm/2.1V process, for 0.13µm/1.3V process coping with deep submicron technology issues facing industry today, such as negative bias temperature instability (NBTI), and to achieve low L2 cache latency including error correction.
Sun Microsystems, Sunnyvale, CA This 64b SPARC processor is designed for compute-dense systems such as rack-mount and blade servers for network computing. This type of application requires high computing throughput for executing multiple threads/processes simultaneously, high memory bandwidth, large addressing space, low power and low cost [1] . The optimum design solution is to use a dual-core processor based on the UltraSPARC II microarchitecture [2] with on-chip 1MB L2 cache, a DDR-1 memory controller and multiprocessor bus interfaces ( Fig. 3. 2.1), as this core provides an efficient performance per watt having balanced hardware complexity with 4-issue superscalar, 9-stage pipeline and in-order execution / out-of-order completion. Typical power dissipation at 1.2GHz and 1.3V is 23W, which is substantially lower than other 64b counterparts. The chip is fabricated in Texas Instruments' 0.13µm CMOS process with 7 layers of Cu and a low-k dielectric. The transistor count is 80M, out of which 72M is SRAM. The 206mm 2 die is packaged in a 959-pin ceramic µPGA. Two major design challenges were to redesign the core circuits, last implemented in 0.25µm/2.1V process, for 0.13µm/1.3V process coping with deep submicron technology issues facing industry today, such as negative bias temperature instability (NBTI), and to achieve low L2 cache latency including error correction.
NBTI is the aging effect that decreases PMOS current mainly due to Vt shift over silicon lifetime [4] . This Vt shift is strongly dependent on gate-source bias and temperature but barely dependent on drain voltage. Many circuits are modified to enhance margins for NBTI. Particularly, current-mode latch sense amps used for L1 caches and TLBs (Fig. 3.2. 2) [3] are highly impacted, degrading the total sense delay by 42% (Fig. 3.2.3c) . The cross-coupled PMOS pair, M3 and M4, that act as low input impedance devices during equilibration and positive feedback while sensing, are unevenly affected due to unequal 0 or 1 read rate. This causes a Vt mismatch of as high as 50mV between the PMOS pair, requiring a longer signal development time to overcome the offset. The PMOS Vt shift also attenuates the gain. To cope with NBTI, the cross-coupled PMOS pair M3 and M4 are replaced by T3 and T4, for which gates are commonly biased at about 40% of Vdd during sensing. These act as low impedance devices being biased in saturation mode. As the gate bias is identical for T3 and T4, the Vt imbalance is minimized. In addition, PMOS T1 and T2 are added to speed up the outputs low-to-high transition. Although T1 and T2 can get an uneven Vt shift, it is not critical as they get activated after the critical part of the amplification. These modifications improve the deteriorated sense delay by 22%, achieving 15% speedup over voltage sense amps (Fig. 3.2.3d ).
In order to address leakage and noise issues, additional circuit modifications are required. The I-cache wordline detector in Fig.  3 .2.4 is one example. This circuit is a 256-input OR gate, which consists of two levels of 16-input self-resetting dynamic OR gates, to detect a wordline transition for sense amp strobe. n1 is a wired-NOR net that is susceptible to leakage and noise, as 16 NMOSs are connected in parallel with a long wire. In the original circuit, n1 is precharged to Vdd-Vt for speed, however the noise margin of this circuit is reduced due to the lower supply voltage: a 100mV drop at n1 can cause the circuit to fail as M1 turns on easily to discharge node n2. In the new circuit, n1 and n2 are both precharged to Vdd with NMOS T1 between them. The gate of T1 is high during the evaluation and low during the precharge. During the evaluation, T1 acts as a noise decoupler since the voltage drop at n1 does not propagate to n2 unless it is large enough to turn on T1. In addition, T1 decouples n2 from the large capacitance on n1 during the precharge, speeding up the reset path. Compared to a conventional domino gate, this circuit achieves similar speed while improving the noise margin by 35%. The wordline detection slowed down by 9% from the original circuit, but the cache access time is not impacted since the extra delay is absorbed in the sense amp strobe buffering stage.
Since clock skew does not scale proportionally to gate delay due to intra-die process variation, significant number of new hold time violations are created. These needed to be fixed with minimal physical design changes. New flops that have larger output delays while keeping the same footprint are created for the most frequently used flops. The flop depicted in Fig. 3.2 .5 utilizes the scan slave path for normal output as TG1 is kept on in normal operation mode with se=0 and thus sclk=1, achieving additional three stage gate delay without increasing the flop size. Together with other types of new flops with negative hold time, 75% of the 26,300 core level hold time violations are fixed simply by replacing the flops.
The chip includes 512kB four way set-associative L2 cache per core with integrated error correction. Logic blocks, including way-selection, ECC and interfaces to the core and system bus, are optimally placed along with the data-flow in the center of the cluster and in the channel region between the SRAM arrays. Compared to a conventional approach where the logic is partitioned and placed outside the SRAM arrays, this approach reduces pipeline stages for communication between the arrays and logic blocks by minimizing the impact of long wire delays. This together with a multi-cycle clock scheme allows a low four cycle latency from the L2 cache to the core including error correction. The L2 cache pipeline and floorplan are shown in Figs. 3.2.6 and 3.2.7. In cycle 1, addresses are dispatched from the core into each array. A full cycle is needed to send the addresses to the far end of the data arrays after multiplexing the addresses in the L2 control unit. In the next cycle, the data and tag arrays start the access simultaneously. In cycle 3, the address information from the tag array generates a way select signal before the data arrives at the waysel datapath. This cycle ends by registering a selected set of data in a datapath block located in the center of the cluster where the delay from all the data arrays is balanced. Cycle 2 and 3 are designed as multi-cycle paths for speed and power, eliminating the pipeline flops between the cycles, which minimizes flop and clock skew overhead. In the last cycle, ECC syndrome and correction is performed in the datapath located in the channel where the data is routed back to the core. The final data set is registered in the interface block located at the top of the L2 cluster. 
