The IBM POWER6e microprocessor is a high-frequency (.5-GHz) microprocessor fabricated in the IBM 65-nm silicon-oninsulator (SOI) complementary metal-oxide semiconductor (CMOS) process technology. This paper describes the circuit, physical design, clocking, timing, power, and hardware characterization challenges faced in the pursuit of this industryleading frequency. Traditional high-power, high-frequency techniques were abandoned in favor of more-power-efficient circuit design methodologies. The hardware frequency and power characterization are reviewed.
Introduction
The IBM POWER6* microprocessor core is fabricated using the IBM 65-nm silicon-on-insulator (SOI) process and provides a significant boost in frequency and performance to pSeries* systems. Core operating frequencies of more than 5 GHz have been demonstrated. The processor chip contains two cores, 8 MB of on-chip level 2 (L2) cache, a directory for a 32-MB L3 cache, two memory controllers, a GX I/O controller, and nest support circuitry for a 128-way symmetric multiprocessor (SMP). The chip shown in Figure 1 has an area of 341 mm 2 and contains 790 million transistors, 1,953 signal I/Os, 5,399 power and ground I/Os, and more than 4.5 km of wire. The on-chip circuits are connected via ten levels of copper wire and are powered through multiple voltage domains. The core logic, array, and I/O circuits are designed to operate at nominal voltages of 1.15, 1.3, and 1.2 V, respectively. However, the actual logic and array voltages delivered to each chip vary between 0.85 V and 1.3 V and between 1.0 V and 1.4 V, respectively, depending on the speed of the part. Chips with shorter channels typically run faster but use considerably more power because of higher leakage. In previous-generation processors, these parts would have been discarded because of excessive power dissipation but now are usable by operating at lowered voltages. In addition, chips with longer channels typically run slower, so some of these parts also would not have been used in earlier generation processors because of their low operating frequency, but now they also are made usable by increasing their operating voltages.
Frequency
Various frequency/cycle-time targets were evaluated during an exploratory phase. A cycle time corresponding to 13-FO4 1 inverter delays was selected based on the fastest known techniques to achieve back-to-back execution of 64-byte dependent, fixed-point instructions.
Performance analysis indicated a loss of ;10% IPC (instructions per cycle) if a dependent fixed-point instruction had to wait an additional cycle before executing. Figure 2 illustrates the relative frequency, IPC, and performance as a function of FO4 cycle times. The POWER6 processor frequency is more than double that of the POWER5 * processor without a doubling of the pipeline depth. Table 1 compares the fixed-point and binary floating-point functional unit pipeline depths of the POWER5 and POWER6 processors [1] . Traditional high-frequency cores rely on a superdeep pipeline and/or aggressive dynamic circuits. Unfortunately, both of these techniques add significant power, because super-deep pipelines require more latches that consume power and dynamic circuits significantly increase data-switching power as well as clock load and power. The POWER6 core has uniquely achieved a low ÓCopyright 2007 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
FO4 cycle time without resorting to either of these traditional design approaches. Instead innovative logic circuit co-designs were used extensively throughout the core to achieve 13 FO4. The basic philosophy is to make each circuit do more logic work. Many circuits perform double and triple duty by implementing parallel logic functions [1] that have historically been implemented serially [2] . Logic functions have also been split into nontraditional parts (e.g., multiplexers, or muxes) to allow part of the logic to be completed earlier in a non-critical-timing part of the cycle [2] .
Overview
The 65-nm complementary metal-oxide semiconductor (CMOS) process technology is described in Section 2. The innovative circuit styles and designs are presented in Section 3. Section 4 focuses on the clock distribution and custom circuit design methodologies. Power is the most significant concern in the design of high-frequency cores, and Section 5 describes the power-reduction methods. Section 6 focuses on laboratory (lab) characterization including array built-in self test (ABIST), logical built-in self test (LBIST), and frequency shmoo testing.
Technology
The POWER6 processor chip is fabricated using the IBM high-performance 65-nm partially depleted SOI process with 40-nm gate length n-FETs, 35-nm gate length p-FETs, and 1.05-nm gate oxides [3] .
Device threshold voltages (V t ) and the polysilicon gate length (L polySi ) Core logic circuits are implemented with three distinct voltage thresholds (VTs) for both p-FETs and n-FETs. These VTs provide different levels of device-switching performance and subthreshold leakage. Device VT optimization for power is described in Section 5. In addition, the n-FET pass device thresholds and crosscoupled inverter n-FET and p-FET thresholds of the array six-transistor (6T) cells were each independently optimized for read performance and cell stability [4] . These array devices were not available for use by logic circuits. An additional orthogonal device dimension, L polySi, was available to further reduce subthreshold leakage power. A new CMOS process compensation mask (XR) is placed over non-performance-critical
Figure 2
Relative instructions per cycle (IPC), frequency, and performance. devices to provide a 2-nm channel length increase for n-FETs and a 4-nm channel length increase for p-FETs. This mask was used sparingly in the core circuits but extensively in the nest and array circuits. Figure 3 ). These nonscaling wires are problematic because 65-nm device delays are shorter than 90-nm device delays, and if no specific actions were taken in 65 nm, the POWER6 core frequency would have been severely limited by wire delays. One primary goal of the POWER6 core was to limit the wire contribution to the total path delay to no more than 35%. This required the use of hundreds of thousands of repeaters. Repeaters break a long wire segment into two or more pieces and cut the total wire delay in a path by a factor of 2 or more. Tools were developed to automatically place repeaters in long wire segments [5] . Unfortunately, each repeater adds ;2-FO4-inverter delays and many paths had insufficient timing slack to tolerate this repeater delay. The design of the POWER6 core provided two primary solutions to these timing-critical wire paths. The first and more preferred solution adds staging latches along the wire whenever the logic allows delaying of signals by one or more cycles (e.g., trace and error signals). The second solution upgrades the wire from a lower layer to a higher layer (e.g., 1X to 2X). The higher-layer wires have significantly lower resistance and resistanceÁcapacitance (RC) delay. This reduces wire delay by a factor of 4 in the case of upgrading from a 1X to a 2X layer, but even more importantly it can reduce the number of repeaters along a timing-critical wire path.
Circuits
The POWER6 processor macro circuit design is accomplished using a strictly enforced methodology to ensure proper electrical operation at the required 13-FO4 frequency. Everything from latch designs to circuit styles to clocking was tightly controlled, tuned, and checked using sophisticated tools, which were applied on both custom and synthesized (random logic module [RLM]) circuits.
Local clock and latch design
The POWER6 processor employs a dual-clock system for synchronization; the two phases of a cycle are c1 and c2. This design choice gives circuit designers more granularity for dividing logic into pipeline states. For low-power operation, some logic paths can be configured to operate in ''pulsed'' mode in which the c1 clock is held active (high) and the c2 clock is pulsed at the end of the cycle [Figures 4(a) and 4(b)]. For a more detailed discussion of clock distribution, refer to the section on global clock distribution.
A single global clock is distributed from a central phase-locked loop (PLL) to each macro on the chip. Inside the macros, the global clock is buffered and shaped by a local clock buffer (LCB). Three types of local clocking are supported: c2_chop ''true pulse'' clock. Full-phase dual c1/c2 clocks. c2 ''pseudo-pulse'' clock, in which c1 clock is held active while c2 clock pulses.
These clock waveforms are illustrated in Figures 4(c) , 4(d) , and 4(e), respectively. There are two motivations to use pulse clocks. First, they reduce ac clock power because only a single clock is switching. Second, pulse clocks allow
Figure 3
Self-loaded, scaled wire delay.
the use of a single L2 latch (as opposed to an L1-L2 latch pair) in the datapath, thus reducing latch insertion overhead. The main drawback of pulse clocks is that they require data to be held stable longer at latch input. In the POWER6 processor, the rising edge of the c2 clock, the falling edge of the c1 clock, and the rising edge of the chopped c2 (c2_chop) clock are all closely aligned to occur at the cycle boundary. Data is launched from the levelsensitive scan design (LSSD) slave latch on the rising edge of the c2 or c2_chop clock and is captured in the LSSD master latch on the falling edge of the c1 or c2_chop clock.
Figures 4(c) through 4(e) illustrate that the extra hold time equals the width of the c2_chop pulse clock. Thus, pulsed clocks become impractical for circuits in which there is little logic between adjacent stages of the pipeline.
The c2_chop ''true pulse'' clock drives a dynamic mux latch, as shown in Figure 5 . All inputs to the dynamic pull-down network, as well as the latch output, are static and work seamlessly with the static circuit families used in this processor. The latch is scanned at reduced speed with clka/c2 clocks. The LBIST sequence of N cycles starts with an ''at-speed'' c2 clock, followed by N c2_chop clocks.
Dual-phase c1/c2 clocks drive either a scannable master-slave latch [ Figure 6 . The LCB that generates the c1 and c2 clocks for the scannable LSSD master-slave latch is programmable either to generate dual-phase dual c1/c2 clocks or to keep the c1 clock in the active state while pulsing the c2 clock. This leads to an overall chip power reduction, as explained above.
The dynamic mux latch was used sparingly because there is no means to recover from a fast (hold)-path fail. For this reason, the hybrid pulsed latch [ Figure 6 (d)] was used extensively in timing-critical custom circuits. This latch is designed to run with the c2_chop pulse clock. In this configuration, we incur an insertion delay of only a single L2 latch. In addition, there is a ''safety mode'' in which we can also run the latch in dual-phase c1/c2 clock mode (at reduced frequency). This safety mode was used during bring-up of early hardware. It is also the default mode used in burn-in stressing of the chips.
All LCBs are designed to operate in several clock modes that delay local clock edges under the control of scannable clock tuning bits:
Delay c1 falling-edge mode provides timing-critical (slow)-path relief or stresses the fast path when the latch is in full-phase dual c1/c2 mode.
Figure 4
Clock waveforms: (a) regular dual clock; (b) pulsed mode; (c) c2_chop "true pulse" clock; (d) full-phase dual c1/c2 clocks; (e) c2 "pseudo-pulse" clock. Delay c2 rising-edge mode provides fast-path relief or stresses the timing-critical (slow) path. Delay c2_chop/c2 falling-edge mode varies the width of the clock pulse, provides latch writability relief or stress, or stresses the fast path when the latch is in pulse mode.
The granularity of these LCB controls was placed at the macro level via a series of scan-only latches so that all LCBs inside a given macro would similarly stress the clock edges.
Circuit styles
The POWER6 processor was designed almost exclusively using static CMOS circuits. Only the static random access memory (SRAM) and register file macros were permitted to use other circuit design styles, for example, a dynamic circuit design.
Similarly, a typical set of circuit blocks was designed for the latches using a highly tunable cell methodology that dramatically reduced the need for a custom layout of components. The predefined set includes inverters, NAND circuits, NOR circuits, AOI circuits, OAI circuits, and other specialized topologies such as fast XORs, XNORs, and transmission gates used mainly for wide muxes in which area was a critical concern.
When the predefined circuit blocks did not meet the circuit requirements, a full custom design was performed following strict guidelines. For example, scannable register file cells and SRAM cells were carefully designed by a specialized array and register file team [4] .
Custom circuit flow
The POWER6 processor is designed using a strict design methodology, which includes regular power and ground connectivity, as well as consistent clean signal routing. Every aspect of the macro design is checked using a set of sophisticated circuit tools, which guarantees that the macro achieves functionality, electrical integrity, and timing specifications.
Most macros adhere to an 18-track vertical dataflow bit image. On the 1X (M 2 and M 4 ) vertical wiring planes, each bit consists of, from left to right, three signal tracks, followed by two ground tracks, followed by another seven signal tracks, followed by two power tracks, followed by another four signal tracks. The pattern is then stepped to the right. This bit pattern gives a total of 14 signal and 4 power tracks; these power tracks align with the unit and core vertical power distribution. On the 1X (M 3 ) horizontal wiring planes, most macros adhere to a 20-track image. Each 20-track pitch consists of, from bottom to top, four signal wires, followed by two power wires, followed by eight signal wires, followed by two power wires, followed by four signal wires. The horizontal bit dimension that is used throughout the custom and random logic macro designs is 3.6 lm and the vertical power distribution grid is 4 lm. All predefined circuits (latches, LCB, book cells) were designed to this 18-track bit image. Higher-level metals were similarly engineered to keep contiguous routing and power distribution at the macro and unit boundaries. The macro circuit design flow is initiated by placing, from left to right, 16 latches, followed by a bank of LCBs occupying 4-bit spaces, followed by 32 latches, followed by another LCB, followed by another 16 latches. This 16-4-16-16-4-16 pattern is repeated as needed along the y-dimension of the macro. After the latches and logic books are placed within a custom macro, the internal interconnects are routed via a combination of automated routing and skilled layout personnel. Generally, the timing-critical signals were routed manually before any other wires were routed. Then, automated in-house routing tools were used to finish the remaining routes. Design rule checking (DRC) and logical-versus-schematic (LVS) checking, respectively, guaranteed that the layout adhered to technology ground rules and was functionally equivalent to schematic. Macro abstracts were generated directly from layout and provided a condensed description of macro size, wire blockages, and pin locations for unit integration.
Electrical correctness checking and tuning
The electrical correctness of the POWER6 processor was checked extensively with in-house tools ( Table 2) . More-detailed descriptions of the circuit-checking tools can be found in another paper in this issue [5] .
The aggressive POWER6 processor frequency target was achieved through extensive circuit tuning. The IBM EinsTuner tool [5] , an EDA (electronic design automation) transistor-level device width tuner, was used throughout the design to tune device widths for minimum delay (maximum speed). Critical-path circuit cross sections were tuned prior to any macro definition. Macro schematics were tuned based on estimated internal parasitics, primary input (PI), primary output (PO), boundary timing, and load assertions. When layout was complete, a final layout-aware device-width-tuning pass was performed. During layout-aware tuning, the physical area allocated to each circuit gate or bounding box constrains the sizes of the devices within the gate. The IBM LAVA (leakage avoidance and analysis) application [5] was used to tune the device VTs; specifically, it identified gates and devices on timing-critical paths and switched those devices to lower VTs (typically regular VT to low VT). LAVA was also used to reduce leakage power; it identified gates with sufficient timing slack and switched the devices for those gates to higher VTs (typically regular VT to high VT).
Integration
Global clock distribution design A low-skew, high-frequency global clock distribution network was designed to support the high operating frequency. The basic clock tree structure is based on a proven grid-tree methodology employed in prior server processor chip designs and high-end game chips [6] . The clock output of PLL is distributed by a clock tree consisting of inverters and shielded high-level wires to local clock sector buffers that are evenly distributed throughout the chip, as depicted in Figure 7 . The sector buffer, in turn, drives a part of a large-area clock grid through a local H-tree. Low local clock skew was achieved by individual width tuning of the local H-tree on the basis of actual local clock loads. The connections from LCBs to the clock grids were made using reserved tracks in order to facilitate an incremental update without affecting other existing signal wires.
Using standard RC modeling and buffering practices, the high-frequency clock requirements would reduce the maximum wire length between one stage of clock buffer to the next, resulting in many more levels of buffering. Using more buffers generally results in higher power, more delay, and more sensitivity to process, voltage, and temperature uncertainties. To achieve the desired performance and minimum power, accurate frequencydependent transmission-line models were used for all critical clock wires. The inductance effects permitted designers to use fewer buffers.
With the improved modeling, optimization, and design methodology, the POWER6 chip actually uses fewer inverter stages for gain than the POWER5 chip. On POWER6 chips, there are seven levels of inverter buffering between the PLL and sector buffers. The total delay from PLL output to LCB is nearly two full clock cycles. Assuming the same variability to power supply and across-chip linewidth variations (ACLV) as in previous designs, higher variability in terms of percentage of cycle time is expected. To tackle this potential issue, additional efforts were made to optimize the distribution tree from PLL to local sector buffers. Specialized tuners were employed to optimize stage-to-stage distance, buffer sizes, wire widths, and wiring structures. The end result is reduced total distribution delay and, more importantly, lower sensitivity to power supply and ACLV.
Power supply noise can vary significantly across the chip; the edge of the chip can experience high V dd while the center experiences low V dd . Consequently, little benefit would result from designing the V dd noise response of the clock distribution to produce longer cycles, specifically during large V dd droop events.
Another feature of the POWER6 processor clock design is the more balanced duty cycle of the clock [7] . Because of the transmission line nature of the high-level shield wiring used, its reflection effects may be used to correct clock duty cycle with careful design. For a certain operating frequency range, optimum wire lengths may be determined for the distribution. In addition, a duty-cycle adjustment circuit was implemented at the PLL output stage for static adjustment as required.
The POWER6 chip has five separate clock grids, one for each core, one for the nest, and one for each memory controller. Communication among circuits on different grids is generally synchronous, except for communications between nest and memory controller. Since the clock grids are not joined, and because in some configurations, the core and nest can run at different voltages, there are potential static clock skews across the grids. A high-resolution, high-linearity programmabledelay clock buffer was designed to alleviate static clock skews. The programmable-delay buffer used in the POWER6 processor has a total range of 40 ps and step value of 2 ps. These delay buffers were strategically placed to allow flexible controls through service elements. The optimal delay settings can be determined empirically or from on-chip measurement circuits such as the Skitter circuit [8] .
Another contributor to the high-frequency clock design is the low-parasitic-high-aspect ratio clock buffer design. Because of the high loads driven by clock buffers, macro internal wiring parasitics must be very low in order to prevent degradation in clock signals. The buffers are also designed to tightly couple to power grids in order to minimize variability when placed at different locations of the chip. The high-aspect ratio allows the buffers to be placed in reserved, regularly spaced column stripes that were preallocated early in the POWER6 processor design, specifically for clock optimization. Special care was taken to minimize blockage to horizontal wiring layers.
Power planes There are two major power planes for the two operating voltages of the chip: V dd with nominal voltage set at 1.15 V at a module pin for the nest, two cores, and nontiming-critical array circuits; and V cs with nominal voltage set at 1.30 V for timing-critical circuits and voltage-sensitive memory cells in all the arrays of the chip.
Custom macro design methodology
Traditionally, wire parasitics in most custom macros are estimated manually on the basis of rough placement of a small amount of timing-critical components from the macro schematic during the pre-physical-design (PD) schematic design phase. The estimated wire parasitic is then manually back-annotated as electrical elements into the schematic for timing analysis. Custom macro size is estimated by summing the area of all the leaf cells in the schematic with some contingency for area increase resulting from design changes over time during the pre-PD schematic design phase. Macro detail placement and routing begin when the logic becomes stable and when macro schematic timing and size are in accordance with the cycle-time and area objectives. Manual wire parasitics estimation and placement are time consuming and can be incomplete and inaccurate. Because metal layer resistance is 3X higher in 65-nm than in 130-nm technology, via resistance is 44X higher, and BEOL technology scaling lags behind front-end-of-line (FEOL) technology scaling, parasitic wire delay becomes a larger portion of the cycle time in 65-nm than in 130-nm technology. In 65-nm technology, post-PD-extraction base timing can be different from schematic base timing with manually estimated parasitics by as much as 30%. The large timing discrepancy can lead to an enormous amount of rework and difficulties in timing closure during post-PD design phase. Table 3 lists the metal layer and via resistances of the two technologies.
POWER6 processor macro design methodology was developed to be placement based during the pre-PD schematic design phase so that wire parasitics, macro size, and macro pin locations could be estimated accurately. Two new design tools, PIP (placement by instance parameters) and STEP (Steiner estimated parasitics), were developed to support the placement-based methodologies. These tools can be used to aid detail placement in custom macros during the pre-PD schematic design phase, to support automatic wire parasitic estimation and modeling in custom macros on the basis of detail placement in the macro layout, and to provide a layout with components and pins for custom macro abstract generation. On average, custom macro design effort for schematic design plus detail placement with PIP requires about 1 to 2 weeks for a small macro and 4 to 6 weeks for a large macro. Custom macros designed with PIP and STEP during the pre-PD schematic phase have a timing error within 2% to 7.5%, compared with actual timing with post-PD-extracted data. The result is a singlepass custom macro design process with very little to no timing or physical design surprise during the post-PD design phase.
During the pre-PD schematic design phase, POWER6 processor custom macro design methodology required the schematics to be designed with detail placement. This can be accomplished with the aid of PIP to provide a more accurate placement-based estimate on pin locations, macro size, form factor, and track utilization. During the pre-PD schematic design phase, timing methodology requires all parameterized standard cell schematics used for custom macro design to include input and output wire and via parasitic resistances. It also uses STEP to estimate wire parasitics of all nets in the macro for more accurate timing model generation. Integration methodology during the pre-PD schematic design phase requires all abstracts of the custom macros to be created from layouts with detail placement. Macro abstract pin locations are based on actual macro driver and receiver placements. Macro abstract input pin capacitance is estimated by STEP during this ''bottom-up'' pre-PD design phase. As the design progresses, custom macro I/O pin placements are refined by macro designers working together with a unit integrator for timing and wire-ability optimization. A new integration-verification methodology checks on macro layout with placed components and pins created by PIP to ensure that post-PD custom macro layout can be routed in the unit floorplan without causing any design rule conflict between macro and unit.
POWER6 processor custom macros [9] are designed with static parameterized standard cells that are similar to those used in the POWER5 processor and the IBM System z990 system. I/O resistances are included in the parameterized standard cells schematics to improve timing accuracy. PolySi gate resistance, polySi to M 1 via resistance, and M 1 pin resistance between gate input pin and polySi gate are modeled as input resistance. Diffusion to M 1 via resistance, M 1 wires resistance for strapping the multiple output fingers into a single pin, and M 1 to M 2 via resistance between gate output pin and source diffusion are modeled as output resistance. Input and output resistances depend on the number of fingers of the gate. Extracted resistance data from samples of cell layouts are curve-fitted to Equation (1):
K 0 and K 1 are constants and are different in input and output resistance models. N finger and P finger are number of n-FET and p-FET fingers in the gate. Figure 8 shows the input and output resistances as a function of FET fingers for a two-input NAND gate. Figure 9 (a) illustrates a schematic with two sets of inverter instances, I6 and I9, and the placed layout created by PIP. Both I6 and I9 are array instances. Each array is 4 bits wide. I6 has a single fipBit value of 3 that directs PIP to place the array of inverter instances consecutively across to the right starting from bit position 3, as shown in the bottom row of instances of the layout in Figure 9 (b). I9 has a fipBit value ending with the three-dots symbol, which instructs PIP to place the array of inverter instances across from left to right on the even bit positions starting from bit position 6, as illustrated in the top row of instances of the layout. If the leaf cell contains pins, designers have the option of using a PIP routine called generate layout pins in order to propagate the lowest-level (or leaf-cell) pins up the hierarchy to higher level (e.g., macro I/O) pins. The layout with components and pins placed is called a placed layout. A macro abstract is then generated from the placed layout and used for unit floorplanning during the pre-PD schematic design phase.
Wire parasitic estimation with STEP and timing with parasitic VIM STEP is used to estimate Steiner graph lengths of all the nets in the placed layout and to add parasitic models with the estimated Steiner lengths into a schematic VLSI integration model (VIM), an IBM internal format of netlist. In the beginning of the schematic design process, STEP attaches default net attributes to all signals in the schematic to represent metal layer, wire width, wire spacing, neighbor hostility, and contingency for non-ideal routing. The circuit designers can optimize these attributes on the basis of timing requirements and STEP will use modified attributes to update parasitic models. During netlisting, STEP calculates the Steiner graph length for each net on the basis of the pin locations of the components in the placed layout. STEP then uses the calculated Steiner graph length, together with optimized net attributes, to create parasitic models and stitches the models into the schematic VIM. The schematic VIM with estimated wire parasitic is called parasitic VIM (PVIM). The PVIM can be generated in 5 to 10 minutes for a small macro with fewer than 10K transistors and 20 to 30 minutes for a large macro with 50K transistors. The IBM EinsTLT [5] , an EDA transistor-level timer, can be used to generate a macro timing rule from PVIM. Accurate optimization of the circuits can be obtained either manually or with EinsTuner, together with PVIM by optimizing device type and size, component locations, pin placement, wiring layers, wire width, and wire spacing.
Pre-PD macro design optimization
During the pre-PD schematic design phase, a custom macro design is iterated through five design steps until macro timing is within a certain predefined range of the target. These five design steps include the following:
1. Update schematic topology for functional changes. 2. Update schematic topology and device width for timing optimization on the basis of feedbacks from timing analysis and EinsTuner. 3. Update detail placement with PIP for PVIM generation with STEP. 4. Generate macro abstract from placed layout for unit integration. 5. Update macro timing rule with new PVIM for timing.
Macro routing
The custom macro routing phase begins when logic becomes stable and macro timing is within a certain predefined range of the target. Four routing techniques are used by the various POWER6 processor design groups to route their custom macros. They are (1) complete custom hand-route for optimum track utilization and with ;99% redundant via for better yield; (2) complete custom route with custom software for results similar to those of the first routing technique; (3) a mixture of custom hand-route for timing-critical and dataflow nets and auto-route for the remaining nets with an auto-router called WRoute, created by Cadence Design Systems, Inc.; and (4) complete auto-route with auto-router. Each routing technique has a different turnaround time and produces slightly different results in terms of track utilization, route quality, and ease of updates. However, they all produce routed layouts that meet POWER6 processor custom macro timing, checking, and yield requirements.
Power
The four primary components of power in the POWER6 core are clock-switching power, data-switching power, gate leakage, and device subthreshold leakage. All components are very sensitive to operating voltage. A 1% reduction in voltage yields approximately a 3% reduction in total power. Adequate array cell stability and array read performance dictate a higher V min (minimum operational voltage limit), which could prevent the core from achieving its power objective. For this reason, logic and array support circuits are decoupled from array cells. Array cells are operated on a separate (V cs ) voltage supply.
Clock-switching power
The clock-switching power is minimized by using several complementary techniques. The first technique is finegrained clock gating supported by an LCB circuit to ''hold'' the local c1 and c2 clocks. This differs from previous designs [10] , which turned off both local c1 and c2 clocks. In the POWER6 core, the c1 clock is held high so the L1 (master) latch remains open. This provides two benefits over the previous local clock-gating design: It eliminates the timing-critical half-cycle path to intercept the rising edge of the c1, and it eliminates the extraneous turning off (switching) of the c1 clock. Power is eliminated by reducing clock and signal switching; turning off clocks and signals saves power only when these signals are already off. Traditionally, clock gating is coarse grained, whereby clocks of an entire unit are turned off when the unit is not doing any useful work. The POWER6 core achieved even higher power savings through extensive fine-grained clock gating. In order to achieve fine-grained clock gating, logic is designed to determine those registers and latches that will not or do not have to change state in a future cycle and to generate corresponding local clock-gating signals. These clockgating signals are recomputed each cycle on the basis of the current logic state. The second clock-switching power reduction technique is called latch sizing. The POWER6 core was designed with six distinct master-slave latch power levels (or sizes). In order to minimize latch power, we chose the smallestsize latch that would meet the constraint that all logic paths sourced by that latch achieve the 13-FO4 cycle-time requirement. Code was developed to identify ''overpowered'' latches and opportunities to reduce latch sizes without breaking the timing requirement.
The third technique is to modify clock frequency. Certain portions of the chip are operated on at a lower frequency. The POWER6 core does not exploit this technique since the microarchitecture requires all pipeline stages to run in lockstep; however, the POWER6 processor nest operates at half the frequency of the core. Pulsed latches are the fourth clock-switching power reduction technique. Pulsed latches have several positive attributes: They reduce the latch delay overhead by eliminating one of the half latches in a conventional master-slave design (as described in the section ''Circuit styles'') and they eliminate the c1 local clock signal and its associated switching power. The c2 clock is converted from a half-cycle signal to a chopped pulse. This is necessary to eliminate the race condition or flushing of data through the latch. However, this is also the downside of the pulsed latch. It requires additional short-path padding of datapaths feeding into the latch. The final power-savings technique is called static circuits. The precharge and footer devices of dynamic circuits introduce a significant clock load. Clock load power is, thus, minimized by implementing all logic functions and all small register files in static circuits.
Data-switching power
Data-switching power is minimized primarily by logic gate sizing. Similar to the latch-sizing methodology described above, the logic device widths are sized as small as possible within the constraint that the logic paths through the device meet the 13-FO4 cycle-time requirement. Clock gating also affects data-switching power because the cone of static circuit logic downstream of a set of clock-gated latches will not switch. This is not always true of dynamic circuits because downstream data-switching power is eliminated only if the dynamic gates precharge and evaluate clocks are gated.
Device leakage power
The technology offers three device thresholds for n-FETs and three device thresholds for p-FETs. Circuits in paths with a sufficient timing margin (.3-FO4-inverter delays) were designed with the highest-threshold (high-VT) devices. These devices have the slowest switching speed but yield the lowest subthreshold leakages. Circuits in paths near the cycle-time limit (;13 FO4) were designed with regular-threshold (regular-VT) devices. These devices switch faster than high-VT devices but at the expense of higher subthreshold leakage. Circuits in the ''ultra-timing-critical paths'' were designed with lowestthreshold (low-VT) devices. Ultra-timing-critical paths are defined to be only those paths that could not achieve the 13-FO4 cycle-time objective even after applying all known timing delay optimization techniques. Low-VT devices switch faster than regular-VT devices, but because of their high amount of subthreshold leakage, the use of a low-VT device was severely restricted; only 1% of the devices in the POWER6 processor core are low-VT devices.
Lab characterization
The design, technology, and product engineering teams extensively tested POWER6 processor hardware at wafer level, after packaging, and then in system-oriented environments. The development team focused initially on the correctness of the design and the ability to test the processor. The next step was to evaluate voltage, frequency, and temperature operating ranges. Gradually, this evaluation included variations in the devices of the 65-nm technology that were designed to stress the POWER6 processor circuits in ways that would identify the weakest points. The results of these stresses are often included in later design revisions, thereby improving the overall robustness (yield, voltage, frequency, temperature ranges) of the microprocessor.
The tools used to evaluate POWER6 processor-based hardware are wafer and module testers, which allow an automatic pattern test mode as well as an interactive ''engineering'' mode. The automatic mode allows gathering of large test samples for statistical evaluation. The latter interactive mode allows an engineer to manipulate voltage, frequency, clock tuning bits, and software, among others, to isolate problems. In some cases, the problems can be eliminated immediately. Occasionally, the problem becomes a limiting issue that must be fixed in the next chip release. After the wafers are diced, the resulting chips are packaged into modules for further testing and sorting.
For the physical tools to be useful, the design team employs several testing strategies for broad evaluation of the POWER6 processor, including specific features designed into the microprocessor. The test strategies include LBIST, ABIST, functional exercisers (i.e., test programs), and PLL range, jitter, and yield tests, among others. Each test strategy is affected by voltage, frequency, and temperature differently when stressing the chip. The results often have to be correlated against each other to identify the correct ones for sorting the POWER6 processor into several targeted power/ performance buckets.
As standard practice, POWER6 processor-based hardware is stressed across frequency as well as below and above the nominal operating voltages. This investigation is intended to look for circuit problems in the processor and define the range in which the chip is fully functional for production. The two voltage planes of V dd and V cs add complexity and, therefore, required special tests for proper handling. Several types of such tests are listed in Table 4 .
The tests of maximum frequency, or f max , identify peak chip frequency as a function of technology speed, as well as slow paths in the POWER6 processor design. Absolute V min is designed to find the minimum voltage at which transistor-switching behavior is functional without restriction due to frequency. For example, this test can show a condition whereby the differences between VT of regular-and high-VT transistors on different power planes prevent a signal from properly switching to the full voltage rail. The speed V min is a minimum voltage test with a frequency component that normally corroborates slow paths identified in the f max tests. V diff tests stress array voltages against read and write performance, looking for the weak points in those structures. As V cs rises above V dd in the V diff -high test, read performance increases, but the array cells become more difficult to write. As V cs drops closer to and below V dd , writing is easier, but read performance slows down and the cell stability degrades (i.e., the cell can be ''disturbed'' and lose its data).
Temperature is another variable in lab testing of the POWER6 processor. On the wafer testers performing the Table 5 Device sample points within the window of process variation. Decoupling capacitance (decap) low Decrease decap for switching noise above tests, the temperature range is limited from a low of À108C to a high of þ708C. After packaging, POWER6 processors are further stressed in burn-in ovens to þ1058C. Technology has a major impact on processor operation and performance. Any silicon device process has an allowed tolerance range for device performance, and the POWER6 processor circuits must operate across that entire range. So that the operation can be verified, the most important parameters were specifically stressed and the POWER6 chips were evaluated on that hardware. Table 5 lists key device points used to test these process changes. Beta refers to the ratio of electrical conductivity between p-FETs and n-FETs. SRAM cell pull-up (PU) refers to the two p-FETs in the feedback inverters of a 6T cell. SRAM cell pull-down (PD) refers to the two n-FETs in the feedback inverters of a 6T cell. SRAM cell passgate (PG) refers to the two (or more) n-FET pass transistors of a 6T cell. Each process split was evaluated against voltage and frequency as previously defined.
Along with identifying circuit problems, the characterization team manipulated various tuning bits and verified these settings across all splits to optimize the circuit yield and performance. Examples of these tuning bits include array clock-chopper pulse width and delay, local clock duty cycle and local clock delay, dynamic circuit pulse width, and others. The effect of tuning can be dramatic. Figure 10 shows, across a sample of parts, an average gain of ;500 MHz that is directly due to the tuning process.
LBIST is based on the traditional technique of LSSD, in which most latches can be scanned into or out of, in order to set or read the contents. An LBIST sequence starts by scanning the chip latches to a pseudo-random, repeatable value. This is followed by scanning a specific number of system clocks, commonly just one clock. Finally, the latches are scanned and checked for correctness against a calculated result. The POWER6 processor includes a highly advanced and configurable LBIST engine. To simplify debug, the core and nest latches are broken into 71 subsections. Each subsection includes random-value generators and compression latches to facilitate rapid evaluation of results that allow POWER6 processor LBIST to perform tens of thousands of tests per second. Masking logic in the engine allows the engineering team to restrict testing to latches within a scan subchain in order to isolate an individual failing latch. Because of the ubiquitous use of scannable masterslave latches in the POWER6 processor, LBIST covers the highest percentage of circuits in the shortest possible time of any of the other used test methods. While LBIST in the POWER6 processor has a few weaknesses, it provides the broadest look at the microprocessor circuits compared with any of the available tests.
The characterization team defined 11 varieties of LBIST to incrementally cover POWER6 processor circuits. These 11 varieties split between static (dc) and frequency-sensitive (ac) tests. The dc tests remain independent of chip frequency by only clocking the capturing latch after all latches are scanned and the downstream circuits have been evaluated. This result provides baseline data on chip functionality. The ac tests are run at a wide range of frequencies oriented toward isolating defects and design problems that limit chip speeds. At low-speed, the ac test matches the dc result. At high speed, the ac test can deviate from the dc result because of failure of the circuits to evaluate in the chosen clock period, incorrect synchronization between clock domains, and improper gating of control signals, among other problems. Any systematic failures discovered are identified and fixed.
Because of the latch types used in the POWER6 processor, additional tests were needed in the ac and dc groups to check for correct operation of half-cycle latches. Similarly, special controls were added to facilitate testing of on-chip caches and array structures. These circuits can be forced into ''write-through'' behavior or read-and-write behavior. This allows for limited testing of the array read-and-write circuitry. Array structures with multiport write capability have to limit LBIST to protect against random scans causing device contention. These structures rely on extra circuitry designed to drive a single port during LBIST and vary that port throughout the test to maximize coverage. The enhancements to the engine, extra control circuits, and variety of tests created by the POWER6 processor design and lab teams allowed extensive LBIST testing and enhanced LBIST coverage beyond that of the POWER5 processor. This ability to test nearly all pipeline structures makes LBIST vital in the POWER6 processor for verifying circuit yield and performance. Figure 11 depicts an example of reduced yield in a particular LBIST that was due to a systematic problem affected by voltage. At 0.9 V, an average of 95% of the chips that were tested passed the test; this result is typical. At 1.1 V, two areas of the chip experienced higher than normal failures, resulting in yield degrading to 75% and 65% accordingly. With embedded masking logic included in the POWER6 processor LBIST engine, the failing path was isolated to the capturing latch. At that point, clock tuning bits and other stresses were used to manipulate the fail until the cause was understood. Very often, the fail may be fixable by using clock tuning bits, as described in the section ''Local clock and latch design.'' Occasionally, the failure is queued by the design team so that a fix can be included in the next chip revision.
Although LBIST evaluates the POWER6 processor execution pipeline well and has some visibility of cache circuits, it does not test the caches thoroughly. To comprehensively evaluate the cache structures, each POWER6 processor array is connected to a programmable hardware engine, which will perform ABIST. This hardware engine runs at frequencies higher than planned system operating frequencies so it can measure the maximum frequency of the array, in addition to identifying bad or weak cells and other test faults.
A typical ABIST involves a series of write and read operations. Each read cycle is intended to match with a calculated result. Compression registers implemented for each cache in the POWER6 processor capture many such cycles worth testing as a single result, so the entire cache can be evaluated in a small, fixed number of cycles. For some caches, additional registers and address capture logic in the ABIST engine log up to five failing points in the array, and extra SRAM cells designed into those arrays can replace the failing cells. In this way, a cache that has some damage that is due to physical defect or is weak because of process variation can be repaired to operate to system specifications.
Many varieties of tests are used to stress the caches. The standard test is a simple write-and-read pass through the array, with walking bit patterns to stress every circuit uniquely. Another test can write and read between all address combinations to look for sensitivity to switching patterns. ''Stability'' tests evaluate the ability of the SRAM cells to hold their state at a variety of voltages, particularly at lower voltages as transistor performance decreases.
Each POWER6 processor array contains a set of tuning bits to allow the characterization and design team to debug and stress the array. The characterization team looks for sensitivity to voltage, temperature, frequency, and the SRAM technology parameters previously identified. These sensitivities can often be mitigated or improved by using the tuning bits. Specific examples of tuning in the POWER6 processor include widening local clock pulses to increase the amount of time for cache writes, narrowing clock pulses to increase the frequency at which a cache reads, and aligning certain clock pulses differently to mitigate negative effects of raising the array voltage supply relative to the system power supply. Figure 12 shows an example of how voltage affects yield against four POWER6 processor wafers: two with the standard design (STD0 and STD2) and two with a device enhancement (EVAL3 and EVAL6). In this example, yield refers to the percentage of arrays tested that are functional versus the total number tested. The arrays are tested at three logic voltage points (V dd ¼ 0.8 V, 0.9 V, and 1.1 V) and at six or seven array voltage points (e.g., V cs ¼ V dd , V dd þ 0.05 V, V dd þ 0.10 V, and so on). As array voltage (V cs ) rises above logic voltage (V dd ), and as overall chip voltage (V dd and V cs ) increases, the yield increases-a typical response due to the increase in transistor performance. The red ellipses highlights a significant change in the voltage response at 0.8-V V dd and 1.2-V V cs , where the standard design (STD0 and STD2) fails dramatically resulting in lower yield (,30%). The enhancement being evaluated (EVAL3 and EVAL6) improves the yield to 80% to 90%. The POWER6 processor characterization team extensively used ABIST in this manner to validate design or process changes in order to maximize performance, voltage margins, and manufacturability.
With a programmable ABIST engine, a suite of tuning bits for each array macro, and repair circuits included on the large arrays, the POWER6 processor is equipped to C0_N0  C0_N3  C0_S1  C0_V0  C0_V3  C0_W2  C1_N1  C1_N4  C1_S2  C1_V1  C1_W0  C1_W3  LB_N02  LB_N11  LB_NE0  LB_NE3  LB_NW2  LB_R01  LB_R10  LB_R13  LB_S02  LB_S11  LB_SE0  LB_SE3  LB_SW1  LB_SW4 test the caches extensively in order to optimize the design. Most problems can be immediately improved, and the voltage and frequency operating ranges enhanced. POWER6 processor arrays currently operate over a range of voltages exceeding 0.8 V to 1.3 V, with performance exceeding 5 GHz. While LBIST is the most complete coverage tool, ABIST specifically tests the arrays. POWER6 processor characterization relies on functional exercisers to stress the chip as would happen in a complete system. Functional exercisers are programs designed to emulate worst-case system code behavior. In addition to standard exercisers, the lab team has created software targeted to stress certain circuit and logic structures that are not well tested otherwise, or that are unique to the POWER6 processor design. Some of these software exercisers cover array ''hit-logic'' circuits that combine cache and logic in ways that LBIST and ABIST cannot effectively evaluate. The POWER6 processor also required special multithreaded mixes of specific code and broad-coverage code. This code was designed to maximize power consumption and exacerbate local heating and supply noise to affect peak frequency.
Since wafer-level testing does not allow for the POWER6 chip to access memory, initial testing was required to operate solely from the 8-MB L2 cache. The POWER6 processor implements features in the L2 cache and the memory controller to facilitate processor operation within the L2. These changes allow full system frequency evaluation in real time on the wafer probe. Being able to execute code in this way enables extensive detailed frequency, voltage, and temperature measurements very early in the design and manufacturing cycles.
The POWER6 processor also implements features to enable cycle-accurate reproduction of a given code sequence through extensive clocking and chip control logic. This mechanism is used to stop the machine across many cycles before and after a failure event in order to identify and isolate the problematic circuit. The code sequence proved effective in debugging a circuit problem related to the L1 data cache and is now commonly used for resolving logic bugs.
Correlating the performance of these test methods against each other improves the sorting effectiveness of the POWER6 processor since minimum frequency targets must be met by every chip. Figure 13 shows a peak frequency comparison of ABIST with various timing settings against a functional exerciser. Additionally, it provides another example of the improvement available by using the extensive tuning bits built into the POWER6 processor arrays. With the default setting, the ABIST maximum frequency generally stays within 200 MHz, yet below the exerciser. With the ability to fine-tune clock timings, the ABIST maximum frequency increases by ;200 MHz, moving it consistently above the functional exerciser peak. Being able to identify such specific frequency limitations and improve such problems increases the yield of chips in the manufacturing and sorting process.
With built-in hardware support for various test methods, the number of technology variations in which it works without performance loss, as well as extensive voltage, frequency, and temperature evaluation, the design of the POWER6 processor has proven to be robust and tunable.
Figure 13
POWER6 microprocessor segment look-aside buffer ABIST f max with tuning bits at 1.1 V/ 1.25 V. 
Summary
The POWER6 chip has been fabricated in IBM 65-nm SOI process. This process technology incorporates multiple device thresholds and ten layers of copper wiring with a low-k dielectric. The logic circuits were predominately implemented in static CMOS circuits in order to reduce power. The POWER6 chip employs three distinct latch designs: a scannable, dynamic front-end latch that incorporates logic function into the latch; a scannable, master-slave latch that can be operated in pulsed mode to save power; and a scannable, hybrid pulsed latch that can be operated in an L2-only mode in order to minimize latch insertion delay, or in a safety mode for burn-in. The low-skew high-frequency POWER6 processor global clock distribution network was described. The POWER6 processor used a new custom macro design methodology to estimate parasitic resistances and capacitances earlier in the design flow. This methodology reduced the layout rework, extraction, and timing iterations needed to close all custom paths to a 13-FO4 cycle time.
The POWER6 processor parts have been demonstrated to operate in excess of 5 GHz and within the power constraints established for the chip. Chip power dissipation is reduced through modulation of operating voltages, fine-grained clock gating, latch and logic gate sizing, VT optimization, pulsed latches, and halffrequency operation of portions of the chip.
The POWER6 chip has been extensively tested at wafer, first-level package, and system levels. The evaluation was accomplished via LBIST, ABIST, and (real code) functional exercisers across wide voltage, frequency, and temperature ranges as well as process technology variations. These tests identified potential functionality and frequency weaknesses in array, logic, and latch circuits. The robustness and speed of the identified circuits with weakness were modified on subsequent chip manufacturing releases.
