We describe the methodology used for the design of a set of CMOS support chips used in the IBM S/390@ Parallel Enterprise Server Generations 3 and 4. The logic design is based on functional units, and the majority of the logic is implemented by standard cell elements placed and routed flat, using timing-driven techniques. Custom library elements are used wherever needed for performance reasons. Using this approach, a density has been achieved that is comparable to those of contemporary custom designs, combined with very attractive turnaround times.
Introduction
Custom design is the dominant design style for highperformance processors. It offers the advantage of full control over the size and the location of each transistor for performance tuning, but requires considerable effort to implement because of the complexity of a complete transistor-level design. This complexity creates the need to introduce additional hierarchies, usually leading to a "floorplanning" approach.
A standard cell design approach (Figure 1 ) makes it possible to globally apply advanced optimization algorithms, which reduce the manual effort required and improve the quality of the synthesized logic during layout. The use of basic standard cell elements reduces complexity to the extent that a complete chip design can be handled flat by layout and test generation tools, removing the need for artificial floorplan boundaries. Our approach uses a small number of custom logic macros and custom memory arrays whenever a standard cell solution is not competitive. The major part of the combinational logic portions, however, are implemented in standard cells.
Design entry, synthesis, and simulation are performed on the basis of functional units. There is no need to optimize logic partitioning on the basis of timing, layout, and test considerations. Flat, timing-driven placement and routing without floorplan boundaries minimizes interconnection delay in critical paths. This, coupled with in-place logic optimization, achieves a post-layout cycle time no more than 15% above the zero-net estimate.
The testing methodology we have used consists of design for test (DFT) to ensure high test coverage, and test pattern generation to enable testing, analysis, and debugging of chips in manufacturing. Key are fast turnaround time and high-quality testing.
Test data generation, circuit and logic design, and timing verification are performed with proprietary IBM tools [l-41. The tools for layout optimization were "Copyright 1997 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, hut no other portions, .
of this paper may he copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must he obtained from the Editor.
the service element for all chips in the system. The memory bus adapter (MBA) chips are direct-memoryOverview of design flow.
access (DMA) controllers that are the interface between the asynchronous, byte-serial 1/0 buses and the 16-bytewide system bus. The bus-switching network (BSN) chips hold shared level-3 caches and bus arbiters that control the concurrent access of PUS, MBAs, and system-wide memory. The storage controller (STC) chips are DRAM controllers, supporting transparent refresh, interleaving, and multibit error detection and repair. More details can be found in [5] .
Technology and design of custom elements

Technology
The CMOS process [6, 71 used on the chip set was developed by the IBM Microelectronics Division. The technology provides six layers of metallization-one layer for internal circuit wiring only, and four layers for wiring in a 1.8-pm wiring pitch. The last metallization layer is used primarily for wiring redistribution to the chip 1/0 pads. The technology parameters are shown in Table 1 .
Library and chip image
The standard cell library we used provides a set of logic gates, latches, and I/O cells which fit into 3.5 million placement locations and are interconnected through horizontal and vertical wiring tracks defined by the chip image. The I/O cells can be placed anywhere among the 3.5 million legal locations. After chip placement and routing, the unused cell locations are filled with nonpersonalized gate array elements to provide an engineering change capability with metallization changes only.
Custom circuit design
The base standard cell library provides simple logic gates, but a small set of custom logic macros and custom SRAM macros was required for the special needs of the Si390 in order to improve cycle time and density. The custom implementation of the macros gives the circuit designer the freedom to use special circuit design techniques such as dynamic and double-pass circuits [SI to improve the propagation delay.
The circuit design flow ( Figure 3 ) begins with a specification sheet defining the macro requirements. With this information, a model written in a proprietary hardware description language (HDL) is designed [9] . This HDL model defines the logic behavior and must be as compact as possible to reduce logic simulation time. The HDL model is thoroughly simulated against the specification sheet and becomes the "golden" model for the following design process. All other design sources required on the way to layout are checked against the golden model.
The first step of the schematic-driven layout is the implementation of the logic function in transistors with a schematic entry tool [lo] . An iterative process based on transistor-level simulation followed by transistor modifications is necessary to meet the timing, performance, and power-consumption targets of the macro.
A Boolean equivalence checker [ l l ] compares the transistor schematics against the golden model and gives early simulation-independent feedback of the correct implementation. An early timing model is generated for the chip-level delay calculator [4] . This early timing model is replaced later in the design process by the final timing model, based on information extracted from the circuit's layout. The device and net information in the schematics is used by a proprietary schematic-driven layout tool. Compliance of the macro layout with the technology design rules is checked with a hierarchical design rule check (DRC). The layout design style could vary, from a full shape-by-shape design to the use of circuit generators for base logic functions such as NANDs.
additional shape and text information must be added to the layout design. After this process, the custom macros can be used like big standard cell circuits, placeable in any legal location. Providing signal-pin, power-pin, and blockage information to the physical design tools allows automatic power and signal wiring at the chip level.
The custom macro layout is fed into a proprietary layout parasitic extraction (LPE) tool. The transistor geometries (width and length), as well as all parasitic elements such as diffusion capacitances and line-to-line capacitances, are then extracted from the layout. The generated netlist with parasitic elements is used for transistor-level resimulation to ensure that the function and performance are still correct. This netlist is the source for the final, most accurate timing model of the macro. After the custom macro layout is complete, a final layout versus schematic (LVS) check is performed. This check generates a layout netlist and compares it against the schematic netlist, not only checking network topology and device sizes, but also detecting net opens and shorts.
Finally a test model is generated, breaking down all transistor schematics into the primitive functions understood by test pattern generation (TPG), such as AND, NAND, NOR, OR, and XOR. This model is verified against the golden HDL model to guarantee logic equivalence between the implementations.
Design entry, synthesis, and simulation
Design entry
The design system accepts design data in three forms: gate-level schematics, hardware design language (HDL) code, and finite-state machine (FSM) tables (Figure 4) .
Gate-level schematics are preferred for data-flowdominated designs and for designs that require careful manual design and optimization. Most parts of the processor and the L2 cache chip are designed at the gate level. The schematics are entered using a proprietary schematic editor that translates the schematics into gatelevel netlists. Apart from macro expansion, this is a oneto-one translation; no logic optimization is performed.
HDL code and FSM tables are preferred for controlflow-dominated designs. Most parts of the support chips are HDL code or FSM table designs. HDL code is a proprietary hardware-description language [9] . The level of description is similar to the concurrent subset of VHDL: Boolean expressions, signal assignments, component instantiations, etc.
FSM tables are convenient because they describe finite- Logic synthesis The logic synthesis system, BooleDozer*, reads the HDL code and generates gate-level netlists. BooleDozer performs technology-independent optimization, technology mapping, and timing optimization to generate a netlist of minimal size that meets the delay objectives [2, 31. Synthesis uses the same delay calculator as placement and routing, with the exception that interconnection capacitances and resistances are estimated as a function of fanout, based on statistics from placement and routing.
Because a full-chip design cannot be synthesized in one run, it must be partitioned into pieces of a few thousand synthesizable gates each. This approach has the advantage that synthesis jobs can run in parallel on multiple machines, reducing turnaround times. Typically, synthesis times range from one to ten hours of CPU time per partition, resulting in overnight turnaround.
Partitioning requires that delay objectives for the chip be broken down into delay objectives for each partition. This process, designated as slack apportionment, assigns delay objectives to partitions in such a way that if each partition meets its delay objective, the chip also meets the delay objective. The process first runs on an unoptimized design to generate initial delay objectives. The design is then resynthesized and optimized with respect to initial delay objectives, and is fed into slack apportionment again to generate improved delay objectives. This is an expensive process because it requires multiple full-chip synthesis runs, but in practice after two or three iterations the delay objectives become stable. Experiments show that slack apportionment need only be rerun after major design changes, which do not occur very often. Logic synthesis and schematic entry generate one netlist for each chip partition. The partition netlists are finally flattened into one chip netlist for flat placement and routing.
Simulation
Extensive logic simulation at the unit, chip, and system level is performed to verify the functional correctness of the designs [12, 131. Cycle-based simulation assumes zero delay, leaving timing verification to the delay calculator [4] . This approach nicely separates timing aspects from functional aspects and speeds up simulation considerably.
Unit-level and chip-level simulation are carried out using mostly HDL code models. This interactive mode of simulation is used primarily in the early stages of logic design to correct small design errors that are easy to detect. The bulk of simulation occurs at the system level. System simulation uses gate-level models for the processor, cache, and memory interface chips, and behavior-level models for the I/O chips. A simulation monitor initializes the storage elements (latches, memories) of the model, loads test cases into the model, and provides tracing, assertion checking, and reporting capabilities. The monitor has a full-screen interface for interactive simulation, but most of the system simulation is done in batch mode. The simulation is performed in
Pre-P&R netlist
parallel with logic design, as soon as an initial, unit-level simulated design is available.
Timing optimization
Chip placement and routing interconnection length set to zero. Upon comparing the actual post-layout cycle time to this hypothetical zero-net cycle time using different design approaches such as floorplanning vs. flat, and timing-driven vs. connectivitydriven, we found that the approach that consistently produced the lowest interconnection delay was flat, timing-driven layout ( Figure 5 ).
Placement
The ability to place and route complex designs flat and timing-driven is an important prerequisite for the design methodology presented here. This is made possible by quadratic optimization combined with a new quadrisection approach [14] . The approach computes net weights, derived from a concurrent timing analysis run, which are then used for the next optimization step. A description of detailed placement can be found in [15] .
In-place optimization
Logic optimization based on the actual placement is performed to further improve the cycle time. This is carried out in three steps:
1. Clock synthesis The clock tree is not considered during placement but is instead resynthesized after placement using a zero-skew approach similar to [16, 171. Routing information for balanced routing is created as an input to the routing step.
Power-level optimization
This is performed for timing optimization and power reduction. It uses one of the five power levels available for each standard-cell circuit. 3. Buffer insertion This is performed on timing-critical paths that still exceed the cycle-time limit after powerlevel optimization.
The resulting decisions are always based on actual placement data, as each circuit added is assigned to a placement location. Details on timing analysis and optimization techniques that are used can be found in [HI.
Routing Special nets such as power buses and nets connected to I/O pads are routed first, and then congestion-driven global routing defines guide boxes for the following local routing step. The information generated during clock optimization drives the balanced routing of the clock nets. The ability to route the entire design flat removes the suboptimality introduced by the necessary pin propagation in a hierarchical approach. The routing tool supports different wire widths and separations and has a crosstalk analysis as well as removal capability [19, 201. functional changes by the logic designers, are carried out concurrently. We have implemented a flow to incorporate the functional changes performed on the pre-layout netlist into the post-layout netlist. Figure 6 shows the net length distribution for the 15.5 X 15.5-mm2 MBA chip. About 70% of the nets are less than 0.5 mm long, and very few nets are more than 5 mm in length. Restricting the functional units to floorplan regions would introduce global nets, which are typically longer. This relatively small increase in interconnection delay is an inherent advantage of our flat, timing-driven layout approach. On the most timing-critical paths we have been able to keep the ratio of post-layout to zero-net cycle time below 1.15. The actual ratio deuends on the given I Net length distribution for the MBA chip. attractive considering that hardly any manual intervention is required. For example, a complete timing-driven layout for the MBA chip, consisting of two placement and Table 3 Run times and memory for the MBA chip. optimization iterations and a routing step, can be performed in less than five days of processor time
Results
The run times for placement and routing are very
Layout step
Run times on Memory (Table 3) . phases:
I.
Boolean compare and engineering changes
To avoid any risk of introducing logic errors during inplace optimization, a Boolean equivalence checking tool [ I l l is used to verify the equivalence of the pre-and post-layout netlists. The design system supports late metallization-only changes by rerouting or by using gatearray circuits. This process is complicated by the fact 51 0 that in-place optimizations during layout, and late
Definition of test methodology and design of test macros
All of our designs follow the level-sensitive scan design (LSSD) rules [26] . This allows race-free testing and initialization of all memory elements in the chip at any level. The implementation is always full-scan. Our main test approach is built-in self-test (BIST), in which different state machines are designed that execute the test after initialization. BIST is used to test both combinational logic (LBIST) and memory arrays the power-on sequence in a customer's office. They can be run at system cycle speed in chip manufacturing using on-product clock generation and on-chip PLLs. A small number of special 1 / 0 circuits allow testing of all signal I/Os without contacting them at wafer test. This reduced-pin-count testing technique [29] allows the use of less expensive test vehicles in manufacturing for most of the tests. The custom library elements are checked early in the design phase for testability, and, if necessary, logic circuits are added to improve controllability and observability.
Design of test control logic together with clock generation logic
Embedding test control logic into on-product clock generation allows more accurate testing by using the system clock distribution. The test control logic is very similar to the IEEE JTAG controller [30] , but in addition it has several registers to set up and control the different tests that are executed. For example, registers are used to define the length of the LBIST test sequence, or the way of clocking, or to disable certain parts of the chip. The test control logic is designed once and reused on all chips in the set. The basic LSSD design and the common test controller are embedded in the functional portion of each chip in such a way that they are virtually invisible to the functional logic designer (Figure 7 ).
An IBM-developed tool set, TestBench* [l], is used not only for test data generation, but also to check for design rule compliance. TestBench checks and analyzes compliance with LSSD and several other rules: Boundary scan rules These rules enable us to test the I/O area independently of the internal logic of the chips, and vice versa. Our implementation [31] is similar to the IEEE JTAG boundary scan design.
Self-test rules
In all self-test designs, propagation of undefined states into the signature analyzer is prohibited because it would corrupt the final signature. Another important check ensures that the self-test chain lengths are equal for all chips in the set, allowing us to reuse the same LBIST control logic on the chips.
with IDDQ test patterns, it is necessary that all current can be turned off for the measurements.
We devoted a separate test I/O pin to control this.
Test structure verification (TSV)
IDDQ rules Because all of our chips are also tested
Testability analysis (TA)
The testability goal for our chips is 99.9% stuckfault coverage and 95% delay-fault coverage. With TestBench we are able to generate the fault models as well as to analyze the problem areas. The
Test control logic.
implementation of LSSD full-scan in addition to LBIST enables TestBench to produce very high coverage almost immediately. The testability problem that we deal with is mainly redundancy removal. However, because our primary test method is LBIST, it is also very important to identify logic portions that are hard to test with random patterns. We add controllability and observability wherever possible to achieve at least 98% LBIST stuck-fault coverage. Figure 8 shows the test pattern generation (TPG) flow. After a design has passed the TSV and TA checks, TPG generates the actual chip and/or module test data to be used during manufacturing, as well as the system LBIST signatures that are checked in the machine. TPG generates the following test data: LBIST/ABIST tests can be run in two different ways: controlled by a single oscillator using the on-product clock-generation logic, or controlled by dedicated tester clocks, the so-called LSSD clocks. This is important for diagnostic purposes.
Test pattern generation
Additional test patterns are generated to supplement the LBIST test coverage, in order to achieve 99.9% stuck-fault coverage.
Using the boundary-scan chain, all I/Os can be set independently to logic 1 or 0 so that these patterns are relatively easy to generate and very compact. Table 4 shows TPG statistics for the MBA chip.
Deterministic test
I10 test
Outlook
We have described a standard-cell-based VLSI design system producing results which are competitive with custom design solutions in terms of density, in a very short turnaround time. In the future, we expect the logic complexity and the number of small custom macros to increase. To verify that this complexity can be handled by our methodology, we have successfully placed and routed an experimental design consisting of 580000 standard cells The low percentage of long nets inherent in our design approach should minimize the impact of the higher interconnection delay expected in future, denser chip technologies. We are currently focusing on improvements in parasitic extraction and the analysis and avoidance of crosstalk. Furthermore, efforts are being put into faster system-level simulation techniques.
*Trademark or registered trademark of International Business Machines Corporation.
