This third-generation 64b 1.1GHz 4-instruction-issue SPARC TM RISC microprocessor supports 1 to 4-way high-end desktop workstations and workgroup servers with focus on higher system integration and cost reduction. The chip operates at 1.1GHz to 1.4GHz, and dissipates 53W at 1.3V, 1.1GHz. It contains 87.5M transistors (63M in RAM cells) and it is implemented in a copper 7-metal-layer 0.13µm CMOS process (Figure 20 The processor uses the same 14-stage core pipeline as processors described in References [1, 2, 3] that supports concurrent launch of up to six instructions which can consist of 2 integer operations, 2 FP operations, 1 memory operation (load/store), and 1 control transfer instruction (CTI). Only 4 instructions per cycle can be executed in a sustained manner. There are 3 floating-point units (add/sub, multiply, divide). Up to 2 floating point loads can be issued per cycle.
RISC microprocessor supports 1 to 4-way high-end desktop workstations and workgroup servers with focus on higher system integration and cost reduction. The chip operates at 1.1GHz to 1.4GHz, and dissipates 53W at 1.3V, 1.1GHz. It contains 87.5M transistors (63M in RAM cells) and it is implemented in a copper 7-metal-layer 0.13µm CMOS process (Figure 20 The processor uses the same 14-stage core pipeline as processors described in References [1, 2, 3] that supports concurrent launch of up to six instructions which can consist of 2 integer operations, 2 FP operations, 1 memory operation (load/store), and 1 control transfer instruction (CTI). Only 4 instructions per cycle can be executed in a sustained manner. There are 3 floating-point units (add/sub, multiply, divide). Up to 2 floating point loads can be issued per cycle.
On-chip, level 1 caches include a 64kB 4-way data cache, a 32kB 4-way instruction cache, a 2kB 4-way data prefetch cache, and a 2kB 4-way write cache. Instruction and data caches have parity protection.
The design complements the third-generation architecture by supporting an on-chip 1MB L2 cache; a 16B, DDR memory interface and a 200MHz cache-coherent SMP system bus interface [4] . The 1MB 64B line size 4-way L2 cache is for both instruction and data caching, and is physically indexed, physically tagged. It implements pseudo-random cache line replacement policy. It is a write-back, write allocate cache, supporting MOESI cache coherency protocol. It operates at half the CPU frequency with 6-cycle latency and 2-cycle throughput and has parity protection on tag, ECC protection on data.
The proprietary 200MHz 128b synchronous packet-switched cache-coherent JBus system interface enables multiprocessors to communicate via the shared address and data bus, with 3.2GB/s maximum bus bandwidth, zero to two-cycle arbitration latencies, and supports up to seven agents. It supports strong memory ordering, as well as 1/2 and 1/32 EnergyStar modes to reduce system power.
The on-chip memory controller supports 133MHz, double data rate (DDR1), SDRAM with 256MB -16GB memory space, and provides 4.2GB/s per processor off-chip, memory bandwidth. The data bus is 128b wide with additional 9b for ECC.
One of the major challenges is handling three clock domains, the CPU, memory, and JBus (system clock) requiring high speed memory access and flexibility of choice in designating the three clock frequencies. PLL1 and PLL2 generate the required 1.1GHz cpu clock (cpu_clk) and the 266MHz internal, memory clock (mclk2) from the same system JBus clock reference, respectively. A divide-by-two circuit generates the 133MHz, external, memory clock (mclk) and the 90 O phase-shifted strobes as required by the DDR spec for DRAM memory writes. The synchronization of the domains is accomplished by using the "edge align detector" circuit which provides two control signals to the CPU clock domain: jbu_sync for the JBus control unit, and mclk_sync for the memory control logic (Figure 20.3.3 ).
During reads from the external memory a 4-stage asynchronous FIFO, shown in Figure 20 .3.4, absorbs the multiple skews between the memory transaction strobes and releases the data in the CPU clock domain [5] . This provides a simple and robust design, eliminating multiple DLLs that would be dynamically fine-tuned (35 strobes, 1 strobe per 4b data), or the need for synchronizers for each of the 137b with potential metastability problems and higher latency. Each FIFO stage comprises a latch to hold a data value and control circuitry to regulate transfer of values between stages. Data is shifted to stage A at every strobe signal (DQS) transition, if stage A is vacant. This data is automatically shifted towards D to the next available vacant stage in a self timed manner, independent of strobe or clock. Up to four values can be stored before the receiver needs to start removing the data to create vacancy in the FIFO avoiding overflow. The released data (one at every receiver clock cycle) is synchronous to the receiver (CPU) clock domain. Timing analysis uses standard slope-dependent timing models and incorporates timing windows information to identify the effective Miller factor per net. The router uses various wire classes to achieve the best speed and noise immunity with minimum area penalty. Low VT transistors (3%) provide a speedup of up to 15%.
The mintiming analysis uses global/local skew per distance estimations to minimize the number of false mintiming violations. The clock grid achieved a structural skew of 10ps, while the process, voltage and temperature variation across the chip caused an additional 70ps skew.
The noise analysis uses an in-house tool incorporating timing windows info to identify the real violations. Other in-house tools automated the electromigration, mintiming and noise fixes at CPU level.
• • 2002 IEEE International Solid-State Circuits Conference 0-7803-7335-9 ©2002 IEEE
