Engine" LSI are presented with emphasis on practical aspects of verification and timing closure. A combination of simulation, emulation and formal verification ensured the functional first silicon for system evaluation. In order to control wire delay in early design stage, floor-plan based synthesis and wire load estimation are adopted for quick timing closure.
I. Introduction "Emotion Engine" ("EE) is a system LSI including a 300MHz 128-bit 2-way superscalar RISC core, two Vector Units ("VUs), Image Processing Unit ("IPU) for MPEG-2 stream decode, a 10-channel memory access (DMA) controller, two channel RambusB memory controller (RAC) and other peripheral modules [l] , [2] . 13.5M transistors are integrated on 15.02" x 15.04" die with 0.25um device technology. The chip photo and the block diagram are shown in Figl.and Fig.2 , respectively. Not only the RISC core but also both VPU0 and VPUl have own program codes. This complexity makes verification more difficult than a single processor chip. These three processors which are running at 300MHz synchronously, occupies as large as 128mm2 on the die. So, a careful timing design and clock skew management are required. This paper focuses on verification and timing closure because they are most crucial in this development project.
Verification Methodology
EE integrates several processors, including the 128-bit RISC core, VPU0 and VPUl and the IPU. Concurrent data transfers on a 128-bit wide on-chip bus happen among them either by program control or by a 10 channel DMA. In order to manage this complexity, three verification approaches are adopted. They are simulation, emulation and formal Verification. Overall Verification flow is shown in Fig.3 I " E E I
B. Emulation
In order to speed verification process, a hardware emulation system is used. Linux OS is ported to the RISC 0-7803-5973-9/00/%10.00 02000 IEEE.
core for verification purpose. For emulation of whole "EE', an external hardware board is designed and connected to the emulation system to verify the interface protocols. The emulation speed is 614K cycles/second, roughly 40,000 X faster than RTL simulation. Emulation speed is attractive but it often takes time to map EE gate netlist to the emulation system and difficult to debug. 
C. Formal verification
Formal verification is applied to verify the equivalence of the different designs, including handcrafted custom block (CB) circuit vs. its RTL description.
Design methodology of 300MHz VU(Vector Engine) ) is careful functional design based on the VU architecture, consistent design hierarchy from RTL to Floor-plan, estimation of wire shape for pre-layout static timing 
B. Repeater insertion
Even if most interconnections are managed well, long interconnections are inevitable. They are kind of the RISC stall signals and the bus signals used many locations over the chip. "RePertory", an automatic repeater insertion program, has been newly developed in this design project and is used effectively [4] . Repeater theory itself was established and not new. However, in order to apply actual LSI design, we think that some improvements are still necessary.
Reppertory features include:
Good controllability of the repeater location (topological and geometrical) and the number of repeaters for the optimum result, Preservation of the module boundary for enabling the following In-Place-Optimization (IPO), Nets for repeaters to be inserted are controllable (force or prohibit repeater insertion for each net). 111. Timing Closure C. Clock skew management As design rule has been shrunk to 0.25-0.18 um level with higher clock frequency, controlling interconnect delay has become significant task in LSI design projects. We think the nature of this problem lies in the fact that floor-plan, RTUgate structure and interconnection delay have become much more "closely-coupled" each other. For example, one of critical paths of the RISC core is so-called "Load Path" from the data cache memory to the integer register through the integer datapath. The designers' interests at early design stage would include:
how much the path delay is (relative to other critical how much timing would be improved if another floorWe advocate that the key is quick timing analysis and paths), plan is applied.
Clock skew must be minimized in order to improve the maximum clock rate, evade a race condition, and guarantee more timing margin for circuit designers. Many studies on automation of clock layout and synthesis have been reported and ASIC design flow can utilize the automated methodology effectively because of its standard cell (SC) based flexible layout. In contrast, almost all the clock network of EE is designed and drawn manually since a high percentage of the die area is occupied by custom blocks (CB's) and the floor plan is not flexible enough for the automated clock design. In those circumstances, an automated clock tuning method is developed to get accurate timing results promptly [5] . Less than 116ps overall clock skew has been achieved across 15.02 x 15.04" die. An integration trend for PC-CPU and game-CPU is shown in Fig.4 . Within several years, a game-CPU with nearly 100M transistors is expected to emerge! In order to develop such CPUs, CAD and methodology progress is necessary. In verification area, for instance, much faster simulation/emulation techniques and robust formal verification will be necessary. In backend area, "interconnection centric" tools ranging over floor-plan, logic synthesis, place & route, extraction and timing analysis will be necessary. As signal swing gets smaller and coupling capacitance among adjacent wires is increased, induced noise on a signal wire will become more serious problem. A CAD tool to estimate noise immunity will be necessary, too. 
