This custom RISC microprocessor delivers approximately 230 DrystoneBtIPS at 200MHz while dissipating 0.5W from a 1.5V internal supply. The internal supply can be biased from 1.5V to 2.0V, while the external supply ia always biased at 3.3V. A split supply minimizes power consumption for a given application, while maintaining compatibility with a 3.3V external environment. The die contains 2.5M transistors and is 8.24~9.12"~. It is fabricated in a 0.35pm, 2.0V, n-well, single poly, 3-metallayer CMOS process (Table 1) . It is packaged in a 208 pin thin quad flat pack.
To achieve 0.5Wpower consumptionwhileprovidingfor a200MHz design, several circuit design principles are followed throughout the design. Wherever possible, stand-alone latches are not used in favor of latches with logic embedded in them. This both reduces circuit delay and reduces power consumption. As an example, Figure 2 shows a 4-way muxing latch used in bypassing logic. While the basic latch structure is based on dynamic design principles for speed, an additional nMOS device is placed between nodes L3 and L4 to ensure static operation during the evaluation phase of the clock. Conditional clocking is used throughout the design. The device is switched to a lower clock frequency when doing off-chip memory transactions. To minimize excessive Idoff leakage in the 0.35p.m process, the lengths of all devices in several slower sections are increased.
To reduce power consumption duringnon-critical times ofoperation, three modes ofoperation are supported: normal mode, idle mode, and sleep mode. In idle mode, the PLLs continue operating, but the main internal clockgrid (that drives the core functionalunits) is stopped. To save power and minimize leakage in sleep mode, most of the internal power supply is turned off. However, the oscillators, powermanagement unit, real-time counter, and some pad circuitry need to remain powered up. Because of this, three separate, internal, 1.5V,VDDgrids aresupported(tw0 that stayonindefinitely, andone that is shut off in sleep mode). In addition, a 3.3V VDD grid is supported for I/O circuits. Custom voltage shifters are used for signals passing betweenvoltage domains. A typicalvoltage shifter can be seen in Figure 3 .
As can be seen in Figure 4 , there are several separate clocking domains, clockmg H-trees, and global clock wires on the die. The Dclk grid drives the main core and caches ofthe microprocessor. It is generatedfromanon-boardPLL. TheDclkgridruns at fullspeed, but can be switched to the Rclk frequency during memory I/O operations. The Dclk grid does not run during the idle modes. The Rclk grid is usedmostly by the peripheral, pad, and non-core logic. The Rclk frequency is a divide-by-2 of the Dclk frequency and runs continuously in idle mode. While software can change the speed of the main clock grids (Dclk and Rclk), some of the peripheral units require fixed frequency clocks regardless of the chosen Dclk and Rclk frequencies. To meet this requirement, both a 7.36MHz clock wire (SCLK) and a 48MHz (TCLK) clock wire are provided. The 7.36MHz frequency is derived from the same PLL that generates Dclk, while the 48MHz clock is driven by a second PLL, asynchronous with the first PLL. A fourth internal clocking wire (KCLK, running at 32kHz) drives various system components that need t o be running during sleep. Because of the < B o p 4 power consumption requirement during sleep, only the 32kHz oscillator and 32kHz clock grid are driven during sleep mode; moreover, their distribution and use is limited. During sleep, all other clocks are turned off.
The chip features two first-level data caches. The main first-level data cache is an 8kB, 32-way set associative, virtually-addressed cache. The other first-level data cache is a 0.5kB 32-way set associative, virtually-addressed cache. During cache fill time, a bit, stored in the TB descriptor for the block being filled, is used to steer the data into either the 0.5kB cache or the 8kB cache. The caches areimplemented as demand fill. As such, subsequent loads & stores are guaranteed to hit in only one ofthe two data caches. Both data caches can be reaawritten in a single clock cycle. The second smaller data cache can be used to store large data structures (typically found in handwriting and voice recognition algorithms) that do not have extreme bandwidth requirements, but would normally exceed the size of the 8kB data cache and cause data already in the cache to be cast out. Thus, the 0.5kB cache allows prefetching, burst loads and stores, and write mergingoflarge data structures without thrashing the 8kB data cache. 
DIGEST OF TECHNICAL PAPERS

