. PC system using the MPC105 and PowerPC processor.
The major product goals for our design included supporting the data transfer between the PowerPC processor, memory, and PCI interfaces; providing controls for off-chip secondary cache and memory systems; assuring PowerPC reference platform specification compatibility and system-configuration flexibility; reducing silicon and packaging costs; and optimizing performance.
Architecture
By design, the MPC105 architecture optimizes data transfer between the PowerPC processor, memory, and PCI interfaces. In addition to the processor, cache, memory, and PCI control blocks, the design has six types of address buffers and five types of associated data buffers (Figure 2 ). Because data transfers between the processor and memory, with the exception of snoop copy-backs that occur directly on the data bus, transfer transactions require no data buffering: Only the address buffer, labeled 1 in Figure 2 , is needed. All transactions entering the MPC105 have their addresses stored in buffers, which allows the MPC105 to snoop all transactions attempting to go through the device. This is important, for example, with a posted write operation that may execute out of order with other transactions. A central control-logic block controls the buffers. It compares the address of an incoming transaction, arbitrates for the shared data bus between the processor and memory, and controls the dataflow between the processor, PCI, and memory interfaces.
Processor and secondary cache-control block.
The processor control interface maintains cache coherency by handling the processor transactions and performing snoop operations for a PCI-to-memory read or write operation. For multiprocessing and interrupt capabilities, the MPC105 supports the PowerPC 60X bus protocol. 4 This interface also provides processor bus arbitration between the processor(s) and various internal bus requests from the secondary cache and snoop control. The interface supports one level of address pipelining, address and data bus parking, and various processor bus protocols. 4 Address pipelining significantly improves data throughput by allowing the chip to decode a new address while the current data transaction finishes.
To boost memory performance, the MPC105 supports external tag and data RAMs for the secondary cache. Figure 3 illustrates a possible secondary cache implementation. This look-aside cache operates in write-through or write-back modes. The chip's secondary cache-control interface supports 256-Kbyte, 512-Kbyte, and 1-Mbyte secondary caches, and up to 4 Gbytes of cacheable space. It supports either asynchronous or synchronous SRAM with 32-byte line sizes and coherence granularity. Figure 3 shows a write-back secondary cache designed with an external tag RAM, dirty RAM, and synchronous burst SRAM as the data RAM. A secondary cache access uses the low-order address bits to search the tag RAM. If the tag RAM data matches the tag address inputs, indicating the presence of the requested data, the cache asserts the hit signal to the MPC105. The dirty signal indicates the state of the secondary cache line. If the cache line data has been modified with respect to memory, the cache asserts the dirty signal to the MPC105. Our device provides programmable cache interface timing parameters to suit various systems' requirements.
Memory control block.
Our chip supports a wide variety of DRAM or synchronous DRAM (SDRAM) configurations. As Figure 4 shows, the memory organization has eight row-address-strobe (RAS) signals, supporting eight memory banks. Each bank is either 8-bytes wide to support a 64-bit wide data bus or 4-bytes wide to support a 32-bit wide data bus. The memory organization also has eight column-address-strobe (CAS) signals used for byte selection. We can build these banks using either DRAM or SIMM devices. For example, a system designed with eight SIMM slots allows 256 Mbytes of memory using 32-Mbyte, 36-bit-wide SIMM modules. Twelve address outputs provide the multiplexed row and column address, supporting memory densities from 256 Kbits to 16 Mbits. A fully populated 64-bit implementation allows 1 Gbyte of memory.
The memory control interface can also support JEDEC-compliant SDRAM, which several vendors are beginning to supply. The MPC105 completely manages refresh cycles. To reduce instantaneous current consumption, the chip bank-staggers CAS before RAS (CBR) refresh cycles. It also supports self-refreshing memories in power-saving configurations, checks parity for memory reads, and generates parity for posted writes to the memory. All DRAM and SDRAM interface timing parameters are completely programmable to accommodate a wide range of system operating frequencies.
Our design can implement two banks of ROM as boot memory, with a 16-Mbyte maximum boot code space. The ROM data path must match the DRAM or SDRAM data path in bit width. The MPC105 supports burst-mode ROMs, or it can implement one bank of flash ROM as boot memory, with a maximum size of 1 Mbyte. Output-enable and write-enable signals allow writes to this flash memory. The flash ROM data-path width is 8 bits. To allow multiword reads, the MPC105 provides byte gathering from flash ROM. System software must byte-orient writes to flash ROM.
PCI control block.
The PCI interface supports a 32-bit multiplexed, address and data bus that operates between 20 and 33 MHz. A phase-locked loop synchronizes the processor/memory bus to the PCI bus running at either the same or twice the speed of the PCI bus. This clocking scheme allows the MPC105 to interface with PowerPC microprocessors operating at 20-to 66-MHz processor bus frequencies. Our device's PCI interface also provides for seamless connections for running in either little-or big-endian configurations, and for interfacing between the processor's 64-bit data bus and the PCI 32-bit data bus. The PCI drivers are 3.3V PCI compliant and 5V tolerant.
Copy-back buffer.
In our device, a 32-byte copy-back buffer temporarily stores cast-out data from the secondary cache or copy-back data from the primary cache. This temporary storage allows the MPC105 to read data from the main memory to the processor and secondary cache if it encounters a cache miss before writing the cast-out data to memory. This feature allows the device to return the read data to the processor earlier than if it had to write the cast-out data to memory. The copy-back buffer also stores primary cache copy-back data if a PCI master read from system memory hits modified data in the primary cache. Data from the copy-back buffer goes to the PCI bus as soon as it enters the MPC105 and does not wait for the whole cache line to load.
PCI-to-memory buffers.
The MCP105 design has three PCI-to-system-memory-transaction buffers, one for PCI reads and two for PCI writes to memory. The two write buffers each holds up to one cache line or 32-bytes. With two such buffers, the device can write data from PCI into one while writing the previously written data from the other into the memory. These buffers can gather data from separate PCI writes to memory if the data is for the same cache line.
The device snoops all writes to the primary and secondary caches before writing the data to memory. If a snoop hit occurs, the snoop copy-back data merges into the write buffers in all bytes not written by the PCI master. Once the processor/memory data bus becomes available and the buffer flush is the highest priority transaction, the chip writes the data into memory.
The 32-byte PCI-read-from-system-memory buffer stores data from memory if there is no snoop hit for the read. Otherwise it stores data from the secondary cache if there is snoop hit in the secondary cache. With no snoop hit, the device fetches the requested data from the memory-critical doubleword first. Then it fetches the next sequential doubleword of data, continuing until it reaches the end of the cache line or a new PCI-read-from-system-memory transaction occurs that is not in the same cache line as the original. The device forwards the data to the PCI bus as it becomes available; data does not need to wait for the whole cache line to fill.
The MPC105 also implements a speculative read feature that, if enabled, will snoop the next sequential cache line address when the current PCI read accesses the third doubleword of the cache line. Once the snoop response is known and the read has finished, the device will fetch the data at the speculative address and load it into the buffer in anticipation of the same PCI master requesting it. In a typical system, the speculative feature will allow streaming of large amounts of data with very minimal delay on the PCI bus, enabling the system to use the full bandwidth of the PCI bus.
Processor-to-PCI buffers.
The MPC105 has three processor-to-PCI-transaction buffers, two for processor-to-PCI writes and one for reads from PCI. The two processor-to-PCI-write buffers are each 16 bytes, or one cache line total. Each posted write buffer can data-gather single-beat writes to PCI within the 16 bytes of the upper or lower half of the cache line. Data gathering is only possible for writes to PCI memory address space, not PCI I/O space, and will continue until the buffer is scheduled to flush or the processor issues a synchronization transaction. This data-gathering feature of the posted write buffers benefits high-performance graphic frame buffer operations in which software generates a whole sequence of consecutive single-beat writes. To improve performance, the two posted write buffers isolate the processor/memory bus from the slower PCI bus on processor-to-PCI writes.
The processor-read-from-PCI buffer is 32 bytes. The data from the PCI bus latches into the PCI read buffer. Once all requested PCI data has latched, the data goes to the processor on the processor/memory data bus. Data in this buffer does not go to the processor until the device has fetched all the requested data to allow for PCI disconnections without locking up the processor bus.
The MPC105 performance box shows an example of the performance this device can achieve.
Packaging technology
Using the C4 package technology 5 in conjunction with a ceramic ball-grid array substrate enabled us to implement the MPC105 in single chip, rather than in a more common set of several chips. The C4 packaging provides 358 pin-connection sites to a die smaller than 39 mm 2 , significantly reducing costs compared to any pad-limited wire-bond technology. The C4/CBGA combination reduces parasitic package inductance by 61 percent, from 15 nH in a standard wire-bonded quad flat pack to 5.8 nH, thereby reducing the switching noise induced on power supply lines.
In addition, C4 technology provides 111 power and ground pins to reduce power-supply noise. Also, CBGA technology significantly reduces footprint area. A standard 304-pin QFP typically measures 43.2×43.2 mm 2 with a lead pitch of 0.5 mm. Using the 21×25 mm 2 CBGA with a 1.27-mm ball pitch reduces the circuit board footprint by 72 percent. To get similar high I/O counts, a QFP with peripheral leads would need pitches of 0.3 to 0.5 mm, so fine they are difficult to handle.
The CBGA also improves circuit-board manufacturability by reducing opens and shorts and device alignment errors. The MPC105 implements the IEEE 1149.1 Boundary Scan Test Methodology for checking electrical connections between the package and the board. Figure 5 shows the typical active-mode current measured in the 2:1 mode. The typical power at a PCI bus frequency of 33 MHz and processor/memory bus frequency of 66 MHz is 0.83W. We estimate the worst-case power dissipation, assuming a fully loaded PCI bus with 150-pF loading and 100-percent duty cycle, at less than 2W. The MPC105 implements four power-saving modes to provide various levels of power savings. The device enters doze, nap, and sleep modes when software sets the appropriate control bits in its configuration register. These power-saving modes are fully compatible with PowerPC microprocessors. 2 Users activate the fourth mode, suspend, by asserting a dedicated pin. In suspend, the device performs memory refreshes either by using the self-refresh mode for DRAMs or an external clock as the refresh time base. In suspend, the MPC105 may shut down its PLL for additional power savings. To further minimize power dissipation when the system is inactive, system software may copy the memory onto a disk before entering the suspend mode.
Power management

Physical implementation
As the photomicrograph of the MPC105 in Figure 6 shows, there are 247 I/O drivers located on the periphery of the chip. Using a fourth metal-interconnection layer, the I/O drivers connect to the C4 bumps that cover the entire chip. The PLL located at the center generates the internal clock from a system clock running at the PCI bus frequency. The internal clock can run either at the same frequency as the system clock or at twice the frequency of the input system clock. An H-tree clock distribution network minimizes the clock skew to less than 500 ps across the chip. This clocking scheme allows the device to interface with a PowerPC microprocessor operating with a processor bus frequency of 66 MHz. Figure 6 . Photomicrograph of the MPC105.
Conclusion
Fabricated using a 0.5-m, four-level metal CMOS technology, the 39-mm 2 MCP105 die contains over 250,000 devices. It operates at a maximum processor/memory bus frequency of 66 MHz with 1W typical power dissipation. By employing a 100-percent level-sensitive scan design (LSSD), the device offers 98 percent test coverage and full IEEE Std 1149.1 compliance. With this PCI bridge chip, designers can use PowerPC microprocessors to quickly design and develop systems.
The authors were all members of the MPC105 design team at the Somerset Design Center in Austin, Texas. Karl Wang was the design manager supervising the design of the central control unit, data-path control, and integration. He designed the PLL, clock distribution network and I/O drivers. Chris Bryant worked on the PCI control block. Tom Elmer designed the memory control block. Michael Garcia was a technical design manager supervising the design of the PCI and memory control block, and was involved in the logic synthesis and timing. C.S. Hui is a senior engineer responsible for defining and implementing the design-for-testability JTAG controller, clocking scheme, ROM/flash interface, power management controller, and the error-report mechanism for the device. 
Reader Interest Survey
Indicate your interest in this article by circling the appropriate number on the Reader Service Card. Table A shows an example of the system performance the MPC105 can achieve. The table gives the performance in terms of the number of clock cycles as seen by the master, in a system operating in a 2:1 clock mode where the processor/memory bus operates at 66 MHz and the PCI bus at 33 MHz. Using 9-ns synchronous burst SRAMs, a processor burst read from secondary cache takes place with a data rate of 3-1-1-1, or a latency of three cycles to the first beat of data and a throughput of one cycle for the other three beats of data. Table A . Key system performance assuming 66-MHz processor/memory bus, 33-MHz PCI bus, 60-ns DRAM, and 9-ns synchronous burst SRAM.
Low 162 Medium 163 High 164
MPC105 performance
Transaction
Number of cycles Processor-to-memory Burst read from L2 3-1-1-1 Pipelined burst-read from L2
3 In a pipelined burst read from secondary cache, two burst reads of data take place back to back with no wait states between the bursts. A burst write to secondary cache takes place at a burst rate of 3-1-1-1. The secondary cache built with 15-ns asynchronous SRAMs performs at 3-2-2-2 for a burst read and 4-2-2-2 for a burst write.
In a system designed with 60-ns DRAMs, the memory performance for both a burst read and burst write is 8-3-3-3. If the system memory has 66-MHz SDRAMs, the performance for a burst read is 8-1-1-1 and 5-1-1-1 for a burst write. A PCI master can read from system memory at a data transfer rate of 9-1-1-1. If the PCI-to-memory transaction hits the prefetch buffer, the data transfer rate becomes 4-1-1-1.
For a large block-data transfer such as in a direct memory access operation, the effective data transfer rate on the PCI bus is 96 Mbytes/s. A PCI master can write to system memory at a data rate of 2-1-1-1, corresponding to an effective data rate of 119 Mbytes/s. The effective data transfer rates are less than the peak data transfer rate on a PCI of 132 Mbytes/s because the MPC105 only transfers a maximum of 32 bytes of data per PCI transaction. A single-beat processor-to-PCI read takes 12 cycles, while a single beat write takes five.
