Abstract-A 2.5-V, 72-Mbit DRAM based on packet protocol has been developed using 1) a rotated hierarchical I/O architecture to reduce power noise and to minimize the chip-size penalty associated with an 8-bit prefetch architecture implemented with 16 internal banks and 144 I/O lines, 2) a delay-locked-loop circuit using a high-speed and small-swing differential clock to achieve the peak bandwidth of 2.0 GByte/s in a single chip with low noise sensitivity, and 3) a flexible column redundancy scheme to efficiently increase redundancy coverage using a shifted I/O line scheme for multibank architecture.
I. INTRODUCTION
A S THE advance of multimedia PC, high-end workstation, and server applications move swiftly, a high throughput from the memory component is in great demand in order to achieve high resolution and deep color depth in real-time three-dimensional graphic images and multitasking operations. Several effective approaches exist for realizing high performance:
1) high-speed I/O circuit techniques for data communication to achieve high bandwidth [1] , [2] ; 2) packet protocol communication methods to take advantages of the reduced pin counts and trace-length variations in the system as well as small granularity [3] ; 3) row/column pipelining protocol and multibank architecture to increase the effective bandwidth by increasing random parallelism [3] , [4] . To achieve these high performance goals, however, several challenges regarding the chip size, access latency, I/O circuit design, power consumption, and optimal package design must be resolved. This paper describes a 2.5-V, 72-Mbit DRAM based on packet protocol with a 16-bank architecture achieving a peak bandwidth of 2.0 GByte/s. It uses both edges of a 500-MHz clock and 18-bit organization while maintaining a maximum power consumption of 1.68 W (chip power of 1.43 W and average 18 I/O switching power of 0.25 W at V and C), even in the case of four-bank interleaving "read" operation with cycle time of 64 ns and column cycle time of 8 ns. This chip features: 1) a rotated hierarchical I/O architecture to reduce power noise and to minimize the chip-size penalty associated with an 8-bit prefetch architecture implemented with 16 internal banks and 144 I/O lines; 2) a delay-locked loop (DLL) circuit with a high-speed, small-swing differential clock to achieve the peak bandwidth of 2.0 Gbyte/s in a single chip with low noise sensitivity; 3) a flexible column redundancy scheme to efficiently increase redundancy coverage using a shifted I/O line scheme for multibank architecture. This paper is organized in the following manner. Section II describes a flipped I/O chip architecture to minimize the chipsize penalty associated with multibank architecture. Using this scheme, a chip-size reduction of 3% is achieved compared to the conventional hierarchical I/O architecture. Section III describes the overall chip operation using packet protocol for a high-speed system. In Section IV, a power-efficient lowvoltage, low-noise DLL with wide locking range is described to achieve a fast access time using a low threshold voltage and differential clocks. Next, Section V presents a flexible column redundancy scheme to efficiently increase redundancy coverage using a shifted I/O line scheme for multibank architecture. In Sections VI and VII, low-power characteristics and the experimental results are described in both the component level and the module level. Section VIII concludes this paper.
II. CHIP ARCHITECTURE
The major concerns in multibit prefetch design for highspeed data transfer and multibank architecture design are the minimization of the chip-size overhead and the power-supply noise. To cope with these challenges, a rotated hierarchical I/O architecture is proposed where column selection lines (CSL , first metal) and I/O lines (second metal) are rotated instead of the conventional stacked hierarchical decoding scheme [4] . and bonding pads is located at the bottom of the chip to efficiently communicate with the high-speed external system. To minimize the chip-size overhead and power noise, the main column decoder is located in the middle of each half of 36-Mbit memory arrays, and the local column decoding circuitry is placed at both edges of a 128-Kbit unit array for a 32 : 1 multiplexing. Fig. 2 shows the comparison of two different chip architectures. This rotated hierarchical I/O scheme reduces the total chip-size overhead associated with 144 I/O lines for an 8-bit prefetching scheme by 3% compared to the conventional hierarchical local-I/O and global-IO scheme [4] because I/O lines can be routed on the array rather than in conjunction areas (sub-word-line driver areas) as shown in Fig. 2 (a). Also, using this architecture, the column power consumption of this architecture can be significantly reduced due to the minimized number of selected CSL lines for activation of 128 (or 144) I/O lines. One of the main advantage of this architecture is in the reduction of power noise since the power bus lines can be placed across the whole memory array and altered with main I/O lines (144 in total), resulting in increased power bus width and peak array power ( ) and ground ( ) noise of less than 0.3 and 0.15 V, respectively, even for multibank interleaving operation, as shown in Fig. 3 . 
III. CHIP OPERATION
Despite the latency concern, packet protocol communication is very attractive for high-speed systems because it has the many advantages of less skew and smaller pin counts in the system. Due to the DRAM's inherent delay between row and column operations for accessing a word line and sensing data ( ) of about 15-20 ns, packet protocol can be used without significantly degrading the access time (command transmitting time of four clock cycle packets must be sacrificed). A total of eight command signals are assigned for overall control and are divided into two groups (three for row and five for column) to increase random parallelism by controlling the row and column independently. Each command signal has four-clock serial data (eight bits for each edge of the clock) to minimize the access latency penalty. Fig. 4 shows a protocol example for a random 32-byte data transfer with three consecutive "read" operations followed by one "write" operation. In this case, an effective bandwidth of up to 94% can be achieved with a row access time ( ) of 34 ns and column latency of 16 ns. The packet data are then interpreted in a time multiplexing way and control the overall DRAM operation. To minimize the data transceiving time during a high-speed operation, an 8-bit data prefetching scheme and 18-bit (two bits for error correction) I/O organization are implemented. Next, a total of 144 data (8 bits 18 I/O) are stored in the 18 8-bit I/O registers and then transferred to the outside bus channel with a four-clock time frame through an even/odd-type I/O buffer at both the rising and falling edges of the clock. For precise control of data at high-speed operation, the differential clock-controlled DLL circuit (locking frequency range of 250-580 MHz) and a tightly controlled parasitic of I/O circuitry ( , pF, pF, and nH) are used to minimize the skew and propagation delay. Hence, a maximum data-transfer rate of 1.0 Gbps per pin can be achieved at V and C.
IV. LOW-VOLTAGE DLL CHARACTERISTICS
The design challenge for this DLL circuit is to reduce timing errors and to correct external clock duty cycle imperfections to less than 3% for 500-MHz operation at V while maintaining a small layout area and power consumption. In this design, some improvements for low margin and low noise sensitivity using a differential clock scheme from the previously published DLL [5] have been implemented. To achieve a low margin down to 2.0 V, low threshold voltage of about 0.4 V is used for the clock buffer, a chargepump circuit, and a duty cycle corrector (DCC). Fig. 5 shows the DCC circuit schematics of the previous 3.3-V version and the newly revised 2.5-V version, respectively. The minimum value ( ) for the previous design is determined along the path denoted by a thick line in Fig. 5(a) .
The minimum can be expressed as (1) where and are the drain-to-source ( 0.3 V) and gateto-source ( 0.7 V) voltage, respectively, and the input signal voltage swing equals 1.0 V. is approximately 2.6 V.
For the new DCC circuit in Fig. 5(b) , the folded cascode scheme is implemented, and the expression is (2) where the input swing can be reduced to 0.5 V using a differential clock scheme and is about 0.4 V due to the low threshold voltage at the differential input stage. Therefore, can be reduced to 1.8 V. Noise sensitivity is improved by employing differential clock signals for the input buffer and the DCC circuit. Using this scheme, an uncertainty window of input buffers can be reduced to 25 ps, which is 0.69 times less than that of the conventional single-ended signal-detection scheme for 100-mV input signal swing at V and C. Fig. 6 shows the circuit schematic of DLL, in which the dual-loop control method is used for efficiently locking the frequency and the phase interpolation techniques [5] . The DLL characteristics obtained at V, C, and MHz reveal a data-transfer rate of 1.0 Gbps/pin, as shown in Fig. 7 . The measured amplitude of an output clock signal is attenuated by 20 times. The peak-to-peak clock jitter including external source clock jitter of 55 ps, current consumption, and deviation of duty ratio was measured to be 160 ps, 28 mA, and 1.8%, respectively, at V, C, and MHz. A valid data window of about 780 ps is measured at V and C with a low parasitic chip-scale package having total chip parasitics of pF pF and nH nH.
V. FLEXIBLE COLUMN REDUNDANCY SCHEME
As the number of internal banks increases, the yield associated with the redundancy flexibility is one of the major concerns. Hence, the area-effective repairing scheme with high flexibility is very important in this rotated I/O architecture with 16-bank, 144 internal I/O lines. Fig. 8 shows the flexible column redundancy scheme implemented in the rotated I/O architecture to minimize the chip-size overhead and high repair coverage. Fig. 8(a) for memory cell arrays and a small swing internal I/O scheme are used to reduce power dissipation. Due to the reduced termination voltage, the size-and loading-efficient flipped I/O architecture with a small-swing I/O scheme, and an internal voltage converter, the total power-supply current of 573 mA is measured at V for multibank operations. Therefore, the total maximum chip power of 1.68 W, including the average 18 I/O switching power of 0.25 W, has been achieved when internal 16 banks are interleavingly operated with 16-ns interval commands at 1.0 Gbps per pin, V, and C. Fig. 9 shows the shmoo plot of the device ( versus period). As can be seen in this figure, the chip can properly operate in a wide range of and a frequency range of up to 1.2 Gbps at V and C. The critical ac characteristics such as data input and output (DQ) setup and hold times ( ) and data output time ( ) are shown in Fig. 10 . The measured valid data window of setup and hold times and data output time is 780 and 655 ps, respectively, at V with a clock frequency of 500 MHz (total data window of 1.0 ns). In the analysis of factors affecting and margins, effective clock skew ( ps), I/O buffer uncertainty ( ps, ps), internal clock skew of receivers and output buffers within the Fig. 12 . Microphotograph of the 72-Mbit DRAM to achieve a peak bandwidth of 2.0 GByte/s. whole chip ( ps), and tester offset accuracy ( ps, ps) are measured, resulting in such large enough data windows for and that the data-transfer rate of 1.0 Gbps can be achieved.
VII. EXPERIMENTAL RESULTS
To achieve the system characteristics, the device is assembled with high-speed package and ball grid array ( -BGA), and a total of eight devices are mounted on the specially designed 8-layer printed circuit-board module with a characteristic impedance of 28 . Fig. 11 shows the module design with eight chip-scale packages and its test results (shmoo plot V with a data-transfer rate of 1.0 Gbps.
VIII. CONCLUSION
In summary, a 2.5-V, 72-Mbit packet-protocol-based DRAM achieving a peak bandwidth of 2.0 GByte/s has been developed with a 0.23-m triple-well, four-poly, two-Al metal CMOS process. An internal of 2.0 V and of 1.8 V with 0.8 V signal swing are used in the array to reduce the sensing power and I/O switching power, respectively. The total maximum chip power consumption of 1.67 W, including the average I/O switching power of 0.25 W, has been achieved when internal 16 banks are interleavingly operated with 16-ns interval commands at 2.0 GByte/s, V, C. Fig. 12 shows a microphotograph of the 72-Mbit packet-protocol-based DRAM, and special features are briefly summarized in Table I. B.-S. Moon was born in Cheju-do, Korea, on April
