Abstract-
I. INTRODUCTION
T ECHNOLOGY advancement such as faster processors, multimedia extension algorithms, serial buses, and accelerated graphics ports is driving the need for a memory system with higher bandwidth and larger density. Doubledata-rate (DDR) synchronous DRAM (SDRAM) [1] , [2] is an appropriate memory solution for systems ranging from multimedia-intensive PC's, high-end workstations, and servers to embedded communications systems such as graphics, cache, and main memory. DDR SDRAM, being an evolutionary architecture from SDRAM, features the doubled data rate at both the rising and falling edges of the clock and the use of the bidirectional data strobe for accurate data fetching to and from the memory controller. Compared to other memory architectures such as Rambus DRAM, which may be considered to be a revolutionary architecture, DDR SDRAM offers the advantages of low cost due to reduced die size, package, and test overheads as well as reduced initial latency and power consumption.
As the density and the scale of integration progress aggressively toward 1-Gbit DRAM technology and beyond [3] , [4] , design and processing issues pertaining to the large chip size are becoming more prominent. The increased chip size and decreased feature size indicate that the on-chip processing Publisher Item Identifier S 0018-9200(99)08340-7.
variations will have a much more pronounced effect on chip performance. In addition, both the on-chip and off-chip skews must be carefully controlled to meet the demand for high performance, which has become synonymous with high data bandwidth. Therefore, the circuit techniques and schemes that render precise skew controls and tolerance to processing variations should be employed. The challenges in the areas of circuit design, processing, and packaging technologies for gigabit-scale memories are manifold. Intrachip data flight time and skews will grow in proportion to the large size. Although high-speed data interface can be achieved using precise clock generators such as delaylocked loop (DLL) [5] , [6] , skews due to large data access path may cause loss of internal timing margins. The fabrication is often at the processing equipment limit, and severe device fluctuation may result with a manufacturing loss. A diminished timing margin may be detrimental to pipelined architectures [7] - [9] , which have been realized for high bandwidth. Also, due to long time to market, there may be a lack or change of standard before completion of the standardization process.
The objective for this work is to mitigate some of the aforementioned problems in achieving high-performance devices for the future generation. This paper presents a 1-Gb DDR SDRAM featuring outer DQ and inner control (ODIC) chip with non-ODIC package, cycle-time adaptive wave pipelining, and variable-stage analog DLL with the three-input phase detector in maintaining high performance with acceptable tolerance to the stringent 0.14-m processing conditions.
II. CHIP ARCHITECTURE
The chip architecture of the 1-Gb SDRAM is shown in Fig. 1(a) . It is organized into eight banks. Each independent bank is split into two 64-Mb subarray blocks to improve noise immunity and data bus routing. The data is 16 bits, with 8 bits coming from the left-half block and the other 8 bits coming from the right-half block of the chip. The addresses and controls including the DLL internal clock are located at the chip center. This kind of pad arrangement is called outer DQ and inner control [8] . The wordline and dataline are in hierarchical structures with main and subwordlines and local and global I/O lines.
The column predecoder is shared between two adjacent banks, and the I/O sense amplifier is shared among four banks, as shown in Fig. 1(b) . Twenty predecoded lines and related circuits are shared between two banks. There is no speed 0018-9200/99$10.00 © 1999 IEEE penalty, and the redundancy flexibility is maintained with the fuse option. Bank information controls the transfer gate as well as the precharging of global I/O lines. There is almost no penalty for the read operation because the signal is small swing. Write-time penalty due to the transfer gate is about 0.15 ns, and the penalty due to the increased loading from global I/O lines is about 0.3 ns. By using these schemes, the chip size is reduced by 0.98% and 0.62%, respectively. Further layout improvements are made using a dual conjunction scheme, which is shown in Fig. 1(c) . The conjunction is the area between the sense-amplifier (SA) block and the subwordline driver block and includes driver circuits such as subwordline (SWL) enable driver and sensing enable driver. In the conventional layout, the two types of drivers are placed in every conjunction. Since this SWL enable driver is biased at the boosted voltage and the other driver is biased at a lower internal voltage for the memory core , an additional design rule for well separation must be added. This accounts for 20% of the conjunction length. In the dual conjunction scheme, the two types of drivers are separated into alternated positions, as shown in the figure. This reduces not only the layout but also the loading for the generator and the susceptibility to coupled noise during sensing operation. With the design rule removed, the driver sizes in the conjunction can be increased for improved driving capability for the core signal lines.
III. ODIC CHIP WITH NON-ODIC PACKAGE
One way to reduce the skew is to reduce the data flight time. On the component level, internal skews can be reduced by having an ODIC pad arrangement. However, on the module level, non-ODIC pin arrangement is preferred to reduce skews due to difference in module trace lengths. Non-ODIC package has been the industry standard, and for backward compatibility, such non-ODIC package may have to be chosen. To be interfaced by this package type, two cases for non-ODIC and ODIC chips may be considered, as shown in Fig. 2 . Non-ODIC pads complicate the design because of the nonsymmetric and unequal data path. But the ODIC design is simple, and skews can be easily controlled. As the name describes, ODIC chip with non-ODIC package (OCNOP) combines non-ODIC package and ODIC chip using a multilayer chip scale package.
The advantages for this scheme are three-fold. First, chip design becomes simple and can be optimized regardless of the package standard. Second, the chip design does not have to be revised even if the package standard may change. Third, data flight time can be small using small resistance package interconnection. In Fig. 2 , signal paths from clock input to data output are shown for two cases in which the on-chip and package interconnection lengths differ in their proportions. In this work, the package interconnection is made of copper, which is 25-m thick and 10-m wide. Since the RC time constant for the package interconnection ( m /mm, pF/mm) represents over 26 000 times improvement over that for the on-chip metal busing line ( /mm, pF/mm), the overall data flight time can be reduced by 24%. In addition, the thick package interconnection layer enhances the thermal characteristics of the component in reducing the active junction temperature during operation.
IV. CYCLE-TIME-ADAPTIVE WAVE PIPELINING
As the chip size and the datapath are increased, the subsequent increase in the memory latency (CL) and the need for fast column bursting access become a critical consideration in the datapath control design. A variety of pipelined architectures have been evaluated for the optimal SDRAM performance [7] - [9] . Wave pipelining with parallel registers [9] is one of the most effective ways to obtain high data throughput with only moderate CL control, as shown in Fig. 3(a) . However, variations in the device diminish the valid data window.
In addition to the difference in the best and worst case data delays, the difference in the data and control signals may present even greater difficulty, causing a limitation on the cycle time. It is noted that there exist two delay paths, which are quite different. One is the data-acquisition path including cell and core operation, and the other is the data latch control path including periphery operation. The total delay can be expressed using their mean values ( , , and ) and their respective variations ( , , and ). The main datapath from the cell array to the data output buffer relies on analog circuit operations such as cell and bitline charge sharing, bitline sensing, and dataline sensing whose speed is sensitive to processing variation. On the other hand, control signals mostly from RC and inverter delays in the peripheral circuit are affected less, and the disparity between the two results in deteriorated internal margin and subsequently lower device yield. Fig. 3(b) shows the dispersion of valid data and the control signal windows. For the conventional data path scheme, invalid data may be latched when gets large and the valid data window narrows. Cycle-time-adaptive wave pipelining (CTAWP) can improve this by doing two things. First, it allows maximum margin for the data latch signal. Second, this margin can be scaled with the cycle time. The key factor is that the data latch signal comes from the second clock rather than from the first clock. So despite the large , a valid data window can always be guaranteed by adjusting the clock cycle time. Fig. 4(a) is the DDR multiplexer serializing a pair of even and odd address data, requiring a multiple data latch circuit for CAS latency control. Fig. 4(b) and (c) shows the timing diagrams for the conventional and CTAWP data latch controls, respectively. The path from read command clock to column selection line to valid data (CLK0 to CSL0 to D0) is a part of the data-acquisition path, so the delay may vary.
The difference between the conventional and CTAWP schemes is where the data latch signal DL comes from. In the conventional scheme, it comes from the same command clock , but in the CTAWP scheme, it comes from the next clock . Also, the CSL precharge to enable timing gap, which is logically guaranteed, is brought down as a timing gap labeled as "G" in Fig. 4(c) . Latching of the first valid data (D0) can be accomplished at the predetermined and minimal time "G" before the second data (D1) becomes valid. This guarantees a maximum margin "tCC-G" for D0 latch, which is also adjustable with the cycle time for optimized coverage for "weak" cells. So the maximum valid data window "F" can be large compared to "E." Since the internal margin can be determined by consecutive clock edges, the device becomes suitable for wide frequency range and hence cycletime adaptive to process variation. To have a tuning capability of the critical internal timing margin, the CTAWP is employed in the main datapath, and the timing margin can be scaled with the master clock cycle time.
V. VARIABLE-STAGE ANALOG DLL WITH THREE-INPUT PHASE DETECTOR
An analog DLL with variable delay stages is implemented for wide locking frequency, fast locking, and small jitter. A block diagram of DLL is shown in Fig. 5(a) . The control signal is a mixture of analog and digital types, and both the unit delay and the delay stage can be varied. For this kind of hybrid delay line, the total delay can have a coarse but fast tuning capability by varying the number of delay stages and a fine tuning capability by varying the unit cell delay .
The shift register shifts "1" from the right to the left in selecting the number of variable delay stages until the phases between external and internal clocks (CLK1 and CLK2) roughly match. Then the mode switches from digital to analog, and the charge-pump output signal (VCON) controls the selected delay lines for analog tuning of the variable resistive load to eliminate the effects from the process, voltage, and temperature variations. The advantage is fast locking with small jitter.
To minimize jitter, a compensation delay accurately matching the characteristics of the input and output buffers is imperative. The DLL path is provided with compensation delays due to replica circuits for the input buffer, output buffer, output multiplexer, and output driver. The replica output driver toggles every cycle for phase comparison with the reference clock for improved tracking for the DQ drivers' environment. In addition, further jitter optimization is possible by employing on-chip active terminations for replicating stub series terminated logic (SSTL)-interface-like loading conditions. Since replica circuits are a close match to those that are actually used in the main path, the jitter is able to be kept to a very small value, typically within 20 ps. Fig. 5(b) shows the phase difference between the external clock and internal clock during the locking process. It can be seen that the locking sequence consists of fast and coarse tuning steps followed by fine analog tuning. The mode shifts from digital to analog when the phase between the internal and external clocks reverses. Good jitter, characteristic of analog DLL, is obtained. Harmonic-free locking under 50 cycles for 66-200 MHz can be obtained, and the recovery after DLL power-down takes only about ten cycles since the locking state before power-down is stored in the register.
To prevent a "false" locking with a delay greater than the clock period, a three-input phase detector shown in Fig. 6 is developed. In addition to the typical inputs (CLK1 and CLK2), this detector utilizes the third input (CLKM) from an intermediate tap of the voltage-controlled delay line (VCDL). The delay is correctly increased or decreased for harmonic-free and fast phase locking from the sequence of these three inputs.
The state diagram for the three-input phase detector in Fig. 6(a) shows that there are four states and two possible sequences of input rising edges. Fig. 6(b) is a simple flipflop circuit implementation. Fig. 6 (c) and (d) represents the two cases-the order of CLK edges being either CLK1-CLKM-CLK2 or CLK1-CLK2-CLKM-for which the delay for CLK2 should be increased (via UP) or decreased (via DN), respectively, for "correct" phase locking with the total VCDL delay less than the clock cycle time. Fig. 7 shows the measured shmoo data characteristics. Fig. 7(a) shows the data output high and low speed and skews among all 16 DQ's, which are terminated with the SSTL_2 interface. DQ pin-to-pin skew is 0.2 ns, and clockto-data output delay (tAC) with the DLL turned off is 5.7 ns at V and C. In Fig. 7(b) , data setup and hold times with respect to the data strobe are 0.5 and 0.2 ns, respectively, and meet the specification for 333-Mbps operation. Fig. 7(c) is the shmoo data of external voltage versus clock cycle time. The test pattern is "read" and "write" at CL of 2.5 and burst length (BL) of four. Fig. 8(a) illustrates the CL and BL test timing used for the measurement of jitter in the data strobe (DQS) and data output. The expected data and strobe signals are compared with a test strobe signal with an amplitude of VTT 0.1 V. Minimum and maximum transition values for each DQ or DQS edge over a repeated test pattern represent the jitter. Fig. 8(b) shows the measured result at tCC ns for eight test input vectors representing active-read-precharge operation with various sets of data patterns and external voltages. Fig. 8(c) shows the active-probed (Picoprobe 34-A) and sampled (Tektronics 11 801B) internal DLL clock histogram with peak-to-peak jitter of 198 ps and rms jitter of 32 ps. Further optimization of DQ power and compensation delay characteristics will be possible for improved jitter. Fig. 9(a) shows the simulated DDR read waveforms. The SSTL_2 clock is 6 ns with a slew rate of 1 V per 1 ns and peakto-peak amplitude of 0.7 V. DLL is turned off, and the DQ and DQS signals are probed at the component pin. Fig. 9(b) is the measured DDR read waveforms at CL of 2.5 and BL of 4. All 16 DQ's are making transitions at the same time for maximum simultaneously switched output (SSO) noise condition. The outputs are not terminated in this measurement. At V and C, the minimum cycle time is measured to be 6.7 ns. In a 64-bit data bus system, this will measure up to the peak bandwidth of 2.4 GB/s.
VI. FABRICATED DEVICE

A. Measured Results
B. Device Processing Technology
The 1-Gb DDR SDRAM is fabricated using a 0.14-m triple-well, triple-metal CMOS process with the KrF excimer laser lithography. The triple metals include a layer of tungsten for interconnection of internal circuits, a first layer of aluminum for signal lines, and a second layer of aluminum for power lines. The gate lengths are 0.22 and 0.29 m for n-channel and p-channel transistors, respectively, in the peripheral region. The gate oxide is 6-nm thick. There are 256 memory cells connected to each bitline for a ratio of ten. The number of cells per one subwordline is 512. The cell size is 0.181 m , and the chip size is 349 mm .
The chip micrograph of the 1-Gb DDR SDRAM is shown in Fig. 10 . The scanning electron micrographs of the cell and bitline structures in Fig. 11 show some of the critical features in the fabricated device. It is shown that the cell transistor drain contact is well aligned to the bitline with the self-aligned contact PAD technology [4] . The half-bitline pitch is 147 nm. Shallow trench isolation and cylindrical stacked cells are used.
VII. CONCLUSION
A 1-Gb DDR SDRAM featuring the cycle-time-adaptive wave pipelining and variable-stage analog DLL with the threeinput phase detector has been presented. By having the internal margin determined by the consecutive clock edges, the device becomes suitable for a wide and varied frequency range. The variable-stage analog DLL achieves large frequency operation with fast locking and small jitter by selecting the minimum number of delay elements. Difficulties from the large chip size are made more amenable in the OCNOP and the shared column decoder and I/O sense amplifier schemes with short data flight time and improved chip efficiency, respectively. The device features are summarized in Table I . In 1992, he studied carrier dynamics at GaAs/AlGaAs quantum structure as a Postdoctor at the Laser Spectroscopy Laboratory, KRISS, Taejon, Korea. In 1992, he joined the Samsung Electronics Co., Ltd., Kyungki-Do, Korea, where he is involved in dry etching process for device fabrication such as 64-Mb and 256-Mb DRAM. Since 1997, he has been engaged in the development of process integration for 1-Gb DRAM and beyond.
