Introduction
In the past, DRAM products required few functional or 110 configuration options because mainframe computers were the primary application. Because of the explosive growth in portable and personal computing products, today's DRAM designs must provide functionality that covers the product spectrum from PDAs to mainframes. DRAM manufacturers must supply products that can satisfy various customer requirements for operating voltage, input/output configuration, and function while maintaining high performance and low power. This paper describes a DRAM architecture with design features that provide a single-chip design with the flexibility to meet these market demands.
By the early 1980s, DRAM designers had begun to offer enhanced functions such as page mode and nibble mode "Copyright 1995 hy International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
Micrograph of 18Mb DRAM chip Fast page
Fast page
Static column
Fast page *@ metal mask select.
[l], as well as demonstrating higher performance [2, 31.
With the move to CMOS technology, DRAMS were presented featuring wider 1/0 and low standby power [4] .
As the DRAM market broadened in the late 1980s, limited functional and I/O configurations selectable by wire bond and/or metal mask were presented [5] . In the past, strong downward pressure on DRAM prices kindled interest in "cutting down" product designs to the bit densities of the previous generation. For example, a 16Mb DRAM chip can be cut down to a 4Mb chip that is of substantially smaller area than the maturing 4Mb technology can provide. Also, a properly architected 18Mb chip can be readily cut down to 16Mb, eliminating the cost and complexity of separate design efforts for each chip. The chip discussed here is manufactured in a 0.5-pm CMOS process, using silicided polysilicon, two levels of metal, trench storage capacitors [6] , and shallow-trench isolation
52
[7, 81. Figure 1 is a micrograph of the 18Mb DRAM chip. 
Xeld of 18Mb chips only (a)
Yield enhancement of 18/16Mb chip during early production.
The chip is divided in half by the center vertical periphery area, which contains the pads, addressing, control, data steering, and 1/0 circuitry. This results in highperformance propagation of signals in that the longest net is equal to the height of the chip. Mask misalignment is minimized in the chip center regions, providing reduced defect sensitivity and tighter parametric distributions [9] . The peripheral circuits are designed using predefined second-metal signal nets and power buses. Input receivers are located immediately adjacent to the I/O pad [3] . The control and address signals are buffered into the lefthight center horizontal stripe, in which wordbit redundancy and column predecode circuits are segmented to independently service the eight array sub-blocks. These nets are equivalent in length to those located in the center vertical region. The widths of nets serving large gate loads were customized to eliminate timing skews. Each chip quadrant consists of a 4.5Mb array organized as 512 Kb X 9, with the data lines running horizontally to the center vertical peripheral region. Each data line fans into each of eight 576Kb quadrant sub-array blocks. The center vertical stripe in the quadrant contains two sets of row predecode circuitry to service the double row of decode stripes which drive polycided word lines. This segmentation architecture optimizes the efficiency of fault-detection schemes such as parity and ECC in the computers using this chip. The chip functions and features are described in Section 2. Circuit design is described in Section 3, with subsections for address path, array design, and data path. Section 4 describes hardware results for the 18Mb DRAM and the 16Mb and 4.5Mb cut-downs.
Chip functions and features
The functional and input/output configurations of the 18Mb DRAM are selected via four program pads, PGMO-PGM3, located in the peripheral area of the chip. These pads are clamped either to Vss or V,, through long and narrow devices. The default state of the pads at the wafer level configures the chip as ~1 8 , low-power addressing (12/8) in fast page mode without write-per-bit (WPB), for low-power products. As shown in Table 1 , the PGMO-PGM3 pads permit sixteen different functional and I/O configurations to be created by wire-bonding the desired pad(s) to the appropriate state. A second-level-metal (M2) mask is required to activate the x4 w/4CE and x1 options along with 3V/5V, static column mode, and WPB selection. The initial 18Mb DRAM design features laser fuses to provide 9/8 substitution in each 4.5Mb quadrant. Chips containing quadrants with a faulty 512Kb 1/0 slice can thus be reconfigured as a 16Mb chip. This fault-tolerance technique increases yield [lo] , as shown in Figure 2 . The resulting chips are then put into the early 400-mil JEDEC standard packages for 18/16Mb DRAMS. The chip architecture also provides for the efficient cut-down of the 18Mb design to 16Mb. Toward this goal, the hierarchical chip design data are nested such that two 1Mb X 1 array slices, centered between the row decode stripes and the contiguous periphery, can be deleted from the full-chip data file. Because of predefined wiring channels in the center vertical periphery, the resultant 4Mb segments can then be stepped and remerged with the rest of the chip. With the two resulting chip designs, the 9/8 steering circuitry is replaced with selftimed refresh (sleep mode) circuits. The laser fuses are now used to program the STR frequency. The minimal time required to run design ground-rule and physicaVlogic checking programs then defines the 16Mb DRAM design schedule. The schedule for functional qualification of the 16Mb cut-down is similarly reduced because the I/O and control circuitry will already have been qualified on the 18Mb chip. The 16Mb cut-down DRAM chips are put into the present 300-mil JEDEC standard packages with functional selection via the chip PGM pads. quadrant, the 18Mb DRAM can also be cut down to 4.5 Mb, as shown in Figure 3 . Because of package requirements, the center vertical periphery must be 
Address path
The address receiver circuits are located immediately adjacent to the associated 1/0 pad, providing reduced input capacitance loads. This also eliminates coupling-induced ~ receiver input-level sensitivities and timing skews, which design, a 45.3-mmz (5.5 X 8.23-mm) 4.5Mb DRAM (shrinkable to 36.8 mm') was created in four months by a single engineer. Utilization of the PGM pads gives the functional capabilities shown in Table 2 .
Chip functional design
The flexibility of this architecture to provide multiple functional and design cut-down options requires careful circuit design for optimizing power, performance, and reliability. The power-and-ground distribution is designed to reduce noise that can affect reliable chip operation. By replicating the DRAM storage trench and pass gate, a 50-pF decoupling capacitor can be constructed in 115 x 35 pm. Inclusion of the DRAM cell pass gate in the decoupling capacitor structure provides current limiting in the event of a trench-to-substrate defect. The trench dielectric reliability is more than 300 times that of the planar dielectric, based on field return data. Therefore, designing the decoupling capacitor to be structurally identical to the DRAM array provides high yield and reliability. Such capacitors are placed at each input pad and off-chip driver pad, providing 2.5 nF of local V,, decoupling in the center vertical peripheral region. Precharge of the bit lines in the p-array to V,, [9-111 54 provides a series-equivalent decoupling capacitance of result when card-signal nets are bused on-chip to distant receivers. The address pad true state is buffered and then propagated in the chip center vertical peripheral region. This address bus is then buffered into either the left or the right center horizontal region over the redundancy circuits, and further buffered into word predecode circuits in the center vertical quadrant region. The word predecode is enabled by row address interlock (RAIN), which is generated by a physical copy of a word redundancy circuit, located at the left or right chip edge. Therefore, in spite of process-parametric variations and RC net delays, the slowest redundant word select signal will always be valid before word predecode is enabled. The RAIN interlock is also propagated to the chip center vertical peripheral region to enable the address transition detect (ATD) circuit and also to enable column addressing onto the address bus. Figure 4 shows the ATD circuit used to reliably detect column address transitions. The ATD circuit is integrated into the left/right address redrive circuits located in the center of the chip. After the row address is decoded, the input pass gate to the address redrive latch is closed, isolating the latch state from the center vertical address bus state. These two states form the input to the XOR gate, which detects an address transition when the address net and latch states are unequal. The address transitions are then summed by a static NOR to enable the BRSETN (bit-reset-NOT) signal, which resets the column predecode circuits in response to an address transition and disables the column predecodes. At this time, address redrive latches are reconnected to the vertical address bus to update the center horizontal address bus. Integration of the ATD into the centrally located address redrive circuits and utilization of a single-state address-bus architecture results in enhanced performance with reduced area and power.
Array design
The array is segmented to reduce the types of failure that can propagate across chip I/O sub-arrays. This is especially important for the efficient operation of parity and ECC in systems using DRAM products with byte-wide
or word-wide ( X 16, X 18) output configurations. The substrate plate trench cell (SPT) [6] technology provides a 70-fF storage cell for a bit line capacitance of 250 F. The p-array bit lines are precharged to VDD to provide high performance, high reliability, and low power [9-111. With bit lines and word lines precharged to the same bias level, defects that cause word line-to-bit line shorts do not contribute to chip standby current. In array designs where the word line and bit line are precharged to different bias levels, e.g., half-V,, bit line precharge, these word line-to-bit line defects can affect the bit line precharge bias level, degrading array signal development and sense-latch sensitivity. Another consequence is unfixable standby-current yield loss in low-power parts [ 181.
Full-V,, bit line precharge provides higher voltage overdrive of the array-cell transfer device, resulting in faster coupling of the stored data onto the bit lines. A word line interlock system, shown in Figure 5 , is designed to accurately time the development of the bit line data signal. The reference word lines (RWL,, RWL,) are used as inputs to exact replicas of the array transfer device, thereby allowing bit line signal development to be accurately timed across electrically, thermally, or processparametrically induced variations in word line RC delay and/or device V,. Accurate word line interlocking is a key aspect of proper DRAM design and is required for enhanced yield, reliability, and performance.
Because of the optimal sense-latch sensitivity that results from fuIl-V,, bit line precharge, activation of the sense clocks by the word line interlock results in rapid and efficient amplification of the bit line signal. After amplification, while the selected array word line remains at Vss, the selected and unselected reference word lines are equalized together to 1/2 V,, for the remaining duration of the array select time. At the initiation of an array restore, the reference word lines are clamped to Vss as the array word line is boosted below Vss. This provides a controlled write-back for the reference and array cells, resulting in consistent stored voltage levels. This eliminates late-writeinduced array data pattern sensitivities. Word line boost at the initiation of array restore also reduces the duty factor for boosted-voltage-induced oxide stresses, resulting in improved array reliability. Because of the high overdrive that results from V,, precharge, p-FET devices provide high-performance array restore and equalization of the selected array bit lines. The unselected arrays substantially 55 decouple the V,, bus, reducing noise and allowing highperformance operation of the 18Mb DRAM array in the various product address-configuration options.
Data path
Because the chip is architected as 1Mb X 18, each quadrant contains nine data 1/0 sub-arrays, each serviced by a single primary data line (PDL). Each PDL fans into eight digital secondary sense amplifiers (DSSA) located in the quadrant array sub-blocks. The DSSA serves two local data line pairs, each of which services 128 sense latches. The DSSA, shown schematically in Figure 6 , results in low-power, high-performance transfer of the sense-latch data to the PDL. The local data lines feature a p-MOS half-latch and devices to reset the local data lines to V,, upon detection of an address transition. Selection of a bit switch discharges the local data line true or complement through the sense latch, which results in the digital transfer of the sense-latch data to the lightly loaded local data line pair. The local data line state is then buffered onto the 2.5-pF PDL through the tri-state driver. From the selection of the bit switch, transfer of the sense-latch data to the PDL occurs in 3.0 ns for devices made by the nominal process at 2.9 V and 85°C. A small latch maintains the PDL data state after the PDL driver sets it, in response to the next column-address access. The PDL state does not change
56
unless the newly addressed sense latch contains the opposite data state. This eliminates a major source of power consumption in word-wide ( X 16 or 18) DRAMS, the selection and reset of primary data lines during column addressing. The nine quadrant PDLs are wired to the 9/8 quadrant steering circuits. A laser fuse disables the data 1/0 circuit associated with the 1Mb X 1 array slices located along the top and bottom chip edges in Figure 1 . Thus, a single laser fuse reconfigures the 18Mb DRAM for 16Mb operation (which, in spite of the 9/8 array operating current overhead, still meets industry specifications for power/performance). The flexibility to "steer out'' defective I/O sub-arrays is provided by a bank of eight additional laser fuses for each quadrant. This design feature is most feasible in arrays that have been carefully designed and segmented to minimize failure types that affect more than one of the nine quadrant data 1/0 subarrays. By utilizing this fault-tolerance technique [lo], 16Mb product yield is obtained from nearly all good 18Mb chips. As shown in Figure 2 , early program yields were improved by up to 75% by utilization of this design feature.
additional addressing, function, and output configuration options are provided by wire-bonding of the appropriate program pads (PGMO-3). For example, wiring PGMl to V,, at module build reconfigures the laser-steered 1Mb X 16 to 2Mb X 8 with addressing reconfigured to 12 row and 9 column bits. The ninth column address bit allows 8-to-4 decoding of the quadrant data I/O arrays. If a module organized as 4Mb X 
Time (ns) Standby current (d)
Timing distributions observed on 16Mb DRAM chips.
I Measured standby current distributions.
addresses 9 and 10, allowing 8-to-2 decoding of the quadrant data I/O. Table 1 illustrates the ability of this design, in conjunction with the flexibility of lead-on-chip packaging technology [19, 201 , to provide multiple products from a single 18Mb DRAM chip design. The flexibility of this design to provide high performance with multiple module-output configuration options required careful design of the total off-chip driver (OCD) circuit/chip/package system, including consideration of 3-V or 5-V operation. The JEDEC standard module pin-outs for 16/18Mb DRAMS organized as xl, ~4 , x8/9 provide two V,, and two V,, pins. For X 16/18 organization, an extra V,, and V,, pin are added to alleviate internal module noise from switching the extra 1/0 loads. The eighteen OCDs are located at the pads along the right side of the center vertical peripheral region, shown in Figure 1 . The OCDs share a common V,, bus which is separated at the Vss pads from the global ground bus, to reduce IR drops and transient dildt-induced noise that can affect logic and input receiver performance.
For 5-V operation, the final metal mask provides a V,, bus tied to the V,, pads and shared by the OCDs, which is separate from the global VDD power bus. Voltage regulators supply 3.3 V to the global VDD power bus, which is substantially decoupled by the array and the 50-pF trench decoupling capacitors located at each peripheral pad. The OCD, shown in Figure 7 , features three stages with input slew-rate control of the individual stage inputs. A staged output drive and input slew-rate control reduce parasitic V,, and V,, noise induced when the module outputs are switched from nominal voltage levels. However, for reliable switching of outputs that have precharged to the maximum voltage levels, series damping resistors are required to moderate dildt-induced V,, and Vcc noise. Transistor PO acts as a series damping resistor between Vcc and P1-3, the output pull-up stages. A damping resistor in series with the three output pull-down stages moderates parasitic noises on the separate Vss bus which is shared by the OCDs.
The alternate final metal mask, for 3-V operation, disables the voltage regulators and connects the V,, and global V,, buses. The series damping device PO is disabled, and the V,, bus is wired directly to the sources of P1-3, along with 50 pF of direct local decoupling and an additional 7 nF of global decoupling provided by the unselected p-arrays. This configuration results in low noise on the power bus, which provides high performance for the 3-V products. A data line interlock (DLINT) is used to ensure efficient access of the array data by holding the OCD in tri-state until the OCD input latch is updated with valid data. The DLINT is constructed by locating an extra primary data line (PDL) in each quadrant, near the chip edge. This PDL fans out of tri-stateable, dummy digital secondary sense amplifiers (DSSA) in each of the eight quadrant array sub-blocks. The local data line pairs closest to the chip edge are NANDed to detect a data transition, which is buffered from the selected array sub-block onto the DLINT net. Leftlright DLINT nets are combined at the top/bottom of the center vertical periphery to time the top/bottom bank of nine OCDs. Since the DLINT interlock comes from the array providing the data, it will track with electrically, thermally, and process-parametrically induced variations in data path timing. This results in accurate timing of the OCD for reliable high-performance operation. 
Hardware results
Hardware results for the 16Mb DRAM, 1Mb X 16, 12/8 addressed device, under worst-case operating conditions for a 150-ns cycle, are shown in Figures 8 and 9 . The 18Mb chips demonstrate similar performance with approximately 9% higher active operating current. Figure 10 shows worstcase results for 3-V and 5-V chips measured with CMOS and TTL input levels. The majority of the 3-V standby current distribution for CMOS input levels, which approaches 10% of the industry specification of 200 pA, demonstrates that this design is well suited for low-power applications such as portable electronic equipment. Chips that are configured into the other 18/16Mb product options also show similarly superior results. The 1Mb X 18 DRAM, as described above, was also cut down to provide a 256Kb X 18, 9/9 addressed device, which was manufactured for both 3-V and 5-V applications. Figures 11 and 12 illustrate the functionality to worst-case voltage, temperature, and loading for a 150-11s cycle. The active current is half of the industry specification, with a 30% faster first access time. Utilization of the PGM pads at module build provides the 58 product configurations shown in Table 2 . This part, if fabricated in the present 0.5-pm CMOS technology, would be 36.8 mm2, quite possibly the smallest 4.5Mb DRAM in the world.
Conclusions
The architecture and design techniques implemented on this chip provide substantial functionality and flexibility. The lead-on-chip packaging technology [19, 201 utilizes these capabilities to provide multiple product options. The careful design and segmentation of the array permit the use of 9/8 steering in the quadrants to provide 16Mb product yield from nearly all good 18Mb chips. Forethought in the architecture of the design data has resulted in a straightforward methodology for cut-down to a 16Mb chip design for enhanced productivity. This also allows the capability to cut down the 18Mb chip to a highly functional 4.5Mb DRAM chip. The chip periphery is customized for high performance as well as high yield. Reliable functionality and insensitivity to electrically, thermally, and/or process-parametrically induced timing skews are provided by carefully designed timing interlocks. Use of the physical chip structures, such as reference word lines, ensures the precision of the interlock design. Precharge of the p-array bit lines to VDD results in reliable, high- wide to word-wide output configuration with multiple functional and power supply options, demonstrating both low power and high performance. The chip design also provides for cut-downs to 16Mb and 4.5Mb DRAM products with functions and features as above. This chip, whose properties are summarized in Table 3 , demonstrates the ability to cover the broad spectrum of today's DRAM product requirements. 
