The Blue Genet/L chip is a technological tour de force that embodies the system-on-a-chip concept in its entirety. This paper outlines the salient features of this 130-nm complementary metal oxide semiconductor (CMOS) technology, including the IBM unique embedded dynamic random access memory (DRAM) technology. Crucial to the execution of Blue Gene/L is the simultaneous instantiation of multiple PowerPCt cores, highperformance static random access memory (SRAM), DRAM, and several other logic design blocks on a single-platform technology. The IBM embedded DRAM platform allows this seamless integration without compromising performance, reliability, or yield. We discuss the process architecture, the key parameters of the logic components used in the processor cores and other logic design blocks, the SRAM features used in the L2 cache, and the embedded DRAM that forms the L3 cache. We also discuss the evolution of embedded DRAM technology into a higherperformance space in the 90-nm and 65-nm nodes and the potential for dynamic memory to improve overall memory subsystem performance.
Introduction
System-on-a-chip (SoC) technology plays a key role in the implementation of the Blue Gene * /L supercomputer. Central to the machine is the Blue Gene/L chip shown in Figure 1 . It consists of twin IBM PowerPC* 440 processors (PU0, PU1) and their local static random access memory (SRAM) caches, two floating-point units (FPU0, FPU1), five network controllers, a memory control system, and 4 MB of on-chip dynamic memory (L3). The integration of the dynamic memory on-chip requires the development of embedded dynamic random access memory (DRAM) technology [1] . With embedded DRAM, the advantages of dense, reasonably fast but very-high-bandwidth DRAM can be integrated with the logic used to build the rest of the processor chip. Since memory continues to be the single largest component of die area in most high-performance processor chips, developing denser and higher-performing memory capable of being integrated with high-performance logic is perhaps the crux of both the SoC and the memory subsystem optimization problems. Increasingly, as exemplified by Blue Gene/L, more of this memory is becoming dynamic. (Both dynamic and static memory are volatile; the term dynamic refers to the fact that a dynamic memory cell can lose the memory state, even with the power on, and must be refreshed periodically.) There are four important reasons for integrating dynamic memory on the processor chip:
Cell size: Dynamic cells tend to be significantly smaller (53 to 73) than static cells in a given technology. Traditionally, static memory cells are composed of six transistors, including a cross-coupled pair that toggles between the two states of the ÓCopyright 2005 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. The presence or absence of charge on the storage node determines the state of the memory cell. Figure 3 shows the scaling trends for both static and dynamic memories when they are embedded in bulk logic technology. Memory cell size, however, does not translate directly to die size savings. The charge stored on the node of the dynamic cell is susceptible to leakage and must be refreshed periodically. Additionally, dynamic cells lose their state during the read operation and have to be rewritten after reading. Furthermore, dynamic memory requires the access transistor gate voltages to be boosted to ensure maximum stored charge. Thus, there is additional circuit overhead that causes the area advantage of dynamic memory over static memory to be about 3 to 4, depending on the size of the memory being considered. Larger memory blocks are more areaefficient.
Standby power: As semiconductor scaling proceeds beyond the 130-nm generation, the device off currents show an alarming increase; as shown in Figure 4 [2], the standby power density begins to approach the active power density of the chip. In the case of memory, a static memory cell with six transistors tends to have ;1000 times more leakage current on a per-cell basis than a dynamic cell. At the memory macro level, one has to consider the standby current of the peripheral devices and the various internally generated power-supply circuits, and the refresh current needed for the dynamic memory. When all of these are included, embedded DRAM tends to have a 63 to 83 advantage over embedded SRAM. Active power tends to be comparable and is dictated by the performance and memory bandwidth used, since the output data lines tend to be heavy loads. Soft-error rate (SER): SER refers to the transient single-cell upsets caused by the penetration of highenergy particles, such as cosmic rays, into the silicon. The carriers generated by these particles can cause individual bits to be upset. The susceptibility of the cell to these events is exponentially dependent on the stored charge. Since embedded DRAMs store approximately 203 more charge, they tend to have This results directly from the fact that the signal levels at the bitline for both types of memory are becoming comparable, so the smaller cell size (hence, time of flight) becomes a defining parameter for all but the smallest memory blocks. Cycle time tends to be longer for embedded DRAMs because the read is destructive and the time to write a full charge on the storage node can be long, but the use of multibank organization, as described later, offers a significantly reduced bank cycle time. Another important point to note is that aggressively scaled SRAMs exhibit new instability modes, but discussion of this point is beyond the scope of this paper [3] .
From a processor design perspective, optimizing memory subsystem performance remains the biggest challenge. Processor frequency, until now an accepted performance metric, is no longer adequate; furthermore, it does not show the dramatic increases witnessed in the last several generations. Two contributors to this shift are the saturation of device-level performance brought about by the difficulties in scaling through technology alone and the dramatic increase in chip power dissipation to almost unsustainable levels. A finer granularity in the on-chip memory hierarchy, including the use of a third level of on-chip memory, is increasingly common. To be useful, however, this third level must be large and dense, with low enough latency and high bandwidth. The bandwidth criteria should be understood in terms of what can be achieved on-chip using on-chip wiring pitches, but cannot
Figure 3
Cell size trends for static and dynamic memory cells when embedded in logic technology. Node generation is as shown. Cell size scaling is expected to saturate for a variety of reasons outlined in the text. 
Figure 4
Active and standby power as a function of gate length. Not shown is the gate leakage component, which becomes significant beyond the 90-nm node. Both embedded SRAMs and embedded DRAMs show high standby currents in the peripheral circuits. However, DRAM subarrays will have a significant (more than 6ϫ to 8ϫ) reduction in standby current compared with SRAMs. Active powers for both are dominated by data rates and are comparable. Adapted from [2] with permission. be met by off-chip memory, which is limited by input/ output (I/O) wiring density and off-chip drivability and power considerations. Embedded DRAM meets these criteria for the on-chip L3 and promises to meet these criteria for increasingly demanding L2 applications in future generations.
Migrating from the commodity DRAM base to the logic base The embedded DRAM macro performance was modest. In fact, it was comparable to commodity memory performance. This resulted from the fact that the subarray was designed to meet the standardized performance metrics required of the commodity part. The relatively poor performance of the peripheral logic devices resulted in a somewhat marginal library. This limited the widespread use of the technology. Density suffered on two counts: DRAM processes were optimized to print array features at very tight pitches, but random patterns were significantly looser than logic technology at the time; second, the low drive of the devices required them to be sized large enough to drive the on-chip load. As a result, the logic density lagged by more than a generation. The assumption of DRAM line learning did not apply as well as initially envisioned. The on-chip voltage variations, junction temperatures, and retention characteristics were different enough from standard DRAM that additional process sensitivity learning was required. Furthermore, a strategic redirection of focus to logic rather than commodity DRAM meant that there was no commodity DRAM learning effect on the embedded DRAM parts in any case.
On the basis of these findings, subsequent embedded DRAM technologies were developed to be integrable with high-performance logic [7] . Furthermore, the embedded DRAM design intellectual property was integrated into the IBM Blue Logic * libraries and followed the same integration and test methodologies for ease of use. Both of these factors have contributed immensely to the success of embedded DRAM at IBM and have facilitated realization of the Blue Gene/L SoC, including the integration of the PowerPC 440 cores. Nevertheless, it should be emphasized that the commodity DRAM pedigree of IBM embedded DRAM has been a crucial element of its success. By incorporating the learning of several generations of commodity DRAMs and retaining crucial elements of the process architecture, our embedded DRAM macros have avoided the pitfalls of the well-known complex DRAM pattern 
Embedded DRAM technology
Blue Gene/L currently uses 130-nm technology [8] . This is second-generation IBM embedded DRAM technology, the first being at 180 nm. This has also been extended to the 90-nm and 65-nm nodes. At 130 nm, the embedded DRAM platform allows for the integration of highperformance logic devices, embedded SRAMs with two different cell sizes, content addressable memories (CAMs), and a variety of passive devices, such as precision resistors, inductors, and capacitors, if needed.
The embedded DRAM technology, shown in Figure 5 , is based on IBM deep-trench (DT) technology [9] . The details of the trench fabrication have been described elsewhere [10] . The key feature to be emphasized here is the fact that the trench process is the very first processing that the wafer undergoes. After DT processing, the wafer presents a planar surface and, for all practical purposes, the DT processing can be considered transparent to the rest of the process. This should be contrasted with other processes, such as the stacked capacitor or metalinsulator-metal capacitor (MIM cap) approach, followed by others [11, 12] in which the capacitor is created in the middle-of-the-line process (between the device and wiring levels). In the latter case, novel materials must be used, and complex three-dimensional capacitor structures must be built. These require additional planarizing steps and, with advanced back-end wiring, even a minimal degradation of planarity can result in degraded yields. Very often, the presence of these tall capacitor structures in the middle-of-the-line process requires unique firstlevel-metal processing that may also require changes in the electrical timing of logic circuits for designs with and without embedded DRAM. It must be pointed out that the embedded DRAM cell layout follows all of the layout rules required by the technology, and no special waivers are granted. This methodology allows the embedded DRAM processing to have minimal impact on yield. This should be contrasted with the approach followed in the dense SRAM cells, in which the SRAM array is shrunk beyond the layout rules.
Another approach to embedded DRAM is the socalled ''logic-only'' process, in which the gate of the device is used as the storage node. Typically, the capacitance obtainable by this approach is of the order of 2-3 fF and is usually woefully inadequate for robust DRAM operation as required in Blue Gene/L types of applications. Figure 6 shows the process flow used in the manufacture of the Blue Gene/L chip. Note that after the DT processing, the rest of the process allows for the seamless integration of the logic and SRAM devices. The pass transistor used in the array to access the deep trench is threshold-tailored for low leakage. A triple-well process is used for the embedded DRAM subarray. Typical leakage currents for the array are in the fA regime. In 130-nm technology, cobalt disilicide is formed on the gate, source, and drain areas of only the logic field-effect transistors (FETs) to reduce their resistance. The array areas are blocked from forming silicide. This is done to ensure long retention. The total leakage associated with the cell is a few fA. The median retention time is of the order of a few seconds, and cell leakage is dominated by junction leakage in this technology. This being the case, retention is sensitive to temperature, and the median retention time shows an increase of approximately 23 for
Figure 6
Process flow used to manufacture the Blue Gene chip. every 108C drop in temperature. The refresh criteria for the macro are designed for a retention time of 3.2 ms at application conditions (T j = 1058C; worst-case voltage and worst-case process). The biasing conditions for the cell are shown in Figure 2 (b). All required voltages are generated on-chip using the supplied V dd of 1.5 V. The redundancy scheme using laser fuses is discussed in a later section. Figures 7(a) and 7(b) show scanning electron microscope (SEM) cross sections of the cell, and Figure 7 (c) a top view. The deep trench, the buried strap that connects the node to the pass transistor, the pass transistor, and the bitline contact are clearly seen.
Cost and process complexity A major misconception in the industry is that embedded DRAM is more complex and, therefore, more expensive. While the additional processing associated with the fabrication of the capacitor and DRAM access device does mean that an embedded DRAM wafer is more expensive to process, this does not mean that an embedded-DRAM-based die is more expensive than an embedded-SRAM-based die. The discussion must be focused on cost-effectiveness. Consider the following argument:
1. Start with an SRAM-based die. 
Figure 8
Fractional cost savings obtained by replacing embedded SRAM with embedded DRAM, as a function of the fraction of the die that is replaceable embedded SRAM, for different values of embedded DRAM efficiency n, which represents the density improvement of embedded DRAM over embedded SRAM. The process complexity adder is assumed to be 15%, which is representative of current technology. Basically, if 25% of the die is replaceable SRAM, the embedded DRAM solution is cost-effective. 
where b is the fractional increase in processing cost per wafer for the embedded DRAM wafer compared with one with no embedded DRAM.
This has been plotted in Figure 8 for different values of n = 1/a. With respect to Blue Gene/L, for 130-nm technology, a die with approximately 25% of contiguous SRAM is an excellent candidate for replacement with embedded DRAM. If the Blue Gene/L chip was designed with embedded SRAM, approximately two thirds of the die would be composed of replaceable SRAM, so that the embedded DRAM solution is, in fact, 40% more costeffective-a significant cost savings. Interestingly, this cost savings has been achieved with no significant increase in latency. This is because the smaller footprint of the DRAM cache compared with the hypothetical SRAM cache results in a shorter time-of-flight delay from the farthest bits in the macro.
Unfortunately, misguided ideas on the perceived cost adder of embedded DRAM without a clear understanding of its value proposition has led to the development of so-called 1T cells of dubious value and limited extendability. (Such cells are offered by several foundries without DRAM experience.) These cells lack the robustness, including noise immunity, of the IBM trench technology and do not offer the same die area savings. Additionally, as logic technologies build more complexity into the base process-such as additional device types, additional metal levels, precision resistors and capacitors, and silicon-on-insulator substratesthe cost and complexity adder for embedded DRAM continues to fall and is expected to be well below 10% at the 65-nm high-performance node.
Embedded DRAM macro design considerations Embedding DRAM into an IBM Cu-11 1 applicationspecific integrated circuit (ASIC) design extends the onchip memory capacity to more than 40 MB, allowing memory that has historically been off-chip to be integrated on-chip. In fact, chips with as much as 344 Mb of DRAM are currently manufactured in this technology.
With the memory on-chip, applications can leverage the high bandwidth naturally offered by a wide I/O DRAM and achieve data rates previously limited by pin count and off-chip data rates. Applications for this memory include standalone L3 cache chips for IBM pSeries* and xSeries * eServers * , network processors, and digital signal processors. The integration of embedded DRAM into ASIC designs has intensified the focus on how best to architect, design, and test a high-performance, highdensity macro as complex as dynamic RAM in an ASIC logic environment. The ASIC environment itself presents many difficult challenges that have historically affected DRAMsspecifically, wide voltage and temperature operating ranges and uncertainties in surrounding noise conditions. These challenges dictate a robust architecture that is noise-tolerant and can operate at high voltage for performance and low voltage for reduced power.
With the advent of embedded DRAM offerings in a logic-based ASIC technology, the performance of embedded DRAM macros has improved significantly over that in DRAM-based technologies. Subsequently, users are increasingly replacing SRAM implementations with embedded DRAM, placing additional pressure on macro performance and random cycle time. This pressure extends into test, where the use of traditional direct memory access (DMA) is costly in silicon area and wiring complexity, and it introduces uncertainty in performancecritical tests.
A more attractive solution to this test problem is the use of a built-in self-test (BIST) system that is adapted to provide all of the necessary elements required for high fault coverage on DRAM, including the calculation of a two-dimensional redundancy solution, pattern programming flexibility, at-speed testing, and test mode application for margin testing [13, 14] . The Cu-11 embedded DRAM macro has been developed around the idea of user simplicity, while including a high degree of flexibility, function, and performance [15] . For application flexibility, the embedded DRAM can grow in 1-Mb increments to provide macro sizes from a 1-Mb minimum to a 16-Mb maximum; it offers a 256-I/O width and a 292-I/O width for applications requiring parity. The wide I/O was chosen to provide maximum bandwidth; for applications not requiring the full width, bit-write control was included to facilitate masking.
Multiple embedded DRAM macros can be instantiated on an ASIC die, enabling customers to make the performance and die area tradeoff specific to their application. Figure 9 shows a high-level floorplan of the embedded DRAM. The embedded DRAM is constructed from building blocks: a 1-Mb array core, a power system for generating boosted voltage levels used by the array core, a control system for buffering and generating the array core timing signals, column redundancy for replacing defective data bits, data I/O for receiving and transmitting off-macro data, and BIST for testing the embedded DRAM macro. BIST is composed of a microprocessor-based engine, instruction memory [readonly memory (ROM) and scannable read-only memory (SROM)], a data comparator, and a redundancy allocation unit. The 1-Mb array and its support circuitry are replicated to construct the desired macro size. Each embedded DRAM macro contains a single control system, a common power system, and a BIST.
This architecture lends itself well to providing two modes of macro operation: single-bank and multibank interleave modes. The single-bank operation provides a simple SRAM replacement function, while the multibanking mode extends the macro performance by allowing concurrent operations to independent banks. Bank operation was intended to resemble an embedded SRAM, supporting simple broadside addressing with read and write control. To improve bandwidth, the user can optionally use page mode, which was carried over from conventional DRAM. This was the only mode supported in SA27E; in Cu-11, however, it was decided to support multibank operation.
For the multibank-mode configuration, each 1-Mb block of the macro acts as an independent bank that shares a common address and data bus with all other 1-Mb blocks within the macro. The number of banks within a macro is determined by the macro size. Figure 9 (a) shows the floorplan, and Figure 9 (b) shows a 4-Mb macro with four banks. A bank select (BS) pin is associated with each bank (1-Mb block) and controls activation and precharge of that bank. The bank address (BA) is decoded by control logic and arbitrates which bank has control of the datapath.
In multibank configuration, the macro does not employ broadside addressing. Rather, the embedded DRAM macro operates similarly to a synchronous DRAM (SDRAM) whose addressing is performed in a manner similar to a row-address-strobe/column-address-strobe (RAS/CAS). The macro select input (MSN) is treated like a master input clock, latching the state of all other input pins with each falling MSN edge. Figure 10 shows three cycles. Cycle 0 activates bank 0, cycle 1 activates bank 1 and reads or writes bank 0, and finally cycle 2 activates bank 2, reads or write bank 1, and precharges bank 0. The Timing diagrams showing multibank operation over three cycles.
MSN input can be cycled at a maximum rate of 250 MHz (4 ns, assuming a nominal 50/50 clock duty cycle). This protocol supports simultaneous activate, read and write, and precharge to three different banks. Maximizing the number of banks in a macro improves the probability of avoiding an open (or busy) bank and maintaining the pseudo-random peak bandwidth of 8 GB/s.
Test considerations
Since the bit cells used in the embedded DRAM macro are derived from their commodity DRAM predecessors, they will likely have the same type of sensitivities, which are well known from the development of commodity DRAM and require identification at test. Many of the interactions in the DRAM cell matrix are complex and only activated with certain combinations of defects and test patterns. To deliver the complex test patterns, commodity DRAMs use specialized test equipment with algorithmic pattern capability for generating the test sequences, and they employ large and fast data-capture memory with redundancy allocation hardware to identify and repair faults. Considering how to test a DRAM embedded in logic creates a dilemma: logic tester or memory tester [16] [17] [18] ? The logic test platform that has developed for past generations of ASICs without embedded DRAM can be characterized as a low-cost reduced-pin-count tester with no algorithmic pattern generation or redundancy allocation hardware; it is, therefore, unable to test DRAM without assistance. The logic test patterns implemented are automatically generated with software based on the customer's netlist. The test strategy comes down to either a two-tester solution (memory tester and logic tester) or comprehensive BIST. The two-tester approach suffers from the following issues:
There are multiple test gates with the associated increase in wafer handling. Cumbersome requirements are placed on the customer to multiplex the macro I/O to package pins. Part-number-specific test pattern development is required and is typically difficult to automate.
In contrast to the two-tester approach, BIST also allows rapid isolation of faults to either the logic or embedded DRAM in SoCs, allowing for fault resolution.
In the high-part-number ASIC environment, it is essential to implement a single-tester platform using BIST for memory test and automated test generation for logic test. The design goal of the BIST is to provide a test engine operable in the logic test environment of low-cost, lowpin-count testers that stimulates the control, datapaths, and array of the embedded DRAM and provides fault coverage equivalent to that traditionally supplied to discrete DRAM by high-cost memory testers. The flexibility of the BIST system, made possible through the use of an SROM, enables the test development engineers to alter the instruction memory to create new or modified test patterns or to change the sequence or number of patterns applied at each manufacturing test gate. The BIST components are shown in Figure 11 . Ultimately, the BIST locates all faults in each 1-Mb array segment, calculates the two-dimensional redundancy solution required to repair these faults, and reports this solution via standard scan string methods [19] . The redundancy solution is permanently stored in a remote fuse memory (nonvolatile) programmed with a laser after testing.
A key enhancement to the BIST schemes previously used for embedded SRAM macros is the inclusion of a redundancy calculator, also referred to as redundancy allocation logic (RAL), for two-dimensional redundancy. The function of the redundancy calculator is to compare data read from the array with the data expected from the BIST engine and optimally allocate redundancy for array fails. The BIST processor calculates row and data-bit redundancy for wide-I/O embedded DRAM macros. The system is described in [20] . Each 1-Mb array contains its own redundant elements, which may not be shared with other 1-Mb arrays. For this reason, BIST calculates and stores only 1 Mb worth of a redundancy solution at a time. Calculation of the full 16-Mb repair solution would require 16 times the number of fail counters and address registers, increasing the silicon area required for the BIST to an unacceptable level.
An additional enhancement to the redundancy calculator is the capability to reload the SROM instruction memory discussed earlier. By segmenting the scan string and reloading only the SROM, neither the
Figure 11
Built-in self test (BIST) engine components. state of the BIST engine nor the current state of the RAL calculated redundancy solution is upset. Thus, additional patterns can continue to be applied one right after the other with the redundancy solution being calculated on the cumulative failure set from all patterns. To maximize system performance, 250-MHz timing rules are provided to describe the random access memory (RAM) core performance. Testing at speed (highfrequency or ac testing) becomes essential to ensure the high-performance timings, especially in the presence of the speed-sensitive fails commonly found in DRAMs. As the complexity of chip-level integration increases, so do the challenges for ac test. Effective ac testing requires that the stimuli be provided with sufficient speed and accuracy while meeting the constraints for cost-competitive manufacturing.
The ac test development is greatly simplified if BIST units are placed in such a way that the delays between the BIST and the RAM core boundary are minimized and predictable. An example of such a design is one in which each instance of a RAM core comes with its own BIST. While this approach may use more silicon area than others, the savings in design, test development, and test cost can be realized through the reuse of the same integrated RAM and BIST core across a wide range of applications. In some environments, the results that can be derived from a finite design and test development resource are enhanced by focusing on the development of a reusable core rather than the development of a specific customer part number. The RAM and BIST integrated core design point enables effective ac test by reducing the delay between the RAM and BIST, which minimizes the timing uncertainties introduced by process variability and typical tester hardware. Efficient ac test is realized by leveraging the self-test concept. Performing data compare and storing fail locations in the BIST reduces the demands on tester resources to input only. Eliminating the need for expensive high-speed data-capture hardware in the test head greatly simplifies requirements from both bandwidth and calibration points of view. Containing critical timings within a BIST represents another large step toward reducing tester demands when clock multiplication capability is added.
In the ASIC environment, in which the BIST has all the capability and flexibility to test the DRAM with minimal tester interaction, there still exists the need to get results from the BIST back to the tester. Upon completion of a functional pattern, there are several types of data that must be acquired from the BIST: pass/fail data, repair data needed for fusing, and bit-fail mapping to identify defective bits. Fail mapping is critical for locating defective bits for failure analysis and follow-on yield learning. In order to generate statistically significant data, large volumes of data are required, and these must be generated in an automated fashion. Providing easy access to a synchronized set of address and compare states makes it possible to create bitmaps for any BIST pattern with a minimum amount of offline processing. In this scheme, BIST design-for-test efforts have had a significant positive impact on both test overhead costs and diagnostic capabilities.
Hardware measurements on 16-Mb macros verify 3.3-ns (300-MHz) operation and 6.6-ns random access and 3.3-ns bank access at 1.5 V for high-performance applications, while also providing reasonable performance at 1.0 V. These results are shown in Figure 12 , a ''schmoo'' pass-fail plot for random cycle compared with chip V dd . These performance values easily support the application specification of 250 MHz at 1.5 V with high yield. Bit-fail mapping using BIST has been implemented in the manufacturing flow, generating data for failure analysis and yield learning. The general-purpose nature of the macro has enabled use across multiple product applications, including the POWER5 * L3 cache, and cumulatively provides yield learn approaching that of a high-volume singlestandard product.
Device considerations
To ensure the best possible performance in bulk silicon technology at competitive cost, the Blue Gene/L chip was designed and fabricated in the IBM standard 0.13-lm bulk CMOS offering [21] . The basic device characteristics
Figure 12
"Schmoo" plot of random cycle time as a function of chip V dd . A 4-ns cycle time (250-MHz clock frequency) can be supported with manufacturing margin over the entire process window and operating conditions of temperature, including retention at 115ЊC. The technology specifications for off current are approximately 0.3 nA/lm for both devices, with maximum off currents of 2 nA/lm and 1 nA/lm for n-FET and p-FET at a 6r short gate length of 70 nm. Controlling the worst-case off currents requires excellent line width control, both across chip and across wafer. Finally, field-effect transistor (FET) parasitic capacitances are optimized as low as possible, taking into account the tradeoffs involving FET current drive, off current, and junction leakages. The culmination of the above is an 18.5-ps delay-per-stage performance on a fanout-1 inverter ring oscillator as measured and modeled including all parasitics. Figure 14 shows cross-sectional images of the fully processed devices. In addition to the base FET devices described above, there is a dual-gate-oxide feature available for high-speed I/O devices. This feature is employed for the array FET and wordline system in the embedded DRAM macro. This is accomplished through a single-mask process rendering both the 22-Å and 52-Å oxides within the chip. The thick-oxide FET nominal design gate length is 0.24 lm, with an on-wafer dimension of approximately 0.21 lm. Independent threshold-voltage centering is also available on these devices, which are used with a 2.5-V (max 2.7-V) power supply. This same 52-Å oxide is employed in both the pass transistor in the array and the wordline driver circuits. The only additional device used in the embedded DRAM macro is the array device, which has its own optimized well and threshold tailoring. The on current for the array transistor is 40 lA per cell at a wordline boost of 2.5 V. A final note on the technology device feature set are optional high-and low-V t devices derived from the base FET through an additional mask for implant tailoring. Such FETs often find applications in arrays and high-speed logic critical paths. Typical offsets with respect to base FETs are approximately À90 mV and þ70 mV for the low-and high-V t devices, respectively.
For passive devices, besides the deep-trench decoupling capacitors, reference and electrostatic discharge (ESD) diodes, the technology has two types of optionalprecision resistors. One is based on an unsilicided n-type diffusion, the other on an unsilicided p-type polysilicon line; they are specified at 73 and 340 X/u, respectively. Such resistors find a large range of critical applications, from high-speed analog applications to front-end ESD protection. The embedded DRAM uses this feature to block silicide from the array.
Sophisticated device models that account for topographical details and mechanical stress were developed to accurately model circuits over a wide variety of process conditions. Such models, including layout extraction capabilities, become important at the 0.13-lm node and beyond, and are key to optimizing design and closing timing on a complex SoC, such as Blue Gene/L.
As a whole, the technology delivers a versatile CMOS offering to a large variety of product applications. These range from components in high-end server chipsets to a realm of products under a common ASIC library, and finally to various custom foundry applications. Recent aggressive defect-reduction activities provide yield learning approaching that of a single high-volume standard product.
Embedded SRAM considerations
The Blue Gene/L chip was designed using 26 instances of single-port SRAM arrays. These SRAM arrays use a dense SRAM cell developed for 130-nm ASIC applications. This section discusses the design features of this SRAM cell, the methodology used to customize the design to allow the cell size to be reduced by 16% from the standard SRAM cell, and the electrical characteristics of the devices used in the cell. In Blue Gene/L, SRAM cells at a given technology node are somewhat relaxed initially. As the technology matures, however, the cell is shrunk by tightening the layout rules in the SRAM array through cell waivers. In the 130-nm node, a 22% shrinkage in the SRAM cell area was achieved using selective layout rule
Figure 14
SEM cross sections of the logic devices showing the shallow-trench isolation (STI), completed device with CoSi 2 self-aligned silicide, contact metallurgy (CA), and metal 1 (M1). The inset shows the spacer definition. The spacer architecture determines the the placement of the source and drain extensions, and-together with the polysilicon gate length, post-spacer thermal cycles, and stress generated from the isolation regions-determines the device characteristics. waivers. This shrinkage comes at the cost of a modest performance loss, as explained below.
Cell features
The dense SRAM cell used is a single-port six-transistor SRAM cell. The cell consists of two n-FET pass-gate devices, two n-FET pull-down devices, and two p-FET load (or pull-up) devices. The dense SRAM cell size is 2.04 lm
2
. This is a 16% shrinkage of the standard SRAM cell (2.47 lm 2 ) offered in the base technology. Table 1 shows a comparison of several key dimensions for both the 2.47-lm 2 and 2.04-lm 2 cells. The bitline connections to the cell are made through metal 2. The wordlines are continuous polycrystalline conductor (PC) lines but, for performance reasons, they are strapped at metal 3 (M3) to reduce the wordline resistance.
Array features
The dense SRAM arrays used in the Blue Gene/L chip range in size from 256 3 72 (words 3 bits) to 2048 3 78. They are all designed with wordline redundancy, 4:1 decode, and one subarray. The access time and cycle time depend on the array size, process corner, operating temperature, and operating voltage. For the nominal process, 658C and 1.5 V, the access and cycle times for the largest and smallest arrays are shown in Table 2 . Within Blue Gene/L, these single-port dense SRAM arrays are used primarily in the L3 directory and network interface circuits.
Device design considerations
For SRAM cells, there are three parameters that are often considered when comparing cells: cell performance, cell size, and cell leakage. The dense SRAM cell was designed to achieve the maximum performance possible for the smallest cell size. Table 3 shows a comparison of the device sizes used in the standard and dense SRAM cells. From Table 3 , it can be seen that to achieve the smaller cell size, the width of all of the transistors had to be reduced. However, because the cell size was reduced, the transient delay due to the resistance and capacitance of the circuit across the cell was also lowered, so some of the performance loss due to the smaller device widths can therefore be regained. Because of reliability considerations, no attempt was made to redesign the devices to achieve either a shorter channel length or a narrower channel width than what was already supported in the technology. No attempt was made to reduce the leakage characteristics of the dense SRAM cell below what was already achievable through the base technology. To achieve the optimized cell size, in addition to the previously mentioned device width reductions, aggressive layout rule shrinks [22] were required. These layout rule shrinks were applied to the following levels: active diffusion area, gate area, contacts, metal 1, and via 1. No layout rule shrinks were required for metal 2 and above. The final dense SRAM design (up to metal 1) is shown in Figure 15 (a). The custom optical proximity correction (OPC) shapes for the active area, gate conductor, and metal 1 levels are shown in Figure 15 (b) and the on-wafer SEM images in Figures 15(c) and 15(d) .
Future considerations
The multiprocessor core hierarchical memory architecture embodied in the specialized Blue Gene/L processor is in line with the current trend for more general-purpose processors. Processor chips will continue to add multiple independent cores, and the local caches used by these cores will require ever-larger amounts of high-bandwidth embedded memory. The use of SRAM memory to realize these chips is being limited by available chip size, power constraints, and time-of-flight performance considerations. Furthermore, SRAM cell scaling is causing diminishing cell functionality margins because of reduced operating voltage and variations in device matching caused by statistical fluctuations in the number of dopant atoms contained in the device channel [23] . If embedded DRAM memory can continue to improve performance and maintain its large density advantage over SRAM, it can play a critical role in future processor systems.
The basis of embedded DRAM is a one-transistor, onecapacitor cell. It is important to be able to scale both the transistor and the capacitor in area while maintaining or These design dimensions have been adjusted to account for optical proximity correction and on-wafer printing. The final on-wafer dimensions will be as if they were designed at ground-rule minimum.
improving the inherent performance of the cell. Let us examine some of the challenges we face if we are to continue scaling embedded DRAM cells and circuits.
The capacitor
The heart of a DRAM cell is its capacitor. Whether the capacitor is formed as a deep trench or a stacked type, most of the challenge in scaling a DRAM to the next technology node involves achieving the same high capacitance, but in a smaller space. This leads to aggressive scaling of the aspect ratio of the fabricated capacitor structures and to reducing the equivalent thickness of the capacitor insulator through physical thinning or increasing the dielectric constant of the material. The basic operation of reading a DRAM cell involves turning on the array device and allowing the charge stored on the capacitor on one side of the device to pass onto a metal wire connected to the other side of the device. This metal line, or bitline, is shared with as many as 500 other nonselected cells. The voltage generated on the bitline is detected or sensed by circuitry at the end of the bitline. A key metric in DRAM design is the amount of voltage or signal generated when a data cell is read by dumping its stored charge onto its bitline. Since the charge initially stored in the cell capacitor equilibrates with the bitline capacitance, the voltage ultimately created is given by
where V cell is the voltage initially stored in the cell capacitor, C cell is the capacitance of the cell capacitor, and C bitline is the capacitance of the bitline. The quotient of the capacitance terms in the preceding equation is known as the transfer ratio and is typically of the order of 0.2. As technology shrinks, the capacitance of the bitline wire typically remains relatively constant, since the wire thickness is not scaled and the capacitance reduction due to the shrinkage in the length of the line is offset by the capacitance increase due to the reduced spacing between the lines. As long as the number of cells per bitline remains constant, the capacitance of the cell capacitor must remain constant to maintain the transfer ratio and the signal available for sensing cell data correctly. Unfortunately, as technology scales, the devices that detect the bitline signal suffer from the same matching degradation as was mentioned for the SRAM cell above, requiring high levels of signal to detect the correct data state of the cell. Since simple dimensional scaling causes the capacitance of a cell to decrease, maintaining a constant cell capacitance from generation to generation is possible only with continuous improvements in fabrication technology or increasing permittivity of the dielectric materials. Since new invention is not always assured, it is desirable to explore options that allow capacitance to be reduced. The obvious option is to reduce the bitline capacitance by reducing the number of bits per bitline. This reduces the capacitance on the bitline and allows a reduction of the cell capacitance while maintaining a high signal transfer ratio. This solution, unfortunately, degrades the density of the memory system by causing the fixed area overhead of the sensing circuitry at the end of the bitline to be amortized over fewer memory cells. The key to enabling fewer cells per bitline is to have a very efficient design of the array sensing and control circuitry. With proper design area, efficient designs using 64 or fewer cells per bitline are feasible.
Reducing cell and bitline capacitance has additional performance benefits. A lower-capacitance bitline is faster to switch, improving array speed. Similarly, a lowercapacitance cell charges more quickly, enabling faster write time when switching the cell to its opposite data polarity state. This is shown in Figure 16 . It should be noted that reducing cell capacitance too far will limit the data retention time and increase the susceptibility of the cell to SER caused by energetic background radiation, as discussed above.
The transistor
Because CMOS logic device technology has been aggressively scaled to create the fastest possible transistor switching, we have now arrived at a point at which the channel length of the conventional DRAM array transistor is significantly longer than the minimum used in the logic devices, and is limiting the continued area scaling of the array cell.
DRAM array transistors must maintain off currents in the range of 10 À14 A, which allows data to be retained for the several-millisecond interval required between cell refreshes. (In comparison, logic device leakage ranges from 10 À12 A to 10 À7 A per device.) A gate voltage swing of approximately 3 V is necessary to achieve both the low off current and high-gate overdrive that enables writing a high voltage to the storage node. This high operating voltage requires the use of a gate oxide of approximately 5 nm to 6 nm compared with the 1 nm to 1.5 nm seen in the state-of-the-art logic devices of today. In addition, the low-leakage requirement necessitates careful grading of the doping profiles of the junctions used in the source and drain of the device in order to limit the electric fields that increase leakages in the junctions. As a result, such DRAM cells have reached their minimum channel limit. Further scaling of the cell area causes the width and available drive current of these devices to be significantly reduced. Such reduced drive current becomes a limitation in the read and write performance for the cell.
If it were possible to relax the leakage constraints for the embedded DRAM by significantly decreasing the interval between cell refreshes, the techniques of gate oxide reduction and device halo implantation used for logic device scaling could be employed to drive continued reduction of the embedded DRAM array device area. During a cell refresh cycle, the DRAM memory is unavailable to the system, and just increasing the rate of refresh for the DRAM would reduce this availability unacceptably. However, innovative circuit designs have enabled reduction of the retention interval to less than 100 ls while maintaining 99% memory availability to the system [24] . With data retention in the tens of ls range, the off-current requirements can be relaxed to several pA per cell. Such off currents are achieved today by the lowpower logic FET designs using a gate oxide of 2 nm to 2.5 nm and a gate voltage swing of 1.5 V (Figure 17) .
The use of aggressive oxide thickness scaling can enable shrinking of the array device length and width while still
Figure 16
Depending on the performance target, the choice of node capacitance and bitline (BL) length may optimize differently. The switching speeds of the circuits driving the DRAM gate also improve as their oxide thickness and voltage swing decrease. In a DRAM memory system, the DRAM gate voltage is typically generated internally with a system of charge pumps. Reducing the level of the generated voltage significantly improves the efficiency and the area required for these pump circuits.
The leakage current through the gate of the device will, however, ultimately limit the data retention for the array cell. For the 64-ls retention, this limit will be approximately 1.8 nm to 2.0 nm for conventional oxide gate dielectrics. These same device considerations can be applied to adapt the silicon-on-insulator devices used in the highest-performance logic technologies for use as an array device [25] .
Conclusion
We have described the embedded DRAM platform that provides the basis for the Blue Gene/L SoC. The focus of this technology platform is to deliver high-density, highbandwidth dynamic memory with true high-performance logic and SRAM technology. This second-generation embedded DRAM technology provides the basis for advanced cache chips for IBM eServer applications as well. It has also found widespread use in network applications and digital signal processors [26] . This technology does not differ from our standard logic technology, which is a subset of the embedded DRAM platform, in either performance or reliability. Embedded DRAM technology has been shown to be an excellent and cost-effective solution.
Going forward, using the capacitor and device scaling approaches outlined here with continued innovation in circuit techniques, it is anticipated that embedded DRAM memory systems can achieve cycle times of ,2.5 ns and data rates of .1.5 GHz, while still maintaining a 33 to 43 density advantage over an SRAM solution. This level of performance and density will be ideal for providing embedded data storage for the next generation of multicore processor designs.
Figure 17
Comparison of drain current for 52-Å DRAM-type device and low-power logic 22-Å device. Measurements taken at V ds ϭ 1.2 V for both cases. V gs ϭ 2.5 V for the 52-Å device and 1.5 V for the 22-Å device. 
