Abstruct-While the performance of flash memory exceeds hard disk drives in almost every category, the cost of flash memory must come down in order to gain wider acceptance in mass storage applications. This paper describes a 3.3 Vonly 32 Mb NAND flash memory that achieves not only high performance but also low cost with a 94.9 mm2 die size, improved yields, and a simple process With 0.5 pm CMOS technology. Die size is reduced by eliminating high voltage operation on the bitlines through a self boosted program inhibit voltage generation scheme. Incremental-step-pulse programming results in a 2.3 MB/s program data rate as well as improved process variation tolerance. Interleaved data paths and a boosted wordline results in a 25 ns burst cycle time and a 24 MB/s read data rate. Maximum operating current is less than 8 mA.
I. INTRODUCTION
INCE first being introduced, the capabilities of flash S memory have improved dramatically. Device density, performance, and endurance have all seen orders of magnitude improvement. When compared to magnetic media based hard disk drives (HDD), flash memory offers higher performance, lower power consumption, increased reliability, and greater portability. However, the acceptance of flash memory in the solid state mass storage market has not yet met expectations mainly due to its cost. Even when compared to the "expensive" PC Card type HDD's, flash memory cards cost over 10 times as much. This high cost of flash, and not density limitations, is the biggest obstacle against the acceptance of high capacity flash memory cards.
The NAND type flash memory [l] , [2] was originally developed to target solid state mass storage applications. Toward this end, NAND offers a small cell size, low power consumption, and fast page based read/program operations. In addition, the small erase block size and fast erase times of NAND make it a very manageable memory. This 32 Mb NAND flash memory further enhances the key features of the NAND architecture. The new device not only improves on performance features, but cost has been reduced as well through a small die size and yield improvement techniques.
To inhibit programming of selected cells in previous NAND flash memories, a high voltage is supplied directly through the bitlines of program inhibited cells. The high voltage is passed to the NAND string channel to prevent Fowler-Nordheim Manuscript revised July 28, 1995. The authors are with the Sumsung Electronics Co., Ltd., Kyungki-Do, IEEE Log Number 9415231. Korea. tunneling from occurring. However, a large capacity charge pump is required to set the highly capacitive bitlines to the inhibit voltage. This charge pump occupies a large silicon area and increases both program time and current consumption. To avoid having to supply high voltages through the bitlines, this flash memory utilizes a self-boosting scheme to generate the required program inhibit voltage on the NAND string channel. With the self boosting scheme, the maximum voltage on the bitlines is Vcc. Low voltage only bitlines not only eliminate large charge pump area, but also reduces the cell array size since a tighter bitline pitch is allowed with the less stringent bitline isolation requirements.
The page buffer scheme in this flash memory allows the programmed cell Vt to be optimized on a cell-by-cell basis even though a page of cells are programmed simultaneously. One drawback, however, is that the page program speed is determined by the slowest programmed cell within the page. While simply using a higher program voltage would result in a faster program time, easily programmed cells might be overprogrammed. In this device, incremental step pulse programming (ISPP) is introduced to dynamically optimize program voltage according to cell characteristics on a cellby-cell basis. With ISPP, a typical 2.3 MB/s program performance is achieved while still maintaining a very tight programmed cell threshold voltage distribution. In addition, excellent process and environmental variation tolerance is also obtained. This improved variation tolerance improves the yield of the device.
A 24 MB/s read throughput is achieved with interleaved data paths and boosted wordlines. The device is fabricated on a 0.5 ,um CMOS process with a die size of 94.9 m2. The device organization is shown in Section 11. The page buffer and basic device operation is explained in Section 111. In Section IV, self boosted program inhibit voltage generation and its benefits are described. The details of ISPP are presented in Section V. Performance measurements of a fabricated device are summarized in Section VI.
DEVICE ORGANIZATION
To minimize the die size, the 34 603 008 cells are organized is typically used to store system data and/or ECC data. If not used, access to the spare data area can be disabled altogether to make the size of each row exactly 4 kb. Each cell in the memory array is part of a unit NAND string which consists of 16 cells and two select transistors (controlled by SSL and GSL) as shown in Fig decoders where the right block decoder is activated by the left one. This split block decoder design allows a 0.95 pm wordline pitch. To accommodate the 1.4 pm bitline pitch, page buffers are split into top and bottom banks. The page buffers in the top bank are connected to odd numbered bitlines while those in the bottom bank are connected to even numbered bitlines.
DEVICE OPERATION
In order to perform page based read and program operations, a page buffer is attached to each bitline. The key components of the page buffer are a latch and a sense transistor as shown in Fig. 1 . Though simple in design, the page buffer serves several important functions. First, it senses and latches cell data during read operations. Second, it holds program status data during program operations. And third, it controls cell-by-cell program optimization during program verify operations. Bias conditions for ERASE, READ, and PROGRAM operations are shown in Fig. 3 . ERASE operations can be performed in single or multiple block units, where up to all 512 blocks can be erased simultaneously. With lines CGO-CG15 of Fig. 1 grounded, the BSEL of selected block(s) are set high while those of unselected blocks are set low. This grounds the wordlines SSL WLi
CSL

PGM DIS
of selected blocks and floats the wordlines of unselected blocks. A 21-V 3.5-ms erase pulse is then applied to the bulk. In the selected blocks, the erase voltage creates a large (21 V) potential difference between the bulk and the control gates. This causes F-N Tunneling of electrons off the floating gate and into the bulk, resulting in a typical cell threshold voltage of -3 V. Since overerasure is not a concern in NAND flash, cells are deliberately overerased to -3 V to ensure that only a single erase pulse is required. Also, the low erased cell threshold voltage provides additional margin against upward threshold voltage shifts that arise from cycling. Unselected blocks are not affected by the erase pulse due to the coupling of the floating control gate to the bulk. The floating control gate is composed of the source side of BSEL transistor, a metal connection from the source to the poly wordline, and the poly wordline. The coupling ratio can be calculated by considering capacitances connected to the floating wordline. These capacitances include source-junction capacitance, source and gate overlap capacitance, poly and metal field capacitance, and poly wordline capacitance over the (pocket p-well) bulk. Of these, the capacitance between the poly wordline and the bulk (which is responsible for the coupling to the erase voltage) is two orders of magnitude larger than the sum of the rest. The coupling ratio computes to over 98%, more than sufficient to prevent F-N Tunneling from occurring. An erase verify operation is performed on each of the selected blocks to ensure that the threshold voltage of all cells in the blocks 'are below -1 V. Verification of all cells in a block is performed in parallel through a single read operation which only requires 7.5 ps.
In READ operations, a page of cell data is simultaneously transferred to the page buffer latches then read out in a sequential burst. To sense a row of cells, the page buffer latches are first initialized to "0" (logical low value), the bitlines are discharged to 0 V, and the SSL and GSL lines are raised to 4.5 V as shown in period tl of Fig. 4 . The selected wordline is then applied by 0 V and a 4.5 V pass voltage is applied to the unselected wordlines in period t2. Since the 4.5 V on unselected wordlines is higher than the threshold voltage of both programmed and erased cells, all unselected cells will act as pass transistors. On the other hand, the 0 V selected wordline will only turn on erased cells. This causes unit NAND strings with an erased selected cell to form a path to ground and those with a programmed selected cell to be open. In period t3, the direct sensing path from bitline to latch is disabled by setting PGM of Fig. 1 low so that the latch value can only be changed through the SENSE transistor. A rising Vref enables the PMOS current mirror load which supplies a 2 p A load current to each bitline. Bitlines of cells associated with erased selected cells sink the load current and remain low while programmed cell bitlines go to a high potential. The high potential on a programmed cell bitline turns on the associated SENSE transistor and flips the latch to a "1" in period t4. Thus, programmed cell latches hold "1" and erased cell latches continue to hold the initial "0" value. These latch values are inverted later in the read path to read as the proper logical levels. Since all latches in a page are set simultaneously, after period t4, the latch data can be read out in a sequential burst cycle. 2) Program (20 ps): apply a short pulse of the program voltage to the selected wordline. 3) Wordline discharge (4 ps): The high voltage on the selected wordline is discharged so that a low verify voltage can be applied in step 4 below. 4) Program verify (8 ps): check the threshold voltage of programmed cell to see if it is above the target level. Further details of the program step are presented in conjunction with the self boosted generation of the program inhibit voltage in Section IV. In the verification step, the latches of cells that are sufficiently programmed are switched from "0" to "1" to inhibit further programming. Bias conditions for the verify operation are similar to the read operation except that the latches hold program status data (they are not initialized to all "0"s) and 0.7 V, instead of 0 V, is applied on the selected wordline. Under these conditions, a latch value is switched from "0" to "1" when the threshold voltage of the associated cell is over 0.7 V; that is, the cell is sufficiently programmed. "1" value latches are not affected since latches can only be flipped from "0' to "1" in the verify operation. Program cycles are repeated until all page buffer latches hold a "1" or the program operation timeout of 10 cycles has been exceeded.
While 
IV. SELF BOOSTED PROGRAM INHIBIT VOLTAGE
In previous NAND flash memory, a high program inhibit voltage (e.g., 8 V in [ 3 ] ) was supplied to the NAND string channel directly through the bitlines. However, there are several disadvantages with this method: 1) A large capacity charge pump is required to supply the high voltage on the highly capacitive bitlines. This charge pump will occupy much silicon area. 2) Time and extra current are required to set up the bitline to a high voltage. 3) Further scaling down of the memory is burdened by the high voltage bitline isolation requirements. 4) The page buffer size is increased due to the high voltage input path and increased transistor size to handle high voltages.
voltages. The bias conditions for supplying program inhibit voltages to the channel of selected cells is shown in Fig. 5(a) . With the SSL transistors turned on and the GSL transistors turned off, the bitline voltages for cells to be programmed are set to 0 V, while the bitline voltages for cells to be program 
where Gin, is the total capacitance between control gate and channel (Cone in series with Ctunnel)
Cins
Cms + Cchannel Kh = Vwl Con, G u n n e l Con, + G u n n e l ' Cins =
In program inhibited strings, as the coupled channel voltage rises to V,, -Vt (of the SSL transistor), the SSL transistor shuts off (Fig. 4(a) ) and the channel becomes a floating node. Through the self-boosted generation of program inhibit voltages, program cycle operating current is reduced by 40% to 4.3 mA. Also, the bitline setup time within each 40 ps program cycle is less than 8 ps, saving approximately 20 ps in bitline precharge time. The effectiveness of the self-boosting scheme is evident in the fact that all 512-byte cells within a page can be programmed on a byte-by-byte basis without program interference. In Fig. 6 , it can be seen that self-boosting can effectively maintain program inhibit voltages in program pulses greater than 10 ms when VpaSs is over 9 V. This is much longer than the required 20 ps of this device. Also, while the effectiveness of self boosting in generating the inhibit voltage increases with the pass voltage, when the pass voltage is too high, unselected cells will start to get programmed by the pass voltage itself (Vpass Disturb in Fig. 6 ).
V. INCREMENTAL STEP PULSE PROGRAMMING
The intelligent page buffer described in Section I11 allows cell-by-cell Vt optimization even though a page of cells are programmed simultaneously. However, an unavoidable side effect of this cell-by-cell optimization is that program speed is determined by the slowest programmed cell within the page. Cell program times can v a q widely due to nonuniformity in the process (To,, coupling ratio) or changes in the environment (V,,, temperature). High program speeds cannot simply be achieved by increasing program voltage since this can result in overprogramming problems that will affect the read and verify operations. ********************ti************************ .. ...... ....... ................................................... ...................................................  ................................................... ................................................... ...................................................  .................................................. Fig. 9 shows a slightly faster programming time than the device with ISPP, it can be seen in Fig. 8 that the fast programming device also has a very wide Vt distribution. Similarly, while the constant 16.5 V program voltage device of Fig. 8 has the tightest Vt distribution, it can be seen in Fig. 9 that the tight distribution is obtained at the expense of program speed. ISPP provides an optimum combination of both a tight Vt distribution and a fast program time.
+----+----+----+----+----+----+-----+----+----+----+-
+----+----+----+----+----C-----+--+--t----+----+----+-
Generally, cells tend to be programmed more easily at higher temperatures. In Fig. 10 , it can be seen that the device with ISPP is very resistant to temperature dependent variations. The starting voltage in the ISPP scheme is deliberately set low so that additional margin is obtained against variations (e.g. temperature) that cause cells to program more easily. ISPP is also effective under conditions where cells become difficult to program since the incrementing program voltage is an automatic adjustment to these cells.
By effectively adjusting to process and environment variations, ISPP maintains consistent program performance which helps improve the yield of the device. Marginal cells that were previously out-of-spec when conditions were varied are brought within-spec with ISPP. While adjustments to a reference cell have been reported to compensate for die-by-die or sector based process variations [5] , ISPP is able to compensate for cell-by-cell variations that can exist within a die.
VI. DEVICE PERFORMANCE
Cell sensing speed in read and verify operations is dependent on the current driving capability of the cells. In this device, pass wordlines are pumped to 4.5 V to increase cell current from 2-4 p A as shown in Fig. 1l(a) . This larger cell current allows a larger current load (current mirror in Fig. 1) to be used which reduces bitline charge-up time by 37% as shown in Fig. 1 l(b) . Through wordline boosting, 528-bytes of cell data is transferred to the page buffers in 7.5 ps. A Schmoo plot of the transfer time ( t~) is shown in Fig. 13(a) .
To offer read and write burst cycle times of 25 ns, the data paths between the top and bottom page buffers are interleaved as shown in Fig. 12 . Dual stage pipelining and prefetching of page buffer data results in a very short data latency (relative to RE) of 15 ns. A Schmoo plot of the burst read cycle time (tat) is shown in Fig. 13(b) . The total time to read out a page is 7.5 ps + (528 x 25 ns) = 20. a general upward shift of cell V,. In Fig. 14(a) , it can be seen that the shift is more pronounced in erased cells. However, since the device is designed with sufficient margin against V, shifts (cells only need to be below -1 V to verify correctly), the electron trapping does not affect device performance until well over lo6 P E cycles. In program operations, as the number of P E cycles approaches IO6, the higher V, of the erased cells helps program cells more easily. This actually results in a shorter program time as shown in Fig. 14(b) .
VII. SUMMARY
A 32 Mb NAND flash memory configured in 4 M x 8 that operates on a single 3.3 V supply has been successfully developed. Die size has been minimized with a single array architecture and a self-boosted program inhibit voltage generation scheme that reduces charge pump area and allows a tight bitline pitch. ISPP has shown not only to improve program performance but also to improve the yield of the device. In addition, the NAND flash memory achieves high levels of serial access performance through the use of wordline boosting and interleaved data paths. Fig. 15 shows a micrograph of the 32 Mb NAND flash memory chip. Key device characteristics and parameters are summarized in Table I . The chip has been implemented with a 0.5 pm design rule, resulting in a die size of 94.9 mm2 and an effective cell size of 1.6 pm'. A triple-well CMOS process on a p-type substrate is used where the memory array is in the pocket p-well and isolated from the substrate through a surrounding n-well. Only NMOS transistors are used in the high voltage circuits and interconnect is limited to single metal and double poly to simplify the overall process. 
