the densest circuits fabricated today. Because their transistors and lines are packed so closely together, RAMs suffer from a very high average number of physical defects per unit of area compared with other types of circuits. This fact has motivated researchers to develop efficient RAM test sequences that provide good fault coverage. ' The published results have led some researchers to suggest that a better solution to the RAM testing problem is to redesign and augment the peripheral circuits surrounding the RAM cell array to improve the RAM's testability.2 Others have proposed the addition of even more extra circuitry to achieve a self-testing RAM, thereby dispensing with expensive external test equipment.57 The principal m e tivation behind these proposals is the significant reduction in total testing time such techniques can achieve, compared with conventional RAM testing procedures.
We usually cannot test an embedded RAM simply by applying test patterns directly to the I/O pins, because the embed-24 ded RAM's data, address, and control signals are not accessible through the 1/0 pins. As a result, some researchers have proposed designs for built-in self-testing (BIST) embedded RAMs.8l9 Given current 0740-7475/93/0600-0024503.00 0 1993 IEEE trends, it is reasonable to expect that eventually some embedded RAMs will grow so large that yield considerations will require them to have spare rows and spare columns. When that happens, the RAMs will need built-in test circuits that not only detect the presence of faults but also specify their location for r e air.^,'^ A built-in selfdiagnosis (BISD) method, which can detect and locate faults in embedded RAMs, and which can be implemented using a small amount of extra area, is the subject of this article.
The heart of the method is the diagnosis-and-repair unit (D&R unit), a small reduced-instructionset processor (RISP) that executes instructions stored in a small ROM. When the embedded RAM is externally repairable (when it has spare rows and columns that one programs by blowing fuses with laser beam pulses), the D&R unit first l o cates all the faults. It then sends a repair procedure to the equipment controlling the laser beam.
However, when the RAM is internally repairable (when it has "soft fuses," which may be EEPROM cells paired with flipflop+also known as "shadow RAM" cells), the D&R unit programs the soft fuses by itself, without help from any external equipment. The same D&R unit can test, diagnose, and (optionally) repair more than one embedded RAM on the same chip.
Test-only algorithms: fault detection
The following sequence of five test algorithms can detect all the most important faults that represent physical defects in RAMS:'," The details of the location algorithms will be the subject of another article (all the details are in author Treuer's t h e sis'2). In the present context, it suffices to know that the location algorithms are essentially extensions of the detection algorithms, significantly modified to use the BISD hardware. 
Algorihms and BlSD hardware
After discussing the various forms of march tests, we will consider four different approaches to BISD hardware. It is useful to keep in mind that while the first priority in external testing is to minimize the length of test algorithms, in BIST the first priority is to make the test algorithms as regular and symmetric as possible so that the self-test circuitry can use a minimum of silicon area.
The algorithmsshown in Tables 1 and  2 We can use the algorithm presented in Table 1 only when each address r e fers to exactly one bit of memory. When we test a word-oriented SRAM, we write to or read from the memory whole words of data, not just single bits. If we applied whole words of all Os and all l's, we would fail to detect certain coupling faults between cells in the same word. To detect such statecoupling faults, we need only apply the primary data backgrounds to each word. 13 For example, Table 4 shows the appropriate primary data backgrounds for testing an SRAM with 8 bits per word (a byte-oriented SRAM). In general, for a word with B bits, the number of primary data backWe construct the March-B algorithm for a byte-oriented SRAM, using the primary data backgrounds, by replicating the singlebit algorithm four times (since there are four primary data backgrounds) and replacing all instances of grounds is 2(rlogz B 1 +1).
WO and w l in the first replica by wOlOlOlOl and w10101010, in the second replica by wOOllOOll and ~11001100, in the third replica by wOOOOl111 and wl1110000, and in the fourth replica by ~00000000 and wl1111111. The r0 and rl operations are replaced in an exactly analogous fashion in each replica.
the same word because such faults r e quire that different patterns be written and read: the marching and walking data backgrounds. The marching data backgrounds, shown in Table 5a and 5b, r e place the singlebit write operations in statechanging march elemen+where the single bit's final value (after the march element has finished) is the complement of its initial value (before the march ele ment began).
Analogously, walking data backgrounds, shown in Table 5c and 5d, replace the write opetations in stateretaining march ele ments-where the initial and final values of the single bit are identical. For a word with B bits, the number of marching and walking data backgrounds is 4B each. If we think of the marching and walking data backgrounds as B x B square matrices, a simple way to represent them using only4 bits per background is shown in Table 6 . A careful examination of the patterns reveals that each matrix has three constant parts: an upper (U) triangle, a diagonal (D), and a lower (L) triangle. The only difference between fi (wx) and (wx) is the direction of the diagonal (\ or d), which is equivalent to a fourth bit. Table 7 shows the March-B algorithm for a byteoriented SRAM that uses the marching and walkiag data backgrounds. Notice that the marching data backgrounds will also detect, as a by-product, all state coupling faults, by forcing any two cells x and y into all four possible states:
, and (1,l). Table 6 . U, D,l representation of data backgrounds.
The wordariented primarystyle march test cannot detect some idemCFs within , Table 7 . Marching and walking &bit version 0fMarch-B algorithm.
Serial memory access. The types of march elements described so far as-
II ( rOOOO, wl 000, rI 000, &OW, rOOOO, wl 000 rl000, wl100, rl100, wl000, rl000, w l l 0 0 rllOO, wl110, rl110, wl100, rll00, wl110 rl110, w l l l l , r l l l l , wl110, rl110, w l l l l )
IT ( r 1 1 1 1 , d I 11, wI111, r l l l l , wlOl1, w l l l l , rl 1 1 1, wl101, wl 1 1 1, r l l l l , wll10, w l l l l ) sume that the circuitry performs read and write operations directly on all memory cells. However, in the context of embedded memories, it may be necessary to perform read and write operations indirectly during the testing process. For example, we could apply march elements with shifting: although the read/write circuitry may access an entire row of bits per operation, the addition of a bit-shifting operation to the data word would appear to implement a singlebit march element. We obtain the shifting operation by adding multiplexers to the inputs of the write drivers, with each multiplexer selecting the normal data input (originating outside the memory) or the stored value in the transparent latch (originating from the sense amplifier) from its left-neighboring biLg During the normal mode of operation, the normal data input isselected. During the test mode, the left-neighboring transparent latch is selected. Thus, shifting is indissolubly linked to the write operation during test mode (that is, the shifting operation cannot be separated from the write operation).
In the serialized mode of operation, faults within the output data latches can corrupt the data values intended to be written into the memory cells. Therefore, to retain fault coverage, we must reinterpret our view of the march ele- Parallel memory access. In the parallel write operation, B bits are written to a word of memory in one time period. Since we are dealing with embedded SRAMs, we cannot supply this Bbit pattern from outside the chip. Area constraints restrict the number and the type of different B-bit patterns we can generate on chip. Usually, we generate highly regular patterns, such as this subset of the primary data backgrounds: 1 1 1 1 1 1 1 1 , 00000000, 01010101, 10101010. An easy way to obtain these patterns is to use a data buffer, where each cell can be individually cleared to 0 or set to 1, and then to tie together the clear and set control lines of all the even cells and all the odd cells. The parallel read operation tells us only whether such a regular pattern was correctly stored in a given memory word. If the &bit pattern was incorrectly stored in the memory word, the parallel read operation cannot identifywhich, or how many, of the bits are erroneous, only that at least one bit does not match the regular pattern. I4 We can implement the parallel read operation with four high-fan-in logic gates: two AND gates with B/2 fan-in and two OR gates also with B/2 fan-in. A different implementation uses B two-input EXOR gates and two OR gates with B/2 fan-in, where half the inputs to the EXOR gates come from the memory cells, and Modular memory access. We can also perform the reading and writing of data backgrounds (for testing and diagnosis) in a way that is neither parallel nor serial but is best described as modular. For this type of memory access, we assume that the memory cell array has been segmented into M modules (each word consists of M groups of BIM bits). The modular write operation writes BIM bits to exactly one of the M segments composing a word, in one time period, leaving the remaining B-B/M bits of the word untouched. As in the parallel write operation, we generate only highly regular patterns to be written to memory: 0101,1010,0011,1100,0000,1111.
The modular read operation tells us more than just whether such a regular pattern was correctly stored in the cho sen Msegment; if theBMbit pattern was incorrectly stored in the given segment of the memory word, the modular read operation (unlike the parallel read operation) identifies exactly which of the bits are erroneous. We implement the modular read and modular write operations by connecting each module's data buffer to a BIM-bitwide bus (the diagnosis and repair bus, described in the next section). This bus is connected to a spe cia1 circuit (the D&R unit, also described in the next section) that can, in the time of one memory cycle, generate the BIMbitwide regular patterns or identify (using MOR gates) the individual erroneous bits of incorrect data.
All the diagnosis algorithms are available in two formats: 1 ) using a combination of parallel write, parallel read, and serial read/write operations; 2) using only the modular write and modular read operations. The hybrid serial/paralle1 algorithm format is appropriate for a diagnosis-only hardware design (using external repair), and the modular format is appropriate for a diagnosis-withself-repair hardware design.
Hardware design
The ability to locate faults and effect self-repair calls for an intelligent selfdiagnosis circuit, which we call the D&R unit, to be added to the repairable embedded RAM (see Figure 1) . The D&R unit is essentially a small reducedinstruction-set processor (RISP), combined with a small ROM. Such a design has the following advantages:
. . .~~ .... . 
~
The design can be adapted easily to a variety of memories because the instructions stored in the ROM can be changed easily to suit various memory organizations and to detect faults associated with various fabrication technologies. Since an embedded RAM cannot be directly accessed through the chip's 1/0 pins, such a RAM is usually not connected to data and address buses running through the chip because such buses almost always lead to VO pins. Instead, an embedded RAM is connected directly to neighboring circuits without buses. This means that the D&R unit's RISP will be connected to an embedded RAM through an extra bus used only for diagnosis and repair. Hence, the only necessary redesign of the RAM is the incorporation of this diagnosisand-repair bus (D&R bus), a relative ly simple task. The only changes in RAM performance will be caused by the D&R bus, since all other parts of the D&R unit are completely outside the RAM. matches the expected signature, stored as the first or the last word in the ROM, we can assume-with high probability-that there are no faults in the program counter, the instruction register, the ROM' s address decoder, and the entire ROM contents, and we go on to pass 2. In pass 2, we run a self-test program, also stored in a part of the ROM, which exercises all the remaining registers and all the instructions implemented by the data path. If the D&R unit successfully completes pass 2, we assume it is faultfree, and we can start testing the embedded RAM itself.
BIST address generators.
In comparison with test data generators and r e sponse data evaluators, address generators almost always require more layout area. The two possible implementations of an address generator are a counter and a pseudorandom pattern generator (PRPG). Counters' disadvantages are that they require more hardware than PRPGs because of the carry propagation logic, and that they are more difficult to make self-testable.
With march tests, address generators based on PRPGs are quite suitable. It is a straightforward matter to construct a PRPG using a linear feedbackshift register (LFSR), augmented to produce the all-O's pattern, and further modified to generate the same pseudorandom sequence of addresses in both the forward and reverse directions.
Cells and their operations. The following five paragraphs describe some of the critical hardware components of a self-diagnosis-with-soft-switching-selfrepair design for embedded RAMS, using local redundancy. We assume that the RAM consists of M subarrays, each containing (B/M+l) = (b+l) bit lines, such that the single spare bit line in a given subarray can replace only one of the B/M = b nonspare bit lines in the same subarray. Due to space limitations, we can provide only a general overview of the design of the diagnosis-and-repair circuitry (all the details are in the thesisI2).
SRAM with selflvepair circuitry. Figure 1 shows the toplevel cell of the proposed design, which diagnoses and repairs faults entirely by itself. The Start signal specifies when a selfdiagnosis and selfrepair session should take place, with the circuit operating in normal mode at other times. The Stop(l:2) signals specify whether the self-repair successfully produced an operational circuit orsome unrepairable faults remain.
SMM. The SRAM is composed of M modules (or subarrays), along with the peripheral circuitry embodied by the cells Addrdecoder, Addrbuffer, Databuffer, and part of Clockgenerate. Each module (or subarray) contains the cells Subcellarray, Writedriverssenseamps, Mod#dec, and Fusemodule. In Figure 1 , we let M = 2; hence, b = B/2. In general, large memories can contain an arbitrary number of such modules. For ease of illustration, each module contains only one spare bit line to provide redundancy for fault repair; incorporating more redundancy in the form of spare word lines and additional spare bit lines is easy. Apart from the D&R unit, the SRAM is also augmented by Addrgenerate to control the address bits during testing and by a part of Clockgenerate to synchronize all testing and repair functions.
Mod4dec
. This cell contains an address decoder for module number #. The address is carried by the b bits of D&Rbus(l: b), which is a valid address when D&Rbus(b+l)'s edge rises. The signal D&Rbus(b+2) specifies whether we are reading/writing to all the modules in the memory together or only to one selected module. Although there can be an arbitrary number of modules, each one requires a unique moduleaddress decoder.
Fusemodule. This cell provides a pro grammable interface between the b bits of the data buffer and the b+l bits (one of which is the spare) of the memory module. The datacarrying signals on both sides of this cell are bidirectional, hence the Direction signal selects the direction of dataflow. The Clearfuses signal forces the fuses into a predetermined default setting. The Enable signal, coming from one of the module address decoders, specifies which module is b e ing tested or being reconfigured by means of its fuses. m u n i t . This cell, shown in Figure 2 , serves as a programmable controller that tests the memory array, locates the faults, and then programs the soft fuses to effect repairs. The only means of data transfer between the D&R unit and the memory array is the bidirectional part of the D&R bus (D&Rbus(l: b) ). Unidirectional control lines from the controller are three lines in the D&R bus (D&Rbus(b+l: b+3)) and six more lines (Control(l:6) iodules to keep the array current small.
his trend results in a larger stray capacance on the global word line because le number of local word decoders also icreases. Therefore, the charging/ ischarging current in the worddecodig circuit significantly increases even in le divided-word-line architecture. The HWD architecture is a method to irther reduce the power consumption nd increase the speed of the wordecoding circuitry. In this architecture, le worddecoding circuit is divided into 3t least) three stages: a global word d e oder, a subglobal word decoder, and a leal word decoder. The subglobal word ecoders are inserted as buffering stages efficiently distribute the stray capaciince on the worddecoding path. The gnificance of this development is that nly local redundancy is possible for the lstest SRAMs, and local redundancy appens to be much better suited to the iodular diagnosis-with-self-repair methd described in this article.
Although global redundancy can r e air a greater variety of faults (because fits greater inherent flexibility) than loal redundancy, it suffers from a higher rea overhead. All possible versions of 4f-repair algorithms need essentially le same type of hardware (some kind of control unit and a D&R bus that communicates with programmable modules), whereas all possible versions of selfdiagnosis algorithms can use a wide variety of extra hardware. In other words, there are many more hardware design alternatives for implementing a diagnosis method than for a repair method. For local redundancy, the repair hardware accesses spare lines spread throughout the RAM. Using the same D&R bus and programmable modules, the nonspare lines spread throughout the RAM are also easily accessible for diagnosis. For global redundancy, the repair hardware accesses the spare lines placed in only one small part of the RAM. Therefore, the repair hardware needs significant additions (increasing the length of the D&R bus and adding more lines to the bus to cope with a more complex addressing situation) to access the nonspare lines in the rest of the RAM for diagnosis.
Repair choices. When a fault is located, we store the location in a fault map. Only when the fault map is complete, do we apply a repair algorithm, which seeks to assign the spare rows and spare columns in as nearly optimal an allocation as possible. A fault-free module of the memory cell array stores a complete fault map of another module being diagnosed. A high probability exists that there will always be at least one fault-free module available to store the fault maps of other modules. This implies that no extra area overhead within the D&R unit is needed to store fault maps.
The two possible ways to apply selfrepair algorithms are 1) programming an appropriate heuristic algorithm, such as branch-and-bound or best-firstsearch, into the D&R unit's ROM, or 2) using an electronic neural network implementation of a gradientdescent algorithm.l0 In comparison with the programmed heuristic technique, the neural network technique has the advantage of significantly higher percentages of successful repair allocations. That is, the neural network will find repair plans in most borderline cases, where the programmed heuristic technique gives up and declares the memo ry unrepairable. However, the neural network technique may require more area overhead than reported in Table 8 . In addition, the implementation of a neural network may require significant changes to the fabrication processes b e ing used.
We will not calculate the effect on fab rication yield of the BISD circuit and possible self-repair techniques here; instead, we refer the reader to the literature.'0,16 THE FOLLOWING LIST enumerates parameters for evaluating the overall quality of a built-in self-testing, selfdiagnosing, and (possibly) self-repairing circuit design. Only methods that optimize most or all of these parameters are acceptable for implementation. The BISD method we have described satisfies these criteria.
