MODULE GENERATORS PROVIDED by library vendors supply chip designers with optimized Booth multipliers, which are widely used as embedded cores in both generalpurpose data path structures and specialized digital signal processors. Designers frequently use Booth multipliers in area-and speedcritical parts of complex ICs. Compared to standard array multipliers, Booth multipliers are faster and require less area, and their regular structure (see box on page 106) facilitates efficient implementation and testing in VLSI devices. Therefore, the Booth architecture is the basis of most embedded multipliers produced by automatic synthesis tools. 1 Testing these deeply embedded multipliers requires an effective BIST scheme that can be easily synthesized along with the multiplier by the module generator. We have developed a BIST scheme for Booth multipliers that fulfills this requirement.
The problem
Achieving high controllability and observability by using scan techniques, engineers can test Booth multipliers efficiently with external testers if the multipliers have either linear or C testability. Linear testability means the number of test vectors increases linearly with the size of the multiplier operands. C testability means the number of test vectors is constant irrespective of multiplier operand size. Several design-for-testability (DFT) approaches for making Booth multipliers linearor C-testable have appeared in the literature. Two of these approaches assume specific implementations of the multiplier cells, using a specific silicon compiler. 2, 3 Another approach is independent of specific multiplier cell implementations. 4 All three approaches use DFT modifications to make the multiplier Ctestable and provide a small test set regardless of operand lengths.
Externally testing Booth multipliers using scan techniques is not trivial at speed, a demand in today's high-speed ICs. Furthermore, scan design is not always cost-effective in high-complexity designs with many deeply embedded cores. In such cases, built-in selftest (BIST), a method that puts the tester on the chip, provides an excellent solution.
The low controllability and observability of multiplier modules deeply embedded in complex ICs impose serious testability problems. 5 For embedded Booth multipliers, as well as other embedded cores such as RAMs, 6 ROMs, 7 and FIFOs, 8 an efficient BIST scheme is the best solution. It permits at-speed testing, provides very high fault coverage, and drives down testing costs for the overall IC.
Adapting the previously mentioned DFT approaches to BIST schemes is very expensive in terms of hardware implementation because of the irregularity of input test sets and their output responses. Therefore, a BIST scheme aimed specifically at relieving the difficulties of testing embedded multiplier modules is necessary. An effective BIST scheme for Booth multipliers must satisfy the following requirements:
s To avoid performance degradation in carefully optimized designs, the scheme must not apply DFT modifications to the 
Booth multiplier operation
Consider the multiplication of numbers X and Y, in two'scomplement representation, with sizes N x and N y respectively. Standard (nonrecoded) N x × N y -array multipliers calculate one N y -bit-wide partial product for each of the N x bits of multiplier operand X. An array of 1-bit adders (half adders and full adders) implements the addition of these N x partial products. Modified Booth multipliers reduce the number of partial products that must be added together by recoding one of the two operands. The multiplier treats groups of k bits (k ≥ 2) of recoded operand X together to produce one (instead of k) partial product. The total number of partial products thus decreases from N x to N x /k, and multiplication significantly accelerates. The Booth multiplier with 2-bit recoding (k = 2) is the architecture used in practice. In this case, recoded operand X is divided into groups of 2 bits (X 2j , The r cells perform recoding. They receive a group of 2 bits (X 2j , X 2j+1 , j = 0, 1, …, N x /2 − 1) of recoded operand X, along with the most significant bit of the previous group (X 2j−1 ). The r cells produce a set of recoding signals (Sign j , One j , Two j ) representing the operation that must be performed over nonrecoded operand Y (+0Y, −0Y, +1Y, −1Y, +2Y, −2Y) at row j, as shown in Table A . The X 2j−1 input for the r cell of row 0 (bottom row) is always 0, so this cell is reduced to a simpler two-input cell instead of a three-input cell, as shown in Figure A1 . The 2-bit recoding scheme (k = 2) produces all partial products with trivial and therefore fast operations such as inversion (to generate −Y) and shifting (to generate 2Y). This is 
The solution
Our generic BIST scheme for Booth multipliers complies with the requirements just listed. The BIST algorithm provides fault coverage higher than 99% and is generated by a fixed-size test set, regardless of the size of the multiplier operands. This means that both the cost of the BIST test pattern generation hardware and the test application time are constants. Synthesizing a test pattern generation block that consists of either an 8-bit counter or an 8-bit maximumlength linear-feedback shift register (LFSR) with an existing multiplier design requires little design effort. The previously derived test sets for Booth multipliers [2] [3] [4] are very irregular and not easily produced by small test pattern generators. Thus, using an 8-bit binary counter or LFSR as a BIST test pattern generator is obviously a very efficient solution. We adopted one of the best-known output data compaction schemes, accumulator-based compaction, 9,10 which provides excellent fault coverage. If an accumulator already exists at the multiplier outputs, this compaction scheme requires no extra hardware and hence is very efficient. An accumulator accompanies a multiplier in most general-purpose computing structures based on data path architectures, as well as in digital signal processing circuits. We extend accumulator-based compaction with rotate-carry adders 9 using multiple rotate-carries. This new compaction scheme provides better results than those obtained by Rajski and Tyszer. 9 If the multiplier is not accompanied by an accumulator, two excellent alternative solutions exist. One solution is to add an accumulator with area-optimized adder cells and exact pitch matching. This requires little design effort due to the accumulator's modularity and imposes a small hardware overhead. The other solution is to add a classical multipleinput shift register (MISR).
Compared to the conventional pseudorandom BIST apthe reason that in practice we prefer 2-bit recoding over larger schemes such as 3-bit recoding, which requires nontrivial operations such as the production of 3Y. The pp cells calculate the partial products that must be added together to form the final product. The pp cell at row j (j = The adder cells (full and half) add up the partial products produced by the pp cells. The adders in the array connect in a carry-propagate fashion. The adders in the bottom row are half adders; the remaining are full adders.
Since the Booth algorithm uses numbers in two's-complement representation, the multiplier must perform a sign extension for the correct addition of the partial products. In our case, the multiplier performs the sign extension by adding a constant value to the adders producing the last N x bits of the product. This action adds +1 at the rightmost half adder. It also connects the sum outputs of the full adders at column N y−1 to an input of the full adders at their right and inverts them to give the corresponding product bit (see Figure A2) .
The BIST architecture proposed in this article is applicable not only to the Booth multiplier structure presented here. We can apply it with equal efficiency to Booth multiplier architectures using other recoding schemes (for example, four recoding signals from the r cells instead of three) or using a carry-save instead of a carry-propagate adder array. It is also applicable to nonrecoded standard-array multipliers with either carry-save or carry-propagate architecture. . proach, our BIST scheme is superior in that it guarantees very high fault coverage with a small test set and requires simple BIST hardware. Since our scheme uses only regular blocks (counter, multiplexers, and accumulator), greatly simplifying the BIST synthesis process, it is easily adaptable to any module generator. The scheme does not require DFT modifications in the multiplier structure.
Test strategy and fault model
Our test strategy is based on pseudoexhaustive testing at the cell level-that is, applying all possible input combinations to all the multiplier's cells. We assume that only one cell can be faulty at a time and that only combinational faults can occur. Adopting this fault model, we cover all detectable combinational faults (faults that can change the single faulty cell's function to any other combinational function that can impact the multiplier's correct operation). The set of faults included in the fault model is a superset of the set of all single stuck-at faults. It also includes any other type of single or multiple fault that can appear in the single faulty cell and change its function to a different combinational function.
Each type of cell (see box for description of cell types) requires the following sets of input combinations: s r cells. An exhaustive r-cell test requires the eight different input combinations of its three inputs X 2j−1 , X 2j , X 2j+1 . The r cell of row 0, which always receives X 2j−1 = 0, is reduced to a simpler, nonredundant two-input cell. Its exhaustive testing requires all four combinations of its two inputs. This fault model makes our test strategy independent of specific cell implementations. We can use any internal realization of the multiplier cells. Our strategy is valid, and every pp cell will receive all required combinations, even if the recoded digits produced by the recoding cells are not represented with three signals (Sign, One, Two) but with a different encoding scheme. (A different encoding scheme, for example, might use four signals representing the recoded digits +1, −1, +2, −2, and 0 to all four signals meaning recoded digit +0.) Thus, the proposed methodology detects all detectable combinational cell faults.
Proposed BIST scheme
Designers usually use optimized layouts of multiplier designs produced by module generators. DFT modifications in a multiplier structure may add extra hardware overhead and lead to performance degradation. Our BIST scheme avoids such modifications by adding a test pattern generator and an output data evaluator on the periphery of the multiplier, as shown in Figure 1 . The figure also shows the input data registers for the two operands X and Y. The architecture uses the two sets of multiplexers at the top and left sides of the multiplier array to select between normal and BIST multiplier inputs. The propagation delay of multiplier blocks usually determines the system clock period. Therefore, to avoid affecting system performance, we can place the multiplexers before the input registers, provided that the multiplier inputs come from faster modules. As a result, the BIST architecture imposes virtually no delay overhead. The proposed BIST structure makes a complex BIST controller unnecessary.
Test pattern generation.
The test pattern generation requirements of BIST differ greatly from those of external testing strategies. Conventional external testing requires the fewest possible test vectors. Because they are stored in tester memory and often applied at lower frequencies than the cir- . cuit's operating frequency, a greater number of test vectors increases testing time and cost. In contrast, BIST test vectors are applied at the normal operating (system) speed and are produced (and not stored) on chip. Therefore, BIST test vectors must be highly regular (with repetitive patterns and correlation) so that small machines can generate them.
The proposed test pattern generator for an N x × N y -bit Booth multiplier is an 8-bit counter that generates 256 values (or a maximum-length 8-bit LFSR that generates 255 values). During BIST, the multiplier repeatedly uses four counter outputs as X inputs, and the remaining four counter outputs as Y inputs, as shown in Figure 1 . We denote the counter outputs used as X inputs X BIST0 , X BIST1 , X BIST2 , and X BIST3 . We denote the The only exception is the half adders, which cannot receive input combination c in = 1, a = 0 repeatedly. If we apply this combination to one half adder, no other half adder can receive it because c in = 1 cannot be reproduced in the row. Therefore, we need a number of test patterns that increases linearly with the length of the Y operand.
Let's examine how the test patterns produced by the 8-bit counter apply all possible input combinations to every multiplier cell except the set of half adders just mentioned. with respect to the adopted fault model. This is also true for the reduced four-input pp cells of columns 0 and N y . s adders. All adders, except the N y − 4 half adders of the bottom row, are exhaustively tested with all eight (full adders) or four (half adders) combinations of their inputs. This is due to the repetitive nature of the 256 patterns, which apply the same input combinations to many adders at the same time.
Apart from applying every possible input combination to all the multiplier's cells, we must be sure that all faulty cell outputs propagate toward primary outputs and thus disclose the faults. Let's consider all cell types that can be faulty and how faulty cell outputs propagate toward primary outputs. Since the single output of these cells drives an adder input, the fault propagates through a chain of adders (following the sum lines) at primary output P 2j+i and thus is detected. s adders. When the single faulty cell is one of the adders (either half or full), the fault propagates through a chain of adders, again following sum lines toward primary outputs.
Output data compaction and fault coverage. As mentioned earlier, we adopted accumulator-based output data compaction as proposed in Rajski and Tyszer. 9 Because multiplier units are usually accompanied by accumulators in data path blocks, this solution requires little extra hardware for response verification. Even if we synthesize a dedicated accumulator for output data compaction, it does not require much design effort due to its inherent regularity.
Since the BIST scheme's signature resides in the accumulator, which is a well-accessed register, it can easily be validated by comparison with the fault-free signature inside or outside the chip. Researchers have analyzed the benefits of accumulator-based compaction extensively. 9, 10 . We extensively simulated the proposed BIST architecture for a 16 × 16 Booth multiplier with respect to all single stuckat faults, using the Cadence Design Framework fault simulation tool (the Verifault stuck-at fault simulator). We explored different compaction schemes based on accumulators and a classical MISR. We also considered the single stuck-at faults in the compactor itself. We used the Espresso minimizer to extract the gate-level implementations of the multiplier cells from their logic functions. The accumulator-based compaction scheme uses multiple rotate-carry loops, as shown in Figure 2 for a 32-bit accumulator. The accumulator consists of four 8-bit segments. Each segment's carry-out signal is fed into its carry-in input and XORed with the previous segment's carry-out signal. Memory elements denoted F in the figure control the feedback loops. During normal operation, the memory elements are set to 0.
Hardware and delay cost. Finally, we provide an estimate in gate equivalents of the hardware cost imposed by our BIST scheme. We assume that a full adder equals 10 gates, a half adder equals 5 gates, an r cell equals 8 gates, a pp cell equals 14 gates, a flip-flop equals 10 gates, a multiplexer equals 3 gates, and a 2-input NAND gate equals 1 gate.
The multiplier's original design consists of the following: Table 2 gives the hardware overhead for 16 × 16-, 32 × 32-, and 64 × 64-bit multipliers. Obviously, .
when an accumulator does not exist in the circuit, the cost is slightly higher but not prohibitive. The delay overhead imposed by our BIST scheme is no more than the smallest possible delay overhead of any offline BIST scheme that has a single multiplexing stage. Moreover, assuming that the multiplier inputs come from faster modules, system performance is not affected when the multiplexers (Figure 1) are placed before the input registers.
The small delay overhead introduced with the addition of the XOR gates between adder segments when the multiple rotate-carry compaction scheme is utilized causes no problem in system performance. This is because the multiplier's speed, which in any case is much higher than the adder's speed, determines the clock period of the data path in which both the adder and multiplier participate. Since we introduce no DFT modification to the multiplier structure, the proposed scheme does not significantly affect the data path's performance.
BECAUSE WE APPLY the BIST hardware only to the multiplier's periphery, our scheme causes no internal performance degradation. For multiple embedded multipliers in a complex IC, more than one multiplier module can share the BIST hardware. We can implement the same scheme without extra effort in pipelined Booth multipliers.
Its comprehensive cellular fault model makes the BIST scheme generic and applicable to any Booth multiplier. Thus, the scheme is suitable for use in any multiplier core generator. AT&T's standard cell library has adopted it for macrocell generation of multipliers, providing automatic synthesis of the BIST hardware of any size along with the Booth multiplier itself. We are currently investigating the scheme's applicability in treelike (Wallace) multipliers. 
Dimitris Gizopoulos

