Abstract: A new testable and reconfigurable bitserial multiplier is proposed. Fault tolerance is established through built-in self-testing and dynamic reconfiguration. During the reconfiguration phase, the faulty modules of the multiplier are automatically isolated and the system reconfigures itself for the required function with no need for external interference. Quadruple-double modular redundancy (QDMR) techniques are employed to allow the isolation of the bad modules and the use of only good ones. The faulty cells are located and the diagnostic information is scanned out for evaluation. The followed design method supports a two-level testing strategy, in which the multiplier is tested by a small built-in test circuit and the test circuit itself is tested externally by a scan-path technique, providing high fault coverage. This design method adopts the functional building block concept and is especially suitable for linear arrays. The multiplier is based on the bit-serial approach, which has been proven to be an efficient implementation for several digital-signalprocessing (DSP) structures; it accepts 2 s complement data and coefficients in a serial form. A prototype of an 8-bit multiplier has been implemented in single-layer metal 2 pm CMOS technology; it has an area of 2.64mm x 1.91 mm (without 1/0 pads) and contains approximately 2500 devices.
Introduction
With recent advances in VLSI technology, very complex digital-signal-processing (DSP) algorithms can be costeffectively implemented. But at the same time, the design complexity to achieve high-speed performance, eficient area and reliability becomes a major challenge. Reliability has recently gained wide importance, especially in several critical fields such as radar communications and real-time image processing for robotics applications. Non-traditional design methods and architectures are required to achieve this high performance within the VLSI constraints. In this paper, a novel reconfigurable testable multiplier architecture is presented. The multiplier is based on the bit-serial approach, to minimise the area. Pipelining is employed to enhance the speed and bring it closer to that of bit-parallel processing. Reliabil- ity is provided by including redundancy in hardware, testing mechanisms for all cells, fault detection and location at each cell, and dynamic reconfigurability. In our design, reliability has been established through two phases: testing and fault tolerance. Testing has been a major challenge in integrated circuits. Traditionally, testing was considered only after the circuit had been fabricated, which requires providing probing testing pads, sophisticated test equipments, and heavy computation time for test generation and fault simulation. With the rapid increase in the complexity of VLSI systems, testing has become more difficult or even impossible [l]. This has been the motivation for adopting the 'design for testability' strategy, in which extra circuits are included on a chip to make testing manageable and economical. Scandesign, such as the level-sensitive scan design (LSSD) [2] and scan path [3] , has been widely used to increase the circuit controllability and observability (the main testing criteria [4] ) and to reduce the test generation for sequential circuit into combinational circuit. However, since the test data are generated outside the circuit under test (CUT) and the responses must be evaluated outside the circuit, the information must be shifted serially in and out of the circuit. This technique makes testing a timeconsuming procedure. A more sophisticated approach, the built-in self-test (BIST), has been introduced to achieve a more effective and economical testing strategy [S, 61. In this paper, a 'two-level testing' strategy is developed, which incorporates both BIST and scan-path approaches; BIST is used to reduce the test application and evaluation time while scan-path is used to guarantee the correctness of the BIST circuitry.
Fault-tolerance has also become highly desirable, since increasing VLSI silicon area means reducing yield. Although current technologies are much more reliable than the early ones, the resulting decrease in the failure rate has been offset by the increased complexity of today's VLSI circuits. In this paper, quadruple-double modular redundancy (QDMR) has been employed to provide fault diagnosis and fault tolerance. This technique is based on modulator duplication concept, and is applicable for one-dimensional unidirectional arrays. In the QDMR approach, the first cell contains four identical modules, and the rest contain two identical modules each. This would provide dynamic reconfiguration for fault-tolerance and would also provide fault diagnosis information. Designing a multiplier is the vehicle to demonstrate our proposed techniques.
The multiplier represents the main computational kernel and the bottleneck of most DSP algorithms. The performance of any DSP processor depends very much on the multiplier efficiency. The bit-serial approach [7, 81 has gained wide acceptance as a trade-off between the design cost and the operation cost for real-time DSP applications. It offers maximum flexibility and extensibility, and high clock rate.
The multiplier design strategy
The main design philosophy is reliability, which is achieved in two phases:
(a) testing to detect faults (b) locating faults and using dynamic reconfiguration to isolate fault modules and to utilise the good ones.
In the testing phase, the entire multiplier is tested by a small built-in test circuit and the test circuit itself is tested externally by a scan-path technique. This would significantly reduce the test turnaround time, while making sure that the testing is proceeded correctly. The second phase is based on utilising redundancy through dynamic reconfiguration. In this approach, the fault modules are isolated automatically and the system reconfigures itself to perform the required function. This technique is contrary to the common static reconfiguration, where an external process, such as a tailored discretionary interconnection, is needed to isolate the faulty modules and to connect the good ones. The dynamic reconfiguration techniques offer a good solution to the limitations experienced when using the static approach, such as: the need to probe each element prior to the final metallisation, the area consumed by the probing pads, and the cost of gen-
erating a distinct interconnection pattern and of exposing a tailored photoresist for the discretionary wiring. Fig. 1 illustrates a non-fault-tolerant bit-serial multiplier. It is composed of three basic modules: the first module, the intermediate modules, and the last module. Its function will be discussed in detail in Section 3. The block diagram of the proposed multiplier is shown in Fig.  2 . It has four different units: cellcu, cell-1, cell-2 and celLN. CelLcu functions as a control unit for the entire multiplier. It includes multiplexers (MUX) to separate normal inputs with test inputs, a linear-feedback shift register (LFSR), a set of shift registers, a timing-signal generator, and a global comparator. CelLl has four firstmultiplier modules (bootkl), two block comparators (CMP), and an arbiter circuit. Cell-2 has two intermediate multiplier modules (bootk2) and an arbiter, like the one in the celL1. Cell-N has two last multiplier modules (boothN) and an arbiter. The fault model used in this paper is similar to that used by previous researchers [9, lo]; it assumes single permanent faults, and the proposed multiplier can tolerate only one fault in each multiplier cell.
The proposed multiplier operates in three separate modes with two inputs, for 'reconfiguration-mode' and 'scan-mode'. These modes are summarised as follows: Non-fault-tolerant serial modijied Booth multiplier algorithm). The celLcu has no influence on the multiplier circuit, but the LFSR in the cellcu is activated all the time.
the AR sent from the preceding cell, and the arbiter will then let one module be responsible for the cell's operation. In this way cell2 to celLN have fault-tolerance The multiplier should set to this mode before the operation starts. By setting reconfiguration-mode pin to logical high, the multiplier will reconfigure itself. The control unit cell celLcu immediately clears all internal flip-flops in cell-1 to celLN, then fetches 4 bits from the LFSR as a pseudorandom test pattern and sends it to cell-1. The four identical modules (booth-1) in cell-1 are organised as two selfchecking blocks, and each two modules in the same block simultaneously perform the same computation.
The 4-bit test pattern (X, Y, PPin, and LSB(0)) will propagate through cell-1 at the same speed. CelLl also generates an extra 'anticipated response' (AR) bit and sends them to cell-2. The AR is actually the partial product output (PP,,,) in the multiplier module during the normal mode, since now the test pattern is propagated through the cell and the original PP,,, wire of the multiplier module is used for propagating PP,, signal during this reconfiguration mode. The AR of both modules in each block is being compared in each clock cycle. If the comparison results in a match, then both modules are assumed to be fault-free with respect to the current test pattern and the block is also assumed to be fault-free. The output of this block comparison (from the CMP) is sent to the arbiter; the arbiter will disable the block having a mismatch signal and allow only one module response for the cell's operation (Note: since the first and the last module in a modified Booth multiplier are slightly different from the intermediate modules, the b o o t k l and b o o t k N will modify themselves to have the same function as booth-2). The second cell (cell-2) has two identical modules only; the basic idea is that each module will use its PP,,, to compare with the AR sent from cell-1 and enable or disable its modular outputs. The test pattern and AR from the celL1 will propagate through celL2 to the next cell. CelLN also has two modules and each modules will compare its PP,,, with capability while keeping the redundant circuit to minimum (double modulator redundancy). Fig. 3a shows the multiplier operation when all modules are fault-free.
Figs. 3b and c illustrate two independent reconfigurations due to faulty modules presented in the multiplier.
The AR produced by cell-1 is sent to a shift register in cellcu, it also propagates through cell-;! to cell-N and then returns to cellcu, where the global checking is done by comparing the AR from cell-1 with the AR from cell-N. The test patterns (generated by cellcu) are sent to another shift register in cellcu (for global checking), and at the same time are propagated through all multiplier cells and finally return from celLN back to the control unit, where another global checking is done by comparing the original with the returned data. The global checking is done every clock cycle and the 'cell-error' output will go to logical high if any mismatch occurs. The celLerror is used to indicate that there are intolerable faults inside the multiplier.
(c) The scan mode: By setting the scan-mode pin to logical high, the multiplier operates in the scan mode. This mode allows the control unit cellcu to be tested externally. A small number of deterministically generated patterns are sent serially from the scan-datajn pin, and then reset the scan-mode pin, wait for at least one clock cycle and then set scan-mode pin again to scan out the response. The fault diagnosis information, stored in the two error registers of the arbiter circuit in celLl to cell-N, will be available during the reconfiguration mode and this information is also shifted out for evaluation by using this scan mode. data and coefficient applied serially. The main constraint on bit-serial systems is the latency of the primitives. It is possible to reduce the latency of the multiplier by using the modified Booth algorithm [ l l , 121. In this algorithm, the coefficients are recorded in such a manner, to halve the number of computational steps involved, and so halve the size of the processing array. The main achievement of this recoding scheme is that the products of these digits with the multiplicand (data) are implemented by optional shifts. The circuit of the serial-modifed Booth multiplier [ 131, already shown in Fig. 1 , consists of three basic modules: the first (leftmost) module, the intermediate modules, and the last (rightmost) module. The first module is different The proposed multiplier takes the advantages of BIST and on-chip redundancy, while also providing fault-diagnostic capability. The four basic cells of the multipliers, cell-1, cell-2, cell-N, and cellcu, are discussed in more detail below: (a) c e l l 2 : This cell has quadruple-modular redundancy, four identical booth-1 multiplier modules. This 'module shadowing' concept was introduced by Intel [14] . Every two modules works as a self-checking block, and the outputs of the two blocks are controlled by an arbiter circuit. If one module in a block fails then the entire block is assumed to be faulty, and the output of the block comparator will be activated; this mismatch signal informs the arbiter to disable the faulty block. Fig. 4 scan-mode BlST (from cell-cu)
ii.. shows the arbiter circuit used in cell-1 to celLN. The AR input in the arbiter of cell-1 is always logical low, since each block in cell-1 already have a comparator and cell-1 will generate the AR to the following cell. The error registers in the arbiter tells whether or not there is a mismatch in the blocks. This information can be serially scanned out during the scan mode. The circuit b o o t h 1 is basically the multiplier first module with some extra circuitry to allow it to function as an intermediate multiplier module and propagate the test pattern through itself during the reconfiguration mode. At the reconfiguration mode, the test pattern (4 bits; X, Y, PPin, and appear as inputs to cell-2. Although self-testing and fault-tolerance may also be achieved by using a triple redundancy with a voter unit in this cell, the quadruple redundancy with the arbiter just described (this arbiter is also used in cell-2 and cell-N) is considered to be easier to implement in VLSI owing to its regular structure.
(b) celL2: This cell has double-modular redundancy (two booth-2 multiplier modules); the outputs are controlled by an arbiter circuit. The circuit of booth-2 is basically the multiplier intermediate module with some extra circuits to allow the test pattern and AR to propagate through it. During the reconfiguration mode, cell2 will receive test patterns and AR from its preceding cell and the PP,,, response of each booth-2 module is compared in the arbiter with the AR. The arbiter will use the comparison signals to decide which module should be responsible for the cell output, and enable the appropriate output buffer. The contents of the two error registers in the arbiter is scanned out for fault diagnosis during the scan mode. The test pattern and the AR (5 bits in total) will propagate through the cell to the following cell after 3 clock cycles.
(c) ce1L-N: This is similar to cell-2; it has double modular redundancy and two identical b o o t k N modules. The output is also controlled by an arbiter circuit. But, during the reconfiguration mode it functions as an intermediate module in celL2 and generates the response which lets the arbiter decide which module should be responsible for the cell's outputs. The test pattern and AR will propagate through it without changing and, after 3 clock cycles, return to cellcu.
(d) cell-cu: By activating the reconfiguration-mode input, cell-cu will generate a one-clock-cycle-width pulse to reset the internal flip-flops in cell-1 to celLN and then start the self-testing and reconfiguration processes. shows the circuit of a timing signal generator and the timing relationship between the signals; the 'power-on-reset' is only used to preset the LFSR. Cell-cu also contains a set of shift registers to store the test pattern generated by the free-running linear feedback shift register (LFSR) [ 151 and the AR generated by cell-1 for global checking. Global error checking of the entire multiplier is done by comparing the test pattern set to cell-1 with the test pattern returned from celLN, and by comparing the AR generated by cell-1 with the AR returned from cell-N. Any mismatch in the comparison will activate the 'cell-error' output. The occurrence of celLerror signal should be interpreted as: there are too many faulty cells to be tolerated, or failure occurs in the interconnections between cells, or the failure occurs in the control unit.
Reliability analysis
The reliability function R(t) is defined as the probability that a component will perform satisfactorily from time zero to time t, given that the operation is successful at time zero. The reliability function for a nonredundant component, using the general Poisson distribution, is R(t) = e-" where il is the constant failure rate of the component. (We assume that the component is in the normal-life region, and therefore the failure rates il is constant). For a general bit-serial multiplier with 2N-bit coefficient, the nonredundant reliability R , is [16] :
where R , is the reliability of the multiplier intermediate module.
Since the circuit complexities of the multiplier modules (booth-1, bootk2, and booth-N) are approx-imately the same, we consider them to have same reliabil-A simple reliability model for the redundant multiplier using our quadruple-double modulator redundancy (QDMR) technique is ity: i.e. R-th 1 = Rbooth 2 = Rbootb N = R,.
where R , is the reliability of the entire cell-1, which includes four multiplier modtes, and-R, that of cell-2. To compare RQDMR with Ro, R1 and R, should be correlated in terms of the module reliability R, . Since cell-1 is composed of two parallel self-checking block outputs controlled by an arbiter, its reliability R, is
where Rhardcore , is the reliability of hardcore circuitry in celL1, which includes the arbiter, output switching, and matching circuitry. Since the failure rate A tends to be proportional to the circuit complexity, we could estimate the relative failure rates from the relative transistor counts. By inspecting the circuit complexity of the hardcore circuitry to the basic multiplier module, we found that they have the relative circuit complexity of 0.28 : 1: therefore The self-checking block has two parallel modules, and its reliability Rblock is therefore
where RCMp is the reliability of the output comparator (a simple XOR gate) in the block of celL1. The relative complexity of XOR gate to multiplier module is about 0.05 : 1, therefore Rap = R205. In celL2, two parallel module outputs are controlled by an arbiter. Its reliability R, is
where Rhardcore , is the reliability of the hardcore circuitry in From eqns. 2-6, and the aforementioned analysis, we get the reliability of the proposed multiplier RQDMR, written as a function of the reliability R, of the basic module as follows :
Values of RqDMR and the reliability Ro (eqn. 1) of a nonfault-tolerant multiplier are plotted in Fig. 6 as a function of module reliability R, and N . It can be seen that the redundant multiplier maintains a higher reliability (the curves of a redundant multiplier are above those of a nonredundant multiplier). Fig. 7 shows a plot of the dependence of the reliability improvement (RQDMJRO) on N and R,. The plot illustrates a factor of 3 to 40 superiority in reliability predicted for the QDMR over the nonredundant model for 8 to 32 bit multipliers. Fig. 7 also shows that when the module reliability R, is approximately zero, the reliability of redundant multiplier is lower than the nonredundant multiplier. This is because the redundant multiplier has spare components that tolerate failures, but when the probability of each component to fail is very high, there are merely more components to fail. Since this situation occurs in the wear-out period of the multiplier, we can ignore it. 
5

Implementation
A prototype of an 8 bit coefficient of the proposed multiplier has been implemented in 2 pm CMOS single-metal technology using two approaches. In the first, a standard static CMOS logic is used, in which the multiplier (without 1/0 pads) occupies an area of 4.86 mm x 2.91 mm and contains approximately 3500 devices. In the second approach, a domino CMOS logic [17] is followed in which the implementation area measures 2.64 mm by 1.91 mm and has about 2500 devices. Approximately 35% of total silicon area is occupied by the BIST and scan-path circuitry for the 8 bit multiplier; however, this area overhead decreases as we increase the size of the multiplier. A simulation program indicates that the multiplier should operate at a maximum clock rate of about 25 MHz, i.e. with a cycle time of about 40 ns. The BIST and scan-path circuitry account for about 8 ns delay on the multiplier's critical path during the normal-mode operation. A layout has been generated, Fig. 8 , and is ready to be implemented on a CMOS DSP chip. The design of a fault simulator and a dedi-cated test pattern generator for the scan path is under way. Layout of a domino CMOS prototype
Conclusions
We have presented a novel, easily testable and faulttolerant multiplier. By combining the built-in self-testing with self-reconfiguration, we eliminate the need for complicated test pattern generation, response analysis, and externally controlled reconfiguration. The employment of QDMR and two-level testing strategy provides complete fault tolerance, fault diagnosis and testability. The design is simple and has highly regular structure which produces an efficient VLSI implementation. The field reliability is significantly increased without seriously compromising the operation speed. 
