In order to reduce cost and to achieve high speed a new hardware accelerator for fault simulation has been designed. 
INTRODUCTION
Simulation is the process of exercising a realistic model of a digital system with sets of input stimuli. Fault simulation plays an important role in testing and diagnosis by determining the fault coverage of a test set. However, the computation time for this process is growing rapidly due to increases in the size and complexity of the systems to be simulated.
In order to reduce simulation time, a number of techniques have been introduced in recent years. One of these techniques is hardware simulation. It is known that hardware simulation can be more than 1000 times faster than software simulation ]. These hardware approaches include two speedup factors, namely; architectural concurrency and algorithmic concurrency. Efficient architectural concurrency can be achieved by allowing parallel processing elements to effectively deal with the circuit. Algorithmic concurrency is characterized by a simulation algorithm which can take advantage of the architectural concurrency. The major disadvantage of the hardware simulation approach is that it is expensive and inflexible as compared to software simulation approaches. Several [2] , the Yorktown Simulation Engine [3] , the Logic Evaluator [4] Realfast [5] , LSM [6] , MARS [7] , etc. These were primarily designed for logic simulation purposes, since the fault simulation process is far more complex and costly. Recently, it was reported by Fehr [8] that a dramatic speed-up potential is achievable in logic simulation by using a parallel hardware configuration which directly maps the design simulation topology onto the accelerator. This architecture focuses on large scale concurrency through; hardware parallelism, high speed input/output, optimized interconnect (fanout) circuits, and pipelining of the design execution cycle. This architecture was mainly intended for functional verification of gate-level, unitdelay, binary-valued, logic designs.
There have been several other attempts to develop a hardware accelerator for fault simulation. Levendel [9] , introduced a special purpose architecture for simulation, using a parallel fault simulation algorithm. Also, there is an approach based on a concurrent fault simulation algorithm [10] , which consists of parallel processing units for each task of concurrent fault simulation. Agrawal 11 ] , presented a pipelined multiprocessor system using a concurrent fault simulation algorithm. Later, the accelerator was upgraded by adopting a new pipelined algorithm for message passing multicomputers 12]. The major drawback with existing approaches to fault simulation has been their cost. The object of this paper is to describe a new architecture for fault simulation. Since the main disadvantage of the hardware approach is the cost, a major concern was to reduce this cost. The architecture introduced in this paper is an array type architecture, which is very cost effective as compared to other architectures. Also to achieve high performance, a new fault simulation algorithm was developed for the new architecture. This algorithm takes advantage of the new architecture through the use of parallelism, where possible.
ARCHITECTURE
The underlying architecture of earlier hardware accelerators is that of a multiprocessor system, where the processor can be configured in a distributed fashion for architectural concurrency, or in a pipelined fashion for algorithmic concurrency. In this approach, fault simulation is executed on a massively parallel processor array. This massively parallel hardware accelerator is composed of a tightly connected array of simple boolean evaluation processing elements, where the accelerator provides high speed fault simulation at a reasonable cost. In this architecture, we apply a direct mapping strategy as described by Fehr [8] for logic simulation, which maps the fault simulation topology onto the accelerator PE array. The basic mapping concept of the new architecture is a one-netlist-node-per-array-element representation of the circuit and it uses pass-gate logic for array interconnections. Therefore, the netlist topology can be mapped as an overlay onto the PE array, and all gates which are assigned to the same level in a netlist can be evaluated concurrently. An example of the mapping of a circuit onto the hardware accelerator is as follows. Figure (a 
ALGORITHM
In order to achieve high performance, using the new architecture, a new simulation algorithm was developed. To maximize the capability of the new algorithm, preprocessing is required. During preprocessing, two stages of circuit expansion are performed. This is done in order to maximally use the parallelism by expanding a circuit into a new circuit with each element having at most two faults. Also, after expansion, the preprocessing for the PE array reconfiguration is performed. In the first stage of circuit expansion, any gate which has more than two fanins is expanded to gates having only two fanins. Consider the circuit shown in Figure 8 . After expanding the circuit, the new circuit is shown in Figure 9 . After expansion, if we apply fault collapsing, the number of faults in the new circuit is 14 and the number of faults in the example circuit is 13. The difference is that there is an uncollapsed fault, i.e., the signal from G_TEMP1 to G_TEMP2 stuck-at-0. This is because we only consider fault equivalence relationships. If Figure  10 (a). Since all internal faults are not considered, there are only 6 faults. Therefore, after the expansion, the number of faults is the same and the difference is only in the method of internal expression, since the propagation and excitation conditions of all faults are the same. An XNOR gate is expanded in a similar way, as shown in Figure 10 (b).
There are several ways to expand a multi-input gate to a set of 2-input gates. Since the number of levels in the circuit is an important factor for the overall performance of the new algorithm, the expansion is performed by choosing a set of gates so as to minimize the number of levels. For example, the 4 A G_TEMP1
G_TEMP2
Example Circuit after First Expansion input AND gate of Figure 11 (a) is expanded to two levels of gates as shown in (c), rather than to three levels of gates as shown in (b).
In the second stage of circuit expansion, all fanout branches are regarded as dummy gates. The object of this preprocessing is to force the number of faults in a gate to always be less than or equal to 2, and to have all faults inserted on the input side of the gates. After considering all fanout branches, the example circuit is shown in Figure 12 .
Notice that after expansion, the number of faults in the expanded circuit is the same as in the original circuit, and the number of faults inserted on the gates is distributed as uniformly as possible. The minimum insertion is 0 and the maximum insertion is 2. Since the number of fanins of any gate is 2, the number of propagated faults towards the next level gate is at most 2. Furthermore, the maximum number of faults on any gate is known and the required memory space can be optimized. The algorithm for the new fault simulation is shown in Figure 13 . Initially, a circuit is simulated without any faults and the results are stored. Then after inserting the first fault, fault evaluation is performed. If the value is different from that of the good simulation, it is propagated. For the second fault, the same processing is performed. Since The algorithm is highly dependent on the number of levels in the circuit and the number of columns in the array architecture. If the architecture has L columns and a given circuit has less than L levels, then simulation can be done in less than (L 1) steps. These procedures can be efficiently performed in a parallel array architecture. Since the good simulation is performed before fault evaluation and single fault propagation is executed at each PE, this algorithm takes advantage of the parallelism using parallel patterns. Also, the handling of fault lists is another advantage over the handling of one fault at a time since it reduces the unnecessary overhead.
Consider the circuit shown in Figure 14 . and D_PO) in level 3. During simulation, the contents of each processing unit is shown in Figure 15 . For simplicity, only the element name, the good simulation value, and the contents of the fault list are represented. Let (1,0) be the input pattern for simulation, and let the architecture have 2 3 processing elements. After preprocessing and good simulation of the pattern (1,0), the contents of each processing element is shown in Figure 15 (a). Consider P (2,1) as an example. The processing element P(2,1) represents the level 2 gate C. After good simulation, the result is stored as 0. Then, 2 faults (A_DUMMY-s-1 and B_DUMMY-s-1) related to gate C are inserted. After evaluation for the fault A_DUMMY-s-1, the effect is not propagated, since the value is the same as the good simulation value. However, since the evaluation result for the fault B_DUMMY-s-1 is different from the good simulation result, it is propagated to the next level. Therefore, B_DUMMY-s-1 is shown in the list of P(3,1), as shown in Figure 15 (b). After one level is considered, the contents of each processing element is shown in Figure 15 (b). Since all faults in the first level are considered and the next level processing elements have the fanin values of the good simulation, the first level elements are not considered any further. Therefore, when the second step is considered, the first column of processing elements is idle. The number of these idle processing units becomes larger as the process continues. However, these idle processing elements can be used to perform simulation of the next patterns, if available. Therefore, the processing units are busy most of the time during simulation. After the consideration of two levels, the contents of each processing element is shown in Figure 15 (c) .
PERFORMANCE EVALUATION
It is very difficult to provide precise timing for this new algorithm, since this is highly dependent on the circuit's topology. However, a rough comparison between the new algorithm and one fault simulation algorithm using the same architecture can be made as follows. The algorithm which will be compared to the new algorithm, is based on logic simulation. In other FAULT SIMULATION 129 words, after insertion of each fault, the evaluation is performed as in logic simulation. This does not require much additional overhead. We will call this a 'conventional algorithm'. Also, an accelerator using this conventional algorithm will be called a 'conventional accelerator', for notational convenience.
Let L be the number of levels in the circuit and L be the number of levels after the expansion. For simplicity, only two cases are considered. One is that there is only one column in the architecture and the other is that there are L columns. Also for simplicity, fault dropping is not considered. Let P be the number of given simulation patterns and F be the number of faults in the circuit. If a conventional fault simulation algorithm is used, using one column, the total simulation time is given by This is the worst case since fault dropping is not considered. Since the new algorithm deals with many faults simultaneously (in parallel), the fault dropping speed of the new algorithm is faster than that of the conventional one.
Comparing equations (1) and (3) shows that for the case where there is only one column; if the following condition is satisfied, the new algorithm is more efficient.
Lx(L-1)<LXF (5) When comparing the two equations (2) and (4) The conditions, shown in equations (5) and (6) , are true for most practical circuits.
As shown in previous sections, the fault simulation cycle consists of two timing parameters; the timing value for a good circuit simulation and the timing value for a faulty circuit simulation. Assume that the times h, tt, and are the same as the system cycle, tsys. Also, let the time tw, which is also not more than tgood + (if,-1) X lfaulo' (9L- 5) [14] and the ISCAS 89 seFor example, circuit c499 uses many XOR gates. Therefore, it has a high GOR and LOR. The main factor affecting the overall performance is the LOR, rather than the GOR, since this architecture depends highly on the number of levels in the circuit. After circuit expansion, according to the number of gates in a level, the PE array is rearranged in the processing scheme and not in the actual hardware. This PE array reconfiguration makes the architecture achieve more efficient fault simulation, since it increases the number of levels to be handled at one down-loading time.
In these results, the PE reconfiguration is done in 4 depths, i.e. from 256 1,128 2, 64 4, and 32 8, in a chip. The average Spr is 4.44 for these circuits.
To compare the performance of the new approach, a software approach was developed using the same environment. For software fault simulation, a parallel pattern algorithm [16] [17], with fanout free region analysis [18] [19] , was used. For fault simulation, 1024 random patterns were used. The new approach averages about 30,000 times faster than the software fault simulation, with a 25 nsec clock cycle, as shown in Table I .
The configuration of the hardware accelerator used for these results, is as follows. Since the architecture is based on an array, there is no additional overhead Currently, the cost of a hardware accelerator is higher than the cost of a software simulation system, i.e., (64 C + W) > (W + S), hence the Cost Performance Ratio is less than the Speed Up Ratio. When the average speedup is considered, the total cost performance is about 19,000 times better than a conventional software simulator (In this computation, W, C, and S are assumed to be $30,000, $1,000, and $30,000, respectively.).
CONCLUSION
The demand for a high speed, low cost, fault simulator has led to the need for a new computer architecture and a new simulation algorithm. Our objective for developing this hardware accelerator for fault simulation was to satisfy this demand.
The architecture of the new hardware accelerator was designed based on a reconfigurable mesh type PE array. The netlist is directly mapped onto a massively parallel PE array, which is composed of a tightly connected array of simple processing ele-132 S. KANG et al.
ments. Circuit elements to be simulated at the same level in the netlist are executed concurrently as in a pipelined process.
The new parallel simulation algorithm expands the gates to two input gates, which permits us to limit the fault number to two at each input signal to the gates, so that the faults can be spread out uniformly throughout the PE array. The PE array reconfiguration operation provides a great advantage for simulation speed, since it utilizes each PE cell most of the time. Another advantage of this algorithm is that it is possible to predict the total memory size used during the simulation, which is a big drawback of conventional concurrent simulation programs.
Simulation results, based on benchmark circuits, show that the hardware accelerator, in a multi-chip mode, would be orders of magnitude faster than a software simulation program on a general purpose computer. The performance of this architecture can be improved further by new technologies, such as; adopting a one-transistor in a memory cell model and using a fast system clock. In addition to that, the size of the proposed hardware accelerator can be reconfigurabl for maximum cost performance; from the chip level to the system level, based on the size of the circuit to be simulated. 
