Abstract
Introduction
Logic simulation plays an important role as a verification tool in the process of VLSI design. Most widely used logic simulators employ the algorithm based on 1) Compiled code [l, 41 (2) Event driven [l, tremely time-consuming; often for tlie lar e systems, we can only hope for partial simulationfll]. Even these partial simulations take enormous amount of CPU time, may be days or weeks. Exploiting the parallelism inherent in these algorithms and implementing them on parallel machines, helps in reducing the execution time of the simulation process [9] .
Speedup can be achieved usin Vector processors,
Hardware accelerators [5] like Y f E (IBM), LE ( ZY-
CAD ) or by mapping the simulation al orithm onto the processors of general purpose paratel machines or on a network of workstations.. In the case of vector processors, the code has to be rewritten to extract 41 or (3) L -algorithm [2, 41 . Logic simulation is exthe computational power of vector processors high vectorization ratio and long vector length) [3] . hardware accelerators provide good speedups, but due to the direct mapping of simulation algorithm onto the hardware, it suffers from the disadvantage of cost to performance ratio, which turns out to be quite high. This makes the hardware accelerator, a less attractive scheme. Current trends in carrying out parallel lo ic simulation are concentrated towards mapping t%e simulation algorithm onto a network of general purpose workstations connected on Ethernet. We exploit either functional parallelism which is inherent in the algorithm or data parallelism by dividing the circuit into subcircuits and assigning these to different processors [6] . The functional parallelism normally results in a pipeline of processors, each processor executing the subtask of the whole simulation task. In the latter case, load balancing and synchronization of simulation time are the vital issues which decide the partition of the circuit among processors [7] . The whole circuit is divided into cones of FFRs, and the cones in the same level are partitioned using a partitionin scheme, and each partition(set of cones) is assignef to a workstation for evaluation.
The Sequential T-Algorithm
In a T-algorithm(Time first evaluation algorithm), the evaluation of gates proceeds in the direction of signal flow i.e.from the input side towards the output side. The T-algorithm is based on the fact that the events associated with a gate can be evaluated independent of other gates for the whole simulation time period(in combinational circuits .
During simulation, the evduation of each gate a dvances asynchronously. The basic principle in the T-a1 orithm is to carry out the evaluation of a gate for t i e whole simulation period, for which the gate inputs are known, before the commencement of the evaluation of the next gate. Prior to the simulation execution, either the level sorting or the DF(Data Flow) sorting is used to find the order of gate evaluation. Level of a gate(cone) is the maximum distance in terms of the number of gates(cones) from the primary inputs. The primary inputs are at level zero. In the case of combinational circuits, the primary input gates are the first target gates for evaluation. The order of gates for evaluation, among the gates whose inputs are available can be random or can depend on some criteria like the number of fanins of the gate. Once these gates are evaluated, the outputs of these gates for the whole simulation period are known, and the fanout gates are selected for the next evaluation.
The process is carried out till all the gates in the circuit are evaluated.
The principal advantage on which the T-algorithm gains in simulation execition time over event driven simulation is that, once the gate is evaluated, the same gate is not referred again thereby saving the table lookup time. Because of this, the T-algorithm runs faster than the event driven algorithm for most of the combinational circuits [2] . In the case of synchronous sequential circuits and short feedback loop circuits the efficiency of the T-algorithm comes down depending on the interval length, but still performance can be improved using two path simulation technique [2] .
Parallel Logic Simulation
The overhead of communication is a major bottleneck in parallel/distributed processing. If the ratio of computation communication is kept high, we environment. The following paragraphs describes, how the T-Algorithm is tuned so as to get a maximum parallelism on a distributed environment. Since the unit of communication in the T-algorithm is a sequence of events unlike an event as in the event driven or compiled code simulation algorithms, the communication cost is kept low. Secondly, the computation is increased, as evaluation of the gate is for the whole simulation period and also the reference to a gate is made only once during the simulation process. Because of these reasons the T-algorithm based logic simulation performs well on a distributed environment. Secondly, it is desirable to increase the computation by evaluating more number of gates per processor, and reduce the communication time by communicating only a few gate outputs which are needed for the subsequent computation. Fan out free region partitioning allows to group the gates in the form of cone, through which we can evaluate a cluster of gates which are associated with every cone, and at the same time restricting the communication between worker task and master task to input wave forms and an output wave form along with the gate numbers associated with the cones. Further we have also carried out the load balancing among processors by dividing the cones among processors depending on the number of gates associated with the cones, such that all processors share equal number of gates for evaluation. This enabled us to increase the computation time and decrease the communication time. In our implementation, we have increased the computation/communication ratio further by evaluating a set of gates instead of a single ate and carrying out communication only once at t8e start and once at the end for the whole set of ates to be evaluated on a worker task [lO] . The folfowing subsections brief the partitioning and master worker abstraction of the simulation algorithm.
Partitioning of Gates among Worker Tasks
As mentioned in section 2 we have to maintain a high computation to communication ratio. To achieve this we initially partition the circuit into maximal fanout free regions(FFRs) in a preprocessing step. FFR partitioning is based on the principle that every maximal FFR output is either a primary output or a fanout stem and every FFR input is either a primary input or a fanout branch. The secan execute algorit h ms faster in a parallel processing quence of operations carried out in this algorithm is explained with the help of pseudocode in Figure 1 . The FFR cones are levelized and all the cones in a particular level are evaluated in parallel. This technique effectively reduces the number of levels in the circuit as shown in Table 2 , increases the number of gates and hence computation per processor, and decreases the communication among processors as only the output of the FFR cones have to be communicated to the master processor. Decrease in the number of levels also reduces the number of simulation cycles which in turn decreases the overall communication time. The partitioning strategy adopted in our present implementation is to assign FFR cones in the same level to different processors in such a way that almost equal number of gates are allotted to every processor to maintain a good load balance among the processors, as shown in Table 1. 
Master and Worker Tasks
The worker and master task carry out reading of the circuit, forming the required data structure, evaluation of cones etc except that master task has to do an additional task of partitioning the circuit into cones using FFR partitioning technique, and divide the cones which belong to same level among master and worker tasks. A brief description of the routines and communication constructs present in the pseudo code of master task is as follows.
Randassign does the partioning of cones among workstations. Form-buffer sets up the array of data for communication, and these array of data are subsequently communicated to other workstations using receive and send constructs. After the communication of set of gates to the worker processors, the master task carries out the evaluation of ates which are assigned to it. After the evaluation, &e master task receives the outputs from all workers, and updates the circuit data structure. This process is carried out in a loop till all levels (of FFR cones) in the circuit are exhausted. Referring to Table 1, the ten cones in level 2 are allocated to processors such that the three processors get 4, 2 and 4 cones with 5, 4 and 4 gates respectively. In most cases, the number of cones allocated in each level to a processor maintains the number of gates to be nearly equal except few cases like in level 9, there is only one FFR cone of 54 gates, that could not be partitioned further. Load balancing can be further im roved if we consider the number of inputs associatejwith each gate and the length of the input list during the assignment to the processors. [12] . Table 2 gives the characteristics of ISCAS85 benchmark circuits. Table 1 shows the partitioning of cones and gates among processors for a good load balancing. The FFR cone generation time is constant irrespective of the number of processors on which simulation is done and is only dependent on the number of gates in the circuit. The preprocessing time is also independent of the length of the simulation time and number of times the simulation is run. We can clearly observe from Figure 2 that the execution time reduces sharply for larger circuits(Ex: c1908) with the increasing number of processors. We also observed that our technique using FFR cones resulted in a good load balance and reduced communication among the workstations. 
Results and Observations

Conclusions
In this paper we have described the suitability of the T-algorithm based logic simulation method on network of workstations. We have chosen to partition by fan out free regions, to decrease the number of levels and et the sgme time increase the computation by forming cluster of gates first and then aseociatin cluster of comes to worker and master proceasors fepending on the number of gates, to achieve a good load balancing (The processor running the master task is called master processor and the processors running the worker tasks are called worker processors). As we cap see from the table the simulation time decreases with the increqse in number of procewrs. Tbe future direction will be to carry out the modification in the present implementation by r e h i n the load balancing heuristic, and optimizing the cote in the partition and simulation modules to give a still better performance. 
