Abstract
Introduction
Unlike other verification methods (formal verification, or hardware assisted simulation), simulation provides full signal visibility and scales with design size. Simulation is also being used to complement static methods, such as static timing analysis (STA) and formal verification (equivalence checking) for timing verification [1] . However, the low speed, the major drawback of simulation, makes it difficult to deal with current complex designs. There have been several approaches to address this deficiency, such as hardware-assisted simulation acceleration [2] , distributed parallel simulation [3] and abstraction level design simulation [4] . This paper concentrates on parallel approach to simulation, as a means to parallelize the simulation and improve its overall performance. A new approach to parallel HDL simulation is proposed, based on a concept of temporal parallel simulation, rather than traditional, spatially distributed simulation. Unlike the spatial simulation method, the method described in this paper does not introduce any dependency among the partitions.
The synchronization and communication overhead due to inter-module communication and signal passing, characteristic of the traditional distributed simulation is completely eliminated. This is a very important feature, which increases parallelism and maximizes speedup ratio with respect to the number of simulation nodes.
In this paper we explain the concept of temporal parallel simulation and show that it is compatible with the accepted design and verification flows. We discuss technical issues to make this method practical and show some preliminary experimental results.
The proposed technique is universally applicable to full timing simulation of designs of any size and type (both with single and multiple, asynchronous clocks). However, for the purpose of this paper we confine its description to gate-level designs with a single clock. It is worth mentioning that gate-level timing simulation is still being used, even for designs with a single clock, because STA alone is not efficient for timing verification due to gated clocks, false paths, multi-cycle paths, etc [5].
State of the Art
A rich body of literature exists in the area of parallel simulation; however, most of the known work addresses traditional parallel simulation, which is based on physical partitioning of the design into modules, distributed to individual simulators. We refer to this approach as spatial parallelism, since the simulation relies on partitioning of the design in spatial domain. This form of parallel simulation has been known since late 1980s as Parallel Discrete Event Simulation (PDES) [6] , [7] , [8] , [9] , [10] . Such an approach to distributed simulation suffers from an unavoidable communication and signal synchronization overhead between the modules, and a lack of methods to perform efficient design partitioning to minimize this overhead. To the best of our knowledge, most of the current research in distributed HDL simulation is based on this approach. The results have been demonstrated on relatively small or well structured designs that can be easily partitioned without incurring large communication cost.
Li et al. [11] implemented a parallel Verilog simulator by adopting an Object-Oriented concept. However, its performance is still not acceptable because of the high synchronization and communication overheads for real, complex designs.
Zhu et al. [12] attempted to apply this approach to realistic designs and achieved good event processing rate with the number of events growing linearly in the number of processors. In spite of these promising results, the actual simulation performance can easily deteriorate, as it strongly depends on the design structure and the dynamic behavior of the design.
No clean solution has been proposed to address the traditional issues plaguing this approach, such as synchronization, message passing/sharing, and design partitioning problems. As a result, only a few commercial products have been developed for such spatial parallel simulation, including SimCluster [13] and MP-Sim [14] . To this date they, however, were not able to attract attention of expert designers.
Temporal Parallel Simulation
The temporal parallel simulation proposed in this paper is a radical departure from the conventional parallel HDL simulation for gate level timing simulation. In contrast to spatial parallelism, which partitions the design to be simulated, our temporal parallel simulation partitions the simulation run in time, by cutting the entire simulation run into a number of independent simulation slices. It consists of two major steps:
1. Fast, zero-delay reference simulation on a single processor that stores essential information at selected checkpoints; and 2. Full timing target simulation, which is distributed to the individual processors.
The reference simulation runs at zero-delay gate level (GL) of design abstraction, while the target simulation runs at the full-timing GL, using annotated SDF (Standard Delay Format). Usually zero-delay gate level simulation is 10~50 times faster than full-timing gate level simulation. This large difference in simulation speed makes it possible for the proposed approach to achieve significant speedup improvement. The basic concept of this new technique is shown in Figure 1 .
Both steps of the simulation work on the entire design under test (DUT), while the entire simulation run is divided into simulation slices, each to be executed on an independent simulator. For this approach to work, the initial design state for each slice of the target simulation must be captured and saved during the first (reference) run. This is done at predetermined checkpoints, determined by the number of processors available for parallel simulation. The design state consists of the state of all internal registers and memory print of the design. By restoring the design states, each slice can be made independent of each other. As a result, target simulation can run concurrently and independently for each slice.
Testbench Forwarding
While the design state of the DUT can be stored at any point during the reference simulation, the state of the testbench cannot be similarly captured (due to its sequential nature and the fact that the state of the testbench is not clearly defined). In order to maintain the correct testbench state at the restoring checkpoints, the testbench must be simulated from the beginning of the simulation time to the starting point of each simulation slice. We refer to this simulation as testbench forwarding. It is a fast, testbench-only simulation needed to provide the correct state of the testbench for target simulation for each simulation slice.
Testbench forwarding is implemented as follows. The values of output ports of DUT, saved continuously during the reference simulation, serve as stimulus provider (a dummy DUT) for testbench simulation. The testbench is simulated with this stimulus from time 0 up to the starting point of the simulation slice in question. At this point the design state is restored form the data stored at the checkpoint and the dummy DUT is replaced by the original DUT; each slice is then simulated normally and independently of the other slices.
This testbench forwarding technique is a fast, testbench-only simulation, with low overhead for the target simulation time in a full-timing gatelevel simulation.
Possible State Mismatches
The design state is restored and latched for the purpose of target simulation at the checkpoints, when it was saved during the reference simulation. However, because of the delays in targetsimulation crossing the slice boundary, the discrepancy between the zero-delay reference simulation and the final full-timing simulation may cause the restored state to be incorrect. This situation is illustrated in Figure. 2(a) .
In order to address this issue, a slice overlapping technique is used, shown in Figure. 2(b). Slices n-1 and n are allowed to share some amount of the simulation period around the checkpoint. Since the mismatch happens at the beginning of Slice n, that period is discarded for Slice n. The correct simulation result for that period is generated from Slice n-1. For this technique to be effective, the effective overlap period should cover the entire mismatch period. 
Predictive Performance Analysis
The following equations show the expected total simulation time of our approach as a function of conventional, stand-alone simulation time and overhead introduced by temporal parallelism. Analysis of the testbench forwarding term shows that the TB simulation for latter slices (for X closer to N) is slower than for the earlier slices (for X closer to 0). This is because testbench must be simulated from simulation time 0 up to slice X-1. However, we still expect temporal simulation to have high-performance since T0-delay and TTB are very small compared to Tfull-tsim. Furthermore, by selecting sufficiently large value of N, it is possible to have Ttot ≪ Tfull-tsim, which is the main goal of this method. Note that Tsave is very small number because fetching the design state can be done with small overhead at particular simulation points. Cstimu is also small because stimulus capturing is done only for output ports even though it happens continuously. Therefore, increasing N results in only a small increase of Tref. On the other hand, Ttarget can be reduced significantly by increasing N because Tfull-tsim has large value. Therefore, significant shortening of simulation time Ttot ≪ Tfull-tsim can be easily achieved for practical designs.
Experimental Results

Experiment 1
The experiment was carried out with the help of designers from a major semiconductor corporation on an industrial 18M-gate design. The design was simulated with Cadence NC-Verilog simulators, running on SUN machines. Despite a small (10:1) speedup ratio between the zero-delay simulation and full-timing GL simulation, the results showed an expected linear (6×) speedup with 10 simulators on a simulation run of 72,600,000 cycles.
Experiment 2
In this experiment an S1_Core design from Simply RISC [15] was used. The design was elaborated and simulated with Cadence NCVerilog simulators, running on PCs with Linux. The design has 1.2 million gates and contains one 64 bit-SPARC Core, a Wishborn bridge, a reset controller and a basic interrupt controller. The following shows the characteristics of the traditional, single-processor simulation and the zero-delay simulation used as reference simulation.
• Single-processor Simulation (full GL timing) The results of the simulation are shown in Table  1 . As discussed earlier (refer to the equations) the time to simulate individual slices during the target simulation differs slightly among the slices. This is due to the time needed to save/restore the testbench, with the restore time increasing with the number of slices. The best simulation performance is attributed to the first slice: the initial condition for the first slice does not require the design state to be restored, and the testbench doesn't need to be simulated to recover the state. From the second slice on, the simulation may be slower. However, the plots in Figure. 3 and Figure. 4 show that the overhead due to testbench forwarding is small, and the last slice is simulated almost as fast as the first one. Furthermore, the simulation overhead due to testbench forwarding is not a function of the number of slices but a function of the simulation time. Therefore, for gate level full timing simulation, requiring reasonably long simulation runs, the total simulation overhead of temporal parallel simulation is maintained at a small level, regardless of the number of slices. 
Conclusions and Future Work
A radical solution to completely eliminate communication and synchronization overhead in a distributed parallel simulation environment for full timing gate level simulation is proposed. This is accomplished by performing temporal partitioning of the simulation run, instead of spatial partitioning of the design. For long simulation runs, a linear speedup can be obtained; this is something that is not achievable in traditional (spatial) parallel simulation, due to the unavoidable inter-simulator communication and synchronization overhead.
In addition, our method makes it possible to easily detect timing violations by comparing the results of target simulation and reference simulation at the clock cycle boundary, since the two simulations should produce the same result. This is an additional strong point of our approach since the violating path can be detected automatically, without human intervention. This process can be readily automated.
The proposed approach is also applicable to designs with multiple asynchronous clocks. By applying proper delay on a clock domain crossing (CDC) wire during the reference simulation, we believe the consistency can be maintained for designs with multiple clocks. As a result, our approach might be applicable for any general designs, and this extension will be our further research items in the future.
