Abstract
Introduction
VLSI logic simulation is used extensively in the design verification of digital systems. With the advent of Application Specific Integrated Circuits ( ASICs) into widespread commercial use, the need for design verification has increased dramatically. The ASIC designer does not have the luxury of breadboarding a design, testing for design errors, and then modifying the breadboard prototype t o correct for the errors. Given the high costs associated with individual fabrication runs for integrated circuits, the designer must have a high degree of confidence in the correctness of the design prior to fabrication. Currently, logic simulation is the primary design tool that fulfills this need. Figure 1 illustrates a typical design flow for an ASIC development. The design is originally specified in a hardware description language (HDL). HDLs in common use are Verilog and VHDL. This original descrip tion of the system is a t the user level. The user level system description is simulated in the iterative design cycle on the right-hand side of the figure until the designer is satisfied with the functionality of the system. At this point, the user-level HDL description is input to the synthesis and place-and-route tools, yielding a gate-level description of the system. This gate-level system description is simulated in the iterative design cycle on the left-hand side of the figure until the d e signer is completely satisfied that both functionality and timing are correct. Finally, the system is fabricated and tested.
This example design flow illustrates the important role that simulation plays in the design process. It is not uncommon to have long simulation runs (measured in hours or even days) be a significant bottleneck in the design process. As a result, improvements in simulator performance can provide significant benefits to the ASIC design community.
One approach t o performance improvement is the use of parallelism. The research community has been investigating the parallel execution of discrete-event simulation models in general [5] and logic simulation models in particular [l] for more than a decade. Yet, in spite of these efforts, production logic simulators today are almost exclusively executed on uniprocessor systems. We believe that the lack of adoption of parallel implementations by the commercial community has 1. Historically, parallel computer systems were expensive to acquire and difficult to maintain. This impediment to adoption has clearly been diminished in recent years with the availability of relatively inexpensive. symmetric multiprocessor systems from a number of manufacturers.
2. Parallel simulation performance results have often been inconsistent. Lin and Lazowska [ll] coined the term "S phenomenon" to describe the observation that speedup curves for optimistic, asynchronous simulators often have several local minima and maxima. This observation was made over a large set of different simulation applications [4, 8, 10, 201 .
In this paper, we describe an approach that is aimed at addressing the second cause listed above. We are proponents of the use of synchronous algorithms for parallel simulation, specifically aimed at the need for consistent performance across a range of individual circumstances. This paper provides a performance analysis of the use of speculative computation in a synchronous simulation environment, with the target simulation application of VLSI logic design.
In the section that follows, we describe what is meant by speculative, synchronous simulation. Section 3 describes the performance evaluation methodology, based upon trace-driven simulation using data derived from the ISCAS-89 benchmark set. Section 4 p r e vides the performance predictions and Section 5 concludes the paper.
Speculative, Synchronous Simulation
Speculative computation has received a great deal of attention in the parallel computing community as a technique for balancing computational load and masking latencies in interprocessor communications [SI. It has even found its way into processor design at the instruction level, with proposals for valve prediction, speculating the results of individual instructions or basic code blocks [9] . In discrete-event simulation, parallel algorithms that perform computation in a speculative manner are generally referred to as optimistic algorithms.
While both synchronous and asynchronous optimistic algorithms exist, our interest is in synchronous algorithms. This is due t o a desire to avoid the inconsistent (and sometimes inexplicable) performance associated with many asynchronous protocols. In addition, synchronous algorithms have an inherent simplicity and ease of implementation that is not present in asynchronous techniques.
As with many other algorithms, there is a tradeoff between simplicity and performance; the simplicity of the synchronous algorithm comes with a p e tential cost in performance. If frequent synchronizations are required, the algorithm becomes more fine grained. Since the critical path lies with the slowest processor a t each iteration, idle time can accumulate at the other processors and the total execution time is lower bounded by the execution time of the slowest processor in each iteration. In an attempt to alleviate these performance concerns for synchronous discreteevent simulation, techniques used in asynchronous simulation algorithms (e.g., speculative computation) can be applied to the synchronous algorithm, while retaining the iterative nature of the algorithm. Figure 2 illustrates a typical set of iterations of a synchronous iterative algorithm executing on four p r e cessors (labeled 1 through 4). An iteration can be seen as consisting of 3 phases:
1. Computation -performing the computational tasks associated with the application.
2. Idle -time between first and last processor t o complete work in an iteration.
Execution time 
3.
Synchronization -time to complete the barrier synchronization operation.
Compz~tation starts on all processors immediately following the barrier synchronization. During this phase, each processor executes all the tasks assigned to it that iteration. For discrete-event simulation, this consists of processing simulation events. Interprocessor data communication may be concurrent with computation. At the end of the computation phase, each processor enters a barrier and waits for its completion. The idle phase is a result of variation in computation times hetween processors due to imbalances in workload as the algorithm progresses, multitasking other unrelated processes (background load) , or processor heterogeneity.
Synchronization time is determined by the communication performance of the parallel platform in completing the barrier synchronization. After the barrier synchronization completes, the processors proceed t o the next iteration, repeating the cycle until the algorithm completes. Speculative computation utilizes the idle phase of the above algorithm by allowing processing to proceed into future iterations. While waiting for the barrier synchronization t o complete, computation progresses speculatively, with the hope that a message arrival from a remote processor does not subsequently invalidate the computation. Once the barrier synchronization is complete, the speculated computation is tested for correctness and either committed or discarded.
To support processing during the execution of the barrier synchronization, a fuzzy barrier implementation is used [7] . Processors signal their willingness to complete the barrier and, rather than blocking, proceed to compute speculatively. A "barrier complete" signal indicates the end of the current iteration. The execution time line, illustrated in Figure 3 , shows the speculative computation occurring during the idle times and while waiting for the barrier to complete. Note that the time required to complete iteration 2 is less than in the previous figure, since some of the computation has been completed during the otherwise idle time of iteration 1. Mehl [12] proposed present a performance model for a similar algorithm that predicts performance gains over a purely conservative synchronous algorithm. Steinman's Breathing Time Buckets algorithm [19] has been implemented in the SPEEDES environment and exhibits good performance on a pair of simulation models (queueing networks and proximity detection).
Performance Evaluation Methodology
The set of variables used in the performance model is summarized in Table 1 . A more complete definition of each variable is given in the text near the first use of the variable.
The execution time of a synchronous iterative a l g e rithm requiring I iterations and running on a set of P processors can be modeled by a function consisting of three distinct terms. In any particular iteration i, on processor j, with speculation work available to speculate during iteration i on processor j speculation success ratio: fraction of speculative computation that is correct
there is a portion of work which is serial in nature. The first term represents the time required to complete the portion which cannot be parallelized, which we denote as ts,i. Each processor j has some assigned work to be performed during the parallel portion of each iteration. The time for each processor to complete this work is denoted by t,,i,j. However, the parallel portion of each iteration is not complete until the last processor completes its assigned work. This gives the second term as maxlljlp(t,,i,j). Finally, the time required for the overheads associated with the parallel algorithm itself during iteration i are denoted by tov,i.
Combining these terms gives a model for the execution time as Previous work has shown this model to be effective for estimating run time for several different types of synchronous iterative algorithms [ 181, including quantifying the performance impact of speculative computation in discrete-event simulation both with uncorrelated [15] and correlated [16] parallel workloads.
Speculative Workload Characterization
The expressions in this subsection were originally presented in [15] . The development is repeated here for clarity of understanding in the performance predictions that follow.
For the VLSI logic simulation applications we are interested in, both the serial term t,,i and the parallel overhead term tov,i in (1) do not vary significantly between iterations. As such, these terms can be treated as constants. We focus on characterizing E [ m a x l l j j p t p , i , j ] for the both the initial workload without speculative computation and for the resulting workload with speculative computation.
Let us define wi,j as the work to be completed on processor j during iteration i without speculative computation. Assuming the units of work are relatively constant in time (e.g., event evaluations with similar computational complexity), t,,i,j will be proportional to wi,j. When speculative computation is performed, it will take work away from future iterations whenever a processor completes before the barrier synchronization. We will define this new workload as vi,j, the work to be completed on processor j during iteration i with speculation. In this case, tp,+,J is proportional t o ui,j which, by definition, must be less than or equal to wif.
The development of ui,j from wi,j can be made by examining a specific iteration, i, of a synchronous algorithm that incorporates speculative computation. Examining Figure 3 , we determine that the work to be completed during iteration i is m a x l L j l p
Therefore, the amount of work that can be speculated on processor j during iteration i is given by (3) provided we limit speculation to one iteration into the future. This yields a recursive formulation that relates V i + l , j to q , j :
with initial condition
The scalar T that is introduced in (4) represents the speculation success ratio, or the fraction of the spec- ulated work that was successfully committed. Substituting (3) into (4) gives:
These expressions can be used to empirically evaluate vij for specific instances where wi,j is known. Essentially, the evaluation is a form of trace-driven simulation, where trace data (which can be collected from a serial simulation execution) provides the information for wi,j and the repeated evaluation of (6) provides vi,j.
Model Evaluation
In [15] and [16] , the above expressions were used to guide the development of a stochastic workload model that describes the steady state distribution of the computational workload both with and without speculation. Here, we are interested in using (2) and (6) The empirical data, ~i ,~, is evaluated using (6) to determine This evaluation constitutes a tracedriven simulation of the speculative execution algorithm. An example trace of is shown in Figure 5 for circuit s9234 simulated on four processors ( P = 4) assuming a success ratio of r = 1.0.
Once trace data is available that represents the computational workload both with and without speculation, the central term in (2) can be estimated by evaluating the expectation (across iterations) of the maximum (across processors) of the workload. These values are tabulated in Table 2 for two different values of the speculation success ratio r . Also given in the table is the percentage performance improvement predicted due to the use of speculative computation.
Speculation Success Ratio
The above techniques provide predicted performance for a speculative, synchronous simulation execution, provided that the speculation success ratio, r , is given. In this subsection, we address two issues associated with the speculation success ratio: 1) whether or not a mean value (i.e., scalar) model is sufficient, and 2) what is an appropriate value for VLSI logic simulation.
To address the first issue, we model the speculation success ratio, 'r, as a random variable, and investigate the sensitivity of the performance results to various d i s tributions for r , each with the same mean. Consider the following four distributions: uniform, binomial, tri- angle, and Gamma. The density functions of each of these distributions are shown in Figure 6 , where each distribution is shown with a mean value of 0.5.
To evaluate the sensitivity of the logic simulator performance to the distribution of the speculation success ratio, the trace-driven simulation is executed again with a probabilistic model for T . Each time (6) is evaluated, a sample is drawn for T from the desired distribution. The resulting values for E [ m a x l < j l p q j ] are plotted in Figures 7 and 8 . The deterministic value for T is plotted along with various distributions, each for a variety of logic circuits.
The primary observation we make when observing these plots is the small impact there is when using a probabilistic model for T rather than the deterministic model. The maximum deviation is under 10% across the range of experiments. It should be noted that the deterministic model is consistently optimistic (i.e., smaller) with respect to the probabilistic models, and that the binomial distribution (likely the most realistic of the distributions) consistently gives the most pessimistic (i.e., largest) results.
Given that there is some impact on predicted perfor- mance due to the distribution of the speculation success ratio, we wish to determine a reasonable distribution for the VLSI logic simulation application. In [13] , we evaluated the effectiveness of the standard optimistic assumption at predicting the future message content on a communications channel. For a twevalued logic simulation, this prediction is equivalent to whether or not a speculated event will be committed. As a result, it is reasonable to use this data to derive an estimate of the true distribution of the speculation success ratio. Figure 9 shows a histogram of the message prediction accuracy across communications channels. The data comes from the latch outputs of all of the TSCAS-89 circuits (a total of 9696 channels). Here, we assume that this data generalizes to all of the functional evaluations in the simulation.
The trace-driven simulation now proceeds as follows. At each iteration, (3) is evaluated to determine si,j, the number of speculated events. For each event in si,j, a random sample is drawn from the distribution of Figure 9 . This sample determines the probability p that the event is committed. This probability p is used to control a Bernoulli trial, the outcome of which deter- mines whether or not the event is committed. Once it is known how many of the events are committed, this quantity is used for rsi,j, the second term in (4). Evaluation of (4) then follows, and the simulation of one iteration is complete.
Performance Predictions
Here, we use the performance evaluation methodology described in the previous section and apply it to a subset of the ISCAS-89 benchmark circuits. Table 3 lists the specific circuits used, their size (component count), and the mean workload per iteration (event count) for a serial execution. These specific circuits were chosen because of their size (i.e., they represent a range across the available choices). Figure 9 . Distribution of successful predictions for latch output channels across the ISCAS-89 benchmarks [13] .
for the circuits listed above, both with speculative computing and without speculative computing. The largest circuit (~38584) is shown separately due to the differences in scale. Table 4 gives the percentage improvement (i.e., decrease in expected workload per iteration) due to speculative computation relative to the synchronous algorithm without speculation. As one would predict, the performance improvement is greater for larger numbers of processors, due to the increased variability in the original workload and, therefore, the increased o p portunity for speculative computation. The smaller percentage inc:reases associated with the largest circuit (~38585) will be understood upon examination of the table that follows (Table 5) . Table 4 . Performance gain due to speculative computation.
In order to evaluate the overall gains due t o parallelism, one must estimate the remaining terms in (2), the serial processing time and the parallel overhead. Table 5 gives the predicted speedup vs. a serial im-- and with speculation
plementation assuming that the sum of the serial and parallel overhead processing each iteration is equivalent to a single event evaluation. Clearly this assumption is not valid for every parallel execution platform, but it is reasonable for tightly-coupled, shared-memory architectures on which a barrier synchronization is relatively inexpensive. Examining the data in the table, it is clear that the overall performance of the speculative computing technique is quite favorable, with parallel efficiencies of about 50% or higher across the board. The lowest performance is associated with the smallest circuits, and the highest performance is associated with the largest circuits. The lower than average percentage improvements due to speculative computation shown in Table 4 can now be understood in the appropriate context. With speedup values so high (e.g., 3.7 using 4 processors, a parallel efficiency of 93%), the performance gain due to parallelism is near the maximum achievable.
Summary and Conclusions
This paper has presented performance predictions for a speculative, synchronous simulation algorithm a p plied to the application domain of parallel VLST logic simulation. The performance results show that there is clear benefit to be gained by developing parallel implementations of logic simulation design tools.
An important piece of this investigation is the uniformity of the results. The performance predictions were explicitly tested for sensitivity to the distribution of the speculation success ratio, and the shape of the distribution did not have a significant impact on the results.
In addition, the overall performance figures are quite consistent across a range of circuits and processor p o p ulations. This is quite different than the typical circumstance encountered when executing asynchronous parallel simulations.
Given this level of potential performance gain, we believe that commercial implementations of these techniques will be successful in improving the design environment for ASIC development, consistently reducing the time required to perform design verification.
