The high costs associated with logic simulation of large VLSI based systems have led to the need for new computer architectures tailored to the simulation task. Such architectures have the potential for significant speedups over standard software based logic simulators.
Introduction
The high costs associated with the detection and correction of design errors once a VLSI chip has been fabricated have led to an increased reliance on simulation techniques in the logic design process. Logic simulation is used extensively to initially verify logic correctness and subsequently to develop vectors for testing fabricated chips. As circuit complexity has grown, the time delays and costs of performing logic simulation on standard serial computers have grown until they can consume months of machine time [15] .
These high costs have led to the development of a number of special purpose processors dedicated to logic simulation [0, 8, 12, 13, 16, 18, 19] Such processors typically perform simulations at 10 to 1000 times the speed of standard general purpose computers. The techniques employed in achieving these speedups vary from microcode implementation of simulation algorithms to the development 'of special purpose logic and multiprocessors tailored to a simulation algorithm. In addition to the approaches found in commercial simulation engines, other possible simulation architectures have also been proposed [1, 2, 9, 10] .
In general, it has been difficult to effectively compare these alternative approaches. There are several reasons for this. First, commercial products in this area often have proprietary designs whose details are not publicly available. Second, developing reasonable performance models over a range of complex architectural alternatives is still more an art than a science. Third, basic data on the simulation process (e.g. event distributions)
is difficult and time consuming to obtain, and has not been generally available in the open literature. * This paper is concerned principally with the third item above, that is obtaining and presenting data on the logic simulation process. Such data, relating mainly to event activity, event list statistics and time distributions is important in determining the effectiveness and sizing of various pipeline and multiprocessor design options, in addition to designing event list scheduling algorithms and hardware.
The section to follow reviews several simulation speedup techniques.
In the course of this presentation, the sort of data necessary in evaluating such techniques and architecture alternatives is discussed. Section 3 describes Lsim, a discrete event logic simulation program which has extensive facilities for data collection, and which was used in the data collection process. Section 4 presents the test case workload of five VLSI designs and the data collection methodology employed. Section 5 discusses the results of the data collection. The final section summarizes the paper.
2.
Logic Simulation Speed-Up Techniques and Data Requirements Numerous options are available to the system designer for accelerating logic simulations through the development of novel architectures and computing structures. These options break down into two broad approaches (see Table 1 ).
-Special function evaluatton hardware ~1 S eaal hardware for net-hst o eratlons The first, functional specialisation, refers to those techniques which take a component of a standard software based simulation algorithm, and decrease its execution time by placing it in hardware.
In this approach the basic sequential nature of the original simulation algorithm may be maintained. For example, given the central role played by event list manipulation routines, it is possible to design special purpose hardware which will time order a list of events, and permit insertion and removal of events in essentially a single instruction time. Given data on the time associated with this task when executed as a software routine, the cost effectiveness of developing such special purpose hardware can be evaluated. Given data on event time distributions, the sizing of such special purpose hardware can be optimized. relates to how clocks on separate processors in a multiprocessor are synchronized. In the unit time increment approach the clock is advanced by a single uniform time step at each simulation cycle. This is done whether or not any events must be evaluated on that cycle. While this eases the clock distribution problem it also introduces an overhead associated with processing clock times having no event activity.
In the event increment approach the clock is advanced in a nonuniform manner depending on when the next event is to take place. This eliminates the overhead associated with processing no activity clock times; however, it introduces extra processing associated with scheduling events and distributing time information. Data on the percentage of noevent times is needed to help evaluate this design decision.
The time synchronization component is of importance principally when using multiple processors architectures. A global clock scheme simpli:fies time synchronization since all processors execute in a lock step fashion executing events scheduled for the same time. However, as pointed out earlier, if only a few events are available at each time point, it is possible that only a small number of the available processors in the multiprocessor will be active at each time point. The use of a local clock scheme in which processors mo,ve ahead to future events at their own pace (subject to various precedence constraints) might allow significantly more events to be processed in parallel [5, 14] . Related to this design decision are the problems of circuit partitioning, design of the interprocessor communications network, and design of the algorithmic pipelining structure. These decisions require clata associated with event activity, various event distributions, and communications impact of various partitioning strategies. Table 3 indicates some simulation machine design decisions and certain data needed to aid in those decisions. The next section discusses a software based simulator used in the data collection process. is associated with each signal in addition to its logical state. Lsim uses two strengths, strong and weak, corresponding to a high and low current drive capability.
NEEDED
A strong signal is one that is connected directly to the power supply, ground, or through au active transistor to supply or ground. A weak signal is one that is connected to a voltage source through a resistance, such as a depletion mode pullup transistor.
Timing analysis is supported at three different levels: a unit delay mode1 in which every gate is assumed to have a delay of one simulated time unit, a fixed delay model where gate delays are modeled by fixed low-to-high and high-to-low propagation times, and a variable delay model in which gates have variable delays specified by a maximum and a minimum value. In addition, enable and disable times (i.e. switching times for the setup and removal of a high impedance state on a component output) may also be specified. The data presented in this paper were obtained with the fixed deiay model.
The seven logical states associated with signal lines are divided into two major types: stable states ("1","0",%",%") and transient states ( " ","f","t").
Stable states apply to all timing r models. The "1" and "0" states are used to mode1 high and low voltages respectively.
The "z" state is used to model the high impedance output of components that have tri-state outputs. The "xl' state is used when little is known about the voltage level of the signal.
Transient states only apply to the variable delay model and are used to represent intermediate states during a transition between stable states. The "r" and "f" states are used during a transition from low to high, and from high to low respectively. The "t" state is used during a transition to or from a high impedance state.
There are several components supported by leim that differ from the normal unidirectional gate mode1 that is common in gate level simulators.
These components, the pass transistor and resistor, are capable of propagating signals in two directions. Internally to /aim, these components are handled by creating, in effect, two parallel unidirectional components that are connected back to back. This construction is hidden from the user, who simply refers to one terminal of the component as the input and the other terminal as the output. The algorithms for processing bidirectional components and handling multiple strength signals follow those proposed by Hayes [ll] .
Data Collection Facilities
Lsim has features to collect information related to three basic items: events, timing of subtasks in the laim program, and communications across user defined circuit partitions. An event refers to a discrete action performed by the simulator, such as the modification of the logical state of a component output, or the periodic display of signal states to the user. Each event has a time associated with it which indicates when that event is to occur. Events are stored in an event queue which is used in event scheduling and (lowest time value) event retrieval.
Lsim collects the following statistical data on events: the number of events associated with each component in the circuit the number of events in the event queue the times between events in the event queue
In addition to the data mentioned above, there is a provision for lsim to send out a record to a file each time event queue activity occurs. Each record contains fields indicating the event type, current simulated time, scheduled time of the event, and whether au event insertion, removal or deletion occurred. The resulting data file can be analyzed using a statistical analysis package.
The UNIX profiling utilities may be used to obtain data on the execution times associated with various tasks involved in simulation. The utilities provide information that tells the number of times that subroutines have been called as well as cpu times for the subroutines themselves. The subroutine calls can then be classified, as shown in Table 4 , into a set of general tasks that comprise the simulation. 1) a stop watch, 2) a priority queue, 3) an associative memory, 4) a Radiation Treatment Planning (RTP) chip, and 5) a crossbar switch.
The stop watch circuit determines the elapsed time between a start and a stop signal. The priority queue can be used as an event list manipulation device. It stores 48-bit records, each divided into four fields, and retrieves the record whose first field contains the smallest value. The associative memory functions like a normal random access memory as well as a memory in which records can be retrieved by content (i.e. those matching a specified pattern). The RTP chip implements an algorithm used in cancer treatment planning, which calculates the radiation dosage at a specified point. The crossbar switch provides an interconnection network between four input and four output ports.
These circuits reflect a mix of characteristics (Table 5 ) and are the product of five graduate student design teams. The two most prevalent VLSI technologies (nmos and cmos) and clocking schemes (synchronous and asynchronous) are represented. The circuit sizes range from approximately 650 transistors to 7950 transistors. The priority queue, associative memory, and crossbar switch were designed so that they could be scaled to larger versions as required (assuming no pin or power limitations).
The test circuits were kept small enough to insure that simulation run lengths were reasonable and disk storage availability was adequate. The Switches and Gates columns in Table 5 indicate the number of lsim bidirectional switch and unidirectional gate blocks used in defining the circuit (the Total entry is the sum of these columns). The right column reflects the total number of transistors in each circuit. Once this first criterion was met, the number of test vectors was increased further if less than 95 percent of the components experienced at least one output change.
The test vectors were applied to the test circuits using Isim'a program interface.
In this technique, special test vector generation subroutines were written in the C language and dynamically linked to the normal lsim routines. These routines supplied the inputs necessary to simulate a stream of random test vectors.
The Data

Subtask Time! Distributions
The lirst data to be considered relates to the relative subtask times associated with the standard discrete event oriented logic simulation (Table 6 ). Time spent in the data output operation has not been included since this will vary greatly depending on the amount of data being collected. For example, if data is collected about every event that occurs during the simulation, this task alone could consume as much 40 From this data, the speedup that can result from various types of hardware specialization can be evaluated. The data indicates, however, that there is no single subtask which is a crit,ical bottleneck.
That is, unless all aspects of the algorithm are improved, large speedups will not be achieved. For example, if an infinitely fast device were designed to process events, the most that could be gained is a speedup of about 23 percent. Note that a fairly efficient timing wheel based event list algorithm is used in lsim 1171.
Event Intensity Data
To get a broad measure of simulation activity over time, it is worthwhile noting the fraction of time points during which no activity takes place. That is, given a resolution of spy I nanosecond, this is the percentage of nanosecond time points when no events are scheduled. As shown in the first colur@ of Table 7 , at most of the time points there is no activity. Related to the idle time percentage is the idea of circuit intensity. Intensity corresponds to the percentage of gates which change state on average over the simulation.
The second column of Table 7 shows the percentage of gates which change state averaged over all non-idle time points (i.e. points where at least one event occurs). The third column shows the percentage of gates which change state averaged over all time points (idle and non-idle) in the simulation. The general picture that emerges is that logic simulation is an activity where, during most of the simulation time point.s nothing is happening and, when there is activity, it involves a small fraction of the circuit being simulated. The conclusion is that for special purpose simulation architectures to be effective, they must take advant.age of the localities of activity which OCCUT in both time and space. Luckily, since we are interested in large circuits, small percentages may still yield enough activity so that the speedup techniques of specialization and parallelism can be effective if they are applied at non-idle time points. Given this resuIt, the queue length and event simultaneity statistics discussed in the next section, unless otherwise specified, are based on measurements taken at non-idle time points. Note that the CB Switch has the highest non-idle time, yet has the lowest intensities. This is due to the design of the switch whiah is constructed from 16 2-by-2 switches operating in an asynchronous, pipelined manner. First, very little logic is exercised except when establishing a path between an input and an output port. Second, the asynchronous design tends to spread the events uniformly over time rather than clustering them near the beginning of a clock period as in synchronous circuits. This results in a larger number of busy time points but less activity at each time point than in the other circuits.
Event Queue Length Distribution
The length of the event queue will vary during the simulation.
The distribution associated with queue length yields information on how long one can expect event lists to grow, and this information is useful in designing efficient software and hardware based algorithms for event manipulation.
For example, what should be the size of a hardware unit specialized to event queue function? If made too small, overflow conditions will often arise with a likely associated time penalty. If made too large, such a device could be costly but yield little in added performance. Note that queue sizes are modest, with an average (over non-idle time points) of less than 30 entries, and the probability of a queue length less than 90 being greater than .9 . This is shown in Figures 1 and 2 Tables 5 and 8 can be used to obtain queue lengths to be expected for larger circuits. For instance, in logic simulations of 100,000 transistor circuits (about 25 times the average circuit of Table 5 ), 90% of the time the event queue would have a length less than about 1800. This will vary, of course, depending on the characteristics of the individual circuit, being simulated.
Work in the area of modeling the performance of various architectural alternatives is under way at, Washington University.
Related research on the circuit partitioning problem and on associated problems in the design of more general purpose simulation machines for VLSI design automation is also being pursued.
Event Simultaneity
The data on intensities is further refined in Table 9 and Figures 3 and 4. Table 9 is concerned only with those time points during which one or more events are processed. For example, the average in the first column means that there is a 90% probability that there will be 57 events or less at a given non-idle time point. Figures 3 and 4 show the general fast drop off in number of simultaneous events after the first few entries. They do, however, also demonstrate that there are apparently a few instances of intensive activity where many simultaneous events occur.
Prob. # Simultaneous
Average Over Events 5 Table Entry Although the statistics indicate that on average relatively few events occur in parallel, this number scales as the size of the circuit being simulated grows. If we assume, for instance, that the average number of simultaneous events scales linearly with circuit size, then a 100,000 transistor circuit will, on average, have about 465 (25*18.6) simultaneous events to process at each non-idle time point. Though not pursued here, it is clear that this affords many opportunities for exploiting parallelism and pipelining in the design of special purpose simulation architectures.
Summary and Conclusions
This paper discussed some factors which are of importance in the design of a hardware based logic simulator. A summary of architecture approaches for achieving high performance logic simulation engines was presented along with a description of a software simulator, I&Z, which has been used as a tool for gathering data on the simulation process. 
