We recently introduced symbolic timing simulation (STS) using data-dependent delays as a tool for verifying the timing of fullcustom transistor-level circuit designs, and for the functional verification of delay-dependent logic. While STS leverages efficient symbolic encodings to yield huge gains over conventional simulation methodologies, it still suffers from a problem known as event multiplication. We discuss this problem and present an event-list management technique based on event-clusters, and a new simulator which utilizes this technique. Finally, we demonstrate substantial speedups on a wide range of test cases, including exponential improvement on a simple logic chain.
INTRODUCTION
As design complexity continues to increase, efficient methods of verifying timing and functionality become both more important and more difficult. Symbolic simulation has been used in the past to greatly increases the effective throughput of simulation-based verification methodologies, and also as a component of a number of formal verification strategies. Symbolic timing simulation is an extension of this technique that correctly accounts for circuit propagation delays.
Symbolic simulation is a form of data-parallel simulation in which Boolean values are used to encode a set of input data patterns. In conventional simulation the user applies a pattern of constant 0's and 1's to each of the circuit inputs, steps the simulator, and verifies that the outputs and state elements have settled to the desired values. With a symbolic simulator, the user may substitute Boolean variables for any of the input values to signify that the input may be either a 0 or a 1. If the user applies n Boolean variables, the symbolic simulator will perform the equivalent of 2 n conventional simulations. The outputs and state elements of the circuit will evaluate to Boolean functions of the input variables, which can be verified against the desired behavior.
Symbolic simulation relies on having an efficient means of encoding Boolean functions to represent the values of circuit nodes. Typically, this takes the form of Binary Decision Diagrams(BDDs) [3] . Though the memory required to encode a Boolean function is provably exponential in the worst case, BDDs have been shown to be This research was supported by the SRC (contract DC-068) efficient and easily-manipulated data structures for representing a large number of interesting functions.
We recently applied symbolic timing simulation to the verification of full-custom transistor-level circuits, as an alternative to static timing analysis. Custom circuits often contain transistor topologies that defy heuristic recognition, causing static analyzers to miss timing checks. A symbolic timing simulator can simulate the timing of all possible input patterns in parallel and infer the correctness of all internal timing by checking for correct functional behavior. This avoids the dependence on correct identification of all possible latches, flip-flops, dynamic gates, self-timed circuitry, etc.
SirSim [9] is a transistor-level symbolic timing simulator based on the delay calculation procedures in IRSIM [10] . It demonstrated the computational feasibility of symbolic timing simulation on a number of reasonable-sized benchmarks. However, SirSim suffers from a major bottleneck which we term the event multiplication problem.
In Section 2, we discuss the extension from symbolic simulation to symbolic timing simulation as a natural progression of different delay models. Section 3 discusses event clusters and how they can be used to improve performance. In Section 4 we present a cluster-based event-management scheme and a new symbolic timing simulator, STEED. Section 5 discusses experimental results and demonstrates the advantages of our approach.
DELAY MODELS
Like conventional simulators, symbolic simulators can be built upon a wide range of delay models. Historically, symbolic simulators have been used for functional verification and have tended to utilize either zero-or unit-delay models. However, assigned delay and data-dependent delay models have been successfully implemented for other applications.
Constant Delay
One of the earliest references to symbolic simulation [8] described a simple zero-delay model. Zero-delay simulation implies a levelized sweep through the circuit going from the inputs to the outputs. At each level, a function is computed that represents the value of each node based on the input variables. While it is extremely efficient, only acyclic circuits can be dealt with in this manner.
To handle circuits containing feedback or state-holding nodes, a unit-delay model can be used. Implementation of the unit-delay model typically utilizes event-lists, where an event is a change in the value function for a particular node. To start the simulation, the input values are scheduled by placing them in the list of events to be executed. The simulator then executes all events in the event list, computes new values for fanouts of the changed nodes, and schedules them back onto the event list. This process is repeated Permi ssion to make digital/hardcopy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2000, Los Angeles, California (c) 2000 ACM 1-58113-188-7/00/0006..$5.00 until the circuit reaches a steady-state, as signalled by an empty event list.
Symbolic simulators which have implemented the unit-delay model include MOSSYM [2] and Cosmos [4] . Both of these simulators actually implement a mixture of zero-delay and unit-delay to gain efficiency when the generality of the full unit-delay model is not required.
The next level of complexity is the assigned-delay model. Here, the delay from one node to the next (typically through a logic gate) has an assigned value d. To handle this case, we simply extend the event-driven scheme from the unit-delay model by sorting the event-list by time. At each time-step the event at the head of the event list is removed (along with any others having the same time value) and executed. As before, the effects on its fanout are computed, and scheduled into the now-sorted event list to be updated d time-units later. Devadas et. al. [7] implemented an assigned-delay symbolic simulator to analyze the transition-delay of gate-level circuits.
Data-Dependent Delays
Recently, we demonstrated the feasibility of extending symbolic simulation to a data-dependent delay model [9] . The difficulty lies in the fact that the time of an event is dependent on the values of the input-variables applied to the circuit. Consider the skewed inverter in Figure 1 (a). When the input changes value from variable a to variable b (Figure 1(b) ), either of which could symbolize a 0 or a 1, it is not clear when the event on node out should be scheduled.
To handle this case properly, we can schedule an event at each of the potential timepoints, along with a mask M that indicates under which input patterns the event will occur. When the event is removed from the event list, we compute the node's new value as: This mechanism was implemented in SirSim, a transistor-level symbolic timing simulator based on the IRSIM Elmore delay-calculation scheme. While a substantial speedup was attained over an equivalent exhaustive IRSIM simulation (up to 10 33 for a 64-bit adder), SirSim still suffers from the problem of event multiplication. Since each new node value is scheduled several times, one for each potential delay, the effects of a single event multiply at each level. This can cause a total number of events that is exponential in the circuit depth.
EVENT-CLUSTERS
To address the event-multiplication problem in data-dependent symbolic timing simulation, we first introduce the concept of event clusters. An event cluster is a set of events on a single node with mutually disjoint masks. The set of events that result on a fanout node from a single node value change will always form an event cluster. Take for example the events on the output of the skewed inverter in Figure 1 (c). Both output events share a common resultant value, but differ in their event times and masks. However, the masks are disjoint (ab^ab = 0 ). In Figure 1 (d) we plot the timeline for all of the possible cases in this simple example, representing every event with a dot. Each dot represents an event which would occur if we were to run a conventional simulation using the input data pattern to the left of that line. We join together events which form clusters with dotted lines.
Each event in a cluster specifies a change on the same node, but at a different timepoint for each input pattern. Therefore, we can potentially compute the resultant effects on fanout nodes once for each cluster, rather than recomputing it for each unique event.
Cluster-Queues
Unfortunately, event clusters can interact with each other such that one cluster changes the state of the network during the period covered by another. In conventional timing simulation, we sort the event queue such that earlier events are always executed before later events, thus guaranteeing that the network state is up-to-date. However, since clusters span time ranges, it is not always possible to impose a total ordering based on time.
Consider the example in Figure 2 , containing the same skewed inverter but with an edge-triggered flip-flop connected to its output. If the clock transitions high at time 150ps, it should sample the old value for case ab, and the new value otherwise. Clearly, if we were to execute the event cluster on out before the cluster on clk, we would sample the new value for all cases, which is incorrect. Likewise, if we executed the cluster on clk before the cluster on out, we would sample the old value for all cases, which is also incorrect. In this case, the only alternative is to split one of the clusters so that the proper ordering can be maintained.
To be clear about how this splitting takes place, we must first formally define a cluster: 
This states that CA CB if and only if CA occurs earlier than CB for all cases in which they are both defined.
A cluster queue is an ordered list of clusters such that:
Note that this definition of correctness requires k 2 =2 comparisons to verify. At the cost of additional splitting, we can reduce the computations required to insert a new cluster into this queue if we define the additional function:
In other words, mustPrecede returns the Boolean function containing all input assignments for which clusters CA and CB are both defined, and in which CA must occur before CB.
During insertion into a cluster queue, we may be required to split the new cluster so that the above ordering conditions can be met.
To compute the portion of cluster C that intersects with an additional masking function F, we introduce the function Chop(C F), which returns a copy of cluster C where all events that do not intersect F have been removed.
When inserting a new cluster CA into the queue C1 C k k 1, we will initially split CA into at most k + 1 pieces that will be inserted between each of the existing clusters: Pj Note that in implementing this insertion procedure, we need not insert NULL events (for which Pi = 0 ).
Pseudocode for the cluster enqueue and dequeue operations are shown in Figure 3 . EnqueueCluster scans through the queue, splitting off and inserting each portion of the new cluster as required.
If M P contains the constant function true in line 9, we can insert the entire remaining cluster and terminate. If M Pis f a l s e , there is no interaction and we can move on to the next cluster. Lines 14-15 enqueue the portion that must be inserted before the current cluster, and lines 16-19 remove that portion from the original and keep track of what's been done so far. If we reach the end of the queue, any remaining portion of the cluster is simply appended. DequeueCluster simply removes the head of the cluster queue and returns it.
Two example cluster insertions are shown in Figure 4 . In part (a), the cluster Cnew can be inserted completely before C2. Despite the fact that the first (leftmost) event in C2 is earlier than the first event in Cnew, Cnew precedes C2 for all input assignments for which they are both defined. In part (b), we see that Cnew must be split into 3 parts to preserve a consistent ordering. Notice that only those portions that must precede C1 and C2 are split off, and the remainder is grouped together at the end of the queue. 
Cluster Scheduling
It is possible to construct a data-dependent symbolic timing simulator with a single cluster-queue for all events. However, upon doing so, we discovered it to be highly inefficient due to interactions between independent clusters. This results in the majority of clusters being split several times to maintain proper ordering, which fails to alleviate the event-multiplication problem. Even worse, it incurs the additional overhead of managing the cluster-queues.
The solution lies in realizing that a strict time ordering across the entire circuit is not required. We simply need to make sure that all circuit state which can affect the computation of a certain event has been accounted for when that event is executed.
For clarity, we will describe the case of a gate-level logic network, but the same principles apply to transistor-level networks subdivided into channel-connected regions. For a cluster of events C to be ready for execution at the input to gate G, we must satisfy the following conservative safety conditions:
1. There may be no pending events on the other inputs to G that precede events in C. 2. There may be no pending events on the output of G that precede events in C.
3. There may be no pending events upstream of the other inputs to G that could propagate to G before the completion of C. where
The minimum propagation delay between any two nodes can be computed by determining a conservative minimum delay per gate, and computing All-Pairs-Shortest-Paths via the Floyd-Warshall or
Johnson's algorithm [6] . Johnson's algorithm requires O(V E lgV) time, which is O(V 3 lgV) for a fully-connected graph. However, since the maximum fanout of each gate is limited to a small integer, usually 4, the number of edges is effectively O(V ), bringing the overall complexity to O(V 2 lgV). Furthermore, this computation can be done as a pre-processing step and re-used from one simulation run to the next.
If no pending clusters satisfy the safety conditions, it is always safe to effectively revert to the non-cluster methodology by splitting off the earliest event contained in all clusters. While this hurts efficiency, it guarantees forward progress. Hopefully, by removing several singular events, we will again be able to guarantee the safety of one or more clusters and return to full-speed operation.
The improved cluster scheduling algorithm is presented in Figure 5 . It makes use of the EnqueueCluster and DequeueCluster operations from Figure 3 .
Simulate() forms the main body of the simulator. As long as pending events remain, it obtains them via GetNext(). Recall that each event cluster is scheduled into the cluster queue for its driving gate and for all receivers. If the cluster returned by GetNext() was from its driver's queue, only that node state is updated. However, if the cluster was removed from a receiver's queue, then the results of that node transition in the receiving gate are computed and scheduled as a future event.
GetNext() is responsible for finding an event-cluster that can be safely executed. It first checks if any of the clusters at the heads of any of the gate queues are "safe", and returns them if they are. Otherwise, it creates a new event cluster containing only the earliest event present in all clusters. That earliest event is then deleted from the original cluster by eliminating that case from its mask. Safe() checks condition 3 specified above. It visits event clusters in order of increasing earliest timepoints, which allows for the early termination case in line 9. If it finds any event which could arrive at G before C completes, it returns false. If it finds no clusters which could conflict, it returns true. Schedule() simply schedules an event cluster into the cluster-queues for all gates connected to NC.
IMPLEMENTATION
The algorithms from the preceding section have been implemented in STEED (Symbolic Timing Engine for Electronic Design), a transistorlevel symbolic timing simulator. STEED computes data-dependent delays based on the Elmore approximation (RC products), utilizing the methodology implemented in IRSIM [5] . It borrows the node value and delay-calculation engines from SirSim [9] , an earlier symbolic timing simulator not based on cluster scheduling.
As transistor-level simulators, both STEED and SirSim operate on channel-connected regions (CCRs) rather than on gates. An input to a CCR is any node connected to the gate of a transistor in that CCR. An output of a CCR is any node connected to the source or drain of a transistor in that CCR. Note that internal nodes are considered outputs for the purposes of the scheduling algorithms.
When an input to a CCR changes, the new value of all output nodes is computed based on voltage-divider or on charge-sharing models. This new value is returned as a BDD, representing the Boolean function of that node relative to the applied input variables.
Delays are computed symbolically using Multi-Terminal Binary Decision Diagrams (MTBDDs) [1] . MTBDDs are an extension of BDDs that may contain an arbitrary number of real-valued terminal nodes, rather than the binary terminals 0 and 1. The delay MTBDDs returned by the delay calculation engine encode the delay value due to the current node transition, as well as the mask information. Recall that the mask specifies under what logical conditions the output transition will occur. If no transition will occur, this is encoded as an infinite delay value in the delay MTBDD.
The delay MTBDD for the skewed inverter case is shown in Figure  6 . To interpret this MTBDD, we follow the solid arc when the 
RESULTS
STEED was run for most of the benchmarks reported in [9] as well as several others. STEED achieved a speedup in nearly every case, up to 7.8X. Table 1 shows runtimes, peak BDD memory usage, and event counts for STEED and Sirsim, as well as the speedup attained. Recall that, in STEED, events are inserted into the queues for both driver and receiver, increasing their apparent number. The effective event count is typically about half the total number which appears in Table 1 .
The lowest speedups occur on the carry-bypass adders, byp add16 and byp add32. These test cases contain especially large channelconnected regions in their carry-chains, meaning more nodes and events per cluster-queue. When large numbers of events are inserted into a single cluster-queue, fragmentation of the events in- creases, and the computational overhead of queue management becomes costly. These test cases suggest that a hybrid scheduling mechanism might be beneficial.
On the smaller combinational benchmarks, we see very little difference between SirSim and STEED. Apparently, the shallower logic cones (O(log n) versus O(n) for the adders), mitigate the event multiplication problem, such that cluster-queue management offsets the gains in event count.
STEED's performance on the standard adders is exceptional, obtaining a speedups as high as 5.3X. As shown in Figure 7 , STEED's speedup over SirSim grows with the size of the adder.
In order to demonstrate a further advantage over non-cluster-based scheduling, we generated two test-cases, adder32.r and adder64.r, with additional small randomized capacitance values on every node. These capacitances break up much of the symmetry in delay values on multi-input gates. For SirSim, this aggravates the event multiplication problem, increasing the time required to complete the simulation. However, for STEED, the event time MTBDDs simply grow a little larger, impacting performance less profoundly. The result of these additional capacitances is a larger speedup due to cluster scheduling, reaching 7.8X for the 64-bit circuit. Furthermore, these test cases model the real-life situation in which back-annotated capacitance values are included in the simulation, especially when the circuit being simulated has irregular routing above it.
To isolate the effects of event multiplication, we constructed increasing lengths of inverter chains with data-dependent load elements ( Figure 9 ). Under SirSim's scheduling algorithm, a transition on the input will generate two possible delay values through the first stage, dependent on the state of signal A. At each successive stage, the number of events at the output doubles, resulting in an exponential number of events. This is responsible for the exponential runtime shown by the top two lines of Figure 8 .
STEED encodes the events on each stage's output as an event cluster. If the capacitors connected to the data-dependent load elements are randomized, there will still be a unique delay case for every possible input assignment, so the cluster event-time MTBDDs (TC) will grow exponentially. Note however, that the exponential runtime is of lower order than in SirSim. If the load-element capacitors are are constant or take on some small number of distinct values, the cluster event-time MTBDDs have a mesh structure and quadratic size. In this case STEED achieves polynomial runtime, as shown by the bottom line on Figure 8 .
CONCLUSION
We have identified a major bottleneck in the data-dependent scheduling mechanism found in SirSim, which we term the event multiplication problem. To address this issue, we introduced an eventmanagement methodology based on event clusters, and implemented it in the simulator STEED. We obtained a substantial speedup in nearly all test-cases attempted, demonstrating the advantages of cluster-based scheduling.
