Symbolic switch-level simulation has been extensively applied to the functional veri cation of CMOS circuitry. We have extended this technique to account for real-valued, data-dependent delay values, and have developed a novel mechanism for symbolically computing datadependent Elmore delays. We present our symbolic simulation and delay calculation algorithms, and discuss their application to the timing and functional veri cation of full-custom transistor-level CMOS circuitry.
I. Introduction
Symbolic simulation is a form of data-parallel simulation in which Boolean functions are used to encode a set of input data patterns. In conventional simulation the user applies a pattern of constant 0's and 1's to each of the circuit inputs, steps the simulator, and veri es that the outputs and state elements have settled to the desired values. With a symbolic simulator, the user may substitute Boolean variables for any of the input values to signify that the input may be either a 0 or a 1. If the user applies n Boolean variables, the symbolic simulator will perform the equivalent of 2 n conventional simulations. The outputs and state elements of the circuit will evaluate to Boolean functions of the input variables, which can be veri ed against the desired behavior.
Previous work on symbolic simulation has largely focused on unit-delay models, in which all node transitions require a uniform amount of time. This is su cient for the majority of functional veri cation problems, but is clearly inadequate for verifying circuit timing. In some cases, as we show in Section III-A, more sophisticated timing models are often required simply to model functionality.
Event-driven symbolic simulation has previously been extended to handle some degree of timing information. Devadas et.al. 6] constructed a gate-level symbolic simulator that utilized pre-assigned gate delays in order to study the transition delay of combinational circuits. However their gate-delay model is unable to simulate separate rising and falling delays, a crucial capability for extension to the transistor level. The skewed inverter in Figure 1 (a) is one example which exhibits this behavior. Since the pulldown nFET is stronger than the pullup pFET, the output will fall more quickly than it will rise.
Seger and Bryant 16] proposed another extension to event-driven symbolic simulation to model rising vs. falling delays on speci c nodes by inserting explicit delay elements and additional logic gates. One drawback of their model is its assumption of quantized delays. Furthermore, it is limited to the assignment of a single rising and falling delay value for any given node, which fails to capture several other data-dependent delay cases. Consider the data-dependent loading on the inverter output in Figure 1(b) . Here, the output capacitance (and thus the inverter's delay) is dependent on the state of signal b. Another case is the asymetric NOR-gate in Figure 1(c) , where the falling delay is dependent on which input red. To capture the full generality of the data-dependent delay model, we have developed a new methodology called symbolic timing simulation (STS). This methodology has been implemented in the simulator SirSim, a symbolic extension of the well-known transistor-level timing simulator IRSIM 15] . SirSim is event-driven and utilizes several novel eventmanagement techniques which allow for arbitrary real-valued delays. To capture the e ects demonstrated in Figure 1 , we have also developed a procedure for computing data-dependent delays in transistor networks at run-time. The details of these algorithms are presented in Section II.
STS has applications in both functional and timing veri cation of VLSI circuits. We discuss these applications and the advantages of STS in Section III, and present experimental results from applying SirSim to a number of substantial test-cases in Section IV. Section V gives our conclusions and suggests future work.
II. Implementation
As a testbed for symbolic timing simulation, we have implemented a symbolic version of the timing simulator IR- SIM 15] . IRSIM is itself derived from two earlier simulators, RSIM and nRSIM. RSIM 18] introduced the concept of event-driven switch-level timing simulation based on Elmore delays 7], 14], which are delay estimates computed as RC products. It models transistors as switched linear resistors and all capacitors are connected to ground. Figure 2 shows a simple circuit and its representation under the RSIM model. RSIM contained a simple and somewhat pessimistic model of nodes with unknown (X) values. nRSIM 5] improved on this model of X values and introduced several other enhancements, such as an improved model of charge-sharing e ects and the simulation of transient voltage spikes. Lastly, IRSIM implemented an incremental simulation model, where circuit updates could be analyzed with only partial re-simulation. SirSim (Symbolic IRSIM) implements the full nRSIM model with the exception of voltage spikes, but does not include incremental simulation. Thus it is primarily indebted to nRSIM, despite being based on the the IRSIM source code. A. Event Handling IRSIM is an event-driven timing simulator, and utilizes relatively standard event-management techniques with several interesting enhancements. All circuit nodes are capable of maintaining their voltage state, which can be either 0,1 or X. \Events" are de ned as changes in the state of a circuit node at a particular time. Although events are handled as if instantaneous, IRSIM associates a transition time with each event that is used in the computation of resultant transition delays. While we will not discuss how this information was incorporated into SirSim, it is straightforward to do so using the techniques presented in this paper.
Pending events are stored in the event queue, where they are sorted by increasing time. The main simulation loop repeatedly selects the earliest event in the event queue and updates the node state. It then identi es a ected downstream nodes, recomputes their steady-state node values, determines the delays to each node, and schedules the appropriate events.
IRSIM extends this procedure with an inertial delay model, which lters out input glitches having durations less than the stage delay. This is accomplished by removing all pending events on a stage output when it is determined that the current state matches the steady-state value.
IRSIM's event-handling procedure is shown in Figure 3 . This algorithm has been re-organized slightly to facilitate comparison with the symbolic version, and several important features have been dropped. In particular, we will only discuss binary simulation here, while the generalization to ternary simulation (node values 2 f0; 1; Xg) will be covered in Section II-G. In addition, we do not show the handling of charge-sharing e ects and several e ciency enhancements.
Simulate contains the main simulation loop. It calls GetNext to obtain the earliest pending event, and updates curtime and the node state. It then calls A ectedNodes to determine which downstream nodes need to be visited. For each downstream node, it determines the new steady-state value using ComputeDC , and checks if the node value has changed. If it has changed, it computes the logic stage delay with ComputeDelay and schedules a new event on the node. CleanEvents scans through the event queue, deleting all events on the speci ed node that are scheduled to occur after a certain time. The rst use of CleanEvents, immediately after the new event is enqueued, overrides previously computed node values that would take e ect after the newly-inserted event. This is necessary because the newly-inserted event represents the latest information about the node's steady-state value and shouldn't be overwritten by long-delay events generated previously using old information. The second call to CleanEvents, performed when no change is detected on the node, is used to implement the inertial delay model.
B. Symbolic Event Handling
Utilization of IRSIM's event-handling methodology in the symbolic domain requires some additional enhancements. First of all, each circuit-node's state is no longer a simple scalar value in f0,1g 1 , but a Boolean function of the variables applied to the circuit inputs. We have chosen to represent node state with Reduced Ordered Binary Decision Diagrams (ROBDDs, or simply BDDs) 3].
The primary di culty is the proper scheduling of data-dependent delays. Consider again the skewed inverter example from Figure 1 . If the input switches from symbolic variable x to symbolic variable y, either or both of which could represent 0 or 1, it is not obvious when the resultant event should be scheduled on the output. However, for any given input pattern, the output will transition at some well-de ned point in time. Thus, the value of that node at any time can be represented by a Boolean function of the input variables, and the full symbolic transition is actually a progression through a series of node functions.
Using this idea, we will construct a valid sequence of functions for the output node of the skewed inverter. If we determine that a falling transition on the output occurs after 1.1ns and a rising transition occurs after 2.3ns, we obtain the series of 3 node functions shown in Figure 4 . Initially, the output function is x, and eventually it settles to y. Since a falling transition will occur at the rst timepoint and a rising transition will not occur until the second timepoint, the only way that out will be high in between is if both x and y were 0 and the output actually remained high continuously.
This behavior is captured in the function x^y. In general, the output node function will progress from being dependent only on the old input variables to being dependent on the new, and in between it will be dependent on a mix of the two.
B.1 Event Masks
The key to handling data-dependent delays is to view symbolic simulation as simultaneously simulating all possible input patterns. Under di erent input patterns, a particular transition might occur at di erent points in time, or perhaps not occur at all. For the inverter example above, the falling transition always occurs at the same time, but only when the old value x is 0 and the new value y is 1. The rising transition only occurs when x is 1 and y is 0, and no transition occurs if x = y.
In the conventional event-handling algorithm described above, node state was updated by assigning an event's value to the node. For the symbolic case, we wish to update the node state selectively, so that it is not disturbed for input patterns under which no event should occur at that time. This is accomplished using event masks, which are Boolean functions that encode the conditions under which a transition occurs. The mask is added to each event record, such that an event is now de ned as : Event = hN ode; Time; Value; Maski Note that we have used boldface for the Value and Mask elds to highlight the fact that they are Boolean functions rather than scalar values. This convention will be used for the remainder of this paper.
Rather than simply copying the event's V alue eld into the output node's state, we select the event value only for those cases where the event mask is true:
Node:V alue (Event:Mask^Event:Value) _ (Event:Mask^Node:Value) Since our implementation utilizes BDDs, the above computation can be more e ciently computed using the equivalent ITE (if-then-else) operation, which forms the core of most BDD packages :
Node:V alue ITE(Event:Mask; Event:Value; Node:Value)
For the skewed inverter example, we would schedule events at both 1.1ns and 2.3ns having masks (x^y) and (x^y) respectively, both with the steady-state value y : @t = 1:1ns out:Value = ((x^y)^y) _ ((x^y)^x) = y^x @t = 2:3ns out:Value = ((x^y)^y) _ ((x^y)^(y^x)) = y
1 X values will be incorporated in Section II-G Figure 5 represents the function F having two inputs a and b. To determine the return value for any given input assignment, we work downwards from the root, following the solid arc from nodes assigned 1 and the dashed arcs from nodes assigned 0. We can see that F 2:5 when either a or b is 1, and F 1:2 otherwise.
To describe the MTBDD operators needed by our algorithm, we introduce the following notation. Let us de ne A = fa 0 ; a 1 ; : : : ; a 2 n ?1 g as the set of 2 n possible assignments to the n variables in the support of MTBDD M, and de ne M a i as the terminal value returned by M under the assignment a i . We will consider BDDs to be a special case of MTBDDs where M a i is limited to the set f0,1g.
The operator MtbddITE(I; T; E) is similar to the BDD ITE operator, selecting T for assignments which satisfy I and selecting E otherwise. Note that I must be a BDD, while T and E are MTBDDs. MtbddITE returns an MTBDD R ITE , such that for all i: At times, we will also need to convert scalar constants into trivial MTBDDs containing only a single terminal node. For an arbitrary scalar , the trivial MTBDD will be denoted ].
To illustrate the use of MTBDDs for representing data-dependent delays, Figure 6 (a) shows the MTBDD that might result from a static 2-input NOR gate formed from equally sized transistors. Note that the pulldown delay is smaller in the case where both a and b are true than in the case where only one is true. Also note that the pullup delay (a and b false) is signi cantly larger than the pulldown delay.
In the worst case, the delay MTBDD T delay can become exponentially large relative to the number of inputs to the circuit. However, our delay calculations are performed on single stages of logic, and the subcircuits in consideration are typically quite small. Furthermore, larger logic stages tend to be highly regular, allowing for signi cant sharing in the MTBDD delay representation. For example, consider the 4-input dynamic NOR gate in Figure 6 (b), and its delay MTBDD. One terminal is required for each number of pulldown FETs that can be on at the same time, resulting in a T delay with 17 total nodes. An arbitrary width NOR gate constructed in this manner will produce a T delay that is quadratic in the circuit size, rather than exponential. We now have the basic machinery needed to implement the symbolic event handling procedure for SirSim ( Figure 7 ), based on the conventional IRSIM algorithm presented above. Throughout this discussion, we will continue to denote symbolic values (BDDs and MTBDDs) with boldface (F), while scalar values will appear in normal type. In SirSim, all BDD and MTBDD primitive operations are performed using the University of Colorado Decision Diagram Package (CUDD) version 2.2.0 1], 17].
SymbolicSimulate forms the main body of the simulator, and it di ers from Simulate in several places. First of all, scalar node values have been replaced with symbolic node values, represented as BDDs, and masks have been added to each event record. Also, all calls to subroutines have been replaced by symbolic versions.
As discussed in Section II-B.1, the node state update is performed selectively using the ITE operator and the event mask. Once the node state has been updated, SymbolicSimulate identi es those nodes requiring recomputation, and iterates on them as in the conventional algorithm. SymbolicComputeDC returns a BDD representing the symbolic steady-state value for the node. In the symbolic case, the same node may change state under one input assignment but remain stable under another. Therefore, we must compute the function change as the XOR of the old and new state, and then use this function to selectively perform both new event scheduling and event-cancellation.
New event scheduling begins by computing an MTBDD representing the data-dependent logic-stage delay. Using MtbddITE , T delay is then modi ed by setting it to 1 for all input assignments where no state change occurs.
SymbolicSchedule is new, and is responsible for creating new events for each of the possible delay-cases represented in the delay MTBDD. Its main loop repeatedly selects the smallest remaining terminal value, dmin, in T delay . It then selects the subset of events which will occur at time curtime + dmin using MtbddEqual. This result becomes the mask for the new event, which is assembled and inserted into the event queue. The last line of the SymbolicSchedule loop modi es T delay so that dmin will get the next smallest terminal on the subsequent iteration.
Event cancellation to implement the inertial delay model is performed by the call to SymbolicCleanEvents inside SymbolicSimulate. Like CleanEvents, its purpose is to remove any events on the speci ed node that occur after a certain time. However, event removal is now quali ed by a masking function, such that only those events that occur under certain input assignments will be removed. This translates directly into reducing the event mask by ANDing it with the inverse of the passed-in masking function. If the event mask becomes FALSE (0 under all input assignments), the event is deleted from the queue. C. Determining the A ected Nodes
The rst thing that occurs after updating a node's value is to compute the set of downstream nodes that will need to be recomputed. While this is perhaps the least complicated portion of the IRSIM algorithm, it required some of the most subtle modi cations to enable symbolic operation.
Under a switch-level model with grounded capacitors, a node's state can only be a ected by other nodes that are reachable through transistor channels, and by the gate-nodes of those transistors. Groups of nodes connected by sourcedrain (channel) connections are called channel-connected regions(CCRs). To enumerate the nodes a ected by a node transition, it would be su cient to simply list all nodes of all CCRs connected to the switching node. However, this overlooks the fact that some nodes in each CCR may only be reachable through currently turned-o transistors. For small CCRs such as static logic gates, simple enumeration only results in a small number of unnecessary node re-evaluations. However for very large CCRs such as barrel-shifters or SRAMs, the number of unnecessary re-evaluations could become prohibitive. Since these large CCRs typically have mutually-exclusive control lines that e ectively partition the CCR into smaller regions, we might greatly reduce the number of nodes to be processed by identifying them at runtime. This is the approach taken by both IRSIM and SirSim.
IRSIM's a ected-node computation is complicated by an additional responsibility. Since the Elmore delay (discussed below) is de ned only for tree structures, we must heuristically break any loops formed by conducting transistors. This loop breaking is accomplished by setting a broken ag on transistors which close conducting loops.
In IRSIM, a ected-node identi cation and loop-breaking are done using a breadth rst search, but a depth-rst version converts more easily to the symbolic case. Thus, Figure 8 presents a depth-rst version of IRSIM's a ected-node procedure. The search is started at the sources and drains of all transistors whose gates are connected to the switching node. A ectedRecur performs a recursive depth-rst search through source/drain connections. All nodes discovered are added to the list of a ected nodes, and transistors are marked as broken if they lead to a previously reached node. This algorithm dynamically identi es the nodes that make up the channel-connected regions a ected by the transitioning node.
In SirSim, the algorithm is conceptually similar though it is complicated by the fact that transistor-gates can have symbolic state values. Thus, each transistor may be \transparent" only under certain input assignments. Furthermore, we can actually reach a node several times under mutually disjoint sets of input assignments without being forced to break loops. In fact, the transistor broken ag itself must also be symbolic, since there will be input assignments for which the transistor closes a loop, and others for which it does not. For example, the simple circuit in Figure 9 contains a data-dependent loop. Since the gates of all three transistors
have symbolic values, they will only form a closed loop when a,b, and c are all 1. This means we must set the broken ag for at least one of these transistors to the function (a^b^c). It doesn't matter which transistor we set the ag for, and the choice will be determined by the order in which the transistors happen to be identi ed. With the broken ag set as shown, the transistor controlled by node a will be considered non-conducting under all input assignments that satisfy the function (a^b^c). Figure 10 shows SirSim's algorithm for identifying a ected nodes. At each node, we maintain a BDD, Node.reached, to keep track of input assignments under which the node has already been reached. Also, at each level of recursion, we keep track of the input assignments for which the search is still active. We return when this active BDD becomes FALSE or no unexplored transistors remain. When the active function intersects the Node.reached function, the result (loop) gives the input assignments under which this node was reached multiple times. The value of loop is used to compute a new broken ag, and then to deactivate further search under those input conditions. If the active function becomes FALSE, then the recursion terminates. The remainder translates directly from the conventional A ectedNodes. After we have identi ed which nodes are potentially a ected by a transition, we need to compute a new steady-state value for each. In IRSIM this computation is performed by ComputeDC , shown in Figure 11 .
Since all loops have been removed by A ectedNodes, the CCR forms a tree. For each node that must be recomputed, we explore outward along the tree in a depth-rst manner until we reach Vdd, GND, or a non-conducting transistor. We then work backwards, performing series and parallel combinations of branches.
Each branch of the tree is represented by the tuple hR h ; R l ; C h ; C l ; Di: To discuss the symbolic version of the DC-value algorithm, we need to rst introduce the concept of symbolic algebra. In SirSim, symbolic algebra is implemented with MTBDDs, again using the CUDD decision diagram package.
The key to performing symbolic algebra is the function MtbddApply(op; M; N). MtbddApply applies an arbitrary algebraic operator (e.g. +; ; =; k) to the argument MTBDDs M and N. As before, we de ne A = fa 0 ; a 1 ; : : : ; a 2 n ?1 g as the set of assignments to the n variables in the support of M and N, and de ne M a i as the terminal value reached by M under the assignment a i . MtbddApply(op,M,N) will return MTBDD the R, such that : In the discussion that follows, it will be convenient to use in x notation rather than explicit calls to MtbddApply. Thus, any algebraic expression involving bold-face operands should be interpreted as a call to MtbddApply with the appropriate operands:
MtbddApply is virtually identical to the well-known Apply operator for ROBDDs 3], and its worst-case complexity is O(jMj jNj), where jMj represents the number of nodes in MTBDD M.
D.2 Representing Circuit Elements
In symbolic simulation, parameters such as transistor-conductance, which depend on the state of the circuit, require a symbolic representation.
In the IRSIM circuit model, transistors are represented as switched linear resistors, having a nite linear resistance when conducting and an in nite resistance otherwise. We can represent the symbolic resistance as an MTBDD having two terminals, the on-resistance and 1. Figure 13 shows the symbolic resistance for an nFET t, computed as R t MtbddITE (t:gate:value; t:res] ; 1]). Furthermore, since we can perform arbitrary algebraic manipulations on MTBDDs, we can also compute series and parallel combinations of transistors, as shown in Figure 14 . Using symbolic algebra with MTBDDs, it is not di cult to construct a symbolic version of IRSIM's DC-value procedure. In SirSim, this operation is performed by SymbolicComputeDC , shown in Figure 15 .
The tuple representation for each branch is unchanged except that each real-valued member is now an MTBDD, and the \driven" ag D is now a BDD: hR h ; R l ; C h ; C l ; Di.
As before, SymbolicComputeDC calls SymbolicGetDC to compute the tuple for the node of interest. However in the symbolic case, a node may be driven under one input assignment but oating under another, resulting in a non-constant D BDD. Therefore we must compute the DC voltage under both assumptions (V r , V c ) and select the proper value for each input assignment using the MtbddITE operator. This voltage MTBDD is then converted into a logical BDD by MtbddThreshold .
The depth-rst search is performed in SymbolicGetDC , utilizing the BDD active in the same manner as in SymbolicA ectedNodes. SymbolicSeries is only slightly modi ed from the conventional version. It rst computes a symbolic resistance value for the series transistor and then uses symbolic algebra to compute the same resistance values as before. In the conventional algorithm, series was only called for conducting transistors, so the capacitance values could simply be copied into the output tuple. Now however, the transistor may be conducting only for a subset of the input assignments, so we must use MtbddITE to remove capacitance values for assignments under which the transistor is not conducting. The only change to SymbolicParallel is the use of symbolic rather than conventional algebraic operations.
E. Delay Computation
To compute the delays associated with node value transitions, IRSIM uses the ComputeDelay algorithm shown in Figure 16 .
To compute the Elmore delay, we require the resistance of the driving path(s), and the amount of capacitance to be charged or discharged. Like GetDC , GetTau performs a depth-rst search through conducting transistors until it reaches Vdd or GND. However, since it has already computed the DC value for the transition, it need only collect resistance values for the driving paths and capacitances that require charging or discharging.
Besides this simpli cation, the other primary di erence between GetDC and GetTau is the computation of branch capacitances. For Elmore delay calculation, the e ective capacitance is obtained by scaling it by the factor r o =r b when translating it across a transistor. Again, the reader is referred to 5] for a complete discussion of this computation.
The symbolic version, shown in Figure 17 is derived quite naturally. The hR; Ci tuple elements are replaced with MTBDDs, and all real-valued computations are performed with symbolic algebra. We again require the active ag to control the recursion depth.
A number of enhancements to this algorithm are implemented in IRSIM and duplicated symbolically in SirSim. Both implement a separate delay calculation algorithm for charge-sharing transitions, and compute an additional delay term to account for the rise/fall time of the triggering transition.
F. An Example
In order to illustrate the di erent steps of the algorithm, we will work through an example evaluation of the domino AND gate shown in Figure 18 . Assume that the rst event in the event queue is hN ode = A; Time = 100; Value = a; Mask = ai, and node A's current value is 0. Further assume that the precharge proceeded as normal, that node B has already transitioned to value b, and the CCR has settled to a steady-state such that the internal node values are as shown. After updating the state of node A, we rst call SymbolicA ectedNodes to identify the nodes that may need to be re-evaluated. The search will begin at nodes x1 and P, and will return the set (x1; P; x2). Note that no loops are detected in this simple example, so the Broken ag is set to false for all transistors.
We next compute the new DC value for all nodes identi ed in the previous step. For the sake of brevity, the resultant tuple is only shown for the CCR output node P. R h , representing the pullup resistance is identically 1, while the pulldown resistance R l has value 15KOhms for a^b, and is in nite otherwise. The capacitance MTBDDs C h and C l record the amount of capacitance connected to P that are charged high and low respectively. The driven function D shows that P is resistively driven only for a^b.
Given this tuple, we compute the DC voltage of P using resistance and capacitance information separately, and then combine them using D. A voltage divider equation shows that V r is identically 0, since R h = 1 and 8x; x=(1+x) = 0. Applying the charge-sharing equation gives V c as shown. We then use MtbddITE to separate the two cases, and then threshold the result to obtain the DC logic function a^b. Now we can compute the delay for the new transition. Note that we have depicted this operation as a product of a lumped R with a lumped C to avoid explicitly stepping through the recursion. In our implementation (and in the algorithms presented earlier), we compute a true Elmore delay, and would obtain the result (5K (5fF ) + 5K (5fF + 5fF ) + 5K (5fF + 5fF + 15fF )) = 200ps.
Lastly, we schedule a resultant event for node P : hN ode = P; Time = 325; Value = a^b; Mask = a^bi
G. Ternary Simulation
The preceding discussion has assumed binary (0,1) simulation, when in fact both IRSIM and SirSim utilize ternary (0,1,X) node values. SirSim uses a dual-rail encoding like that used in Cosmos 4] and MOSSYM 2]. Two BDDs, value.h and value.l, are used to encode the symbolic ternary node value as follows: value:h = 1; value:l = 0 : HIGH value:h = 0; value:l = 1 : LOW value:h = 1; value:l = 1 : UNKNOWN (X) value:h = 0; value:l = 0 : not de ned All nodes are initialized to X values at the start of simulation.
G.1 Ternary DC Value Computation
When the node connected to a transistor's gate has an unknown(X) value, its equivalent resistance can vary from the nite conducting value to 1 Now each term in the tuple representation is actually a pair of min/max MTBDDs. When performing series and parallel combinations of these tuples with ranges, an approximate solution is computed. The details of this approximation comprise a large part of Chu's thesis 5], and are beyond the scope of this article. However, the operations required involve straightforward algebraic manipulation and can be duplicated exactly in symbolic form without substantial di culty.
To obtain the DC voltage from the tuple with ranges, the voltage divider and charge-sharing computations are slightly modi ed: 
G.2 Delay Computation with Uncertainty
For delay computation in the presence of unknown node states, resistance ranges are again computed and dealt with in the same manner as above. However, capacitance requires a di erent treatment.
We maintain two MTBDDs, one for capacitance potentially charged high and the other for potentially discharged capacitors. This is similar to the method described for the binary version of GetDC , except that capacitance in the X state is added to both C hx and C lx . Since SirSim does not currently allow event times to be time-ranges, we must select either the maximum or minimum delay possible for each input assignment. Following IRSIM's heuristic, we assume that the maximum delay is the conservative choice when switching to a well-de ned value, while the minimum delay is selected when transitioning to an X. Note that the data-dependent delay variations are still accounted for, and this approximation only a ects delays for input assignments where one or more node values are X's.
T min R min C switch T max R max C switch T delay MtbddITE (DCV al:h^DCV al:l; T min ; T max )
III. Application
Symbolic Timing Simulation has applications to both functional and timing veri cation of transistor-level circuits. While it is substantially more compute intensive than static analysis techniques, it is applicable to a much broader family of circuits.
To verify a block containing arbitrary circuit structures, we simply perform a symbolic timing simulation while monitoring the Boolean functions on the output and state nodes. A speci cation of the correct output function must be supplied by the user or extracted from the RTL description of the block. If the initial values of particular latch nodes are required to express the expected output function, then the user must initialize them to symbolic values as well. If the output nodes settle to the proper functions while being simulated under a realistic delay model, then the timing and functional correctness of the circuit under all input patterns is implied. If an error is detected, a counterexample is generated and the simulation can be run again with non-symbolic inputs to aid in debugging.
We should note that our veri cation is limited at present to a model having speci c delay values for each input pattern, rather than delay ranges that result from process variation and other sources of uncertainty. This is a direct result of our modelling of node transition events as instantaneous. Static techniques, on the other hand, are well suited to analysis based on time ranges, and thus can be more provably conservative. We are looking into extending STS by incorporating delay ranges in the form of min/max delay values. This can be accomplished either by explicitly modelling events as time ranges, or by scheduling additional transitions to an X value at the minimum delay point. The latter approach is substantially easier to implement, but is slightly more pessimistic.
A. Functional Veri cation
Timing and functionality cannot be easily separated for many circuits, causing problems for methodologies which perform these analyses independently. In some cases, timing information is required simply to model circuit functionality properly.
Consider the example shown in Figure 19 which was used as part of a row-decoder and RAM wordline driver. It is composed of a dynamic NOR stage followed by a static NAND gate. The NAND gate is intended to keep the wordline de-asserted during precharge of the NOR stage. If any of the mismatch lines are high when CLK rises, then select should fall quickly enough to keep nWL from going low. The safety of this circuit is ensured by the relative loading of nWL and select. Since the WordLine driver is very large, nWL is heavily loaded and will not switch nearly as quickly as select.
Existing static functional veri cation methodologies such as equivalence-checking use either zero-delay or unit-delay models of timing behavior. Zero-delay analysis will fail to model the storage of charge (and thus state) on the precharge node. However, unit-delay analysis assumes that all transitions require the same amount of time, resulting in a unitglitch on WordLine. Capturing the intended behavior of this circuit requires proper modelling of the relative stage delays. Since SirSim implements an Elmore delay model and uses inertial delays, it computes the correct sequence of events for this circuit.
B. Timing Veri cation
Timing constraints on circuits exist to ensure that signal transitions occur in the order required for proper operation. Some constraints are imposed by the circuit's environment, and some are due to internal structures, such as latches, dynamic gates, and self-timed loops.
Static timing analysis relies on pattern matching routines to identify these structures from the circuit netlist and apply timing constraints based on a set of precompiled rules. In a full-custom design environment, designers often creatively hand-optimize circuitry to take advantage of local don't-care cases or x critical timing paths. These hand-optimized circuits rarely match the patterns built into the static timing analyzer, causing it to apply incorrect constraints. To perform a timing analysis in this situation, designers are left with two equally unattractive options: develop a substantial simulation suite or train the analyzer to \understand" the circuit. Both options are labor intensive, and the simulation option may simply be infeasible if the circuit is large enough. The result in either case is a time-consuming and error-prone analysis.
However, a symbolic timing simulator need only simulate the circuit and check for correct functionality. Since it avoids having to explicitly determine timing constraints based on circuit topology, it is much more robust with respect to variations from standard design styles.
A further advantage of STS is that its output arrival times will be more accurate than those computed by a timing analyzer, given the same delay computation model. One reason for this is that the simulator is not susceptible to false paths, because their e ects will be eliminated by a dynamic sensitization criterion. McGeer demonstrates that the dynamic criterion cannot underestimate the true circuit delay, while the static criterion can 10]. While the dynamic criterion does not satisfy the monotone speedup property, we argue that it matches reality much more closely and will give a higher quality estimate of the true delay. Furthermore, McGeer shows in 11] that the dynamic criterion does satisfy the monotone speedup property for dynamic(precharge unate) circuits, an application for which this methodology is particularly well-suited.
In addition, a symbolic timing simulator need not make worst-case assumptions about the state of the surrounding circuitry when computing delays. A static analyzer must assume worst-case loading, simultaneous-switching, and capacitive-coupling during delay calculation to ensure a conservative analysis. Because a symbolic simulator stores the current state of every node in the circuit, it can avoid this pessimism if its delay model correctly accounts for these e ects.
C. Complexity
Since we are performing a complete analysis over all possible input combinations without any loss of information, the worst-case complexity of symbolic timing simulation is necessarily exponential in the number of inputs to the circuit. However, the actual complexity is highly dependent on the e ciency with which the circuit's node functions can be represented. Implementations of symbolic simulators using BDDs to compute and represent node functions have been shown to be very e cient for a wide range of interesting circuits due to signi cant amounts of BDD subgraph sharing.
As in any simulation-based approach, runtime complexity can be di cult to evaluate because it is a ected by a number of factors. However, it is a natural metric and is of paramount importance to potential users. In general, we have found runtime rather than memory usage to be the limiter for symbolic timing simulation.
STS runtime is a ected primarily by the number of events generated and the e ciency of the symbolic encoding. The number of events is in turn dependent on the circuit topology, depth of the logic cones, latching strategies, and circuit size. Of course, if the circuit is astable (3 inverter loops), the simulation may never even terminate. In general, given the same number of transistors, deeper logic cones will generate more events because of the larger number of potential delay paths. Edge-triggered ip-ops help control the event count by resynchronizing multiple events that arrive at its inputs.
Probably the most important factor a ecting runtime is the e ciency of the symbolic encoding, which is strongly dependent on the variable order selected for the BDDs and MTBDDs. For some circuits, no sub-exponential variable order exists.
In practice, we have seen runtimes on the order of several minutes for most circuits up to 10000 transistors. With further algorithmic improvements and tuning, this might be pushed as high as 50-100 K transistors, which is su cient to handle sub-blocks typically assigned to single designers. Above this size, the key to utilizing STS may be extensions capable of generating abstractions of the timing interfaces suitable for use by a static timing analyzer at a higher level of the hierarchy. 
A. Adders
For our rst substantial test cases, we ran SirSim on Manchester carry-chain adders 8] of varying widths. We expected these circuits to exhibit worst-case behavior for our scheduling algorithm in two ways. First, event timings are highly dependent on the choice of input values, resulting in smaller sets of events that can be scheduled together as symbolic transitions. Second, the depth of the carry-chain logic is proportional to the width of the adder, which we expected to generate an exponential number of events relative to the circuit size. In most other cases, one would expect the depth of the logic cone to be O(log n), creating a polynomial number of events.
However, the runtimes were excellent, and in fact only grew as the cube of the adder width. They are plotted in Figure   20 , along with the time required to perform the equivalent 2 2n conventional IRSIM runs, for adders of width n = 1 : : : 64.
For the 32-bit adder, SirSim performed a complete analysis in less than 2.5 minutes on a 300MHz UltraSparc system, representing a speedup over exhaustive conventional simulation of 10 17 . For the 64-bit adder, SirSim required less than 20 minutes, and achieved a speedup of 10 33 . While the absolute speedup values are so large as to be almost meaningless and can be made arbitrarily large by increasing the size of the circuit, it is worth noting that a previously infeasible analysis on realistically sized adders is now possible.
The following analysis justi es this cubic behavior. The runtime will be composed primarily of two factors, the number of events processed E and the average cost of processing one event C. Since most of the processing cost is MTBDD traversal, C will be proportional to their average size. For an adder, we know that the output BDDs are linear in the width of the adder, so we can expect this to be true of the computational MTBDDs as well, and thus of C. To estimate E, we look at the i-th adder bit-slice. Each slice will locally generate a constant number of events l on the carry output carry i . In the Manchester carry chain design, there is only one delay path possible from carry i to carry i+1 . Thus if we assume that k i events were scheduled on carry i , then there should be k i + l events on carry i+1 . Since carry 0 is a constant and has no events, carry 1 will have only the locally generated l events, and carry n will have nl events. Thus the total event count E = l P n i=1 i n = O(n 2 ) ,and the total runtime T = EC = O(n 3 ).
In order to evaluate SirSim on a more realistic adder implementation, we also simulated varying widths of a carry bypass design. The runtimes (shown in Table I ) were slightly worse than those for the ripple carry design, due to the additional circuitry. This test case is particularly interesting because carry bypass adders are notoriously di cult for static timing analysis due to the huge number of false paths.
B. Combinational Circuits
To determine how SirSim performs on combinational networks, we ran several of the ISCAS89 benchmarks. To obtain transistor-level networks, we replaced each gate with an equivalent nominally-sized static CMOS subcircuit. We also removed the DFFs and turned their inputs into primary outputs, and the outputs into primary inputs. Note that SirSim can simulate sequential circuits but our goal was to determine performance on a sample of combinational circuits. For these experiments, we used a simple depth-rst-search ordering heuristic for the BDD and MTBDD variables. As can be seen from Table I , the runtimes varied substantially and were not highly correlated with the size of the circuit. The e ciency of this technique is heavily dependent on the BDD/MTBDD variable order selected, and on the We also implemented a self-resetting 64-bit incrementer designed at IBM 9] . This circuit makes use of self-timed locally-generated reset signals to accept a pulsed input, compute the incremented value, signal a pulsed output, and reset itself to prepare for the next input. It uses no global clocks, and all operations are triggered by the pulsing of the input data lines.
We used SirSim throughout the implementation of the incrementer to verify both functionality and timing, and found it to be a very natural way to identify errors. By simulating a pulsed symbolic input vector and placing checks on the output lines, we located and debugged problems in connectivity, drive strengths, reset delays, etc.
In contrast, the original designers made use of a rather complicated, special-purpose timing veri cation methodology, which they outlined in 12]. Their methodology involved adding pulse-propagation and overlapping pulse-width checks to an in-house static timing analyzer. Since SirSim uses an inertial delay model, its veri cation power is virtually identical to these checks. This example clearly demonstrates the power of SirSim to handle even highly customized circuit design styles. Furthermore, SirSim's runtime for this 4200-transistor design was around 40 seconds (sr incr64 in Table I ). Figures 21{22 . In order to gain some intuition into the speedup attained by symbolic timing simulation, we compared the total number of symbolic events per timestep with the total number of real events. We de ne the real event count as the sum of all events that would occur in a given timestep in an exhaustive conventional simulation suite. This was computed by examining the don't-care sets of the symbolic event trace.
For circuits such as adders, which have highly data-dependent delays, we did not know if each symbolic event would be able to capture a signi cant number of real events. However, as Figure 22 shows, it was quite successful, resulting in an average symbolic compression (ratio of real events to symbolic events) of over 9600 for the 8-bit adder. This compression increases exponentially with the adder width.
V. Conclusions and Future Work

