[1 uled to change to 1 at t = 9 ns. At t = 9 ns, e is 1, an-d the event queue is empty.
The final simulation state is represented by the last line of Table 2 .
Note that even though the simulation ran up to time t = 9 ns, the simulation loop was not executed 9 times (once per ns). Gates a and /3 can then be evaluated (in parallel), triggering two messages to y. Two additional messages are sent from the inputs to a and~, indicating no more input changes will take place, updating their local times to t = CC. The need for these two messages will be described below. As a result of the input waiting rule, y's local clock can now be updated to t = 4 ns. All three gates can now be evaluated (again in parallel), triggering two more messages to y. Again following the input waiting rule, y can now update its local time to t = 6 ns, since it now has a message from /? indicating no additional messages will come between t = 4 ns and t = 6 ns. Gate y is then evaluated, and a message is sent to the output at t = 9 ns.
As another example, consider the circuit in Figure  4 . Assume that the propagation delay of each gate is 3 ns, and each gate is on a separate processor. The local clock of the processor containing the top gate has a simulated time of 1 ns, while the local clock of the lower proces3We are using lookahead to relax the output waiting rule. In the first example circuit ( Figure  3 ), the simulation starts out as in the conservative algorithm, with messages from the input to gates a and~(see Table 5 pared. The processor which has the largest set in common is selected for the gate. After a processor is full, it is no longer considered for assignment. They report that this is fairly fast and reduces communication greatly when compared to a simple organization which places gates on the same processor if they are of the same rank.
We illustrate this algorithm using fanin cones, starting from the primary outputs and working back to the primary inputs. Table 6 shows the fanin cones for each of the gates in the example circuit. Starting from the primary outputs, we arbitrarily assign gate 2 to processor 1 and gates 7 and 10 to processor 2. Choosing gate 5 at random, we note that it has more overlap with the cones of gates 7 and 10, so it is assigned to processor 2.
Choosing gates 6 and 8 at random results in the same conclusion, assignment to processor 2. Since half of the gates are now on processor 2, the remaining gates are assigned to processor 1. This results in the partitioning illustrated in Figure  8(b) .
Mueller-Thuns et al. [1993] Figure  7 , we start with gates 3 and 8 initially on processors 1 and 2, respectively. In round 2, gate 5 is added to processor 1 (since it is adj scent to gate 3), and gate 10 is added to processor 2 (being adjacent to gate 8). In round 3, gate 1 is added to processor 1, and gate 6 is added to processor 2. In round 4, gate 7 is added to processor 1, and gate 9 is added to processor 2. Finally, in round 5, gate 4 is added to processor 1, and gate 2 is left for processor 2. The resulting partitioning is the same (in this case) as the strings algorithm and is illustrated in Figure  8 (a Figure  7 , a hierarchical decomposition might place one bit of the adder on processor 1 and the other bit of the adder on processor 2. This results in the partitioning illustrated in Figure  8( is then derived as the expectation of-the maximum of P samples from this distribution. They evaluate the model by comparing the wredicted~erformance with observed perkrmance o'n several production VLSI circuits.
The results of' one of these comparisons is shown in Figure  12 . The dotted line represents ideal speedup; the dashed line shows observed performance with up to 16 processors [Agrawal 1986] ; and the solid line shows the medicted performance using the model. The modeled results match the observed performance closely for all of the circuits they investigated.
The above models address synchronous algorithms, but are not useful for asyn- Figure  13 compares the speedup predicted for two different benchmark circuits. The most notable conclusion from these curves is the extreme variation in performance from one circuit to the next. Using TW, benchmark 1 ( Figure  13(a) ) has speedup less than unity (i.e., the parallel execution is slower than the serial execution)
for 32 processors, while benchmark 2 ( Figure  13(b) ) has near-optimal performance.
One 
