The 
Introduction
With the size of digital systems doubling with each VLSI process generation, the need for faster verification of such systems is essential. Traditional methods of functional verification such as software simulation are slow and hence the system cannot be thoroughly tested. As a result of this, several technologies have evolved for functional verification. The cost performance trend of the various options are shown in Figure 1 . At one end is software simulation which is flexible but slow and on the other end we have a prototype which is fast but inflexible and hence expensive. A hardware emulator takes some good qualities of both of these extremes. It is faster than software simulation and more flexible than a prototype [1] , [2] , [3] . ( Cycles/sec (000)) 1 10 100 1000 10000
Figure 1. Verification Choices
As the size of the system being emulated increases, the chief problems faced are as follows:
1. Minimizing the latency/response time seen by the environment to a stimulus.
2. Building the system with simple modular pieces so that large systems can be easily accommodated.
Any emulation system which targets large designs has to be modular and scalable so that smaller modules can be easily put together to emulate large designs.
Our solution to this problem is to decompose a complex emulation system into smaller modules, each of which is itself a complete emulation system. In this paper, we describe just such a modular scheme which, coupled with a particular partitioning and pipelining strategy, in theory, promises unlimited emulation system scaling.
In the remainder of this paper, we will first describe the basic emulation system (which forms a module in the scalable system). Next, we state the scaling problem faced by large emulation systems and describe our solution to the scaling and response time minimization problem by introducing a particular partitioning strategy and the associated pipelined control structure. We develop a Petri-net model for the control structure and analyze this model to obtain performance estimates for the emulation system in terms of the performance of the individual emulation modules. We conclude by discussing the properties of the proposed scheme that enable scaling.
Overview of the Basic Emulation System Module
Each emulation module would have a controller which would coordinate the emulation activities on the module. Our emulation system is aimed at synchronous systems only, and in each cycle of the original system-being-emulated, the emulator controller regulates three types of actions in the following sequence:
1. Obtaining inputs from the environment and applying them to the design.
2. Providing the appropriate control sequence to the system so that the behaviour of the system-being-emulated is faithfully implemented and the logic of the system is correctly evaluated(Processing) [4] , [5] , [6] , [7] .
3. Providing the outputs of the system-being-emulated to the external environment.
The response time seen by the environment is one cycle of this input/processing(control sequence)/output sequence.
Thus, in the emulation system, a single clock period in the system-being-emulated is modeled by a sequence of actions regulated by the emulation controller.
The Scaling Problem for Emulation Systems
The chief difficulty in architecting an emulation system is that FPGA's typically have packing densities that are 1/½¼ Ø that of ASIC's. Thus, the number of FPGA's required in an emulation system can be large. So, to map an arbitrary design into a set of FPGA's, we need to partition the design appropriately. To get maximum performance out of this partitioning, we would need to pipeline these partitions. This gives rise to control/modularity issues. The response time seen would also be a function of the partitioning/pipelining scheme.
To get a truly scalable system we would require a scheme which, along with correctly modeling the system, results in a modular system and for which the response time observed is not a function of pipeline depth.
As system complexity scales, the I/O activity increases with the "perimeter" of the design whereas the processing activity increases with the size of the system. Thus a scalable emulation system must be able to manage the increase in processing activity without affecting response time.
Our Solution to the Scaling Problem
Our solution to the scaling problem tries to approximate the above criterion for true scalability.
Our solution lies in a unique way of partitioning the design under test as shown in Figure 2 . In this "onion partition" the design is broken up such that the resultant partitions are dependent on each other in a layered fashion [9] . Only one partition interacts with the environment. Every partition interacts with two other partitions, the one on the outside and the one on the inside. Three types of signals are permitted, signals going in-wards through latches, outwards through latches and combinational paths going inwards. Each partition is a module and is controlled by a module controller. The module controllers interact with each other to move the system forward. As soon as conditions/dependencies for a module to move forward are fulfilled, the controller for that module will give a message to modules dependent on it to inform them of the availability of signals needed by them. Although this kind of a mechanism can support any arbitrary partitioning of the design into the modules, we can quickly land up with a complex control scheme.
The response time for this pipeline will be the latency of the input/process/output sequence of the outermost partition only. The latency of any partition is also dependent on the latencies of the outer and inner partitions because of the existing dependencies between these partitions but if we match up the latencies of the partitions (by equalizing emulation activity) then this effect is minimized. Thus, the response time of the pipeline will not grow appreciably with the number of stages in the pipeline if we follow this rule. For a random partitioning, this property would be hard to achieve. In this way we get modularity as well as independence of response time from pipeline depth.
As the system complexity scales, the increase in I/O activity will directly show in the latency of the outermost partition but the increase in the processing activity can be hidden in the pipeline thus ensuring scalability.
The throughput of the asynchronous pipeline that is setup with these partitions is determined by the worst case behaviour of any stage, just like in a normal pipeline. Thus, we should equalize the emulation activity of each pipeline stage. So the equalization of the emulation activity helps us in scaling the system as well as achieving a high throughput.
Modeling the Pipeline
Each module controller can be visualized as a state machine that performs I/O transfers between the modules and goes through a control sequence to superwise the emulation activity on the module. The latency of the control sequence in an individual module is not fixed; we may consider it as a random variable. The control scheme for this kind of a partitioning is described by the Petri-net shown in Figure 3 . The module controller mimics the behaviour of this Petrinet at the module level. There is an interlocking arrangement between the partitions(modules) which takes care of the inter-partition dependencies. The part of the Petri-net corresponding to one controller stage is shown between the dotted lines in Figure 3 . The controller gets a "move" signal from the partition controller preceding it (after a delay greater than the delay of any combinational path going from the outer partition to the inner one if such a path exists otherwise zero delay). It also gets a "complete" (cmp) signal after the transfers across the forward interface is over. Since the "move" signal from the backward interface comes only after the "complete" on that interface is over, we don't need to look at the "complete"
This kind of a partitioning can be extended to two dimensions. The simplest two dimensional structure would be a mesh as in Figure 4 . The corresponding partition is shown in Figure 5 . The 2-D case gives us more flexibility in partitioning. Latched as well as combinational signals are allowed to go across partitions in the same layer but there should no purely combinational cyclical paths across two partitions. The Petri-net description of the control for the 2D case is similar to that for the 1D case. The difference is that there are more "Tx" legs for each controller as in the 2D case, each controller is talking to more than two partitions. Also, in the 2D case, each partition might get moves from more than one partition, unlike the 1D case where a move comes from only one partition. Similarly, each controller will generate more completes as one complete is generated for each partition to which the present controller gives a move. Essentially, the "move" from a partition should arrive after the combinational signals from the partition have been transfered and the "cmp" signal should be generated after all the latched signals for an inter-partition interface have been transfered.
The Petri-net sections for different kinds of interfaces is shown in Figure 6 . There are two kinds of interfaces: an interface to a partition belonging to a different layer or an interface to a partition belonging to the same layer. Interface between different levels Interface in same level
Figure 6. Different kinds of interfaces
To arrive at such a layered partitioning for an arbitrary design, we start with a graph where nodes represent blocks of logic, solid arcs represent latched paths between blocks and dotted arcs represent combinational paths between blocks. A legal "layer number" labeling as shown in Figure 7 would have to satisfy the following rules:
1. All blocks with input/output signals from/to the environment should be in a partition in layer 1.
2. Signals from a lower layer partition to a higher layer partition can be either latched or combinational.
3. Only latched signals can go from a higher layer partition to a lower layer partition.
4. Between partitions in the same layer, we can have latched as well as combinational signals.
Such a legal labeling can be obtained as follows:
1. Eliminate all combinational loops by collapsing all strongly connected components into a single node.
2. Perform a breadth first search(BFS) labeling starting from the environment.
3. If a backward combinational edge is present, relabel one of the two endpoints (using the smaller label). Go back to step 2 if relabeling was done.
Experiments and Observations for the Pipeline Control Scheme
Experiments were carried out to get some performance measurements for the asynchronous pipeline setup for the "onion" partitioning scheme. To perform these experiments, a limited scope Petri-net simulator was written in C++. This simulator can give possible output traces which can be used to characterize performance of the system. The Petri-net is described as a connection of places and transitions. Each transition can be assigned two firing latencies with a probability distribution for these two values. After all the places above a transition have tokens, the transition fires after time equal to one of the latency values which is picked on the basis of the probabilities specified. Transitions with fixed latencies can also be modeled. The simulator does not give all the possible traces, but is useful for performance evaluation experiments.
The Petri-net of the control scheme for the "onion" partition was simulated using this tool for different depths of the pipeline. For each simulation, the activity(A) transitions were given the latency values of '1' and '10' with probability of '1' ranging from 0.1 to 1. The other transitions were modeled as having a fixed latency of '1'. For each simulation, we get the throughput by counting the number of firings of the "start" transition of the first stage in the pipeline while simulating the pipeline for a large number of time steps (1 million).
The average number of firings for different probability values for transition A and for different pipeline depths are shown in Figure 8 . In Figure 8 , the x-axis shows the probability (on 10) of transition A having latency of '1' and y-axis shows number of firings.
From Figure 8 we conclude that the performance of the pipeline is worst-case for most of the probability range. As we increase the pipeline depth, the performance becomes increasingly probability independent and stays at the worst case. This can be explained by noting that when we make the pipeline longer, there is more probability of one of the stages being slow. A slow stage slows down all the stages in the pipeline. The effect of the slow stage ripples through the pipeline with time.
The same experiments were done for a 2D partition and its corresponding Petri-net control structure. We observed that the performance degrades faster in a 2D connected structure compared to a 1D structure with similar number Experiments were done to see how a disturbance in the pipeline propagates with time. For this, we used the constant latency model for all the transitions with a latency value of '1' but spiked only one of the A transitions to a latency value of '10' for only one firing. The experiment was conducted on different positions of A along the pipeline. We observe that a disturbance in one stage of the pipeline will cause disturbances in the neighboring stages and this will propagate further with time. This is because of the interlock between the stages in the pipeline. If any of these stages are slow, it will slow down the stage. One way to visualize this is shown in Figure 10 . An important fact to note is that this disturbance cannot be absorbed by putting queues between the pipeline stages as is possible with a normal straight pipeline. This is because a partition has dependencies with both the inner as well as the outer layer partitions. The pipeline of Figure 11 will have a response time dependent on the number of stages in the pipeline. Compared to this, the "onion" partition response time is independent of the number of stages as discussed above. The general performance(throughput) trend, as a function of variation of stage delay, of both type of pipelines is the same. The oneway pipeline also shows worst-case performance for most of the probability range as shown in Figure 12. 
On the Scalability of the Pipeline
Since the response time of the system is determined only by the outer-most partition and the rate at which the pipeline can be stimulated is determined by the maximum latency of an individual partition, we can say that if the complexity of a single partition stays the same as the system size scales, the performance of the pipeline scheme is independent of the size of the system being emulated.
To understand why this is the case, we observe that at any given time, the different partitions in our scheme are not working on the same cycle of the original design. There is a "slip" between outer and inner partitions, with the inner partitions working on an earlier clock period than the outer ones. In a sense the results of a single step stimulus of the original system are available as soon as the outer-most partition has stabilized (and we need not wait for the effects of this stimulus to propagate through the entire system).
Unfavorable partitioning is unlikely because most large logic blocks will have latched outputs.
The scheme also preserves modularity, large emulation systems can be built with smaller modules. This, coupled with the independence of the response time from the pipeline depth ensures a scalable system.
Conclusion
To conclude, we have defined a novel partitioning and pipelining strategy that offers rapid response evaluation properties in modeling the behaviour of complex systems. To make the system scalable, we have opted for a distributed control scheme where local partition controllers interact with each other depending on the way the partitioning and pipelining has been done. The partitioning style used in the scheme has some restrictions but is general enough to allow partitioning of most systems. The modular construction of the basic emulation block and its controller together with the response time property displayed by the partitioning scheme enables scalability.
This technique has the potential of putting together large emulation systems using identical modules. The proposed pipeline shows worst-case behaviour. Even when we can give bounds on the latency of each pipeline stage, the performance of the pipeline is dictated by the upper bound of this latency. With reference to the control scheme this means that we need to equalize and balance the workloads across the stages.
The problem of identifying good partitions for such a system needs additional thought. However, given a partition, there is a simple procedure to identify a valid pipelining with the above properties using a simple BFS technique.
