Abstract-Fast distributed cosimulation is a challenging problem for the embedded system design. The main theme of this paper is to increase the simulation speed by reducing the frequency of inter-simulator communications, reducing the active duration of simulators, and utilizing the parallelism of component simulators. Those enhancements are accomplished by the proposed virtual synchronization technique which combines event-driven and data-driven simulation methods. Experimental results show that the proposed technique can boost the cosimulation speed significantly compared with the previous conservative approaches.
Annotated compiled cosimulations make it faster but sacrifice accuracy [4] [5] . In-circuit emulators boost the speed [6] but they require special hardware components.
The main theme of this paper is to increase the simulation speed by reducing the frequency of inter-simulator communications, reducing the active duration of simulators, and utilizing the parallelism of component simulators, without modifying the component simulators. First, we propose a new distributed event driven simulation technique of component simulators. Our approach uses a cosimulation backplane as the master process in the distributed event driven (DED) cosimulation.
Furthermore, we present a novel technique to reduce the time synchronization and communication overheads significantly by combining event-driven and data-driven simulation methods. Previous distributed event-driven simulation techniques do not assume any special execution semantics of the simulated tasks. Execution semantics allows us to minimize the communication overhead, while not violating the causality condition of the event driven simulation. In the next section, we explain the execution model of task graphs and the environment assumed in this paper. Section 3 explains the proposed distributed event driven (DED) cosimulation technique and compares it with the existent techniques. Section 4 discusses the main theme of this paper: how to use data-driven scheduling information for the DED cosimulation. Section 5 discusses the expected performance improvement qualitatively. Experimental results and conclusions will follow in section 6 and 7 respectively.
II. COSIMULATION ENVIRONMENT
In our proposed environment, we specify control and function modules separately on the codesign/cosimulation backplane using FSM model for control modules and SDF (Synchronous Dataflow) model for function modules. In a dataflow graph, a block represents a function that transforms input data streams into output streams. An arc represents a channel that carries a stream of data samples from the source block to the destination block. The number of samples produced (or consumed) per block firing is predefined. An SDF model has a restriction that a block becomes runnable only after all input ports have the predefined number of input samples. Fig. 1 (a) shows an example system specification which is hierarchical and compositional. After mapping between function modules and architecture components is determined, we establish a distributed cosimulation environment as illustrated in Fig. 1 (c) . All component simulators establish their outside connections only to the cosimulation backplane. The backplane is basically an event-driven simulator and a centralized backbone for communication between component simulators [7] .
III. DISTRIBUTED EVENT DRIVEN COSIMULATION
In this section, we propose a new distributed event driven cosimulation technique which enhances parallelism between component simulators. The main difficulty to speed up the distributed event driven simulation is to synchronize the simulators so that no simulator violates causality condition of event processing. Several techniques for accelerating distributed discrete event simulations have been proposed since late 1970's and largely classified into two approaches: optimistic [2] and conservative [3] .
In the optimistic approach, a component simulator may advance its local time optimistically assuming that no past event will arrive. If that assumption fails, it rolls back its local time to the event arrival time or the earlier checkpoint time, and cancels all processing results after the adjusted time, paying huge overhead [8] .
To accommodate component simulators without rollback capability [9] , we take the conservative approach where a component simulator advances its local time and processes events only after it is guaranteed that no event will arrive earlier [10] [11] . There are two schemes to satisfy this causality condition: centralized approach and distributed approach.
In the centralized approach, the central controller manages the component simulators with the information how far the local clock of the simulator can advance. The centralized approach serializes all communication activities between simulators in the global queue although there is no dependency between them. It is the main cause of low performance.
In the distributed approach, there is no need of a central controller. Instead, each simulator should wait until it receives input events from all input ports and process the event of the smallest time-stamp. The main difficulty of the distributed approach is deadlock possibility. If there is a cycle among simulators, no simulator can receive events from all input ports. One solution is to use null messages that carry the lower bound information of the time-stamp value of the next event [12] . On the other hand, Chandy and Misra [3] allow deadlock situation. When deadlock situation occurs, a central controller initiates the recovery phase in which each simulator exchanges time information with other simulators and updates the time until when it can safely advance its local clock. Since the overhead of deadlock detection and recovery is significant, the simulation performance degrades proportionally to the deadlock frequency.
The proposed scheme lies in the middle of the centralized and the distributed approaches. As a centralized approach, all data exchange between component simulators goes through the cosimulation backplane. Unlike the centralized approach, however, the backplane does not serialize the communication activities. Instead, each component simulator waits until it receives input events from all input ports, and processes the event of the smallest time-stamp like the distributed approach. Therefore, the proposed simulation is based on the idea of deadlock detection and recovery. The difference between Chandy and Misra's approach is that the cosimulation backplane plays the role of the central process without paying extra communication overhead to find a runnable simulator at deadlock situation. Fig. 2 illustrates the structure of the proposed scheme. We need to introduce a simulator interface between the backplane and a component simulator, which contains wrap-up code for deadlock detection and recovery since the component simulator is not expected to have this capability. The simulator interface also has a local queue to check causality conditions, which confirms that there arrives no earlier event after processing a current event. And it manages a local clock related to the simulator. The event produced from the predecessor is delivered to the associated local queue of the destination block. Because events are stored in the distributed local queues, the backplane executes the simulators in parallel as long as they meet causality conditions. When a deadlock occurs, it finds the smallest event in all local queues and makes it delivered although it does not meet local causality condition [3] .
There is a significant difference between the cosimulation problem and the typical DED simulation problem. In a typical DED simulation, each process is assumed to finish processing the current input event before accepting the next event. In other words, the local process is non-preemptive. On the other hand, the component simulator performs itself an event-driven simulation so that it advances the local clock while processing events. Without the support of rollback mechanism, the local clock should not be ahead of the time-stamp of the next event. During processing the current event, the local clock should be compared with the time-stamp of the next earliest event and the current processing should be preempted if the next event arrives before completion. If the next event time is not known a priori, such time-synchronization requirement charges huge overhead of exchanging time information between simulators during the event processing.
In the proposed scheme, we can do much better if we consider the underlying computation model of the simulated tasks. Dataflow tasks can be regarded as non-preemptive process if we use a clever time adjustment scheme as explained in the next section.
IV. EVENT-DRIVEN SIMULATION WITH DATA-DRIVEN SCHEDULING
As explained in section 2, the simulated task in a component simulator is a dataflow subgraph. Recall that a dataflow block becomes active after all input ports have the required number of data samples. By utilizing the data-driven execution model, we could reduce both the data communication overhead and the time synchronization overhead significantly. We assume that the simulated task within a component simulator is a connected synchronous dataflow subgraph. Although we restrict our discussion to the dataflow specification language, our approach is also applicable to more general cases when the simulated tasks have predictable access patterns to the shared resources [5] .
A. Virtual Synchronization
Time synchronization problem explained in the previous section has been the main performance bottleneck for the distributed cosimulation. The difficulty lies in the fact that the local clock of a component simulator should be synchronized with the global clock of the backplane in order not to violate causality condition. It should be noted that time synchronization is observed only when data samples are exchanged between the backplane and the component simulator. In the proposed scheme, we do not synchronize the local clock with the global clock at all times, but still preserve the correctness. We will Suppose that data samples from the backplane arrive following the scenario as displayed in Fig. 3(d) . Then, the actual execution profile for the HW component would be as displayed in Fig. 3(e) . To simulate the actual execution profile, the simulator should read the second sample a(2) from the input port of block A when the local clock reaches at 5 during executing block C. It means that the component simulator has to wait until the next earliest data is available to confirm that the local clock can be safely advanced without violating the causality condition.
In the proposed scheme, we do not wait the next earliest data but finish the current execution of the local schedule as illustrated in Fig.  3(f) . Let the finish time be f T . We compute the time difference T ∆ between f T and the arrival time of a(2)
When the component simulator accepts data a (2) , it translates the time-stamp by the timing offset T ∆ . When the component simulator sends a data to the backplane, it adjusts the time-stamp in the reverse direction by the same magnitude as the timing offset T ∆ . The timing offset is updated at every execution of the local schedule. We call this scheme as virtual synchronization because the local clock is apparently synchronized with the global clock while not actually. Cockx [5] proposed a similar approach for compiled cosimulation which allows out-of-order executions of SW modules. Correct execution order is recovered from later time adjustments. The proposed technique is similar to those in that we also use time adjustment technique for timing correctness. But, there are three significant differences. First, our technique is for the system-level distributed cosimulation, which is more complicated. Second, we can utilize parallelism between simulators. Third, we do not synchronize local clocks of component simulators while they do.
The correctness of cosimulation is preserved when all data samples carry the correct values and the correct time stamps at the simulation boundary. Since the virtual synchronization scheme translates the time-stamp values back and forth by the same amount, the processing order of data samples will be remained at the same as without virtual synchronization scheme. On the other hand, data-driven execution model is not dependent on the arrival times of data samples, but on the arrival order. In other words, we can arbitrarily translate the arrival times of data samples without violating the arrival order, but still obtaining the same result. In Fig. 3(e) , sample a(2) does not affect the value of sample d(1). Therefore, if the simulated task is a synchronous dataflow graph, virtual synchronization preserves the correctness of cosimulation.
Virtual synchronization removes the need of time synchronization overhead during the execution of the current schedule. Synchronization between the backplane and the component simulator is accomplished at the time of sample exchange. As a result, we reduce the time-synchronization overhead near to zero except adjusting overhead of the time stamp.
B. Reduction of Active Simulation Duration
Virtual synchronization not only removes the synchronization overhead but also reduces the active duration of component simulators. Using the virtual synchronization, a simulator need not increase the local clock until it receives a new input data after processing the last data samples. The absolute value of local clock is no more important. Instead time difference between output production and input arrival matters for timing accuracy. Therefore, we may shorten the simulation time significantly.
Consider a simple video decoder example of Fig. 4 , where a packet decoder (PD) block and a display block (DIS) are executed on a SW simulator. And we assume that an inverse DCT (IDCT) and a motion compensation (MC) blocks are run on a HW simulator. The proposed technique changes the execution profile as shown in Fig. 5(c) . First the idle times are removed and sample times are adjusted by the backplane. Second multiple HW simulators can be invoked to process separate blocks. Since no time synchronization is necessary, benefit from parallel simulation is bigger than the increased IPC overhead. 
C. Message Grouping
Using the backplane as a central simulator doubles the amount of data communications among simulators. Even though the total number of data samples becomes double, we can reduce the communication overhead by grouping the samples into a large packet. Because the communication overhead is more affected by the number of message exchanges than by the message size, the overhead can be greatly reduced by message grouping.
This grouping optimization is possible if it does not hurt the local causality condition of the task. Message grouping in the proposed scheme comes from the data-driven property. Because a dataflow block becomes active only when all input ports have data, there is no benefit of sending a separate packet to each input port. And it does not incur an additional grouping overhead.
Because the execution schedule of dataflow subgraph is determined at compile time and fixed at run-time, we can determine the receiving order of input data samples. From the schedule information, we can group the messages without deadlock possibilities.
V. EXPECTED PERFORMANCE IMPROVEMENT
In this section, we discuss the expected performance improvement from the proposed scheme against a typical centralized cosimulation scheme. Due to the virtual synchronization technique, the proposed scheme reduces the time synchronization overhead and active duration of the simulators significantly, and enhances the parallelism between component simulators. To discuss each factor separately, we consider three execution modes of the backplane: centralized, serial and distributed mode.
In the serial mode of execution, the backplane waits for output data immediately after sending input data to a component simulator. So each simulator is executed sequentially and the backplane does not utilize parallelism if any. The serial mode is designed to show the performance gain with the virtual synchronization, and the comparison between the serial mode and the distributed mode.
Even though the serial mode of execution does not use parallelism between component simulators, it is still much faster than the previous centralized approaches because time synchronization overhead is removed and active duration of the simulator is reduced using the virtual synchronization. Let T(sync) be the overhead of single time synchronization overhead. If N is the number of the time synchronization activities, the simulation time of each component simulator is increased by ) (sync T N × . Suppose that simulator A takes T(idle, A) to advance the local clock by one time unit during inactive period. If the active duration of the simulator is D% of the total t time units, then the reduced simulation time becomes
. Therefore, the expected performance improvement depends on the number of time synchronization activity and the active duration of the simulator:
Next, we discuss the benefit of distributed execution of simulators. In the distributed mode of execution, after sending the input data to a component simulator, the backplane processes other events and possibly invokes multiple simulators simultaneously. Depending on when the backplane receives output data from a component simulator, the interrupt scheme and the polling scheme are differentiated. In the interrupt scheme, the simulator interrupts the backplane at the end of the current event processing. Making a separate dedicated communication thread can be an implementation choice to minimize the interrupt handling overhead. Instead, we implemented the distributed mode using the polling scheme. Each component simulator waits after event processing until the backplane requests the results just before it sends the next event to the simulator. If the component simulator is still processing the current event when the backplane requests the results, the backplane is blocked. Fig. 6 illustrates two situations that can be observed in the backplane viewpoint. Fig. 6(a) corresponds to a situation where the backplane finishes other events earlier than the simulation time of the component simulator, says T(sim, A) . Then, the backplane is blocked until the simulator finishes processing the current event before sending the next input event. We define the "backplane time", T(BP,A) as the time duration that the backplane spends processing other events but A. If the backplane time is longer than the simulation time of the simulator like Fig. 6(b) , the execution time is determined by the backplane time. As a result, the simulation time for an execution of component simulator A becomes
T(A) = T(send,A) + MAX(T(sim,A), T(BP,A)) + T(receive,A) (2)
If we execute Fig. 2 in the distributed mode, the backplane does not wait until component simulator A finishes processing the current event. Instead, it fetches another event from the global queue to fire the event generator block S to generate the next event to block A. After component simulator A finishes processing, it can immediately process the next event while block B also receives events from block A. In Fig. 7 , a Gantt chart shows the pipelined fashion of distributed execution of component simulators. If three simulators in Fig. 2 have different execution times, the Gantt chart of Fig. 7 shows that execution patterns converges to a steady state where the backplane is blocked on the slowest simulator as the simulation progresses. The simulation time for an steady state T(steady, distributed) of Fig. 7  becomes equation (3) . Although there are multiple simulators, the same equation holds for each simulator. Thus equation (3) says that the slowest simulator determines the simulation time, as we could imagine.
T(steady, distributed) = T(send, k) + MAX(T(sim, k), T(BP, k)) + T(receive, k) k∈{A, B, C} (3)
Because the simulation time is bound to the time of the slowest simulator, we can acquire the parallelism bound equation (4) 
VI. EXPERIMENTS
Performance improvements from the proposed technique result from several factors. First, virtual synchronization removes the time synchronization overhead and reduce the active duration of simulators. Second, it enhances the parallelism between component simulators. Last, message grouping reduces the data communication overhead. In this section, we present the experimental results on each factor separately. At last, we compare the proposed approach with the optimized operation of Seamless CVE co-verification environment. All experiments are performed in the PeaCE environment [13] .
A. Reduced Time by the Virtual Synchronization
To isolate the performance gain from the virtual synchronization, we make a simple example which consists of a source, a HW simulator, and a display. We compare two centralized approaches and the proposed approach: "normal" indicates a conservative approach and "optimized" indicates an optimized technique proposed in [7] .
The first set of experiments assumes that the source block generates an event every 200 ns and HW takes 140 ns. And the clock period of the HW simulator is set to 20 ns. Fig. 8 shows the experimental results varying the total simulated time. In the "normal" case, there need eight synchronization activities at every 200 ns, and each synchronization overhead is measured 73 us on the average. Since the "optimized" case already knows when the next event arrives, it needs not any time synchronization activity in this experiment. But, compared with the proposed approach, it has another cause of performance penalty. That is, it needs to advance the local clock during 60 ns of the idle period to be synchronized with the global clock while the virtual synchronization need not synchronize two clocks but adjusts the time stamp values after executions. We observed that it takes 17.6 us on average to simulate unit clock period 20 ns of the simulator. Now, we increase the source period to 10000 ns, the performance gain by the virtual synchronization becomes drastic as shown in Fig. 9 . The performance gain from the proposed scheme becomes 45 times in Fig. 9 compared with the "optimized" one mainly because of the removal of time synchronization overhead and reduction of active duration of the simulator.
B. Enhanced Distributed Execution
In this subsection, we examine the performance improvement due to the distributed execution of simulators. We use six different artificial experiment sets with diverse experiment conditions by varying graph topology and machine specification. Fig. 10 shows three different graph topologies. And three machines are used in the experiments. Machine 1 is Pentium III 600Mhz using dual CPUs and machine 2 is Ultra Sparc II 450Mhz. Both are connected to an 100M fast Ethernet switch while Machine 3 is Pentium III 350Mhz and connected through a 10M Ethernet hub. Six different experiment sets are listed in Table 1 . The first three sets have a single task graph and the next three sets have two disconnected task graphs. Tasks mapped to the SW component are simulated sequentially in a SW simulator while we use a separate HW simulator process for each task mapped to the HW component.
For each set of experiment, we perform five different experiments by intentionally changing the execution time ratio between a SW task and a HW task as 1(SW):1(HW), 1:2, 2:1, 1:3 and 3:1.
In the distributed mode, each simulator time is measured in the backplane as the sum of the waiting time and send/receive times. If the backplane time is greater than the simulator time, waiting time becomes zero as we already mentioned in section V. As a result the total cosimulation time close to the execution time of the slowest simulator in the serial mode as shown in Fig. 11 . When more than one task are assigned to the same SW component, distributed execution may be hindered by the scheduling dependency between tasks. The second set of experiments reveals this effect most vividly. Even when the hardware simulation time is greater than the software simulation time, the backplane still experiences SW simulation time. Fig. 12 illustrates the summary of performance improvement by the distributed execution of simulators compared with the serial execution with the virtual synchronization. The more simulators are involved in the cosimulation with similar execution times, the more performance gain is achieved. 
C. Message Grouping
To demonstrate performance gain by grouping messages, we experiment three different graphs with different numbers of input and output ports as shown in Fig. 13 . Fig. 14 shows the TCP/IP communication times with and without message grouping. From the results, it is observed that communication overhead is reduced drastically when message grouping technique is applied. And the communication time is relatively insensitive to the number of samples if grouping is used. We also notice that grouping shows better result even though the number of sample is one and no grouping is required. It is because that we use handshaking protocol when sending separate messages to avoid extra overhead caused by Nagle's algorithm [14] used in the TCP/IP protocol stack. Nagles algorithm adds extra long delay between the first two consecutive "send" operations.
D. Comparison with Seamless CVE
We compare the proposed approach with Seamless CVE co-verification environment [15] using an H.263 decoder example which is composed of a HW IDCT block and remaining SW blocks. We use Armulator for a process simulator and ModelSim for a HW simulator as Seamless CVE does on the Ultrasparc II 450Mhz machine. In the Seamless CVE, we apply the instruction fetch optimization and the data access optimization [15] to enhance the simulation speed while maintaining the cycle accuracy.
The experiment result shows a significant performance enhancement. On Seamless CVE, the simulation takes 2031.71s to decode one frame of QCIF format. But the proposed approach ends the simulation at 303s and is 6.7 times faster without sacrificing time accuracy. Such performance gain comes from reducing synchronization overhead and the active duration of simulators by the virtual synchronization. In this experiments, HW simulation time takes 78% of the total cosimulation time, and 98% including IPC overhead. Therefore, the performance bound expected in section V is almost achieved in this experiment.
VII. CONCLUSION
As the embedded systems are complex and the design turn-around time is shortened, the performance of co-simulation becomes more important to verify such a system. And the component simulators used in the cosimulation are likely to be heterogeneous or geographically distributed. So the fast distributed cosimulation is essential for the embedded system design.
Our proposed backplane approach uses the "deadlock detection and recovery" approach to enhance the parallel execution of component simulators. The key contribution of the proposed approach is to combine data-driven and event-driven simulation technique for fast distributed cosimulation. Data-driven properties of the specification model reduce the number of communication packet exchanges by message grouping and minimize time-synchronization overhead by the virtual synchronization. Virtual synchronization also reduces the active duration of component simulators significantly. The experiments give promising results on the performance improvement specially when there are more component simulators involved. 
