Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach.
INTRODUCTION
Transformative applications are computation-intensive applications and are usually modeled by dataflow graphs in which each node represents a task, and a directed edge from a source to a sink task represents the communication link between them [1] . A task executes iteratively over different sets of input data. Each execution instance is called a job Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. which consumes input data at the beginning, processes them and then produces any output data at the end. Some typical transformative applications are MPEG codec, Viterbi codec etc. To meet the real-time constraints and minimize the cost, such systems are often implemented by heterogeneous architectures which utilize SW processing elements (PEs) such as micro-controllers to execute control tasks, and HW PEs such as FPGAs to execute timing critical tasks. Tasks that are partitioned to HW/SW PEs are called HW/SW tasks respectively. A HW job starts execution as soon as its input data arrive. A SW job can be started only if both its input data have arrived and it is dispatched by the scheduler on the SW PE. Traditionally, schedulers in the SW PEs of transformative systems are static and non-preemptive. With the increasing complexity of modern transformative systems, dynamic and/or non-preemptive schedulers are becoming attractive because of their flexibility.
The problem of optimally partitioning a transformative application specification to a set of available HW/SW PEs and the derivation of a job execution order (scheduling) on them has been quite extensively studied. See for example, Karam [1] for an excellent survey. In this paper, we consider the problem how to achieve fast HW/SW co-simulation for a transformative application on a symmetric multi-processor (SMP) simulation host, given the partition solution and the derived schedulers. We assume that the application dataflow graph is a directed acyclic graph (DAG) in which each directed edge represents shared buffer between a source and a sink task. The source task feeds its output data to the sink task by a non-blocking write to the buffer. We say that a real-time deadline is missed if any data in any shared buffer is overwritten.
HW/SW co-simulation of a transformative application consists of two steps: (1) assign the simulation tasks to all available host processors, and (2) carry out simulation iteratively over a set of input data. The simulation tasks are subject to the same application constraints but because of the differences between the simulation platform and the real HW/SW platform, there is also more flexibility in allocating and scheduling the simulation tasks.
Similarity:
• A job is also simulated as reading its input data at the beginning and producing its output data at the end. The simulation synchronizes the tasks at job boundaries, just like the real HW/SW platform.
• The iterative execution of a transformative applica- tion is characterized by the execution initiation interval (EII), which is the interval between initiations of two consecutive iterations [3] . The key idea of most HW/SW partition algorithms for transformative applications is to minimize EII with/without considering other cost constraints. For instance, pipelined implementations of transformative applications allow successive iterations to be executed concurrently. Fig. 1 shows a data graph in which task T1 and T3 are partitioned to a HW PE, and T2 and T4 are partitioned to a SW PE. The pipelined implementation achieves an EII as 80 us. The goal of fast co-simulation is to minimize the simulation initiation interval (SII). The execution time of each job in fig. 1 is replaced by the simulation time (# processor cycles).
Difference:
• The simulation time of a job can be thought of as constant no matter on which host processor it is simulated. Thus, when trying to minimize SII, instead of considering which PE a task be partitioned to, the problem is how to balance the workload among all host processors. Tasks partitioned to the same PE do not have to be simulated on the same host processor. For example, HW tasks can execute fast on the specialized HW PE, but may take a comparatively large amount of time to simulate. It is better to distribute those HW tasks to different host processors for simulation purposes.
• The order of the jobs being simulated can be different from the order in which they execute on the real platform. In fig. 1 , task T1 and T4 are assigned to CPU 1, and T2 and T3 are assigned to CPU 2. On the real platform, job T3(i + 1) is not started before T4(i) completes (unless a real-time deadline is missed). However, they overlap in simulation. Note that the simulation fidelity can still be maintained as long as we record the simulation time stamp when T3(i + 1) completes and save the data it has produced for T4(i + 1). Later when T4(i + 1) is dispatched by its SW PE, we check and make sure that its start time is no larger than that when T3(i + 1) ends. Otherwise, the real-time deadline is missed since the input data to T4(i) was overwritten by T3(i + 1). We also note that a HW job, e.g., T3(i + 1), can be simulated as soon as all its input data are available. On the other hand, we cannot simulate T2(i + 1) until T4(i) is simulated because T2 and T4 are partitioned to the same SW PE. Their jobs are scheduled by the SW scheduler and execute sequentially. The start time of T2(i + 1) is not only determined by when all its input data are available, but also when T4(i) completes. Generally speaking, the simulation order of all SW jobs belonging to the same SW PE has to be the same as they execute on the SW PE.
• Synchronization overhead is a well known obstacle to realize fast simulation. It consists of inter-process communication (IPC), inter-processor interrupt (IPI), process context switch etc. It is directly related to how the tasks are distributed among the host processors. To model the optimal simulation task assignment problem, it has to be considered as well.
Given the reasons above, the optimal partition solution of a transformative application is often different from the optimal task assignment solution for simulation on an SMP host.
Related Work
Various approaches have been proposed to reduce the synchronization overhead in the multi-processor simulation environment. Yoo in [5] presented an idea of predicting synchronization points by SW analysis. Kim in [6] proposed a virtual synchronization method which combines the eventdriven scheduler with data-driven model. Bhattacharyya in [8] tried to reduce the inter-processor communication frequency by resorting to compile-time/run-time scheduling. A common feature of these approaches is to "get away" from or work around the limitations in the host operating system (OS), instead of overcoming them, as we shall do. SystemC [9] implements its own thread primitive to realize fast communication and context switch. However, simulation cannot be parallelized. For more related work, please refer to [15] .
Existing SMP OS kernels are optimized for generic applications but not for HW/SW co-simulation. Consider Linux SMP kernel as an example, threads belonging to the same process are not distributed to different processors, which means simulation is not parallelized if all tasks are grouped into one process. Different processes can be distributed but they communicate by IPCs which are known to be inefficient. Another problem associated with the Linux SMP kernel is that a thread execution is not bound to a processor. A thread running on one processor can be re-allocated to another by the kernel, which not only hurts cache performance but also makes it impossible to realize an effective task assignment. Given the all those reasons, it is evident that modifications to existing SMP kernel are necessary to support fast HW/SW co-simulation.
Organization
In the rest of this paper, we shall propose an efficient HW/SW co-simulation framework for transformative applications. The framework requires only a reasonable amount of modifications to the existing Linux SMP kernel. We also present a heuristic algorithm to compute suboptimal task assignment solutions, considering both the average job simulation time of each task, and other simulation overhead including (1) processor scheduler execution, (2) task context switch cost, and (3) task blocking cost due to accessing the critical memory.
The paper is organized as follows. The proposed framework is presented in section 2. The task assignment algorithm is described in section 3. Section 4 provides some experimental results. Conclusion is given in section 5. 
FRAMEWORK OVERVIEW
From the host OS's point of view, the whole simulation is a process and each task is simulated in its own thread's context. Threads communicate with one another by shared memory. The main improvement we propose depends on the distribution of threads that are bound to host processors. Thus, simulation can be carried out in parallel while the synchronization overhead can be significantly reduced by eliminating IPC.
Binding threads to host processors requires modifications to existing Linux SMP kernel. The proposed modifications are realistic. Similar ideas have been proposed and evaluated in others [11] [12] . We are currently implementing it based on the Linux 2.6 kernel.
The co-simulation framework is shown in fig. 2 . It consists of task threads, PEs, processor schedulers, communication links, and simulation backplane. Before simulation starts, task threads are assigned to host processors according to the task assignment solution described in Section 3. Each time a processor completes simulating a job, an inquiry operation is performed to every PE that has task(s) assigned to it to obtain all the jobs ready for simulation. Then it picks the job with the highest priority and activates the corresponding thread.
Tasks &Processor Schedulers
Each host processor has a priority task queue and a blocking task queue. A task T k is a thread belonging to processor Pi if T k is assigned to Pi. The task priority is determined by the index of its next simulation job in descending order. That is, a thread with a job of smaller index to be simulated will have higher priority. Since the simulation is carried out in an asynchronous manner, a source task may need to be blocked to wait for the corresponding sink task to catch up. In such a case, the source task is put into the blocking queue of its processor. When the source task can be unblocked, it will be re-inserted into the task queue. The way how a source task is blocked/unblocked will be explained in 2.3.
Each host processor also has a PE set. A PE P Ej is contained in the PE set of Pi if there exists a task T k which is partitioned to P Ej and assigned to Pi. As mentioned earlier, after a job is simulated or when it is blocked, the processor will inquire all the PEs in its PE set to get all new ready job(s) for simulation.
PE
A PE maintains the information of all the tasks partitioned to it, and provides the enquiry interfaces that are called by the host processors to get the new jobs ready for simulation. When all its input data are available, a job of a HW task can be simulated. On the other hand, a job of a SW task will not be dispatched by its SW PE until the previous job has been simulated.
Accessing to task information is protected by fine-grained locks. The most important item of information is the simulation time stamp. A task is required to synchronize the time stamp with its PE when either of the following two conditions becomes true: (1) A job simulation starts or ends. (2) A read/write operation starts or ends.
Communication Links
Tasks transfer data to each other through the shared buffer in the communication links. When a source task writes the output data to a buffer that has enough space, the link will save the data and record the time stamp of the write operation. Otherwise if the buffer does not have enough space, the source task has to block to wait for the corresponding sink to catch up. After the sink task has retrieved some previous input data and vacate enough space for the source to complete the current write operation, the link will notify the host processor to which the source is assigned that the source task can be unblocked. When all input data have arrived for a sink job, the link notifies its PE which will dispatch it to its processor at appropriate time.
Assuming the buffer size is x, the buffer-overwritten events are detected in either of the following two scenarios:
• The k th job of the source task Tsrc(k) makes a write at simulation time t0, and the most recent read made by the sink is its j th job Tsinc(j) at t1. If t0 < t1 and k − j > x − 1, buffer is overwritten.
• The j th job of the sink task Tsinc(j) completes reading at time t0, and there exists a source job Tsrc(k) (k−j > x − 1) that started writing at t1. If t1 < t0, buffer is overwritten.
Simulation Backplane
The simulation backplane will interrupt all the host processors to terminate the simulation when any communication link notifies it that a buffer overwritten event is detected.
TASK ASSIGNMENT
In this section, we first derive the objective function for the problem of optimal task assignment among all processors in an SMP simulation host, considering both the average job simulation time of each task and the simulation overhead including (1) processor scheduler execution, (2) task context switch cost, and (3) cost of task blockings due to accessing the critical memory. Then we present a heuristic algorithm which finds the optimal solutions when the number of tasks is small and suboptimal solutions when both the number of tasks and processors are large. Notation 4. P T is an M × N matrix, specifying the task assignment solutions. P Tij = 1 if Tj is assigned to Pi, otherwise P Tij = 0.
Derivation of Task Assignment Objective Function
Notation 5. EP is a K×M matrix, specifying the task assignment relation between PEs and host processors. EPij = n if P Ei has n tasks assigned to Pj . Notation 6. cs denotes the task context switch cost. qc denotes the average cost of an inquiry operation made by a host processor to a PE.
In each SII the time spent on a particular processor Pi consists of the following 3 parts:
1. Simulating one job for each task assigned to Pi.
Processor scheduler execution cost, which is a func-
tion of the number of tasks assigned to Pi and the number of PEs that Pi needs to inquire. The time Pi takes to inquire a particular PE is directly related to the number of other processors that will also inquire that PE since inquiry operations are executed in critical sections and the blocking time to get the lock can be deemed as proportional to the number of those processors.
Thread context switch overhead which is proportional to the number of tasks assigned to Pi.
Thus, the average time spent on Pi for a SII can be expressed by equ. (1), where ti1, ti2 and ti3 correspond to part 1, 2 and 3 above, respectively.
Equ. (3) needs some explanations.
s=1...M (s =i)
EP ks is the total number of inquiries to P E k made by processors other than Pi. Assuming that the probability of obtaining the lock is equal for all inquiries, it is easy to show that the average blocking delay for each inquire by Pi to obtain the lock can be approximated by qc× The objective function for the optimal task assignment is expressed as equ. (5), which balances SII among all processors.
Algorithm Finding Suboptimal Task Assignment Solution
Solving equ. (5) is a binary integer programming problem. Norman in [14] showed that mapping parallel algorithms onto parallel architectures is an NP-hard problem in all but restricted cases. To obtain the optimal solution, an exhaustive search takes O(M N ) iterations. In case the number of tasks in the application is not large, such an exhaustive search method can be taken. When the number of tasks and host processors are large, we present an effective algorithm which is able to find the suboptimal solutions close to the optimal but the search time is significantly reduced. Broadly speaking, most heuristic task assignment algorithms can be classified into 3 categories: Local search algorithm, Genetic approaches and Greedy algorithms. Our algorithm is similar to the greedy algorithm which starts from an empty solution set, repeatedly chooses a task based on certain selection strategy and assigns it to a processor to obtain a partial solution. The algorithm continues until all the tasks have been assigned. The key difference between our algorithm and the greedy one is that our algorithm keeps a bounded number of partial solutions after assigning a task. By adjusting the bound, it has the flexibility to trade off between the goodness of the obtained solution and the search time. In the rest of this section, we will firstly introduce all the notations and definitions used, and then present our algorithm.
Notation 7.
A partial solution is called a n m i partial solution if n tasks have been assigned to m processors among which i processors are not assigned any task. We call those i processors idle processors. The other m − i processors are called busy processors. When considering the assignment of the n th task Tn to an M processor host, it is easy to see that we only need to consider n processors with n < M since the other M − n processors will not be assigned any task. We will call them backup processors. Now let us look at how to obtain SAn,m(i), given all the promising partial solutions SB n−1,k (0),...,SB n−1,k (k − 1) kept in the last step, where k = M if M < n − 1 and k = n − 1 otherwise. We need to consider the following two cases:
1. When M > n, in such a case there is still backup processor(s) available. By bringing in a backup processor as Pn, the problem is to assign Tn to any of the n processors. By induction, we can show that T1,T3  T2   T1  T2,T3   T1,T2  T3   T1,T2, (6) is easy to see because to assign n tasks to n processors each of which has at least one task, there is only one possibility that each processor gets exact one task.
To see how equ. (7) is obtained, first consider the first right hand side term Bn−1,n−1(i) which is the number of n − 1 n − 1 i partial solutions kept. Each of them becomes a n n i partial solution if Tn is assigned to Pn. Now consider the term Bn−1,n−1(i − 1) which is the number of n − 1 n − 1 i − 1 partial solutions kept. Each of them can become a n n i partial solution if Tn is assigned to any of the n − i busy processors. Fig. 3 illustrates the idea when n = 3.
2. In case M ≤ n that there is no backup processor available, the problem is to assign Tn to any of the existing M processors. Similarly, we can show that
The explanation above implies that the search space grows exponentially with M and N . To bound the search space, Bn,m(i) is set by equ. (9). That is, at most D of the most promising n m i partial solutions are kept in each step. By adjusting D, our algorithm has the flexibility to trade off the goodness of the solution and the search time.
Bn,m(i)
Task selection is in descending order of their effects to the objective function. The effect of a task Tn is defined in equ. (10) , where ET represents the partition mapping from the tasks to PEs. ET kn = 1 if Tn is partitioned to P E k , otherwise ET kn = 0. The first part of the effect is the average job simulation time of Tn. The larger Cn is, the more its assignment will affect the value of equ. (5). The second part considers the total number of tasks partitioned to P E k . The more tasks P E k has, the more time the processor simulating Tn will take to get the lock before inquiring P E k .
The algorithm is summarized as following. The number of search operations in step 3 is in the order of O(M 2 D) as implied by equ. (7) and (8) . Thus, the total number of search operations to find the suboptimal task assignment solution is in the order of O(NM 2 D).
SIMULATION RESULTS
In this section, we present some experimental results. We first generated a set of random DAGs to test the effectiveness of the proposed algorithm to search for suboptimal task assignment solutions. The number of tasks is in the range of 20 to 128. Constant D is set from 2 to 1000. The average job simulation time is a uniform distribution in [5, 300] . qc and sc are both set to 10. The results demonstrate that the suboptimal solutions found by our algorithm are close to the optimal solutions in most cases when D is set to 500, while the search time is negligible.
Although we have not implemented the real proposed framework, we have implemented a simulator of it based on SystemC. To simulate the SMP feature on a single CPU host, a virtual processor (VP) layer which temporally multiplexes the physical CPU is inserted under the SystemC simulation engine. The time stamp management of each VP is briefly explained as follows. When a job is to be simulated on a VP, the CPU timer counter are recorded. It is read again when a VP needs to synchronize with other VP(s). The difference is accumulated to the VP time. VPs synchronize with each other conservatively in a lock step manner so that all the inquiry operations to any PE will be simulated in the exact order as if the simulation was carried out on a real SMP host. When the simulation ends, the largest VP time is considered as the simulation time. Such a time stamp management guarantees the fidelity of the SMP simulation by a single processor. We carried out simulations on the VP simulator for 3 typical transformative applications: IP video telephone, streaming video and video surveillance system. The number of VPs is set to 4. The DAG for each application is obtained from [4] . For each DAG, the simulation iterates 1000 times.
We compared the simulation time obtained by optimal task assignment and a naive task assignment. The naive solution assigns tasks partitioned to the same PE to the same VP, which is often the case in practice although the workload is unbalanced on the VPs. The results in table 1 show that simulation time can be reduced by 25% to 50%. 
CONCLUSION AND FUTURE WORK
We proposed an efficient HW/SW co-simulation framework for transformative applications on an SMP host. Task threads are assigned by a fast algorithm to host processors, considering both the average job simulation time and other simulation overhead. Threads are bound to host processors and communicate with each other by shared memory. Simulation can be carried out on parallel processors while the simulation overhead is significantly reduced. The end result is optimizing the simulation initiation interval (SII) for measuring the efficiency of simulating transformative applications.
We also presented a heuristic algorithm which is able to effectively assign simulation tasks to host processors. Our experiment results show that the suboptimal solutions found by our algorithm are close to the optimal solutions while the search time is negligible. With the simulation tasks being distributed optimally, our experiments also show that cosimulation time can be reduced by 25% to 50% over the usual naive approach.
We will finish implementing the proposed framework by modifying the Linux SMP kernel. The framework will be extended to support simulation of other applications.
