The aim is to explain the current issues of HW/SW cosimulation and to introduce a new challenge of HW/SW cosimulation for multiprocessor SoC (MPSoC). Most of the current issues are related to raising abstraction levels of HW/SW cosimulation. Mixed-level cosimulation is explained in a unified manner using a concept of 'HW/SW interface'. First, abstraction levels in HW/SW cosimulation are explained in terms of abstraction levels of function, SW interface and HW interface. Transaction level models are introduced for HW interface. OS and device driver levels are explained for the SW interface. Then, the concept, applications and techniques of mixedlevel cosimulation are presented. To better understand mixed-level cosimulation through SoC design flow, a view called refinement space is presented. Using the refinement space, cases of mixed-level cosimulation are explained in a SoC design scenario. Then, the issue of cosimulation performance in raising abstraction levels, i.e. Amdahl's law in HW/SW cosimulation, is addressed. A new challenge of cosimulation for MPSoC is also introduced.
Introduction
HW=SW cosimulation is to validate both software (SW) and hardware (HW) functionality in a single simulation. The validation covers the performance as well as the functionality. As system complexity grows, the validation becomes more and more time-consuming thereby becoming a bottleneck in shortening time-to-market.
One of effective ways to speed up HW=SW cosimulation is to raise the abstraction levels of simulation. However, while raising abstraction levels of simulation, designers face a new problem in handling high abstraction levels; mixed abstraction level (in short, mixed-level) cosimulation. The mixed-level cosimulation problem is how to manage many different abstraction levels of SW and HW models (e.g. OS level SW model and transaction level HW model, etc.) in HW=SW cosimulation.
In this paper, we explain mixed-level cosimulation in a unified manner using the concept of 'HW=SW interface'. Then, we introduce a new challenge, i.e. cosimulation for MPSoC. For a survey on traditional issues and techniques of HW=SW cosimulation, readers are recommended to refer to [1, 2] .
Abstraction levels in HW=SW cosimulation
In terms of design flow, HW=SW cosimulation is performed after HW=SW partitioning is finished. HW=SW cosimulation captures application behaviour (SW application, HW design) and architectural component behaviour (bus, network-on-chip, memory, DMA, interrupt controller, etc.). After HW=SW partitioning, HW=SW refinement is needed. The refinement is performed in two ways: function and interface refinement as shown in Fig. 1 . Figure 1a illustrates function refinement. SW functionality in a sequential program can be refined to multiple tasks. HW functionality may be refined from an architecture-independent description (e.g. C code) to RTL (register transfer level) design which includes functional units (multipliers, adder=subtractors, etc.), memory elements and control units as shown in the Figure. Figure 1b shows interface refinement. An operating system (OS), device driver (d=d) and other SW codes are required for refined SW functionality (e.g. multiple tasks) to run on the target processor and to communicate with external HW modules. HW design needs also interface logic for it to communicate with other HW or SW modules via onchip interconnect (e.g. on-chip bus) having a communication protocol (e.g. AXI [3] , OCP [4] ).
Both function and interface refinement can take several steps, respectively. Each step may need an abstraction level. For instance, function refinement can go through algorithm, task=process, instruction accurate=RTL and cycle accurate levels. Interface refinement can be performed in two ways depending on the SW or HW side of the interface. We call each side of the interface the 'SW' or the 'HW interface'. In the following subsections, we explain the abstraction levels of the two types of interface.
Abstraction levels of HW interface
The abstraction levels of the HW interface are known as transaction level models (TLMs). There are a few slightly different TLM models, e.g. message, transaction, transfer and cycle accurate levels [5] and algorithm, programmer's view (PV), PV with timing (PVT) and cycle callable (CC) levels [6] . Figure 2 exemplifies the abstraction levels of the HW interface according to the definition in [5] . We use transaction level models in [5] only as an example. The arguments described in this paper are not limited to the examples used, but include also other TLMs, e.g. [6] . Figure 2b , denoted by TL1, gives a higher abstraction level than the C=A level by abstracting away the bit-by-bit behaviour of address and data signals (a code section denoted by an oval in Fig. 2a) . In [5] , this level is called the 'transfer level'. Avoiding the bit-by-bit manipulation of address and data signals, this level gives simulation speedup without losing simulation accuracy since it abstracts only the representations (i.e. data types in simulation models) of address and data signals.
The simulation model of Fig. 2c , denoted by TL2, abstracts away the clock signal (a code section denoted by an oval in Fig. 2b) . The model becomes an event-driven one. It may give faster simulation than that of Fig. 2b since event-driven simulation is known to be superior to cyclebased simulation in terms of simulation performance, especially when event activity is low. Simulation accuracy can be still cycle-accurate at this level. (The performance difference between cycle-based and event-driven simulation is also determined by the efficiency of the simulation kernel. In other words, an event-driven simulation kernel with high scheduling overhead may yield a worse performance than a highly optimised cycle-based simulation kernel.) Figure 2d gives a very different abstraction than the other two in Figs. 2b and 2c. Compared with Fig. 2c , it gives an abstraction of control signals (nMREQ and nRW in this case). Since the behaviour of control signals specifies a protocol of on-chip interconnect, e.g. AXI [3] and OCP [4], the abstraction makes the HW interface independent of on-chip communication protocol. Thus, the simulation accuracy of this model is not guaranteed to be cycleaccurate. This level is called the 'message level' (TL3). Since a HW module with its interface at message level is not limited to a specific on-chip communication protocol, it can be easily reused over different SoC designs with different on-chip communication protocols. Figure 3a shows an example of SoC architecture which consists of two processors, on-chip bus, shared memory and a dedicated FIFO. SW tasks, tasks 1 and 2 (and associated OS and device drivers) run on a processor and task 3 on the other processor. The SW code (task, OS and device driver code) is assumed to be compiled using a target compiler (e.g. armcc in the case of the ARM processor) and simulated on an instruction set simulator (ISS). In terms of the simulation model in HW=SW cosimulation, we call the compiled SW code the 'instruction set architecture (ISA) level model'. The ISA is a SW interface to the SW code. From the viewpoint of the HW simulation model, the processor may be represented by a bus functional model (BFM) [7] .
Abstraction levels of SW interface
The ISA level model can be obtained only when all the SW code is ready to be compiled and linked. However, when the HW design (e.g. design of a HW device such as the is not yet finished but a current implementation of the SW code is required to be validated (such a case is encountered more and more often since SW code is getting more and more complex and it needs to be validated even before HW prototype is available), the ISA model may not be exploited since the corresponding SW code, i.e. device driver code for the dedicated HW FIFO, may not be ready yet. Figure 3b shows an abstraction level of the SW model (which we call the 'device driver level model') which may be useful in such a case. At device driver level, SW code consists of task and OS code. (OS code is written on top of API called the hardware abstraction layer (HAL) or board support package (BSP) API). Both codes can call device driver functions (e.g. rd_dev() and wr_dev()) to access HW devices. Since the HW devices are not ready, an abstract memory model replaces them. Thus, device driver functions (to be more specific, functions which emulate the device driver functions) access the abstract memory model when they are called by task and OS code. For further details of the device driver level model, refer to [8 -13] . Fig. 3b , the OS level model assumes that the OS is not yet designed or selected and only SW tasks are designed. Such a case happens especially when designers want to design application-specific OSs or to select one suitable OS from several OS candidates. [14 -17] .
At OS level, the execution of multiple tasks on the same processor is serialised by an OS simulation model (omitted in Fig. 3c for clarity) . In general, accesses to shared resources (including shared objects such as semaphore) are serialised in OS level simulation.
Multiple tasks may be simulated on a simulation environment (i.e. functional simulation), such as SystemC [18] , without task serialisation, which could be enabled by the OS simulation model. However, in this case, simulation accuracy is inferior to that which OS simulation gives. Figure 4 exemplifies the case. Figure 4 illustrates a typical case where OS level simulation helps to reveal design problems that cannot otherwise be 'detected' by a functional simulation of the system (e.g. simulating multiple tasks in a simulation environment such as SystemC without OS simulation model). Figure 4a shows an example of system specification composed of three tasks T1, T2 and T3. T1 and T2 access a shared resource protected by a semaphore. T3 is activated by an external asynchronous event. Upon reception of that event, a signal is emitted to T2. Then T2 executes some computation (using the shared resource) and then sends a signal back to T3 in order to notify the end of the computation. Figure 4b shows an execution trace of the example obtained in functional simulation. At time t1, task T1 starts to run and acquires the semaphore. At time t2, an external event arrives and task T3 starts to run. Since accesses to shared resources are not serialised in functional simulation, both tasks can run concurrently. At t3, task T2 receives an event from T3 and starts to run. At t4, T2 tries to acquire the semaphore. Since the semaphore is already locked by T1, T2 keeps trying to acquire the semaphore until T1 releases it at t5. T2 acquires the semaphore at t5 and continues its execution. Figure 4c shows the execution trace obtained in the simulation with the OS simulation model. In this case, assuming that the three tasks are mapped on a single processor, a static priority-based pre-emptive scheduling is used and task T3 (T1) has the highest (lowest) priority. The execution is the same in the case of Fig. 4b by time t2. At t2, due to the external event, the execution of T1 is pre-empted by the OS simulation model and T3 starts to run. Then, after receiving the signal from T3, T2 starts to run. At t4, T2 tries to acquire the semaphore. Since the semaphore is already locked by T1, T2 fails to acquire it. However, since it has the highest priority, T2 keeps trying to acquire the semaphore holding the processor (forever).
As shown in this example, OS level simulation can reveal design errors (especially, related to multitask synchronisation) that might not have been detected in functional simulation.
3 Mixed-level cosimulation Figure 5 exemplifies a HW=SW cosimulation model which consists of SW and HW simulation models at different abstraction levels. In the Figure, a SW task running on a processor is at OS level and the processor is modelled at transaction level (TL) on its HW interface. The on-chip bus is at transaction level (TL), two HW modules at TL and cycle-accurate level (C=A), respectively.
In the case of the processor, we have a special interface for mixed-level cosimulation which is located between SW code and the HW interface of the processor. It is called the 'HW=SW interface for mixed-level cosimulation', in short, the HW=SW interface in this paper (shaded rectangle in Fig. 5 ). (In a more precise terminology, the HW=SW interface represents real SW code (OS and device driver) and HW interface logic, which enables SW tasks running on the processor to communicate with HW modules. In this paper, we use the terms only in simulation perspective). It has two interfaces; SW interface for SW code and HW interface for connection with the other HW parts. In HW=SW cosimulation, the HW=SW interface serves to enable SW code (e.g. multiple tasks) to run (e.g. by an OS simulation model) and to communicate with HW modules.
The HW=SW interface can be specified by the abstraction levels at both sides of the HW=SW interface as shown in Fig. 6 .
For instance, assuming that the HW interface has four abstraction levels and the SW interface three abstraction levels as shown in the Figure, we may need up to 12 cases for the HW=SW interface. In this regard, the HW=SW interface is a generalised model of conventional BFM since it covers a wider range of abstraction levels than the conventional BFM. The conventional BFM covers only the cases with a SW interface at algorithm level or ISA level and a HW interface at cycle-accurate level. Each of the existing solutions in mixed-level HW=SW cosimulation covers only a subset of total combinations of the HW=SW interface. For instance, [19] covers only the combinations of device driver level (SW interface) and TLMs (HW interface) while [16] 
covers only those of OS level (SW interface) and TLMs (HW interface).
The mixed-level cosimulation model exemplified in Fig. 5 is quite common in current SoC design flows. In this Section, we explain how mixed-level cosimulation models are produced in the design flow and how to simulate those models. . 7a ), where SW design starts only after HW design is finished in HW=SW cosimulation, the abstraction level of the HW design is fixed (mostly at C=A level). Only the abstraction levels of SW (though there are usually only two levels of SW abstraction, algorithm and ISA levels) can change. In such a case, mixed-level cosimulation is enabled by a bus functional model (BFM) which transforms memory accesses from SW code into cycle-accurate events on processor interface ports (address=data buses and control signals) [7] . Figure 7b illustrates HW=SW co-development exploiting the TLM concept. In this flow, first the transaction level model of HW design is created. Then, SW design can start using the HW TLM as a virtual platform [20 -23] . SW design can be refined from algorithm level down to ISA level. SW code can be compiled, loaded and executed on the virtual platform. The debugging and optimisation of SW design can be performed exploiting the virtual platform. In the meantime, HW design can be refined from TLM to C=A in a simultaneous way. Figure 7b exemplifies the reduction in design cycle obtained by HW=SW co-development.
In terms of the number of abstraction levels, designers encounter more cases of the mixed-level HW=SW cosimulation model in the HW=SW co-development flow than in conventional sequential design flow. Considering a general design flow where we can encounter all the possibilities of mixed-level cosimulation models, we can imagine a refinement space of function and interface as shown in Fig. 8a .
The space has three dimensions: one of SW function refinement and two of HW and SW interface refinement. Given a processor or HW module, we can imagine one refinement space like Fig. 8a , which represents a refinement space of a processor on which function f 1 runs as shown in the left-hand side of Fig. 8a .
The abstraction levels of function are shown to be the task=process level, instruction=RT (register transfer) level and cycle-accurate level (C=A). SW and HW interface dimensions in the space represent those of the HW=SW interface as the two dashed arrows in the left-hand side of Fig. 8a . (In the case of the HW module, the refinement space will be two-dimensional since only function and HW interface dimensions are required).
In the space, a point is denoted by a tuple hAS; AH; AFi, where AS, AH and AF represent the abstraction levels of the SW interface, HW interface and SW function. The origin of the space represents an algorithm level code (which is not yet partitioned into HW and SW).
A scenario of refinement corresponds to a walk in the refinement space. Figure 8b shows a scenario of refinement. Figure 8a shows the corresponding walk from the space origin to a point, hISA; C=A; C=Ai denoted by a solid circle in the Figure. Point a (in both Figures) represents the case that the algorithm level code is mapped to SW and refined to multiple SW tasks. A solid arrow from the space origin to point a represents the refinement. The Figure shows also the projections (shaded arrows) of the arrow on the sub-space of function and HW interface and that of function and SW interface. In terms of SW interface abstraction, the multiple tasks are at OS level. The HW interface (of the processor on which SW tasks will run) is at message level.
Refinement from point a to b represents that an OS is designed or selected for the multiple tasks. However, device drivers are not yet fixed since HW device design is not yet finished. Since only the SW interface is refined from OS level to device driver level, there is no projected arrow on the sub-space of function and HW interface in this case. The HW interface is still at message level.
Refinement from point b to c represents that the device drivers are designed (maybe since the corresponding HW devices design is finished). Since all the SW code is ready, it can be compiled and run on an instruction set simulator (ISS) which provides for instruction accuracy, e.g. [23] . However, the HW interface of the processor is at message level in this case (maybe since the HW interface of HW module f 2 in Fig. 8 is not yet refined but remains at message level). The projected arrow on the sub-space of function and SW interface corresponds to the refinement taken in this step.
Refinement from point c to d represents a new ISS (denoted by ISS 0 in the Figure) with cycle accuracy. The HW interface is refined to cycle-accurate level (maybe since the HW interface of module f 2 is refined to cycle-accurate level).
As shown in Fig. 8 , the refinement of function and interface can give various combinations of mixed-level cosimulation models. In the next subsection, we will explain how to simulate mixed-level cosimulation. Before going to the next subsection, note that another important application where mixed-level cosimulation is required is to enhance simulation performance by raising the abstraction levels of some parts of the system that are already refined to a low level. Section 4 will address the performance issue.
How to simulate mixed-level cosimulation
First, we will introduce mixed-level HW simulation methods. Then, we explain a key technology in mixedlevel HW=SW cosimulation, simulation of handling interrupts in OS level simulation.
Mixed-level HW simulation:
We will handle cases where HW modules with different abstraction levels of interface communicate with each other. There are two approaches to tackle this issue. One is to simulate all the abstraction levels of the HW interface during simulation [24] . The other is to design=generate a simulation wrapper that adapts different abstraction levels [25] . Figure 9 shows a case where all the abstraction levels are simulated [24] . In this model, called 'SystemC SV ', three abstraction levels of the HW interface are assumed; RTL, The advantages of this method are two-fold. One is that the communication protocol being simulated at a high level can be validated against the simulation results obtained at low levels. The other is that interconnecting modules with different abstraction levels of HW interface do not need an additional adaptation, since all the possible abstraction levels are already supported in the HW interface.
In [25] , the authors present a method of generating only the necessary simulation wrapper for the HW interface. Figure 10 shows the internal architecture of the simulation wrapper of the HW interface. It consists of two parts. One is called the 'simulator interface', which adapts different simulation languages and simulators. The other is called the 'communication interface', which adapts different abstraction levels of the HW interface. The communication interface is divided into three parts; port adaptor, channel adapter and internal communication media.
A port adapter is connected to the port(s) of the HW module (via the simulator interface, if different simulation languages=simulators are used). For each pair of port of HW module and its abstraction level, a port adapter is predesigned in a simulation model library. For each communication protocol, a channel adapter is predesigned for each abstraction level of communication protocol. The internal communication media does not carry any information of abstraction level between port and channel adaptors. Thus, a port=channel adapter depends only on the abstraction level of the HW module port=communication channel.
Given a HW module and a set of communication channels, a simulation model is generated according to the abstraction levels of ports of the HW module and those of communication channels. This approach aims at automatically generating simulation wrappers for mixed-level HW simulation while minimising the number of required library components.
Mixed-level HW=SW cosimulation:
As the abstraction level of SW design is raised, integrating OS level simulation of SW code into HW=SW cosimulation gets more and more attention [10, 11, 16 -19] . Even a commercial tool for this purpose has been released [14] .
In terms of simulation model composition, in this case, the SW simulation model consists of SW code, OS=device Compared with conventional simulation with OS models [26] , where we use OS simulators on top of which application tasks run, HW=SW cosimulation with OS simulation differs in that (i) HW simulation is performed and (ii) SW simulation is timed [17] . Especially, timed SW simulation enables us to validate performance as well as functionality, which is not allowed in conventional simulation with OS models.
Timed SW simulation is enabled by annotating SW execution delay on the target processor into the SW task code. Figure 11a exemplifies the annotation with function delay(). Delay values can be estimated using existing methods of estimating SW execution delay on the target processor [28] .
The key technology in HW=SW cosimulation with OS (device driver) models is the simulation of interrupt handling. Function delay() needs to simulate interrupt handling. Figure 11b shows how function delay() works in a simulation environment such as SystemC. When it is called by task code, function wait() is called. (In this case, we use SystemC wait(event, time) which returns when time elapses without event or when event occurs before time elapses). When function wait() returns, we can have one of two cases. One is that there was no interrupt event which arrived at the processor on which task code runs during the time period of delay d. Figure 11c exemplifies this case. At time t, function delay(10) is called by task code. There is no interrupt event from time t to t þ 10. In this case, function delay(10) returns after advancing the simulation time of SW, T SW by the amount of elapsed time, T e ð¼ 10Þ as Fig. 11b shows. (We assume that SW simulation time and HW simulation time are managed separately. They are synchronised when function delay(), to be more specific, function wait() is called.)
The other case where function wait() returns is when an interrupt event arrived at the processor before the delay period elapses. Figure 11d exemplifies this case. An interrupt event (e.g. nIRQ changes from '1' to '0' in the case of the ARM7 processor) arrives at time t þ 5. Upon the interrupt, an interrupt service routine (ISR) is executed in reality. In HW=SW cosimulation, a simulation model of ISR is executed as shown in Fig. 11b .
After the ISR simulation is finished, function delay() needs to continue to complete the remaining delay. To do that, before calling the ISR simulation, the remaining delay value is calculated ðd ¼ d À T e Þ. As Fig. 11d shows, after finishing the ISR simulation, the remaining time period needs to be elapsed as if a new function delay() is called with the remaining period.
During ISR simulation, the ISR simulation model can invoke OS scheduling (in the OS model), thereby yielding task context switch. The ISR simulation can be even preempted by the arrival of another higher priority interrupt (e.g. FIQ in the case of ARM7 processor). That is, nested interrupts need to be supported. For further details of OS=device driver level simulation in HW=SW cosimulation, refer to [12, 13] .
Performance of mixed-level cosimulation
In this Section, we address performance issues that designers face while raising the abstraction levels of HW=SW cosimulation. Simulation performance is determined by simulation workload (of SW and HW simulation) and simulation kernel overhead. It is known that simulation workload decreases as the abstraction level is raised. Simulation kernel overhead is inversely proportional to the number of synchronisations between SW and HW simulation. Figure 12 illustrates the synchronisation overhead in HW=SW cosimulation. In the Figure, 
Amdahl's law in HW=SW cosimulation performance
Raising abstraction levels of simulation does not always yield expected speedup. It is because if one item (e.g. cycleaccurate HW simulation workload) is improved (by raising the abstraction level to that of algorithm or TLM), speedup is not as high as expected. It is because the dominance in simulation runtime changes as we raise the abstraction level of SW and HW simulation. For instance, in the case that SW simulation is performed by instruction set simulators and HW simulation by cycle-accurate RTL simulation, HW simulation may dominate total simulation runtime (while yielding simulation performance, $ 1 K cycles=s). If we raise the abstraction level of HW simulation from cycleaccurate level to cycle-approximate task level (function) and TLM level (HW interface), then dominance may move from HW simulation to SW simulation, thereby giving simulation performance ($ 100 K cycles=s) which may be inferior to what the simulation speedup in HW simulation alone can give. In this case, if we raise again the abstraction level of SW simulation from ISA level to device driver or OS level, then dominance may move again from SW simulation to synchronisation overhead while giving higher simulation performance, e.g. $ 10 M cycles=s. The simulation speedup is inferior to what device driver=OS level simulation alone gives (e.g. $ 1000 times speedup). The reason why synchronisation overhead dominates is that HW and SW simulation must synchronise with each other frequently to check interrupt events going from HW to SW simulation.
Techniques of improving simulation performance
The most powerful way to improve simulation performance is raising abstraction levels of simulation. However, in each case of possible combinations of abstraction levels in HW=SW cosimulation, we need techniques to further improve simulation performance. In the followings, we explain existing techniques of speeding up SW simulation and those of reducing synchronisation overhead, thereby improving HW=SW cosimulation performance. There is little work specific to HW simulation speedup at abstraction levels higher than RTL, i.e. algorithm or task=process level, since existing simulation techniques [30] , which are independent of HW=SW partitioning, can be applied to high-level HW simulation.
Most techniques of SW simulation speed-up focus on instruction set simulation [31 -33] . Figure 13 illustrates two types of instruction set simulation: interpretive and compiled simulation. Compiled ISS overcomes the problem of instruction decoding in the interpretive ISS by preinterpreting assembly instructions (in the code generation step in Fig. 13 ) and by generating the simulation model of each assembly instruction in the source code of ISS as shown in the Figure [31] . By removing the overhead of instruction decoding during the simulation run, the compiled ISS can improve the simulation performance significantly. However, it is lacking in the support of the simulation of self-modifying code and interrupts. It suffers also from the overhead of code size increase.
There have been several approaches to preserve the speed of compiled ISS while keeping the capability of interpretive ISS. In [32] , SW simulation is based on an interpretive ISS. The key idea of this method is to reuse the information of instruction decoding. First, an assembly instruction is simulated by the interpretive ISS. The information of instruction decoding is stored in a table together with the address of this instruction. Later, when the same instruction (identified by its address) needs to be decoded, the saved information of instruction decoding is used instead of executing the interpretive ISS. Since the instruction decoding is performed only once (if the table is large enough) during the simulation run, the method is called JIT ( just-in-time) cache compiled simulation.
In [34] , a concept called 'cached simulation' is presented. The main idea of this method is to replace the ISS execution in HW=SW cosimulation as often as possible by that of the source code of the application function. To do that, the delay of application function is obtained by executing the ISS when the application function is called for the first time. Then, the delay value is stored in a table called the 'delay cache' together with the information of the execution path in the code of the application function. When the same execution path of the application function is simulated, the delay value is reused instead of executing the ISS. Thus we can minimise executing the ISS.
Synchronisation overhead reduction can be achieved by optimistic approaches [35, 36] or by a concept of message grouping [37] . Figure 14 shows two types of synchronisation between SW and HW simulation; lock step and optimistic synchronisation. In lock step synchronisation, at every system clock cycle, both SW and HW simulation synchronises with each other to exchange events (e.g. memory read=write request signals from SW to HW simulation and interrupts from HW to SW simulation) as shown in Fig. 14a . The overhead of such a synchronisation is inhibitively high, especially when multiple processes are involved in simulation or when the abstraction levels of both SW and HW simulation are high (e.g. OS level for SW simulation and task=process level for HW simulation, respectively).
Optimistic simulation reduces the synchronisation overhead by skipping synchronisation for a certain number of simulation cycles. In Fig. 14b , SW simulation is assumed to advance its execution by four clock cycles optimistically, i.e. without synchronisation with HW simulation. During that period, it saves its simulation state (at time 2). At time 4, both simulations synchronise with each other. If there is an event that should have been exchanged between SW and HW simulation before time 4, SW simulation rolls back to one of the saved simulation states whose timestamp is earlier than that of the missed event. If not, the simulation continues in the same way. The Figure illustrates simulation speedup by optimistic simulation. In this method, a key consideration is to decide when to save simulation states since too frequent state saving may offset the speedup by state saving overhead.
Challenge: cosimulation for MPSoC
MPSoC is introducing new concepts and problems into SoC design methodology including network-on-chip [38] , parallel processors [39] , distributed memory [40] , parallel programming model [39] , distributed OS [41] , etc. HW=SW cosimulation faces a new challenge in the MPSoC era. Assume that we perform HW=SW cosimulation of the MPSoC with a cycle-accurate ISS for each processor. We may need up to 100 ISSs for the simulation as exemplified in the figure. Such a simulation may suffer from very low speed (less than $ 1Kcycle=s) due to the high simulation workload of the SW simulation (e.g. 100 ISSs run) though HW simulation may be performed at a high abstraction level. Such a low performance may not be acceptable in application software development and in architecture exploration (e.g. buffer size optimisation for DMA or network interface, etc.).
To overcome the problem of poor simulation performance, we may need to apply higher abstraction levels to HW=SW cosimulation (e.g. OS level simulation [14 -17, 42, 43] instead of cycle-accurate ISS execution) or apply other methods such as parallel simulation.
In the near future, a slightly higher abstraction level model, e.g. instruction accurate (IA) ISS may be practically applied together with functional=timed models of peripherals. The performance of IA ISS ranges between 10 -100 MIPS on high-performance PCs. Such a high performance even allows for simulating an entire board consisting of a complex processor, bus, and peripherals almost in real time [43] . Embedded software developers will benefit from the simulation model since they can validate application SW code in real time before the hardware prototype is available. In the cases that tens of processors need to be simulated, IA ISSs may not give enough performance to allow for application SW validation. A set of higher abstraction levels for both hardware and software simulation [12 -15] may need to be applied in this case.
In order to improve simulation performance, high-level simulation may not be always the best solution since it lacks simulation accuracy (which might be needed in performance estimation) and visibility of the simulated system (which might be needed in functional validation or in debugging). One desirable solution may be the ability to change abstraction levels of simulation dynamically during the simulation run. Figure 16 exemplifies this idea.
Assume that we simulate an MPSoC with N processors. The Figure shows that the abstraction levels of simulation of processors change dynamically during simulation between a high abstraction level (HL), e.g. OS level and a low abstraction level (LL), e.g. ISA level. The abstraction level may be lowered when a detailed simulation is needed for debugging purpose (e.g. ISS execution to check stack overflow) or for simulation accuracy (e.g. ISS execution to simulate interrupt handling at cycle accuracy). The Figure illustrates that the abstraction levels of processors #1 and #N are lowered between time t2 and t3 to accurately simulate inter-processor communication (e.g. DMA launch, waiting on interrupt, etc.). After the low level simulation, the abstraction levels may be raised to speed up simulation.
Changing the abstraction levels of simulation is not a new idea. However, the existing technique [44] is limited to changing the abstraction levels of the HW interface. In the case of MPSoC cosimulation, we need a method which enables us to change all the abstraction levels of HW=SW interfaces and function. To the best of the authors' knowledge, there is little work on this issue. Considering Amdahl's law in HW=SW cosimulation explained in Section 3.2.3, if both the abstraction levels of SW and HW simulation are raised, synchronisation overhead may dominate again the total HW=SW simulation runtime. For instance, recalling OS level simulation in Fig. 7 , if the granularity of delay annotation is very small, e.g. a few clock cycles, then the synchronisation overhead by function delay() will dominate the entire simulation runtime, which may give a poor simulation performance masking off the benefit of high-level simulation.
Possible solutions to overcome this problem will be (i) to increase the granularity of delay annotation (hundreds or thousands of cycles delay) in order to reduce the number of synchronisations (ii) to predict timing points only necessary for synchronisation [36] , or (iii) to apply optimistic simulation approaches [35, 37] . More research is required for novel techniques to reduce synchronisation overhead in high-level HW=SW cosimulation of MPSoC.
Conclusions
In this paper, we explained mixed-level HW=SW cosimulation as the current issue of HW=SW cosimulation and addressed the performance problem of MPSoC cosimulation. First, we introduced the abstraction levels of function, HW interface and SW interface. Mixed-level cosimulation is required in HW=SW co-development and for the purpose of enhancing cosimulation performance. To better understand how mixed-level cosimulation models are produced in HW=SW co-development, we introduced a concept of refinement space. Techniques of mixed-level cosimulation were explained for both cases of HW interface and SW interface.
In terms of cosimulation performance, raising abstraction levels of simulation may not always give as much performance improvement as expected due to Amdahl's law. Thus, in addition to raising abstraction levels, we need techniques to improve each of SW and HW simulations and to reduce synchronisation overhead in HW=SW cosimulation. In this paper, we explained techniques to improve SW simulation based on instruction set simulation and to reduce synchronisation overhead exploiting optimistic simulation.
MPSoC gives a new problem to HW=SW cosimulation. Owing to the high number of processors in MPSoC, current solutions of HW=SW cosimulation based on ISS execution may not give sufficient simulation performance. In this paper, we addressed the direction of required study to 
