We present a new approach of interconnecting diverse SystemC simulations using the High Level Architecture (HLA) as simulation backbone. The presented simulation environment is characterized by its generality and extendability. It basically allows different kinds of execution like distributed simulation of a single SystemC model as well as co-simulation with other arbitrary simulators. The emphasis within this work is on the synchronization and time flow mechanisms that need to be applied when executing a single SystemC model in parallel. A case study is performed by means of a loosely-timed SystemC transaction level model of a homogenous Multi-Processor System-on-Chip. The SystemC model exploits temporal decoupling which allows adjusting different computation to synchronization ratios, serving as basis for performance evaluation.
INTRODUCTION
Especially in recent years a rapid evolution of embedded systems took place thus allowing for integration of more and more functionality into a single device. Within this development two facts can be observed: (1) Shrinking VLSI structure sizes in combination with the system-on-chip paradigm are one of the key enabler for power efficient single chip Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. One classic example are sensor actuator networks comprising a huge amount of embedded sensor nodes. Each of the nodes can be a system-on-chip device of intrinsic complexity. When simulating such communication-centric systems the limiting factor of simulation speed is mostly given by the communication and synchronization overhead. An approach to reduce this overhead is using more abstract transaction level models (TLM) [5] , [13] which increase the ratio between computation and synchronization while still allowing for quite accurate evaluations of single devices.
However, we believe the simulation of a single device exclusively on the transaction level is not enough to exploit the impact of distributed sensor network applications on the single device system-on-chip architecture. In our work a more holistic approach is targeted, allowing for a scalable detailed simulation of several nodes or sub-modules on different abstraction levels like transaction level, cycle-accurate level (CAL) or register transfer level (RTL) [13] concurrently. This demands for an underlying simulation environment supporting scalability, adaptability and modularity. Therefore, within our concept each node or sub-module can be placed into its own SystemC [4] simulation. In a next step the SystemC simulations are interconnected by a generic simulation backbone based on the High Level Architecture (HLA) [6] , inherently supporting parallel/distributed discrete event simulation (PDES or DDES) [12] [19] on diverse platforms as well as the connection of other arbitrary simulators like network simulators for more comprehensive evaluations. Communication of several SystemC simulations is thereby generally performed using higher abstraction levels in order to increase the computation to synchronization ratio.
Within this contribution we present the new approach of interconnecting diverse SystemC simulations using the HLA. We discuss important aspects of time synchronization between different SystemC simulations in detail and evaluate synchronization performance of the HLA backbone by means of a scalable transaction level model of a homogenous MultiProcessor SoC (MPSoC). Based on the results we draw first conclusions regarding distributed SystemC simulations using our approach.
The remainder of this paper is organized as follows: In section 2 we summarize a selection of related work. SystemC, the TLM 2.0 design methodology as well as concepts of the HLA are shortly depicted in section 3. Afterwards, in sections 4 and 5 the simulation environment is introduced and the experimental setup is described. Performance analysis results are presented in section 6. Section 7 concludes and points to further research.
RELATED WORK
The parallel simulation of SystemC models gains more and more interest in the research community. Within the last years diverse approaches into this direction have been developed. One of the first works explicity refering to the application of PDES is presented in [9] and [10] . The approach is based on kernel integration of remote functionality and allows for synchronization between several distributed SystemC kernels on delta cycle level via MPI. It is therefore basically applicable for RTL simulation. In [14] a similar approach is described that accesses the SystemC kernel via the SystemC user level functions. Due to the high delta cycle synchronization overhead both approaches only gain leverage for models that provide for high computation to synchronization ratios like TLMs. To further speed up pure TLM simulations several specific TLM engines have been developed. In [23] a light-weight simulation kernel is presented that is optimized for parallel simulation of adaptive TLM models. In [17] also a specialized simulation engine for TLMs and a modelling methodology called TLM with distributed time is introduced to simulate an MPSoC in parallel. Other concepts instead of PDES are used e.g. in [22] and [20] . In [22] the SystemC kernel is improved by parallel programming techniques, leveraging the parallel execution capabilities of multi-core machines. The authors of [20] use general purpose graphics processors (GPGPUs) for kernel parallelization. Finally an example having not performance but distributed IP core verification as focus is described in [16] and [11] .
To the best of our knowledge none of these approaches uses the HLA as communication backbone. They all have the drawback of being limited to SystemC parallelization/ distribution whereas our approach basically allows for parallel/distributed simulation as well as co-simulation of SystemC models with other simulators. Further more, many approaches in the transaction level domain often depend on purpose-built simulation engines, making them not suitable for concurrent simulation on different abstraction levels like TLM and RTL, however, being a main demand of our research in the context of wireless sensor networks.
FUNDAMENTALS

SystemC
There is a set of design languages that can be chosen for hardware modeling during design space exploration (DSE). Most important characteristics of such a language are that it must support
• HW/SW co design
• reuse of existing IP (still C/C++ is clearly the most used high level programming language [18] )
• system level IP integration
• different levels of abstraction 
The High Level Architecture
The HLA was originally defined by the Defense Modeling and Simulation Office (DMSO) for the U.S. Department of Defense. Its original field of application are military training simulations in which thousands of military participants interact within a shared training exercise. The HLA is a generic software architecture combining all the components necessary for PDES. It determines the functional entities, design rules and interfaces for computer-based simulation systems and specifies the communication between the single components of an entire simulation. Communication between different simulators is independent of the underlying computing platforms. Further advantages of the HLA are the support of easy interoperability, reusability and adaptability. In 2000 the HLA became an international standard (IEEE 1516.x) [6] .
The general structure of the HLA is shown in fig. 1 The HLA specifies a software architecture and not an implementation. There exist several commercial as well as open-source RTI implementations. The task of the simulation designer is to select an appropriate RTI and to develop the federates which finally access the RTI via the ambassador modules.
THE SIMULATION ENVIRONMENT
To the best of our knowledge this work is the first one considering the HLA for distributed simulation of SystemC models. Since we plan to apply it in the context of a complete wireless sensor network simulation, we are reliant upon sufficient performance, flexibility and scalability. These requirements are the justification for the HLA to form the core for managing communication and synchronization. The simulation environment shall support two types of execution:
1. Distributed simulation of several nodes together with other simulators like e.g. network simulators [2] 2. Distributed simulation of a single sensor node by dividing it into sub-modules which can be described on different abstraction levels like TLM, CAL or RTL concurrently Thereby, each node/sub-module can be simulated by a separate SystemC kernel. In general, execution performance of a distributed simulation is greatly influenced by the synchronization overhead which increases with increasing communication effort (this issue is also illustrated by the later described experiments in section 5). For case one, synchronization is not that problematic since wireless communication latencies are much lower than intra-chip communication latencies. The bottleneck within the simulation environment will most likely be located in the distributed simulation of a sensor node itself. To relax the synchronization effort, data exchange between sub-modules (case two) is generally performed on higher levels of abstraction. In order to investigate the synchronization effort of intra-chip communication, within this paper we evaluate the distributed simulation of a single scalable transaction level model of a system-on-chip.
Synchronization and Time Flow
A distributed SystemC simulation may execute orders of magnitude slower than its non-distributed counterpart if synchronization becomes the dominant factor. The literature on PDES distinguishes between two types of synchronization algorithms conservative and optimistic [12] . In short, conservative synchronization algorithms avoid violating the causality relationships between the logical processes that are to be synchronized. They always guarantee events to be delivered in the correct time order. In contrast to that, optimistic approaches allow violating the causality relationships but provide for mechanisms like timewarp [15] to restore already past points in time. The HLA time management provides both types. Due to implementation complexity and enormous memory requirements of optimistic synchronization [12] , we use the conservative synchronization interface of the HLA.
A second technique for which the HLA provides freedom of choice is the time flow mechanism. It can be time-stepped or event-driven. In a time-stepped approach simulation time is subdivided into a sequence of equal-sized time steps. The time globally advances from one time step to the next. In contrast to that, in an event-driven simulation the simulation state is only updated in case of an occuring event. Simulation time does not advance from one time step to the next but advances from the time stamp of one event to the next. Since the SystemC kernel itself is an event-driven simulation kernel and the time of the next event to be processed can always vary, the event-driven time flow mechanism is prefered.
In the following a simulation library is described, integrating an HLA interface with SystemC and implementing the described techniques.
The SystemC Federate
The SystemC federate library is the basic component for distributed SystemC execution. It combines the OSCI SystemC 2.2.0 kernel together with the HLA interfaces. In order to switch the RTI the federate library only has to be recompiled with a different RTI implementation.
Within each federate a separate SystemC simulation kernel is executed. To provide for maximum flexibility a modular structure similar to the HLA structure itself has been chosen ( fig. 2) . The functional components RTI ambassador, federate ambassador, adaptor, controller and object database are interconnected by the mediator which forwards communication requests. The federate library is equipped with an XML parser making it parameterizable also during runtime. 
Controller
The controller directs the local simulation which means initializing, executing and shutting down the federate. It contains the simulation loop. Since its implementation is model dependent, the controller is provided as an abstract base class. In fig. 3 the simulation loop is exemplarily illustrated for the transaction level model of section 5 by means of a state chart. [14] the SystemC user-level functions are used. In contrast to [14] , using our approach synchronization is generally not performed on delta cycle level but only on the level of timed events which is sufficient for our transaction level model.
Adaptor
Models are integrated into the federate via an adaptor module. The adaptor performs the translation of local data into remote RTI objects or interactions. Its implementation can vary depending on the type of the model to be integrated (e.g. sub-module or complete node). Because of that, it is also provided as an abstract base class which has to be specialized for the connection of a designated SystemC model.
Object and Interaction Database
Each federate must define a Simulation Object Model (SOM) [6] which determines the data (objects and interactions) that it is able to exchange. During the initialization phase the SOM is stored in the object and interaction database of a federate and that way accessible for all federate components.
EXPERIMENTAL SETUP
As a case study and in order to evaluate applicability and performance of the simulation environment, a transaction level model of a homogenous MPSoC has been implemented and prepared for distribution. In the following, the functionality of the model is shortly depicted.
TLM-based Reference Model
The overall structure of the MPSoC model is shown in fig. 4 . It consists of several processing nodes being interconnected by a network on chip with mesh topology. The main feature is the scalability of the computation to synchronization ratio of each processing node as well as the quantity of nodes per row and column. The processing nodes are described in a loosely-timed coding style using the OSCI TLM 2.0 blocking transport interface [5] .
A more detailed view of a single processing node is given in fig. 5 . Its structure is generally characteristic for MPSoC nodes and consists of a processor connected to timer, router and memory via a local bus. The model applies temporal decoupling [5] . In general, temporal decoupling allows processes to run ahead of simulation time (instead of forcing them to synchronize with the SystemC kernel) by retarding calls to the wait(time) function. This reduces costly context switches and increases the computation to synchronization ratio. To avoid processes to run ahead with no limit 
Processor
In respect of future investigations we selected an instruction set simulator (ISS) as processor model, simulating an arbitrarily chosen architecture (SPARC V8) [7] . The ISS basically allows for simulations of instruction level time granularity. Temporal decoupling makes time granularity of communication coarser by aggregating the execution time of several instructions into the global quantum q. For this purpose, we define q as the product of a number of instructions ni and the cycle time c (assuming one instruction to be executed per cycle). This means that only after execution of ni instructions the wait(q) function is called and control is returned back from the ISS thread to the SystemC kernel which then is able to schedule other threads. In general, q is chosen several orders of magnitudes higher than c in order to mimic long software execution times before interaction with the external network via the router can be performed.
Router
The router is equipped with four fifos for each of the cardinal directions north, east, south and west (see fig. 5 ). The host fifos are used for packets to be transmitted to direct neighbours. Routing fifos are used in case of packet forwarding between nodes having greater distance than one single hop. Each packet is divided into header and data field. Packet transmission and reception is performed by the routing thread which is invoked in case of a new packet available either in one of the input fifos or in the internal packet buffer. In the reception case the routing thread compares the destination address of the packet to the own address. If a match is detected, the packet is forwarded to the host, otherwise it is written in one of the outgoing routing fifos using a simple x-y-routing-scheme which first performs routing in horizontal and then vertical direction. In the transmission case the destination address is also analyzed and compared to the addresses of the neighbouring nodes. If a match is detected, the packet is written into the appropriate host fifo, otherwise it is written into one of the routing fifos. Basis for address comparison is the address assignment shown in fig. 4 . Transmissions via fifos are afflicted with a delay d f ranging in the order of magnitude of several clock cycles which is modelled by a wait(d f ) call. This results in a creation of events having much finer time granularity compared to the events created by the processor. The different granularity was chosen for the sake of a more realistic timing of simulation since in reality hardware routing occurs much faster than software controlled packet transmission.
Model Partitioning
There exist several possibilities for cutting the MPSoC model into parallelizable sub-modules e.g. partitioning along architectural boundaries or logical partitioning. We decided to use the former (being the most obvious variant) and simulated one processing node per federate. Fig. 4 exemplarily shows the cut edge for node 0/1. That way, each federate is always charged with similar workload. For the chosen partitioning the fifos that interconnect the processing nodes need to be cut into two parts. Each counterpart communicates with an adaptor (see section 4.2.2) which handles remote transmission and reception.
Remote fifo packets transmitted by one federate are broadcasted to all others. This makes the utilization of a unique ID necessary. Data that was read from a fifo is written by the adaptor into an HLA object called hla packet (see table 2). Beside the already mentioned fifo packet fields of table 1 there exist two additional ones namely fifoID and rF lag for unique identification of the counterpart fifo at the receiver node. The fifoID is generated by concatenation of sender address and receiver address of the next hop node. The order of concatenation determines the direction of communication. The rF lag determines whether the next hop is the final one or not. 
ISS Software
Two types of software are executed on the processing nodes making them either a producer or a consumer. Consumers request the producers one after another by transmitting request-packets. After having forwarded a requestpacket a consumer waits in a loop for an acknowledge-packet from the addressed producer. As soon as the acknowledgepacket is received the consumer transmits the next requestpacket. Conversely producers permanently wait in a loop for request reception. As soon as a request-packet is received an appropriate acknowledge-packet is generated and forwarded.
Temporal decoupling affects the behaviour of consumer and producer by increasing the number of loop iterations until synchronization with the kernel and therefore with the appropriate router as well as other processing nodes is done. That way, different algorithm execution times are modelled.
Functional Verification
The global timing behaviour was verified to be correct by comparing the time stamps of transmitted packets while using either distribution or no distribution. Fig. 6 shows an example of the emerging synchronization pattern in the case of simulating four processing nodes in parallel. The global quantum q sets an upper border for global synchronization. However, additional shorter synchronization cycles are automatically inserted in case of low latency packet transmission and direct routing due to the lower chosen delay d f of a fifo compared to the global quantum q. 
PERFORMANCE ANALYSIS
To evaluate performance we utilized a network (Ethernet, 100 MBit/s) consisting of up to nine linux workstations with equal configuration namely a 2 GHz DualCore CPU as well as 2 GB RAM as hardware platform. We compiled the federate library with CERTI [24] [21][1] since it is free open source software. Unfortunately we were not able to use Portico [3] because it had a bug in the time management services which was discovered during implementation. However, we plan to do so as soon as the bug is fixed. Looking at CERTI, the implementation of the RTI is done in a centralized manner. Fig. 7 shows the resulting layered implementation when using CERTI. The lower layers consist of two types of processes, local ones called RTI Ambassadors (RTIA) and a central one called RTI Gateway (RTIG). These processes as well as the SystemC federate library are linked with each other using Unix and TCP sockets. Thereby the RTIG is of predominant importance since any form of communication between federates, be it for data exchange or be it for synchronization purposes, is done via the RTIG. 
Synchronization Algorithms
In its actual implementation CERTI can be compiled with two types of time synchronization algorithms, the so called Chandy/Misra/Bryant Null Message Algorithm (NMA) [8] or the Null Prime Message Algorithm (NPMA) which was designed by the CERTI developers. The former one has the drawback of not avoiding the time creep problem, occuring in case of low lookahead values [12] . The time creep problem has strong effect in our case since we are forced to chose the fifo delay d f as lookahead which consists of only a few clock cycles. Applying the NMA we generally got execution times that always lay several orders of magnitude beyond the sequential case. The NPMA avoids the time creep problem by including the time stamp of the next unprocessed event (so called conditional information [12] ) in computing LBTS values. Because of that, the measurements presented in the following are all based on the utilization of the NPMA.
Experiment 1: Varying Node Number
To evaluate performance when varying the node number, the reference model was configured with 2, 4 and 9 nodes as shown in table 3. In each row we put one producer. The clock frequency of the MPSoC model was set to 100 MHz. The overall simulation was executed for one second of simulation time. The experiment was carried out for the parallel as well as the sequential case. Looking at parallel execution, each federate process was executed together with its respective RTIA process on a separate cpu core. When starting a federate process, its RTIA process is automatically initialized as background process, having the same cpu affinity as the federate process itself. Within experiment one also the RTIG process was executed on a separate core. For the sequential case the whole simulation was perfomed as pure SystemC simulation without HLA extension on one single core.
Case
Per row Per column Producers Consumers  1  1  2  1  1  2  2  2  2  2  3  3  3  3  6   Table 3 : Model Configuration (Experiment 1) Fig. 8 points out the achieved speedup of the parallel case in relation to the sequential case depending on the global quantum which is indicated in nanoseconds. For all three cases the speedup increases with increasing global quantum which corresponds to an increasing number of simulated instructions per synchronization step. The speedup approximates to an area of saturation which is the maximum theoretically achievable speedup. Since we used one core for each node, this value corresponds to the number of cores in each case. Additionally it can be stated: When considering for example nine nodes, a parallel simulation is profitable for the given kind of distribution, if the global quantum exceeds about 3, 3 * 10 4 nanoseconds. At 100 MHz simulated clock frequency and one instruction per clock cycle this corresponds to 3, 3 * 10 3 instructions which each ISS must be allowed to run ahead before synchronizing again with the remaining simulation. 
Experiment 2: Varying Distribution
Finally, we evaluated the impact of different kinds of process distributions. For that reason, we used an MPSoC comprising only two nodes, one executing the consumer application and the other executing the producer application. The clock frequency of the MPSoC model as well as the maximum simulation time were the same as in experiment one. As shown in table 4, we distributed the federates F1 and F2 (including their particular RTIA processes) and the RTIG on the four cores of two workstations (WS1 and WS2) in different ways. For each of the three distributions we again measured the execution time that was needed to carry out the simulation for different global quanta. Results of the measurement are shown in fig. 9 . For all three cases the speedup again increases with increasing global quantum, until reaching the maximum achievable speedup, which is two. As can be seen, the centralized implementation of CERTI has great influence on simulation performance. The smallest global quantum wherefrom parallel simulation is profitable depends on the kind of distribution. The earliest profitable variant is distribution three, since the speedup is greater than one already for a global quantum of about 2, 6 * 10 4 ns. Despite the fact that in case of local execution on a single workstation F1 and the RTIG are executed on the same core, local execution is faster than any distribution on two workstations which adds the network latency of the Ethernet connection. For that reason, distribution one is the slowest variant since both federates have to communicate with the RTIG via Ethernet.
CONCLUSION AND OUTLOOK
We presented a simulation environment for parallel and distributed SystemC simulation. The approach is the first one combining SystemC with the HLA. Using the HLA a modular structure is obtained which is characterized by its scalability and extendability also for different simulators. We verified the correct functionality and evaluated the performance of the simulation environment by means of a scalable loosely-timed transaction level model of an MPSoC. We showed that for a loosely-timed transaction level model we are able to receive a maximum speedup that is almost proportional to the degree of parallelization if the ratio of computation to synchronization gains a certain level. We also showed that the underlying RTI implementation has great influence on execution performance.
Further research includes the evaluation of other possibilities for model partitioning as well as the impact on performance when describing selected nodes internally on lower abstraction levels like register transfer level. Moreover, the environment shall be extended regarding co-simulation capabilities with other domains like network simulators [2] which forms the basis for building up a simulation framework for SoC based wireless sensor nodes.
