Abstract-This paper presents synthesis of Hardware Dependent Software (HdS) for multicore and many-core designs using Embedded System Environment (ESE). ESE is a tool set, developed at UC Irvine, for transaction level design of multicore embedded systems. HdS synthesis is a key component of ESE backend design flow. We follow a design process that starts with an application model consisting of C processes communicating via abstract message passing channels. The application model is mapped to a platform net-list of SW and HW cores, buses and buffers. A high speed transaction level model (TLM) is generated to validate abstract communication between processes mapped to different cores. The TLM is further refined into a Pin-Cycle Accurate Model (PCAM) for board implementation. The PCAM includes C code for all the HdS layers including routing, packeting, synchronization and bus transfer. The generated HdS methods provide a library of application level services to the C processes on individual SW cores. Therefore, the application developer does not need to write low level HdS for board implementation. Synthesis results for an multi-core MP3 decoder design, using ESE, show that the HdS is generated in order of seconds, compared to hours of manual coding. The quality of synthesized code is comparable to manually written code in terms of performance and code size.
I. INTRODUCTION
Multi-core and many-core embedded systems are being increasingly used to meet the complexity and performance requirements of modern applications. Embedded application developers for multi-core systems need a library of communication services to validate and debug their distributed multiprocess code. On the other hand, system designers need to provide board prototypes and system SW for application development.
Transaction level modeling (TLM) is widely seen as an enabler for early application development before the system prototype is ready. This is because TLMs execute much faster than traditional pin-cycle accurate models (PCAMs). However, with higher abstraction in TLMs, there are fewer design details to allow realistic estimation of design metrics. Pin-cycle accurate models (PCAMs) provide much more accurate estimation of performance, design cost and power consumption. They are also neccessary for prototyping systems with existing EDA tools and methodologies. However, PCAMs require an implementation of core, platform and application-specific system SW services on top of the SW core's instruction set. Some of these services are available directly in an RTOS for single processor systems with standard peripherals. Others, such as Hardware-dependent-Software (HdS), must be developed speficically for the given cores, platform, application and mapping. In a complex multi-core or many-core system, manual HdS development may become very time consuming. This is not only due to code size, but also due the complex interaction of processes in concurrent applications mapped to multi-core platforms.
Integrated design environments, such as ESE [3] , are needed to transform application level models into platform specific TLMs for validation and PCAMs for implementation. In this paper we will discuss the model based design methodology of ESE, with focus on HdS synthesis. Our methodology and synthesis technique allows automatic transformation of application level models with abstract message passing communication into PCAMs with an HdS stack of communication services. The automation not only cuts design time, but results in modular HdS code that is consistent with the application level communication requirements.
II. RELATED WORK
There has been significant research in model based design for embedded systems in the recent years. Standardization approaches such as AUTOSAR [2] and OSEK [4] provide common API and middleware for automotive SW development. On the other hand, system level design languages such as SystemC [5] and SpecC [9] allow multi-core system modeling with simulation speeds suitable for SW development. Such efforts have provided the groundwork for developing and deploying model automation tools such as the one presented in this paper.
There has also been much work in embedded system modeling frameworks and SW code generation from specific input languages. POLIS [7] (Co-Design Finite State Machine), DESCARTES [19] (ADF and an extended SDF), Cortadella [8] (petri nets) and SCE [10] (SpecC) provide some automation for SW generation from certain languages and models of computation. Our approach, in ESE, provides a C based input with multi-core support and has been demonstrated with actual board implementation.
Modular communication modeling has been proposed for application domains such as real-time systems and platforms such as heterogeneous multi-core systems. Kopetz [13] proposes component model for dependable automotive systems. Sangiovanni-vincentelli [21] has proposed a three phase simulation model for platform based design. These approaches tackle security, dependability and heterogeneity at the system level, but require underlying SW services and tools to implement the models. Communication optimization techniques [18, 20, 17] on the other hand have dealt primarily with platform and application transformations using simulation models. In contrast, HdS synthesis in ESE focuses on code generation for accurate optimization feedback and is fast and flexible enough to incorporate application and platform modifications on the fly.
HdS [15] itself has been a topic of active research lately and our work contributes to it. Commercial vendors provide a board support package (BSP) [6, 1] with their board IDEs, but such software is customized for the limited set of IP cores available or synthesizable on the board. Most academic approaches so far have dealt with porting of simulation models on RTOS, discounting external communication. Herrara [12] proposes overloading SystemC library elements to reuse the same model for specification and target execution, but partly replicates the simulation engine on the host and thereby imposes strict input requirements. Krause [14] proposes generation of source code from SystemC mapped onto an RTOS, while Gauthier's method [11] provides generation of application-specific RTOS and the corresponding application SW. Both techniques cannot be extended to muti-core platforms with inter-core communication synthesis. Yu [23] shows generation of application C code from concurrent SpecC, which requires the initial system modeling to be done in SpecC. The Phantom Serializing Compiler [16] translates multi-tasking POSIX C code input into sequential C code by custom scheduling, but is a purely SW core-specific optimization. Schirner [22] also proposes hardware dependent synthesis from SpecC models but only considers platforms with single core connected to several peripherals. In contrast to all the above techniques, ESE provides automatic HdS synthesis for multi-core and many-core systems, starting from an abstract C based application model. III. MODEL BASED DESIGN WITH ESE Our model based design methodology is shown in Figure  1 . We start with an application model that consists of C processes communicating via synchronized point-to-point handshake channels and shared variables. The platform definition is a graphical net list of processing elements (PEs), buses and interface cores called transducers. Processes and variables in the application model are mapped to the PEs in the platform. Channels are mapped to routes in the platform. If the route includes a transducer, then the communicated data may need to be broken up into smaller packets according to the buffer size limitations. The above design decisions and data models of PEs, buses and RTOSes are used by the ESE Front-End to generate a TLM. The TLM models the PEs as SystemC modules connected to the communication architecture model consisting of bus channels and transducer modules. The original application processes are encapsulated as SystemC threads instantiated inside the PE modules. The point-to-point channel accesses of the application model are mapped into equivalent packet transactions routed over the communication model.
The step of refining the TLM into a PCAM is performed by the ESE Back-End. The component data models in TLM are replaced with respective implementation libraries in the PCAM. Synchronization is modeled in the TLM via abstract SystemC flags and events. The flag and event accesses must be transformed into interrupts or polling in the PCAM. Similarly, the packet transactions over the bus channels in the TLM must be transformed into equivalent arbitration and data transfer cycles on the system buses. The transformations applied to the model result in various C functions per SW core. These functions form the HdS library for that core. If there are HW IPs in the platform, they will require RTL interface blocks for the same functions, with platform specific timing constraints. In this section, we will discuss the above models in greater detail to provide an idea of the input and output of the HdS synthesis procedure.
A. Hardware Platform Template
In order to automate HdS synthesis, we first need to define the platform components and connections. The platform is composed of processing elements (PEs), memories, buses and transducers. PEs are our generic term for HW and SW cores on which application processes are mapped. Memories are storage cores that do not have any active thread of computation. Shared variables in the application are mapped to memories. Buses are generic communication units that can act as point-to-point links or shared buses with arbitration. Buses have well defined protocols and may connect to compatible ports on a given core.
Transducers are generic interface cores that provide functionality of (1) protocol conversion and (2) store-and-forward static routing. Transducers consist of internal buffers and may connect to incompatible buses via different ports. For each bus connection, they have an IO interface and a Request Buffer. This request buffer stores all send/receive requests made to the transducer for storing and forwarding data on a channel. Thus, they allow sending data from one PE to another if the two PEs are not connected to a common bus. A route in the platform is a sequence of buses and transducers with the following regular expression:
Channels in the application are mapped to routes in the platform. As a result, each transducer in the platform may have several channels routed through it. For each such channel, the transducer defines (1) a unique buffer partition to be used by data on that channel, (2) a unique bus address for a send request, and (3) a unique bus address for a receive request. Since transactions on a channel are sequential, the partitioning of transducer buffers guarantees safety and liveness of implementation, provided the application model is safe and live. Communication in application model is enabled with calls to (a) send/recv methods for direct process communication, and (b) read/write methods for accessing variables shared between processes. The send/recv methods are encapsulated in process-to-process channels with no message buffering. Instead, process-to-process channels follow handshake synchronization semantics, where the receiver process blocks until the sender has sent the communicated data. All communication in MP3 Decoder is modeled using process-to-process channels Ch1 through Ch9.
On the other hand, the communication with read/write methods is non-blocking. The shared variables are in the global scope and are accessed with unsynchronized access channels. The two communication mechanisms are sufficient to model more complex communication services such as FIFOs, mutexes, mailboxes or events. Therefore, the synthesis of the basic communication models of handshake channels and shared variable access channels is necessary and sufficient for implementing any inter-process communication service at this level of abstraction.
The set of processes, variables and channels are built on top of the SystemC simulation kernel, as shown on Figure 2 . The processes execute as concurrent threads on the simulation kernel. The process to process channels use the notify-wait semantics of the kernel events to implement handshake synchronization. The shared variables are modeled as passive SystemC modules that export read and write interfaces, which are used to connect them to the access channels. Interfaces are also defined for processes to allow connection to channels. A well defined interface template provides a communication API with the following functions, where < i > is the name of used interface:
• < i > Send(void *data, int size) Synchronized send
• < i > Write(void *data, int size) Non-blocking write
• < i > Read(void *data, int size) Non-blocking read By separating the communication interface from the rest of the computation code, we are able to successively refine only the interface implementation code. The API provided to the application developer stays the same after HdS synthesis. In other words, HdS synthesis is the implementation of application channel methods, specific to the given core, platform, application and mapping. 
C. Transaction Level Model
The TLM is derived by mapping the application model in Section B to an embedded platform. The platform components are modeled with a well defined SystemC code template. PEs are modeled as SystemC modules that instantiate application processes. The system buses are modeled with a universal bus channel (UBC), that provides methods for synchronized send/receive, non-blocking read/write and memory service. Memories are modeled as SystemC modules with a local array. Transducers are modeled as SystemC modules with local buffer and controller threads for each bus interface. Figure 3 shows the TLM of the MP3 Decoder. The HdS model is highlighted inside the CPU core model. Processes Left and Right DCT are mapped to the HW units (IP 1 and IP 2) , while all other processes reside in a SW core (CPU) model. The route between the core and the HW units includes two UBCs and a Transducer. Access to units from the SW core is modeled with Channel API that encapsulate routing and packeting methods. These methods in turn are implemented with the UBC functions. Routing includes programming the Transducer with encoded route using UBC write method. Packeting divides the message into data packets of selected size. Since multiple processes are mapped to the SW core, a dynamic scheduler model that exports a threading API simulates processor multitasking.
Channels between processes in the SW core are implemented with an inter-process communication (IPC) model. The IPC and scheduler model are only core dependent and can be 3D-4 included into the TLM from a library. However, the HdS code is application, platform and core specific. Therefore, its has to be generated for every design change that impacts communication parameters in the application, platform or mapping. 
D. Pin-Cycle Accurate Model
The TLM is refined into a PCAM that is used for board implementation. Board design tools such those from Xilinx and Altera can be used to convert PCAMs into bitstreams for configuring the FPGA to obtain a prototype. Board debugging tools can then be used to run and debug the prototype in real time. Figure 4 . shows the PCAM of the MP3 Decoder. The platform consisting of one SW core and two IP units connected with two buses and a transducer is now modeled in synthesizable RTL. The six MP3 Decoder processes mapped to a SW core are cross-compiled with the processor's C compiler (e.g. Xilinx compiler for Microblaze core) and linked with the generated HdS and other system SW libraries for download. The processes mapped to hardware can be either synthesized using C-to-RTL tools or replaced with the respective RTL IP. The system SW stack includes the threading and IPC libraries of the RTOS, and the HdS library generated by our synthesis tool. The RTOS itself may consist of several other services such as file handling, memory management, standard C library, networking and so on.
The HdS library, generated by ESE, consists of four layers as shown in Figure 4 . The lowest layer consists of a set of interrupt handlers (IHs) and memory access functions. Each application level handshake channel requires synchronization that may be implemented as interrupt or polling. For interrupt based synchronization an IH is implemented per handshake channel. For polling implementation, a memory mapped flag is implemented in the slave device that is periodically checked by the master SW core. The memory access functions also provide basic IO to the peripherals. The synchronization and data transfer layer consists of C methods that use the IHs and memory access methods to manage packet level synchronization and bus word transfers. The higher level layers for routing and packeting and the channel API are imported directly from the TLM. In summary, the communication in PCAM is implemented with core specific C methods as opposed to SystemC kernel methods in TLM.
IV. HDS SYNTHESIS
In this section we describe automatic HdS synthesis and code generation from a set of design parameters. The design parameters are determined from the application and platform decisions as well as core properties and are treated as constants for HdS code generation. Two layers of communication functions are generated,namely for routing/packeting and synchronization/transfer. These functions are specific to the interface of the application process. An example shows the typical code synthesized for a Send interface.
A. Communication Design Parameters
In order to automate HdS code generation, we define a set of communication specific system parameters. Based on our platform template, explained in Section A, we define a Global Static Routing Table ( GSRT). The GSRT stores the mapping of each application level channel to a platform route. For each channel Ch, routed through a transducer Tx, we define BufferSize(Tx, Ch) to be the buffer partition size in bytes for Ch on Tx. We also define the transducer send and receive request buffer addresses per channel as SendRB(Tx, Ch) and RecvRB(Tx, Ch), respectively. The above parameters are required to generate routing and packeting layers for the SW core.
For each channel Ch, routed over a bus B, we define SyncT ype(B, Ch) to be the synchronization method to be used for ch for the route segment at B. The two possible synchronization methods are Interrupt and Polling. For direct memory accesses that do not require routing through transducer, synchronization is not required. A synchronization flag table is maintained for each core. Each channel Ch gets a unique entry SyncFlag Ch in this table. For interrupt based synchronization, we also define a binding from the interrupt source to the flag and the handler instance. For polling, the flag is bound to an address in the slave PE. Finally, for the data transfer implementation, we define the bus word size and the low to high address range for each channel Ch on bus B as AR(B, Ch). For each SW core we also define WordSize as the number of bytes per word.
B. Routing and Packeting
The communication functions are synthesized for each interface i that is bound to a channel Ch. Since we allow only static routing, a route object Rt is stored in the GSRT corresponding to each channel. Note that the GSRT does not need to be part of the communication library, since the routing per channel is static. The route for Ch determines the channel packet size as follows:
Hence, packet size is the largest data size that can fit into any transducer buffer allocation for Ch. Again, note that P ktSz is a constant per channel, due to static routing.
The code generated for the interface communication method is a do-while loop, with a temporary variable to keep track of already sent/received data. A lower level method i SyncTr is called by the routing/packeting layer to synchronize with the corresponding process and send or receive each packet.
C. Synchronization and Transfer
The routing of channel Ch determines the synchronization code generated inside the i SyncTr method. Given the route object Rt, as obtained from the GSRT, we determine the first bus B in Rt. We also determine if Rt contains any transducers. If so, we assign Tx to be the first transducer in Rt. The first step of packet synchronization is to make a transducer request for the transaction. This is done by generating code to write the packet size (in bytes) into the request buffer at the address given by the parameter SendRB(T x, Ch) or RecvRB(T x, Ch), depending on the transaction type. Once the request is written, the transducer initiates lower level synchronization via interrupt or polling, just like any other slave core.
Lower level synchronization is implemented by generating code for using flag SyncFlag Ch in the i SyncTr method. In case of interrupt synchronization, the flag is set by the associated interrupt handler. If the flag is not available, the processor is suspended into a power save mode and re-awoken by the next interrupt. In case of polling synchronization, the flag is periodically read directly from the corresponding slave core. The flag waiting code is followed by resetting the synchronization flag. Finally, data transfer is performed by generating a call to the core-specific WrMem or RdMem functions. These functions write or read data of given bytes using bus transactions of size WordSize. The starting address of the transfer is obtained from the address range AR(B,Ch). Figure 5 shows an example for the embedded SW code generated for send method of interface i. The sender process is mapped to a SW core, and its interface i is connected to bus B. Interface i is bound to channel Ch that is routed over B and transducer Tx and onto the destination core. Interrupt signal (Interrupt) from the transducer to the SW core is used for synchronization, and is bound to handler IH Ch and flag SyncFlag Ch. Figure 6 shows a multi-core design with an MP3 decoder application mapped to a platform consisting of one SW core (Microblaze) and four HW cores (Left/Right DCT and IMDCT) used as accelerators. The HW cores use a DoubleHandshake (DH) Bus interface, while the SW core is connected to the Open Peripheral Bus (OPB). Since the two bus protocols are incompatible, a transducer is used to interface between the cores. The We created four mappings of the application, that we refer to as SW+1DCT, SW+2DCT, SW+2IMDCT and SW+2DCT+2IMDCT, with parts of the application mapped to the hardware accelerators, as indicated by the mapping name. As the DCT and IMDCT processes are moved from SW core to the HW cores, the inter-core bidirectional channels are routed over the OPB, DH buses and transducer Tx. The HdS on Microblaze for PCAMs of the different designs are generated using ESE. Xilinx EDK [6] is used to convert the generated PCAMs into bitstream for implementation on the FF896 Virtex-II device. The decoding performance for all the synthesized designs is measured with an OPB timer on the board, using a common MP3 input file. Table I shows a comparison between manually implemented and automatically synthesized PCAMs using quality metrics of HdS code size and communication delay. It can be seen that the synthesized SW binary is only marginally larger than manual implementation (between 1-4%). However, the performance of the HdS synthesized by ESE, as measured by the on-chip timer, is 6-9% better than manual implementation. The code quality difference was because the manual implementation shared the synchronization function for different application channels, while the synthesized code had unique synchronization function for each channel. Therefore, the manual code had fewer total instructions, but incurred more instruction fetches for each communication call at run-time. Table II shows a comparison of lines of code between manual and synthesized HdS. Due to difference in synchronization implementation, as mentioned above, we can see that synthesized source code is marginally larger than manual code. The development time includes the 5 hours that it took to define the application level channels and the design parameters. It took 2-4 hours to implement and test the manual communication code. In contrast, with the given parameters, ESE synthesized the HdS code in fraction of a second. This resulted in an overall development time savings of 33% on average. These results show that with automatic HdS synthesis in from ESE, the designer productivity can improve significantly, without loss in design quality. 
V. EXPERIMENTAL RESULTS

3D-4
VI. CONCLUSIONS
We presented a model based technique and methodology for HdS synthesis for heterogeneous multi-core systems. The novelty of our work lies in defining embedded system models at different abstraction level with clear synthesis semantics. Application level models were defined as a set of processes communicating via message passing channels and shared variables. A well defined, yet highly flexible, platform template and associated design parameters were presented. We also presented a synthesis procedure to generate core, application and platform specific HdS for the design. Synthesis results for an MP3 decoder example demonstrated the applicability of our technique for large industrial size embedded systems. Our automatic HdS synthesis reduces overall design time, while consistently providing better performance and negligible increase in code size over manual implementation. For future work, we are investigating HdS synthesis from dependability and security oriented application models. We are also working extending our model based design framework with application and platform templates for real-time architectures such as time triggered network.
