Recent advancements in mixed hardware-software modeling platforms have simplified the process of concurrently modeling complex algorithms in hardware while considering the software generation in an integrated environment. To efficiently facilitate design space exploration for RISC extensible processors, it is imperative to use a tightly coupled hardware-software co-design and modeling effort that also maintains the flexibility for exploring other types of microarchitectures. We have introduced a simulation framework using a software-oriented design methodology that can be adopted to model the software and hardware components in a reconfigurable co-processor. Experiments on some commonly used DSP and image processing tasks show that hardware-software trade-offs in the various models can be efficiently analyzed during the initial design phase.
INTRODUCTION
Hybrid embedded systems that consist of a co-processor coupled together with an embedded CPU have a widespread use especially in the multimedia application domain. Recently there have been an emergence of reconfigurable coprocessors [1] , [2] , [3] , that can reconfigure their resources (i.e. functional units and interconnection patterns) through software control to support different tasks. The software control requires memory buffers to store the current context of the hardware resources while allowing new configuration data to be loaded. These platforms have simple reconfiguration methods, but the flexibility is restricted by the degree of programmability of the software control and dynamic reconfiguration is exploited at a coarser granularity than the underlying device technology.
There are other types of hybrid systems when implemented on reconfigurable hardware (e.g. FPGA) can provide significant speed up [4] , [5] , over a general-purpose CPU and are more flexible than their ASIC counterparts. The implementation in [5] , uses a soft-core processor and the partial reconfiguration capability of the FPGA device to achieve dynamic reconfiguration. The co-processor forms a dynamic core that can be 'swapped' with other types of cores within a designated area of the device. Since only a partial area needs to be reconfigured, the overhead is less than having to change the entire device configuration. The use of soft-core processors can provide the flexibility to explore different types of coupling interface between the CPU and the co-processor. MicroBlaze [6] , and Nios-II [7] , are examples of commercial soft-core processors that provide an efficient method for implementing hybrid systems where the CPU and co-processor interface is well defined. The coprocessor provides an extension where computational intensive tasks can be accelerated in hardware. Depending on the type of interface, there is a potential risk of slowing down the internal CPU datapath especially if custom instructions are used to control the co-processor.
Current modeling frameworks use different source types to describe the hardware and software models for systemlevel co-simulation. Moreover, the models are described implicitly and do not represent the actual hardware implementation. The framework in [8] , introduced a two-step approach whereby SystemC [9] , is used in the beginning of the system-level specification stage. The resulting C ++ models are only in abstract form (i.e. abstract data types and the interfaces between modules) and cannot be used directly for hardware synthesis. It requires a manual translation to VHDL to provide the functional and timing details for system validation using an external logic simulator. This framework also includes a software-based CPU simulator that emulates an instruction set architecture, but can only provide an instruction accurate interpretation of the software component.
There are similar works that use the SimpleScalar [10] CPU simulator to model the hardware resources in hybrid architectures. In [11] , SimpleScalar is extended to model the reconfigurable functional units in a behavioural manner. The models are not pin-accurate and a considerable amount of effort is required to achieve system verification. A twosource framework is presented in [12] . The C-based soft-ware processes that control the co-processor in the CPU simulator are integrated into a logic simulator to achieve a cycleaccurate co-simulation. However, the interface between the two simulators is of an abstract type that uses messages to communicate. Moreover, synchronization between the simulators is required in order to give an accurate estimate of the software performance.
Commercial design tools like Impulse CoDeveloper [13] , and Celoxica DK Design Suite [14] , can also provide a convenient platform to transform a requirement described in Clike language for hardware implementation. However, these tools are less flexible for facilitating design space exploration as the compilers target specific processor architectures and mainly exploit the fine granularity of the underlying device technology.
This paper presents the design of a simulation framework that aims to enhance system-level modeling by using a common platform driven from a software-oriented methodology. Our approach focuses on using SystemC to describe explicitly the hardware and software components that are used for system-level simulation. The SystemC simulation kernel provides a hardware clocked modeling platform that will give a cycle-accurate co-simulation of our models. Our SystemC models are also pin-accurate conforming to vendor specifications (i.e. functional and timing details) and this will provide a close representation of the actual hardware circuits. The same SystemC models can also be readily translated using commercial back-end FPGA tools for synthesis and implementation on the target device. This approach is more economical and shows that early system verification can be achieved during high-level modeling using a single source throughout the co-simulation. To showcase our framework, we will model a cycle-accurate MicroBlaze co-processor using a loosely-coupled interface and a Nios-II CPU to represent a tightly-coupled instruction set extension (ISE) model. With this integrated environment, the constraints imposed on the overall system performance by the hardware specific co-processor and the microprocessor models can be analyzed by a system designer to determine trade-off decisions. The framework can also be extended to explore other types of hardware and software models.
The next section of this paper describes the components of the target architectural models. Section 3 introduces our co-simulation and compilation framework for the hardware and software generation of our extension models. In Section 4, we describe the experiments that are conducted using our simulation framework and present the results in Section 5. Finally, we conclude our work in Section 6. model that is similar to the MicroBlaze processor [6] , which has a 3-stage (fetch, decode and execute) pipeline architecture and excludes branch prediction. Data transfer to the datapath functional units is accomplished through an internal 32-bit register file in the CPU which can be fetched from the data cache or the main memory. The reconfigurable extension is coupled into the CPU using a similar approach where a MicroBlaze core is connected to custom IP through the Fast Simplex Link (FSL) interface [15] . A total of 16 FSL channels are available where up to 8 channels each can be used for writing or reading to/from the CPU and the extension. The master FSL interface is integrated into the decode stage of the CPU through a 32-bit register file. This interface ensures that a complex result with a long execution time does not stall the internal datapath unnecessarily in order to wait for the completion of the execution. One dedicated master FSL channel is used to carry the control signals to the extension control unit for managing the internal operations of the computational core. The switch network supplies the operands and reconfiguration data to the extension and can be reprogrammed so that the FSL channels can be re-assigned dynamically. This programmable switch network allows other types of hardware cores to be analyzed in the same design space without having to re-define the interfaces.
ARCHITECTURAL MODEL
Our tightly coupled extension model shown in Fig. 1b is based on the Nios-II custom instruction [16] , logic interface. We have also created a SystemC RISC CPU model based on a 5-stage (fetch, decode, execute, memory and writeback) pipeline architecture with no branch prediction. This is similar to the Nios-II/s (standard) core [7] . The ALU and custom logic operate independently of each other and can be managed by the same decoder unit in the CPU datapath. Multiple custom functions can co-exist in the hardware logic but needs to share the 2-input and 1-output registers. Our extension only supports the execution of one custom function at a single time. If a multiple-input, multiple-output extension is available, we can easily extend our framework to include some control logic to support simultaneous execution of multiple custom functions. From can see that the MicroBlaze extension requires more hardware control logic while the Nios-II extension has simpler control over the custom logic. The rest of this section will describe in detail the two methods of interfacing the CPU and the resource models of the reconfigurable and custom logic respectively.
SystemC RISC CPU extension
Fig . 2 shows the block diagram of the FSL and the custom instruction interface. The FSL is a 32-bit wide FIFO-based communication interface. A set of pre-defined functions is used to control the sequences to transfer the contents from the CPU registers to the FSL bus and vice versa. When the operation code (e.g. write to FSL channels) is received from the decoder unit and the appropriate data is available, the FSL interface will assert the relevant 'write' and 'control' signals to reliably transfer the contents to the slave. The reverse also holds where the FSL interface must wait for the activation of the 'exists' signal (activated by the slave) to indicate that the result from the slave is valid. It is imperative that the core in the reconfigurable logic conforms to the FSL slave definitions in order to communicate with the CPU. Our custom instruction interface includes a multiple-cycle model instead of a combinatorial one. This represents the worst case where custom functions cannot complete in one clock cycle. For the multi-cycle type, 'start' and 'done' signals must be asserted by the custom instruction interface and the custom functions respectively.
Reconfigurable and custom logic resource model
The reconfigurable logic shown in Fig. 1 contains a 2D processing element array that is based on a multiply-accumulate (MAC) unit implementation. This core is coarse-grained, reconfigurable during run-time. Fig. 3 shows the block diagram of a MAC-unit which uses a reconfigurable constant multiplier. Dynamic reconfiguration can be achieved when new configuration data is loaded into one of the context registers while the other is currently active. We have created an 8x8 MAC-unit array where each MAC-unit is connected to its nearest 16 neighbours. Four FSL write channels are used to directly load operands into the first row of the array and each FSL channel is shared by two MAC-units. All the other MAC-units use the 16-bit bus operand ('bus op') to load data into the register buffers using the same set of channels. Configuration data for each MAC-unit is transferred in a column format sharing one FSL channel with four MACunits in a column. The rest of the write channels are used to carry control signals to the extension control unit. Results from the array are also read in a column format as shown. Fig. 4 shows three custom functions that will be implemented in our custom logic. All the custom functions have a two-stage pipeline which allows the multiplication and addition to operate concurrently. For the custom0 and custom2 functions, the multiplier will only accept two 16-bit operands supplied by one input register. We must pack two 16-bit operands into one input register before presenting the data to the two custom functions. The custom1 function can perform a 32 x 32-bit multiplication and accumulation using two input registers. However, only the lower 32-bit word of the result will be returned. As some Nios-II cores do not have an ALU that supports hardware multiplication, we can reuse the custom2 function to perform a single multiply operation. The multiplexer in the custom2 function is used to select between an internal register (hardwired to '0' -disable addition) and one of the input registers using the 'readrb' signal. When the signal is low, the internal register will be read instead.
All the custom functions will run simultaneously but only the result from the desired function will be returned to the CPU. The 'sel result' multiplexer selects the output of the desired custom function and puts it onto the 'n [7. .0]' bus. This selection is specified by the operation code of the function when the custom instruction is decoded. Our custom function models conform to the Nios-II multi-cycle custom instruction timing [16] , and are controlled by the appropriate 'done', 'start' and 'clk' signals. Fig. 5 shows our co-simulation and compilation flow that is based on the SystemC platform. The compilation flow consists of two parts; hardware compilation and software compilation.
SIMULATION FRAMEWORK

Hardware-Software Co-simulation
Our initial phase begins with a timed model of our two extensions and the maximum value is selected for each of the component parameters that are used for simulating the models. These parameters will then be fine-tuned (See Fig. 5 ) based on the results obtained from the hardware compilation stage. The co-simulation flow is sub-divided into three domains; modeling, exploration and refinement domain.
The modeling domain deals with the different abstraction levels of modeling the hardware and software components. Our RISC CPU models are cycle accurate using detailed bus channels for the communication between the various components (e.g. decoder, fetch and etc). These models are not pin-accurate and cannot be directly synthesized for hardware. However, they are provided as custom IP modules available in commercial back-end tools. The rest of the components in the two extension models are pin-accurate and will provide a close representation of the actual hardware circuits.
The characteristics of the modeling domain are presented in the exploration stage so that the designs can be refined and different component parameters (e.g. clock speed, array size, interconnection pattern, FSL allocation, custom functions and etc) can be experimented with. To facilitate the switching between multiple views/contexts created by mapping a computational problem into various extension models, we have created a software sequencer program that runs on top of the SystemC simulation kernel and will contain the interface information of all the components in the two extension models. Any refinements to the individual components in the extension models will require the simulation executable to be re-compiled. However, the re-compilation is quite straightforward as the compilation process for simulation is separated from the compilation for implementation. This separation ensures that any parameter changes are verified at system-level according to the correctness of the hardware solution and also satisfying any real-time constraints. To minimize re-compilation, a system designer can also create multiple hardware solutions with different parameters and use the software sequencer to switch between the solutions at run-time. 
Hardware Compilation
The same sources used for generating the simulation executable can also be readily translated for synthesis and implementation on the target device. For the reconfigurable extension, we used SystemCrafter [17] , to generate a RTL VHDL description for Xilinx FPGA devices. Since SystemCrafter does not support Altera devices, we have to manually convert our SystemC descriptions into its equivalent VHDL form. The RTL description can then be synthesized to a netlist using Synplify Pro tools for the respective target device. From the synthesis results, we can derive the maximum clock performance of each extension model. This will fine-tune the initial parameters used in the co-simulation phase. Our hardware synthesis excludes the RISC CPU models, the FSL and the custom instruction interface. It is more economical to use Xilinx ISE/EDK and Altera SOPC Builder as the back-end implementation tools where our synthesized netlist can be imported as custom IP modules. It is imperative though that our modeling framework ensures that the requirements (i.e. functional and timing) of the custom instruction and the FSL interfaces strictly conform to the vendor specifications [16] , [18] .
Software Compilation
The software produced by our framework for simulation is an assembly code that is not targeted for any RTOS. Fig. 6 shows the assembler instruction set syntax for each CPU interface model. In the multi-cycle custom instruction model, a custom instruction that depends on the result of a preceding instruction must wait for its completion. To achieve this, we use a conservative approach by inserting wait instructions between the dependencies. We have also optimized the 0, r1, r2, r3   Fig. 6 . Pseudo assembly code of the MicroBlaze extension using FSL functions and the Nios-II ISE using custom instructions.
software code by minimizing the amount of spilling in the registers. To automatically compile the assembly code to the respective RTOS target, we employ the software compilers in Xilinx ISE/EDK and Altera SOPC Builder design tools. We can subsequently use the optimization level offered by each of the compilers to meet any real-time constraints. Our simulation framework at the initial design phase has already provided a system designer an outlook of the software performance, especially cost-performance sensitive instructions that control the hardware computational core.
APPLICATION EXAMPLES
In order to demonstrate our integrated framework, we experimented with three kernel (integer) tasks. They are the discrete cosine transform (DCT), matrix multiplication (MM) and finite impulse response (FIR) filtering that are commonly found in DSP and image processing applications.
DCT: We used the MediaBench [19] JPEG and MPEG compression C program for a 8x8 DCT block on a 512x512 pixel image size. For the ISE model, we have translated the source code into assembly code that corresponds to the Nios-II CPU instruction set and use the custom instructions to achieve speed up. For the MicroBlaze extension model, we used a systolic mapping of a 2-D array of MAC-units to exploit data parallelism. Fig. 7a shows a 1-D DCT mapping onto a 2-D mesh array. A 2-D DCT can be accomplished by performing a 1-D DCT on each column followed by another 1-D DCT on the row using transposed coefficients [20] . The resulting assembly code will contain hardware specific instructions (i.e. FSL functions) to perform the execution.
MM and FIR:
We can directly write the assembly code for both of the tasks (4x4 MM and 16-tap FIR with 16-bit input samples and filter coefficients) as they are fairly straightforward. For the ISE model, the hardware and software compilations for simulation are similar to the above DCT case. For the hardware mapping on the MicroBlaze extension model, we have also used a 2-D array to implement a basic FIR filter as shown in Fig. 7b . The input samples are shifted serially into the array and we can dedicate more rows of MAC-unit to execute in parallel for a larger number of in- put samples. There is a limit, restricted by the mesh array size (i.e. 8x8 mesh array), in which case the MAC-unit will have to be reused.
EXPERIMENTAL RESULTS
The application examples were performed on three models we have created; MicroBlaze (software), MicroBlaze extension (MAC-unit array) and Nios-II extension (ISE). Fig. 8 shows the computational time for simulating the tasks on each of the models using parameters (described in Fig. 5 ) obtained from the hardware compilation results. We compared the speed up between ISE vs. software, MAC-unit vs. software and MAC-unit vs. ISE models. Both the ISE and MAC-unit extension models show a better performance than the software implementation with an average of 2.8 and 4.9 times speed up respectively. In contrast, the speed up between the ISE and MAC-unit is only an average of 1.65 times in favour of the latter. One possible contribution is that both extensions have inputs of similar width (i.e. 32-bit wide). Although the MAC-unit extension has more input channels to supply data to, it needs to find a large number of data sets to exploit its vertical parallelism, horizontal pipeline structure in order to gain significant speed up. Another explanation is that the results are produced serially from the MAC-unit extension and the CPU will have to incur additional cycles to read the immediate outputs. However, with some modification to the design and interleaving these read cycles we can improve the performance of the extension significantly. Interleaving instructions in the CPU with non-computational ones are easier to discover in the MAC-unit extension since the core is more loosely-coupled than the ISE model.
The ISE offers another attractive option. It can be seen that with careful design of the custom functions where the multiplication and addition stages are pipeline (see Section 2.2), we can achieve a significant speed up with the same de- gree of flexibility over non-custom instructions in the software only implementation. Generally, not all applications using these custom functions will have an improvement in the performance, but in the case where the speed up from a co-processor and the ISE are only marginal then it may be more economical to adopt a simpler ISE solution since the hardware design effort is less complex.
CONCLUSION
In summary, we have introduced a flexible simulation framework where computational intensive kernels in different applications can be efficiently evaluated across a range of solutions. It also presents to a system designer an outlook on the performance during the initial design stage. At this level, system functional specifications can be readily verified and also different parameters of the hardware components can be analyzed to meet any real-time constraints. Refinements to the design (hardware and software) can be performed easily and system trade-off decisions can be considered before going to the next level of optimizing the hardware or software. We have also demonstrated that our compilation flow can be combined with commercial back-end design tools to provide an economical approach for the synthesis and implementation on the target devices.
