Abstract
Introduction
The application of modern Systems-on-Chip (SoCs) today range from low-end domains like automotive dashboards to high-end systems powering smartphones and the like; these systems' complexity steadily grows. The design of these SoCs would be eased if they are based on a parallel, flexible, scalable, and generic architecture. Such an architecture can help to reduce the design complexity by supporting a kind of divide-and-conquer approach. Processing elements can be replicated or replaced by application specific function units to reach performance or energy targets.
Today, multicore FPGAs are an important implementation technology. They allow to design systems with most of the components placed on-chip. With multiple processor cores, both as fixed hardware or implemented using the configurable logic, and their ability to reconfigure they provide a good basis for parallel, flexible, and scalable systems.
However, the long familiar problems of performance, reliability, flexibility, and power management still exist in FPGAs as they only provide the basis to the solution of these problems. To efficiently use multicore FPGAs all cores should be included in the parallel system. To optimize the power management the number of active, or even configured, cores must be adapted dynamically to the current workload. Furthermore, the configurable logic has to be used to implement application specific function units to accelerate performance. Therefore, a modern FPGA-based system is a complex one whose optimal performance depends on parameters that can in part only be determined at runtime.
To make these features of an FPGA manageable and augment the system with adaptivity a virtualization layer is required, which hides the -due to runtime reconfigurationchanging hardware system from the application software. The Scalable Dataflow-driven Virtual Machine (SDVM) is such a virtualization of a parallel, adaptive and heterogeneous cluster of processing elements [3, 2] . Thus, it is well suited to serve as a managing firmware for multicore FPGAs.
This paper is structured as follows: Section 2 describes the fundamentals of FPGAs and the outline of the firmware concept. Section 3 discusses the realization of the firmware using the SDVM and its technical challenges. In Section 4 an overview of related work is given. The paper closes with a brief conclusion in Section 5.
A Firmware concept for FPGAs
Besides the primary functions that a System-on-Chip (SoC) should accomplish, e.g. speech encoding in a cell phone, their design has to address a multitude of secondary requirements. These requirements are important for most systems, merely the weighting differs. The introduction of FPGAs as a target platform for SoCs adds an other important requirement: The runtime reconfiguration ability of some FPGAs provide additional flexibility to the system. To make optimal use of these reconfigurable systems an efficient management of the reconfiguration process is necessary.
The list of secondary requirements can be summarized as follows:
• performance, scalability and adaptivity
• support for parallelism
• robustness and reliablity
• energy efficiency
• support for runtime reconfiguration
• incorporation of heterogeneous components
As these requirements and therefore the techniques to achieve them are common to a vast number of SoCs it is beneficial to supply a generic module which manages these supporting features. This lightens the burden of the designer who can concentrate on the primary functions of the SoC.
To avoid an increase in complexity, provide flexibility, and improve portability and code reusability through different hardware the division of the functionality into several layers is a possible solution. The aforementioned generic module should therefore be implemented as a functional layer between the system hardware and the application software thus acting as a middleware.
The middleware should provide a complete virtualization of the underlying hardware. The application has no longer to be tailored to the hardware, instead it is sufficient to tailor it to the virtual layer. This virtual layer not only provides hardware independence, it can also hide changes in the underlying hardware due to reconfiguration. Thus, such a middleware is specifically well suited to be used as a firmware for FPGAs.
Dynamically reconfigurable platform FPGAs
A generic FPGA architecture consists of three major components: Configurable logic blocks (CLBs), input/output blocks (IOBs) and a configurable routing network that connects all IOBs and CLBs. The CLBs can be used to implement arbitrary logic functions while the IOBs provide the physical interface to the off-chip environment. Today, FPGAs are no longer limited to these basic components. They incorporate additional specialized function blocks like embedded memory or processor cores. Thus, modern FPGA can implement complete parallel system architectures on-chip, so-called microgrids.
As seen in Table 1 even the second smallest device of the Virtex-4 FX family can host two MicroBlaze softcores including an FPU and dedicated RAM each and still has more than 50 % free logic resources that can be used to implement application specific functions. Larger FPGAs can support systems with four cores and still have 42 % of their logic elements unsused. 
The Middleware concept
To efficiently use a dynamic reconfigurable FPGA as an implementation architecture for multi-core systems the middleware has to support a number of different features. Today, even small FPGAs can host multiple cores (See Table 1). Besides vendor-supplied softcores or embedded hardcores system designers often add application specific function blocks like digital signal processors (DSP) or specialized datapath units. Unless a distinction is necessary these cores and function blocks will be referred to as processing elements (PE) in the following.
The logic resources and therefore the computing power of the FPGA and the internal memory blocks can be distributed evenly among all PEs, but there are resources which cannot be efficiently split. One of the most important ones being the external memory. As FPGAs typically have only up to some hundred kilobytes of internal memory a lot of applications require external memory. On the one hand, to avoid bottlenecks each PE should have its own memory both for program and data. On the other hand, applications for shared memory are much easier to design than applications for message passing systems. Thus, the distributed memory should be transparent to the applications. Therefore, the middleware should support a multi-level memory architecture that is transparent to the application software. Besides external memory every interface of the FPGA system to the outside world like ethernet or PCIe cannot be allocated to every PE. The middleware must manage these resources on the cluster level.
The middleware should provide a complete virtualization of runtime reconfigurable platform FPGAs. Therefore it has to support the following primary features:
• Combine all PEs to create a parallel system.
• Provide task mobility between all PEs even if they are heterogenous. It should be possible to execute a task on processors of different architectures and on custom function units if applicable.
• Virtualize the I/O-system to enable the execution of a task on an arbitrary PE.
• Combine the distributed memory of each PE to form a virtually shared memory.
• Manage the reconfiguration of the FPGA, i.e. keep track of the current usage of the FPGA resources and available alternative partial configurations. Furthermore, an adequate replacement policy has to be defined.
• Monitor a number of system parameters to gather information the configuration replacement policy depends on.
• Adjust the number of active PEs at runtime. For example, this can be used to meet power dissipation or reliability targets.
• The previous feature requires the firmware to hide the actual number of PEs from the application to ease programming.
• As the user software does not know the number and architecture of the PEs the firmware has to provide dynamic scheduling as well as code and data distribution.
One of the fundamental decisions in the design process of the firmware is whether each PE forms an independent building block of the parallel cluster or multiple PEs are merged in a higher-order cluster element. The latter may impose less overhead but the former eases the implementation of adaptive features like coping with errors in the fabric or reducing hotspots.
If each PE is augmented with a complete set of the virtualization functions and therefore no PE is the sole provider of any function, the system is much more flexible. If an error is detected in some part of the FPGA the affected PE can be disabled or reconfigured to avoid the erroneous location without hampering the functionality of the cluster. Furthermore, as each augmented PE provides its share of the cluster management functionality the number of bottlenecks is reduced. The distribution of functionality can lead to a better distribution of workload thus reducing the number of hotspots on the FPGA.
In addition, the firmware depends to a much lesser extent on the number and type of cores in a cluster if it runs on each core independently and communicates using hardware-independant messages.
Therefore, the middleware is implemented as a firmware running on each core of the FPGA based system. The parallel cluster is created by the communication of each firmware instance with the other instances.
Realization
In this section the realization of the presented virtualization concept is described. Due to its features which match the requirements specified in Section 2, the SDVM was chosen as a basis. Thus, the firmware for reconfigurable systems is called SDVM R .
The Scalable Dataflow-driven Virtual Machine (SDVM)
The Scalable Dataflow-driven Virtual Machine [3, 2] is a dataflow-driven parallel computing middleware (see Fig. 1 ). It was designed to feature undisturbed parallel computation flow while adding and removing processing units from computing clusters. Applications for the SDVM must be cut to convenient code fragments (of any size). The code fragments and the frames (a data container for parameters needed to execute the former) will be spread automatically throughout the cluster depending on the data distribution.
Each processing unit which is encapsulated by the SDVM virtual layer and thus acts as an autonomous member of the cluster is called a site. The sites consist of a number of modules with distinct tasks and communicate by message passing.
The SDVM is a convenient choice as a middleware (virtual layer) for FPGAs due to several distinguishing features. The two most important features are that the SDVM cluster can be resized at runtime without disturbing the parallel program execution, and each site in a cluster can use a different internal hardware architecture. These two features are the basis for the runtime reconfiguration ability of the system.
A typical reconfiguration cycle in such a system starts with the site occupying the area to be reconfigured dropping out of the running cluster. The displacement of data and code objects is managed by the SDVM middleware during the drop out process. After loading the new configuration the newly created site, now having a different architecture, joins the cluster. On joining the site is automatically included in the ongoing runtime distribution of the workload. On FPGAs which support partial runtime reconfiguration and therefore allow continuous operation of the not affected logic fabric, the other sites can proceed with their operation. Thus, making even reconfiguration of significant parts of the FPGA, which may be time consuming, worthwhile.
Besides changes of the sites' architectures the cluster resize mechanism can be used to adapt the cluster to different grades of parallelism of the currently running applications. On an excess of computing power sites can drop out of the cluster shutting down the affected area of the FPGA to minimize power consuption. When the demand for computing power rises free areas of the FPGA can be used to deploy new sites thus increasing the capacity of the cluster.
The integration of adaptive features into the middleware which can be used to adjust the behaviour of the system to a variety of targets requires knowledge gathered at runtime. For example, a power management policy using the SDVM was developed which can be used to improve the reliability of a multicore chip [1] . Independent selection of the core's power states in conjuction with dynamic parallelism is used to minimize temperature changes of the chip thereby improving the reliability significantly while sacrificing less performance than simpler power management policies. The policy requires the collection of serveral runtime parameters like the amount of workload and current core temperature which are gathered and distributed by the SDVM.
All in all the SDVM offers the following features that are beneficial for the implementation of an FPGA middleware:
• undisturbed parallel computation while resizing the cluster
• dynamic scheduling and thereby automatic load balancing
• distributed scheduling: no specific site has to decide a global scheduling and therefore any site can be shut down at any time
• participating computing resources may have different processing speeds and ISAs
• applications may be run on any SDVM-driven system, as the number and types of the processing units do not matter
• support for any connection network topology
• no common clock needed: the clock is locally synchronous but globally asynchronous.
• distribution of knowledge gathered at runtime
Different realization schemes
One important question when implementing a parallel system on an FPGA is, how the reconfigurable area provided by the FPGA is used. There are two primal possibilities: 1. The available resources on the FPGA are used up by configuring additional processing units. Thus, the SDVM R cluster consists of more sites and a higher parallelism can be achieved. The number of sites can be changed by reconfiguration to adopt to the available parallelism of the application as far as FPGA resources permit.
2. The FPGA fabric is used to implement custom function units, each attached to and therefore controlled by one of the cores. The function units conform to specific code fragments which are to be executed often. The supported functions of the custom function units can be changed at runtime by reconfiguration to adapt the system to the needs of the application.
The different approaches can be combined (see Fig. 2 ). This is especially aided by the fact that both the MicroBlaze and the PowerPC hardcore provide a fast low-level interface to the FPGA logic fabric. In this way the middleware still runs as software on the core while some of the data processing is shifted to specialized hardware (i.e. the logic fabric).
Technical challenges
As the virtualization layer is completely implemented in software which runs on the PE of each site the memory footprint is a main concern. The original SDVM consists of 50,000 lines of code, depends on a full featured operating system, and makes heavy use of C++-STL resulting in a binary code size of about 500 KB. On an FPGA a PE can host typically 64 KB of on-chip memory. Furthermore, only a basic operating system kernel can be used due to the limited resources. Thus, the virtualization layer software was reimplemented using plain C.
Nevertheless, the overall structure of the existing SDVM and the functionality of its modules could be largely reused for the new implementation. Additionally, the data structures and algorithms used in the SDVM could be simplified because the original implementation has been developed for an arbitrary large cluster in an unsecure environment. This further limited the required implementation complexity.
To extend the memory resources for the system the external memory has to be included. This memory cannot be accessed directly by all sites efficiently as the available bandwidth would be a bottleneck. Research is done to determine the best memory architecture. The memory could be added to one site relying on the data distribution of the SDVM to move needed data to the requesting site, i.e. the local memory of each site would act as a cache.
Another issue is runtime reconfiguration. A system design must adhere to certain constraints to enable reconfiguration. Mapping and placement of the modules to be configured are time-consuming tasks which cannot be done fast enough at runtime on today's FPGA architectures. Thus all modules which will be used in the system must be created in advance. As the number of sites can change due to reconfiguration a reconfiguration-optimized on-chip network connecting the sites is under research.
Developement system architecture
For the developement of the SDVM R firmware a scalable system has been created using the Xilinx EDK 8.1 software. The system is based on IP blocks supplied by Xilinx. As the processing element the MicroBlaze softcore is used. It is supported by a timer and an interrupt module connected to a local On-Chip Peripheral Bus (OPB) to allow for the execution of the Xilinx XMK operating system. The MicroBlaze core has 64 KB of embedded memory blocks connected to its Local Memory Bus (LMB) to store and execute the firmware. These four IP blocks build the basic processing element of the system.
The communication between the processing elements is done using a shared memory connected to the system OPB. To allow for mutual exclusion of the access to this memory and to the RS232 interface a hardware mutex modul is attached to the system OPB. The multiple cores are attached to the system using OPB-to-OPB bridges (See Fig. 3) . The PowerPC based system is basically the same. The MicroBlaze and its memory is replaced by the PowerPC core and a PLB-connected memory block. The resource requirements of different system configurations are listed in Table 1 . The MicroBlaze is implemented with all features and FPU enabled but without caches and debugging support.
As the hardware basis a Virtex4 FX20 populated evaluation board was chosen due to its fine-grained reconfiguration features and embedded PowerPC core. Resources, especially embedded memory blocks, are quite limited on the V4FX20 to implement systems with more than two cores, so a Virtex-II Pro 30 based system is used for tests with 2 PowerPCs and up to 4 MicroBlaze cores. The busses and MicroBlazes cores of all systems run at 100 MHz clock frequency; the PowerPC cores are clocked at 300 MHz. 
Preliminary performance results
The current developement version of the SDVM R has been tested with a parallel application calculating a mandelbrot set. The application is mapped to the execution model of the SDVM R in such a way that the calculation of a row of the mandelbrot set is done in one code fragment. Thus, the parallelism of the application increases with the number of rows to calculate. The disadvantage of this straight-forward approach is that the runtime of each instance of this code fragment differs. This is due to the mandelbrot algorithm which requires a different amount of iterations depending on the current point it is calculating.
The application has been tested on the one, two and four core systems including combinations of PowerPC and MicroBlaze cores described in Section 3.4. The results are shown in Table 2 . Despite the lower clock frequency of the MicroBlaze (100 MHz) compared to the PowerPC cores (300 MHz) the latter ones are much slower due to a missing FPU. The dualcore PowerPC system scales noticeably better that the dualcore MicroBlaze system with an efficiency of 0.99 compared to 0.94. As the MicroBlaze system is more than 8 times faster the shared memory communication and the mutual exclusion mechanism may be a bottleneck in this case. For the four core MicroBlaze system the efficiency drops to 0.78. Table 2 . Runtimes of a single-precision floating point mandelbrot set
Related work
The middleware presented in this paper uses hardware abstraction based on a virtual layer to exploit runtime reconfiguration and parallelism of currently available FPGAs on system level. Most of related work in this field either focuses on the virtualization of an FPGA on the level of the logic fabric, or on techniques for runtime reconfiguration. Furthermore the discussion of adaptive features is most often detached from concrete implementation technologies. An overview of related work in a broader sense confirm this.
Roman Lysecky et al. developed techniques for dynamic hardware/software partitioning based on online profiling of software loops and just-in-time (JIT) synthesis of hardware components called WARP [6] . They also present a dynamic FPGA routing approach which can be used to solve the routing and placement problem of reconfigurable components at runtime [7] . However, their approach relies on a special, to our knowledge not yet implemented, FPGA architecture called Configurable Logic Architecture [5] . To overcome the need for this special FPGA architecture Roman Lysecky et al. developed a virtualization layer for FPGAs [8] which is placed on top of the logic fabric of a real FGPA. This virtualization layer emulates the Configurable Logic Architecture and enables the use of JIT techniques on existing FPGAs but leads to a 6X decrease in performance and a 100X increase in hardware requirements. In contrast the concept presented in our paper uses softcores and functional units specifically designed to the target FPGA family thus the overhead should be much lower.
The usage of adaptive features to tackle the complexity of modern SoCs is extensively covered by Gabriel Lipsa et al. [4] . Their paper proposes a concept that applies autonomic or organic computing principles to hardware designs. The paper does not present any kind of implementation, neither as software nor as hardware. The SDVM R as a firmware for FPGAs is a software-realization of these autonomous principles.
The widely used Parallel Virtual Machine (PVM) [9] is a collection of procedures and functions to simplify the usage of heterogeneous computer clusters. On each member of the cluster a daemon has to be started which performs communication with the other systems. The members and the communication structures have to be set up beforehand and cannot be changed during runtime. The SDVM whereas allows dynamic entry and exit of computers at runtime. This allows automatic scaling of parallelity and an adaption to various needs for special hardware resources.
Conclusion
In this paper a middleware providing a virtualization layer for FPGAs is presented. The virtualization layer can reduce system design complexity as it separates applications to be run from the underlying hardware while exploiting runtime reconfiguration and parallelism of currently available FPGAs on system level. It is based on the SDVM, a middleware for computer clusters and multicore chips. Due to its features, the FPGA may reconfigure itself at runtime to adapt to changing conditions and requirements. The work is currently under developement, several arising technical issues and the developement system architecture were described.
