Partial re-con¯guration is the process of con¯guring a portion of a FPGA while the rest of the device is still running/operating. This paper proposes a novel allocation methodology for realizing applications with partial and dynamic features on FPGAs. The methodology was implemented as a manager that incorporates two stages: the¯rst one modi¯es the con¯guration data of each partial bitstream by replacing the associated application's functionalities (or slices), its goal being to compact the slice distribution, while keeping the same functionality. The second one determines the appropriate spatial location over the FPGA device where the previously optimized con¯guration data should be placed. The proposed manager is device independent, since it derives partial con¯guration data that can program dynamically any island-style or hierarchical FPGA. For demonstration purposes, the proposed manager was implemented as part of an existing bitstream generator tool, named DAGGER (part from the MEANDER framework) targeting to Virtex-like architectures.
Introduction
Perhaps the most important advantage of employing Field-Programming Gate Arrays (FPGAs) against to Application Speci¯c Integrated Circuits (ASICs) is their inherit capability to recon¯gure an application at any time. In the majority of cases, however, this feature is used exclusively during the design phase of a project. In such an approach, the con¯guration data a®ects the programming of the entire device and the stream is called bitstream. Nevertheless, the actual application requirement for hardware resources may be a subset of the FPGA. In order to reprogram only a part of the device, we have to generate a portion of the con¯guration data, called partial bitstream¯le. Among others, by taking the advantage of partial recon¯guration, can adapt applications, share hardware resources between applications, as well as provide continuous device service.
Dynamic recon¯guration is a mechanism that allows applications to attach (logically added) or detach (logically removed) from an FPGA device without incurring any system downtime. Furthermore, speci¯c parts from complex applications can be realized consecutively (i.e., partial), even if the application cannot be mapped onto the device as a whole. Partial and dynamic recon¯guration types, also known as run-time recon¯guration, are rather complex tasks and they should be supported both from the target device and the con¯guration tool.
Several existing recon¯gurable architectures support partial recon¯guration. Among others are Chimaera, 3 PipeRench, 4 NAPA, 5 and Virtex FPGAs. 6 Extensive work was done to improve the multi-context handling capability of these devices, by storing several con¯gurations and enabling quick context switching. The main goal was to improve the execution time by minimizing external memory transfers, assuming that some amount of on-chip data storage was available in the recon¯-gurable architecture. However, this solution is feasible whenever the implemented functions are mutually-exclusive on the temporal domain (i.e., context-switching between coding/decoding schemes in communication, video or audio systems); otherwise, the length of the recon¯guration intervals would lead to unacceptable delays.
In Ref. 1 , an algorithm that alleviates the above-mentioned problems posed by the consecutive recon¯guration of the same logic was presented. It targets to rearrange on-line the hardware resources solving the fragmentation problem. An approach for predicting the possible locations of the maximal empty device areas on a partially recon¯gurable FPGA can be found in Ref. 10 . In Ref. 7, a component reuse-based strategy to reduce the recon¯guration overhead between two consecutive recon¯-gurations was proposed. However, this solution incurs constraints which may lead to unroutability. In Ref. 9 , two algorithms for resource allocation on FPGAs targeting to minimize the percentage of device that needs run-time recon¯guration, were described. A hardware/software approach that relocates and defrags a partially recon¯gurable FPGA was proposed in Ref. 8 . In Ref. 16 an algorithm that reduces the required time for resources allocation on computational hardware devices with partial recon¯gurability is discussed. Finally, a placement of hardware modules in space and time for recon¯gurable architectures is shown in Ref. 17 . Such an approach achieves reasonable execution times, while it tries to¯nd the FPGA of minimal size to accomplish the tasks within a¯red time limit.
In this paper we introduce a device independent allocation methodology for realizing applications with partial and dynamic features on FPGAs. This methodology consists of two main stages, where the¯rst one modi¯es the con¯guration data of each partial bitstream by re-placing the associated slices, while keeping the same application's functionality. The goal of the¯rst step is to compact the slices distribution over the FPGA in order to reduce the required continuous silicon area on the device. The second step deals with the assignment of the previously-optimized con¯guration data onto the FPGA device.
Assuming the MCNC benchmarks, the e®ectiveness of the proposed methodology is proved by a comparison study that shows signi¯cant reduction of the actuallyutilized logic modules (or slices). Also, by appropriately handling the contents of bitstream¯le, we hope to reduce the transition activity of the consecutive bits of the con¯guration stream.
For demonstration purposes, the proposed methodology was implemented as part of an existing tool named DAGGER, 2 targeting Virtex-like architectures. 13 More speci¯cally, DAGGER ver. 1 supported only bitstream generation, while DAGGER ver. 2 supports also partial and dynamic recon¯guration.
The rest of the paper is organized as follows: Sec. 2 describes the supported con¯guration procedure (partial and dynamic), while the proposed re-allocation methodology is discussed in Sec. 3. Comparison results and conclusions are summarized in Secs. 4 and 5, respectively.
Partial and Dynamic Device Programming
In order to support partial recon¯guration, our solution incorporates a methodology that adds or removes applications dynamically onto the target FPGA. Figure 1 shows the proposed methodology for modifying the functionality of the target FPGA. A pre-request for applying this methodology is to have speci¯ed in advanced the appropriate con¯guration bitstream¯les for all the applications that need to program the device.
Whenever there is a request to modify the current functionality of an FPGA, the appropriate data from the con¯guration RAM (which contains the bit sequence for the new application) should be read and checked for errors, through error checking routines. Possible errors might have occurred during either the storage/retrieval process from con¯guration memories or the transfer from con¯guration memory to the programming tool.
Whenever an error is found, the algorithm reloads the valid bitstream again (i.e., set FLAG ¼ \1"). In case the error occurred due to the transfer process, the correct con¯guration data is available after the second transfer. Otherwise, the bitstream le is invalid after the second retrieval, too. In such a case, an error message is printed and the algorithm terminates its execution, since the new application has corrupted con¯guration¯le. In contrast (i.e., valid con¯guration data), the algorithm locates and reserves an appropriate area on the FPGA where the new application is mapped.
For the purposes of this paper, we have developed two novel algorithms for su±cient con¯guration data re-allocation, as they are described in more detail in upcoming sections. If there is not enough continuous empty area on the device, some already stalled applications might be overwritten. Finally, the reserved area is programmed by the bitstream¯le of the new application. The proposed slice reallocation task is highlighted in Fig. 1 .
In order to support the applications attachment (logically add) or detachment (logically remove) from an FPGA device, we incorporate a method based on correlation. More speci¯cally, we correlate the current bitstream data stored in the programming memories with the con¯guration¯le of the application that needs to be attached or removed from the FPGA. The current programming information from the device can be retrieved through read-back process. If there is no mapped application on the device, the corresponding bit sequence is null.
An example of applying the data correlation for adding and removing applications from a con¯guration¯le is depicted in Fig. 2 . In particular, the term \programmed" bit represent values \1" or \0" that programs the device, while the remaining bits are \don't care" bits. The don't care bits are the ones we padded to the useful bits to form a RAM word of speci¯c length (e.g., 16-bit or 32-bit), and they do not a®ect the device functionality.
Since the value of \don't care" bits does not a®ect application functionality, we choose their value to be invariable and set to the last used bit. Such a decision results into less transitions between consecutive bits inside the con¯guration¯le. This feature leads to smaller switching activity during the programming phase, and therefore to lower power consumption on the device con¯guration level.
The correlation operation can be implemented either on software or hardware level. In case on software, a CPU core attached to the recon¯gurable device is required. This core might be an embedded component, or a software IP core. On the other hand, it is possible to XOR the bitstream¯les by using dedicated hardware resources of the FPGA. Even though the latter approach is faster as compared to the CPU core, however it requires a synchronization mechanism to ensure proper bitwise junction of the bit sequences.
Con¯guration Data Re-Allocation (De-Fragmentation)
Over the time, as a partially recon¯gurable device loads and unloads con¯gurations, the unoccupied area of the array is likely to become fragmented, similar to what occurs in memory systems. One of the most critical processes for e±cient partial recon¯guration is to determine the appropriate region in device, where a new application should be mapped without a®ecting the already existing and running applications. The next paragraphs provide the description of the proposed allocation manager, which optimizes the con¯guration data and determines the spatial location for the partial con¯guration bitstream. The e±ciency of the proposed allocation method depends mainly on the current fragmentation state of the device, while the optimal approach is to perform an FPGA defragmentation after each device recon¯guration.
The proposed allocation method incorporates a mechanism to perform defragmentation of the recon¯gurable array, by fusing the portions of the unused areas, resulting into larger continuous areas in the FPGA device. For that purpose, a number of valid con¯guration data have to be moved to new locations.
The basic idea of the proposed allocation methodology is that a logic module (or slice) currently being used by a given function has its functionality transferred into another logic module, without disturbing the overall system operation. This relocation mechanism does more than just copying the functional speci¯cation of the logic module to be replicated: the corresponding interconnections with the rest of the circuit have to be re-established. Additionally, according to its current functionality, internal state information may also have to be copied. Any recon¯guration action must therefore ensure that the signals from the original logic module are not broken before being totally re-established from its replica; otherwise its operation will be disturbed or even halted. It is also important to ensure that the functionality of the logic module replica is perfectly stable before its outputs are connected to the system, in order to prevent output glitches.
More speci¯cally, in this section we study the following problem: Given a set of con¯gurations fC 1 ; C 2 ; . . . ; C i ; . . . ; C n g that should be loaded into the device with a speci¯c order, each of which requires corresponding device area (as it is de¯ned by a placement and routing tool) equals to fArea 1 ; Area 2 ; . . . ; Area i ; . . . ; Area n g, respectively, then minimize the value of P Area i , where Area i denotes the required array of slices if we map solely C i th con¯guration into the FPGA device. Also, during the loading of con¯guration C i , try to reorganize (i.e., replace) the existing congurations placed onto the FPGA, in order to save as much possible of the continuous area.
In order to show the necessity of the device defragmentation, Fig. 3 gives a speci¯c example. Here, four applications (marked as \a" to \d") are already executed onto a 4 Â 4 FPGA (Fig. 3(a) ), while a new application (marked as \e") that requires a subarray of 2 Â 3 slices (Fig. 3(d) ) needs to be implemented in conjunction to the others. Even though there are enough empty slices on the device, there is no su±cient continuous area for the new application to be realized.
Our proposed re-allocation methodology aims to prevent and solve such problem. The main concept is¯rstly to identify the boundaries of each one of the existing applications ( Fig. 3(b) ) and then, iteratively, move some of the utilized slices to preserve as much as possible of empty area (Fig. 3(c) ), by taking into consideration the timing constraints of the already implemented applications.
The next two subsections describe in more detail the proposed allocation methodology. More speci¯cally, Sec. 3.1 gives an algorithm for modifying the contents of partial bitstream by altering the spatial location of slices, while Sec. 3.2 proposes a methodology for reorganizing (i.e., defrag) the partial con¯guration data across the device. Both of these approaches focus on maximizing the area utilization.
Modifying con¯guration data of partial bitstream¯les
The¯rst task of the FPGA defragmentation methodology concerns the appropriate modi¯cation of the con¯guration data for each of the partial bitstreams that form the whole application. The goal of this step is to compact the slice distribution by merging as much as possible the unused area (unutilized slices). This can be characterized as a bottom-up procedure for optimizing the application placement, while keeping the same routing.
In order not to change the application's functionality, typical constraints during this task do not modify either the Manhattan distance between consecutive slices on application networks (a sequence of connected logic elements), or the total number of routing bends at each network.
To handle this problem, the proposed allocation manager keeps a°oorplan of all the device slices (utilized or not). Then, it parses the placement and routing info of the new partial bitstream¯le (i.e., the one that needs to be added to the FPGA) and replaces the location of slices for each network to new spatial locations, while keeping the same routing properties (i.e., Manhattan distance, number of bends, etc.). The pseudo-code of the developed allocation algorithm for the slice re-allocation inside the partial bitstream, is shown in Fig. 4 .
Initially, the algorithm determines the distance between all the adjacent slices of the partial bitstream¯le (lines 5À9 of algorithm), while then through a modi¯ed version of the maze router, we result to a placement that achieves higher area utilization ratio (lines 10À28 of algorithm). The basic concept of the aforementioned strategy is explained by a speci¯c example shown in Fig. 5 . It is assumed that the initial application mapping requires six (6) slices while the target FPGA incorporates an array of 6 Â 6 slices. After the placement and routing, the logic modules are connected as shown in Fig. 5(a) . If we apply the proposed re-allocation algorithm (Fig. 4) to the considered example, the allocation manager sets D max ¼ 6 and reserves an 11 Â 11 array (step 13 of Fig. 4) as shown in Fig. 6(a) . The values inside the boxes (step 19 of Fig. 4 ) represent the Manhattan distance of each slice from the slice denoted as \1" of Fig. 6(b) . After searching to¯nd out all the possible valid sequences of slices (in our example is six), which exhibit the same functionality with the sequence of the initial placement ( Fig. 5(a) ), we obtain the¯nal con¯guration placement shown in Fig. 5(b) 21À27 of Fig. 4 ). It should be noted that a sequence is valid if: (i) the order of slices remains unchanged and (ii) the interconnect distances between two successive slices remains unchanged too. In Fig. 6(b) any sequence of slices from 1 to 6 represents a valid placement for the con¯guration data that realizes the initial application. Based on the algorithm shown in Fig. 4 , the con¯guration that utilizes more e±-ciently the available hardware resources is chosen (step 26). If there are more than one such placement, the one that maximizes the resource utilization in the horizontal direction (x-axis) is chosen. This leads to maximum empty rectangular space on the device, and hence to maximum routability for the remaining slices. For instance, in the initial con¯guration assignment, the maximum rectangular of continuous area in the device consists of 6 Â 3 slices, while the proposed allocation methodology provides an area of 6 Â 4 slices. By applying this method recursively for all the networks that compose an application, the con¯guration data assignment is optimized. This feature is critical especially for applications which require consecutive recon¯gurations.
Determination partial bitstream placement
The second task of the proposed allocation methodology is to¯nd out a su±cient area on the FPGA, where the new partial bitstream could be placed. The proposed algorithm tries to assign the con¯guration data at spatial locations of the FPGA in order to maximize the area utilization, in respect to the upcoming con¯gurations. Constraint to this task is the demand for as less as possible impact on existing or upcoming applications, which leads to maximal area utilization. The algorithmic approach of the proposed con¯guration¯le placement procedure is shown in Fig. 7 .
The allocation procedure assigns a unique label to all existing bitstream¯les and then, it aims to maximize iteratively the hardware utilization across the x-direction of the recon¯gurable architecture. Initially, all the con¯gurations are sorted based on their area for occupied slices. After the initial mapping, a simulated annealing-based approach is applied to further optimize the area e±ciency. During this step, partial con¯guration¯les are selected randomly and swapped. If the derived con¯guration leads to better utilization of the recon¯gurable architecture (i.e., more continuous area is free), then the swap is kept. On the other hand, the swap might be acceptable (the probability of accepting such swaps is reduced with the execution time).
An example of applying this feature of the proposed allocation manager on a FPGA is shown in Fig. 8(a) . We assume an array of 6 Â 6 slices and three distinct applications (denoted as \A", \B " and \C ") that run on it simultaneously. The partial con¯guration data for each of these applications is available to the allocation manager. Let us say that the application \A" requires 12 slices (i.e., 4 Â 3 array), the application \B" needs 4 slices (2 Â 2 array), and the application \C " needs 9 slices (3 Â 3 array).
Considering the 6 Â 6 device, the placement procedure starts from application with label \A" (Step 10), which is placed to the left-bottom corner of the FPGA device (Step 11). Up to now, no other partial bitstream exists on the device, so there is no con°ict (Step 12).
Step 13 indicates which one is the available partial bitstream that is selected for next placement. Afterwards, the allocation manager¯nds out a valid placement for the application \B " because it utilizes more e±ciently the available hardware resources on x-direction (Criterion = TRUE ) (steps 13À16). Finally, during the third iteration, the application \C " is placed as close as possible to the right-bottom corner of the device. This method guarantees that the non-utilized slices across the device produce the maximum empty rectangular continuous shape. For instance, Fig. 8(b) gives the mapping of the same partial bitstream¯les onto identical hardware resources, but with di®erent order (starting from the one that requires the minimum number of slices). Although, the total number of unutilized slices in Fig. 8(b) is identical to the ones of Fig. 8(a) , they do not compose a larger continuous area of the device than the area shown in Fig. 8(a) .
Comparisons Results
In order to evaluate the e±ciency of the proposed strategy for deriving partial and dynamic recon¯guration¯les, the proposed allocation methodology is implemented as an extension of an existing bitstream generator tool, named DAGGER (DEMOCRITUS UNIVERSITY OF THRACE E-FPGA BITSTREAM GENERATOR).
11 The DAGGER ver. 2 is available through the MEANDER Design Framework (part from AMDREL Project) at http://proteas.microlab.ntua. gr. This tool is capable of con¯guring any Virtex-like device and it is publicly available under GNU licence for modi¯cations or improvements, in order to support additional features.
The e®ectiveness of the proposed partial recon¯guration scheme is proved by a series of comparisons in terms of certain design parameters (e.g., hardware resources utilization). We performed the comparison study using the 20 largest MCNC benchmarks, 14 which are widely-accepted in the recon¯gurable architectures¯eld community. Table 1 shows the results of applying the proposed allocation methodology over the 20 MCNC benchmarks in terms of logic blocks utilization. The second and third columns correspond to the required number of logic blocks and the total number of networks for each benchmark, respectively. The fourth column gives the required number of logic blocks, R, to place this partial bitstream onto the FPGA, while the ¯fth column shows the logic blocks that can be saved, S by applying the proposed allocation methodology. This number mainly depends upon the current fragmentation state of the device and the available unutilized resources of the FPGA. The initial placement of an application and therefore the fragmentation status of the device are speci¯ed by the used place and route tool (EX-VPR 2 in our case). The last column shows the percentage of total area that can be potentially saved if the proposed approach is applied. For our experimental setup, we achieve a continuous area savings about 10.6% in average over all MCNC benchmarks, considering that each benchmark was realized onto the smallest square Virtex-like FPGA using EX-VPR tool.
The results from Table 1 refer to the smallest Virtex FPGA architecture that each benchmark¯ts to. However, this approach does not show the actual gains from the proposed allocation manager, as each benchmark is realized onto a FPGA with di®erent size. In Tables 2 and 3 , we provide results for implementing MCNC benchmarks onto a prede¯ned FPGA device. Having a single FPGA platform for mapping many applications is a common approach. More speci¯cally, we realize the same benchmarks onto two di®erent FPGA architectures. In particular, the¯rst architecture (Table 2) is composed by a 70 Â 70 array of slices (where almost all benchmarks can be mapped), while the second one (Table 3) considers an architecture with 100 Â 100 slices. From these Tables, we inferred that remarkable reduction of 50.6% and 66% for 70 Â 70 and 100 Â 100 FPGA array, respectively, of the actually-utilized hardware resources (slices) comparing to the available hardware resources, can be achieved. Also, the column denoted as \Speedup versus initial" refers to the speedup factor of applying the proposed allocation approach to 70 Â 70 and 100 Â 100 FPGAs comparing with the minimum square device (shown in Table 1 ). As a consequence of the hardware resources reduction, the proposed reallocation (defragment) manager can perform the placement procedure up to 3.9 Â in average faster than the conventional placement. Table 4 provides the power savings for the MCNC benchmarks employing the proposed technique for minimizing the unnecessary transitions (i.e., reduction of switching activity) inside the bitstream¯le, as it was described in Sec. 2. The second column shows the minimum required square FPGA array where each benchmark is mapped, using EX-VPR tool. The third column represents the actually required number of bits for implementing each benchmark onto a Virtex-like FPGA. The \Null bits" column corresponds to the \don't care" bits of each con¯guration¯le, while the last column gives the percentage of the achieved power saving. Based on these results, an average power reduction about 5% can be achieved employing this low-power technique. Even though this percentage is not remarkable, it should not be seen as an isolated number. We have to sum up this number with the power savings resulting from the remaining tools of the design°ow (e.g., synthesis and mapping).
Conclusion
A new systematic software-supported methodology for deriving partial and dynamic con¯guration¯les through a novel allocation manager capable to defrag an FPGA device, was presented. This strategy is architecture independent since it can generate bitstream¯les for any island-style or hierarchical FPGA architecture. The proposed allocation scheme consists of two steps: the¯rst one produces the appropriate conguration data that modi¯es the functionality inside each bitstream¯le, while the second one determines the appropriate spatial locations in an FPGA device, where the con¯guration data should be placed. A comparison study proved the e±ciency of the proposed allocation methodology.
