Abstract-In this paper, a solution to support the run-time readback, relocation and replication of cores in embedded systems with dynamic and partial reconfiguration capabilities is presented. The proposal shows a peripheral structure that allows an easy integration and communication with the rest of the system, including an API to make the reconfiguration details to be more transparent to software applications. Differently to other proposals, all functionality is implemented in hardware, achieving a higher reconfiguration speed. In addition, different design decisions have been taken in order to increase the portability of the solution to existing and, possibly, future FPGAs. Finally, a use case is provided, which shows the features of this module applied to the run-time scaling of a hardware coprocessor.
INTRODUCTION
The emergence of commercial FPGAs with dynamic and partial reconfiguration (DPR) features has opened promising research opportunities in the field of reconfigurable computing. The benefits and possibilities offered by this technique have been extensively reported, mainly in academic works [1] [2] [3] [4] . An important one is that partial reconfiguration reduces time and memory overheads compared with full reconfiguration. In addition, it allows the design of specific hardware for each particular application, providing advantages, both regarding power consumption and performance. Furthermore, dynamic partial reconfiguration allows the implementation of complex systems in constrained devices, and enhances run-time adaptability for embedded systems working under variable environments. Consequently, due to these advantages, and the limits of silicon integration, reconfigurable computing is seen like a must for future embedded systems.
Despite these advantages, the main obstacles to provide partially reconfigurable solutions with commercial purposes are the lack of commercial design flows and tools supporting it [5] , as well as the need of having specific knowledge on dynamic reconfiguration techniques. In addition, the particularity of each device, and the changes from one generation of reconfigurable devices to the next one, makes difficult to create portable and upgradable solutions. To deal with dynamic reconfiguration from scratch, different issues have to be solved. Some of them are the creation of partial configuration files to reconfigure only the desired portion of the device, the tools to relocate these modules in any arbitrary position, or the control of the configuration port and the communication infrastructure compatible with the DPR technique [6] .
Regarding the configuration ports, different interfaces are available. Xilinx FPGAs contain the ICAP, an internal port that allows the partial reconfiguration of the device. This leads to the concept of self-reconfigurable embedded system, a system that can modify its functionality autonomously at run-time. To control the ICAP, Xilinx provides a wrapper called HWICAP that can be integrated in the system like a peripheral through the IBM PLB CoreConnect bus [7] .
Regarding the generation of partial bitstreams, that is, configuration files to reconfigure only the desired resources of the device, traditionally there has always been an important gap between the silicon availability and the apparition of design tools and flows supporting this feature. In addition, partial reconfiguration files are addressed to the position of the device where they were initially mapped. As a result, either a different partial bitstream is generated in advance for each possible reconfigurable region, or the final position of each block is restricted to the original one. With the second option, the system can enter into stall states, if it requires the execution of a certain hardware block, and the corresponding reconfigurable region is not available. An alternative, with low computing overhead and bitstream storage requirements, is to modify the target position of the hardware modules at run-time, during the reconfiguration process. Some solutions exist to carry it out, including both software and hardware implementations, like those reported in [8] and [8] . However, hardware approaches have restricted functionality and are very device dependent, while software ones provide lower reconfiguration rates, because of their high overhead.
The work described in this paper provides an easy and fast hardware based solution for the readback, relocation and replication of partial bitstreams. The main purpose of this solution is to provide digital systems designers the possibility of making logical designs that take advantage of the partial dynamic reconfiguration, without having a previous preparation in this field. Additionally, the hardware system was designed to be easily ported to new devices and families of Xilinx FPGAs. This module is designed as a peripheral module, including integration with the rest of the system by means of the CoreConnect technology, and a software API to increase independence between the familyimplementation and the software applic dynamic reconfiguration capabilities offered This independence will be proved with the solution not only across devices within a between different families. Also, the recon process provided by the block, may all modifications (also called mutations) of the b
The rest of the paper is organized as follo brief description of the Xilinx Virtex FPGA details is provided. Section III includes an state-of-the art of the existing hardware solu manipulation, while in section IV the approa introduced. In sections V and VI, implement at hardware and software levels are describ section VII, implementation results are given II. XILINX VIRTEX RECONFIGURATI Before introducing the details of the sol this paper, it is necessary to explain some c Xilinx Virtex architecture and its bitstream st Partial reconfiguration of Xilinx FPGA modification of the content of the SRAM that configuration of each element of the dev addressable unit of this configuration me frame, and defines the minimum reconfigu Frames for Virtex families prior to Virtexcolumns that span the device entirely from the other side, Virtex-4, Virtex-5 and Virte more complex architectures, since they a stacked rows of configurable blocks, called c clock region is composed by columns elements. Each frame expands the height of a As a result, not the full height of the d reconfigured at each time, permitting a 2 scenario. This can be seen in Figure 1 .
Access to the internal configuration mem among other possibilities, through the intern ICAP port. Partial reconfiguration implies configuration commands, the register progr registers in the area being reconfigured, and itself. For the purpose of this paper, the regis the FAR, where the address of the frame to b be stored; the CRC, a checksum of the conf COR, where the configuration options are se where the FAR data to be configured is store that provides readback configuration informa command register (CMD) is used to trigger machine of the ICAP. Details of this proces the configuration user guides like [10] [11].
To reconfigure a region, the FAR values this region have to be generated. In Virtex d created like the composition of different fi seen in Figure 2 . The first one is the positio the top or in the bottom of the device. Then where it is situated has to be indicated. Afterwards, the Major Add the clock region has to be selec the position of the frame insi number of frames needed to con kind of logical elements that block type, selects the configur work, only type 0 elements interconnect and block configu of all these parameters, as wel the FPGA and their relative po device.
Figure 2. Virte
As it has been already said, to create self-reconfigurable pla is a wrapper of the ICAP prim memories, an ICAP control instantiation. This hardware sy rtial reconfiguration in Virtex-5 dress, that is, the column inside cted. Finally, the Minor Address, ide a column is identified. The nfigure a column depends on the it contains. Other field, called ration layer of the device. In this will be addressed, that is, the uration type. The concrete value ll as the number of resources of sitions, depend on the particular ex-5 FAR format the solution provided by Xilinx atforms is called HWICAP. This mitive, including some internal state-machine and the ICAP ystem has a peripheral structure, to be integrated on a SoC, including some Peripheral Interface Blocks to communicate with the PLB, as well as a software driver. This driver includes functions to access the FPGA resources, but not to reallocate arbitrary reconfigurable blocks, neither to perform the readback of an arbitrary region.
III. STATE OF THE ART
Since dynamic reconfiguration of commercial FPGAs became feasible, the relocation of presynthesized bitstreams has been addressed in different works. A first classification among the existing solutions can be done according to the location of the bitstream manipulation, that is, off-chip or onchip. The first external solution was PARBIT [12] . However, in this work, the objective of the designed peripheral is to provide an embedded system with enhanced dynamic reconfiguration capabilities. Consequently, only on-chip solutions will be analyzed. Among them, a further classification can be carried out depending on the processing approach, existing both embedded software and hardware solutions. Examples of software-based solutions are XPART, a C version derived from the set of JBits Java classes and pBITPOS [13] , an embedded version of BITPOS. A good summary of software-based solutions is provided in [14] . However, since software based solutions offer too high overhead for run-time operation, this state-of-the art will be restricted to hardware approaches.
Focusing on embedded hardware solutions, a remarkable approach is the REPLICA (Relocation per online Configuration Alteration) filter proposed in [15] . This solution is based on the concept of the bitstream filtering during relocation, in order to reduce the process overhead. Basically, the idea is to implement a finite state machine that parses the bitstream during the configuration, identifying the different fields and commands. Among these fields, it is necessary to detect the FAR, in order to replace the original addresses where the block was synthesized, with the final addresses where it will be relocated, as well as the CRC. The generation of the new addresses where each frame has to be relocated is the heart of the filter. Since this approach is restricted to Xilinx Virtex and Virtex-E FPGAs, it only addresses 1D relocations. The address calculation is reduced to obtain a constant offset value to be added to the original Major address. A correction factor has to be included to skip BRAM columns. The calculus of the FAR in Virtex-E devices is more difficult, so some necessary operations are performed in advance. The filter also includes an FPGA Type Decoder, which from the device identification code generates some specific parameters of the chip, necessary to calculate the addresses. In addition, a block to generate the new CRC value, after bitstream manipulation, is included. The overhead of this block is null, since the next MJA is calculated while a column is being reconfigured. The output of this block is the manipulated bitstream that can be used to feed the ICAP, or other interfaces like the SelectMAP. Details of the control of the reconfiguration port are out of the scope of this work. Authors also provide a configuration manager to control the data transfers from the bitstream memory to the port. The relocator can be situated between the bitstream memory and the configuration manager. A new version including support for Virtex-2/-Pro devices, called REPLICA2Pro, was introduced in [8] . It also includes support for the relocation of tasks that make use of BlockRAM and multiplier columns.
The BIRF (Bitstream Relocator Filter), presented in [16] , is similar to REPLICA in the sense that it is also based on a filter approach, and is restricted to 1D relocation in Virtex FPGAs. However, BIRF claims to have a better performance, but no performance information was included in REPLICA. The main difference is a simplification of the Parser FSM, limiting its behavior to detect the FAR and the CRC. An enhanced version of this proposal, reported in [8] , solves the bidimensional relocation problem, addressing Virtex-4 and Virtex-5 devices. The FAR is also calculated by the parser, and it depends on the differences between virtex-4 and virtex-5 devices. An optimized version is provided, including a bypass value for the CRC that avoids the calculation of the new CRC.
Another approach is the Self-reconfiguring Platform (SRP) by Blodget et al. in [17] and [18] , which includes the idea of attaching the hardware reconfiguration subsystem as a peripheral of the embedded MicroBlaze processor, using the CoreConnect Bus technology. Inside the hardware subsystem, a cache is included, with capacity to store a single configuration frame. The software component of the platform defines a low-level API, in charge of the ICAP-cache memory control, and a high level API programmed in C language called XPART, derived from JBits. As a result, it contains methods to random access bitstreams, as well as to relocate partial bitstreams. This software component contains the read/modify/write operation for configuration data, which enables fine-grain hardware modification, like changing LUT equations, as well as the copyCLBModule function, that copies a rectangular region of configuration memory, and pastes it in another position. This can be very useful to some applications, avoiding the access to the external memory. In [19] , based on this Self-Reconfigurable Platform, the concept of merge reconfiguration is introduced. The process followed is to, instead of directly writing the new configuration data in the device memory, the former one is previously read, and frame by frame merged with new data. This allows the inclusion of static routing, as well as the reconfiguration of blocks occupying less than a frame height. This idea was initially addressed to virtex-2 devices, but has also been successfully tested in Virtex-4. In [20] , a software version of this readmodify-write method is also exploited.
The PRR-PRR reconfiguration approach introduced in [21] , provides frame relocating from a partial reconfigurable region (PRR) in the device, into another region. This is done in a frame by frame basis, in order to accelerate the relocation process and to reduce the bitstream storing requirements. However, this approach has a huge overhead of repeating the different sequences of the header of each pair of frames, having no possibility of reconfiguring blocks from the external memory. Furthermore, the algorithm to relocate a full rectangular region is implemented in software.
The solution proposed in this paper is incremental regarding previous works, but contributing with some original implementation ideas, and enhanced functionality.
IV. PROPOSED RELOCATION SO
The solution for the bitstream relocation paper can be considered like a hardware enhanced functionality. As a result, this pl configuration, with features offered by softw still with reduced area overhead, and fol design approach to increase the portability Furthermore, it is compatible with cascaded order to produce mutations of the bitstream like, but not limited to, evolvable har redundancy addition for fault tolerance or s protection.
The hardware system is integrated as a pe to be used with the microprocessor embed (hard or soft core), making it compatible with Technology. In addition, its file structure f Xilinx's EDK peripheral cores, so a user wi dynamic reconfiguration, can just add it to an same way any other core is added. The sys have access to the resources provided by th by means of a software API.
In addition of the bitstream relocation, am implemented options, bitstream readback relocation of full configurable colum implemented. This approach also allows per modifications during the relocation process peripheral can be seen like a final solution dynamic reconfiguration capabilities develop characteristic that can be very useful to cert the copy and paste property, from a positi memory, to a different one. This can b applications like the scalable coprocessors sh approach also allows a read-modify-write o proposed in [19] .
The peripheral module has been desig generic and modular design approach, in ord the core with new functionality, to other bus as to other FPGA families. Also, to improve new devices and between different memb family, the module in charge of address gen VHDL package that describes the elements can be changed and resynthesized. Furtherm structural difference, compared with other a this solution, instead of relocating partial generates headers and tails of the bitstream only receives the purely configuration data result, the platform has been designed so tha to other families of FPGAs just by exchangin packages. In addition, the size of the configu core is reduced; on the other side, the relo easier, since no parser has to be implemente allow a faster portability to new devices, as w the generation of the ICAP control sig HWICAP is included like a subcomponent. T includes the configuration control machine. sections, hardware implementation detai interface and the implementation results will OLUTION n proposed in this e-based one, with latform offers fast ware solutions, but llowing a generic y of the solution. bitstream filters in for other purposes rdware solutions, side-channel attack eripheral structure, dded in the FPGA h the CoreConnect fits perfectly with ith no expertise in n EDK project the stem designer will he hardware block, mong the hardware , replication and mns have been forming fine-grain . As a result, this n to easily provide ped so far. Another tain applications is ion in the internal be very useful in hown in [22] . This operation, like the gned following a der to easily adapt protocols, as well e the portability to bers of the same neration relies on a of the FPGA that more, an important approaches, is that l bitstreams, selfms internally, and as an input. As a at it can be adapted ng specific VHDL uration data of the ocation process is ed. Furthermore, to well as to simplify gnals, the Xilinx This ICAP already . In the following ils, the software be shown.
V. IMPLEMEN
As has been already said, the p compatible with the CoreCon easy integration of this block w IPIF has been included, as w registers, accessible from the p This memory stores the configu as well as readback configurati relocated in different regions of
Figure 3. Dynamic Reconfig Archi
In addition, the hardware sy charge of control issues, the generation of the header and th the ICAP. In the following su hardware implementation are pr
A. VHDL packages
Reconfiguration procedures for family are equivalent. Howe parameters, specific values, bo also regarding the state-machi both the generation of the nece partial reconfiguration, but al addresses. To solve this prob parameters for all devices in th NTATION DETAILS peripheral module is completely nnect technology, allowing the with the rest of the system. So, an well as some control and data processor, and a FIFO memory. uration data from the processor, ion frames, to be replicated and f the reconfigurable device.
guration Peripheral Hardware tecture stem includes some blocks in e generation of the FAR, the he tail, as well as the control of ubsections, further details of the rovided.
r all devices of the same Virtex ever, they differ in a lot of oth architectural dependent, and ine control details. This affects essary commands to perform the lso the generation of the FAR blem, instead of storing all the he family, two VHDL packages are generated. The first one includes the specific architecture details to generate the FAR, and another one with the command sequences that constitute both header and tail of the bitstream. The elements of the FPGA architecture are indexed in XY coordinates and represent FPGA columns, not frames. When the peripheral is instantiated and resynthesized for a different device, the parameters stored in these files are used by the tool, creating specific hardware for each particular device. For new devices in the market, these packages might be created easily, whenever the structure is kept similar. This way, it will be possible to shorten the development of reconfigurable systems when new families appear in the market, thus shortening the tools gap before mentioned.
B. FAR Calculation
The inputs of this module are the XY coordinates of the lowerleft and upper-right corners of the rectangular area that are going to be read or written. The block takes the information stored in the architecture device package to generate every frame address that span the rectangular region that will be reconfigured. This subsystem is a simple state machine that just reads the proper elements of the architectural package. It goes through each element of the package's array and takes its frame address fields. In order to generate the minor addresses, it increments it from zero to the number of frames of the column selected at that moment. When all the addresses of the selected column are generated, it goes to the next column. This process goes on until the last column is reached. As a result, new FPGAs with different architectures could be easily reconfigured.
At the output of this block, a FIFO memory is used, to store the generated addresses. This FIFO allows the synchronization with the data coming from the processor. When all addresses have been generated and sent, a false address (an address in which every bit is set to '1') is sent to the FIFO to indicate that the previous address was the last one.
C. Bitstream Commands generation
The idea of this block is very similar to the FAR generator. It reads the sequence of commands from the commands vhdl package, and feeds the ICAP with it. In addition, it includes the generated FAR, for each frame, in the proper position of the bitstream.
D. ICAP wrapper
The differences in the control and the interface of the ICAP primitive, depending on the specific device, make difficult the generation of portable solutions. Therefore, instead of instantiating the ICAP port directly in this solution, the HWICAP wrapper provided by Xilinx is used. The connection with the wrapper is done through the standard IPIF included in the HWICAP. This IPIF is completely standard, and compatible with new generations or versions of this wrapper.
E. Control blocks
Other modules of the system are small state-machines in charge of the generation of different control signals, like the control of the ICAP through the IPIF, as well as the execution of some tasks like the inclusion of headers and tails, or the information merging from different sources (data from external sources or from readback, relocated addresses, headers and tails).
VI. SOFTWARE API
The proposed peripheral includes a software API to ease its integration in the system, as well as to hide the reconfiguration details to software applications using it. The proposed interface is based on the definition of a set of commands that are stored in the peripheral registers, allowing the control of the hardware subsystem from the processor. The block has six Registers: the commands one, a data register, and four coordinates' registers to communicate two positions in the FPGA. Possible content of the commands register is shown in Table 1 . Table 1 . Software API commands Command Effect
READ2EXTERNAL
Reads the configuration of a region from the FPGA configuration memory and sends it to the processor.
READ2INTERNAL
Reads the configuration of a region from the FPGA configuration memory and stores it in the internal FIFO
WRITEFROMEXTERNAL
Writes the configuration of a region from the processor to the FPGA configuration memory.
WRITEFROMINTERNAL
Writes the configuration of a region from the peripheral FIFO memory to the configuration memory.
In addition, two bits of this register are used to indicate data transferences between the processor and the peripheral. The addresses of the rectangular regions involved in the operations are introduced through the registers X0, XF, Y0, and YF. These addresses indicate conventional coordinates of the FPGA, beginning from the bottom left corner.
VII. EXPERIMENTAL RESUL
In this section, implementation results, as we the reconfiguration peripheral, are provided.
A. Implementation results
Synthesis values are compared with the si The first analysis is offered in terms of resou Table 2 In Figure 4 , the header sequence intro reconfiguration process is shown, while the Figure 5 . To reconfigure each frame, as can 6, it is necessary to introduce in the ICAP th and the generated address value, fo corresponding configuration data. To reconf region, the process of a single frame has t repeated, preceded by the header, and finish The full reconfiguration process is seen in th The reconfiguration time of a region can follows (in clock cycles): Table 3 , actual values for these parameters 
B. Use case application
To validate the block, the proposed perip been used to create a dynamically scal Scalable coprocessors are a possibility functionality of some hardware tasks, depen variable applications requirements, or also on conditions, like available power or chip a ing the header, the inal tail.
oduced before a tail is reported in n be seen in Figure  he pheral module has able coprocessor. y to adapt the nding on run-time n changing system area. The solution proposed in this paper, is also out fast size adaptation of scala correctness of this solution, i with the architecture propos reconfigure the peripheral fro However, due to the homoge processing element of the arra replicated in all the new positio Figure 8 , the element has to positions. Each process reconfiguration frames (2 CLB CLB column). According w equations, the readback process process 1536. At 100 MHz operation is 16,36 us. In this case, since the read peripheral memory, no overhea external memory is incurred. processing elements, the prop used, configured like a readba larger reconfiguration areas w sources, the provision of data access to the peripheral, but [2 may be achieved by using DMA VIII. CONCLUSION Real time applications that m reconfiguration are becoming a combine the speed of hardware reconfigurable systems. In thi dealing with restricted devices power), fast and autonomous importance. seen like a possibility to carry able coprocessors. To prove the it has been employed together sed in [22] . The idea is to om a 3 × 3, to 5 × 5 size. eneity of the modules, a single ay can be read back, and later ons. As it can be seen in the be replicated into 16 different sing element requires 72 B columns by 36 frames for a with the previously reported s takes 100 cycles, and the write z, the required time for this scalable coprocessor dback data is already in the ad for the transference from the Furthermore, to generate new posal of this paper can also be ack to the external memory. For which are read from external to this block requires fast data 23] shows that maximum speed A schemes.
S AND FUTURE WORK make use of run-time dynamic a key design topic, because they e solutions with the flexibility of is context, and specially when (in both logic resources and/or s reconfiguration are of main The proposed solution provides extended features, comparable with software based approaches, for run-time reconfiguration modules, but at speeds that are close to the maximum ones, given today's reconfiguration port achievable speeds. Additionally, the block is seamlessly integrated with the bus technologies available for SoPC systems, and the abstraction level at which reconfiguration is handled alleviates the system designer to have deep knowledge of reconfiguration skills (taking apart the knowledge required to produce a reconfigurable block layout, which is assumed to be done by a hardware designer, with much higher expertise in reconfiguration). An effort towards generalization in the way different FPGA families behave with dynamic reconfiguration has been done, and the design of the peripheral module based on VHDL configuration packages opens up an opportunity to rapidly migrate the reconfiguration process to new coming families, with reduced development times. Experimental results show that a full featured hardware based solution still requires very little area overhead, at very high speeds. Future work goes in three directions. On one side, the possibility of adding more bitstream filters (mutators) opens up the enhancement of the block with other features like increased fault tolerance by automatic circuit modifications, protection against side channel attacks, etc. Second, the 300 MHz frontier reported in [23] for maximum programming speed is a challenge for our all-at-one relocation and programming feature. Third, the use of this module into a variety of scenarios where fast reconfiguration is required, like video processing with inter-frame algorithm adaptation, for instance, or evolvable systems, are broader research lines that are being considered presently.
