Abstract-Runtime relocation of circuits on fieldprogrammable gate arrays (FPGAs) has been proposed for achieving many desirable features including fault tolerance, defragmentation, and system load balancing. However, the changes in the architectural composition of FPGAs have made relocation more challenging mainly because FPGAs have become more heterogeneous. Previous and state-of-the-art circuit relocation systems on FPGAs have relied only on direct bitstream relocation which requires the source and destination resource layouts to be the same, as well as access to the design bitstream for manipulation. Hence, their efficiency on modern heterogeneous chips greatly reduces, and mostly cannot be applied to encrypted bitstreams of intellectual property blocks. In this brief, we present a circuit relocator which augments direct bitstream relocation with a functionality-based relocation scheme. We demonstrate the feasibility of the proposed technique using a CORDIC application and show that an average of over 2.6-fold increase in the number of relocations can be obtained compared to only direct bitstream relocation at the expense of a small memory overhead and manageable relocation time for this case study.
I. INTRODUCTION AND RELATED WORKS
F IELD-PROGRAMMABLE Gate Arrays (FPGAs) have gained increasing popularity in many domains, including consumer electronics, aerospace, defense, scientific instruments, autonomous vehicles and various video processing applications [1] , [2] . In embedded systems, they have continued to be popular because of their unique combination of performance and flexibility. One very key feature of modern FPGAs is that circuits (or a subset of circuit(s)) configured on them could be removed when not needed to make room for other circuits or to modify the functionality of the system, a feature called dynamic partial reconfiguration (DPR). In addition, circuits could be moved from one location to another on the FPGA, a process termed relocation. Relocation of circuits on FPGA chips is beneficial for many reasons. Three important ones are: to circumvent permanent damages on chips and consequently improve fault tolerance of critical applications in hostile environments such as space [3] , to achieve defragmentation of the chip area [4] and to maintain a desired thermal distribution on the chip [5] .
A major condition that needs to be satisfied for relocation of a circuit to be possible in runtime is that the resource composition of the original location for which the circuit was synthesized should be the same as the intended destination location. That is, the source and destination are required to have identical chip area, not only in size, but also in the type, number and order of the resources they contain. This condition was easily satisfiable in older versions of FPGAs which were essentially homogeneous. Modern FPGA chips, in a bid to improve performance and lower power consumption, have hard blocks such as memory blocks (BRAMs) and digital signal processors (DSPs) sandwiched between the conventional Configurable Logic Blocks (CLBs) [6] . In addition, these BRAMs, DSPs and CLBs sometimes have different orientations (left and right) which differ in routing types as in the Xilinx 7 series FPGAs. Thus, FPGAs have become increasingly heterogeneous, and this places greater restrictions on the relocation of circuits. The result of this increase in heterogeneity is that the number of relocations possible for typical circuits has reduced with newer generations of FPGAs. Figure 1 illustrates this point. The figure shows that while on a homogeneous chip the circuit on LOC 1 could be relocated to 2 additional identical locations (LOC 2 and LOC 3), on the heterogeneous chip no identical location can be found. Most of the circuit relocation systems present in the literature still demand that identical location(s) must exist on chip before any form of circuit relocation is possible [5] , [7] - [10] . They address only direct bitstream placement and relocation and do not provide any means of relocating circuits to non-identical locations. Becker et al. [11] reported a technique that successfully relocated a design bitstream synthesized for a location containing a set of CLBs and an unused DSP to another location with the same CLB layout but with an unused BRAM replacing the DSP. However, the technique is based on online editing of configuration bitstreams which could be time consuming. In addition, the routing between the DSP and BRAM are required to be identical, which is not the case in recent FPGA architectures like the Xilinx 7 series. Hence, we consider the work presented in this brief as novel, as we are not aware of any other functionality-based relocation scheme in the literature.
1549-7747 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. In this brief, we propose a relocation manager to improve the number of relocations for a circuit on heterogeneous FPGAs. The proposed relocator augments direct bitstream relocation with a functionality-based relocation. The functionality-based relocation presented in this brief relies on the technique of replicating the functionality of a circuit with a look-up-table (LUT) or a memory block in runtime for selected circuits. The basic idea is shown in figure 2 .
The rest of this brief is organized as follows: Section II discusses the proposed system architecture and presents the details of the relocation flow proposed. Implementation details of the proposed design are presented in Section III, in addition to a case-study application. Experimental results are presented and discussed in Section IV, and Section V gives a conclusion.
II. SYETEM OPERATION AND ARCHTECTURE Our complete relocation flow occurs in two stages: first the possibility of relocating a circuit by direct bitstream is evaluated and in cases when this is not possible or leads to undesirable effects, functionality-based relocation is resorted to. The proposed functionality-based relocation is done when an exact matching position for the circuit's original bitstream is either not available, or would lead to undesirable effects, such as increased fragmentation of the chip area. Our flow for relocating circuits by direct bitstream can be found in [10] . The following description will focus on the functionality-based relocation aspect. A circuit to be relocated using this technique has its computation results memorized during its normal operation. A bitstream of an LUT or memory resource template is pre-synthesized and stored in an off-chip memory at design time. When relocation is required in runtime, a destination location is configured with the bitstream template, and its memory content filled with the outputs of the original circuit previously memorized.
The operational flow of the relocation mechanism is shown in figure 3 . When a request is received to relocate a circuit (after attempts to find an exact location for the original bitstream on the chip is found to be infeasible or unprofitable), a duration evaluator carries out a check to see if the timing constraints associated with the relocation request can be met.
Next, an area check is done to find a suitable location for a pre-synthesized memory template. The details of the time required for a relocation procedure is given in Section II-B below while Section II-C explains the procedure for the area check. If both checks are successful, then the relocation request is accepted and executed in 3 additional steps:
i) The outputs of the circuit not present in memory are computed and saved ii) A memory template is configured on the chip iii) Data is copied from the original circuit's memory unto the already configured template. These operations are managed by various units of a relocation module discussed below. The architectural composition Operational flow of the proposed functionlaity-based relocation system. of the proposed relocation module consists of an Output Memorizer, Duration Evaluator, and Area Finder.
A. Output Memorizer
The output memorizer basically saves the results of computations of selected circuits in memory in runtime. Thus, it connects to the circuits whose outputs it memorizes. It has 3 units: task memory, evaluation logic (which we shall call memo logic) and output memory. These are shown in figure 4. The task memory saves the list of circuits which are currently configured on the FPGA chip and are potentially relocatable by functionality. The memo logic manages the conversion of the raw inputs to address values, determines if the output for an input has been previously saved and switches mode to save the current output of the application when it has not been saved previously. Each circuit has a unique identifier (Circuit ID) which corresponds to its address in the task memory (Base_Addr). The memo logic has a fixed 3 clock cycle overhead when operating in the CHECK mode where it verifies if an input has been previously saved, and an overhead of 2 clock cycles when in the SAVE mode where it saves an output unto its output memory if not already saved. Basically, the fixed number of clock cycles is achieved by concatenating the inputs of a circuit into a unique address value (Base_Addr + offset), with Base_Addr being the start of the memory location assigned to the circuit and offset determined using information on the circuit's input and tolerance.
The output memory contains the results of computations. An application with multiple outputs has these outputs concatenated and saved at an address. The least significant bit (LSB) of each output memory location is reserved to be checked for validity of the value stored at that address as shown in figure 5 . This bit is checked to determine if results of a computation are available in memory or not. A value of '1' at that location indicates that a previous value has been saved and is valid and a '0' means that valid output is missing for this input and the original circuit would have to compute it.
To compute missing outputs in runtime after a request to relocate a circuit is received, the memo logic iterates through the LSBs of the section of its own output memory dedicated to memorizing the circuit's outputs. The LSB of a missing output has a value of '0'. The address indices (which correspond to inputs) of missing outputs are then each decoupled and fed into the original circuit as inputs for it to compute corresponding outputs.
The sizes of the task and output memories of the Output Memorizer are determined by the number of relocatable circuits on the chip, the sum of the number of inputs of the constituent circuits and the tolerance of the circuits. By tolerance, we mean permissible variation in a circuit's outputs. Since this technique requires that space is reserved for all potential outputs, its memory overhead could be a major bottleneck for large-port-width applications that require numerous distinct outputs to be saved. Hence, we acknowledge that to keep the memory requirement reasonable, the port width of the circuits which can be relocated using this mechanism must be small, or if the port width is large, then the application tolerance must be large as well. Moreover, the functionality-based relocation proposed in this brief is only applicable to circuits which are referentially transparent -that is, circuits implementing systems that produce the same set of outputs for the same set of inputs.
Circuits whose current outputs depend on some internal states, or are determined by factors other than the current input(s) are not directly relocatable by the technique proposed in this brief. Nevertheless, there are many applications which can profit from the proposed scheme even with these limitations. Three Examples are: an RGB to YCrCb colour conversion circuit which is widely used in computer graphics, CORDIC circuits designed to compute the trigonometry of angular inputs, and multiplier circuits which form the basis for many other applications.
B. Duration Evaluator
This unit checks if the requested relocation can be completed within the time constraint associated with the request. Its architecture consists of an LUT RAM which contains the essential parameters of the circuits, including the duration associated with the circuit's operations such as the configuration time, the number of clock cycles for computation of outputs (e) as well as the duration of data transfer from the output memorizer's memory to a memory template. The time constraint of a relocation request is evaluated using (1). The term R t in (1) is the total time required for relocation, C t is the greater of the time required for the memory template to be configured on the chip and the area finder module to execute; and e is the time required to compute a missing output of the circuit(s) to be relocated, with n being the number of the missing (yet to be saved) outputs. M t is time required for the memorized memory content to be transferred to the template. It is worth noting that the operation of the area finder and the computation of the missing outputs of a circuit to be relocated are done in parallel, thus C t in (1) takes the value of the greater of time required to complete these two operations. e is initially measured at design time like the configuration duration of the circuit. However, since e depends on the architecture and functionality of a circuit, when these are changed by DPR, its new value is measured (by observing the duration required by the updated circuit to change a set of inputs into outputs) and updated in runtime.
C. Area Finder
The area finder basically checks if there is an area on the chip for a template to be placed on. It has access to a RAM containing the state of the chip (State Memory), as well as a memory containing all the potential locations of the template. The State Memory represents the state of all resources on the chip by an M × N Matrix, where M and N are respectively the number of rows and columns in the device. An available resource is represented by a '0' and a used or damaged resource by a '1'. Thus, each element in the matrix defines the state of a specific reconfigurable resource on the chip. A scan function is used to check the availability of potential locations for the circuit in the light of the current state of the chip. Further details of the scan procedure can be seen in [10] .
Finally, it is worth stating that the memory template consists of a generic memory block capable of holding all potentially required output data of the circuit(s) it is designed to replace. It also contains associated logic to manage functionalities such as memory read and delay management. Its memory size is determined like the output memory of the output memorizer discussed in Section II-A. The delay management block manages the difference between the timing behavior of the memory template and the original circuit so as to maintain the timing characteristics of the entire system. It does this by delaying the assertion of 'done' by the difference in the number of clock cycles between the operation of the memory template and that of the original circuit.
III. IMPLEMENTATION DETAILS A. Case-Study Application
We implemented a CORDIC application using Xilinx IP blocks. The application consists of 3 independent circuits: Square root, Sine/Cosine trigonometric operations and the hyperbolic tangent (Tanh) computing circuits. CORDIC was chosen as it is an important algorithm for various mathematical functions [12] . Details of the circuits' operations as well as their data format can be found in [13] . We created a custom wrapper for the circuits for easy comparison with our relocation model. Each circuit was optimized to take an Table I shows the resource utilization of the circuits, while table II shows the number of clock cycles for each operation. The partial bitstream of the application is 140 kB in size.
B. Relocation Module
The relocation module, comprising of an output memorizer, duration checker and area finder described in Section II was implemented using the Xilinx Vivado 15.1 design tools. Its resource utilization is shown in table III. A total of 66 LUTs, 58 flip flops and a single 18-Kb BRAM were used on the xc7a35tcpg236-1 chip. It is worth noting that the size of the memory used is dependent on the application. We chose an 18-kb memory because it is sufficient to save all the outputs of our target case-study application. The relocation module connects to the inputs and outputs of application(s) to memorize new computations by the application. It is also worth noting that practical relocation techniques require access to the configuration memory of the FPGA, as well as a means of communicating between a relocated module and other parts of the chip. Thus, a self-reconfiguration controller [14] with the required access to the configuration memory was instantiated. The controller is used for configuring the chip, as well as copying of data between block memories of the relocation module and the relocated module via the configuration layer. To address the need of a communication technique that supports relocation, we adopted the technique described in [15] which makes use of those clock buffers not used by applications for on-chip communication.
Next, a memory template for relocation was implemented. This template reserves 10 kB of memory and manages the delay of the application it replaces. This memory size was determined by the maximum memory requirement of the circuits whose functionality it is intended to replace. The actual resource utilization of the memory template on the target FPGA is 14 LUTs, 21 Flip flops, and 18-kb RAM and it has a delay of 2 clock cycles. The bitstream size of the template is 76.9 kB. The delay mechanism is used to ensure that the relocated equivalent does not alter the timing of the relocated application so as not to lose synchronization with the entire system.
IV. RESULTS AND DISCUSSION
At runtime, we initiated a relocation request when 50% of the outputs of the application have been saved by the output memorization module. The floor plan of the application required a pattern of 8 contiguous CLB columns on the xc7a35tcpg236-1 chip. This pattern occurs only once on that chip. Hence, only a functionality-based relocation was possible. The timing constraint associated with the relocation request was such that the relocation was required to take a maximum of 1ms. The total time duration for the relocation was measured as 306.80 µs at 100 MHz, with the configuration of memory template taking 82.30 µs, the computation of missing outputs taking 175.36 µs, and the copying of data from memorization unit memory to the template taking 49.14 µs. We also measured the worst-case relocation duration for this module as 361.86 µs and best case as 131.44 µs. This was done by generating relocation requests when 0% (worst case) and 100% (best case) respectively of the outputs had been saved. The time required for the configuration of the memory template and the copying of data is constant for an application, irrespective of when a relocation request is received.
We also observed the outputs of both the original circuit and the relocated equivalent for the same inputs. The results were the same for both circuits -in both cases, the value of Data_out when the ap_done signal goes high was the same. In addition, we evaluated the improvement in the number of possible relocations brought about by incorporating the proposed technique into the state-of-the-art direct bitstream relocation technique. Table IV shows the result for different Xilinx FPGA chips. As shown, the proposed technique leads to a significant improvement in the number of relocations. For the chips compared, an average of about 36 more relocations (an increase of over 260%) of the case-study circuits could be obtained using the proposed technique. This is a great advantage in applications (such as space missions) which aim to improve reliability by circumventing permanent damage on the chip. It means that augmenting the traditional direct bitstream relocation with the proposed functionality-based technique would significantly improve the fault tolerance of a design.
As already noted in Section II and in the relocation case study used, our relocator resorts to a functionality-based relocation when the bitstream of the original design cannot be placed on a matching location on the chip, leads to undesired effects, or where access to the location information of the bitstream is not possible (such as in encrypted bitstreams). The technique is especially suitable on modern heterogeneous FPGA chips, such as the Xilinx 7 Series and UltraScale FPGAs, which are rich in memory resources, many of which are sometimes unused. We have also noted in Section II that relocation by functionality is only applicable to circuits with low port width. This is due to its memory overhead not scaling well with port width, and thus resulting in large overheads for large-port-width circuits. To this end, it is important to restate that the relocator system which we present is also capable of bitstream relocation for circuits which cannot be memorized.
In addition, we compared the time overhead of direct bitstream relocation and our proposed functionality-based relocation. Table V shows the relocation time for both techniques for 3 different circuits: CORDIC, RGB to YCrCb colour converter and a multiplier circuit. All the circuits were implemented using Xilinx IPs from Xilinx Vivado 15.1 for the xc7a35tcpg236-1 chip. As shown, functionality-based relocation technique has a larger time overhead than direct bitstream relocation for a majority (2 out of 3) of the cases. For example, direct bitstream relocation duration for a 12-bit RGB to YCrCb colour converter circuit would only require 174.46 µs as against a minimum of 326.15 µs required for the functionalitybased technique. It is worth noting that the relocation time for the functionality-based technique is proportional to the port width of the circuit. Hence, for the CORDIC circuit with 10-bits inputs, its relocation time is smaller than direct bitstream relocation. With increase in port width, the relocation time for direct bitstream relocation has better performance. A major disadvantage of functionality-based relocation technique is that it does not scale well with increase in port width. In fact, the memory requirement doubles for each bit increase in port width. However, since direct relocation is impossible in certain cases such as for encrypted bitstreams and when an identical location is not present on the chip, servicing relocation requests whose time constraint can be satisfied in those cases is always an advantage. Therefore, it is an added layer of advantage to relocate circuits by functionality whenever direct bitstream relocation is impossible or leads to undesired effects.
Finally, the size of the additional memory template bitstream required for functionality-based relocation is only 55% of that of the original circuits' in our case study. Hence, in terms of additional memory required, the functionality-based technique would be better compared to having to store multiple bitstreams of the original circuit, not to mention that since it is an empty memory template most of the bits in its bitstreams are '0's and would be much smaller when compressed compared to the original circuit's bitstream.
V. CONCLUSION
In this brief, we have presented a runtime circuit relocation system which can relocate circuits on FPGAs based on functionality. The technique is applicable to applications that are referentially-transparent and have low port widths. We have demonstrated its feasibility using a set of CORDIC circuits and shown that it has potentials to greatly increase the average number of relocations (over 2.6-times increase in our case study), while incurring only a small additional bitstream storage overhead (only 55%) and a respectable relocation time.
