Abstract
Introduction
The process of dynamically reconfiguring an P G A introduces a delay into the operation of the realised circuit and can thus be critical to the performance of the system. This is especially important in embedded systems that can demand fast context switching of the configured circuits [6] . In this paper, we study an approach to reducing reconfiguration cost that is applicable when the configurations are being loaded onto the device from off-chip.
The basic idea behind the technique is to re-use the configurations that are on-chip. instead of loading them afresh from off-chip, thereby reducing the time needed to reconfigure the device. While this seems similar to instruction caching in a microprocessor, there are two differences between an instruction sequence and a sequence of configurations. Firstly, a configuration has a spatial dimension since it configures a logic circuit inside the device. Secondly, the smallest unit of configuration, or the configuration granularity, can he many times larger than an instruction for a typical processor.
To study the effect of circuit placement and configuration granularity on the problem, we analyse two cases. First, when each configuration in a sequence has a fixed placement on the device (Section 4). Second, when we have freedom to place the configurations anywhere on the device (Section 5). We have such a freedom when the runtime management system for the device provides a virtual address space of the configuration memory to the higher layers of the design environment either to support hardware re-use or to support multiple applications running concurrently (e.g. see [22] ). In this work, we allow configurations to he shifted in one dimension. horizontally (or vertically) across the chip. We show that for a typical commercial device. only 1% of the configuration data in the input sequence can he re-used if the configuration placements are fixed, and 3% can he reused if configuration relocation is allowed. However, we show that the amount of configuration re-use can he dramatically improved by reducing the size of the basic unit of reconfiguration (Section 6).
We conclude this paper by summarising our main result -we were able to reduce the amount of configuration data for a sequence of benchmark circuits. by as much as 4190 -however, the actual reduction in reconfiguration time is not possible without a configuration memory that allows fine-grained and random access to its data. In the light of this result. we plan to further investigate configuration memory architectures for FPGA-based systems.
Related Work and Contributions
Various algorithmic (e.g. [ 171) and architectural (e.g. [20] [lS]) techniques have been presented in the literature to address the problem of reducing reconfiguration time. This paper deals with this problem at the configuration level. [IO] and Sadhir et al. [16] . Our technique focuses on the unit of configuration. Recently, Kennedy has performed experiments that are somewhat similar to ours [SI. While we confirm his findings that as much as 80% of the configurations can he redundant while changing a typical circuit into another. our techniques are essentially different. He generated pseudo-configurations using JBits and inspected only two of them at a time. In contrast, our method considers a sequence of actual configuration datasets and focuses on the unit of configuration in particular. Koch et al. [9] outline their configuration model and present techniques somewhat similar to Kennedy's. They also consider configuration placement but at the circuit level. We instead focus on the placement of circuits from the perspective of the configuration bitstreams. We have been able to show the impact of the configuration-memory architecture on this problem.
Various tools to generate the difference between two configurations have been reponed e.g. for Virtex series [15] and for XC6200 series 1121 and can he incorporated into our methodology. directly proportional to the number of frames loaded onto the device. We also assume that the device is homogeneous meaning that, excepting the frame addresses, the same Configuration configures the same circuit no matter where it is loaded. While commercial devices do not represent this ideal (e.g. some hex wires in Vinex can be read after three CLBs), our assumption about the homogeneity of the device simplifies the subsequent analysis and helps us to undertsand the issues involved in configuration re-use.
The device is attached to a micro-processor as a peripheral component (e.g. [5] ). The panial reconfigurability of the device is a pre-requisite to the problems presented in this paper. However, a loose coupling of the gate-arrays with the host processor is not critical and is only mentioned because the experimental results reported in this paper have been gathered on such a model. Indeed we intend to study tightlycoupled architectures in the future.
The application model
The need to construct a reconfgurable circuit arises in many situations. We consider two cases:
The circuit needs more hardware resources than are available. The design is partitioned into manageable units and configurations arc generated for each partition. The loading of these configurations is then scheduled to produce the same final result as if a bigger FPGA were present (e.g. v-11). The circuit is specialised around certain commonly occumng data patterns. Configurations
Models and Problem Statement
corresponding to these customised partitions are generated and loaded onto an FPGA when needed. (e.g. [13]). We use the following definitions in ouranalysis: A co&nration for a device consists of data that will he loaded into the configuration memory during a given circuit.(re)configuration phase, ne instructions to write this data at appropriate places are also included in a configuration. A complete configuration contains data for each and everv confieurable element of a dew e assume that our model circuits span the entire height of the FPGA (a similar model to the one described by Li et. al [IO] ) and their physical placements are specified by the columntframe addresses . We also assume that the execution times of the -vice. A partiul configuration is a configuration that only contains the configuration data for a sub-set of the elements. By an application, we mean a circuit.
or a set of inter-related circuits (e.g. the output of one is an input to another). The device is an SRAM-based parthlly reconfigurable FPGA consisting of c columns, We assume that the FPGA offers a colirniri oriented reconfiguration method in which the atomic unit of configuration is a frame consisting of a slice of configurations data for an eritire cohmii of resources. Let there he f frames per column where each frame contains b bytes of configuration data. Reconfiguration time is circuits are not configuration delay or placement dependent. We therefore assume a sufficiently homogeneous and interconnected resource to allow arbitrary circuit placement without affecting performance. We assume that IO is performed either via the top or the bottom of the device or is managed by the runtime system. We also assume that we can disconnect IO pins from cached configurations that are no longer active or become active again. In this paper, we ignore the performance issues that arise due to IO constraints. Instead our focus is on analysing the effect of circuit placement on configuration re-use. We assume a time-shared multi-tasking model (e.g. [22] ).
Models

The Configuration Re-use Problem
Consider a sequence of three configurations. C1 + C, i C,, to he loaded onto a device having three columns and five frames per column ( Figure I ). Let us represent frames by characters. Assume that each configuration has a fixed placement, e.g. CI will start from the second frame of the first column (i.e. Q1.2). Assume that the device has been initialised to the null configuration, 40. We then load the first configuration, C1. Let us call the resulting on-chip configuration dl. Now consider C, which starts from the fourth frame of the first column. We note that there are common frames between 41 and C1 (i.e. a101.4 and az@1.5). We leave these frames and load the residue, C I~, ,~~, onto the device. We 
CI&-l,nl such that:
Loading Clmi,;+q, given qii is already on-chip, results in a configuration 4<+1 that contains Ci+1.
The total amount of reconliguration data is minimised. 0 As qii may contain configurations from the previous partial reconfigurations, let us recursively define o i = C 0 , f o r .~= O , e l s e 4 i = C [~~_~;~l @q5-1.Thus,4, is recursively defined as an addition (e) of all previous configurations. CO is the initial null configuration that is needed to start the device in a safe state.
Let fcotlnt(C;:) be the number of frames in Ci. The above problem can now be redefined as:
Analysis
The configurations must be placed at a fixed location on the device when the circuit is customised around certain data and the configuration updates modify a given part of the currently executing configuration. or when the designer fixes the circuit placements for performance reasons (e.g. the use of a highbandwidth IO port). In the following, we first discuss the above problem given the configurations have fixed placements and then analyse the case when restricted movement of the configured circuits is possible. 
Reducing Reconfiguration Cost with
Fixed Placements
If placement is fixed then the amount of configuration data to be loaded can minimised by removing the common frames between successive configurations in the sequence. The algorithm in Figure 2 describes this procedure. The worst case complexity for the algorithm is O(fnbj where f is the maximum number of frames in the device (f=4,096 for an XCVlOOO), n is the number of configurations in the sequence and 6 is the size of the frame (6 = 156 bytes for an XCVIOOO). A thousand random permutations of the sequence of thirteen cores were considered. For each permutation the program took on average 3.6 seconds to compare the frames in the successive configurations and output the difference. There is a total of 18,M)8 frames present in the input sequence and, on average. the ereedv aleorithm removed 229 frames (with Table 1 . A set of benchmark circuits for a Virtex device.
In Virtex, one needs to write configuration commands in the command register for every contiguous block of frames. By removing the common frames, we fragment these blocks and we thus increase the number of commands that must he issued. However, it should he mentioned that the commands contain only a few words of data. while a single frame contains on the order of a hundred bytes for a typical device. Thus the amount of data saved by removing common frames is much more than the extra command data.
We assume that simply laying the next configuration over the current will give us a safe circuit.
This is very much a device specific issue and will not he considered in detail. For a discussion on this, please see [SI.
Results
In this section, we illustrate the use of our technique by giving an example. We envisage that our technique will he useful in an embedded system domain where fast context switching of circuits is needed and application characteristics are known a priori making static optimisations possible. We have in our mind an FPGA-based system where various cores are swapped in and out of the device (e.g. [ 191).
We generated a sequence of thirteen cores targeted at an XCVlOOO device [4] using ISE5.2 [ I ] CAD tools (see Table I ). We let the tool decide the physical placement of these circuits. The first column of Table 1 assigns a number to each core. The third column lists the number of columns spanned by the respective core (XCV1000=96 columns). The fourth column lists the critical delay of the circuit (as calculated by the CAD tools) and the last column lists its source.
We implemented the algorithm of Figure 2 in Java.
As the configuration format for the Virtex devices is not fully open. we first generated a byte representation of the configurations using JBits. The JBits read-
a standard deviation of I I O frames). The resulting reduction in reconfiguration time was calculated to he about I%. There can he three reasons for this relatively small improvement: there are not many common frames to remove; there are common frames hut they do not occur in consecutive configurations; and there are common frames hut they d o not occupy the same columdframe position in the respective configurations.
We analysed the configurations to answer these questions. Let us assume that we have n configurations. Let fvnique be the total number of unique frames in the n configurations. Reconfiguration cost cannot he less than fiLniyue. It was found that funique was equal to 16.916 frames for the thirteen configurations under test. This still gives us 1,092 frames that could he removed (or a 6% maximum possible reduction assuming the cores were placed at positions that maximised their overlap and the configuration sequence suited our placement). For the purposes of this analysis, two frames were considered similar only if they had the same configuration data and they were located at the same frame index within the respective columns.
Let us consider the second and third of the above mentioned reasons for poor performance. As we generated a thousand random permutations of the sequence and found that the standard deviation in the result was only 0.670, the second reason does not seem plausible. Hence we are left with the issue of frame alignabilily. By alignability we mean that the frames can be placed at the same columdframe address (thereby eliminating the frames in the successive configurations once the first frame has been loaded). We analyse this dimension of the problem in the next section.
Reducing Reconfiguration Cost with
In this section we tackle a more general version of the problem defined in Section 3.2. We allow onedimensional placement freedom and introduce further simplifications to the models outlined in Section 3.1.
1D Placement Freedom
The input configurations provide data for a contiguous region of the configuration memory meaning that each configuration has a start columdframe address and an ending columdframe address. Moreover, the configurations span the entire column of the device. The placement freedom of a configuration, C,, is given by c-lCtI where c i s the total number of columns in the device and ICil is the number of column spanned by Ci. The placement freedom corresponds to all legal column addresses for the leftmost column of the configuration. The configurations can only he shifted by a multiple of columns. This means that if a particular frame is at position 2 within a column then it will occupy the same position in any column when the configuration is shifted across the device. The partitioning of the configuration address space into columns and frames therefore simplifies our analysis.
Complexity analysis
This section analyses the complexity of the onedimensional configuration re-use problem. It shows that the problem is NP-complete by transforming kcache missesproblem [I41 to it. We first discuss this problem.
We are given a direct-mapped cache of size k, a finite main memory, and a set of m > k memory ob- Please note that our problem allows configuration relocation whereas in the k-cache misses problem each object has a fixed location in memory. It can be seen that the k-cache misses problem remains NP-complete even if we allow re-allocation. This is because it does not matter where an object is placed in a k-cache as long as it is in the cache (in other words, k-cache hits are position oblivious). 
Input
A greedy approach
Algorithm. We have examined the performance of a greedy algorithm when applied to the problem of configuration re-use with variable placements. This algorithm places each configuration at a position that minimises the reconfiguration data between it and the on-chip configuration ( (Figure 3) ). The worst case complexity for this algorithm is O ( f 2 n b ) where f is the maximum number of frames in the device, 11 is the number of configurations in the sequence and b is the sire of the frame.
We generated a hundred different permutations of the scquence in Table I . The program took, on average, 72.4 seconds to run for each sequence. With an initial cost of 18,008 frames, the program removed 579 frames, resulting in a 3% reduction in configuration data (standard deviation = 154 frames). This is still 3 6 less than the estimated minimum cost (see previous section). We found that even though there can he common frames among configurations, they might not he alignable due to physical constraints on the configuration placements. Please consider Figure 4, in which two configurations C, and C,+l are shown on a device with only one frame per column.
Let the common frames between the two be located at opposite ends as shown by the lighter regions (blocks numbered I). It is clear that because of constraints on the placement freedom the two configurations cannot he placed such that the common frames of C,+l are placed in the same column as those of C,. Thus, the common frames of CLt should be considered to be In the case of an FPGA there exists another'kind of non-alianahility which we call frame-interlocking.
The algorithm operates on frames that occur more than once in the overall sequence. It takes one such frame at a time and creates n hit vectors each of size equal to the maximum number of frames the device can have. If the frame occurs in the ith configuration, 0 5 i 5 n. it marks those hits of the i t h vector where this frame can possibly he placed. Finally, it traverses the sequence from the start and performs an AND operation between successive vectors. The uniqueness of the frames is thus deduced from the result. In order to accommodate the frames that are separated by configurations that do not contain those frames, the algorithm simply assumes that these configurations are placed such that the frame before the in-between configurations can he seen by the frames after these configurations.
We performed the above analysis for 100 random permutations of the sequence listed in Table 1 . It was found that there were 16,Y 16 actual unique frames (as found previously) and after running the alignability test, this number rose to 17,012 (or almost 95%) -partly explaining the unexpectedly poor reduction in cost. The non-alignability of frames arises due to the limited size of the device. In order to explore this we increased the device size and ran the program again for 10 random permutations of the input sequence.
The results are shown in Table 2 . We increased the number of columns hut kept the number of framesper-column the same. It can be seen that the reduction in reconfiguration cost approaches the estimated 5%. It should be noted that increasing the device size beyond this will gain no benefit as the total number -As an example, consider Figure 5 . Shown are common frames numbered 1 and 2. Notice that we can either align 1's (resulting in a misalignment of 2's) or vice versa but we cannot align both simultaneously.
As we have not yet developed a simple solution to detect frame-interlocking of a sequence of configurations, we are unable to provide the tightest lower hound on the optimal cost. Thus, our cost estimates are optimistic. In the coming section we show that:
The absolute lower bound on the number of unique frames (whether alignable or not) can be drastically reduced if we divide a frame into subframes and allow them to be loaded independently. The greedy method of placing the configurations, if such freedom is allowed, is a reasonable solution in practice.
Breaking the Atom of Reconfiguration
Every FPGA has a smallest unit of reconfiguration. This means that a certain amount of data must be written to the configuration memory even if only a one-bit change is required. Our target device, Virtex_ has a name for this unit (afiume). The frame size is 156 bytes for an XCVlO00 device. The technique presented so far performs a frame-by-frame comparison. Let us now break the frames into smaller suhframes and re-apply the caching technique assuming that the sub-frames can he loaded independently.
We divided the frames into sub-frames of various sizes (using the same configurations as in Section 4.1). The results are shown in Table 4 . Deriving the optimal frame size.
rounded to the nearest whole number). The first column lists the frame sizes we examined. The %Esr column states our estimate of the possible percentage reduction in the configuration data of the input sequence. This is the percentage of common frames, i.e. 10090 less the percentage of unique frames (calculated by performing the alignability test) assuming an XCVloOO target device. The %Fix. PIace column lists the reduction in configuration data obtained after applying the fixed placement algorithm ( Figure   2 ) and the last column lists the reduction in configuration data obtained when the variable placements algorithm (Figure 3) is applied. It can be seen that the number of unique frames steadily decreases as the frame size gets smaller. It can also be seen that for a byte-sized frame, the vanable placement algorithm yields an 85% reduction in the configuration data. The significant reduction in the configuration data can be due to three reasons. First. the floor-plans of the benchmark circuits revealed that all of the resources within the columns were not used. These resources were probably set to the d l configuration by the CAD tool thereby allowing us to increase the reduction in configuration data once a smaller frame size was introduced. Second, there can bc circuit fragments that occur in more than one core. Lastly, a sparse encoding of the configurations can also result in redundant configurations (see [71) . A detailed analysis of these factors is yet to be done.
The above analysis does not include the overhead incurred due to the addition of extra address data that is required as frames become smaller and more fragmented. While decreasing the frame size decreases the amount of data to be loaded, it also increases the addressing overhead. Let us derive an optimal frame size for the configurations under test (see Table 4 ). We assume that the configuration interface consists of an 8-bit port. We also assume that each frame is individually addressed. Note that this over-estimates the addressing overhead used currently by Virtex, which provides a start address and a count of the number of consecutive frames to be loaded. However, we only account for the minimum number of bytes needed to address each frame. Whereas Virtex currently uses 4-byte addresses we estimate 2 suffice for an XCVIOOO with 156 bytes. We call this model explicir irrdividual frame addressing (EIFA) .
The second column of Table 4 lists the total size of the unoptimised bitstreams taking into account the number of frames loaded as well as the address of each frame. For example, for a frame size of 156 bytes, we had 18,008 frames. We added 2 bytes of address data to each of these yielding a total of 2,845,264 bytes. The number of frames needed for the smaller frame sizes was estimated by dividing 156 by each respective frame size and multiplying the result by the number of original frames, 18,008. For frame sizes of less than 16 bytes we estimated 3 address bytes were needed per frame written.
The optimised EIFA bitstream sizes listed in the third column were obtained by reducing the sizes obtained in the second column by the %Fix. Place listed in Table 3 . Finally, the %improvement in bitstream size. the last column of Table 4 , was estimated by comparing the optimised bitstream size with the unoptimised bitstream size using 156 bytes frames (close to the actual total Virtex configuration file sizes). Table 4 suggests that a frame size of 39 bytes, or one quarter the current Virtex frame size, is optimal since it offers good compression with little address overhead. A frame size of 16 offers an equal compression.
A similar analysis for the variable placement case reveals that a frame size of 16 bytes offers a 41% reduction in the total bitstream size. While this is 7% more than the fixed placement case. extra effort is needed to find the placement of each configuration. Moreover, variable placement demands an FPGA model that allows one-dimensional configuration shifts. The device we worked on (an XCVIOOO) does not fully suppon arbitrary placements of configurations. We now discuss the main conclusions from the above analysis. Firstly. for relatively fine-grained logic fabrics such as Virtex. fine-grained, random access to the configuration memory is needed in or-der to adequately exploit the redundancy present in configuration data. Secondly, introducing placement freedom does reduce the amount of reconfiguration d;. m hut not significantly. Lastly, the relatively simple and quick greedy strategies we explored provided reasonable reductions in overall configuration bitstream sizes.
I Conclusions and Future Work
In this paper we have developed techniques to reduce the reconfiguration overhead of an FF'GA. Our method reduces the amount of reconfiguration data that needs to be transfered to the device by making use of configurations that are already present in the configuration memory. We have studied the effect of circuit placement and configuration granularity on configuration re-use. We have found that introducing placement freedom has little impact on the overall reduction in configuration data. However, fine-grained, random access to configuration memory could help to reduce the reconfiguration time by more than 41%.
In future, we intend to investigate configurationmemory architectures that support efficient reconfiguration. We intend to focus on the granularity of the configuration memory and efficient configuration addressing schemes for a given circuit domain.
