Although run-time reconfigurable systems have been shown to achieve very high performance, the speedups over traditional microprocessor systems are limited by the cost of configuration of the hardware. In this paper, we explore the idea of configuration caching. We present techniques to carefully manage the configurations present on the reconfigurable hardware throughout program execution. Through the use of the presented strategies, we show that the number of required reconfigurations is reduced, lowering the configuration overhead. We extend these techniques to a number of different FPGA programming models, and develop both lower bound and realistic caching algorithms for these structures.
Introduction
In recent years, coupled processor-FPGA systems have attracted a lot of attention because of their promise to deliver the high performance provided by reconfigurable hardware along with the flexibility of general purpose processors. In such systems, portions of an application with repetitive logic and arithmetic computation are mapped to the reconfigurable hardware, while the general-purpose processor handles other portions of the computation.
For many applications, the systems need to be reconfigured frequently during run-time to exploit the full potential of using the reconfigurable hardware. Reconfiguration overhead is therefore a major concern because the CPU often must sit idle during this process. By reducing the overall reconfiguration overhead, the performance of the system is improved. Therefore, many studies have involved examining techniques to reduce the configuration overhead. Some of these techniques include configuration prefetching [Hauck98a] and configuration compression [Hauck98b, Li991. In this paper, we present another approach called configuration caching that reduces the reconfiguration overhead by buffering configurations on the FPGA.
Caching configurations on an FPGA, which is similar to caching instructions or data in a general memory, retains the configurations on the chip so the amount of the data that needs to be transferred to the chip can be reduced. In configuration caching we view the area of the FPGA as a cache. If this cache is large enough to hold more than one computation, configuration cache management techniques will be used to determine when configurations should be loaded and unloaded to best minimize the overall reconfiguration times. In a general-purpose computational system, caching is an important approach to hide memory latency by taking advantage of two types of locality-spatial locality and temporal locality. Spatial locality states that items whose addresses are near one another tend to be referenced close together in time. Temporal locality addresses the tendency of recently accessed items to be accessed again in the near future. These two localities also apply to the caching of configurations on the FPGA in coupled processor-FPGA systems. However, the traditional caching approaches for general-purpose computational systems are unsuitable for the configuration caching for the following reasons: 1) In general-purpose systems, the data loading latency is fixed because the block represents the atomic data transfer unit, while in the coupled processor-FPGA systems, the loading latency of configurations may vary because of non-uniform configuration sizes. This variable latency factor could have a great impact on the effectiveness of caching approaches. In traditional memory caching, the data items frequently accessed are kept in the cache in order to minimize the memory latency. However, this might not be true in coupled processor-FPGA systems because of the non-uniform configuration latency. For example, suppose that we have two configurations with configuration latencies l O m s and lOOOms respectively. Even though the first configuration is executed 10 times as often as the second one, the second configuration is likely to be cached in order to minimize the total configuration overhead.
Since the ratio of the average size of configurations to chip area is much smaller than the ratio of the block size to the cache size, only a small number of configurations can be retained on the chip. This makes the system more likely to suffer the "thrashing problem", in which the configurations are swapped between the configuration memory and the FPGA excessively.
2)
22 0-7695-0871-5/00 $10.00 0 2000 IEEE The challenge in configuration caching is to determine which configurations should remain on the chip and which should be replaced when a reconfiguration occurs. An incorrect decision will fail to reduce the reconfiguration overhead and lead to a much higher reconfiguration overhead than a correct decision. The non-uniform configuration latency and the small number of configurations that reside simultaneously on the chip increase the complexity of this decision. Both frequency and latency factors of configurations need to be considered to assure the best reconfiguration overhead reduction. Specifically, in certain situations retaining configurations with high latency is better than keeping frequently required configurations that have lower latency. In other situations, keeping configurations with high latency and ignoring the frequency factor will result in switching between other frequently required configurations because they cannot fit in the remaining area. This switching causes reconfiguration overhead in this case that will not occur if the configurations with high latency but low frequency are unloaded.
In 
FPGA Models
The three FPGA models mentioned previously, the Single Context FPGA, the Partial Run-Time Reconfigurable FPGA, and the Multi-Context FPGA, are the three dominant models for current run-time reconfigurable systems. For a Single Context FPGA, the whole chip area must be reconfigured during each reconfiguration. Even if only a small portion of the chip needs to reconfigure, the whole chip is rewritten during the reconfiguration. Configuration caching for the Single Context model allocates multiple configurations that are likely to be accessed near in time into a single context to minimize switching of contexts. By caching configurations in this way, the reconfiguration latency is amortized over the configurations in a context. Since the reconfiguration latency for a Single Context FPGA is fixed (based on the total amount of configuration memory), minimizing the number of times the chip is reconfigured will minimize the reconfiguration overhead.
For the Partial Run-Time Reconfigurable (PRTR) FPGA, the area that is reconfigured is just the portion required by the new configuration, while the rest of the chip remains intact. Unlike the configuration caching for the Single Context FPGA, where multiple configurations are loaded to amortize the fixed reconfiguration latency, the configuration caching method for the PRTR is to load and retain configurations that are required rather than to reconfigure the whole chip. The overall reconfiguration overhead is the summation of the reconfiguration latency of the individual reconfigurations. Compared to the Single Context FPGA, the Partial RTR FPGA provides more flexibility for performing reconfiguration.
g 4 Figure 1 : The structure of a 4-context FPGA [DeHon94] .
The configuration mechanism of the Multi-Context model [DeHon94] is similar to that of the Single Context FPGA. However, instead of having one configuration stored in the FPGA, multiple complete configurations are stored. Each complete configuration can be viewed as multiple configuration memory planes contained within the FPGA.
The structure of a 4-context FPGA is illustrated in Figure 1 . For the Multi-Context P G A , the configuration can be loaded into any of the contexts. When needed, the context containing the required configuration will be switched to control the logic and interconnect plane. Compared with the configuration loading latency, the single cycle configuration switching latency is negligible. In this paper, we consider Multi-Context model cannot be partially reconfigured, thus every SRAM context can be viewed as a Single Context FPGA and the methods for allocating configurations onto contexts for the Single Context FPGA could be applied.
We will present two new models based on the Partial Run-Time Reconfigurable (PRTR) FPGA. In standard PRTR devices, configurations are mapped to fixed locations in the array, and whenever they are loaded they must be mapped to that specific location. Therefore, current PRTR systems are likely to suffer a "thrashing problem" if two or more frequently used configurations occupy overlapping locations in the array. Simply increasing the size of the chip will not alleviate this problem. Instead, we present a new model, the Relocation model, which dynamically allocates the position of a configuration on the FPGA at run time instead of at compilation time. Another model, called the Relocation + Defragmentation model [ComptonOO] , further improves the hardware utilization. In current PRTR systems, during reconfiguration portions of chip area could be wasted because they are too small to hold another configuration. These small portions are called fragments similar to the fragments in traditional memory systems, and they could represent a significant percentage of chip area. In the Relocation + Defragmentation model, a special hardware unit called the Defragmentor can move configurations within the chip such that the small unused portions are collected as a single large fragment. This allows more configurations to be retained on the chip, increasing the hardware utilization and thus reducing the reconfiguration overhead. For example, Figure 2 shows three configurations currently on the chip with two small fragments. Without defragmentation, one of the three configurations has to be replaced when configuration 4 is loaded. However, as shown in the right side of Figure 2 , by pushing configurations 2 and 3 upward the defragmentor produces one single fragment that is large enough to hold configuration 4. Notice that the previous three configurations are still present, and therefore the reconfiguration overhead caused by reloading a replaced configuration can be avoided. 
Experimental Setup
In order to investigate the performance of configuration caching for the five different models presented in the last section, we develop a set of caching algorithms for each model. A fixed amount of hardware resources (in the form of overall area) is allocated to each model. To conduct the evaluation, we must perform three steps. First, for each model, the capacity equation must be derived for a given architecture model and a given area. Since the number of programming bits represent the maximum amount of the configuration information that a model can retain, the number of programming bits is calculated to represent the capacity of each model. Second, we test the performance of the algorithms for each model by generating a sequence of configuration accesses from an execution profile of each benchmark. Third, for each model, caching algorithms are executed on the configuration access sequence, and the configuration overhead for each algorithm is measured.
Capacity analysis
We created layouts for the programming structures of the different FPGA types: the Single Context, the Partial RunTime Reconfigurable, the Multi-Context, the PRTR with Relocation, and the PRTR with Relocation and Defragmentation. These area models are based on the layout of tileable structures that composed the necessary memory system. This layout was performed using the Magic tool, and sizes (in lambda*) were obtained for the tiles. Hauser97] . In this type of P G A , a full row of computational structures is the atomic unit used when creating a configuration: configurations may use one or more rows, but any row used by one configuration becomes unavailable to other configurations. While a two-dimensional model could improve the configuration density, the extra hardware required and the complexities of two-dimensional placement limits the benefits gained through the use of the model. One-dimensionality simplifies the creation of relocation and defragmentation hardware, however, because these problems become onedimensional issues similar to memory allocation and defragmentation.
The PRTR design forms the basis of the PRTR with Relocation FPGA. A small adder and a small register both equal in width to the number of address bits for the row address of the memory array were added for the new design. This allows all configurations to be generated such that the "uppermost" address is 0. Relocating the configuration is therefore as simple as loading an offset into the offset register, and adding this offset to the addresses supplied when loading a configuration.
Finally, the PRTR with Relocation and Defragmentation [ComptonOO] is similar to the PRTR with Relocation, with the addition of a row-sized set of SRAM cells that form a buffer between the input of the programming information and the memory array itself. A full row of programming information can be read back into this buffer from the array, and then written back to the array in a different position as dictated by the offset register. In order to make this operation efficient, an additional offset register and a 2:l multiplexer to choose between the offset registers are added. This provides one offset for the reading of configuration data from the array, and a separate one for writing the information back to a new location. This buffer requires its own row decoder, since it is composed of several data words. A column decoder between the incoming configuration data and the buffer is not needed, as we design the structure to have multiple rows, but a single column. However, the column decoders connected to the main array (as appear in the basic PRTR design) are no longer necessary, as the information written from the buffer to the array is the full width of the array. This structure is similar to an architecture proposed by Xilinx [Trimberger95] .
In order to account for the size of the logic and interconnect in these FPGAs, we use the assumption that the programming layer of a Single Context P G A uses approximately 25% of the area of the chip. All other architectures are assumed to require this same logic and interconnect area per bit of configuration (or active configuration in the case of a Multi-Context device). See Appendix I for calculation details.
As mentioned before, all models are given the same total silicon. However, due to the differences in the hardware structures, the number of programming bits, and thus the capacity of the device, vary among our models. For example, according to Appendix I, a Multi-Context model with 1 Megabit of active configuration information and 3
Megabits of inactive information has same area as a PRTR with 2.4 Megabits of configuration information. Thus the PRTR devices has 2.4 times as many logic blocks as the Multi-Context device, but 40% less total configuration space.
Configuration Sequence Generation and Size Definition
We use two sets of benchmarks to evaluate caching algorithms for FPGA models. One set of the benchmarks was compiled and mapped to the Garp architecture [Hauser97], where the compute-intensive loops of C programs are extracted automatically for acceleration on a tightly-coupled dynamically reconfigurable coprocessor [Callahan99] . The other set of benchmarks was created for the Chimera architecture [Hauck97] . In this system, portions of the code that can accelerate computation are mapped to the reconfigurable coprocessor [Hauck98a]. For both sets of benchmarks the mappings to the coprocessors are referred to RFUOPs. In order to evaluate the algorithms for different FPGA models, we need to create an RFUOP access trace for each benchmark, which is similar to a memory access string used for memory evaluation.
The RFUOP sequen'ce for each benchmark was generated by simulating the execution of the benchmark. During the simulated execution, the RFUOP ID is output when an RFUOP is encountered. After the completion of the execution, an ordered sequence of the execution of RFUOPs is created. In the Garp architecture, each RFUOP in the benchmark programs has size information in term of number of rows occupied. For Chimaera, we assume that the size of an RFUOP is proportional to the number of instructions mapped, to that RFUOP.
Configuration Caching Algorithms
In this work, we seek to find caching methods that target the different FPGA models. For each FPGA model, we will develop realistic algorithms that can significantly reduce the reconfiguration latencies. In order to evaluate the performance of these realistic algorithms, we also attempt to develop tight lower bound algorithms by using complete application execution information. For the models where true lower bound algorithms are unavailable, we will develop algorithms that we believe are near optimal.
We divide our algorithms into 3 categories: run time algorithms, general off-line algorithms, and complete prediction algorithms. The classification of the algorithms depends on the time complexity and input information needed for each algorithm.
The run time algorithms use only basic information on the execution of the program up to that point, and thus must make guesses as to the future behavior of the program. This is similar to run time cache management algorithms such as LRU. Because of the limited information at run time, a prediction of keeping a configuration or replacing a configuration may not be correct and can even cause higher reconfiguration overhead. Therefore, we believe that these realistic algorithms will provide an upper bound on reconfiguration overhead, and for some domains better predictors could be developed that improve over these results.
The complete prediction algorithms use complete, exact execution information of the application, and can use computationally intractable approaches. These algorithms attempt to search the whole execution stream in order to lower the configuration overhead. These provide the optimal (lower bound) or near optimal solution. In some cases these algorithms relax restrictions on system behavior in order to make the algorithm a true (but unachieveable) lower bound.
The general off-line algorithms use profile information of each application, with computationally tractable algorithms. They represent realistic algorithms for the case where static execution information is available, or approximate algorithms where highly accurate execution predictions can be developed. These algorithms will typically perform between the run time and complete prediction algorithms in terms of quality, and are realistic algorithms for some situations.
With these three classes of algorithms, we can get upper bounds (the run time algorithms) and lower bounds (the complete prediction algorithms), as well as an estimate of behavior for executions without data-dependent execution (the general off-line algorithms).
Single Context Algorithms
In the next two sections we present a near lower bound algorithm (based on simulated annealing), and a more realistic general off-line algorithm, which uses more restricted information. Note that since there are no run-time decisions in a single context device (if a needed configuration is not loaded the only possible behavior is to overwrite all currently loaded configurations with the required configuration), we do not present a run-time algorithm.
Simulated Annealing Algorithm for the Single Context FPGA
When a reconfiguration occurs in a Single Context FPGA, even if only a portion of the chip needs to be reconfigured, the entire configuration memory store will be rewritten. Because of this property, multiple RFUOPs should be configured together onto the chip. In this manner, during a reconfiguration a group (context) that contains the currently required RFUOP, as well as possibly one or more later required RFUOPs, is loaded. This amortizes the configuration time over all of the configurations grouped into a context. Minimizing the number of group (context) loadings will minimize the overall reconfiguration overhead.
It is obvious that the method used for grouping has a great impact on the latency reduction; the overall reconfiguration overhead resulting from a good grouping could be much smaller than that resulting from a bad grouping. For example, suppose there are 4 RFUOPs with equal size and equal configuration latency for a computation, and the RFUOP sequence is 1 2 3 4 3 4 2 1, where 1 , 2 , 3, and 4 are the RFUOP IDS. Given a Single
Context FPGA that has the capacity to hold two RFUOPs, the number of context loads is 3 if RFUOPs 1 and 2 are placed in the same group (context), and RFUOPs 3 and 4 are placed in another. However, if RFUOPs 1 and 3 are placed in the same group (context) and RFUOPs 2 and 4 are placed in the other, the number of context loads will be 7.
In order to create the optimal solution for grouping, one simple method is to create all combinations of configurations and then compute reconfiguration latency for all possible groupings, from which an optimal solution can be found. However, this method has exponential time complexity, and is therefore not applicable for real applications. In this paper, we instead present a simulated annealing approach to acquire a near optimal solution. For the simulated annealing algorithm, we use the exact reconfiguration overhead for a given grouping as our cost function, and the moves consist of shuffling the different RFUOPs between contexts. Specifically, at each step an RFUOP is randomly picked to move to a randomly selected group, and if there is not enough room in that group to hold the RFUOP, RFUOPs in that group are randomly chosen to move to other groups. Once finished, the reconfiguration overhead of the grouping is computed by applying the complete RFUOP sequence. Since configuration caching for the Single Context FPGA is similar to the placement problem in CAD applications, the simulated annealing algorithm should provide near optimal solution.
General Off-line Algorithm for the Single Context FPGA
Although the simulated annealing approach can generate a near optimal solution, the high computation complexity and the requirement of knowledge of the exact execution sequences make this solution unreasonable for most real applications. We therefore suggest an algorithm more suited for general purpose use. The Single Context FPGA requires that the whole configuration memory will be rewritten if a demanded RFUOP is not currently on the chip. Therefore, if two consecutive RFUOPs are not allocated to the same group, a reconfiguration will result. Our algorithm attempts to compute the likelihood of RFUOPs following one another in sequence, and use this knowledge to minimize the number of reconfigurations required. Before we further discuss this algorithm, we first give the definition of a "correlate" as used in the algorithm. Figure 3 illustrates an example of the general off-line algorithm. Each line connects a pair of correlated RFUOPs and the number next to each line indicates the degree of the correlation. As presented in the algorithm, we will merge the highly correlated groups together under the size constraints of the target architecture. In this example, assume that the chip can only retain at most 3 RFUOPs at a time. At the first grouping step we place RFWOP17 and RFUOP4 together. In the 2"d step we add RFUOP43 into the group formed at step 1, since it has a correlation of 30 (15+15) to that group. We then group RFUOP2 and RFUOP34 together in step 3, and they cannot be merged with the previous group because of the size restriction. Finally, in the 4Ih step RFUOP22 and RFUOP68 are grouped together.
Compared to the simulated annealing algorithm, this algorithm only requires profile information on the degrees of correlation between RFUOPs. In addition, since the number of RFuOPs tends to be much smaller than the length of the RFUOP sequence, it should be much quicker to find a grouping by searching of the matrix instead of traversing the RFUOP sequence as the simulated annealing algorithm does. Therefore, the computation time is significantly reduced. Step 3
Step 1 Step 2 /" 
Multi-Context Algorithms
In this section we present algorithms for multi-context devices. This includes a complete prediction algorithm that represents a near lower bound, and a general offline algorithm that couples the single-context general offline algorithm with a run-time replacement policy.
A Complete Prediction Algorithm for the Multi-Context FPGA
A Multi-Context FPGA can be regarded as multiple Single Context FPGAs, since the atomic unit that must be transferred from the host processor to the FPGA is a full context. During a reconfiguration, one of the inactive contexts is replaced. In order to reduce the reconfiguration overhead, the number of reconfigurations must be reduced. The factors that could affect the number of reconfigurations are the configuration grouping method and the context replacement policies.
We have discussed the importance of the grouping method for the Single Context FPGA, where an incorrect grouping may have significantly larger overhead than a good grouping. This is also true for the Multi-Context P G A , where a context (a group of configurations) remains the atomic reconfiguration data transfer unit. The reconfiguration overhead caused by the incorrect grouping remains very high even though the flexibility provided by the Multi-Context FPGA can somewhat reduce part of the overhead.
As mentioned previously, even the perfect grouping will not minimize the reconfiguration overhead if the policies used for context replacement are not considered. A context replacement policy specifies which context could be replaced once a demanded configuration is not in any of the contexts currently present on the chip. Just as in the general caching problem where frequently used blocks should remain in the cache, the contexts that are frequently used should be kept configured on the chip. Furthermore, if the atomic configuration unit (context) is considered as a data block, we can view the Multi-Context FPGA as a general cache and apply tactics that worked for the general cache for the Multi-Context FPGA. More specifically, we have an existing optimal replacement algorithm called the Belady [Belady661 algorithm for the Multi-Context FPGA. The Belady [Belady661 algorithm is well known in the operating systems and computer architecture fields. It claims that the fewest number of replacements will occur provided the memory access sequence is known. This algorithm is based on the idea that a data item is most likely to be replaced if it is least likely to be accessed in the near future. For a Multi-Context FPGA, the optimal context replacement can be achieved as long as the context access string is available. Since the RFUOP sequence is known, it is trivial to create the context access string by transforming the RFUOP sequence. We integrate the Belady algorithm into the simulated annealing grouping method used in the Single Context model to achieve the near optimal solution. Specifically, for each grouping generated, the number of the context replacements determined by the Belady algorithm is calculated as the cost function of the simulated annealing approach. The steps below outline the complete prediction algorithm for the Multi-Context model:
1. Traverse the RFUOP sequence, and for each RFUOP appearing, change the RFUOP ID to the corresponding group ID. This will result a context access sequence.
Initially assign each RFUOP to a group such that for each group the total size of all RFUOPs is smaller than or equal to the size of the context. Set up parameters of initial temperature, the number of iterations under each temperature.
While the current temperature is greater than the terminating temperature:
2.
3.
3.1 .l. While the number of iterations is greater than 0: A candidate RFUOP is randomly chosen along with a randomly selected destination group to which the candidate will be moved.
3.1.2. After the move, if the total size of the RFUOPs in the destination group exceeds the size of the context, a new candidate RFUOP in the destination group is randomly selected. This RFUOP is , then moved to any group that can hold it. This step is repeated until all groups satisfy the size constraint. of iterations by one.
3.2. Decrease the current temperature.
The reconfiguration overhead for a Multi-Context FPGA is therefore the number of context loads multiplied by the configuration latency for a single context. As mentioned above, the factors that can affect the performance of configuration caching for the Multi-Context FPGA are the configuration grouping and the replacement policies.
Since the optimal replacement algorithm is integrated into the simulated annealing approach, this algorithm will provide the near optimal solution. We consider this algorithm to be a complete prediction algorithm.
Least Recently Used (LRU) Algorithm for the Multi-Context FPGA
The LRU algorithm is a widely used memory replacement algorithm in operating system and architecture. Unlike the Belady algorithm, the LRU algorithm does not require future information to make a replacement decision. Because of the similarity between the configuration caching problem and the data caching problem, we can apply the LRU algorithm for the Multi-Context FPGA model. The LRU is more realistic than the Belady algorithm, but the reconfiguration overhead incurred is higher. The basic steps are outlined below:
1. Apply the Single Context general off-line algorithm to acquire a final grouping of RFUOPs into contexts, and give each group formed its own ID.
Traverse the RFUOP sequence, and for each RFUOP appearing, change the RFUOP ID to the corresponding group ID. This will result a context access sequence.
Apply the LRU algorithm to the context access string. Increase the total number of context loads by one if a replacement occurs.
2.

3.
Algorithms for the Partial Run Time Reconfigurable FPGA
Compared to the Single Context FPGA, an advantage of the Partial Run-Time Reconfigurable FPGA is its flexibility of loading and retaining configurations. Any time a reconfiguration occurs, instead of loading the whole group only a portion of the chip is reconfigured while the other RFUOPs located elsewhere on the chip remain intact. The basic idea of configuration caching for PRTR is to find the optimal location for each RFUOP. This is to avoid the thrashing problem, which could be caused by the overlap of frequently used RFUOPs. In order to reduce the reconfiguration overhead for the Partial Run-Time Reconfigurable FPGA, we need to consider two major factors: the reconfiguration frequency and the average latency of each RFWOP. Any algorithm that attempts to lower only one factor will fail to produce an optimal solution because the reconfiguration overhead is the product of the two. A complete prediction algorithm that can achieve near optimal solution and a general off-line algorithm that can significantly reduce the running time are presented below.
A Simulated Annealing Algorithm for the PRTR FPGA
Similar to the simulated annealing algorithm used for the Single Context FPGA, the purpose of annealing for the Partial Run-Time Reconfigurable FPGA is to find the optimal mapping for each configuration such that the reconfiguration overhead is minimized. For each step, a randomly selected RFUOP is assigned to a random position within the chip and the exact reconfiguration overhead is then computed. Before presenting the full simulated annealing algorithm, we fist give the definition of a "conflict" as used in our discussion.
Definition 2: Given two configurations and their positions on the FPGA, RFUOP A is said to be in conflict with RFUOP B i f any part of A overlaps with any part of B.
We now present our simulated annealing algorithm for the PRTR FPGA.
1. Assign a random position for each RFUOP. Set up the parameters of initial temperature, number of iterations under each temperature, and terminating temperature.
2.1. While the number of iterations is greater than 0:
2.1.1. A randomly selected RFUOP is moved to a random location within the chip.
2.1.2.Traverse the RFUOP sequence. If the demanded RFUOP is not currently on the chip, load the RFUOP to the specified location, and increase the overall reconfiguration latency by the loading latency of the RFUOP. If the newly loaded RFUOP conflicts with any other RFUOPs on the chip, those conflicted RFUOPs are removed from the chip.
2.1.3.Let the new cost be equal to the overall RFUOP overhead and determine whether the move is allowed. Decrease the number of iteration by one.
Decrease the current temperature.
Since finding the location for each RFUOP is similar to the placement problem in physical design, where the simulated annealing algorithm usually provides good performance, our simulated annealing algorithm should create a near optimal solution.
An Alternate Simulated Annealing Algorithm for the PRTR FPGA
In the simulated annealing algorithm presented in the last section, the computation complexity is very high since the RFUOP sequence must be traversed to compute the overall reconfiguration overhead after every move. Obviously, a better algorithm is needed to reduce the running time. Again, as for the Single Context FPGA, an adjacency matrix of size N m is built, where N is the number of the RFUOPs. The main purpose of the matrix is to record the possible conflicts between RFUOPs. In order to reduce the reconfiguration overhead, the conflicts that will create larger configuration loading latency are distributed to unoverlapped locations. This is done by modifying the cost computation step of the previous algorithm. To clarify, we present the full algorithm: While the current temperature is greater than the terminating temperature:
4.1. While the number of iterations is greater than 0:
3.
4.
4.1.1. A random selected RFUOP is reallocated to a random location within the chip. After the move, if Generally, the number of total RFUOPs is much less than the length of the RFUOP sequence. Therefore, by looking up the conflict matrices instead of the whole configuration sequence, the time complexity can be greatly decreased. Still, one final concern is the quality of the algorithm because, instead of using configuration sequence, the matrix of potential conflicts derived from the sequence is used. Even the matrix may not represent the conflicts exactly, however, it gives an estimate of the potential conflicts between any two configurations.
Algorithms for the PRTR FPGA with Relocation and Relocation + Defragmentation
For the PRTR FPGA with Relocation + Defragmentation, the replacement policies have a great impact on reducing reconfiguration overhead. This is because the high flexibility available for choosing victim RFUOPs when a reconfiguration is required. With Relocation, an RFUOP can be dynamically remapped and loaded to an arbitrary position. With defragmentation, a demanded RFUOP can be loaded as long as there is enough room on the chip, since the small fragments existed on the chip can be merged. Instead of giving the algorithms for PRTR P G A with only Relocation, we first analyze the case of PRTR with both Relocation and Defragmentation. This includes a lower bound algorithm that relaxes restriction in the system, a general off-line algorithm integrating the Belady algorithm, and two run time algorithms using different approaches.
A Lower Bound Algorithm for the PRTR FPGA with Relocation + Defragmentation
As discussed previously, the major problems that prevent us from acquiring an optimal solution of configuration caching are the different sizes and different loading latencies of different RFUOPs. Generally, the loading latency of a configuration is proportional to the size of the configuration.
The Belady [Belady661 algorithm gives the optimal replacement for the case that the memory access string is known and the data transfer unit is uniform. Given the RFUOP sequence for the PRTR with Relocation + defragmentation model, we can achieve a lower bound of our problem if we assume that a portion of any RFUOP can be transferred. Under this assumption, when a reconfiguration occurs, only a portion of an RFUOP might be replaced while the other portion is still kept on the chip. Once the removed RFUOP is needed again, only the missing portion (possibly the whole RFUOP) is loaded instead of loading the entire RFUOP. We present the Lower Bound Algorithm as follows:
1. If a required RFUOP is not on the chip, do the following:
1.1. Find the missing portion of the RFUOP. While the missing portion is greater than the free space on the chip, do following:
1.1.1. For all RFUOPs that are currently on the chip, a victim RFUOP is identified such that in the RFUOP 1.1.2. Let R = the size of the victim + the size of the free space -the missing portion.
1.1.3. If R is greater than 0, a portion of the victim that equals R is retained on chip while the other portion is replaced and added to the free space. Otherwise add the space occupied by the victim to the free space.
sequence its next appearance is later than the appearance of all others.
1.2.
Load the missing portion of the demanded RFWOP into the free space. Increase the RFUOP overhead by the loading latency of the missing portion. 
Sin
Now the size of every R, in the sequence is equal to the atomic configuration unit.
In our algorithm, we assumed that portion of the any RFUOP can be retained on the chip, and during reconfiguration only the missing portion of the demanded RFUOP will be loaded. This can be viewed as loading multiple atomic configuration units. Therefore, this problem can be viewed as the general caching problem, with the atomic configuration unit as the data transfer unit. We now present our general off-line algorithm for the PRTR with Relocation + Defragmentation:
1. If a demanded RFUOP is not currently on the chip, do the following.
1 .l. While there is not enough room to load the RFUOP, do the following:
1.1 .l. Find the reappearance window W. smallest value is replaced.
The steps 1.1.1 -1.1.3 specify the rules to select the victim RFUOP. Counting the number of appearances of each RFUOP is to get the frequency of the RFUOP to be used in near future. As we mentioned, this is not adequate to determine a victim RFUOP, because an RFUOP with lower frequency may have much higher configuration latency. Therefore, by multiplying the latency and the frequency, we can find the possible overall latency in the near future if the RFUOP is replaced.
LRU Algorithm for the PRTR FPGA with Relocation + Defragmentation
Since the PRTR with Relocation plus Defragmentation model can be viewed as a general memory model, we can use a LRU algorithm for our reconfiguration problem. Here, we traverse the RFUOP sequence and when a demanded RFUOP is not on the chip and there is not enough room to load the RFUOP, an RFUOP on the chip is selected to be removed by the LRU algorithm. Although simple to implement, this algorithm may display poor quality because it ignores the sizes of the RFUOPs.
Penalty Oriented Algorithm for the PRTR FPGA with Relocation + Defragmentation
Since the non-uniform size of RFUOPs is not considered as a factor in LRU algorithm, a high RFUOP overhead could potentially result. For example, consider an RFUOP sequence 1 2 3 1 2 3 1 2 3 ..., RFUOPs 1 , 2 and 3 have sizes of 1000, 10 and 10 programming bits respectively. Suppose also that the size of the chip is 1010 programming bits. According LRU algorithm, the RFUOPs are replaced in same order of the RFUOP sequence. It is obvious that configuration overhead will be much smaller if RFUOP 1 is always kept on the chip. This does not mean that we always want to keep larger RFUOPs on the chip as keeping larger configurations with low reload frequency may not reduce the reconfiguration overhead. Instead, both size and frequency factors should be considered in the algorithm. Therefore, we use a variable "credit" to determine the victim [Young94] . Every time an RFUOP is loaded onto the chip, its credit is set to its size. When a replacement occurs, the RFUOP with the smallest credit is evicted from the chip and the credit of all other RFUOPs on the chip is decreased by the credit of the victim. To make this more clear, we present the algorithm as follows:
1. If a demanded RFUOP is currently on the chip, set its credit equal to its size. Else do following:
1 .l. While there is not enough room to load the required RFUOP:
1.1.1. For all RFUOPs on chip, replace the one with the smallest credit and decrease the credit of all other RFUOPs by that value.
1.2. Load the demanded RFUOP and set its credit equal to its size.
A General Off-line Algorithm for the PRTR FPGA with Relocation
One major advantage that the PRTR with Relocation + Defragmentation has over the PRTR with Relocation is the ability to have higher utilization of the space on the chip. Any small fragments can contribute to one larger area such that an RFUOP could possibly be loaded without forcing a replacement. However, for PRTR with only Relocation those fragments could be wasted. This could cause an RFUOP that is currently on chip to be replaced and thus may result in extra overhead if the replaced RFUOP is demanded again very soon. In order to reduce the reconfiguration overhead for this model, the utilization of the fragments must be improved. We present the algorithm as following:
1.1. While there is not enough room to load the RFUOP, do the following:
1.1.1. Find the reappearance window W.
1.1.2. For each RFUOP, calculate the total number of appearances in W 1.1.3. For each RFUOP, multiply the loading latency and the number of appearances, producing a cost.
1.1.4. For each RFUOP on chip, presume that it is to be the candidate victim, identify the adjacent configurations that must also be removed to make room for the demanded RFUOP. Sum up the costs of all the potential victims.
1.1.5. Identify the smallest the sum of each RFUOP, and victims that produce the smallest cost are replaced.
1.2. Load the demanded RFUOP. Increase the overall latency by the loading latency of the configuration
The general off-line heuristic that applied to the PRTR with Relocation + Defragmentation is also implemented in this algorithm. The major difference for this algorithm is to consider the geometric positions of the RFUOPs. Since the PRTR with Relocation + Defragmentation model has the ability to collect the fragments, the RFUOPs are replaced in the increasing order of their costs (load latency times appearance in the reappearance window). However, this scheme does not work for the PRTR with Relocation if the victim RFUOPs are separated by other non-victim RFUOPs because the system cannot merge the non-adjacent spaces. Therefore, when multiple RFUOPs are to be replaced in the PRTR FPGA with Relocation, these RFUOPs must be adjacent or separated only by the fragments. Considering this geometric factor, the victims to be replaced are adjacent RFUOPs (or separated by fragments) that produce the overall smallest cost.
Simulation Results and Discussion
All algorithms are implemented in C++ on a Sun Sparc-20 workstation. As can be seen in Figure 4 , the reconfiguration penalties of the PRTR is much smaller (64% to 85% smaller) than the Single Context model. This is because with almost the same capacity the PRTR model can significantly reduce the average reconfiguration latency of the Single Context model without incurring a much larger number of reconfigurations. The Multi-Context model has smaller reconfiguration overhead (20% to 40% smaller) than the PRTR when the chip silicon is small. With small silicon area, the Multi-Context model wins because of its much larger configuration area. With the silicon area becomes larger, the number of conflicts incurred in the PRTR model is greatly reduced and thus the PRTR has almost the same reconfiguration penalty as the Multi-Context model. In fact, the PRTR performs even better than the Multi-Context model in some cases. The Multi-Context device must reload a complete context at each time, making the per reconfiguration penalty in large chips much higher than in the PRTR model. Since the number of the conflicts is small, the overall reconfiguration overhead of the PRTR FPGA is smaller than that of the Multi-Context P G A . Figure 5 demonstrates the reconfiguration overheads of the two new models we proposed. For the PRTR with Relocation + Defragmentation, the general off-line algorithm performs almost as well as the lower bound algorithm in the reconfiguration overhead reduction, especially when the chip silicon becomes larger. Note that the lower bound algorithm relaxes the PRTR model restrictions by allowing portion of the RFUOPs can be replaced and loaded. As can be seen in Figure 5 , future information is very important, as the general off-line algorithm for the PRTR with Relocation performs better than both the LRU and the penalty oriented algorithms for the PRTR with Relocation + Defragmentation. Only considering the frequency factor while ignoring load latency, the LRU, a traditional caching algorithm, has the worse performance than the penalty oriented algorithm. 
2
Normalized FPGA size 
Conclusions
Configuration caching, which configurations are retained on chip until they are required again, is a technique to reduce the reconfiguration overhead. However, the limit the on-chip memory and the non-uniform configuration latency add complexity in deciding which configurations to retain to maximize the odds that the required data is present in the cache. To deal these problems, we have developed new caching algorithms targeted at a number of different FPGA models. In addition to the three currently dominant models (Single Context FPGA, Partial Run- 
