Abstract-One of the major impediments to deploying partially run-time reconfigurable FPGAs as hardware accelerators is the time overhead involved in loading the hardware modules. While configuration prefetching is an effective method that can be employed to reduce this overhead, mispredicted prefetches may worsen the situation by increasing the number of reconfigurations needed. In this paper, we present a static algorithm for configuration prefetching in partially reconfigurable FPGAs that minimizes the reconfiguration overhead. By making use of profiling, the interprocedural control flow graph, and the placement information of hardware modules, our algorithm predicts hardware execution and tries to prefetch hardware modules as early as possible while minimizing the risk of mis-predictions. Our experiments show that our algorithm performs significantly better than current-state-ofthe-art prefetching algorthms for control-bound applications.
I. INTRODUCTION
Configuration prefetching [1] seeks to address the problem of high run-time reconfiguration overhead of FPGAs through parallelizing the (partial) reconfiguration of the FPGA with an application's execution. However, as we shall show, a misprediction in the prefetch can be very costly because it increases the number of configurations. Therefore, the correct scheduling of configurations is the key to good performance of such accelerators. Ideally, the execution of a hardware module should be predicted as early as possible (given the huge configuration latency) and as accurately as possible (so as to avoid costly recoveries). In this paper, we present an algorithm that reduces configuration latency for specifications written in interprocedural control flow graphs [3] . Through the use of profile information, the algorithm predicts the execution of hardware modules by computing 'placementaware' probabilities. These probabilities are in turn used for the generation of prefetching codes that are then inserted into the control-flow graph. Our experiments show that our approach significantly outperforms previous works.
II. BACKGROUND

A. Architecture Model
We consider the architecture model as shown in Figure 1 . The model is based on actual silicon devices such as the Xilinx Virtex family of FPGAs, especially Virtex-II Pro, IV and V. The software code, data, and the bitstreams to be loaded onto the reconfigurable region are stored in memory. The CPU is the main controller of application execution and is also responsible for initiating the reconfiguration of the FPGA. The reconfiguration manager is a hardware module that loads bitstream data from the memory upon requests issued by the CPU.
The reconfigurable region is organized as n slots where hardware modules can be placed. We consider any two placements of the hardware modules with overlapping slots to be in 'physical placement conflict' (or just 'conflict' for the rest of the paper). Conflicting hardware modules cannot be loaded into the reconfigurable region at the same time. 
III. PROBLEM FORMULATION AND MOTIVATION
The aim of this paper is to minimize the reconfiguration delay of a single, sequential program for the platform described above. We have assumed that the placements of the hardware modules are fixed. We represent the program as an interprocedural control flow graph (ICFG), a directed graph G = (V, E,C,U, HW ) where every node on the graph is either a basic block or a hardware node (a block of code that invokes hardware execution). V is the set of all the nodes in the graph. E is the set of all the edges in the graph. head(e) and tail(e) refers to the begin and end node of edge e respectively. C is the set of all call sites and C ⊂ V . U is On the other hand, a mispredicted loading may result in a schedule that is longer than the "fetch-on-demand". In Figure 2 (c), the loading of a during the execution of c results in an additional reconfiguration of b later, hence lengthening the original "fetch-on-demand" schedule. This paper aims to ensure that configurations are loaded at appropriate times so that the reconfiguration overhead is minimized.
IV. INTERPROCEDURAL PLACEMENT-AWARE CONFIGURATION SCHEDULING
The proposed algorithm has five stages. 1) Obtain the frequency of executing each control-flow edge through profiling and remove all edges that are not executed. The weight function w of each edge is computed using the equation w(e) = frequency of edge e total frequency count of node head(e)
. 2) Compute post dominators [6] for every node, denoting the immediate post-dominator of each v to be ipdom(v). 3) Compute the intra post dominator paths (IPDP) information for each node v, which are set of paths that start from v with the following properties: a) There does not exist any node along the path that is a post-dominator of any other node along the path and b) the estimated probability of this path being taken is greater than a threshold value. This threshold value is set to be 0.0005 in our experiments. The estimated probability of taking a path P path (p) is the product of the weightage of the edges on the path. 4) Compute for every node on the graph the estimated placement-aware probability (PAP) of reaching each hardware node using the IPDP and post dominator information through a fixed-point iterative method. 5) Insert hardware loading instructions into candidate basic blocks chosen based on the PAP information. The rest of this section focuses on describing steps 4 and 5 in more detail. end factor ← 1.0; forall hw ∈ HW : hw hw do factor ← factor − IPDP Prob(v, hw ); end if factor × P(ipdom(v), hw) > max prob then max prob ← factor × P(ipdom(v), hw); if max prob < threshold then temp p(hw) ← 0; else temp p(hw) ← max prob; end end if ∃ hw ∈ HW : temp p(hw) = P(v,hw) then change ← true; P(v) ← temp p; return change;
A. Iterative PAP Estimation and Prefetch Code Generation
Our configuration prefetch algorithm is based on a placement-aware probability computed for every node.
Definition 4.1: Placement-aware Probability (PAP) The placement-aware probability (PAP) of a node n of the ICFG reaching hardware node g is the sum of the estimated probabilities of all paths from n to g such that there is no conflicting hardware node of g on the path.
We estimate the PAP of reaching every hardware node for each node in the ICFG through an iterative fixed point method. Algorithm 1 shows a main loop that processes all the nodes in the graph during each iteration and continues doing so until a fixed-point (i.e., the estimated probabilities for each node have stabilized.) is reached. Throughout all iterations, we maintain two two-dimensional vectors IPDP Prob and P. IPDP Prob(v, hw) are the estimated probabilities that a node v may reach a hardware node hw through its IPDP. P(v, hw) is the estimated PAP that a node v may reach a hardware node hw through all possible paths while P(v) is a vector of all estimated PAPs for node v. Every P(v, hw) is initialized to zeros except when v = hw, where P(v, hw) is initialized to 1. A procedure may have multiple callers. Due to the uncertainty of the call context, we do not compute PAPs for the exit nodes of the procedures.
We distinguish between the general case and call sites for the updating of estimated PAPs. Algorithm 2 shows how the estimated PAPs for a general node v (i.e., neither call site nor exit node) is updated, by computing a vector of estimated PAPs temp p that will be used to update P(v) if these 2 vectors are different. In the case when P(v) is updated, a change is reported. We estimate the PAP of call sites by taking the maximal value of a) either the weighted sum of the PAPs of all its callee sites or b) the PAP of its own immediate post-dominator.
Example: Figure 3 shows how the estimated PAPs are computed for a simple CFG. During initialization, the PAPs of reaching the hardware nodes are set to 0 except for hardware nodes themselves (e.g., the probability of node C reaching C is 1). We note that while C is a post dominator of 4, the estimated PAP of reaching C is 0.75. This is because there is also a 0.25 probability of reaching D through 2 from 4. While this could be an over-estimation, it is sufficient for our purposes to obtain relative size of probabilities for reaching each hardware module.
After PAP estimation, we proceed to select basic blocks that become candidates for insertion of hardware prefetch instructions. The number of candidates can be reduced by clearing the PAPs for nodes where all its parents have the same PAPs as the node itself. The basic blocks with non-zero PAPs are where we insert hardware prefetch instructions. The exact hardware module loaded, however, depends on run-time conditions. If the most probable hardware module is not yet loaded and not being reconfigured, it will be loaded at the candidate basic block. Otherwise if there is no ongoing loading, the next most probable hardware module that is not yet loaded and not conflicting with the most probable hardware module will be loaded on the FPGA.
V. EXPERIMENTAL EVALUATION
A. Experimental Setup
To evaluate the effectiveness of our approach, we performed experiments with two applications, 429.mcf from the SPEC2006 benchmark suite [7] , and h264enc from the MediaBench II video benchmark suite [8] . 429.mcf performs single-depot vehicle scheduling while h264enc [9] is a H.264/AVC (Advanced Video Coding) encoder. Through profiling, we identified 6 compute-intensive regions for 429.mcf and 7 such regions for h264enc that are to be implemented in hardware. These compute-intensive regions are either basic blocks or loops in the original program. For our experiments, we assumed that the hardware performance is 5 times faster than the software counter-parts.
We modeled our experimental platform along the lines of ReCoBus [10] which supports complex run-time reconfiguration. The ReCoBus's reconfiguration regions are organized in terms of reconfigurable slots that are 6 CLB columns in size. A slot is the smallest granularity that any hardware module can occupy on the FPGA. For our experiments, we assumed a hardware device that has a similar geometry as Xilinx Virtex II Pro XC2VP30 [11] that is organized as a CLB matrix of 80 rows and 56 columns, with the PowerPC CPU operating at 300MHz and the 32 bits wide reconfiguration port at 100Mhz. The overhead of reconfiguring each slot can be calculated based on the data in the datasheet [11] . It is approximately 81, 576 PPC cycles.
We performed our experiments using a trace-based simulator that takes in the basic block trace and the execution time information and computes the execution time of the application on the reconfigurable computing architecture we modeled. We compared the performance of our algorithm by comparing it against four other algorithms described below.
Fetch-on-demand (FOD): In the FOD schedule, there is no prefetching of configurations. The hardware modules are loaded if they are encountered during execution, and if it is not already residing on the FPGA. It is reasonable to expect that any prefetching approach should do better than this. We used the execution time of the fetch-on-demand scenario as the baseline for comparison in our experiments.
Optimal prefetching (OPT): Our implementation of OPT relies on the algorithm described in [15] . It can be done only if the entire execution trace is already known beforehand. We do not expect any static approach to be able to perform better, but the gap between OPT and FOD serves as a useful gauge for the effectiveness of our approach.
Placement-blind probabilistic algorithm (PBP): The implementation of PBP is based on [5] . It should be noted that the PBP was developed for relocatable and defragmentable FPGAs, and not for the Xilinx FPGA architectures. Therefore, this approach does not account for conflicts between the hardware modules.
Conservative analysis (CA): The implementation of CA is based on [4] . Reconfigurations are not preempted in this case. Instead, all previously issued prefetches (maintained in a queue) must be completed before a hardware module that is yet to be configured can execute. The prefetch queue is cleared only at the insertion edges.
B. Experimental Results
The specific placement of the hardware modules affect the conflict relationship between them. Therefore, in order to evaluate the effectiveness of our algorithm, we performed experiments for different placements and Figure 4 shows the various speedups/slowdown results. Each placement is named after the corresponding applications. Hence, placements starting with h264-refers to placements for h264enc while labels starting with mcf-refers to placements for 429.mcf. The placements that are labeled with 's6' are placements generated for a reconfigurable region of 6 reconfigurable slots while those labeled with 's8' were generated for a reconfigurable region of 8 reconfigurable slots. We make the following observation of the results shown: a) Performance degrades seriously when conflicts are not taken into account. The PBP suffers 20% to 90% degradation in performance for most of the placement sets tested in our experiments. b) CA is consistently either very close to baseline or slightly worse than baseline. Being conservative, prefetches inserted in control flow points are very near to where the hardware modules need to execute. c) For the same benchmark, the speedup that can be gained in our approach depends on the placement. In particular, h264-s6-1 is the best for h264enc, achieving a speedup of nearly 30% that of the optimal prefetch algorithm. This shows how placements affect both the overall performance and the opportunities available for configuration prefetching. d) On the whole, our algorithm returned results that fall between 17% and 72% of the OPT results without having to deal with gigabytes of traces needed by the latter.
VI. CONCLUSION
In this paper, we have described a novel method that statically determines the places in an application's control flow graph where prefetches of hardware modules into the FPGA should be initiated such that the reconfiguration overhead is minimized. Our approach performs consistently better than our baseline and also out-performs the stateof-the-art static prefetching algorithms, coming to 72% of optimal prefetching at its best. As future work, we intend to extend the algorithm such that it will also take into consideration the execution phases of the applications.
