A program phase is an interval over which the working set of the program remains more or less constant. This paper presents a dynamic optimization scheme which uses program phase information to optimize designs for reconfigurable computing. We present a mathematical formulation of the optimization problem and propose a solution which comprises of : (1) A hardware compilation scheme for generating configurations that are specialized for different phases of execution. (2) A runtime system which manages interchange of these configurations to maintain specialization between phase transitions. We report experimental results for Xilinx Virtex FPGAs involving OpenGL SPECview -perf benchmarks and demonstrate 95.39% speedup over an optimized uniform rate static design and 11.13% speedup over an optimized multi-initiation interval static design. We present a framework for a posteriori performance analysis and architectural exploration with which we (a) establish a performance upper bound under perfect phase optimization, (b) investigate sensitivity to reconfiguration time, (c) examine the quality of the proposed algorithm for phasedetection. The optimization is shown to be surprisingly insensitive to increased reconfiguration time. Faster reconfiguration yields limited benefits and performance improvements are possible upto 1 second reconfiguration time.
INTRODUCTION
Since the 1960s [1] it has been known that a broad set of programs exhibit phase behavior. Any program which adheres to the Phase Transition Model [2] has predicable program memory access patterns. Program phase is one of the basic principles which underpins cache design and branch prediction. Recently, microprocessors with reconfigurable cache configurations have been proposed [3] which include an explicit phase prediction model to specialize the cache configuration at runtime in response to phase change. This paper explores using phase behavior to optimize the mapping of computer programs to reconfigurable architectures such as FPGAs.
OPTIMIZATION FORMULATION AND EXISTING APPROACHES
Phase-optimization consists of generation of phaseoptimized configurations and management of reconfiguration.
Configurations can be generated at runtime or offline. In microprocessors with multi-configuration caches [3] the different phase-optimized cache configurations are designed by hand offline. Several software [4] and hardware [5] environments have been reported which specialize designs at runtime. Offline generation may allow for greater specialization whilst runtime generation requires less state.
Configurations must be interchanged at runtime to maintain phase-specialization. This task is modeled using a trellis graph ( Fig. 1 ):
1. Let t p be the number of computation steps in a program execution. 2. Let c be the number of phase-optimized configurations. 3. T ∈ c×tp stores trellis node weights. T i,j is the cost of configuration i for step j. 4. R ∈ c×c holds edge weights. R i,j is the reconfiguration time between configurations i and j.
A reconfiguration schedule is represented by S ∈ N tp , where S i is the index of the configuration used at computation step i. The cost of S is its path length le n g th (S) = T i,Si + R Si,Si−1 . The optimal reconfiguration schedule is the shortest-path through the trellis S o pt where le n g th (S o pt ) is minimal. Given a complete execution trellis, S o pt can be computed by simplified Dijkstra's Shortest Path [6] . The reconfiguration manager must compute an approximation of these operations at runtime. This consists of the following tasks :
1. Monitor the working set. At each t cu r r e n t , a configuration independent measure of state called the working set signature is recorded. A windowed working set history of length w, W pa s t ∈ s ig w is stored. In multi-configuration cache microprocessors [3] , the Fig. 1 . Trellis representation of execution history. T i,j is the cost of configuration i for step j. R i,j is the reconfiguration time between configurations i and j.
working set signature consists of a lossily compressed histogram of program counter values over time. 2. Working set sequence prediction. The future working set sequence W f u tu r e ∈ s ig w is predicted from W p a s t . 3. Evaluate cost of alternative configurations. The future trellis window T f u tu r e ∈ c×w is determined from W f u tu r e . 4. Reconfiguration scheduling. The shortest path over T f u tu r e is estimated. 5. Invoke reconfiguration. The reconfiguration schedule is implemented by invoking reconfiguration at the specified time-steps.
In existing systems (3) is achieved by tuning [3] or modeling. A tuning sequence consists of systematically trying each of a number of configurations and measuring the performance of each. Tasks (2) and (4) are typically [3] bundled together in a combined algorithm.
PROPOSED SYSTEM
In our system, designs are specialized for different phases by optimizing resource allocation between different program branches. The optimal resource allocation for a phase is a function of program branch probabilities [7] . We define a program phase as an interval over which the branch probabilities of a program remain more or less constant. The proposed system consists of : Generation of phase-optimized configurations We compile a single high level program into a spectrum of phaseoptimized FPGA configurations offl ine. Our compilation scheme [7] [8] combines coarse grain asynchronous and fine grain synchronous pipelines and allows different program branches to operate at different initiation intervals. For a design of n basic blocks, parameter b ∈ N n sets the initiation interval of each block. b i is the cycles per result of block i. The spectrum of configurations covers a subset of the Cartesian product of possible parameterizations, culled by applying fl ow heuristics. A parameterization is culled if it contains a sub-graph with 1. Downstream slack The sustainable output rate is greater than the sum of the maximum input rates. 2. Upstream blocking The sum of the input rates is greater than maximum sustainable output rate.
Management of reconfiguration
Our reconfiguration manager mixes hardware and software. Monitoring of the working set is conducted in hardware with all other activities in software.
Monitor the working set We define the working set signature to be the set of branch probabilities over a finite execution window. The working set signature is recorded by profiling counters in hardware. Each BRANCH node contains two counters which record the number of branches and the total number of TRUE branches. At the end of a 10 million input-sample execution window the signature is fetched by software. Evaluate cost of alternative configurations A simple steady state M /M /1/∞/F C F S queuing network model is compiled for each parameterization. In [7] we demonstrated that this model is both accurate and fast. The input parameters for each configuration are :
2. The branch probabilities. Routing matrix Q ∈ n×n , where element Q ij is the steady state probability that a job completing basic block i branches to block j.
The traffic equations (eq. 1) are solved, subject to utilization constraints (eq. 2), to determine overall performance : the maximum sustainable input rate to block one γ 1 .
We propose a partial evaluation scheme to minimize calculations at runtime. A symbolic solution is generated offl ine and at runtime Q is substituted in. This requires that only two sets of N linear equations need be evaluated at runtime. It has been suggested that if (eq. 1) is ill-conditioned, the partial evaluation should be abandoned and a full numeric solution should be computed at runtime. Working set prediction, reconfiguration scheduling and invoke reconfiguration We propose a very simple combined algorithm which is based on a 1-step history. The algorithm reconfigures the device to the highest performance configuration over the previous execution window as determined by 
PERFORMANCE ANALYSIS: SPECVIEWPERF
OpenGL is an industry standard API for real-time rendering. We implement parts of the OpenGL-like Mesa3D [9] graphics library. The top-level datafl ow graph is shown in the left pane of Fig. 2 . There are five basic blocks with two feedback loop carry dependent loops. We use eight of the SPECviewperf [10] SPEC benchmarks for OpenGL. A combined benchmark is also examined which runs all benchmarks in sequence. All benchmarks were run at 320x200 resolution, 32-bit colour.
All experiments begin with compilation and synthesis of the phase-optimized configurations for Xilinx Virtex-E XCV1000-E. Our arithmetic library comes from Xilinx Core GENERATOR and supports initiation intervals 1, 2, 4, 8, 16 and 32 cycles per result. The compiler generates 56 different parameterizations for b, of which 54 fit the XCV1000-E.
Performance upper bound
For an arbitrary program and dataset, phase-optimization will deliver a theoretical maximum performance improvement when :
1. Reconfiguration management overhead is zero. 2. Reconfiguration time is zero.
3. Detection efficiency is one, such that the reconfiguration schedule is optimal.
Clearly, no real-world system will share these properties. However, analyzing performance under these conditions is useful as it defines a performance upper bound, below which all real-world phase-optimizing systems exist. Upper bound performance U 1 is determined by a posteriori analysis of the execution trellis. A trellis is computed for each benchmark using cycle-accurate simulation. The optimal reconfiguration schedule is then computed by shortest path. Two control experiments are reported. Control C 1 is the fastest single design with uniform rate for all blocks. Control C 2 is the fastest single design with multiple initiation intervals. Table. 1 shows the results for these experiments. A timing constraint of 90MHz clock rate is met by all 54 designs in the spectrum of phase-optimized FPGA configurations. In total, 66 reconfigurations are made over the course of the nine benchmarks. Dynamic reconfiguration is present in four of the nine optimal reconfiguration schedules. Fig. 3 illustrates the optimal reconfiguration schedule for the combined benchmark.
The variance of speedup results for different datasets is significant and shows that phase optimization is a datadependent optimization. Phase-optimization in only beneficial if the underlying assumption of phase-phase-behavior holds true. The fl at sections of Fig. 3 illustrate that there is little phase behavior in benchmarks 1,3,4,5 and 7. The optimal reconfiguration schedules for these benchmarks there- Why is there little speedup over the best single multiinitiation interval design ? Firstly, there is a lack of phasetransition behavior in the dataset. In general the SPECviewperf benchmarks do not exhibit the classical behavior of the phase-transition execution model. If an application could be found with longer or more varied phases of execution, more reconfigurations would occur in the optimal reconfiguration schedule and a greater theoretical upper bound on performance improvement would be achieved. Secondly there is a lack of fine grain control on design specialization. Our arithmetic library is restricted to initialization intervals of powers of two. As a result there are large intervals of branch probability over which the same design parameterization is optimal. Greater fl exibility, for example arbitrary integer initialization intervals, would permit finer grain specialization and encourage more frequent reconfiguration.
Architectural exploration: sensitivity to reconfi guration time
This section explores the sensitivity of the optimal reconfiguration schedule to increased reconfiguration time. We begin by constructing the complete execution trellis for each benchmark. The a posteriori shortest path is then computed with parameterized reconfiguration overhead edge costs R. Performance is analyzed over the interval of zero-cost reconfiguration (U 1 ) up to 2 seconds per reconfiguration. Table 2 . Sensitivity of runtime phase optimization to increased reconfiguration overhead for SPECviewperf benchmarks 2,6,8 and 9. R i,j is the reconfiguration time in milliseconds. Remaining columns show the % speedup of U 1 over C 2 and the number of reconfigurations in U 1 . Table 2 shows optimal reconfiguration schedule performance degradation with increased reconfiguration overhead. Tolerance to high reconfiguration overhead is only possible in benchmarks which exhibit sufficient phase-transition behavior. SPECviewperf benchmarks 2, 6, and 8 exhibit limited phase behavior and are sensitive to increased reconfiguration time. Combined benchmark 9 exhibits greater phase behavior. The results for the combined benchmark show :
1. Where applicable, phase-optimization is well suited to existing architectures and complete device reconfiguration in the range of 10-50ms. The XCV1000-E (16.466 ms) suffers only 1.35% loss of only speedup compared to the zero-reconfiguration time optimal schedule. 2. Where applicable, phase-optimization is surprisingly insensitive to increased reconfiguration time. Performance improvements are still possible at 1 second reconfiguration time. 3. There is surprisingly little benefit in faster reconfiguration. Techniques such as partial reconfiguration or coarse grain reconfiguration would be ineffective.
Prototyping board experiments
This section describes experiments using the RC1000-PP board. All designs run at the maximum memory clock rate 25MHz. Four control experiments were conducted using techniques described in Section 4.1.
The fastest uniform rate single design. C 2 The fastest multiple initiation interval single design. C 3 The optimal reconfiguration schedule using designs with multiple initiation intervals. Zero reconfiguration time. C 4 The optimal reconfiguration schedule using designs involving multiple initiation intervals. Reconfiguration time 16.4688ms for the XCV1000-E on the RC1000-PP. I is the full phase-optimization system. Tables 3 and 4 show performance results. Table 4 . SPECviewperf OpenGL benchmarks running on RC1000-PP. Table shows the percentage speedup of the implemented system I over the control experiments. Table 4 shows overall experimental speedup. Column three shows the speedup in percentage of the experimental runtime phase optimizing system I versus the best sin-gle configuration with uniform initiation interval. Speedup ranges from a 54.86% speedup for benchmark one, to a 175.61% speedup for benchmark eight. For the combined benchmark a speedup of 95.39% is encountered. Column four shows that for the combined benchmark, an 11.13% speed improvement is made over the best possible single configuration. Column five and six show that I is 4.77% slower than the optimal possible reconfiguration strategy for the XCV1000-E. Table 3 lists the configurations used and the number of reconfigurations during execution. The results indicate that the experimental reconfiguration management system performs well. The number of reconfigurations used in I correlates well with the optimal reconfiguration schedule for XCV1000-E, control study C 4 . I also uses the same configurations as C 4 with configuration 6 used on startup.
Quality of phase-detection scheme
The final column of Table 3 shows the mean phase change miss distance of experimental schedule I compared to C 4 . Our phase detection algorithm exhibits high detection effi ciency: it schedules each reconfiguration on average less than half a profiling sample away from the optimum reconfiguration schedule.
CONCLUSIONS
The key contribution of this work is a system of phase-optimization which comprises (1) A hardware compilation scheme for generating configurations that are specialized for different phases of execution. (2) A runtime system which manages interchange of these configurations to maintain specialization across phase transitions. We provide an experimental implementation for Xilinx Virtex FPGAs and demonstrate 95.39% speedup over an optimized uniform rate static design and 11.13% speedup over an optimized multiinitiation interval static design.
We characterize the zero-reconfiguration time upper bound on performance and explore the sensitivity of the proposed system to increased reconfiguration time. The upper bound for XCV1000-E at 90MHz is 16.72% over the best possible single configuration. Performance degrades gracefully as reconfiguration time is increased. The optimization is shown to be beneficial in the 10-50ms reconfiguration time region exhibited by modern FPGAs and is surprisingly insensitive to increased reconfiguration time. Performance improvements are possible upto 1 second reconfiguration time and there is little benefit in faster reconfiguration.
We analyze the quality of the proposed reconfiguration management system. The runtime system is shown to be extremely lightweight : only two sets of linear equations need be evaluated at each timestep. The overall performance of the system is only 4.77% slower than the optimal possible reconfiguration strategy for the XCV1000-E and the phase detection algorithm is shown to exhibit a very high detection efficiency.
There are several possible directions for future work. The most pressing requirement is to build a fl exible arithmetic library which targets Virtex IV to address issues raised in Section 4.1. Our compilation scheme generates a globally asynchronous locally synchronous design. Greater specialization will be sought in multiple-clock domain configurations, in effect enabling more fl exible selection of initialization intervals. There is also significant scope for improving management of reconfiguration. Our queuing network model would be improved by attempting to model burstyness. More sophisticated phase change detection algorithm algorithms such as the Signature-Based Reconfi guration Algorithm or Rochester Algorithm [3] will also be investigated. Finally, modern FPGAs are capable of self-reconfiguration [11] , inviting the possibility of phase-optimizing system-onchip.
