Abstract. This paper explores using information about program branch probabilities to optimise reconfigurable designs. The basic premise is to promote utilization by dedicating more resources to branches which execute more frequently. A hardware compilation system has been developed for producing designs which are optimised for different branch probabilities. We propose an analytical queueing network performance model to determine the best design from observed branch probability information. The branch optimisation space is characterized in an experimental study for Xilinx Virtex FPGAs of two complex applications: video feature extraction and progressive refinement radiosity. For designs of equal performance, branch-optimised designs require 24% and 27.5% less area. For designs of equal area, branch optimised designs run upto 3 times faster. Our analytical performance model is shown to be highly accurate with relative error between 0.12 and ¢ ¡ £ ¥ ¤ ¦ § © .
Introduction
For most computer programs, the execution frequency of each basic block is controlled by the runtime behavior of conditional branches. Optimal resource allocation between basic blocks requires that execution frequencies be known. Software profilers collect execution frequencies for a representative dataset to support static resource allocation. Microprocessors demonstrate that branch probability information can be used at runtime to aid dynamic resource allocation. In this paper we explore the analogous use of branch probability information to optimize resource allocation in hardware compilation. In particular, the novel aspects of our work include: -a compiler that maps programs written in a subset of C to a set of hardware designs that are optimised for different branching probabilities; -analytical methods, including a queueing network model, for elucidating the properties of the proposed compilation procedure; -evaluation of our approach based on both analytical and experimental methods for two large applications: video feature extraction and radiosity.
The rest of the paper is organised as follows. Section 2 describes the two basic compilation phases: dependency analysis and circuit synthesis. Section 3 then presents the branch-optimised compilation path. Section 4 deals with the models for studying the analytical properties of this compilation procedure, and is followed by Section 5 which evaluates both analytically and experimentally the effectiveness of the proposed approach. Finally, Section 7 summarises our current and future research.
Our compilation procedure consists of two phases: dependency analysis and circuit synthesis. The input language for the compiler is a streaming subset of the C language in which arbitrary pointers and loop carry dependencies are not supported. Each input program specifies the body of a single loop, with flow control specified by an if..then..else branch construct. These restrictions preclude certain types of program such as the Fibonacci generator, however an extensive set of applications can be automatically transformed [10] into this form. A simple example program, shown on the left of Fig. 1 , will be used to illustrate our compilation procedure in the following sections. The dependency analysis phase constructs a two-level data flow graph from the input program. The data flow graph for our simple example program is shown on the right of Fig. 1 . It includes a numbered direct acyclic graph (DAG) for each basic block. Flow control between DAGs is represented by BRANCH and MERGE nodes with firing rule semantics as described in data flow computing literature [9] . Reads and writes to vector variables at the start and end of the data flow graph are mapped to READ and WRITE nodes.
The circuit synthesis phase transforms the dependency graph into a unidirectional pipeline captured in structural VHDL. It consists of module selection, scheduling, binding and instantiation of appropriate flow control circuits. The initiation interval of a library block is the number of cycles between each output. An XML library block database specifies the initiation interval, latency in cycles, and area of available library blocks. A static pipelined list scheduler [2] is provided for basic block scheduling.
Circuit synthesis is specialized to form two compilation paths: control study compilation path and branch-optimised compilation path. The control study compilation path is inspired by the StReAm [4] compiler. It creates pipelines which perform equally well under all branching conditions. Designs are parameterised with a global initiation interval parameter
. The control study compilation path circuit with 
Branch-Optimised Compilation Path
In this section we introduce a new compilation scheme which promotes efficiency in the presence of branch probability information. A branch-optimised circuit for the simple example program is shown on the right of Fig. 2 . The branch-optimised compilation scheme transforms the data flow graph into a set of hardware configurations in which different basic blocks run at different initiation intervals. From this set, a configuration can be chosen in which the resources assigned to different branches match the observed computational load. The branch-optimised compilation scheme creates designs with the following characteristics. 4. Each basic block propagates a ready signal back up through the pipeline, shown as a dashed line in Fig. 2 . The ready signal allows basic blocks to stall incoming computation when input queues are full. For each basic block, the incoming ready signal fans out to the clock-enable input of all registers in the datapath. In our current implementation we adopt a fully synchronous design style. However, a globally asynchronous locally synchronous (GALS) design style could potentially be adopted, in which each basic block operates in a separate clock domain and the ready signal is replaced with true asynchronous handshaking. 5. The BRANCH node routes sequencing token and data to the branch target specified by the branch condition. It receives ready signals from the two branch targets, and blocks computation if the branch target set by the branch condition is not ready. 6. The MERGE node forwards data and sequencing tokens from true and false branch targets. If sequencing tokens arrive from both branch targets simultaneously, the MERGE node blocks the branch targets alternately in a round robin fashion.
Analytical Modeling
In this section we describe analytical models of the area-throughput design space for the control study and branch-optimised compilation paths. These models are used to determine the best compilation path and parameterization from observed branch probability information. In the experimental study presented in Section 5, branch probability information is collected at compile time by profiling. In a future system, branch probability information could be collected and acted upon at runtime. Analytical techniques are of increasing importance, as severe time constraints on the optimisation process would almost certainly preclude more complex modelling. We model the cycle count throughput of branch-optimised designs using a queueing network model. Branch-optimised designs introduce finite queue lengths, blocking, and the possibility of correlated arrival rates. Queuing networks which model these properties are generally solved by simulation [7] . We adopt a simple analytical model based on a
queueing network with saturating external arrivals to node one [5] . Given information about steady state branch probabilities, known variables in the model are: 
!
is filled with the known branch probability information.
To estimate performance, we determine the maximum sustainable external arrival rate to node one. In the model, external arrival rates are captured in 
2. Determine the maximum arrival rate at node one given that the utilization of each node is less than or equal to one. In the model, the utilization of each node is an element in . We maximize gF subject to the utilization constraint (eq.3).
Any design with
will exhibit steady state blocking.
The control study compilation path is parameterised with the global initiation interval
for all branching probabilities.
Case Studies
In this section we compare the performance of both compilation paths and evaluate the accuracy of analytical models for two case study applications. The input programs and their corresponding top-level data flow graphs for the case study applications are shown in Fig. 4 and Fig. 5 . The test scenes are shown in Fig. 3 . Video feature extraction. The algorithm [11] consists of edge detection, thresholding and 3x3 sum-squared difference. There are four basic blocks and one branch.
Progressive refinement radiosity. Radiosity algorithms [8] simulate radiation of energy between surfaces. There are ten basic blocks guarded by three branches. 
Results
For the purposes of the experiments, all designs have a uniform word length of 32 bits. All results use the Xilinx XCV3200E-8 device. Arithmetic library blocks are generated using Xilinx Core GENERATOR 5.1.02i, with . The analytical and experimental results for both compilation paths and case studies are shown in Fig. 6 and Tables 1, 2, 3 and 4. Fig. 7 illustrates the effects of different probabilities on the performance of both compilation paths for the video feature extraction case study. The key results of the experimental study are as follows.
1. The branch-optimised compilation path automatically identifies the basic blocks that can benefit from branch probability information and produces designs with different parameterizations of b, the initiation interval vector. For the video feature extraction application, the compiler identifies basic block 2 and produces 10 different designs; for the progressive refinement radiosity application, the compiler identifies basic blocks 1, 2, 3 and 4 and produces 35 designs. 2. For a given area, branch-optimised designs can often run significantly faster than non-branch-optimised designs. In Fig. 6 smaller than EC4 (7514 slices), and at 15.61 ns/pixel is more than 3.2 times faster than EC4 at 49.86 ns/pixel. Similarly while RB1 and RB2 are respectively 22.6% and 13.5% larger than RC4, they run 322% and 162% faster than RC4. 3. For a given performance, branch-optimised designs often require smaller areas than non-branch-optimised designs. In Fig. 6 , for instance, at 64 Mpixels/sec EB1 is 24% smaller than EC1 and at 32 Mpixels/sec EB2 is 18% smaller than EC2. Similarly at 70 Mray-triangle intersections per second, RB1 is 27.5% smaller than RC1 while at 35 Mray-triangle intersections per second, RB2 is 27.5% smaller than RC2. 4. The analytical performance model is shown to be accurate. For video feature extraction, the relative error varies between 0.12 and
; for progressive refinement radiosity, the worst case relative error is smaller than
5. As the probability of a branch tends towards zero or one the branch becomes more biased and branch-optimised compilation becomes more attractive. Fig. 7 shows that for video feature extraction, branch-optimised compilation is favourable if branch probability Area Clk Pixel Time Table 1 . Complete area-throughput design space with control study compilation path for video feature extraction case study, with input scenes shown in Fig. 3 . Designs EC1, EC2 and EC4 are the smallest control study compilation path designs which meet performance constraints 64Mpixel/set, 32Mpixel/sec and 16Mpixel/sec. These designs are labeled in Fig. 6 .
Conclusion
This paper explores using branch probability information to optimise hardware compilation. We demonstrate that this technique can result in significant improvements in area and performance. Future work will focus on extending the analytical model and compilation system. In the long term we intend to develop a dynamically reconfigurable system in which branch optimisation techniques are applied at runtime. for branch-optimised and control study compilation paths. Video feature extraction case study application is shown with performance constraint of 64Mpixels/sec. The observed probability for ¥ ¡ ¤ £ of 0.0891 is indicated with a vertical line through the graph. EC1 and EB1 correspond to the optimal designs for each compilation path as shown in Table 1 and Table 2 . The trend line for branch-optimised compilation path with different probabilities is produced using our analytical model. The intersection of trend lines for branch-optimised compilation path and control study compilation path shows that branch-optimised compilation is favourable when ¦ ¡ ¤ £ § § ¡© . As the probability ¡ ¤ £ Table 2 . Selected area-throughput results for branch-optimised compilation path in the video feature extraction case study with input scene shown in Fig. 3 . 10 designs are automatically generated. Designs EB1 and EB2 are the smallest branch-optimised designs which meet performance constraint 64Mpixel/sec and 32Mpixel/sec. Clock period for both designs is 13.96ns. ¡ can be calculated as ¡ ¤ ¡ . EB1 and EB2 are labeled in Fig. 6 .
