Abstract. Hardware realization of kernel loops holds the promise of accelerating the overall application performance and is therefore an important part of the synthesis process. In this paper, we consider two important loop optimization techniques, namely loop unrolling and software pipelining that can impact the performance and cost of the synthesized hardware. We propose a novel model that accounts for various characteristics of a loop, including dependencies, parallelism and resource requirement, as well as certain high level constraints of the implementation platform. Using this model, we are able to deduce the optimal unroll factor and technique for achieving the best performance given a fixed resource budget. The model was verified using a compiler-based FPGA synthesis framework on a number of kernel loops. We believe that our model is general and applicable to other synthesis frameworks, and will help reduce the time for design space exploration.
Introduction
A standard practice in synthesis of application specific hardware is to focus attention at kernel loops. In many applications, they account for the bulk of the execution time and are thus natural candidates for hardware acceleration. A key difficulty in synthesizing hardware for kernel loops is that there are many loop optimizations available and the complex interactions among these optimizations make it difficult to predict the costbenefit of applying each. In particular, one cannot tell how much more or less resources a particular optimization will take or what its impact will be on performance. This means that one has to either settle for sub-optimal results or go through a costly process of trial-and-error in order to arrive at the correct combination of loop optimizations that fits the need of the user. Having a model of how a particular loop optimization will impact resource and performance is therefore necessary.
Two important loop optimizations applicable to kernel loops are loop unrolling and software pipelining. Loop unrolling is a technique to expand the loop such that a new iteration consists of 2 or more of the original iterations. This is performed by a compiler to expose more instruction level parallelism and reduce the overhead of updating index variables. The number of times the loop is expanded is called the unroll factor. If the loop iteration count is not a multiple of unroll factor, then the remainder of the loop iterations needs to be executed at the end as it is.
Software pipelining [1] tries to achieve higher level of instruction level parallelism by moving operations across iteration boundaries. This optimization achieves overlap among the iterations by pipelining the execution of the iterations. The loop body is scheduled such that (a) all iterations have identical schedule and (b) each iteration is scheduled to start some fixed number of cycles later than the previous iteration. The delay between the start cycles of two successive iterations is called the Initiation Interval (II). The modulo scheduling algorithm attempts to achieve the smallest value of II such that no intra-or inter-iteration dependencies and resource constraints are violated.
As multiple iterations are executed in parallel, both loop unrolling and software pipelining increase register pressure and resource requirement but in different ways. Furthermore, it is possible to use them in combination, i.e. it is possible to software pipeline unrolled loops. The complex interaction between the two optimizations makes it difficult to decide how they should be deployed in optimizing a loop given a particular resource constraint. Often the only way to tell is to exhaustively try various combinations of these two optimizations to obtain the optimal one.
In this paper, we propose a model for the performance and resource requirement for the hardware realization of unrolled and software pipelined loops. The novelty of our model lies in the use of the compiler to extract certain key parameters of the loop in question that characterize the code including the data dependences present for a given hardware. For example, the platform we use allows at most four parallel reads to memory and only if they do not hit the same memory bank. Such characteristics are hard to model. So instead we rely on the instruction scheduler of a compiler to capture these. From these parameters reported by the compiler, the model will inform the user if given a certain resource constraint, unrolling alone, or software pipelining used in combination with loop unrolling would deliver the better performance. It will also output the optimal unrolling factor that should be used. The contribution of this model is that without exhaustively trying a large number of possibilities, it can very quickly recommend a solution that we believe is optimal or very near it.
Related Work
Hardware realization of kernel loops has been actively studied by many research groups. However, the focus has been mainly on automatic synthesis of kernel loops from high level language constructs. The exploitation of compiler optimizations such as loop unrolling and modulo scheduling has largely remained unexplored. Even a few commercial synthesis tools that apply these compiler optimizations depend on user feedback to choose unroll factor or decide between unrolling and modulo scheduling. Our work bridges this gap in automatic hardware realization of kernel loops.
There are two main approaches towards hardware synthesis from high level constructs. One approach is to design new languages for hardware design which are at much higher level than traditional hardware description languages such as Verilog and VHDL. The claim is that the productivity gap will be reduced as software programmers can easily learn these new languages. An example is Handel-C [2] programming language which has C-like syntax with support for explicit hardware parallelism, communication, and hardware structures such as memory, bus etc.
The other approach attempts to map a subset of commonly used software programming languages such as C to hardware automatically. These efforts include SA-C [3] , PipeRench C Compiler [4] , Garp C compiler [5] , work by Weinhardt et. al. [6] [7], Babb et. al. [8] and Snider et. al. [9] . The PACT project [10] at Northwestern University performs C to hardware synthesis by taking power/performance trade off into account. The PICO project [11, 12] performs static timing analysis to identify chain of operators to minimize number of cycles while maintaining cycle time constraints.
The only existing tool that allows application of high level compiler optimizations in hardware synthesis is Monet [13] . However, it requires user feedback in deciding unroll factor for example. Among research projects, Derien et. al [14] have developed an analytical model to choose a tiling strategy that will minimize loop execution time. The closest to our work is So et. al. [15] . They perform fast and automatic design space exploration to choose the right loop unrolling factor that satisfies the area constraints and maximizes performance. However, they do not use other compiler optimizations such as software pipeline which can potentially improve the performance significantly.
Our Model
In this section, we will present our proposed model. The novelty of the model lies in the use of key parameters supplied by the compiler in characterizing aspects of the kernel loop as well as the machine that are hard to model correctly.
Model for Performance
For the discussion below, we will assume a loop L that is executed N times. Let S 1 be the schedule length of the loop. In our model, S 1 is a quantity reported by the compiler as it performs instruction scheduling. As we are realizing the loop in hardware, we assumed infinite registers by skipping the traditional register allocation phase. In the quantity S 1 , various complex issues such as the machine's configuration, instruction type distribution, data dependencies etc. are encapsulated. The user, for example, can choose to use the machine configuration to constraint the amount of parallelism or number and types of functional units to be realized in hardware. We will also generalize S 1 to S u which is the schedule length of the kernel when it is unrolled u times. The following formula gives the total number of cycles the unrolled kernel will take to execute N iterations.
After unrolling, the loop size is N/u and the schedule length is S u . Therefore the first term in Eq. 1 accounts for the total number of cycles executed by the unrolled loop. However, if N is not divisible by u, a compensation loop of size N − N/u × u and a schedule length of S 1 will be generated. In practice, we would not want to have to get all S u 's from the compiler as that requires multiple runs. Rather, we estimate S u given S 1 . In particular, we assumed that
where c S is a constant. From the experience gained from our experimentation, we chose
This is because we found that there may be a case where it so happens that empty resource slots available at the end of the instruction schedule can be filled up by a new instance of the loop.
To model software pipelining, we assumed the technique of iterative modulo scheduling given by Rau [1] that uses predicated execution and rotating registers [16] . It is characterized by two important parameters also obtained from the compiler, the initiation interval, II, and the epilog counter e. The initiation interval is the gap (in machine cycles) between two successive software pipelined iterations. In effect, after a successful modulo scheduling, each iteration of the software pipelined kernel loop takes exactly II cycles. The epilog count is the number of iterations in the epilog of the software pipelined loop. Again, in II and e, the complexity of machine configuration, resource requirements, and data dependencies are hidden away. Since we would like to combine software pipelining with unrolling, we will introduce II u and e u which are the II and e for a software pipelined loop that has been unrolled u times. We have the following formula for the total number of cycles a software pipelined loop that has been previous unrolled u times will take:
A constant of 1 is added to II u because at the end of each iteration, it is necessary to perform a shift of the content of the rotating registers so as to prepare for the next iteration. These shifts can be done in parallel in hardware and thus cost one cycle. The constant of 3 is needed because in our scheme, we needed one clock cycle at the beginning of the loop to set up the rotating registers, another clock cycle to initiate the loop and epilog counters, and one more at the end of the loop to copy out the content of the rotating registers. S u is obtained from Eq. 2. As is the case for S u , we do not redo modulo scheduling over all possible u's for II u and e u . Given a machine configuration, M, and a loop, L, the following holds:
where c II is dependent on M and L. However, we also found that the simple recurrent relation for II u do not necessarily end with the unroll size of 1. In particular, for software pipelining, if there is sufficient resources, then II i = II i−1 and the recurrent relations are not established until resource over-subscription comes into play. In our experiments, we used a machine that has only four memory port but otherwise has unlimited resources. The former condition is to reflect the limitation of the FPGA board that we are using. We used the following strategy: we perform software pipelining with
e u can be derived from S u and II u through Eq. 5. This relationship is apparent once we see the idealized diagram for software pipelining shown in Fig. 1 . In this example, S u = 4, and II u = 1, giving e u = 3. Since S u > II u , e u ≥ 1.
Estimating FPGA Frequencies. The total running time of an implementation of a loop in a FPGA is given by the product of the number of cycles it takes to execute the code and the frequency of the FPGA which permits the safe operation of the realized design. It turns out that it is difficult to use static compiler information to obtain an accurate model of the final realizable frequency. In order to overcome this problem, we use the following strategy. We run place and route for three instances of the loop, namely the loop unrolled two, three, and four times. These three runs are also used in our resource estimation process described in the next section. Let the actual frequencies obtained from the three runs be f l (2), f l (3) and f l (4) , respectively where l is either 'unrolled' or 'swp'. We set the predicted frequency as follows:
Using these equations, we can finally approximate the time taken to execute the realized design to be
Model for Resource Usage
While we can easily count the various operators emitted by the compiler, optimizations further down the synthesis chain, in particular, the place and route pass, introduce nontrivial relationships between the high level hardware description our compiler output and the final resource usage. From experimental results, we found this to be especially true for the case of software pipelined loops. From the same three place and route runs used to obtain the frequencies, we also obtained the resource consumption information by means of linear regression. In particular, for a machine M and loop L, we model resource usage as:
where m unrolled , c unrolled , m swp , and c swp are constants obtained from the linear regression.
Putting it together
The model is used as follows. The user will decide on a certain amount of resource, R user , that he would like to use for realizing the loop in hardware. Using Equations 7 and 8, we obtained two maximal unroll factors u 1 and u 2 such that
Next we examine all unroll factors less than u 1 and u 2 to look for a u 1 ≤ u 1 , and a u 2 ≤ u 2 such that T unrolled (u 1 ) and T swp (u 2 ) are the respective minimum. If T unrolled (u 1 ) > T swp (u 2 ) then we will get better performance by using software pipelining with the loop unrolled u 2 times and vice versa.
Compilation Framework
We used the Trimaran [17] compiler infrastructure to experiment with the model. The compiler targets for a parameterized Explicitly Parallel Instruction Computing (EPIC) architecture called HPL-PD [16] . We modified the compilation framework as follows:
-An EPIC machine with infinite resources except for four memory ports was defined.
The four memory port was a constraint of the FPGA board which we used in our experiments. It has four banks of memory that can be simultaneously accessed with only one access to a bank at any time. Consequently, we also had to modify the instruction and modulo schedulers of Trimaran. We assumed that an entire array is stored in a single bank. Thus any two access to the same array has to be performed in different machine cycles. -Trimaran uses some heuristics to guide unrolling. Furthermore, it does not always emit compensation loops during unrolling as these can be folded into the unrolled loop using predicated execution. For our purpose, we forced unrolling to be performed as per our requirements. -Finally, we added a phase to generate Handel-C [18] code for Trimaran's Elcor intermediate representation. Handel-C is a C-like behavioral hardware description language. The Handel-C compiler compiles our output into a EDIF [19] file for the FPGA vendor's synthesis tools to process.
In the resultant design flow, we are able to utilize the advanced features used by Trimaran including predicated execution and rotating registers and translate them into Handel-C. From Handel-C's EDIF output, we synthesis the bitmap for a Xilinx XCV1000 FPGA and execute it on a Celoxica RC1000 board.
Results
We used six kernel loops to verify our model:
-Edge detection. A 32×32 mask is computed over 128×128 image to detect edges. The accuracy of our performance model is given in Table 1 . The first set of columns present the result for loop unrolling and the second set of columns present the result for unrolling and software pipelining. "Est." is the predicted execution time, i.e. T unrolled (u) and T swp (u). "Act." is the actual execution time taken to execute the loop. This is obtained from multiplying the actual frequency obtained after place and route with the actual number of cycles executed. "Diff T" represents the percentage difference between "Est." and "Act." while "Diff C" represents the percentage difference in estimating C unrolled (u) and C swp (u). The average value for "Diff C" for loop unrolling and loop unrolling with modulo scheduling are 2.84% and 2.19%, respectively. In addition, the values for S Table 1 were computed using Equations 2, 4 and 5 while S a u , II a u and e a u were obtained from the actual compilation. The average relative error for "Diff T" are 3.6% and 8.4% respectively for loop unrolling alone and software pipelining with unrolling. Given that the average relative difference between the actual execution time of the two strategies is 36%, we conclude that our performance estimation model is within the necessary margin and is accurate. Fig. 2 shows the accuracy of our resource model. Due to space limitation, we will show the results for two benchmarks: Edge and LM1. The results for other benchmarks are similar. "Unroll" and "SWP" show the actual resource usage due to unroll and unroll with software pipeline respectively. These points are obtained from the reports of the FPGA synthesis tool. The "Linear of Unroll" and "Linear of SWP" show the estimated resource usage using linear regression of u = 2, 3 and 4. As can be seen from the figures, the estimated resource usage closely follows the actual resource usage.
It seem that in most cases, unrolling alone yields better performance under the same resource constraints. However, if we set R user = 100, 000, then for the Lm1 benchmark, the unroll factor to be used for unrolling and software pipelining are 7 and 5, respectively. Using these unroll factors, our model predicts that we should use software pipelining instead of unrolling. The actual execution time given in Table 1 confirms that our prediction is correct. Table 2 shows the various constants of Equations 7 and 8 obtained in our model. The results show that our model is fairly accurate and can significantly cut down the design space exploration time.
