Abstract
Introduction
Software pipelining for loop nests is a challenging research topic. While numerous algorithms have been proposed for single loops [2, 1, 5, 6, 10] , only a few address loop nests [6, 8, 15] . They all modulo schedule a loop nest hierarchically, starting from the innermost loop to the outermost one. This approach, henceforth referred to as innermostloop-centric modulo scheduling, naturally extends the single loop scheduling method to the multi-dimensional domain. However, the approach has two major shortcomings. First, it commits itself to the innermost loop first without considering how much parallelism the other loop levels have to offer. Second, it cannot exploit the data reuse potential that may be present in the outer loops.
In [14] , we introduced a resource-constrained scheduling method for software pipelining of loop nests, called Single-dimension Software Pipelining (SSP) . In contrast to the traditional innermost-loop-centric approach, SSP searches the entire loop nest and chooses the most profitable loop level to software pipeline, considering both parallelism and data reuse in order to reduce the actual execution time of the loop nest. SSP retains the simplicity of the classical modulo scheduling of single loops, yet achieves significantly higher performance than the traditional innermost-loop-centric approach.
SSP has three steps: (1)Loop selection: select the loop level that may yield the best performance if software pipelining is applied to this level. (2)Dependence simplification and 1-dimensional schedule construction: simplify the n-dimensional (n-D) scheduling problem to 1-dimensional (1-D), and then schedule the operations. (3)Final schedule computation: the 1-D schedule is mapped to an n-D iteration space to form a final schedule, which is semantically equivalent to the selected (serial) loop 1 . This paper presents a code generation scheme for the SSP method. In the context of a modern compiler framework, the scheme is shown in Fig.1 . It basically follows the three steps of the SSP method. First, it chooses a prof-itable loop from the source loop nest. The selected loop is then lowered into CGIR (Intermediate Representation for Code Generation). Second, it simplifies dependences and performs scheduling. The output is a kernel -called intermediate kernel in the rest of this paper -that expresses the 1-D schedule for the selected loop. Lastly, the SSP code generator (the bigger dotted box in the figure) translates the intermediate kernel into target machine code. This is equivalent to the third step of SSP (final schedule computation). We focus on this step in this paper.
Figure 1. Compilation Flow
Code generation for the SSP method presents several interesting issues and this paper addresses them in an effective way. More specifically:
The intermediate kernel generated by the SSP method leads to two, instead of one, repetitive patterns . These patterns, referred to as the outermost and the innermost loop patterns, introduce a more challenging code generation problem than traditional software pipelining.
Because the SSP method overlaps different iterations of an outer loop, their inner loops are also overlapped. Consequently, the live ranges for a TN in each outer loop iteration are overlapped not only in the outer loop, but also in the inner loops. In this case, a two-level rotating register file is required to handle register renaming [13] . In absence of this, in this paper, we combine dynamic register renaming (using rotating registers) and static register renaming (using code replication) to address the problem.
Code size increase in SSP schedules is more important than in traditional software pipelining. The challenge is how to limit the code size increase while retaining the performance benefits of the SSP method.
In this paper, we discuss the code generation scheme and then target it for the IA-64 architecture. We show how to apply to loop nests the traditional hardware support for software pipelining of single loops, e.g., Intel IA-64 hardware support (rotating registers, predication, and special operations). Initial experimental results demonstrate the feasibility and correctness of our code generation scheme. It also reveals the code quality and performance of the SSP method. Due to the space limitation, this paper only addresses code generation issues and the reader is referred to [14] for details about the SSP method.
The rest of the paper is organized as follows. Section 2 motivates our study by a simple example. Section 3 outlines our code generation method, while Section 4 presents in details the algorithms for the IA-64 architecture. Section 5 presents extensions and optimizations to the basic method. Experimental results are reported in Section 6. A discussion on future work, related work and concluding remarks are then presented in the remaining sections. Fig.2 (a) shows a perfect loop nest. Suppose the outermost loop is selected by SSP. After lowering it into an equivalent internal representation for code generation (CGIR), it becomes imperfect (See Fig. 2(b) , where the for loops are shown in pseudo code for ease of understanding). Every register in the CGIR is a logical register, i.e., Temporary Name (TN). TNf-1g refers to the instance of the TN in the next outermost loop iteration. SSP schedules this internal representation of the outermost loop, and outputs an intermediate kernel in the form shown in Fig. 2(c) . The scheduling process of an imperfect loop nest is similar to that of a perfect loop nest [14] . Details are documented elsewhere [12] [14] (Technical memo  version) 2 . Like traditional software pipelining, the register allocator maps a TN to an architecture register. One possible plan is to allocate r35 to TN1, r45 to TN2, and r40 to TN3. Fig.2(d) shows the corresponding register-allocated kernel. Note that in this kernel, the same TN in adjacent stages, which come from adjacent outermost loop iterations, is allocated registers with successive indexes. For instance, TN1 is allocated r35, r36, r37, and 38, respectively, in each of the stages from right to left. TN1f-1g in the rightmost stage is the register that will contain the TN1 value in the next outermost loop iteration and therefore is assigned r34.
Motivation, Assumption, and Problem Statement

Motivating Example
The main problem is then to generate the final executable code in a compact form from the register-allocated kernel.
Assumptions
Source Loop Nest
In this paper, we assume a ndeep (n > 1) SSP for scheduling is the outermost loop L 1 .
In the loop nest, OPSET x represents a set of nonbranch operations at CGIR level between the beginnings of two adjacent loops. We assume that there is no operation between the end of the two loops for simplicity reasons, although code generation for arbitrary loop nests is similar [12] . For the example in Fig. 2(b) , OPSET 1 is composed of the two add operations, and OPSET 2 is composed of ld4 and st4 operations.
In the following sections, we assume that OPSET x (2 x n 1) is empty to simplify our discussion. The code generation algorithms are then extended to the general cases when OPSET x is not necessarily empty in Section 5.2. 
Intermediate Kernel
Problem Statement
Now we state the code generation problem addressed in this paper as below:
Problem Statement: Given an intermediate kernel generated by SSP and a target architecture, generate the SSP final schedule, while reducing code size and loop control overheads.
In this paper, we propose to look at a code generation scheme and then target it to the IA-64 architecture to make use of the available hardware support, i.e. rotating registers, predicated execution and specialized ISA (Instruction Set
Figure 3. Generic Example
Architecture), which were originally designed for modulo scheduling of single loops, and show how to apply them to loop nests.
SSP Code Generation Overview
In this section, we present a high-level overview of the code generation scheme and explain its components, based on the repeating patterns in the SSP final schedule.
Components in a Final Schedule
Let us first identify the different components involved in the SSP final schedule. It consists of 4 separate components which we will refer to as the prolog, the outermost loop pattern, the innermost loop pattern and the epilog. These components for our example in O P S E T n , appear in the outermost loop pattern, whereas only operations in the innermost loop, i.e. O PSET n , appear in the innermost loop pattern. Note that to make the outermost loop pattern appear repetitively, ineffective operations need to be added. The ineffective operations are circled in Fig. 4 . They are ineffective because their first indexes are beyond the legal range of i 1 , the outermost loop index variable (The range is assumed to be [0,6) in Fig. 4 ). For the IA-64 architecture, predicate registers will be used to make them ineffective.
Register Usage Strategy
An invariant in the loop nest can be assigned a nonrotating register in conventional register allocation techniques. In this paper, we discuss allocation of only (predicate, integer and floating-point) rotating registers to variables in the loop nest.
From Fig. 4 , we see that after the outermost loop pattern, control will finally transfer to the innermost loop pattern. In general, the code sequence is like that shown in Fig. 5 .
It can be seen that the outermost loop pattern is composed of S n copies of the kernel in a stagger way. This reminds us of the traditional modulo scheduled code. For such code, we can simply repeat the kernel for S n times, with dynamic register renaming applied after each repetition. It is will never take them as any repeating pattern, although they do repeat, to simplify our discussion. Therefore, there are only two repeating patterns in any case. The innermost loop pattern contains S n copies of the S n leftmost stages in the kernel. As indicated by the arrows in Fig. 5 , the first copy of the kernel (copy 0) is formed by simply shift right by 1 stage the S n leftmost stages of the last kernel copy in the outermost loop pattern. The next copy (copy 1) is simply a permutation of copy 0. Then copy 2 is a permutation of copy 1, etc. The permutation is to rotate right by 1 stage the current copy. To achieve the effect of permutation, we have to statically rename registers in each copy, unless there were hardware support for dynamic renaming.
In this paper, our strategy for register usage is to combine dynamic (hardware) and static (software) register renaming. Dynamic register renaming, e.g., the rotating register support, is used in the outermost loop pattern, prolog and epilog. Static register renaming is used in the innermost loop pattern.
Let us consider the compile flow in Fig. 1 . For the intermediate kernel, we will assign rotating registers to the TNs in this kernel. The first S predicate rotating registers are used to control the issue of the outermost loop iterations,like the traditional modulo scheduling [1] . For the IA-64 architecture, p16, p17, , p(16 + S 1), are assigned to each stage in the kernel from right to left. See the example in Fig. 2 
(d).
For other rotating registers, in our current implementation, we made a simplistic choice of allocating S rotating registers per variable. This method is conservative and some allocated registers might never be used. For instance, in Fig. 2 (d), TN1 is allocated rotating register r35 whose value is referenced only in the first and the third stages of the kernel, and thus r36 and r38 are not used by TN1 and are not allocated to other TNs, either. An optimal/tight allocation of rotating registers is left for future work. After getting the register-allocated kernel, the code generator begins to generate the final schedule. In this process, we will use the kernel directly to form the prolog, the outermost loop pattern, and the epilog, using dynamic register renaming. For the innermost loop pattern, however, the register indexes of the operations in the kernel must be adjusted to reflect the permutation of the kernel, as to be shown in Section 4. The register-allocated kernel will be used throughout the subsequent code generation process. Thus from now on, when we talk about kernel, we refer to the "registerallocated kernel" by default.
Generated SSP Code Skeleton
Knowing the different components of the final schedule and the register usage strategy, now we can show the skeleton of the generated code in Fig. 6 .
The skeleton is written in pseudo-code. Each L 0 i corresponds to the L i loop in the original loop nest. Each for loop structure is to be replaced by its equivalent in the target assembly language. This is straightforward and we do not show the details here. The code in bold font is to be generated by the corresponding algorithms in Section 4.
In this skeleton, variable initial i n is used to set the initial value of the innermost loop index i n . When execution reaches L 0 n the first time, initial i n is 1; Otherwise, it is 0.
Figure 6. Generated Code Skeleton
Br.ctop is a branch operation in the IA-64 ISA, which rotates registers automatically for dynamic register renaming, and decrements the loop counter register LC if LC > 0, or decrements epilog control register ECif LC = 0. Fig. 6 shows only one br:ctop, which will either branch back to L 0 1 , or fall through to the epilog. Other br.ctop operations will appear in the prolog, the outermost loop pattern,and the epilog, as to be shown later.
Based on the above skeleton, the final code produced by our code generation method for our example (depicted in Fig. 2 ) is shown in Fig. 7 . The generated code is shown in IA-64 assembly language and pseudo code. We can distinguish all the components: the initialization (1-6), the prolog (7-9), the outermost loop pattern (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) 
Code Generation for the IA-64 Architecture
The algorithms to generate the different components will now be described in detail in the context of the IA-64 architecture. The code generation scheme, however, is general and can be easily adapted to any architecture with similar architectural support. Note that, there is more than one way to generate code for a given SSP final schedule and that we are describing here only one possible solution.
In the following descriptions, we use emit op to emit an operation and emit label to create a label. We will keep using the example from Fig. 2 to illustrate each algorithm. 
Prolog
Prolog occurs only once for a given SSP final schedule. It accounts for S S n 1 copies of the kernel, with some stages peeled off in each copy. There is no prolog if S S n 1 = 0. The prolog for the example appears on lines 7-9 in Fig. 7 . The br:ctop operation ensures that the rotating registers are rotated and the LC or E Ccounter is decremented. Since the branch label (end prolog 0) immediately follows the branch, control simply falls through.
The algorithm for generating the prolog is shown in Fig.8 , where function emit stages() emits operations cycle by cycle from a series of stages, as shown in Fig.8 . Here we simply emit a stop bit ";;" when all operations in a cycle are emitted.
The Outermost Loop Pattern
The outermost loop pattern (See Fig. 5 ) is composed of S n identical copies of the entire kernel shifted by one outermost loop iteration between each copy. Therefore, to generate the code associated with the outermost loop pattern, we use rotating registers and rotating branches: we emit S n copies of the kernel alternated with a br:ctop operation to force the rotation of the registers. Once again the br:ctop operation is not used for control flow transfer, but for register rotation. Furthermore, the last kernel copy issued is not immediately followed by a br:ctop operation. This is to freeze the hardware register renaming process until new iterations of the outermost loop are initiated again, which is to happen in the next occurrence of the outermost loop pattern. Lines 10-20 in Fig. 7 shows the outermost loop pattern for our example. Note that we have exactly S n = 2 copies of the kernel: one is within lines 10-13, and another within lines 16-19. After the first copy, there is one br:ctop operation (in line 14). The second copy, however, is not immediately followed by a br:ctop, which is delayed to be after the innermost loop pattern and appear in line 27. The code generation algorithm for the outermost loop pattern is shown in Fig. 8. 
The Innermost Loop Pattern
As shown in Fig. 5 , after the outermost loop pattern, control will finally transfer to the innermost loop pattern. Since the outermost loop pattern freezes hardware renaming in the end, as said above (Section. 4.2), to keep ensuring that overlapping live ranges of the same TN from different outermost loop iterations do not use the same register, some kind of register renaming must be done. However, the available hardware register renaming is used for the outermost loop pattern, and the IA-64 architecture provides only one rotating register base. Hence, the register renaming in the innermost loop pattern must be handled by software.
To equivalently express the innermost loop pattern in where "%" is the modulo division.
In another word, the first copy of the kernel (copy 0) in the innermost loop pattern is formed by decrementing by 1 the indexes of the rotating registers in each operation in the leftmost S n stages. From that on, indexes of the rotating registers must be permuted between copies of the kernel (copy 1 to copy S n 1 in the innermost pattern in Fig. 5 ).
For our example, the S n = 2 copies of the leftmost 2 stages of the kernel are shown in Fig. 7 , lines 22-25. Note how the registers in the original register-allocated kernel have been renamed to make sure each operation uses the correct registers. Take the load operation for instance, which is operation c, and appears from the cycle 3 to cycle 4 in After mapping to real code, it becomes the following, which corresponds to line 23 and 25 in Fig.7 . For the ld operation in copy 0 of the kernel, offset(j i) = o f f s e t (2 0) = 1(j = 2 since the ldoperation is in stage A 2 , as shown in Fig. 2(d) . And i = 0 since we are solving the offset for copy 0). For the ldoperation in copy 1 of the kernel, offset(j i) = offset(2 1) = 0.
Therefore, in the first copy, the rotating registers used in the load operation are renamed from p18, r42 and r37 in the register-allocated kernel to p17, r41 and r36. Then in the second copy, they are renamed back to p18, r42 and r37 again.
The corresponding algorithms for generating the innermost loop pattern and adjusting the rotating registers' indexes are shown in Fig. 8 , where index(r) refers to the index of register r. Function TS() transforms a stage with a given adjustment.
Epilog
The final phase of the SSP schedule is the epilog, which consists of S n copies of the kernel, except that only a subset of the S n leftmost stages of the kernel are executed. In a sense, it is similar to the prolog, and thus the code generation algorithm (shown in Fig. 8 ) is also similar.
Initialization
The initialization part in Fig.6 sets the LC and ECregisters provided by the IA-64 architecture. Their values will control all the generated code except the initialization itself and the epilog (Epilog has its own setting of LC and EC, as shown in Fig. 8 ). The setting of the values is crucial to the correctness of the generated code. The formulas found below assure that when (LC EC) becomes (0 1),
We have issued all the outermost loop iterations, and
have not issued any more iterations.
2. The next br:ctop to be executed must be the one shown in Fig. 6 . According to the behavior of br:ctop [1] , the control flow definitely goes to the epilog.
There are totally N 1 number of outermost loop iterations. One br:ctop issues one iteration. Therefore, LC is initialized to:
To find the correct value for EC, let us reconsider Fig. 4 . Since ECis used only by br:ctop, which is not used in the innermost loop pattern at all, if we remove all the occurrences of the innermost loop pattern from Fig. 4 , we get Fig. 9 . To be clear, we have explicitly shown the br:ctops. A br:ctop operation is controlled by two registers LC and EC. For clarity, we use br:ctop(LC EC) to represent a br:ctop operation with (LC EC) as its input parameters. For example, br:ctop(5 2) (in the prolog) means that the current value of the (LC EC) pair is (5 2), and from this value, br:ctop rotates the registers once, and modifies the value to be (4 2), according to the semantics of this operation [1] . Therefore, the next br:ctop is represented as br:ctop(4 2), as shown in the first occurrence of the outermost loop pattern in Fig. 9 .
We observe the figure vertically and horizontally, in order to find the correct initial value of EC. Vertically, there are S 1+d emit op("EC=Sn;;"); 3:
for(i = 0 ;i < S n;i++)f 5 As said in Section 4.2, the last kernel copy in the outermost loop pattern is not followed by a br:ctop until after the innermost loop pattern.
It is so in Fig. 9 because the innermost loop pattern is removed. Assume E C has an initial value of x. Then horizontally, there are N 1 1 + x + S n br :ctops. First, (LC E C ) is changed from (N 1 1 x ) to (1 x ), and that uses N 1 1 br:ctops. LC is initialized to this value. Then (LC E C ) is changed from (0 x ) to (0 1), and that uses x b r :ctops. E C is initialized to this value. As we said before, we assure that when (LC E C )=(0 1), we will definitely fall through to epilog, where we need another S n br :ctops to completely drain the pipelines. Therefore, we have the following equation:
S n e S n = N 1 1 + x + S n :
From that, we easily find that E C should be initialized to
In our example, S = 4 and S n = 2 . Therefore, E C= 3 ((N 1 1)%2). That is, E C= 3 when N 1 is odd, and 2 when N 1 is even. The initialization phase should also prepare the live-in values for the rotating registers when needed and the bit mask for rotating register base. The final code for our example is shown in Fig. 7 .
Extensions & Optimizations
Based on the basic algorithms introduced in the previous section, this section presents some skills on code-size reduction. We further generalize the algorithms to more general loop nests.
Code-Size Optimizations
To facilitate understanding, the code generation algorithms presented in the previous section are not optimized for code size. The prolog, the outermost loop pattern, and the epilog might contain several copies of the kernel that could be avoided.
If code size is an issue, the multiple copies of the kernel can be replaced by a single copy enclosed in a loop. The corresponding code generation algorithm for the outermost loop pattern is shown in Fig. 10(a) . In this code, pd designates a non-rotating predicate register used for storing conditional, and rca non-rotating integer register used as a loop counter. emit op("rc= rc 1;;");
5:
emit op("pd,p0 = cmp.eq rc,0;;"); Note that the above algorithm generates the outermost loop pattern with a single copy of the kernel. The same optimization can also be applied to the prolog and epilog.
To further reduce the code size, we can merge the epilog and the outermost loop pattern. As seen in Fig.4 , the epilog and the outermost loop pattern contain the same operations. The stages that are not used by the epilog in the outermost loop pattern can be peeled off by setting LC and ECcorrectly. Then predicate registers will turn off the operations that do not need to be executed. In order to achieve this, in the initialization phase in Fig. 6 , we first initialize a non-rotating predicate register pe to f a l s e to indicate that we are not draining the pipeline yet. The register is used at the end of the outermost loop pattern to force the control flow to exit the loop nest at the end of the draining. This is done by adding an instruction emit op("(pe)br exit") at the end of the algorithm for generating the outermost loop pattern, where exit is a label. Correspondingly, we change the epilog generation algorithm to the one shown in Fig. 10(b) , where pe is set to true, and the control branches back to reuse the outermost loop pattern.
Extension to Generic Source Loop Nest
In previous sections, we have considered the case when the OPSETs are empty for the loops between the outermost and the innermost loops. Let us now consider a more generic case when these OPSETs are not necessarily empty. Let the leftmost S x stages in the kernel consist of operations executed by loop L x and its inner loops.
Each time we finish an iteration of such an inner loop L x (1 < x < n ), we should fill the pipeline with its next iteration, if any 6 . So the generated code skeleton is a little different, as shown in Fig.11(a) .
To fill the pipelines, the stages from A S 1 to A S Sx need to be permuted using the algorithm shown in Fig. 11(b) .
In the algorithm, offsetx is defined as: offsetx(j i x) = n 1 if i = 0 (j i S)%Sx j + S Sx 1 otherwise which is an extension of function offset(j i) defined before.
Function TSx(A j o f s t ) in the algorithm returns an empty stage if ofst+j < S S n . In this case, the stage A j after permutation is not within the current group of the outermost loop iterations. So we simply ignore it. Otherwise, the algorithm is the same as TS() shown in Fig. 8. 
Experiments
Experimental Setup
The code generation algorithms have been implemented as a tool set on an IA-64 Itanium workstation. For simplicity reasons, our method was implemented as a stand-alone module working at the assembly level. The Gnu assembler is then used to assemble the resulting code. The 1-D scheduler (Step 2 of the SSP method [14] ) was implemented using a standard modulo scheduling method [5] . We have implemented two versions of SSP code generation, one with We have compared SSP and CS-SSP method with two other methods: a traditional modulo scheduling method of the innermost loop (MS) [5] , and an extended modulo scheduling method (xMS) which overlaps the draining and filling part of an outer loop [8] . We compare the different methods for their performance, code size, and bundling capability.
For the experiments we chose important loops extracted from scientific applications. Because SSP is equivalent to MS when applied to the innermost loop of a loop nest, we considered only loops where SSP would select a loop level other than the innermost one. The following benchmarks extracted from the Livermore Loops suite [9] have been used: matrix multiply (MM), modified 2-D hydrodynamics (HD), LU decomposition (LU) and Successive OverRelaxation (SOR). For matrix multiply with a loop body of
, we have considered 6 different versions, corresponding to the 6 different ways in which the loops can be interchanged, in order to fully demonstrate the impact of data reuse and parallelism upon the final code quality. These version are referred to as: ijk, ji k , ikj, kij, jk iand kji, according to the order of the indexes of the loop nest. We also applied loop tiling to jk iwith loops k and i tiled, for further comparisons. The chosen tile size was the one giving the best performance. Upon tiling, we further applied unroll-and-jam, also known as register tiling. The tiled and register-tiled versions are named as jk i+ T and jk i+ U J for short. Here we report the results for the matrix size 1000 1000, with double precision floating point values. Other matrix sizes were considered in [14] .
Results & Analysis
In this section we report the performance results by running the code, generated by our code generation method, on an IA-64 Itanium workstation equipped with a 733MHZ Itanium1 processor, 2GB of main memory, 16KB/96KB/2MB of L1/L2/L3 caches, and running Red Hat Linux 7.2 operating system. In reporting the performance results, our goal is three-fold. First, the results demonstrate the feasibility and correctness of the proposed code generation method. Second, we would like to know whether the code generation scheme retains the predicted performance benefits of SSP final schedules. In particular we would like to answer whether the use of static register renaming or code size increase due to our code generation scheme hinders the performance? To address these questions we report the speedup of xMS, SSP, and CS-SSP schedules over the MS version, for each of the benchmarks, by directly measuring the execution time of the respective loops on the Itanium workstation. We also report performance numbers relating to code size and bundle density for the code generated by our code generation scheme.
Correctness
To ensure that our code generation method produces correct code, we compared the outputs produced by SSP, MS and xMS with those generated by a serial version of the code (without any software pipelining). In certain cases, we have also manually checked the generated code. In all benchmarks, the outputs produced by MS, xMS, SSP, and CS-SSP match exactly with those generated by the serial version. [14] and as shown in Fig 12 , SSP schedules perform significantly better than xMS and MS schedules for every benchmark tested. The speedup achieved by SSP is between 1.1 and 4.24 times faster than MS or xMS with an average speedup of 2.1. This significant performance improvement of SSP is due to the fact that it is able to take advantage of available parallelism or data reuse in outer loop levels. The two SSP versions seem to perform equally well. SSP performed better in ikj and LU, while the code-size optimized SSP performs slightly better for other benchmarks.
Performance As reported in
We note that neither the static register renaming method nor the code size increase has resulted in SSP final schedules performing worse than MS or xMS schedules. Obtaining more performance numbers that further indicate the impact of these two is left for future work. 
Bundle Density
Bundle density is the average number of operations per bundle, excluding NOPs (Null OPerations). Larger bundling density implies more compact code, and probably more parallelism. We point out here that bundling density is a measure of parallelism in the static code, and does not necessarily equal to the instruction-level parallelism exploited at run time.
The bundle density of all schedule methods for the different benchmarks are shown in Fig. 13 . While MS and xMS achieve a bundle density of 1.90 on average, the average bundle densities of SSP and CS-SSP are, respectively, 1.91 and 2.1. The improvement in the bundle density of CS-SSP is especially better than those of MS and xMS.
Code Size
Last, we compare the code size of the different scheduling methods in Fig 14. Despite our precautions to avoid code duplication during code generation, the code size produced by SSP is between 3.6 and 9.0 times bigger than MS or xMS. The increase due to CS-SSP schedules is between 2 and 6.85 larger than MS or xMS.
Although the code size increase in SSP and CS-SSP code is high, it is not a surprise. There are several reasons for this code size increase. First, SSP method uses two patterns (the outermost and the innermost loop patterns) instead of one The code size increase, although noticeable, does not result in any performance degradation. In particular, the measured L1 instruction cache misses were still extremely low. Second, we observe that the maximum size of the generated SSP code, among all the benchmarks considered, is less than 4.2KB, which is less than a typical L1 I-cache size. Thus as long as the schedules for the loops can be held in the I-cache, the code size increase does not affect the performance significantly. As we see in all our experiments, SSP and CS-SSP perform as well or significantly better than MS and xMS schedules. Thus we observe that the code size is largely outweighed by the improvement in execution time, a result quite acceptable in general purpose computing.
Future Work
Our experiments revealed that most of the code expansion is caused by the multiple copies of the kernel for the innermost loop pattern. The copies were introduced because there was no rotating register file available for the inner loops. Therefore one possible future direction is to investigate hardware support and ISA extensions (more affordable than that in [13] ) to generate kernel-only code.
As explained in Section 3.2, our method currently allocates rotating register conservatively. More efficient register allocation for SSP will be investigated in the future.
Lastly, we will introduce the extension of our code generation scheme to non-rectangular iteration spaces.
Related Work
Code generation schemes for modulo scheduling of single loops are discussed for VLIW architectures with and without hardware support in [11] . The considered hardware support include rotating registers, predicated execution, and iteration control registers [3] . The code generation approach for modulo scheduling in the Cydra-5 compiler has been discussed in [3] . Code size reduction for software pipelined loops has been discussed in [7, 4] . All these works consider software pipelining only for the innermost loop.
In contract, this paper deals with code generation issues for the SSP method, which deals with multi-dimensional loop nests. Code generation for architectures supporting rotating registers and predicated execution has been considered in this paper. Dynamic and static register renaming are combined smoothly to address the issue of life range overlapping at multiple levels.
Conclusion
The Single-dimension Software Pipelining (SSP) method for a multi-dimensional loop nest [14] chooses the most profitable loop level in the loop nest and software pipelines it. This paper discusses a code generation scheme for the SSP method. In particular, it proposes a code generation skeleton and targets it for the IA-64 architecture. It addresses several interesting issues in code generation, including (1) code generation of the outermost and the innermost loop patterns, with dynamic and static register renaming to assure that overlapping live ranges of different instances of the same TN use different registers; (2) code generation of the prolog and the epilog; (3) code generation using predicated execution; and (4) code size reduction. We have implemented our code generation scheme for the IA-64 architecture. Initial experimental results demonstrate the feasibility and advantages of the proposed scheme.
