Abstract. Enhanced pipeline scheduling (EPS) is a software pipelining technique which can achieve a variable initiation interval (II) for loops with control flows via its code motion pipelining. EPS, however, leaves behind many renaming copy instructions that cannot be coalesced due to interferences. These copies take resources, and more seriously, they may cause a stall if they rename a multi-latency instruction whose latency is longer than the II aimed for by EPS. This paper describes how those renaming copies can be deleted through unrolling, which enables EPS to avoid a serious slowdown from latency handling and resource pressure while keeping its variable II and other advantages. In fact, EPS's renaming through copies, followed by unrollbased copy elimination, provides a more general and simpler solution to the cross-iteration register overwrite problem in software pipelining which works for loops with control flows as well as for straight-line loops. Our empirical study performed on a VLIW testbed with a two-cycle load latency shows that the unrolled version of the 16-ALU VLIW code includes fewer no-op VLIWs caused by stalls, improving the performance by a geometric mean of 18%, yet the peak improvement with a longer latency reaches as much as a geometric mean of 25%.
Introduction
Enhanced pipeline scheduling (EPS) is a software pipelining technique that is unique due to its code motion pipelining [1] . This feature allows EPS to pipeline any type of loop including those with arbitrary control flows and outer loops. EPS, however, leaves behind many renaming copies that cannot be coalesced due to interferences, suffering from problems caused by those copies.
For example, consider the loop in Fig. 1 (a) which searches for the first nonzero element of an array. Let us assume for the moment that each instruction takes a single cycle. After the loop is scheduled by EPS as in Fig. 1 (b) (we will show later how the loop is scheduled), no true data dependences exist among the three data instructions in the loop (cc=(y==0),y=load(x'),x"=x'+4). The loop can be executed at a rate of one cycle per iteration on a superscalar or a VLIW machine capable of executing two integer instructions, one load and one branch per cycle, assuming we can get rid of the two copies located at the beginning of the loop (we assume the branch executes in parallel with other instructions of S. Kim et al. the next iteration). However, we cannot coalesce any of these two copies due to interferences; x' is live when x" is defined, so is x when x' is defined.
Both copies in Fig. 1 (b) can be removed if we transform the code as follows: first, unroll the loop twice (thus generating three copies of the original loop) and insert a dummy copy x=x at each loop exit for the variable x which was live at the original loop exit, as shown in Fig. 1 (c) ; then calculate live ranges as in Fig.  1 (d) and perform coalescing. The final result after coalescing is performed is shown in Fig. 1 (e) where all copies disappeared except for those dummy artifact copies at loop exits which have now become real copies. The loop will execute at a rate of three cycles per three iterations and give the correct result of x at the loop exit (exit:).
There are two independent issues involved in this example. The first issue is about the unrolling technique itself, i.e., how many times the loop should be
