Abstract. To achieve high resource utilization for multi-issue Digital Signal Processors (DSPs), production compilers commonly include variants of the iterative modulo scheduling algorithm. However, excessive cyclic data dependences, which exist in communication and media processing loops, often prevent the modulo scheduler from achieving ideal loop initiation intervals. As a result, replicated functional units in multi-issue DSPs are frequently underutilized. In response to this resource underutilization problem, this paper describes a compiler preprocessing strategy that capitalizes on two techniques for effective modulo scheduling, referred to as cloning1 and cloning2. The core of the proposed techniques lies in the direct relaxation of cyclic data dependences by exploiting functional units which are otherwise left unused. Since our preprocessing strategy requires neither code duplication nor additional hardware support, it is relatively easy to implement in DSP compilers. The strategy proposed has been validated by an implementation for a StarCore SC140 optimizing C compiler.
INTRODUCTION
As communication and media signal processing applications get more complex, system designers seek programmable high performance fixed-point Digital Signal Processors (DSPs). Recent multi-issue high performance DSPs 1 are designed to meet such demand by providing (1) multiple functional units, (2) advanced issue logic that allows a variable number of instructions to be dispatched in parallel, and (3) optimizing compilers that automatically tune C algorithms for performance [8, 17] .
In particular, to exploit the multiple functional units available in multi-issue DSPs, optimizing compilers commonly use a software pipelining strategy. Software pipelining is a global loop scheduling concept which exploits instruction level parallelism across loop iteration boundaries. Optimizing C compilers for multi-issue DSPs commonly adopt variants of the iterative modulo scheduling pioneered by Rau and Glaser [2] . Although existing iterative modulo scheduling approaches [7, 17] are proven to be effective, excessive cyclic data dependences, which are frequently observed in communication and media processing loops, restrict modulo scheduling quality [13] . As a result, replicated functional units in multi-issue DSPs are often left underutilized.
To address this resource utilization problem, the objective of this paper are twofold: (1) analyzing the nature of the data dependences existing in various signal processing applications, and (2) engineering an effective compiler preprocessing strategy for multiissue DSPs to help an existing modulo scheduler achieve a high quality loop schedule. For this, the paper describes our preprocessing that directly relaxes excessive cycle data dependences with two techniques, referred to as cloning1 and cloning2. Since these two techniques exploit underutilized functional resources, neither code duplication nor additional hardware support are required, and therefore, it is relatively amenable to implement in DSP compilers. To measure the feasibility and effectiveness of our preprocessing strategy for multi-issue DSPs, the StarCore SC140 DSP processor is used as the representative.
MOTIVATION: Excessive RecMII
We formally define the commonly used loop scheduling terms in this paper; for the definitions of other modulo scheduling terms, consult [2] . [10] . 2 
Definition 1. A candidate loop for an iterative modulo scheduler is the loop with the branch-free body that can run in DSP hardware looping mode

Definition 2. A recurrence circuit is a data dependence circuit that exists in a Data Dependence Graph (DDG), which is formed from an instruction to an instance of itself.
Definition 3. ExRecMII is the difference between RecMII and ResMII, iff RecMII
¦
ResMII.
According to our benchmark for SC140, various signal processing loop kernels manifest that ExRecMII is the dominant limiting factor that either fails candidate loops to be modulo scheduled or modulo schedules with excessively large II.
LOOP-CARRIED TRUE DEPENDENCE
As the first example of ExRecMII, consider the C code fragment shown in Figure 1 (a) that implements the Fast Fourier Transformation (FFT) algorithm. For the shaded candidate loop body in Figure 1(a) , the SC140 optimizing compiler produces highly optimized assembly code as shown in Figure 1 (b), which is yet to be modulo scheduled.
For iterative modulo scheduling, II of the candidate loop in Figure 1 (b) is initially set equal to MII, which is computed as follows. First, each iteration of the branch-free loop body shown in Figure 1( (b) An assembly code of figure (a)
SC140 Instruction Comments
Load twiddle [n] [j] to d4 and postincrement the array index by 4 bytes
: AAU instruction Since MII is max(ResMII,RecMII), II is initially set to 6 for a modulo schedule.
For analysis, consider the loop-carried data dependence in Figure 1 (c). This dependence is true since the value of induction variable r1 in the 9 instruction is referenced by the 1 instruction in the subsequent loop iteration. In addition, the dependence chain from the 1 instruction down to the 9 instruction is transitively true. This type of cyclic true dependence is often created by the compiler when a source address for a computation is used as the destination address to store the result of the computation, which is a very common pattern in DSP applications. Due to this cyclic true dependences, the FFT candidate loop fails to be modulo scheduled since MII of 6 is the ratio which can be achieved by local acyclic scheduling.
LOOP-CARRIED FALSE DEPENDENCE
As the second example for ExRecMII, consider the C code fragment in Figure 2 (a) that implements the half-rate Global System for Mobile communication (GSM) algorithm. For the candidate loop body in Figure 2(a) , that uses the European Telecommunications Standards Institute (ETSI) compliant C macros [6] , the optimizing compiler produces highly optimized assembly code as shown in Figure 2 (b), which is yet to be modulo scheduled.
For iterative modulo scheduling, candidate loop II is initially set equal to MII, which is computed as follows. ). Second, RecMII for the half-rate GSM is 5 and the corresponding RecMII recurrence circuits are depicted in Figure 2 (c). Since MII is max(ResMII, RecMII), II is initially set to 5 for modulo scheduling. instructions is output since both instructions store results to d10. This type of composite loop carried dependences is often observed when DSP specific instructions are selected by the code generator. Due to these two RecMII=5 recurrence circuits, half-rate GSM candidate loop fails to be modulo scheduled with II smaller than 5.
Note that modulo scheduling requires a candidate loop II be selected before scheduling is attempted. A smaller II corresponds to a shorter execution time. Since the MII is a lower bound on the smallest possible value of II for which a modulo schedule exists, the candidate loop II is initially set equal to the MII and increased until a modulo schedule is obtained. Therefore, a preprocessing strategy that lowers the MII by reducing RecMII can be quite an effective preparation to achieve high loop initiation rate modulo schedules.
PROBLEM FORMULATION
The compiler eases our preprocessing task by putting every candidate loop body such that intra-loop false dependences are removed whenever possible. In that setting, our preprocessing reduces ExRecMII by exploiting underutilized functional resources by capitalizing on the following two techniques.
-Cloning1: eliminate loop-carried true dependences of RecMII recurrence circuits by cloning the value of an induction register, and -Cloning2: relax loop-carried false dependences of RecMII recurrence circuits by splitting and cloning the excessive lifetime data value used in a destructive instruction that requires use of the same register for source and destination, i.e. mac [14] .
Since few functional resources are typically left for the preprocessing, the challenge is to find an optimal allocation of critical resources for cloning1 and cloning2, which reduces RecMII by the largest degree, subject to the constraints of ResMII increase. [14] . Therefore, for loop-carried false dependences, each destructive instruction of the RecMII data dependence circuit is a potentially splittable point.
Benefit Estimation
Once splittable points are identified for a set & of RecMII recurrence circuits, these points need to be partially ordered to make the best use of underutilized resources for our preprocessing. First, the potential benefit of the preprocessing for the
, is estimated by the following equation:
where RecMII
2
' is the largest dependence length of the dismantled recurrence circuits when cloning1 and cloning2 are applied to the
is used as a metric to estimate the local benefit of the preprocessing. Second, the overall (global) benefit of clonings for & is estimated by the following equation:
However, Equations 1 and 2 are not sufficient to achieve the desired partial ordering since a splittable point can be shared by multiple RecMII recurrence circuits. For this reason, the benefit estimation can potentially make our preprocessing fail to find the desired partial ordering. Note that since the preprocessing techniques cloing1 and cloning2 require additional registers and functional resources, the proposed preprocessing is required to check the availability of these resources while estimating benefit.
Register Constraint and Resource Constraint
The number of architected registers of a processor is denoted as
, where D represents a register file type, i.e., data or address, which depends on architecture characteristics. First, to avoid spill code in a candidate loop, our preprocessing applies cloning1 and cloning2 only when the register budget allows. This register budget is the first constraint, which requires that additional register need for clonings must not exceed the difference between B C and the number of loop variables kept in registers.
1 mac d1,d2,d3 2 mac d2,d3,d5 3 mpy d6,d5,d5 4 mac d9,d7,d5 1 mac d1,d2,d3 2 mac d2,d3,d5 3 mpy d6,d4,d5 4a tfr d5,d8 4b mac d9,d7,d8 Improper application of clonings to a given set of RecMII circuits either makes no improvement on II by simply wasting resources or may even make II worse due to the excessive increase in ResMII. As an illustration, consider the example candidate loop in Figure 3 (a), which contains a RecMIIF H G recurrence circuit (2-3-4) and ResMIIF P I . When the data register d5 of the 3Q S R mac instruction is cloned with additional data register d8, MII is reduced by 1 in Figure 3(b) .
a Nevertheless, in case clonings increase ResMII more than the decrease in RecMII, net effect can make MII worse.
a ExRecMII is reduced from 2 to 0.
To prevent such an indiscriminate application of clonings which may require more resources than the architecture can possibly provide, resource budget is additionally considered as the second constraint for our preprocessing. The resource budget is modeled using Rau's reservation table [2] , which represents the resource occupation of each loop instruction in a partial schedule.
Problem Formalization: MAX-MIN
The solution to our resource allocation problem, which searches for an optimal sequence of splittable points, requires the ability to identify the best splittable point T from all possible permutations of splittable points. Obviously, a backtracking-based algorithm cannot be a viable approach since the runtime complexity of this combinatorial problem grows exponentially in terms of the number of splittable points. To respond to this intractability, 1. Max-Min problem is formulated that requires a solution to maximize the decrease in II while minimizing both register pressure and resource bound, and 2. Branch and Bound approach is employed to effectively search for an optimal split point T .
In particular, the Max-Min problem for our preprocessing is to seek an
under the register and resource constraints described in Section 3.2. 
where 1 9 8 is computed by Equations 1 and 2, and is the number of additional registers required for clonings. Note that, since the benefit estimation with Equations 1 and 2 is not sufficient to find an optimal partial ordering, the branch and bound approach may produce a suboptimal solution.
PREPROCESSING STRATEGY FOR EFFECTIVE MODULO SCHEDULING
Since a candidate loop can have exponentially many recurrence dependence circuits, the proposed preprocessing strategy sets up the Max-Min problem described in Section 3.3, and exploits a divide-and-conquer principle to effectively search for a suboptimal splittable point T .
DIVIDE STEP: Detecting ExRecMII Recurrence Circuits and Finding Splittable Points
To identify all recurrence circuits which account for ExRecMII in a candidate loop, we use Tiernan's algorithm [12] with the C data structures shown in Figure 4 : The i ecs and p ecs fields are later exploited to find an optimal splittable point for cloning2. Since the initialization of these two fields for each recurrence circuit potentially requires the algorithm to determine set relationships with all other recurrence circuits, the upper bound for set operations is e g , where is the number of recurrence circuits of a candidate loop body. Therefore, in order to perform a set operation in constant time, we represent the constituent instructions of a recurrence circuit as a bit vector, which are encoded into the circuit field of EM CT.
When the circuit confirmation completes, the Theorem 1 implies the following corollaries which simplifies the desired search into three steps.
Corollary 1.
Cloning1 must be applied prior to applying cloning2 to reduce ExRecMII.
Corollary 2. The optimality in allocating functional resources is not affected by the order of splittable points selection for cloning1.
Partition the set of ExRecMII recurrence circuits into c-worklist for cloning1 and
d-worklist for cloning2, where c-worklist is a set of ExRecMII circuits whose loop-carried dependences are true and d-worklist is a set of ExRecMII recurrence circuits whose loop-carried dependences are false. 2. According to Corollaries 1 and 2, randomly select a sequence of splittable points from c-worklist as long as register and resource constraints can be met. 3. Find an optimal sequence of splittable points from d-worklist using a branch-andbound search algorithm.
PHASE 2: Search for an optimal solution from a d-worklist
An optimal sequence of splittable points for a d-worklist can be found only when the following side effects are accurately estimated.
1. The number of circuits in the d-worklist whose RecMII can be simultaneously improved when a particular splittable point is cloned. 2. The prediction whether the overall loop schedule gets worse when the particular point is cloned.
As an instance of the first side effect, consider Figure 2 (c) that shows two RecMII=5 recurrence circuits. When the 7 mac instruction is cloned, the loop-carried false dependences from these two circuits can be simultaneously relaxed with no increase in ResMII. As an illustration of the second side effect, consider the example candidate loop in Figure 5(a) , which contains two RecMIIF 3 recurrence circuits. When the 3Q S R mac instruction is accidentally cloned for t , the dependence path of is increased by 1 as shown in Figure 5 (b), and as a result, the overall loop schedule can get worse.
To consider these two side effects in finding an optimal sequence of splittable points is the profit function that estimates the benefit of a given splittable point in terms of overall number of delisted circuits from d-worklist when the point is cloned. For this profit estimation, the i ecs and p ecs fields in EM CT which are described in Section 4.1 are exploited. Second, the preprocessing exploits the branch and bound search approach [4] . Otherwise, the current context before the branching will be saved so that remaining search space rooted at other splittable points can be explored. 4. Iteration: steps 2 and 3 will be repeated until isEmpty() returns true. The recurrence circuit in Figure 6 (a) highlights RecMII=6 of the FFT loop illustrated in Figure 1 . Since this circuit is formed with cyclic true dependences, the ExRecMII deems irreducible. However, careful analysis on this circuit leads us to observe the following:
1. The loop-carried true dependence is an artifact of scheduling insensitive loop optimization, such as induction variable elimination and addressing mode optimization. 2. The loop-carried true dependence has no memory (store-load) dependence.
When these two conditions are met, the loop-carried true dependence edge can be removed by cloning1, which replicates an original induction register. As an illustration, cloning1 removes the loop-carried true dependence in Figure 6 (a) as follows.
1. Allocate one additional induction register to clone the value of the original induction register r1, and initialize it at the loop preheader. First, live-variable analysis followed by local reaching definition indicates the availability of additional register r10 to clone the value of r1. Second, a transfer instruction tfra r1,r10, which initializes the cloned register r10, is placed at the loop preheader as shown in Figure 6 (b). 2. Prepare one additional operation that clones the induction register r1 used in the 1 instruction, which was selected according to Definition 4. Place this additional operation prior to the update of r1 value. In particular, to minimize resource pressure on AAU units, postincrement addressing mode is exploited at the 1 instruction. Note that, since the memory stride between the 1 and 9 instructions differs by two bytes, indexed postincrement addressing mode is selected as shown in Figure 6(b) . 3. Finally, eliminate the original loop-carried true dependence by making the cloned operand being referenced. Since the 1 S instruction is already amended to reference clone operand r10, no additional change is required.
As a result of this transformation, the original loop-carried true dependence is removed. By applying cloning1 to other RecMII=6 recurrence circuit that exists in Figure 1(b) , MII is reduced from 6 to 4. Without a single modification to an existing modulo scheduler, a higher loop initiation rate of 4 is effectively achieved and the modified schedule results in a 17% performance improvement. Fig. 7 . GSM code applied the solution Phase3 -Cloning2: Figure 7 (a) shows the candidate loop body, which contains two RecMII=5 recurrence circuits highlighted in Figure 2(c) . To relax the loop-carried false dependences from these RecMII recurrence circuits without strip-mining the original loop and unrolling the loop kernel, cloning2 technique is engineered, which splits excessive lifetimes of registers by moving data values around. In particular, cloning2 targets for complex [14] and destructive instructions that requires use of the same register for source and destination. As an illustration, cloning2 relaxes the RecMII loopcarried false dependences that exist in Figure 7 (a) as follows.
UNIFIED FRAMEWORK: Divide-and-Conquer
For a given candidate loop, the actual cloning1 and cloning2 techniques will be performed only when the analysis from divide-and-conquer steps forecasts the entire recurrence circuits in c-worklist and d-worklist can be simultaneously relaxed. Since this process iterates until there is no further change in ExRecMII, most of the search space for an optimal sequence of splittable points, which means an optimal allocation of underutilized functional units in a multi-issue DSP, is typically exhausted.
EXPERIMENTAL RESULTS
This section describes results of a set of experiments to illustrate effectiveness of the unified preprocessing strategy described in Section 4.3, which is implemented for the SC140 optimizing C compiler. The experimental input is a set of candidate loops obtained from DSPStone [18] , MediaBench [3] , half-rate GSM, enhanced full rate GSM , and other industry signal application kernels. Table 1 lists benchmarks used for our experiments.
In order to isolate impacts on performance and code size purely from our preprocessing, two sets of executables for SC140 multi-issue DSP are produced for benchmarks listed in Table 1; -ORIG: fully optimized one with original C compiler, and -PRE: fully optimized one with revised C compiler with our preprocessing proposed. With these two sets of executables, we measured (1) cycle counts with StarCore cycle count accurate simulator simsc100, and (2) code size with StarCore utility tool, sc100-size. The performance improvements (decrease in cycle counts) and code size increase due to our preprocessing were measured in percent, using formula
. Figure 8 (a) reports performance improvements achieved by applying the unified algorithm in Section 4.3, which is based on cloning1 and cloning2 techniques respectively. The overall performance improvement from our preprocessing ranges from 0.3 to 29.5 , and the average performance improvement is 12.9 . Considering there is no modification made to existing iterative modulo scheduler and the performance comparison is made to highly optimized SC140 DSP code, performance gain from our preprocessing was impressive. In particular, the performance improvements on Mat1x3, FIR, FFT and ComFFT benchmarks were brought to our attention, since 1. iterative application of cloning1 followed by cloning2 for an existing modulo scheduler can deliver more performance gain by effectively reducing the ExRecMII of a candidate loop, and 2. the preprocessing strategy described in Section 4.3 can detect and exploit such opportunities for an effective modulo scheduling.
Note that none of benchmarks in Figure 8 (a) reports performance degradation. This is not a coincidence, since our algorithm is designed to apply cloning1 and cloning2 only when the additional operations for these techniques can be placed in non-critical recurrence circuits. reports code size increase due to the unified algorithm. Since cloning1 and cloning2 reduces ExRecMII, existing modulo scheduler discovers instruction level parallelism across more loop iteration boundary and as a result, achieves a better modulo schedule. Since the size of the prologue and the epilogue grow proportionally as more loop iterations of a candidate loop get overlapped for a final schedule, code size increase is unavoidable. However, we also observed that existing modulo scheduler can find a better loop schedule for a number of loop iteration boundaries when our preprocessing is applied. This is the reason why our preprocessing to IIR, GSMdec, GSMad and GSMsy benchmarks, reports significant performance improvements with negligible increase in code size.
For benchmarks listed in Figure 8 (b), overall code size increase from the proposed preprocessing ranges from 0% to 63.1%, and the average increase is 13.99%. However, note that benchmarks in Table 1 are critical loop kernels which typically account for 5% -10% of entire application code size. By carefully applying the preprocessing to mission critical loops with profiling, overall code size increase can be hold to a moderate amount.
RELATED WORK
To effectively lower this ExRecMII, Lam pioneered a compiler technique, referred to as Modulo Variable Expansion (MVE), that removes loop-carried anti and output dependencies in recurrence circuits [15] . Since MVE achieves the desired removal with loop unrolling followed by register renaming, a high loop unrolling factor might incur an increase in code size and register pressure. Another drawback of this scheme is that those candidate loops which execute for a multiple number of times the unrolling factor can only be properly accommodated. To overcome this problem, either peeling candidate loops for some number of loop iterations or adding a branch out of the unrolled loop body are required [11] .
To duplicate the effect of MVE without loop unrolling, Huff proposed an innovative rotating register files as an architectural feature in a hypothetical VLIW processor similar to Cydrome's Cydra 5 [16] . Since the Huff technique still requires a large number of the architected rotating registers to support MVE without code expansion, Tyson and et al. ameliorated Huff technique with register queues and rq-connect instruction [9] . In their technique, register queues share a common name-space with physical register files. As a consequence, the architected rotating register space is no longer a limiting factor.
CONCLUSION
This paper describes compiler optimizations that preprocess loop kernels of signal processing applications to relax their intrinsic data dependencies and thereby, complementing iterative modulo scheduler. The presented strategy is implemented for the StarCore SC140 optimizing C compiler backend. As a result of the implementation, a 12.9% average runtime improvement is reported for benchmarks in Table 1 ; This runtime improvement is made at the expense of a 13.99% average code size increase. Considering that no modification is made to existing modulo scheduler and that the performance comparison is made to highly optimized SC140 DSP code, we believe that this gain is impressive.
