Reducing address arithmetic instructions by optimization of address offset assignment greatly improves the performance of DSP applications. However, minimizing address operations alone may not directly reduce code size and schedule length for multiple functional units DSPs. In this paper, we exploit address assignment and scheduling for application with loops on multiple functional units DSPs. Array transformation is used in our approach to leverage the indirect addressing modes provided by most of the DSP architectures. An algorithm, Address Instruction Reduction Loop Scheduling (AIRLS), is proposed. The algorithm utilizes the techniques of rotation scheduling, address assignment and array transformation to minimize both address instructions and schedule length. Compared to the list scheduling, AIRLS shows an average reduction of 35.4% in schedule length and an average reduction of 38.3% in address instructions. Compared to the rotation scheduling, AIRLS shows an average reduction of 19.2% in schedule length and 39.5% in the number of address instructions.
INTRODUCTION
DSP processors generally provide dedicated address generation units (AGUs) that can perform address computations in parallel to the central data path. As a result, when we access data in registerindirect addressing mode, the address stored in the address register (AR) can be auto-incremented or auto-decremented without extra addressing instruction. If the address of the next variable could be reached by auto-increment or auto-decrement, the next instruction can be executed without additional address arithmetic instruction. Consequently, contrary to the traditional compilers, DSP compilers can carefully determine the relative location of data in memory and achieve compacted object code size and improved performance. Loops are the most critical sections in many computationintensive DSP applications. An efficient loop scheduling can help reduce both the schedule length and code size. In this paper, we develop a scheme to exploit address assignment and scheduling for application with loops on multiple functional units DSPs.
Recently, a lot of research has been done to optimize the address assignment of variables to minimize the total number of address arithmetic instructions. The address assignment was first studied in [1, 2] . More research [3] has been done on address assignment with fixed scheduling on single functional unit architectures. Some work [4] has been done on combining scheduling and address assignment in code generation. These algorithms only target single functional unit and can not be directly applied to multiple functional units.
This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001, and NSF CCR-0309461, USA. This paper proposes an algorithm, Address Instruction Reduction Loop Scheduling (AIRLS), to minimize both address instructions and schedule length for loop applications with array transformation. In the AIRLS algorithm, the schedules are generated by repeatedly rotating down and re-allocating nodes with minimum address instructions based on rotation scheduling [5] , and a best schedule is selected that has the minimum schedule length. During each step of the rotation, we generate a transformed array sequence based on the result from the address assignment algorithm. Then, this array sequence is used to determine the new location for each rotating node. AIRLS shows significant performance improvement. Compared to the list scheduling, the average reduction in schedule length is 35.4% and the average reduction in address instructions is 38.3%. Compared to the rotation scheduling, the average reduction in schedule length is 13.6% and the average reduction in address instruction is 38.3%. When the unfolding technique is applied with an unfolding factor of 2, the average reduction in schedule length is increased to 24.8% and the average reduction in address instructions is increased to 40.7%.
The remainder of this paper is organized as follows. Section 2 provides a motivating example. Section 3 introduces the basic concepts and the architecture model. The algorithm is discussed in Section 4. Experimental results and concluding remarks are provided in Section 5 and 6, respectively.
MOTIVATING EXAMPLES
In this section, we provide a motivating example based on the loop in Figure 1 (a). The DFG for the loop is shown in Figure 1(b) . A schedule generated by list scheduling for the DFG is shown in Figure 1 We compare the schedule length based on the traditional rotation scheduling, and the schedule length after apply array transformation and address assignment to the traditional rotation scheduling. The detail schedule shown in Figure 3 . In this detail schedule, each node is expanded into assembly level codes. In our architecture model, we assume there is only one address register available, and no other register is available to each accumulator. Under this constraint, and with the traditional array layout, it is very difficult to leverage the indirect addressing modes. However, if we apply array transformation technique, we can leverage indirect addressing modes provided by most of the DSPs. For example, we transform the array data according to figure 4(2). Figure 4 (1) shows the traditional sequential memory layout for arrays. Figure 4 (3) denotes the array transformation in the 4(2), which is used as our notation in this paper for array data transformation memory layout. With this array layout, we obtain a new detail schedule as figure 4(4). We can see that the schedule length has been reduced from 17 to 11, and number of address instructions has been reduced from 14 to 5. 4 ADD *(AR1) 
Fig. 3. Detail Schedule after Rotation Scheduling
Memory layout 
BASIC CONCEPTS AND MODELS
The processor model we use in this paper is given as follows. For each functional unit in a multiple functional units processor, there is an accumulator and an address register. Each operation involves the accumulator and, if any, another operand from the memory. Memory access can only occur indirectly via address registers, AR0 through ARk. Furthermore, if an instruction uses ARi for indirect addressing, then in the same instruction, ARi can be optionally post-incremented or post-decremented by one without extra cost. If an address register does not point to the desired location, it may be changed by adding or subtracting a constant, using the instructions ADAR and SBAR. In this paper, FNi is used to denote functional unit i, and ARi is used to denote the address register for FNi. We also use *(ARi), *(ARi)+, and *(ARi)-to denote indirect addressing through ARi, indirect addressing with post-increment, and indirect addressing with post-decrement, respectively. This processor model reflects addressing capabilities of most DSPs, and can be easily transformed into other architectures.
Data Flow Graph is used to model loops and is defined as follows. A Data Flow Graph (DFG)
is the edge set that defines the precedence relations for all nodes in
is a binary string associated with each node
represents the number of delays for an edge 3 . Nodes in ¥ can be various operations, such as addition, subtraction, multiplication, logic operation, etc.
In our case, a DFG can contain cycles. The intra-iteration precedence relation is represented by the edge without delay and the inter-iteration precedence relation is represented by the edge with delays. The cycle period of a DFG corresponds to the minimum schedule length of one iteration of the loop when there are no resource constraints.
An example is shown in Figure 1 . The DFG in Figure 1 (b) models the loop in Figure 1(a) . In this example, there are two kinds of operations: multiplication and addition. They are denoted by the rectangle and circle as shown in Figure 1(b) .
A static schedule of a cyclic DFG is a repeated pattern of an execution of the corresponding loop. In our work, a schedule implies both control step assignment, and functional unit allocation. A static schedule must obey the precedence relations of the directed acyclic graph (DAG) portion of the respective DFG. The DAG is obtained by removing all edges with delays in the DFG. in the schedule refers to node B scheduled at control step 2 and assigned to
in Figure 1 (c). Unfolding is also called unrolling or unwinding, is widely used in compiler design [6] . A schedule of unfolding factor f can be obtained by unfolding G f times. That is, a total of f iterations are scheduled together, and the schedule is repeated every f iterations.
Retiming [7] can be used to optimize the cycle period of a DFG by evenly distributing the delays in it. Given a DFG graph of with retiming [5] is a scheduling technique used to optimize a loop schedule with resource constraints. It transforms a schedule to a more compact one iteratively. In most cases, the node level minimal schedule length can be obtained in polynomial time by rotation scheduling. In each step of rotation, nodes in the first row of the schedule are rotated down. By doing so, the nodes in the first row are rescheduled to the earliest possible available locations. From retiming point of view, each node gets retimed once by drawing one delay from each of incoming edges of the node and adding one delay to each of its outgoing edges in the DFG. The new location of the node in the schedule must also obey the precedence relation in the new retimed graph. The retimed graphs and schedules after the first and second rotation are shown in Figure 2 (a) and Figure 2 (b) respectively, which is based on the original schedule in Figure 1(c) . The node level minimal schedule length is obtained by the schedule in Figure 2(b) .
Rotation Scheduling presented in

THE AIRLS ALGORITHM
In this section, an algorithm, Address Instruction Reduction Loop Scheduling (AIRLS), is designed to reduce address operations based on loop unrolling and rotation scheduling. The basic idea is to first unroll the loop, then generate the schedules by repeatedly rotating down and re-allocating nodes with minimum schedule length and address operations based on Rotation Scheduling, and then select a best schedule that has the minimal schedule length. The AIRLS algorithm is shown in Algorithm 4.1.
We use a function mSOA() [8] in the AIRLS algorithm, which is a modified version of the original Solve-SOA algorithm [2] so that it can handle partial access sequence, and also handle variables from the same array. In the original Solve-SOA algorithm, an edge is not selected if any node has a degree We obtain the best location for a rotated node by the following strategy. For a location 6 7 § 8 @ 0
, we define a function,
, to compute the address operation if & is assigned to location
is the node in the first non-empty location above is the node in the first non-empty location below 6 7 § 8 @ both in column the similar architecture as TI C6000 DSP. We compare our results with those from the traditional list scheduling and rotation algorithm. The experiments are performed on a PC with a P4 2.1 G processor and 512 MB memory running Red Hat Linux 9.0. In the experiments, the running time of AIRLS on each benchmark is less than one minute.
The experimental results for the list scheduling, the rotation scheduling and the AIRLS algorithm, are shown in Table 1 when the number of FUs is 4 and 6 respectively. Column "AI" presents the number of address instructions and Column "SL" presents the schedule length obtained from the three different scheduling algorithms: the list scheduling (Field "List"), the traditional rotation scheduling (Field "Rotation"), and our AIRLS algorithm (Field "AIRLS"). Field "AIRLS(Unfolding factor=2)" denotes the data obtained by AIRLS with each benchmarks unfolded by 2. Column "%SL-L" and "%SL-R" under "AIRLS" represent the percentage of reduction in schedule length compared with list scheduling and rotation scheduling respectively. Column "SL/2" and "AI/2" denotes the average schedule length and address instructions considering two loop iterations are processed at the same time. Column "%AI-L" and "%AI-R" under "AIRLS" represent the percentage of reduction in number of address instructions compared with list scheduling and rotation scheduling respectively.
Compared to the list scheduling, the reduction in schedule length is 35.4% and the reduction in address instructions is 38.3%. Compared to the rotation scheduling, the reduction in schedule length is 13.6% and the reduction in address instructions is 38.3%. When we apply unfolding technique with a unfolding factor of 2, the average reduction in schedule length and number of address instructions are both increased. Compared to the list scheduling, the reduction in schedule length become 45.3% and the reduction in address instructions become 40.7%. Compared to the rotation scheduling, the reduction in schedule length become 24.8% and the reduction in address instructions become 40.7%. In summary, we found that AIRLS can reduce both schedule length and the number of address instructions compared to the list scheduling and the rotation scheduling.
6. CONCLUSION Loops are the most critical sections for DSP applications. Minimizing the schedule length and reducing code size for loops can significantly increase performance for computation-intensive DSP applications. In this paper, we proposed an algorithm, AIRLS, that utilizes array transformation, address assignment, and rotation scheduling techniques to reduce schedule length and address operations for loops on multiple functional units DSPs. AIRLS can significantly reduce schedule length and address instructions comparing to the previous work.
