Abstract-Digital signal processors provide dedicated address generation units (AGUs) that are capable of performing address arithmetic in parallel to the main data path. Address assignment, optimization of memory layout of program variables to reduce address arithmetic instructions by taking advantage of the capabilities of AGUs, has been studied extensively for single-functional-unit (FU) processors. In this brief, we exploit address assignment and scheduling for multiple-FU processors. We propose an efficient address assignment and scheduling algorithm for multiple-FU processors. Experimental results show that our algorithm can greatly reduce schedule length and address operations on multiple-FU processors compared with the previous work.
TMS320C2x/5x/6x [7] and AT&T DSP 16xx [8] . An AGU is a dedicated address generation unit that is capable of performing auto-increment/decrement address arithmetic in parallel to the main data path. When auto-increment/decrement is used in an instruction, the value of the address register is modified in parallel with the instruction, hence the next instruction is ready to be executed without any extra instructions. With a careful placement of variables in memory, we can reduce the total number of the address arithmetic instructions of an application, and both the schedule length and code size can be improved.
Address assignment has been studied extensively for single-FU processors. Assuming that instruction scheduling has been done, to find an optimal memory layout for program variables has been studied in [2] [3] [4] [5] . Considering AGUs that can also perform auto-increment/decrement based on modified registers, various problems have been studied in [13] [14] [15] [16] [17] . Address mode selection has been studied in [20] . Optimal address register live range merge is solved in [18] . An algorithm that allows data variables to share memory locations is proposed in [22] . Considering scheduling and address assignment together, various techniques have been proposed in [10] [11] [12] . Experimental results comparing different algorithms are presented in [19] , [21] . The goal of all of these works is to minimize address operations to achieve code size reduction and performance improvement. It works well on single-FU processors. However, as shown in Section II-B, minimizing address operations alone may not directly reduce code size and schedule length for multiple-FU architectures. In this brief, we exploit the address assignment problem with scheduling for multiple-FU architectures.
The basic idea is to construct an address assignment first and then perform scheduling. In this way, we can take full advantage of the obtained address assignment and significantly reduce code size and schedule length. An algorithm, MFSchAS, is proposed in this brief to generate both address assignment and schedule for multiple-FU processors. In the MFSchAS algorithm, we first obtain an address assignment and then use bipartite matching to find the best schedule based on the address assignment. Compared with list scheduling, MFSchAS shows an average reduction of 16.9% in schedule length and an average reduction of 33.9% in the number of address operations. Compared with Solve-SOA [5] , MFSchAS shows an average reduction of 9.0% in schedule length and an average reduction of 8.3% in the number of address operations.
The remainder of this brief is organized as follows. Section II introduces the basic models and provides a motivational example. The algorithm is discussed in Section III. Experimental results and concluding remarks are provided in Sections IV and V, respectively. 
II. MODELS AND EXAMPLES

A. Basic Models
The processor model we use is given as follows. For each FU in a multiple-FU processor, i.e.,
, there is an accumulator and one or more address registers. Each operation involves the accumulator and another optional operand from the memory. Memory access can only occur indirectly via address registers -. Furthermore, if an instruction uses for indirect addressing, then, in the same instruction, can be post-incremented or post-decremented by one or by the value stored in the modify register (MR) without extra cost. If an address register does not point to the desired location, it may be changed by adding or subtracting a constant using the instructions ADAR and SBAR. LDAR is used to load address into address register, and ADD performs addition arithmetic. In this brief, is used to denote the th AR for . For simplicity, is used in the examples to denote AR for when there is only one AR available for each FU. We use , , and to denote indirect addressing through , indirect addressing with post-increment, and indirect addressing with post-increment, respectively. This processor model reflects addressing capabilities of most DSPs and can be easily transformed into other architectures. The input of our algorithm is a DAG. A directed acyclic graph (DAG) , is a graph, where is the node set in which each node represents a computation, and is the edge set where each edge denotes a dependency relation between two nodes.
B. Examples
Here, we provide a motivating example. For a given DAG, we compare the schedule length and code size generated by list scheduling, Solve-SOA algorithm [5] , and our algorithm.
The input DAG shown in Fig. 1 (a) is used throughout this brief. Each node in the DAG is a computation. For example, node Y denotes the computation of . The list of nodes and computations is shown in Fig. 1(b) .
Assume that we have two functional units in our system. Using the list scheduling that sets the priority of each node as the longest path from this node to a leaf node, we obtain the schedule shown in Fig. 2(a) . The address assignment is simply the alphabetical order as shown in Fig. 2(b) . The detailed assembly code for this schedule is shown in Fig. 2(c) . Each node in the schedule in Fig. 2 (a) corresponds to several assembly instructions in Fig. 2 (c) to complete the computation denoted by this node. For example, node S in Fig. 2 (a) corresponds to assembly code from line 1 to line 5 of in Fig. 2 (c) that computes . In this assembly code, we first load the address of variable e into address register , i.e. LDAR ,&e. Then we load the value pointed by into the accumulator of , i.e., LOAD . In this instruction, the auto-decrement addressing mode is used to make point to variable . Then, the value of is added to the accumulator, i.e., ADD . Also, since the distance between d and f in the address assignment in Fig. 2(b) is 2, we move from f to d by adding 2 to it, i.e., ADAR ,2. Finally, we store the result to d, i.e., STOR . The schedule length is 25, as shown in Fig. 2(c) .
Based on the schedule from Fig. 2(c) , the Solve-SOA algorithm [5] is applied to generate a better address assignment as shown in Fig. 3(b) . With this new address assignment, some address arithmetic operations are saved. We obtain a new schedule with a total schedule length of 21 as shown in Fig. 3(c) . From the schedule in Fig. 3(c) , we can see that the number of address operations (ADAR and SBAR) are reduced. However, the schedule length is not reduced as much as it could be. Even though the address operations in one FU can be saved, we may not reduce schedule length or code size because of the dependency constraints shown in the dashed boxes in Fig. 3(c) . This implies that we cannot achieve the best result with a fixed schedule on a multiple-FU processor.
The schedule generated by our algorithm is shown in Fig. 4 . With a different address assignment as shown in Fig. 4(b) , the schedule length is 17. In our schedule, both address operations and schedule length are reduced. Among the three schedules, the schedule generated by our algorithm has the minimal schedule length.
III. ADDRESS ASSIGNMENT AND SCHEDULING
As shown in Section II, minimizing address operations alone cannot directly reduce schedule length and code size for multiple-FU processors. We use an approach that generates address assignment first and then performs scheduling based on the obtained address assignment to solve this problem. In this section, we first show how to generate a good address assignment and then propose an algorithm, MFSchAS, to minimize schedule length and code size for multiple-FU processors.
A. Address Assignment Before Scheduling
The input of the Solve-SOA/GOA algorithm [5] is a complete access sequence based on a fixed schedule. In our algorithm, the schedule is not yet known. However, we can obtain a partial access sequence based on the access sequence within each node. In this section, we propose an algorithm, GetAS, that improves the Solve-SOA/GOA algorithm [5] so that it can handle partial access sequences. The algorithm is shown in Algorithm III.1. Algorithm III.1 has two steps. In step 1, a partial access sequence is obtained based on the access sequence within each node. Basically, a special symbol " " is inserted between the access sequences of two neighbor nodes to denote that there is no relation between the two neighbor variables. For example, " " is a partial access sequence, in which "d" and "f" have no relation. A partial access sequence for a DAG contains around 60% of the total access sequence information obtained based on a fixed schedule, assuming that there are three variable accesses in each node.
Algorithm III.1 GetAS(G,k)
In step 2, depending on the number of address registers that are available per FU, either Solve-SOA or Solve-GOA will be called with the access sequence created in step 1. Here, Solve-SOA/GOA will be modified slightly in the calculation of edge weight. If there is a " " symbol between u and v in the partial access sequence, this means that u and v are not adjacent to each other, so it will not be counted in the weight of the edge e between u and v.
For a commutative operation such as , the access sequence can be either "e f d" or "f e d." To exploit commutativity or associativity, we can apply Commute2-SOA by Rao and Pande [11] in place of Solve-SOA [5] algorithm. When there are modify registers available in the system, we can apply the technique presented in [14] inside Algorithm III.1 to take advantage of the modify registers.
An example is presented in Fig. 5 . Given the DAG in Fig. 1 , a partial access sequence obtained by GetAS is shown in Fig. 5(a) , assuming only one address register is available per FU. For example, there is a " " symbol between d and f since we do not know whether or not Node S (with internal access sequence "e f d") will be scheduled in front of Node T (with internal access sequence "f a h"). Based on this partial access sequence, an access graph is constructed in Fig. 5(b) . A maximum weight path cover is shown in thick line in the access graph. The cost of the path cover is the sum of the weight of the edges that are not covered by the path cover, which also equals the number of address arithmetic instructions we have to include in our schedule. For this example, the cost is 5.
B. Algorithm for Multiple-FU Processors
Here, we present an algorithm, MFSchAS, to minimize the schedule length and code size for multiple-FU processors. The basic idea is to find a matching between available FUs As shown in Algorithm III.2, we first obtain an address assignment using GetAS(G,k) and then generate a schedule with minimum schedule length using weighted bipartite matching. In the MFSchAS algorithm, we repeatedly create a weighted bipartite graph between the set of available functional units and the set of ready nodes in and assign nodes based on the mincost maximum bipartite matching M. In each scheduling step, the weighted bipartite graph, , is constructed as follows:
, where is the set of currently available FUs and is the set of ready nodes; for each FU and each node , an edge is added into and edge weight , where is the list of variables last accessed by each AR in FU , First(u) is the first variable that will be accessed by node u. Priority(u) is the longest path from node u to a leaf node. WCF(AL,y,Z) is a weight function defined as follows: (AL is a list of variables; y is a variable in the address assignment; Z is the priority) is a neighbor of otherwise. In this way, the ready nodes with higher priority are considered first. Given the same priority, nodes with address operation savings have more advantage. ARs in use are preferred over unused ARs to save the initialization cost.
An example is shown in Fig. 6 . Given the DAG in Fig. 1(a) , the scheduling in the second step by the MFSchAS algorithm is shown in Fig. 6 when there are two FUs and one AR per FU, after nodes T and S have been scheduled to and in the first step. Address assignment generated using the algorithm GetAS is shown in Fig. 6(a) . A weighted bipartite graph based on the set of available functional units and the set of ready nodes in is constructed in Fig. 6(b) . A min-cost maximum bipartite matching is shown in Fig. 6(c) . Applying this matching, we schedule node W to and node V to . The technique proposed by Fredman and Tarjan [6] can be used to obtain a min-cost maximum bipartite matching in , where n is the number of nodes and m is the number of edges of a bipartite graph. Let be the number of FUs. In every scheduling step, we need at most to find a minimum-weight maximum bipartite matching, since the number of nodes is and the number of edges is in the bipartite graph. Thus, the complexity of MFSchAS is , since the scheduling step is at most . IV. EXPERIMENTS Here, we experiment with our algorithm on a set of DSP benchmarks including IIR filter, IIR-UF2 (IIR filter with unfolding factor 2), IIR-UF3 (IIR filter with unfolding factor 3), four-stage lattice filter, eight-stage lattice filter, differential equation solver, all-pole filter, elliptic filter, and Voltera filter. The algorithm is implemented in C language on Redhat 9 Linux.
An optimal solution for the address assignment and scheduling problem with multiple-FU architecture can be computed by trying all possible schedules and applying Liao's branchand-bound algorithm [9] for optimally solving SOA. We computed optimal solutions for several of the benchmark programs that are relatively small (with at most 16 variables) and compare with MFSchAS. This allows us to measure the quality of TABLE I  COMPARISON BETWEEN OPTIMUM AND MFSCHAS   TABLE II  COMPARISON OF SCHEDULE LENGTH FOR LIST SCHEDULING,  SOLVE-SOA, AND MFSCHAS   TABLE III  COMPARISON OF ADDRESS OPERATIONS FOR MFSCHAS, SOLVE-SOA, AND LIST SCHEDULING the proposed approach. The results are given in Table I . For two of the benchmark programs, the MFSchAS algorithm achieves the same results as the optimal solution. The average overhead is 2.1%. We compare the MFSchAS algorithm with list scheduling and the algorithm that directly applies Solve-SOA [5] on multiple-FU processors. Tables II and III show the comparison for schedule length and for the number of address operations, respectively. In Tables II, III , the corresponding results are shown for list scheduling (Column "List"), Solve-SOA(Column "SOA") and MFSchAS (Column "MFSchAS") when the number of functional units equals three and four, respectively, and assuming only one AR is available for each FU. Column "%LS" denotes the percentage of reduction between list scheduling and MFSchAS. Column "%SOA" denotes the percentage of reduction between Solve-SOA and MFSchAS. Compared with list scheduling, MFSchAS experimental results show an average reduction of 16.9% in schedule length and an average reduction of 33.9% in the number of address operations. Compared with the algorithm that directly applies Solve-SOA [5] , MFSchAS shows an average reduction of 9.0% in schedule length and an average reduction of 8.3% in the number of address operations.
V. CONCLUSION
In this brief, we show that we can improve both performance and code size when we combine scheduling with address assignment for multiple-FU DSPs. Specifically, we can generate an address assignment and then utilize this address assignment during the scheduling. Hence, we can minimize the number of address operations needed and significantly reduce schedule length.
