Abstract-One of the important issues in embedded system design is to optimize program code for the microprocessor to be stored in ROM. In this paper, we propose an integrated approach to the DSP address-code generation problem for minimizing the number of addressing instructions. Unlike previous works, in which code scheduling and offset assignment are performed sequentially without any interaction between them, our work tightly couples offset assignment problem with code scheduling to exploit scheduling on minimizing addressing instructions more effectively. We accomplish this by developing a fast but accurate two-phase assignment procedure which, for a sequence of code schedules, finds a sequence of memory layouts with minimum addressing instructions. Experimental results with benchmark DSP programs show improvements an average of 5.8% in the whole code size over the existing methods.
I. INTRODUCTION AND RELATED WORK
T HE COMPLEXITY of designing embedded VLSI systems has made the traditional DSP compilers unable to meet the very tight constraints of code size for on-chip program and real-time performance. Unfortunately, due to tight constraints such as nonhomogeneous register sets, specialized functional units and registers, and irregular architecture, even compilers available for commercial DSP processors generate very inefficient code and few researchers have addressed the problems of optimizing code size and program execution time. The code sizes produced by using conventional code-generation techniques and even compilers specifically designed for commercial DSP processors are not satisfactory [1] , [2] .
Storage assignment, i.e., optimization of memory layout for program variables, is an important part of code generation since the resulting address code can account for over 50% of all program bits and one out of every six instructions for a typical general purpose processor [3] . It was shown in [4] that, for a set of programs in MediaBench [5] that was compiled for the Motorola DSP56000 family, more than 55% of the instructions involve address registers (ARs). Consequently, optimizing address assignment could lead to a significant reduction of code size and program execution time. Many architectures (e.g., the VAX, TI TMS320C2x DSP family, most embedded controllers) provide indirect addressing modes with autoincrement/autodecrement address arithmetic. These architectures also provide a set of dedicated address generation units (AGUs), as shown in Fig. 1 , that perform fast address computation in parallel to the central data path and contain a separate adder/subtractor for performing next-address computations. Since address generation on AGUs does not use datapath resources, a high instruction-level parallelism can be achieved by generating many autoincrement/autodecrement addressing modes to access variables. However, the high utilization of autoincrement/autodecrement requires a careful placement of variables in memory. Consequently, contrary to the traditional compilers which perform storage assignment based on naive approaches such as declaration order, first use, or lexicographic order of variables in the program, DSP compilers should carefully determine the relative location of variables in memory to maximize the use of autoincrement/autodecrement address modes to produce a very compact address code.
The storage assignment with a single AR in AGU [the simple offset assignment (SOA) problem] was first studied by Bartley [6] . He modeled the problem as a maximum-weighted Hamiltonian path cover (MWPC) problem, which is NP-complete, and proposed a heuristic algorithm for the problem. Liao et al. [1] , [7] also proposed a heuristic 1 to solve SOA and extended the problem to the case with multiple ARs in AGU [the general offset assignment (GOA) problem]. Leupers and Marwedel [8] have extended the work done by Bartley [6] and Liao et al. [1] , [7] by proposing a tie-breaking heuristic and a variable partitioning method to improve the quality of the SOA/GOA solution. Leupers and David [9] have solved GOA problems with arbitrary register file sizes and autoincrement ranges. The proposed technique is based on a genetic algorithm. Sudarsanam et al. [10] take into account the utilization of autoincrement with increments varying from to . Gebotys [11] modeled the problem of assigning ARs to every variable access in the code given a fixed memory layout as a network flow problem and solved it optimally.
All of the previous approaches [1] , [6] - [10] have addressed the SOA/GOA problem as a separate code optimization problem for a given exact access sequence of the program variables. However, since the access sequence of the variables is one of the most critical factors which affect the quality of SOA/GOA solutions, the memory layouts produced by the previous approaches will be SOA/GOA solutions that was locally optimized. In this context, we study the problem of integrating the SOA/GOA with code scheduling to exploit the effect of scheduling on minimizing the size of the address code (in assembly) more fully and effectively. Our solution is naturally extended to incorporate reordering of input-operands of commutative operations 2 (e.g., ) with scheduling to further refine the address code. To our knowledge, the work done by Rao and Pande [12] has addressed the optimization of access sequence of the variables by a (local) reordering of commutative input-operands. However, their solution is confined to the SOA problem only. Further, code scheduling has not been taken into account. Lim et al. [13] addressed the scheduling effect on the SOA problem. Their approach aimed to make graph sparser by an exhaustive search algorithm with pruning techniques. However, it is not true that sparser graphs always lead to a cheaper MWPC cost than that of denser graphs.
The key contribution of our work is an effective integration of code scheduling (together with input-operand permutation) which we call Sch-SOA/GOA, into the SOA/GOA heuristics Solve-SOA/GOA [1] , [7] , [8] . The remainder of this paper is organized as follows. In Section II, we illustrate our improvement to the SOA solution over the previous approaches through an example. Our iterative and incremental improvement algorithm for the unification of SOA/GOA and code scheduling is presented in Section III. In Section IV, we provide a set of experimental results with DSP benchmark programs to show the effectiveness of our solutions. Finally, concluding remarks are given in Section V.
II. MOTIVATING EXAMPLE
As an example illustrating the effectiveness of code scheduling on the problem of assigning offsets to program variables, consider the C program in Fig. 2(a) . Suppose we want to solve an SOA problem. That is, we assumed to have a target architecture with a single AR with only the indirect and autoincrement/autodecrement address modes. Fig. 2(b-i) shows the access sequence of the variables corresponding to the input C code in Fig. 2(a) .
Bartley [6] modeled SOA by an undirected weighted graph, called access graph, ( , ) where each node corresponds to a unique variable and an edge between nodes and exists with weight , if and are adjacent to 2 The optimization is performed at the level of an intermediate representation (IR) like three-address instruction that has two source operands. each other times in the access sequence. The access graph corresponding to the access sequence in Fig. 2(b-i) is shown in Fig. 2(b-ii) . 3 An optimal offset assignment is to find a path cover in ( , ) that minimizes the quantity (i.e., total address arithmetic instructions): (1) where the term "1" indicates the cost of AR initialization. Finding an MWPC minimizes the quantity of . The heavy lines in Fig. 2 (b-ii) beginning from and ending at form a path cover. The path was obtained by applying the Liao et al. SOA solution [1] with Leupers and Marwedel's tie-breaking rule [8] . As a result, the corresponding memory layout is shown in Fig. 2(b-iii) where
On the other hand, the left side of Fig. 2(c) shows a rescheduled C code where the highlighted statements indicate the change of schedule. The access sequence for the rescheduled code and the resulting memory layout are shown on the right side of Fig. 2(c) . The total number of nonzero-cost addressing instructions is reduced to 11. Finally, Fig. 2(d) shows the further optimized access sequence and memory layout obtained when scheduling is considered together with input permutation for commutative operations. The number of nonzero-cost addressing instructions is then reduced to 8, which is 38% improvement over the schedule in Fig. 2(a) .
The example in Fig. 2 clearly reveals that the offset assignment problem is very sensitive to the access sequence of the variables. Since the existing compilers have designed the offset assignment as a postpass optimization (after code generation), they failed to globally optimize the access sequence. To overcome this limitation, we propose a new offset assignment algorithm tightly coupled with code scheduling to exploit the effect of scheduling on minimizing address arithmetic instructions more fully and effectively. We then extend the scheduling problem to include input permutation for commutative operations to enhance the quality of the assignment further.
III. A UNIFICATION OF CODE SCHEDULING AND ADDRESS ASSIGNMENT
Based on the motivating example in Section II, we propose an effective two-phase algorithm for address assignment integrated with code scheduling. First, we describe an overview of the proposed algorithm in Section III-A, followed by the detailed description of our two-phase algorithm: (Phase 1) a generation of initial schedule and address assignment in Section III-B, and (Phase 2) a stepwise refinement of the solution obtained in Phase 1 in Sections III-C, III-D, and III-E.
A. Overview
The input to our algorithm is the dependency graph of operations (i.e., instructions), that is an IR of a basic block. A basic block is a sequence of IR statements of maximum length that satisfies two conditions: 1) The control flow enters only at statement and 2) the control flow leaves only at statement . We applied our offset assignment algorithm to the most frequently used basic blocks first and the least frequently used blocks last. The most/least frequently used blocks are determined by the execution profile or just as the basic blocks in/out of the loops. A node in the dependency graph represents an operation, and an arc from node to indicates that must precede in execution. For a loop, the access at the end of a loop is considered to be adjacent to the access at the beginning of the loop. Note that our technique considers the possibility of rescheduling only within a basic block. However, it can handle the access sequence of variables that cross over the loop boundary to take into account the optimization of addressing the code around the boundary by incoporating the consecutive variables accesses across the loop boundary into the edge weights in the access graph. (Our technique can be extended to procedure-wide by picking a frequently executed straight-line path through the branches of control flow and extracting an access sequence from the path. When there are function call sites in the selected path, some of their actual parameters are included in the access sequence according to the given calling convention.)
The objective of our algorithm is to schedule operations in the dependency graph and determines a complete sequence of variable accesses, so that the size of the resulting address code is minimized. Fig. 3 summarizes the flow of our offset assignment algorithm combined with scheduling/operand-permutation, called Sch-SOA/GOA. The proposed algorithm is performed in two-steps: (Phase 1) generating an initial schedule and address assignment from the dependency graph of each basic block derived from the input C program and (Phase 2) iteratively improving the SOA/GOA solution obtained in Phase 1 at the outer while-loop. An operation scheduled at an execution step is called "reschedulable" to another execution step, say , if scheduling the operation at does not violate the data-flow dependency constraint and it was not locked in the previous iteration in the inner while-loop. For every reschedulable operation, its value is computed. Among the operations, the operation with the least value of is selected, and rescheduled it to the corresponding execution step. Once the operation is rescheduled, it will be locked at that clock step for the rest of the execution at the inner while-loop. After an end of inner while-loop, all operations become candidates for rescheduling (i.e., unlocked). In other words, this operation is not rescheduled again in inner while-loop. The outer while-loop continues until it is not able to generate a schedule and SOA/GOA solution whose value is less than that of the minimal found so far during the previous iterations.
The time complexity of Sch-SOA/GOA depends on the time to generate a path cover solution for a trial of rescheduling at the for-loop. Since there are total number of operations and the number of execution steps at which each operation can be rescheduled is no more than , the number of iterations executed in the for-loop is bounded by . Further, at each iteration Solve-SOA is applied at most once and its runtime is where is the number of edges in access graph. Thus, the total runtime spent by the for-loop is bounded by .
B. Initial Schedule and Address Assignment
It is important to produce a good initial schedule and address assignment because the quality of the final address assignment by reschedule is greatly affected by its initial solution. We focus our discussion on generating an initial schedule and its SOA solution for the reason that our extension to GOA solution is rather straightforward with minor modifications.
We construct an access sequence incrementally, one execution step at a time, from the first execution step to the last. An operation in the dependency graph is said to be a ready operation for a certain execution step if all of its predecessors (and thus, their variable accesses) have already been scheduled in the previous execution steps. At each iteration, our algorithm selects the most "promising" operation and schedules it at the current execution step, and appends its variable accesses to the end of the current partial access sequence. After the (partial) access sequence is augmented according to the selection of an operation from the ready operations, the algorithm repeats for the next execution step until all of the operations in the dependency graph have been scheduled for execution.
Among ready operations, the algorithm selects the operation with a minimum additional cost of in (1) for the augmented access graph. For example, suppose we have already generated an access sequence (ended by accessing variable ) partially up through execution step , as depicted in Fig. 4(a) . Now, we want to select an operation to be executed in execution step and, accordingly, append the variable accesses to the current partial access sequence " " (denoted by ), as indicated in Fig. 4(b) . Suppose we have an access graph (denoted by ), shown in Fig. 4(c) corresponding to . The path (denoted by ) with heavy lines indicates a solution of MWPC obtained by applying heuristic Solve-SOA [8] . Then, as shown in Fig. 4(c) . Let us now consider all possible schedules of ready operations at the next execution step. Fig. 4(d-f) shows the access graphs with augmented weights and the computation of their values when operations 1-3 in Fig. 4(a) are selected for execution at execution step , respectively. Consequently, we choose operation 2 because it results in the minimum increase of value. Thus, we attach the access sequence " " (for operation 2) to , resulting in " ." The ready operations are then updated and the process repeats until there are no more operations in the list. Note that if there is more than one ready operation, say op1 and op2 with equal values, we break the ties by per- forming one iteration step further under the condition that each of op1 and op2 has been selected to schedule for execution. Fig. 5 summarizes the flow of our SOA algorithm combined with code scheduling, called Sch-SOA-init. It is a variant of list scheduling. For the dependency graph of each basic block derived from the input C program, we iteratively order the executions of operations, from the first step to the last. At each iteration, we select, among the ready operations, the operation whose variable access transitions lead to a minimum increase of the number of nonzero cost address instructions (i.e., the value where is the number of edges in the access graph. Thus, the total time of our algorithm is bounded by where is the number of operations since the outer for-loop is executed times and at each time, at most ready operations are tested using Solve-SOA. However, since the number of ready operations is very small in practice, the runtime is actually bounded by .
C. Iterative Improvement Techniques
The core of our algorithm is to efficiently but accurately generate the address assignment solution (i.e., path cover) which minimizes the quantity of in (1) when an operation is rescheduled from execution step to . Clearly, the rescheduling changes the access sequence and, thus, changes the weights of some of the edges in the access graph. A path cover solution can be obtained simply by applying Solve-SOA/GOA [8] . However, this definitely requires an excessive runtime when the problem size is large and a large number of reschedules are performed. However, from the fact that each reschedule changes the execution order locally, causing a small change of edge weights in the access graph, and a path cover solution with a minimum value for the previous schedule has already been known, it is natural to ask whether there is a way to find a path cover with a minimum value for the current schedule rapidly by exploiting the previous path cover solution. To do this, we develop a fast and comprehensive path cover computation procedure based on the previous solution. Fig. 6(a) shows an example of code rescheduling. According to the reschedule, as depicted in Fig. 6(b) , the weights of edges (x1,x2), (x3,x4), and (x6, x7) are decremented by 1, and the weights of (x1,x4), (x3,x7), and (x2,x6) are incremented by 1. Thus, there are at most six edges in the access graph whose weights are changed by the reschedule. We can easily find that the number of edge weights to update by a reschedule of an operation is bounded by a constant ( ), independently of the size of access graph. Further, the amount of change for each edge is in [ , ] . 4 4 "3" occurs when x1 = x2, x3 = x4, x6 = x7 in Fig. 6 whereas '03' occurs when x1 = x4, x3 = x7, x2 = x6. 
D. Schedule-SOA
Let denote the path cover (i.e., solve-SOA solution) in access graph ( , ). Suppose some of edge weights in ( , ) are updated according to a reschedule of an operation. Let ( , ) denote the updated access graph. The problem is to find a new path cover for ( , ) that minimizes the quantity of in (1). We determine according to the following rules.
Case 1: . Note that is the value for the access graph before the reschedule using path cover , and , is the value after the reschedule using the same path cover . Case 1 belongs to the case that the value obtained by using for the updated ( ) is smaller than that by for the old . In this situation, we simply use the as a new path cover for ( , ) as long as the value using for ( , ) decreases, rather than computing a new path cover for ( , ) . Since the number of edges whose weights are updated is at most six, computing the value for ( , ) using , can be done in . For example, Fig. 7(b) shows the updated by the reschedule in Fig. 7(a) . Since the value for the new is reduced even when is used, we use as a path cover, , for the new .
The described procedure for determining a path cover for ( , ) in Case 1 has an optimal property for a special case. Property 1: Let be an optimal path cover for ( , ), and and denote the weights of edge in ( , ) and ( , ), respectively. We define to be if , and 0, otherwise. Then, is an optimal path cover for ( , ) if the following conditions are satisfied:
. Proof: Let be a path cover for ( , ) which satisfies . We define the edge sets
We first show the inequality (2) From the assumption that is an optimal path cover for ( , )
Thus (3) According to condition 2 and the fact that the total sum of the decreases of edge weights (from to ) is at most three, it is true that , from which (4) Thus, from (3) and (4) and according to condition 1, we have (5) To show , by the definitions of , , and , and , and using the inequality in (5), we derive
Case 2:
. This is the case where using for ( , ) does not reduce the value. Thus, we find another path cover other than by applying Solve-SOA. If we maintain a sorted edge-list for ( , ) according to the edge weights, finding a path cover for ( , ) using Solve-SOA, can be completed in where is the number of edges in ( , ). For example, Fig. 8(b) shows ( , ) and the new path cover corresponding to the reschedule in Fig. 8(a) . Since the value for ( , ) using is not smaller than that for ( , ) using , we compute for ( , ) applying Solve-SOA, generating a reduced value (i.e., from 6 to 5) for ( , ).
E. Schedule-GOA
GOA is the generalization of SOA toward a number ( ) of ARs. A reasonable approach proposed by Liao et al. [1] and Leupers and Marwedel [8] is to partition the variables into disjoint subsets , for each of which a distinct AR is used. The objective is to minimize (6) where is the quantity of [in (1)] for the access subgraph containing the variables only in . Then, the problem is reduced to SOA problems. Solve-GOA [8] works as follows.
Step 1: Sort nonzero edges in of access graph in descending order of weight.
Step 2: Create subsets of size two by selecting disjoint edges of the highest weight in and placing the node pairs to one subset each.
Step 3: The remaining variables in the sorted list are sequentially assigned to the subset for which the increase of for caused by adding to is minimal. At each trial of rescheduling, we selectively apply Solve-GOA to the updated . Let be the access graph for the previous schedule, and be the access subgraph corresponding to the partitioned vertex sets , respectively. Also, let be the path covers for , respectively. From the change of weights in by the current reschedule, we calculate for each of , using path cover . According to the change of the value of , we partition the variables in into two subsets, and . If there is a reduction on the value of , all of the variables in the corresponding are assigned to . Otherwise, the variables are assigned to . Then, for each access subgraph whose variables are all in , we use the corresponding previous path covers as the path covers for the current reschedule. However, for the variables in , we apply Solve-GOA to the access subgraph for , and generate new access subgraphs and their path covers. Fig. 9 shows an example of our selective application of Solve-GOA. Fig. 9(a) shows an access graph that was partitioned into three subgraphs containing variables , , and each for the previous schedule. Then, by the changes of weights in the access graph due to a reschedule, it results in and . Consequently, the path cover for does not change, whereas the partitioning and path covers for are updated as shown in Fig. 9(b) . As a result, the total cost reduction of in (2) by the reschedule is .
F. Exploiting Operand-Input Commutativity
For a commutative operation such as , the access order can be either or . For a commutative operation to be rescheduled, our algorithm generates the access graph for each alternative of variable access sequence, and compute the new path cover from the access graph according to the procedure described in Sch-SOA/GOA. Then, we choose the access sequence that has the smaller value of .
For noncommutative operations we exclude the possibility of reordering of operand accesses. TI's TMS320C2x DSP processor is a processor we may assume. It has an ALU whose one operand is always from accumulator and the other is from data memory, immediate or register. register is the destination register of multiplication and is not a temporary register. For example, for subtraction (especially for unsigned) it is not possible to reorder the accesses of two input operands due to the lack of temporary registers.
IV. EXPERIMENTAL RESULTS
The proposed address code generation algorithm was implemented in C++ and executed on an Intel Pentium 3 workstation. We tested a set of benchmark programs and randomly generated designs to demonstrate the effectiveness of our algorithm integrated with scheduling. In the experiments, we considered the target DSP architectures that have not only a single AR (i.e., SOA) with indirect autoincrement/autodecrement addressing modes but also multiple ARs (i.e., GOA). Table I shows comparisons of the results in terms of address code size produced by OFU (the order of the first use) offset assignment, Solve-SOA with tie-breaking in [8] , and the proposed Sch-SOA for the randomly generated C programs. and represent the number of variables and the access sequence length, respectively. For each pair ( ), we tested 10 designs (i.e., 10 runs) and included the average value of the resultant address code sizes to the corresponding entry in the table. 5 The last column shows the runtime of Sch-SOA. In summary, the proposed address assignment technique considering scheduling effects reduces the address code size up to 74% and 27% over those by OFU and Solve-SOA, respectively. Note that the runtime for the testcase with 100 variables and access sequence in Table I is relatively very high. This is because of the fact that most of the runtime spent by Sch-SOA TABLE I  SOA RESULTS FOR RANDOMLY GENERATED CODES   TABLE II GOA RESULTS FOR RANDOMLY GENERATED CODES is in finding path covers, which is very sensitive to the length of access sequence. Table II shows the results produced by Solve-GOA [8] , and the proposed Sch-GOA for the randomly generated C programs.
represents the number of ARs. For each pair ( ) with multiple ARs, we tested 10 designs and included the average value of the resultant address code sizes to the corresponding entry in the table. The comparison of the results indicates that Sch-GOA performs well compared with Solve-GOA, reducing the address code size up to 36% more than that of Solve-GOA. Moreover, the reductions are consistent. Note that Solve-GOA is much faster than Solve-SOA for large examples. This is mainly due to the fact that the time consuming process, which is finding path covers, is applied to the partitioned subgraphs, one for each AR, of access graph independently in Solve-GOA. However, in Solve-SOA, the process is applied to the entire access graph. Table III shows results for benchmark C programs when (i.e., SOA). For the experiments with the benchmark programs, we generated code targeting TI's TMS320C25. The table shows the comparisons of the results by Solve-SOA [8] , Hybrid [13] and our algorithm Sch-SOA in terms of the total number of (whole) instructions, including address instructions. In the first column, the names of the benchmarks and their corresponding number of variables ( ) and the length of access sequence ( ) are given. [From Tables III-VI, these ( ) pairs are the same.] FIR and BIQUAD (biquad_one_section) are taken from the DSPstone benchmark suite [14] in which for BIQUAD the variables specified as static were regarded as automatic variables. The architecture specific pointer operations with increment/decrement ( ) was assumed to be achieved with a single variable access operation in code, like LAR , LACL , , which means that load AR with where is a pointer, , , where denotes the accumulator. COMP (complex multiplier) and ELLIP (elliptical wave filter) are taken from high-level synthesis benchmark designs [15] . GAULEG (Gauss-Legendre weights and abscissas), GAUHER (Gauss-Laguerre weights and abscissas), and GAUJAC (Gauss-Jacobi weights and abscissas) are taken from [16] . (GAULEG, GAUHER, and GAUJAC are the main routines for Gaussian quadrature. Gaussian quadrature is one of the widely used techniques in numerical integration.) ADPCM is taken from mediabench [5] 's "adpcm.c." The numbers in parentheses of the fifth column of the table represent the reductions of the number of code instructions generated by the applications of "rescheduling" and "input-reordering" in Sch-SOA over that by Solve-SOA, respectively.
Tables IV-VI shows the comparisons of the total instruction numbers produced by Solve-GOA [8] and Sch-GOA with the changes of the number of ARs. As the number of ARs increases, the improvement by Sch-GOA is less. This is mainly due to the fact that the number of ARs is relatively large enough to fully utilize the autoincrement/autodecrement addressing modes without the exploitation of scheduling. Note that the total numbers of instructions required when are larger than the numbers required when in some cases. This is because of the increase of the number of instructions for AR initializations for these cases. Table VII shows the GOA results for randomly generated codes produced by our algorithm Sch-GOA and Naive-it, which does use a full cost computation for every reschedule. (Precisely, Naive-it did not apply cases 1 and 2 in Section III-C to the cost evaluations.) As shown in the fifth column of Table VII , Naive-it uses up to 24.7% less address instructions than that by Sch-GOA; but in six out of nine test cases, the differences are less than 10% and, further, for relatively large design examples, Sch-GOA was able to produce comparable solutions. On the other hand, in terms of runtime, Sch-GOA is over ten times faster than Naive-it, as indicated in the last column of the table.
V. CONCLUSION
In this paper, we proposed a new technique for optimizing address instructions for DSP code generation by utilizing code scheduling. Contrary to all prior work, in which code scheduling and address assignment phases are performed sequentially without any interaction between them, we integrated code scheduling and address assignment so that the scheduling effect can be exploited more fully and effectively on minimizing addressing instructions. Specifically, we solve the SOA/GOA (offset assignment) problem coupled with scheduling in two steps: 1) generating an initial schedule and address assignment and 2) iteratively improving the SOA/GOA solution. To support a large number of iterations, we devised a set of techniques for fast evaluation of the SOA/GOA solution for a reschedule based on the solution for the previous schedule. From experiments using benchmark DSP programs, we confirmed that utilizing code scheduling in address code generation reduces the address code size by 13-33% (and the whole code by 5.8% on average) over Solve-SOA/GOA [8] and by 43-71% over a naive storage assignment algorithm. Based on iterative improvement technique, the proposed algorithm may be slow for a large code. However, by limiting the number of iterations or reschedulable operations, it can have a moderate speed. Further, note that since the proposed approach is oriented toward reducing addressing code size, it may not be suitable for superscalar or VLIW DSPs and DSP codes with few addressing codes.
In order to utilize modify register (MR) in our algorithm, the cost of MR modification should be taken into account in addition to the cost of AR modification cost in (1) . The simplest way is to consider the effects of MRs for every reschedule. That is, for each trial of rescheduling, we obtain the path cover and memory layout, from which we generate the sequence of modify values (that cannot be covered by autoincrement/autodecrement and will not be covered either by immediate AR modification). Then, by using a page replacement algorithm [18] which is proven to be optimal in exploiting MRs [8] , we calculate the MR modification cost, i.e., the number of immediate load of an MR. Thus, by using the sum of the cost in (1) and the MR modification cost as our new cost in our algorithm can incorporate the effect of MRs. However, the page replacement algorithm takes a square-or a linear-time with a very careful implementation [19] . To reduce the runtime, for every trial of rescheduling we can simply check if the number of MRs is equal to or larger than the number of distinct modify values. If it does, the MR modification cost becomes exactly the number of distinct modify values. (Note that the number of distinct modify values for the reschedule can be calculated in a constant time if the modify values for the previous schedule were known and the same path cover is used.)
Finally, the approach we proposed is a local one. Consequently, for programs with many small basic blocks, the solutions produced by the proposed approach may not be satisfactory. To overcome this problem, a further investigation is required in the future to design a global and efficient technique, such as the one [17] for array reference allocation problem. Her research interests include design automation of embedded systems and retargetable compiler/simulator for embedded DSPs and ASIPs.
Taewhan Kim (M'93) received the B.S. degree in computer science and statistics and the M.S. degree in computer science from Seoul National University, Seoul, Korea, in 1985 and 1987, respectively, and the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign, in 1993.
From 1993 to 1998, he was with Lattice Semiconductor Corporation and Synopsys Inc., where he was involved in the research and development of CPLD mapping, multichip partitioning, and logic and behavioral synthesis. Currently, he is an Associate Professor in Electrical Engineering and Computer Science at the Korea Advanced Institute of Science and Technology, Daejon. His research interests are in the computer-aided design for VLSI systems, embedded software, and combinatorial optimizations.
