Abstract-This paper presents a new approach to solving the DSP address code generation problem. A minimum cost circulation approach is used to efficiently generate high-performance addressing code in polynomial time. Results show that addressing code size improvements of up to 62 are obtained, accounting for up to 1:62 improvement in code size and performance of compiler-generated DSP code. This research is important for industry since this value-added technique can improve code size, energy dissipation, and performance, without increasing cost.
I. INTRODUCTION
A S digital signal processing (DSP) applications are rapidly growing more complex, some designers are moving from full custom digital circuitry to programmable processors or in-house cores to obtain lower risk solutions. The DSP core is a DSP processor that can be reused and combined with program/data memory, dedicated logic, plus applicationspecific integrated circuits (ASIC's), and incorporated onto a large silicon chip, providing a cost efficient and flexible solution for many typical embedded applications requiring low power and high reliability. These systems demand small code size and high-performance. Due to increasing complexities, high-level compilation is a necessity. However, the biggest drawback to both DSP processors or DSP core use is the code generation.
The use of conventional code generation techniques and even compilers specifically designed for commercial DSP processors produce very inefficient code [2] , [4] , [24] . There are many more limitations placed upon code generation for the DSP processor than for the general-purpose processor. The difficulty arises from nonhomogeneous register sets, small number of very specialized registers, very specialized functional units, restricted connectivity, limited addressing, and highly irregular datapaths [1] . For example specialized functional units such as address calculation units are typically found in these architectures. Instructions take operands and store results of computations in well defined registers with limited connectivity.
Limited addressing modes and the use of address registers are also typical. For example many DSP processors assume autoincrement/autodecrement addressing modes for sequential Manuscript received November 12, 1998 ; revised January 19, 1999 . This work was supported by a grant from NSERC. This paper was recommended by Associate Editor F. Catthoor.
The author is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ont. N2L 3G1 Canada.
Publisher Item Identifier S 0278-0070(99) 03970-6. accessing of data variables from memory will be used heavily.
In particular, there are a number of address registers which point to addresses in memory. The addresses in the address registers can be incremented or decremented at negligible cost. However, adding or subtracting offsets (not equal to zero or one) to these address registers often requires a specific instruction and, therefore, has a performance, code size, and energy dissipation cost associated with it. The autoincrement/autodecrement addressing subsumes address arithmetic instructions and requires shorter instructions [2] than other forms of addressing. Unfortunately it is assumed that efficient data memory layout has been performed to support this type of addressing. Code generation for DSP processors must meet tight timing constraints dictated typically by DSP throughput, and meet tight resource constraints. Given that these DSP processors must meet these requirements using very small code space (all on chip ROM), the code generation problem is a very difficult one [2] . Typically DSP processors are difficult to use requiring long product development times (using a large number of assembler programmers) even though the program may be less than 1-K program ROM. The need for decreasing time to market, development costs, and maintenance costs, demands the use of high-level language compilation. All of these factors imply several challenges in writing efficient code generators for such DSP processors. This is even more difficult for in-house core instruction set architecture which requires retargetable compilation.
II. PROBLEM DESCRIPTION AND RELATED WORK
The following problem, Problem 1 given below, is an important part of the code optimization problem that will be studied in this paper. For simplicity let us assume that an algorithm to implement the application has already been assigned based upon accuracy required, low-energy implementation, etc. The algorithm is composed of basic blocks, which are given as a partially ordered list of code operations. For the problem definition below we assume that there is one target DSP processor or core defined with an instruction set architecture. The target processor supports indirect addressing, in particular, there is one or more address registers which point to an address in memory and whose value can be incremented or decremented at negligible cost. Data layout is performed by using the approach described in [2] for singleoffset assignment layout or by a compiler or user.
Problem 1: Assume we are given initial code generated for the target processor and a memory layout (or sequence) of the data variables. Given the number of address registers in the 0278-0070/99$10.00 © 1999 IEEE target processor and the costs of loading an address register with a new value, and adding/subtracting a value to/from the address registers, the problem is to generate addressing code such that performance is maximized and the codesize is minimum.
The performance and code size costs are exact measures of how many extra cycles or instructions will be required. By minimizing the number of instructions generated, one can often also minimize energy dissipation as reported in [22] . In our case, the instructions minimized consist of address computations and not data compute operations. Furthermore, it can be assumed that the majority of the energy dissipated for these types of instructions is in the instruction fetch, decoding, and control line distribution, not in the address calculation units themselves. Thus, one can make the assumption that the energy is minimized as a result of minimizing the address related instructions. Extensions to this problem include supporting addressing with fixed offsets whose value is stored in a index register, in addition to the autoincrement/autodecrement addressing. Loops, conditional code, and data structures should also be supported.
Although many researchers have studied code generation for DSP processors or application-specific instruction-set parameters (ASIP's) [1] , fewer have studied address code generation. Early research examined the index register allocation problem [13] , where an instruction was specified with a memory address and an index register (where the address computation involved adding the memory address to the value in the index register). The number of index registers used in a program was greater than the number of index registers supported in the architecture, so the problem was to determine at each time what indexes should be present in the registers so as to minimize the total number of loads and stores. Extensions of their algorithm to loops assumed that register to register moves incurred zero costs. Researchers in [2] defined the single-(and general-) offset assignment [(SOA) and (GOA)] problem as determining the data layout in memory so as to reduce the code required to implement autoincrement/autodecrement addressing with one (and ) address registers. They defined an access graph which modeled the sequential data variable accesses in the initial code. The SOA problem was solved directly on the graph using a modification of Kruskal's algorithm [7] . The GOA problem was also solved through a recursive algorithm that selects subsets of variables from the access sequence to solve the SOA on and calculate its cost [2] . This research modeled the problem as the maximum weighted path covering problem (MWPC) and researchers recently extended it heuristically [11] to support index registers (or incrementing ranges of to , where is an integer value). In [9] researchers defined a tiebreaking technique that can improve the addressing technique in [2] and proposed an extension to support dynamic index (or modify) register allocation in a post-pass phase. A simulated annealing approach to memory layout and address code generation was presented in [17] and also included extensions to loops. Address register allocation for array assignment in loops has also been studied. In this problem the memory layout is fixed by the array index. Researchers in [12] formulated the array assignment problem in loops as a minimum disjoint path covering (MDPC) technique, but did not support array access optimization across loop iteration boundaries. However, the approach did not support constraints on the maximum number of address registers available in the architecture and offsets were not supported. In [18] , researchers developed a branch and bound algorithm for handling address assignment for arrays in loops and although it is an exponential time algorithm, upper and lower bounds are used to prune the search space and the solution is valid across loop iteration boundaries. Researchers in [4] introduced a transformation of C code using pointer-based code instead of array-based code to make better use of address calculation units on DSP's. Improvements in code size and performance were obtained. Address generation has been shown to account for a large percent of energy dissipation in [3] .
In this paper, a new approach is presented to solve Problem 1, DSP address code generation. Unlike previous research, we study the problem of given a fixed data layout in memory, generate addressing code to minimize code size, or maximize performance. This is a valued added approach that may be used to improve compiler-generated DSP code by returning the address-optimized code. It can also be used in combination with a data memory layout technique such as [2] , [11] , or [9] to further improve code generation. This optimization approach supports cases where the data memory layout may be predefined at an interface with another processor or external input-output in the system. A minimum cost circulation technique is used to obtain solutions in polynomial time. Section III will outline the assumptions and terminology to be used in the balance of the paper.
III. ASSUMPTIONS AND NOTATION
The following terminology will be used in this paper: the set of address registers. The cardinality of this set, , is the number of address registers in the processor. The notation means that data variable is accessed at time . This access may be for a read or write. The value of is the memory address of variable . Additional terminology to be used in the optimization discussions are:
is a graph composed of vertices and arcs . The variable is the flow in the arc from vertex to vertex , where and . The value is the capacity on arc . The value is the cost per unit of flow on arc . Section IV will provide an introduction to minimum cost circulation. The following sections, will show how the DSP addressing code generation is mapped into a circulation problem and provide examples to illustrate the technique. In this paper, the use of the term optimal will refer to locally optimal unless otherwise stated as globally optimal.
IV. INTRODUCTION TO CIRCULATIONS
To introduce the circulation problem, we will first discuss network flow and its extension to circulations. The network flow problem is defined on a directed acyclic graph , composed of vertices and arcs , where each arc has a capacity associated with it. The capacity is a real valued quantity where arc , has an associated capacity . Let the variable represent the flow in arc . We will call this the flow variable. For any vertex in the graph the flow into the vertex is equal to the flow out of the vertex (known as the conservation of flow [7] ). The flow along any arc (in the same direction as the arc) must be positive valued (and also may be greater than or equal to a positive lower bound placed on that arc), but less than or equal to the capacity of that arc. There are two special vertices in this graph called vertex and vertex . The flow out of vertex is equal to the flow into vertex . Arcs incident to only leave vertex and arcs incident to vertex only are directed into vertex The maximum flow problem [7] is to find the maximum amount of flow from vertex to vertex through the network graph, such that the conservation of flow is maintained.
To study the minimum cost circulation problem [7] we add an arc from vertex to vertex (that circulates the flow in the graph). Each arc has a cost which is also a real valued quantity where arc has associated cost The problem is to find the flow through this graph with the minimum cost such that the sum over all arcs of the multiplication of the flow in each arc and the cost of each arc is minimum. The following equations represent the formulation of the minimum cost circulation problem as a mathematical programming problem
Minimize
In the problem above, the capacities , and costs are given. The problem is to solve for values of the variable that represent the flows in the network graph, , such that the objective function is minimum. As long as the capacities, , and the lower bounds on the flow, , are integer, we can be guaranteed of obtaining integer flows in the solution of this problem [7] . Globally optimal solutions to this problem can be obtained in polynomial time using linear programming or more commonly by using faster and more efficient network algorithms [7] .
V. METHODOLOGY AND MODELING
This section will briefly describe the methodology for DSP address code generation problem and how the minimum cost circulation formulation is used to solve Problem 1. First an algorithm or task flow graph is selected for the application and transformations are performed based upon cost, performance, and energy dissipation requirements. Initial code generation is performed and a memory layout specified by the compiler or generated post process similar to the technique in [2] and [15] is performed. In the later case, storage allocation is performed after the access graph is created to further reduce the addressing complexity. Complex applications with data structures, nested loops, conditional code, branches, etc. are supported. The minimum cost circulation algorithm is performed globally over a special trace of the entire algorithm. A post-pass stage is used to adjust or generate addressing code in remaining conditional code blocks and at the end of loops and branch locations. Details will be given in Section V-C. If basic block frequencies are available within the application, costs can be changed from code size metrics to performance or estimated performance metrics (for example by multiplying the performance of the instruction by the number of times it will be executed or is expected to be executed in that block). If index registers or automodify ranges are supported in the DSP architecture this procedure can be repeated with different offsets or ranges to further optimize the address code. In each case the minimum cost circulation technique is used providing polynomial time solutions. Sections V-A-V-D will illustrate how the minimum cost circulation technique solves the address generation problem and its extensions to loops, data structures, and branches. Finally, application to more recently introduced processors and different families of DSP processors will be provided.
A. Mapping to Min-Cost Circulation
To model Problem 1 as a circulation problem, we first have to develop a network flow graph. The memory layout (used as the horizontal axis, -axis) along with the variable access sequence (the vertical axis, axis) is used to form the graph. Each vertex in the graph corresponds to a variable access in the initial code sequence , where represents the time that a variable is accessed from memory. Vertices and are added to the graph representing times zero and , respectively. Arcs from the vertex to all vertices in the graph (except vertex ) are added. Arcs from each vertex to all other vertices accessed after time (including vertex ) are also added. Finally, one more arc from vertex to vertex is added. Next, capacities and costs are assigned to each arc in the network flow graph and the minimum cost flow problem is solved. The capacity of all arcs is one except the arc from vertex to which has capacity equal to the maximum number of index registers in the target processor. The costs per arc can now be set up to reflect the actual costs of code size or performance. The path of each flow identifies a partition of accessed variables that will be assigned to one address register. As each unit of flow passes from to it accumulates a cost representing the instruction for loading of an initial value into the address register. As each unit of flow passes through the vertices between to it accumulates any costs associated with adding or subtracting of offsets. For example, if an offset is required, for example for arc , a cost of one would be required representing a separate instruction (for example or in the TMS320C2x [6] DSP processor). To illustrate the graph formulation, consider Fig. 1 where data access variables and (listed on the left-hand side in order of increasing time from top to bottom) represent memory accesses along the vertical axis (access time). Each variable is accessed from a different address in memory shown as the horizontal axis (memory address). Fig. 2 shows how arcs are formed in the graph from each node whose access time is to all other nodes which are accessed at a time greater than . The arcs with zero cost are 
and
. All other arcs have nonzero cost associated with them (which is one for the TMS320C2x processor when no index registers are used). Fig. 3 illustrates a set of two unit flows in the graph, shown as bold arcs. In total two address registers are used in the figure, one for memory accesses and the other address register for memory accesses . In Fig. 4 , one address register is allocated, however, an offset of three occurs in arc incurring a nonzero cost (for example an extra instruction, 3, in the TMS320C2x [6] DSP processor). In Fig. 5 , the actual mapping from compiler-generated code and a memory layout into optimized code for the TMS320C2x DSP processor is illustrated. This processor [6] supports loading an address register with an initial value (i.e., lark ), autoincrement address , and autodecrement addressing . Each instruction refers to the current address register (initialized by instruction larp ), and can change the current address register by specifying it at the end of the instruction (for example, will increment the current address register and then set the next current address register to ). In There is always one extra cost, instruction larp 0, which identifies the current address register for the TMS320C2x DSP processor [6] . Sections V-B and V-C will describe in detail the formulation of the minimum cost circulation formulation for solving Problem 1.
The general formulation of the address generation problem as a minimum cost circulation problem will now be presented. For illustration purposes, the costs for the autoincrement/autodecrement addressing problem will be formulated below using the TMS320C2x [6] . However, in general costs of autoincrement/autodecrement addressing modes used in other DSP processors (such as TMS320C5x, TMS320C8x, etc.) can also be supported; see Section IV. The cost for using an offset whose value is greater than one is one instruction in the TMS320C2x instruction set
The cost for using each address register is the cost to load each address register with an initial value which again is one instruction in the TMS320C2x instruction set Other costs are zero including
The formulation of the minimum cost circulation problem will now be presented. Vertices or are used to represent general vertices of the graph, which can be or data access vertices unless otherwise stated Equations (2) and (3) ensure that the flow into and out of each data access variable, is equal to one. Finally, (4) and (5) set the arc capacities and lower bounds. Alternatively, (2) and (3) can be transformed into a pure circulation problem as presented in Section IV. In this case, each data access vertex in the graph is replaced by an arc whose lower bound is set to one [which has the same effect as (2) and (3)]. This lower bound along with the conservation of flow inequality (1) is used along with inequalities (4) and (5) to formulate the circulation problem.
This model can be extended to generate addressing code that additionally supports a fixed nonzero that is loaded into an index register and used to increment or decrement the address stored in any other address register by the amount (such as that capability provided in TMS320C2x processor with the register [6] , i.e., ). In this case, we extend the cost formulation as follows, for or otherwise Note that the offset value is fixed. Any number of offsets can be supported with the circulation technique (supporting a wide range of DSP processor architecture).
B. Extensions for Data Structures, Loops, Conditionals
This section will describe the extensions of the basic block optimized address code generation problem to data structures, loops, and conditional code. Data structures can be supported with the minimum cost circulation problem and optimal solutions can be provided. Extensions for data structures in basic blocks will be described first, followed by approaches to handling loops. Approaches to supporting multiple and nested loops and conditional code are discussed. Details for combining these features, for example nested loops with data structures, in real algorithmic code will be provided in Section V-C.
1) Data Structures: Data structures can be supported with the minimum cost circulation problem. For illustration purposes data structure support within basic blocks only will be discussed in this section. Extensions for data structures in loops and conditional code will be described later in Section V-C. In particular, support for pointers is used to illustrate data structure extensions. At the assembly code level, initial access to a pointer data structure is obtained by loading an address register with a value stored in memory which is being pointed to by the current address register (i.e., for the TMS320C2x), unlike immediate loads into an address register (i.e., lark ). One address register is pointing to the memory address that contains a data value equal to the desired memory address. Apart from the initial access of a data pointer, within basic blocks there are different options which may be supported whenever a pointer-based memory access is required. Examples of these options are shown in Fig. 6 and below with TMS320C2x assembly code and costs. The options are a) set the next address register to the one whose last use was pointing to the appropriate data , b) adjust an address register for which was pointing to an element of that pointer array, or c) if necessary adjust an address pointer so that it can be loaded in order to fetch the pointer data if necessary, and . Note that these examples refer to different code and are just used to illustrate the extra instructions and graph formulations. When address registers are finished being used to access pointer data, they are loaded with a constant value lark as shown in Fig. 6 , by a nonzero cost on the returning arc from pointer accesses.
2) Loops: Support for loops requires that two important constraints be met. These constraints will be called the interloop constraint and the intraloop constraint. The interloop constraint states that the last address pointed to by each address register at the end of the loop must correspond to the address pointed to by the same address register at the top of the loop. The intraloop constraint states that each set of data access variables (or a set of memory accesses) that an address register is assigned to in the first iteration of the loop must be the same set of data access variables it points to in the next iteration of the loop. Note this does not necessarily mean that they have to point to the same memory address. For example, some loop structures will access the next value of an array in the next loop iteration (increment by one or some other value in each loop iteration) or it will access a decremented array value in the next loop iteration. To handle address register assignment for loops one has to make a distinction between these different types of data access variables.
One set of data access variables will be called loopinvariant, or in other words the same memory addresses are accessed in each loop iteration. The other type of data access variables will be called loop-variant. We group the loop-variant data variables into sets such that all data variables in one set use the same function to map their memory addresses of one loop iteration into their memory addresses of the next loop iteration. For illustration purposes only, the "C" loop shown below on the left-hand side (taken from part of the durbin code found in the LDCELP voice compression/decompression application [16] ) will be used to given an example of the different sets. In this "C" loop the memory addresses which can be used to store and sum are loop-invariant. Whereas the memory addresses used to access and store are loopvariant and its mapping function is that of incrementing by one for each loop iteration. The memory addresses used to access are also loop-variant, however, must be placed in a separate group from since the mapping function is that of decrementing . The assembly code for this "C" loop is shown at the bottom of the page to the right, where is used to point to the loop-invariant set and points to the decrementing loop-variant set and points to the incrementing loop-variant set. In this code a separate address pointer, , is used to keep track of the loop index During formation of the network flow graph, edges can be placed only between data variables in the same set, otherwise the intraloop constraint is violated. For example, if in the code segment above, one address register is assigned to access the array and the array, it would be impractical to generate code to update both elements of the and array in the next loop iteration (since one requires incrementation and the other requires decrementation). For this simple loop, the network flow graph (corresponding to the five memory accesses of the assembly code) would contain four paths from node to node . These paths go from to the first memory access, to the second memory access, to the third-fourth memory access (which are sequential and refer to the same memory location so could be represented by one node), and finally to the fifth memory access (the loop index).
In general, the interloop constraint causes the problem to become a multicommodity flow problem which is NPcomplete [21] (see Section IV-A for more details). However, solutions can be handled in a number of ways, which differ depending upon whether they tradeoff optimality or the polynomial run time (see Section VIII for more details on alternatives). In the current implementation, we solve the minimum cost flow problem as formulated with the flow graph described above, obeying the intraloop constraints and after in a post-pass stage make adjustments to satisfy the interloop constraint by adding or modifying instructions if necessary (to update the address registers to point to the required memory addresses at the bottom of the loop for the next loop iteration). For each application which may contain complex data structures, nested loops and conditional code, one large network flow graph is formed. This provides a global approach to address code generation. Details are provided below for formulation of edges in the flow graph and costs. Support for pointers within (separate or nested) loops is discussed followed by flow graph formulation for code preceding and following loops. In all cases, memory address values used to form the network flow graph are obtained from the trace through the code (taking the high-probability conditional blocks or branches) and only the first iteration of each loop encountered. This initial minimum cost circulation is solved, and followed by a post-pass stage where code adjustments are made for loops and branches. Also during this stage remaining blocks of code, which were not a part of the original trace, are processed (flow graphs formed and optimized code generated). Fig. 11 is an example of a flow graph of an application, the durbin example, where the shaded basic blocks of code represent the trace through the code optimized simultaneously. The circles indicate where code adjustments had to be made and the unshaded basic blocks were separately optimized, both during the post-pass stage.
Within each loop, all memory accesses are placed into loop-invariant sets and different types of loop-variant sets (as described earlier). This is performed for pointer data as well. Pointers to data values inside loops may be loop-invariant or loop-variant. In the examples in Section IV-B, loop-variant examples were given, however, in practise often loop-invariant pointers are present. Edges between loop-invariant memory accesses from the same or nested or separate loops are formed. Loop-variant memory accesses form edges only with other memory accesses in the same set (incrementing or decrementing by the same amount) of the same loop. Fig. 7 gives an example of an edge between two loop-invariant data pointers of separate and 2) loops or nested and 2) loops. Allocating flow on this edge (which means the same address register can be used to access this pointer data) is an example of an improvement to addressing code that would not be possible if optimization proceeded locally on a loop by loop or basic block basis, requiring unnecessary reloading of the pointer within the loop. In Fig. 8 Blocks of code preceding loops have data variable accesses that can form edges with accesses within the loop (since the loop contains addresses based upon the initial loop iteration). For example if a basic block has just accessed in a basic block preceding a loop and this initial loop iteration accesses , where and gets incremented for each loop iteration after that, then the same address register can be used for the block preceding the loop and within the loop itself (as long as the loop accesses are adjusted for interloop constraints, if necessary, at the bottom of the loop). However, blocks of code that are executed after the loop have nonzero costs associate with all edges except those possibly which are formed from the loop-invariant sets in the loop. In other words any address registers used to access data variables in loop-variant sets must be reloaded outside of the loop unless it can be guaranteed that the last iteration of the loop is pointing to the required data being accessed in the code which follows.
Conditional code support also reduces down to a multicommodity flow problem similar to loops. However, in our implementation the minimum cost circulation technique is used and code adjustment is performed in the post-pass stage. The most probable conditional block is chosen (or for branches the branch is taken or not taken depending upon which is most likely) as part of the trace or, if this data is not known, an arbitrary choice is made (typically by choosing the one which results in optimizing the largest code). After the minimum cost circulation is solved with this trace of the code, the remaining conditional blocks (which were not a part of the original trace) are separately optimized, with initial addresses of address registers taken from the optimized code included in the network flow graph at the beginning (time 0). An example is given in Fig. 9 , where the branch is not taken to label L20 in the trace which is being optimized (and the L20 block is not a part of the original trace and has only one branch statement to it, called a single entry block). In the post-pass stage the branch to L20 is solved as a separate network flow graph with addresses loaded into the address registers as they were right before the branch is taken (shown in Fig. 8 with hollow circles representing three nodes at time 1 of the flow graph for block L20). For cases where conditional blocks are included in the original trace but specific branches are not taken, the postprocessing stage performs adjustments to the optimized code. Fig. 10 illustrates this adjustment required for multientry blocks. For example, Fig. 10(a) has the last block (L15) with two entry points into it (multientry block). In the trace being optimized [see Fig. 10(b) ] only one entry point is considered. To adjust this code so that it can incorporate the branch to label , the circled code [shown in Fig. 10(c) ] must be added/modified. In other words where necessary, address registers must be loaded with values at the top of multientry blocks. Additionally the next address register must often be set (for example, ). For example values must be loaded into address registers (for example lark instead of adding five to the address register value , since contained a different value in the top block before the branch statement than its value in the middle block immediately before :). Again, pointers previously loaded do not have to be reloaded with the global trace technique, as illustrated with in Fig. 10 , again providing code improvements (shown with dashed lines) over the load implemented in the compiler-generated code, Original ASM Code lark and
D. Extensions for Other Addressing Modes and Recent Processors
This section will describe extensions to this methodology for supporting other DSP processors and addressing modes. Support for parallel memory accesses, no-penalty offset-based addressing, addressing with automodify ranges and nonlinear types of addressing modes will be discussed.
The minimum cost circulation technique supports parallel memory accesses (such as those available in the TMS320C3x [19] , TMS320C6x (VLIW) [20] , and other processors). In the network flow graph, nodes with the same access time will use separate address registers (or create separate flows, since no edges between nodes with the same access time are formed). In some processors, such as the TMS320C3x, although offsetbased addressing is supported with no performance penalty, the use of autoincrement/autodecrement addressing supports parallel memory accesses required for parallel instructions which improve both performance and code size. In the C3x case, memory accesses of all instructions are used to form the network flow graph. The minimum cost circulation problem is solved and the resulting code is processed for parallel instruction substitution. Specifically, each pair of sequential instructions in the code is examined to see if they can be substituted by a parallel instruction. This approach is not guaranteed to always provide improved code since it relies on having potential parallelism in the existing schedule and generation of initial code. For example, a small part of code obtained from the compiler is shown on the left-hand side below. The memory accesses are identified by the use of FP. The right-hand side shows the transformation to autoincrement/autodecrement addressing and parallel instructions (identified by which concatenates two instructions into one parallel instruction [19] ). Thus, the minimum cost circulation technique is applicable (see Section VII for examples).
STF R1
FP ( In the TMS320C6x and other newer DSP processors, 5-bit constants or an index register can be used to specified an offset to the address without any performance or cost penalty. This can be supported with the circulation technique by using a cost function which assigns nonzero cost to edges at a distance greater than away from the current memory address. Alternatively since any register can be used as an address, index or data register, the optimization of the number of address registers (see Section VII-A), performed by the circulation technique, is important. Thus the circulation technique can be used for this processor also.
In other processors, such as the TMS320C8x (DSP multiprocessor), offset addressing in the range of to (or automodify ranges [17] , is allowed. This is supported with the circulation technique, again by setting the edge cost function to use zero costs only when the absolute difference between two memory addresses being accessed is less than or equal to . Specifically, , otherwise Bit reversed addressing and modulo (or circular) addressing are examples of nonlinear types of addressing. These types of addressing can be supported only if there is no restriction by the architecture that only one specific address register can use this mode (requiring multicommodity formulation). For example, costs are assigned to edges: if for bit reversed addressing, or for circular addressing (specified by block size and top of buffer stored in special registers), or for offset-based addressing modes [where is stored in an index register (or represents zero or one for autoincrement/autodecrement addressing)], otherwise Offsetbased addressing, where the offset is stored in the index register, but the address register is not modified (i.e., address offset, but does not change its original value) can only be supported in a post-pass stage (not as part of the minimum cost circulation formulation; Section VI-A for more details). In this case if the offset is a displacement value stored in the instruction word or if the address register is always modified (i.e., offset) then the minimum cost circulation technique can be used.
The circulation technique is also applicable to design of DSP codes where designers would like to explore for their set of applications how many index and address registers to support. For example, the minimum cost circulation technique can be applied a number of times, each with a different value and number of index registers for a group of DSP basic blocks to determine the addressing costs (including optimal number of address registers that should be supported). In these cases, the fast solution time and optimal solutions available from the circulation technique for the specific problem it solves provides an important exploration tool. Section VI will provide some examples of this type of exploration, using multiple index registers.
VI. EXPERIMENTAL RESULTS
Several DSP applications are used to illustrate this methodology. The notation refers to the least mean square's algorithm [6] , a high-pass infinite impulse response (IIR) filter realized in two different structures and , the discrete cosine transform and the fast Fourier transform, respectively. A larger real industrial example, the durbin code, taken from the LDCELP algorithm [16] , was used to further illustrate the support for loops, conditional code, and complex data structures. Code was generated using the Texas Instrument's (TI) C compiler [6] . The minimum cost circulation was solved on a Pentium PC workstation using a linear programming solver, GAMS/CPLEX [10] , although faster CPU times are possible using network solvers [7] . Table I illustrates the optimized results compared to the TI-compiler-generated addressing. The optimized code was generated from the memory accesses in the compiler-generated DSP code using the same memory layout as the compiler did. No index registers were used in this table (see Tables III  and IV for index register analysis). Results were compared to the initial code generated from TI compiler in terms of the final code size (CS), and the number of instructions used for addressing alone (#, or the cost resulting from adrk, sbrk, lark, etc.). For the applications chosen, the code size is equivalent to the number of cycles required to execute the code using the TMS320C2x. The performance improvement shown in Table I is only due to optimal address generation for the given TI-generated memory layout. In general, the TI compiler selects memory layout in the same sequence that the input parameters are passed to a routine. All examples in the table are compiled and optimized for the TMS320C2x except the examples and which show improvements for the TMS320C3x, through optimized indirect addressing thus supporting parallel instructions (see Section IV for details). In Table II the size of the LP problems is listed are the number of variables/inequalities and CPU times are given for solving these network problems using a LP solver. See Appendix for complete details of a fast fourier transform (FFT) application showing the TI C compiler-generated output and the optimized code from the minimum cost circulation technique.
The durbin example [16] consisted of 26 lines of "C" code and 204 lines of assembly code. It contained six loops (two of which were nested), and seven branches (two of which were inside a loop). Fig. 11 shows the basic blocks of assembly code (shown as rectangles) and arrows representing branches. Three pointer data structures were used and only six address registers were used. Only six address registers were used in the optimization to compare directly with TI-generated code. Due to the large number of control instructions (branches) and only 27 instructions which were or , the percent improvement in terms of code size was not as high as the basic block examples. However, the actual number of addressing related instructions (including loading of address registers, pointers, and 's) was reduced by more than half. Optimizations included not only reduction in number of 's and 's, but reduction of unnecessary loading of pointers. This is a consequence of applying the circulation technique in a global manner where pointers even in loops that are loopinvariant do not have to be reloaded in each iteration and also avoid initial loading outside of loop by using address registers which pointed to them in the previous basic block or previous loop (where it was also used in a loop-invariant set). The frequencies of the loops were not known so performance improvement could not be calculated and conditional/other branches were chosen based upon size. The original trace that was optimized is shown in Fig. 11 with shaded blocks. The circled areas illustrate where code adjustments were made for interloop constraints, and multientry blocks. The unshaded blocks illustrate single entry blocks that were optimized in the post-pass stage, since they were not a part of the original trace. Some parts of the durbin code (but not the entire code) have also been presented in Figs. 8-10 (Sections V-B and V-C), where savings in code size and performance were indicated by dashed horizontal lines) to illustrate code adjustments and optimization of remaining blocks both in the post-pass stage. Table III presents the effects of using offsets stored in one index register for the TMS320C5x, where performance was optimized. The cost to load an address register is two (two cycles required). The cost of the optimized addressing code represents the number of cycles required by the addressing code alone (inverse to performance) is shown in the tables. Table IV shows cost changes for the use of two or three index registers, if they were available in the TMS320C5x architecture. Table V shows some effects of using different memory layouts on the address generation problem for a variation of the least means square algorithm. The TI-generated memory results are shown as TI-mem. An improved memory layout was generated for these algorithms using a technique similar to the modified Kruskal algorithm presented in [2] (Krusk-mem) except variable sharing was incorporated. A fixed number of address registers was used each time to further compare solutions. The total addressing costs are given. Fig . 12 illustrates the percent improvement in performance and energy (for zero data, Energy 0, and speech data, Energy 1) of the optimized code compared to the TI-compilergenerated code (for different offset values stored in the index register) for the TMS320C5x. The first bar, none, is the optimized solution that does not use any index register in the addressing code. A current measurement technique similar to [22] , [5] was used to obtain real energy measurements.
A. Comparison with Previous Research
Unlike [2] , [11] , [9] , and [17] , simultaneous memory layout is not performed with the address code generation. It is assumed that memory layout has already been performed before the minimum cost circulation technique is applied to improve the address code. However, one can make some brief comparison of address code generation with these techniques. Unlike [9] we do not determine what values should be placed in the index register or determine when they should be overwritten. In the approach presented in this paper these decisions have to be explored by the user. Similar to [17] and unlike [2] and [11] , different address registers can be assigned to the same memory address (or variable) at different times. Also similar to [17] , we incorporate the address register loading into our cost and support automodify ranges (see Section V-D). However, unlike [17] , the value stored in the (and the number of) index register(s) has to be fixed by the user and unfortunately is not part of the optimizable cost. Also, offset-based addressing (where the index register holds the offset value), where the contents of the address register are not changed, is not supported by the minimum cost circulation formulation. However, it can be incorporated during the postpass stage (where in each case the address register of minimum cost is chosen to generate the address). Similar to [11] , our methodology can be used to explore the number of address registers and ranges of autoincrement/autodecrement which are best for performance and cost for a particular application. Whereas in [11] a parameterizable algorithm is used during exploration, our methodology must iterate on the number and value of index registers alone (since the number of address registers is minimized during the circulation technique). was (see row 1 of Table VI) and it was changed to in order to reduce the cost to three (using two s). Our methodology (MinCost1) takes any memory layout, for example (the original layout), and determines the minimum cost solution, in this case also three (using three AR's), without having to change memory layout. Using an example presented in Fig. 2 in [9] , the original memory layout was used with the circulation technique to obtain a minimum cost solution of two (or two 's, see row 2 of Table VI), whereas the paper explored addressing using a single address register. They modified the memory layout (to ) in order to reduce the cost to three for one address register and one index register (where the index register is allocated in a post-pass phase [9] ). In our approach, the user must specify the number of index registers and with what values each must hold in order to generate addressing code simultaneously with determining how many address registers should be used. Thus, such improvement is not always possible since it depends on the input of the designer (allowing more control which most industrial designers highly appreciate, but also more responsibility). No post-pass phase is used in order to support index registers. In cases where there exists more than one solution of minimum cost, if a solution with a minimum number of address registers is desired, one can use a weighted cost function with the circulation technique (whose weights for the address register loading is higher than that of the address calculation instructions). In the last column (MinCost2) of Table VI, we assume the loading of an address register requires a cost of two (for example representing two cycles for execution as in the C5x). Notice how the solution changes in the first row from three address registers to two (or one address register and one index register) reflecting our cost-based approach to the address code generation problem (where cost can be modeled as performance or code size).
Previous research [12] , [18] has also studied the problem of address register allocation for array accesses in loops. This problem is identical to generating optimal addressing code for any general loop where memory layout has already been performed (since the memory layout is determined by the index of array elements). The solution of this problem is also particularly important for critical loops (for example 10% of the code, requiring 90% of the execution time). Due to the interloop constraints this problem is a NP-complete problem. Similar to [12] , one can ignore the interloop constraints and adjust the results during the post-pass stage, as discussed in the Sections V-B and V-C. Unlike [12] , the minimum cost circulation can also support a constraint on the number of address registers and can support index registers. Alternatively, one can also extend the circulation technique presented in this paper, to a multicommodity flow formulation in order to incorporate the interloop constraints. Similar to [18] , this approach requires a branch and bound solution (here integer linear programming), however, we can also support index registers during this process. This multicommodity extension replaces each variable per arc of the circulation problem with variables per arc, where is the specific address register (or commodity), and the new variable of the model is . To disallow any extra costs or instructions within the loop, only edges in the flow graph that represent increments or decrements of one or zero (or the specific values of the index registers, ) between nodes of the same set are allowed. If the loop is optimized in isolation, the cost to be minimized is the total number of address registers used. Constraints are set up to ensure that memory accesses (arcs) assigned to the same commodity across loop iteration boundaries and can have unit flow only if their memory accesses and have addresses within 1 or zero (or ) of each other. Specifically the constraint, is used to ensure that memory accesses at the bottom of the loop can be incremented/decremented or remain the same for use at the top of the loop in the next loop iteration by the same address register. Using the small loop example from [18] the minimum cost circulation approach had a total cost of three, requiring two 's from the solution of the minimum cost circulation and an additional cost of one instruction to adjust for interloop constraints. The multicommodity solution had an optimal cost of 3 (representing three 's), similar to the optimal solution obtained in [18] .
VII. DISCUSSIONS
In summary code size and performance savings from 18%-61% (see Table I ) were attained by optimizing the address generation code across several DSP examples. The technique presented in this paper performs optimal address code generation for a given memory layout. Unlike previous research, the address code generated is optimized for code size or performance, when applied to basic blocks (along with pointer data structures). Other polynomial time techniques, such as MDPC for array access in loops [12] , do not model costs or support maximum number of address register constraints. Support for costs is very important in order to support data structures (which require different costs to load pointers, etc.). The formulation of the address code generation problem as a minimum cost circulation technique not only provides optimal solutions in polynomial times (for basic blocks), but efficient mathematical codes can be used. Extensions of this formulation to include loops and conditional/branch constraints (i.e., the multicommodity flow formulation) allows the user to have a chance at solving for an optimal address code generation using a integer linear programming approach. This may be very practical for small code sections which are time critical. However, for larger codes (and the majority of codes used to demonstrate the technique in this paper), the methodology typically will apply the minimum cost circulation followed by code adjustment to one or more traces of the code (as was the case in the durbin example). In algorithms where the number of control constructs outweigh the data computation operations, the minimum cost circulation methodology would have limited use or show very little improvement (since the methodology concentrates on optimizing addressing code for data-computation operations). In these control dominated cases, other methodologies which involve optimization of control constructs would be more appropriate (since the circulation methodology presented in this paper supports control constructs, but does not attempt to optimize the control). Many classical compiler approaches [23] can be applied to various stages of code generation before address code generation is optimized using the minimum cost circulation technique presented in this paper. For example, loop pipelining, loop induction variable optimization, code compaction, etc. can be performed, since the minimum cost circulation technique is typically applied after assembly code is generated, and the technique strictly attempts to improve the addressing code only. It is assumed that these other compiler steps have a very limited interaction with the address generation approach that we are proposing in this paper.
There are alternative heuristics for handling loops which may be implemented to satisfy the interloop constraints. For example one approach is to solve a number of single flow circulation problems on the same graph except an initial memory access is chosen to be the only access node connected to the node and the only edges which can be connected to are those whose access node represents a memory address accessed at an offset of zero or 1 or 1 from the initial memory address. After this path of flow is calculated (with zero cost and containing a maximum number of nodes), the nodes it includes are deleted from the graph and the algorithm continues. However, since we are iterating and solving for one address register assignment at a time the constraint on the maximum number of address registers is not supported.
Since address generation is dependent upon the memory layout, this could be used in conjunction with a search technique to find a good memory layout and corresponding optimal address generation. Furthermore, address code generation for a fixed memory layout may be useful for hand-coded DSP methodologies where the memory layout and initial code has already been determined. In this case, the address code generation is often tedious and can, therefore, be automated. This approach also supports improving sections of code instead of a complete algorithm, when memory layout (determined by the output of a previous routine, which one does not wish to modify) is fixed. Although the paper has concentrated on presenting the formulation for the TMS320C2x DSP processor, Section V-D has discussed its applicability to a wide range of other DSP processors.
VIII. CONCLUSIONS
In contrast to previous research [1] , [12] , [2] , [8] , [9] which examined the general offset assignment problem or other addressing techniques, we have presented an optimal polynomial time technique for generating address code which can work in conjunction with any data memory layout technique such as in [2] or with memory layout generated by a compiler. This may be advantageous when memory layout is constrained by interfacing with external systems or when it is performed by an algorithm the user has selected and does not wish to change. It also allows a decomposition approach, or task by task approach to code generation since one can fix memory layout at the beginning of a task according to what was used in previous tasks unlike [2] which has examined how to deal with memory layout in large code segments. This provides a value-added advantage where code can be quickly generated by a compiler and optimized for addressing without changing the memory layout. Unlike previous research the formulation as a minimum cost circulation problem supports cost modeling, the limit on the number of address registers, and data structures. For the first time, optimal address code is generated for basic blocks and data structures. This approach has also been extended to loops and branches/conditional code showing significant improvements to compiler-generated code are still possible.
We have introduced a new methodology for address generation given a fixed memory layout. It is applicable to many different DSP processors (both fixed-point and floating-point ones) including some very recently introduced. For the first time, a cost-based approach to address code generation (where cost is based upon performance or code size) using this new application of the minimum cost circulation technique is presented. Results have shown significant improvements have been achieved in terms of performance, cost, and energy, with a wide range of industrial realistic applications (ranging from data computation intensive, such as the discrete cosine transform, to the control-and data-structure-intensive durbin algorithm from the LDCELP application). APPENDIX Fig. 13 shows the C code describing a FFT which was input to the TI-C compiler. Fig. 14 Fig. 15 shows the TMS320C2x assembly code generated from TI's C compiler using a FFT application. This differs from the FFT listed in the Section VI of this paper, in that each line of C code uses only two operand operations (i.e., no subexpressions, etc. are used), so that the resulting assembly code can more easily be followed for illustration purposes. The minimum cost circulation technique was applied to try to minimize the code size of this assembly code. The exact same memory layout as the compiler was used. Fig. 16 shows the optimized assembly code output from the minimum cost circulation technique. The compiler-generated code consists of 191 instructions (or 190 if we ignore loading which is used for temporary storage) and the optimized code consists of 116 instructions. The optimized code used seven address registers, whereas the compiler-generated code used only two. Since the TI-C-compiler solution did not use indexing, neither did our optimized solution for comparison purposes. The seven address register solution shown below is an optimal solution for address code in this example.
