Abstract--This paper presents a network flow approach to solving the register binding and allocation problem for multiword memory access DSP processors. In recently announced DSP processors, sixteen bit instructions which simultaneously access four words from memory are supported. A polynomial-time network flow methodology is used to allocate multiword accesses, including constant data memory layout, while minimizing code size. Results show that improvements of up to 87% in terms of memory bandwidth are obtained compared to compiler-generated DSP code. This research is important for industry since this value-added technique can improve code size and utilize higher memory bandwidths without increasing cost.
I. INTRODUCTION
ue to increasing complexities of domain applications, high level language compilation is a necessity for DSP processor cores. However the biggest drawback to DSP processor cores is the lack of efficient optimizing compilers.The use of conventional code generation techniques and even compilers specifically designed for commercial DSP processors are known to produce very inefficient [1, 2] . The code generation problem has tight constraints due to DSP architectural features as well as price, performance, and power requirements. In more recently announced DSP processors, constraints placed upon code generation include dual bank register files, higher memory bandwidths available for aligned sequential data words, and execution set overheads. DSP processors have steadily been increasing the number o f functional units in their architecture to support higher levels of parallelism. However only recently has a somewhat equivalent increase in memory bandwidth been available. For example a DSP processor [3] , which has four complex functional units, can fetch up to 8 memory aligned words in a single cycle (thus supporting access of two operands per functional unit per cycle).
Multiword memory accesses provide high memory bandwidth matching computational throughputs of recent Manuscript received October 30, 2000. This research was supported in part by grants from NSERC, Motorola, and CITO.
C.H.Gebotys is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ONT N2L 3G1 Canada (telephone:
519-885-1211, e -mail: cgebotys@ optimal.vlsi.uwaterloo.ca).
embedded DSP processors. However due to compact code size requirements, this multiword capability places additional constraints on data register allocation and binding. For example in the Star*core processor [3] (jointly developed by Motorola and Lucent), 16 bit quad memory access instructions are supported. These quad memory access instructions load or store 4 memory aligned (16 bit) data words with only one instruction. Dual memory access instructions which are also supported require that the first of two data words (aligned in memory) must be loaded into an even numbered data register (such as d2) and the second of two data words must be loaded into the adjacent odd numbered data register (such as d3, adjacent to d2). Alternatively a dual load could be used with data register pairs (d0,d1) or pair (d4,d5), etc. In the quad memory access instruction the first and third of four data words (aligned in memory) must be loaded into even numbered data registers and the second and fourth of four data words must be loaded into the adjacent odd numbered data registers. Each quad memory access has a choice of only four sets of data registers it can load into (or from if a store is being performed) out of a total of 16 data registers available in the register file [3] . These sets are registers (d0,d1,d2,d3) or set (d4,d5,d6,d7) or (d8,d9,d10,d11) or (d12,d13,d14,d15). If the compiler cannot bind the data to one of these four sets, obeying both memory layout and alignment restrictions, then it must use several single or dual memory access instructions. Since each memory access instruction whether quad or single is 16bits, code size can be reduced through efficiently utilizing dual and quad memory access instructions wherever possible.
Out of a 16bit instruction, only 3 bits are allocated per operand to address the lower bank of 8 registers (d0-d7) in the register file. If an operand is stored in a register of the upper bank of 8 registers (d8-d15), an extra 16 bit word attached to the execution set (or set of instructions executed in parallel, or in the same clock cycle) is used to identify the upper bank register usage. This extra word (used to group instructions in an execution set) is called a prefix overhead. In DSP applications coefficients or constant data values are commonly used in computations. In some processor architectures, immediate addressing requires the extra 16 bit coefficient (or constant) word plus a prefix word to define the execution set. Alternatively the compiler may choose to use register addressing and load the coefficient from data memory if a register is available. If we ignore the additional prefix savings of registered coefficient addressing, the registered coefficient A Network Flow Approach to Memory Bandwidth Utilization in Embedded DSP Core Processors Catherine H. Gebotys, Member, IEEE D would reduce code size by one or more words in two cases: first if the registered coefficient was accessed mo re than once; second if a second different coefficient was stored in adjacent and aligned memory and loaded using a dual memory access instruction with this coefficient (thus hiding the second memory access cost). Coefficients occur frequently in many DSP applications. In summary the multiword memory access instructions have an important impact on code size which in turn effects cost, energy and power dissipation. The need for decreasing time to market, development costs, and maintenance costs, demands the use of efficient high level language compilation for these sophisticated DSP processors which support these mutliword access constraints. All of these factors imply several challenges in writing efficient code generators for such DSP processors.
II. PROBLEM DESCRIPTION AND RELATED WORK
The following problem, problem 1 given below, is an important part of the code optimization problem for DSP processors that will be studied in this paper. For the problem definition below we assume that there is one target D SP processor or core defined with an instruction set architecture, such as the Star*core processor [3] . The target processor supports dual and quad memory accesses (of aligned and sequential 16 bit words) with aligned data register constraints. In the general application, it is assumed that some data has a predetermined memory layout (for example input data or output data of a DSP algorithm), with a fixed or flexible alignment. Other data may have no predetermined memory layout or alignment, such as data coefficients which can be stored in memory according to the designer, compiler or CAD tool.
Problem 1 Assume we are given initial assembly code generated for the target processor, including memory layout (and alignment) of the data. The problem is to improve the register allocation and binding code such that the codesize is minimum (or the utilization of the memory bandwidth of the DSP processor is maximum). Specifically allocate and bind registers, with constraints of both fixed and flexible data memory layout, to maximize the number of quad and dual memory access instructions (and determine how to address coefficients through parallel memory accesses or as immediates) such that the total resultant code size is minimized. Extensions to this problem include handling loops, conditional code, and data structures.
Although many researchers have studied code generation for DSP processors or ASIPs [4] , fewer have studied memoryaccess code generation. A number of researchers have examined allocation of address regis ters [5, 6] , data layout [1] , and index registers [7] . Researchers in [2] introduced a transformation of C code using pointer based code instead of array based code to make better use of address calculation units on DSPs. Register binding and allocation for low energy was studied in [8] using network flow, however memory accesses were not considered. Register allocation, binding and scheduling techniques are discussed in [9] . A network flow formulation in [6] was used to allocate address registers for DSP p rocessors with significant savings in code size. Researchers in [10] perform register binding and instruction scheduling to minimize the number of registers and find tight schedules through the addition of sequencing relations. Memory bandwidth minimization for video and communications applications is studied in [11] , in order to reduce the cost and complexity of memory architectures. An outline of other research in memory optimizations can be found in [12] . Dynamically maximizing the utilization of memory bandwidth of data cache, using a hardware approach is described in [13] .
In this manuscript a new problem, memory-access register allocation and binding, is defined and solved. Unlike previous research, we study the problem of given flexible and fixed data layout in memory, generate optimal memory-access code to minimize code size. A maximum cost flow technique is used to obtain solutions in polynomial time. The next section will outline the assumptions and terminology to be used in the rest of the paper.
III. TERMINOLOGY AND ASSUMPTIONS
The following terminology will be used in this paper: Variables, i, can be data stored in registers (produced by the result of an instruction, accessed by other instructions, modified by multiply-accumulate type instructions, values moved to/from memory or coefficients used in the code i ∈ K, where the coefficient value is represented by value(i)). A variable, i, can have a number of attributes such as: a lifetime which is a range of time from define_time(i) to last_used_time(i) {where define_time(i) is the cycle in which the variable is defined and last_used_time(i) is the cycle in which the variable is last used }; access_times(i) are the set of cycles during which the variable is accessed or used. Additional terminology to be used in the optimization discussions are: G=(V,A) is a graph G composed of vertices V and arcs A. The variable x i→ j is the flow in the arc i→ j from vertex i to vertex j, where i , j ∈ V and i→ j ∈ A. The value c i→ j is the capacity on arc i→ j . The value e i→ j is the cost (or gain) per unit of flow on arc i→ j . There are two special vertices in this graph called vertex s and vertex t. The flow out of vertex s is equal to the flow into vertex t and for all other vertices the flow in is equal to the flow out. Arcs incident to s only leave vertex s and arcs incident to vertex t are only directed into vertex t.
The maximum cost flow problem is to fix the amount of flow through the graph and maximize the sum over all arcs of the flow multiplied by the cost. As long as the capacities, (c i→ j ), and the lower bounds on the flow, (l i→ j ), are integer, we can be guaranteed of obtaining integer flows in the solution of this problem [14] . This problem can be solved in polynomial time using linear programming or network algorithms. In the paper we refer to the maximization of cost (to be consistent with the application of network flow), however it is more appropriate to use the term gain in place of cost, thus maximizing the gain in utilization of memory bandwidth or gain in codesize savings.
IV. METHODOLOGY AND MODELING
This section will briefly describe the methodology for DSP memory-access code optimization, problem 1, and how the maximum cost flow formulation is used to solve problem 1. The methodology first determines a lower bound on the number of registers, R, required in the application code (based upon variables lifetimes, using network flow). It proceeds to use this number as a fixed amount of flow and maximizes a cost equivalent to the number of dual memory access instructions and number of words saved from registered coefficients which can be supported by solving a maximum cost network flow problem. It next proceeds with the same fixed amount of flow and maximizes a cost equivalent to the number of quad memory access instructions in the application. In the final stage of the methodology, with the dual and quad memory accesses fixed (from the solutions of the two previous maxcost flow problems), the register binding is performed by solving several network flow problems with fixed flows ranging from 1 to |R|-2(#dual)-4(#quad) and costs representing the sum of the number of accesses of each variable in the binding. This final cost is attempting to minimize the number of register accesses of higher numbered data registers, since accesses to registers in the upper bank have a prefix overhead [3] (or extra 16 bit word required which increases the code size). Thus we use the cost formulation to allocate more frequently accessed variables into the lower bank of registers.
A. Allocation
To illustrate how problem 1 can be modeled as a flow problem, we first have to develop a network flow graph. Each vertex, v i ∈ V, in the graph corresponds to a variable, i, in the assembly code. Vertices s and t are added to the graph representing times 0 and s+1, where s is the number of execution sets in the application assembly code, respectively. Arcs from the s vertex to all vertices in the graph (except vertex t) are added. Arcs are formed from each vertex v i to all other vertices v j such that last_used_time(i) ≤ define_time(j) (or in other words variables i and j can share the same register since their lifetimes do not overlap). The capacities of all arcs are set to 1. In the maxcost solution, each path of flow through vertices represents one regis ter which stores the respective variables. For illustration purposes the data registers are represented by d0 through d7. Single memory access instructions (load or store) will be represented by move.f. Dual and quad memory access instructions will be represented by move.2f and move.4f respectively.
Variable lifetimes are extracted from the compiler generated assembly code to create the network flow graph. Additional vertices are added into this graph in order to increase the number of multiword memory accesses and support possible registered coefficients. The maximization of multiword memory accesses will be briefly described. Consider figure 1 where the horizontal axis represents the memory layout (first box representing an even address, followed by three sequential memory addresses) and the vertical axis of figure 1a) and 1b) each represents part of the assembly code increasing in clock cycles downwards. The circles represent a vertex i n our network flow graph which is a variable being loaded from memory (at the address identified on it's x-axis, with the code located on the y axis) in this example. Arcs into and out of this circle are not shown for simplicity. The empty circles shown in figure 1a ) represent the additional vertices placed in the network flow graph. In figure 1a ) the empty circle represents an earlier load of a variable, which is represented by the circle located vertically below it and connected to it by the dashed arc. In the network flow solution, a flow through these circles (see figure 1b) ) represents a multiword memory access instruction (since the earlier memory accesses can be performed in parallel with the other memory access). Thus a cost of one is assigned to this arc (connected from the empty circle to the circle below it). Part of the network flow graph is shown in figure 1c ) and the flows, indicating an allocation are shown in figure 1d ). The final two dual accesses and partial assembly code are shown in figure 1b). In the network flow problem we maximize the number of multiword accesses which is the sum of costs on the dashed arcs multiplied by the flow on these arcs. For example a flow through these arcs transforms the four move.f (single word memory access) instructions, in figure 1 (a) , into two move.2f (dual memory access) instructions in figure 1(b) . The total cost is the savings in code size words (ie. the total number of move.f words saved, since 2 are replaced by one move.2f instruction) or equivalently the number of dual memory access instructions allocated. In the next step, illustrated in figure 2, we take two move.2f instructions and add one arc with a cost of one to create one move.4f or quad memory access instruction, as shown in figure  2b ). Again the objective of this second network flow problem is to maximize the number of quad memory access instructions allocated. Note in figure 2b ) if the flow in the arc of cost 1 is zero, then we still have a valid allocation of memory access instructions since we would use two dual instructions (illustrated by the dashed circle).
For applications which use immediate addressed data coefficients, two methods exist for minimizing code size. First we can hide memory accesses of coefficients by incorporating their access in dual or quad memory accesses, and therefore save code size. Secondly if a coefficient is used more than once in the code, one or more words are saved in the code through loading the coefficient once into a register and accessing it several times afterwards directly from the register. The extra vertices incorporated into the graph in this case represent one vertex per coefficient (where arcs from one coefficient to another coefficient of the same value has a cost of 1 ). The single vertex can have flow of one or zero through it, which represents registered or immediate addressed coefficient. For example in figure 3, two coefficients with the same value are accessed in the BEFORE column. Each is represented by a vertex with an arc of cost 1 in between them. The lines in the graph indicate a possible solution where flow is one, shown in the AFTER column on the right hand side where a cost of 1 word is saved. Furthermore if a coefficient is already registered (from an earlier use of network flow or as determined by the compiler) there is the possibility of transforming a single move.f into a dual memory access, move.2f, thus hiding another coefficient access. This is performed by adding a 2nd extra vertex for this coefficient. For example in figure 4 , by adding two vertices (a hollow circle with an a rc of cost 1 connected to the coefficient's vertex representing the access in the BEFORE code), the coefficient #-7979 is register addressed in the solution of network flow of this graph. The solution, shown with non-dotted arcs transforms code into the AFTER column, which utilizes a dual memory access. Figure 5 illustrates a different solution of the same graph as in figure 4 , where given the fixed constraint on the number of registers (or fixed flow through the graph), it is not possible to support the dual load. However in this case the second coefficient is still registered because code size is saved through use of the registered coefficient accessed later in the code (see arc of 1 at bottom of figure). In figure 3 , 4, and 5 a flow of one is used to illustrate only a partial solution of the network flow problem for illustration purposes.
For simplicity, dual memory accesses (of data or coefficient data) will next be used to illustrate the intial network flow model. Let the original graph presented above be represented by G. Let two nodes, v i , v j , represent data moves from registers to a fixed memory layout, at sequentially aligned addresses d+0 and d+2, byte addressable (or vicaversa for loads). Assume that the earliest (or latest, for loads) cycle that one variable (ie. j) can move is at cycle t and that cycle t is greater than the latest cycle the other variable (ie. i) can move. An extra node (ie. 
i). Let the original graph be represented by G, where the vertices and edges are respectively V(G),E(G)
. Let the set of variables which can move to an earlier cycle for loads be L and move to a later cycle for stores be S. The new graph 1 G is defined and costs for dual memory accesses and coefficients are also shown below: 
The complete formulation of the maximum cost flow problem will now be presented (Let X=K∪ L∪ S).
The first equality represents the conservation of flow. The next two equalities ensure that all data variables (except coefficients and earlier memory access variables) are allocated into registers. The next two inequalities allow coefficients to use immediate or register addressing and allocation of single or dual memory accesses, with flows =0 or 1 respectively. This model can be extended to quad memory accesses as described earlier, using the same network flow structure (see figure 2) .
However even when the dual and quad memory accesses are maximized using the technique described above, the final code cannot be generated until register binding is performed (which specifically assigns a data register from the bank to each variable), which will be described next.
B. Multiple Access Register Binding
After the network flow has been used to allocate registers (for data and/or coefficients) and dual/quad memory accesses, the register binding problem must be solved. In figure 6 the possible register bindings for each dual and quad memory access field are shown. For the register binding problem, we also use a network flow formulation. Each flow through our graph is a binding of a register di to those variables represented as vertices the flow goes through.
The methodology incrementally binds one or more registers at a time. We consider even and odd flows which specifically are binding of even (d0,d2,...d6) or odd (d1,d3,...d7) numbered registers respectively. In figure 6 a) one quad memory access (shown by four boxes in a row at the top, representing the four memory layout locations, with odd numbered registers shaded) and three dual memory accesses are shown. In the network flow graph, vertices for all the other data variables are a part of the flow graph, however will not be shown in the figures for simplicity. In figure 7a ) a total flow going in from the top vertex (s shown as a circle) and out of the bottom vertex (t shown as a circle) is 2, which we call the even flow since it represents flow through two even data registers (d0 ,d2) . A maximum cost flow problem is solved where the number of accesses on each variable is the cost of that vertex, being maximized. In figure 4b ) the flows of the solution are shown. In this case we have allocated data registers d0, d2 as shown. Next we must bind data register d1 and d3. This next problem we solve two network flow problems, the first with single flow through d1 variables of dual or quad memory accesses as shown in figure  7c ) and the second for d3 shown in figure 7 d) .
Consider two quad memory accesses, where the set of variables in one quad memory access do not have lifetimes which overlap with any variable in the second quad memory access variable set. We will use the term nonoverlapped quad memory accesses for this case. Furthermore if these two quad memory accesses have fixed memory layouts then a single flow formulation must be used, as shown in figure 8a ). In figure 8 b) the quad memory accesses overlap, thus flows of 4 can be used through the even variables. After all dual and quad memory access variables are bound to registers the remaining variables are allocated using network flow. We can now formulate the register binding network flow problem. We define the set of variables which have been given the possibility of flow as set F, where variable k∈ F. If a variable cannot be bound or allocated in the specific network flow problem, we use the terminology, k∉F. We define our costs, e i→ j , as the number of accesses of that variable j (as explained earlier). For example if we are solving the problem in figure 7b), we define variables, k∈ F , as all even variables of quad or dual memory access instructions with fixed memory layouts, all variables of dual or quad memory access instructions with flexible memory layouts, and all other variables in the graph extracted from variables of the assembly code. The set of variables with no flow or k∉F are those which are the odd numbered variables in dual/quad memory access instructions with fixed memory layouts. We denote a variable, k , which is a member of a quad or dual access instruction, i, as k∈Q i , where Q i represents the set of variables in multiword memory access instruction i. Finally R is the number of registers we are currently binding (for example in problem in figure 8a) R=1 whereas in figure 8b) R=4.
If quad or dual loads or stores have flexible memory layouts then this can be supported in the network flow formulation. For dual or quad load instructions the flow into the 2 or 4 variables being loaded is set to 1 or 2 respectively. For example in figure  9 the flow into the quad memory access variables is fixed to two. Mathematically we add the following constraint for dual or quad loads, , ,
where W=1 for dual loads, or W=2 for quad loads. Dual and quad stores are similar except the flow constraint is set on the output arcs leaving the dual/quad variable vertices (since the lifetimes of these variables end on the same clock cycle).
In general more complex codes which have nested loops with dual and quad memory accesses are supported in this framework. In these cases the inner most nested loops are used to solve single flow maximum cost flow problems for each loop feedback variable. Multiword memory access instructions inside nested loops can be supported by early binding of the feedback variables to d0, d1, etc for each run of the network flow problem.
V. EXPERIMENTAL RESULTS
Several DSP applications are used to illustrate this methodology. The notation bqa, red, dct, fir, pad refers to a biquad filter, a polynomial modular reduction algorithm (used in cryptography), the discrete cosine transform, linear FIR filter, and a polynomial addition algorithm (also used in cryptography) r espectively. Code was generated using the Star*core C compiler [3] for red, dct, pad and assembler for the hand-coded fir and bqa examples. The original C -compiler generated dct assembly codes contain 39 execution sets in the compiler generated assembly code. In dct, only single memory access instructions were utilized in the C compiler generated assembly code . The fir example has dual nested loops with high degrees of parallelism (execution sets of 7 words each cycle), using dual and single moves. The maximum cost flow was solved on a Sun workstation using a linear programming solver [15] . All optimizations ran in under 5 cpu seconds in total for each application on a 300MHz UltraSparcIIi Sun workstation. Table 1 illustrates the optimized results of solely minimizing code size for immediate registered or immediate addressing, Opt-name, compared to the Star*core C compiler generated code, Comp-name. The original compiler generated code used 12 words for immediate addressing of coefficients. This application involved a loop in which 8 coefficients were accessed. The total number of registers R, and number of words saved (#words) through registered coefficients only, and number of words saved due to both registered coefficients and multicoefficient accesses (#wordsMA). It was generated automatically from the assembly code using the maxcost formulation and using the same memory layout and alignment for data input and data output as the compiler did. The bqa example was optimized three times, each time allowing a larger number of registers, R, to illustrate the reduction of immediate word overheads. For example the fourth row of table 1 indicates that for the same number of registers as the compiler generated code, a total of 8 16bit words are saved. The last row indicates that for 12 registers, 11 out of 12 extra words are saved. Table 2 illustrates the percent improvement in memory bandwidth utilization of optimized results, Opt-name, compared to the C compiler generated code, Comp-name. The number of registers R is given in each row. After the first implementation of maximum cost flow ( or in the original compiler generated code for Comp-rows) , the number of dual memory accesses D 1 and the total number of memory access instructions, M 1 , is shown. After the second implementation of maximum cost flow, followed by register binding, the number of dual memory accesses, D 2 and quad memory accesses , Q 2 , and the final number of memory access instructions, M 2 , are also shown in the table. The improved memory bandwidth is shown as %BW, which we define as the number of memory access instructions in the original compiler-generated code divided by the number of memory access instructions in optimized code (after network flow optimization ) shown as a percent. Results were generated automatically from assembly code using the maxcost formulation and using the memory layout and alignment for data input and data output specified by the compiler. To explore the relationship between increasing multiword memory accesses versus the total number of registers used, the network flow problems were solved with different amounts of flow (or numb er of registers). As is shown in the table, the multiword memory accesses can often be improved through an increase in the number of registers used.
The register binding for the dct examp le, with 12 registers maximum, had three quad memory accesses, where one was a quad access of coefficients with flexible memory layout (like in figure 9 ) and two quad memory accesses of input data. The register binding methodology solved an initial network with flow of 4 (since the two quad memory accesses overlapped, similar to figure 8b) ), two network flow problems each with flow one for each odd data register (as in figure 7b) ) , one final odd data register network with flow of 2 and one last network flow for all remaining variables which was a flow of 4. Each flow problem was solved in under one second. The optimized dct was up to 1.76 times (37/21) more efficient in terms of memory accesses.
The red example used 32bit word memory accesses so only dual memory accesses were supported (since quad accesses were only supported for 16bit words, using a 64 bit bus in [3] ). In this example, up to 1.87 times improvement in memory accesses was observed. The hand-coded fir examples had different memory accesses which reduced the overall code size by 1 word. This example had a loop which involved solving the network flow problem for interloop variables with single flows in addition to constraining any dual and quad memory access allocations. All results shown were verified and included loop support and register constraints on dual and quad memory accesses. A bar chart is included in figure 10 to further illustrate impact of dual and quad memory accesses on memory bandwidth utilization.
The pad example in table 2 was also executed in hardware (using Starcore c hip on a development board) in order to measure the effect on power dissipation using experimental setup as in [16, 17] . The memory bandwidth optimization technique provided one third reduction in energy dissipation reduced from 4.8nJ to 3.0nJ. This energy dissipation was due to improved performance (15 cycles versus 26 cycles of compiler generated code) as a result of elimination of execution sets from allocation of dual loads and dual stores (and elimination of some address register load instructions) within the main loop of the pad example. Figure 11 illustrates the savings in code size (bytes), latency per loop iteration (ns/LI), and energy (nJ). A power trace in figure 12 , using setup in [16] , is also provided showing how power increases for higher parallelism, yet latency drops providing overall improvement in energy.
VI. DISCUSSION AND CONCLUSIONS
In summary code size overheads (including memory accesses) were improved up to 1.8 times and memory bandwidth utilization improved up to 87%, using the maximu m cost network flow formulation of the memory-access code problem. Unfortunately there is no previously published research to compare with, however results reported were compared with the Star*core's C compiler [3] . The network flow improved codes showed significant improvements. The technique presented in this paper performs register allocation and binding to minimize code size. Furthermore only memory bandwidth utilization and its impact on code size reduction was explored in this problem. It is possible t hat the improved memory bandwidth may also lead to improves in application throughput (through rescheduling). Although the paper has illustrated the network flow technique for 8 registers, it was implemented for 16 registers in the experimental section. For larger than 16 registers, a larger number of network flow problems would have to be solved for the binding problem. However in the worst case for N registers, N network flow problems would be solved, which includes one initial flow to find a lower bound on registers, two network flows to allocated dual and quad memory accesses and in the worst case N network flow problems for N data registers. It is also interesting that the memory access ordering and layout may have a significant impact on the solution of the network flow problem. The network flow approach for scheduling multiple memory accesses presented in this paper may also be applied to increase the utilization of memory bandwidth described in [13] , by increasing the number of candidate words to be accessed from the cache row buffer.
The approach can be extended for loop support. Although several network flow problems are solved, each can be solved in polynomial time with faster network solvers. For loop support this number would increase, however the network flow graphs would be of different sizes. For example one would solve the network flow on smaller graphs representing variables in innermost nested loops and merging the flows into a vertex in the graph at a higher level (outside of the nested loops) to complete the register binding. In all cases the register binding network flows did not increase the number of total registers being allocated. Many applications examined involved loops and nested loops which contained significant code size savings after the network flow methodology. It is possible that in these cases a savings of energy would occur since fewer memory accesses would be required through many loop iterations. Future research will examine more efficient techniques for handling complex loops and conditionals.
The formulation of a new problem and a maximum cost flow approach to solving it has been presented in this paper. Unlike previous research we have studied register allocation and binding for multi-word memory access. Important code size savings have been presented. Notably even a hand-coded assembly program has shown improvement with this new approach. We have introduced a new methodology for minimizing code size. It is applicable to other DSP processors (both fixed point and floating p oint ones) in addition to Star*core. For the first time, codesize-minimized memoryaccess code can be generated, by maximizing the utilization of memory bandwidth of the DSP processor, using this new application of the maximum cost flow technique.
ACKNOWLEDGMENT
The author would like to thank the reviewers, and editors for their helpful comments. This research was supported in part by grants from NSERC, Motorola, and CITO.
