This paper presents a new transformation for the scheduling of memory-access operations in High-Level Synthesis. This transformation is suited to memoryintensive applications with synthesized designs containing a secondary store accessed by explicit instructions. Such memory-intensive behaviors are commonly observed in video compression, image convolution, hydrodynamics and mechatronics. Our transformation removes load and store instructions which become redundant or unnecessary during the transformation of loops. The advantage of this reduction is the decrease of secondary memory bandwidth demands. This technique is implemented in our Percolation-Based Scheduler which w e used to conduct experiments on core numerical benchmarks. Our results demonstrate a signicant reduction in the number of memory operations and an increase in performance on these benchmarks.
Introduction
Traditionally, one of the goals in High-Level Synthesis (HLS) is the minimization of storage requirements for synthesized designs [5, 7, 15, 19] . As the focus of HLS shifts towards the synthesis of designs for inherently memoryintensive behaviors [6, 8, 14, 16, 18] , memory optimization becomes crucial to obtaining acceptable performance. Examples of such behaviors are abundant in video compression, image convolution, speech recognition, hydrodynamics and mechatronics. The memory-intensive nature of these behaviors necessitates the use of a secondary store (e.g., a memory system), since a primary store (e.g., register storage) suciently large enough would be impractical. This memory is explicitly addressed in a synthesized system by memory operations containing indexing functions. However, due to bottlenecks in the access of memory systems, memory accessing operations must be eectively scheduled so as to improve performance.
Our strategy for optimizing memory access is to eliminate the redundancy found in memory interaction when scheduling memory operations. Such redundancies can be found within loop iterations, possibly over multiple paths, This work was supported in part by NSF grant CCR8704367, ONR grant N0001486K0215 and a UCI Faculty Research Grant.
as well as across loop iterations. During loop pipelining [17, 13] redundancy is exhibited when values loaded from and/or stored to the memory in one iteration are loaded from and/or stored to the memory in future iterations. end end The inner loop would normally require four load and two store instructions per iteration. However, after application of our transformation, the inner loop contains only one load and one store.
Previous work in reducing memory accessing balances load operations with computation [3] . However, their algorithm only removes redundant loads and only deals with single-dimensional arrays and single ow-of-control. In [6] a model for access dependencies is used to optimize memory system organization. In [16] background memory management is discussed, but no details of an algorithm are present. Therefore, it is not clear what approach is taken in determining redundancy removal, nor the general applicitivity of the technique. Related work includes the minimization of registers [8, 14] , the minimization of the interconnection between secondary-store and functional units [11] , and the assignment of arrays to storage [18] .
Our transformation has many signicant benets. By eliminating unnecessary memory operations that occur on the critical dependency path through the code, the performance of the resulting schedule can increase dramatically: the length of the critical path can be shortened, thus generating more compact schedules and reducing code size. Also, due to our transformation's local nature, it integrates easily into other parallelizing transformations [4, 13] . Another benet is the possible savings in hardware due to the decrease in memory bandwidth requirements and/or the exploration of more cost-eective implementations.
Program Model
In our model, a program is represented by a control data-ow graph where each node corresponds to operations performed in one time step and the edges between nodes represent o w-of-control. Initially, each node contains only one operation. Parallelizing a program involves the compaction of multiple operations into one node (subject to resource availability).
Memory operations contain an indexing function, composed of a constant base and induction variables (iv's), and either a destination for load operations or a value for store operations 1 . The semantics of a load operation are that issuing a load reserves the destination (local storage) at issuance time (i.e., destination is unavailable during the load's latency). For the purpose of dependency analysis on memory operations, each contains a symbolic expression which is a string that formulates the indexing function without iv's. (During loop pipelining these expressions must be updated w.r.t. iv's.)
The initial analysis algorithm in Fig. 1 computes initial program information. Detecting loop invariants and iv's and building iv use-def chains can be done with standard algorithms found in [1] and stored into a database. The function build symbolic exprs creates symbolic expressions for each memory operation in the program by getting the iv denitions that dene the current operation's indexing function and deriving an expression for each. Next, the base of the memory structure is added to each expression. An operation is then annotated with its expression, combining multiple expressions into one of the form \( (expr1) or . . . or (exprN))."
The function derive expr constructs the expression \(LoopId * Const)" if iv is self-referencing (e.g. i = i + c onst) where LoopId is the identier of the loop over which iv inducts and Const is a constant derived from the constant in the iv operation multiplied by a data size and possibly other variables and constants 2 . In the introductory example, the data size for the j loop is the array element size and for the i loop is the size of a column (or row) of data. If iv is dened in terms of another iv (e.g. i = j + 1 , where j is an iv) then recursive calls are made on all denitions of that other iv. In this case, marking of iv's is necessary to detect cyclic dependencies which are handled by a technique called variable folding. Essentially variable folding determines an initial value of a variable on input to the loop or resulting from the rst iteration (i.e. values which are loop-carried are not considered) from the reverse-ow of the graph. The result can be a constant or another variable (which is recursively folded, until the beginning of the loop is reached). Fig. 2 shows a sample behavior and its CDFG annotated with symbolic expresssions. The load from A builds the expression \((8 * L0) + (4 * L0))" which is the addition of 2 (the const for iv j) times 4 (the element size) and 1 (the const for iv i) times 4. The second loop over k adds the expression \(400 * L1)." Finally the base address of A is added. For the store operation, the expression \(12 * 1 We use the term argument to refer to destination if the operations is a load or to value if the operation is a store. 2 For clarity w e present a simplied algorithm. More complex analysis (based on [12] ) has been implemented in our scheduler. 
Memory Disambiguation
Memory disambiguation is the ability to determine if two memory access instructions are aliases for the same location [1] . In our context, we are interested in static memory disambiguation, or the ability to disambiguate memory references during scheduling. In the general case, memory indexing functions can be arbitrarily complex due to explicit and implicit induction variables and loop index increments. Therefore, a simplistic pattern matching approach to matching loads and stores over loop iterations cannot provide the power of memory aliasing analysis. For instance, in the following behavior, if arrays a and b are aliases:
end pattern matching will fail to nd the redundancy.
In our scheduler, memory disambiguation is based on the well-known greatest common divisor, or GCD test [2] . Performing memory disambiguation on two operations, op1 and op2, i n v olves determining if the dierence equation: (op1's symbolic expression) -(op2's symbolic expression) = 0 has any i n teger solution. Fig. 3 contains an algorithm to disambiguate two memory references. This algorithm works by iterating over all expressions of operations one and two, thereby testing each possible address that the two operations can have. The rst step in disambiguating two expressions is to convert them into the sum of products form \((a * b)+. . . +(y * z))." Next, operation two's expression is subtracted from operation one's. If the resultant expression is not linear then the disambiguator returns CANT TELL, otherwise the gcd of coecients of the equation is solved for. If the gcd does not divide all terms, there is no dependence between op1 and op2.
Returning to the example in Fig. 2 , if the load from iteration i + 1 is overlapped with the store from iteration i, the disambiguator determines that the updated expression for the load minus the store's expression is 0, exposing the redundancy in loading a value which has just been computed.
If the disambiguator cannot determine that two memory operations refer to the same location, we follow the conservative approach that there is a dependence between them (i.e., no optimization can be done). Assertions (source-level statements such as certain arrays reside in disjoint memory space, absolute bounds on loops, etc.) can be used to allay this. Also, providing the user with the information the dismabiguator has derived and querying for a result to the dependence question is an alternate, interactive approach. 
Reducing Memory Trac
Our solution to reducing the amount of memory trafc in HLS is to make explicit the redundancy in memory interaction within the behavior and eliminate those extraneous operations. Our technique is employed during scheduling rather than as a pre-pass or post-pass phase; a pre-pass phase may not remove all redundancy since other optimizations can create opportunities that may not have otherwise existed while a post-pass phase cannot derive as compact a schedule since operations eliminated on the critical path allow further schedule renement. Detail   Fig. 4 shows the main algorithm for removing unnecessary memory operations. This function is invoked in our Percolation-based scheduler [17] by the move op transform (or any suitable local code motion routine in other systems) when moving a memory operation into a previous step that contains other memory operations.
Algorithm in
The function redundant elimination checks to see if the memory operation is invariant. If so, then the function remove inv mem op tries to remove it. If it is not invariant or could not be removed, then op is checked against each memory operation in the previous step for possible optimization. If two operations refer to the same location then the appropriate action is taken depending upon their types. The load-after-load and load-after-store cases will be discussed shortly. In the case of a store-after-store, the rst operation is dead and can be removed if it stores the same argument as the second and the argument has the same reaching denitions. We c hoose to simply remove op, rather than removing mem op and moving op into its place. For the store-after-load, nothing is done as this is a false (anti-) dependency that should be preserved for correctness. Status reecting the outcome is returned, allowing operations to continue to move if no redundancy was found.
Removing Invariants
Removing invariant memory operations is slightly dierent from general loop invariant removal. Traditional loop invariant removal moves an invariant i n to a pre-loop time step. For load operations this is correct; for store operations it is not. Conceptually, i n v ariant loads are \inputs" to the loop, while invariant stores are \outputs." Therefore, loads must be placed in pre-loop steps and stores must be placed in loop exit steps.
An algorithm to perform invariant removal appears in Fig. 5 . The conditions necessary for loop invariant removal (adapted from [1] ) are: 1) the step that op is in must dominate all loop exits (i.e., op must be executed every iteration), 2) only one denition of the variable (for loads) or memory location (for stores) occurs in the loop and 3) no other denition of the variable or memory location reaches their users. Additionally, store operations require that the denition of its argument be the same at the loop exits so that correctness is preserved. If these conditions are met, then the operation can be hoisted out of the loop. If condition 2 fails and the operation is a load, it still might b e possible to hoist the operation if a register can be allocated to the loaded value for the duration of the loop.
Load-After-Load Optimization
The load-after-load optimization is applied in situations where a load operation accesses a memory value that has been previously loaded and no intervening modication has occurred to that location's value (i.e. there is no intermittent store). In Fig. 5 Although move operations are introduced into the schedule, the number of registers used does not increase (a proof appears in [9] ).
Load-After-Store Optimization
The load-after-store optimization is used to remove a load operation which accesses a value that a store operation previously wrote to the memory. Due to limited resources it is possible that this optimization cannot be applied. Consider the partial code fragment:
Step placing a move operation in step 1 will violate program semantics because it introduces a write-live conict|the move m ust be inserted into step 2. Notice that in both cases, the transformation is still possible, analysis is required to determine which step is applicable. Therefore, the precise case when the load-after-store optimization fails to remove a redundant load is composed of three conditions: 1. A move in this step results in a read-wrong. 2. A move in the previous step results in a write-live. 3. No free storage cell exists in the previous time step. In practice, this situation occurs very infrequently.
The load-after-store optimization algorithm is found in Fig. 6 . This algorithm determines which step to place a move operation. Initially, the step that op is in is tried. If a read-wrong conict occurs, the previous step is tried. If a write-live conict arises, a free cell is necessary to transfer the value. In this case, two m o v e operations are added to the schedule. If a free cell is not available, no optimization is done. If no conicts occur (or they can be alleviated by switching steps) then a move operation is inserted. Finally, the load operation is deleted and necessary information updated. invariants the store cannot be removed unless the load is also removed. Once the load is hoisted, the store can then be hoisted as well.
Example

Experiments and Results
Four memory-intensive benchmarks were used to study our transformation: three numerical algorithms (prex sums, tri-diagonal elimination and general linear recurrence equations) which are core routines in many algorithms (as discussed in the introduction) adapated from [10] and a two-dimensional hydrodynamics implicit computation adapted from [20] .
Latencies used for scheduling these behaviors were two steps for add/subtract, three steps for multiply, and ve steps for load/store. Also, the memory model adopted here assumed that: memory ports are homogenous, each port has its own address calculator, the memory is pipelined with no bank conicts.
With these assumptions, two experiments were conducted. In the rst, schedules were generated with the number of memory ports constrained between one and four and no functional unit (FU) constraints. Two s c hedules were produced for each benchmark with the sole dierence between them the application of our transformation. The goal of this experiment w as to isolate the dierence in transformed schedules without the bias of FU constraints. In the second experiment, schedules were generated with one to four memory ports, two adder units and one multiplier unit. This experiment w as designed to study performance in the presence of realistic FU resources.
For each experiment, the number of steps in the schedule of the innermost loop was counted. The GLR equations benchmark (marked with a ?) has two loops at the same innermost nesting level; the results indicate the summation of the number of steps in both loops. The results of experiments one and two are found in Tables 1 and 2 , respectively. The column labelled \RE" indicates application of our transformation. The columns collectively labelled \Number of Ports" contain the number of steps in the innermost loop for the respective FU and memory port parameters.
The results for experiment one (Table 1) demonstrate that this optimization considerably reduces the number of cycles for the inner loop. In the prex sums and tridiagonal elimination benchmarks, a performance limited by the latency of a load is achieved with a sucient n umber of ports. Since a latency of 5 cycles was used for load operations and not all loads can be eliminated, the schedule length cannot be any shorter. The same characteristic is exhibited by the GLR equations benchmark, although computational latency causes a longer schedule length while the hydrodynamics benchmark exhibits im-RE Numb e r o f P orts proved performance as the number of memory ports increases.
The results of experiment t w o ( T able 2) indicate that load elimination plays an important role in the presence of realistic resource constraints. For the prex sums and tri-diagonal elimination benchmarks, the transformed behaviors were not aected by the resource constraints due to the increased exibility i n s c heduling. When a load operation is removed, the dependent operations can move t o earlier time steps. In resource constrained scheduling these operations have a m uch higher mobility, and thus a higher degree of scheduling freedom, w.r.t. the untransformed behavior. This exibility is also demonstrated by the GLR equations and hydrodynamics benchmarks, although they are also aected by the particular resource constraints.
Conclusion
In this paper we h a v e presented a new local scheduler transformation which optimizes the accessing of a secondary memory thereby reducing memory trac. This method is based on the redundancy found in memory access instructions both within and across iterations of a loop. The method of eliminating these redundancies is through the use of memory aliasing theory, which determines when two memory instructions access the same location. With this foundation, our technique provides a powerful method of memory optimization in contrast to a simplistic pattern-matching approach (either the comparison of source-level text or the overlaying of the behavioral CDFG to determine equivalency) that can often and easily be fooled. We h a v e presented our algorithm in detail and provided results of its application to several benchmarks which demonstrate the utility and power of this memory minimization transformation. In this paper we h a v e restricted our discussion to memory trac minimization. We believe that this transformation, when used in conjunction with other traditional HLS transformations, should yield better designs for memory-intensive applications. Future work will address this interaction.
