This paper presents a new transformation for the scheduling of memory-access operations in High-Level Synthesis. This transformation is suited to memory-intensive applications with synthesized designs containing a secondary store accessed by explicit instructions. Such memory-intensive behaviors are commonly observed in video compression, image convolution, hydrodynamics and mechatronics. Our transformation removes load and store instructions which become redundant or unnecessary during the transformation of loops. The advantage of this reduction is the decrease of secondary memory bandwidth demands. This technique is implemented in our Percolation-Based Scheduler which we used to conduct experiments on a suite of memory-intensive benchmarks. Our results demonstrate a signicant reduction in the number of memory operations and an increase in performance on these benchmarks.
Introduction
Traditionally, one of the goals in High-Level Synthesis (HLS) is the optimization of storage requirements for synthesized designs 12, 18, 26, 29, 30] . As the focus of HLS shifts towards the synthesis of designs for inherently memory-intensive behaviors 17, 20, 28, 31] , this step becomes crucial to obtaining acceptable performance. Examples of such behaviors are abundant in video compression, image convolution, speech recognition, hydrodynamics and mechatronics. The memory-intensive nature of these behaviors necessitates the use of a secondary store (e.g., a memory system), since a primary store (e.g., register storage) su ciently large enough would be impractical. This memory is then explicitly addressed in a synthesized system by memory operations containing indexing functions. However, due to the bottleneck that a memory system represents, memory accessing operations must be e ectively scheduled so as to optimize memory access. In this paper, we present a scheduler transformation which optimizes memory access by the elimination of extraneous memory operations.
Our strategy for optimizing memory access is to eliminate the redundancy found in memory interaction when scheduling memory operations. Such redundancies can be found within loop iterations, possibly over multiple paths, as well as across loop iterations. During loop pipelining 27, 33] redundancy is exhibited when values loaded from and/or stored to the memory in one iteration are loaded from and/or stored to the memory in future iterations.
For example, consider the behavior in Fig. 1 . The inner loop would normally require four load and two store instructions per iteration. However, after application of our transformations, the inner loop contains only one load and one store at the expense of additional local (register) storage. Note that the assignment to t3 is not an actual memory fetching instruction|the value stored to b i] j] is copied to t3.
During loop pipelining multiple iterations of the loop are overlapped (subject to data dependencies and resource constraints) to nd a repeating pattern in loop execution. As each iteration is integrated and overlapped with the current loop schedule, memory operations from the new iteration are exposed to the memory operations from previous iterations. With the aid of memory disambiguation, the recurrence patterns in data usage become apparent from loop iteration to iteration as the new memory operations move into time steps with loads and stores from previous iterations. Thus, the scheduler uncovers redundancy in memory interaction as the parallelism in the loop is exposed.
Our transformation has many signi cant bene ts. By eliminating redundant load operations occurring on the critical dependency path through the code, the performance of the resulting schedule can increase dramatically as the length of the critical path can be shortened, thus generating more compact schedules and reducing code size. Also, due to our transformation's local nature, it integrates easily into other parallelizing transformations 9, 27]. Another bene t is the possible savings in hardware due to the decrease in memory bandwidth requirements and/or the exploration of more cost-e ective implementations.
In Section 2, we discuss related work. Section 3 describes our behavior model and Section 4 discusses our approach for reducing memory tra c. Section 5 details experiments performed to study the e ects of our transformations and the results observed on a suite of benchmarks. Finally, Section 6 concludes.
Related Work
In order to perform memory optimization, some formal method is necessary to model the accessing of memory. Once a model is developed, allocation strategies map data onto registers and memory modules. Then this information is used by behavioral transformations to optimize memory access. In this section we discuss related work on each of these: memory-access models, memory allocation and memory-related transformations.
Memory-Access Models
Duesterwald, Gupta and So a 15] extend traditional data ow analysis for scalar variables to include subscripted (array) variables through the use of iteration distance values. These values denote the number of iterations between the production of an array value and its nal use and is utilized in analyzing dependencies between memory operations.
Verhaegh et al. 
Memory Allocation
One formulation of the memory allocation problem is that of assigning a processor's registers to variables in a program. Probably the most pervasive solution to this problem is a graph coloring approach 10, 11] where the nodes in the graph correspond to program variables and the edges between nodes correspond to overlapping lifetimes. The task is then to color the graph such that connected nodes have di erent colors. If such a coloring is not found, some variable is heuristically selected and spilled to memory; all references to that variable then refer to that memory location. This process is repeated until a colored graph results. Callahan, Carr and Kennedy 6] extend the graph coloring technique to handle arrays by scalar replacement which replaces repeated references to an array location by references to a (newly created) scalar variable. Another formulation of the memory allocation problem is to determine the number of registers necessary to preserve values produced in one control step and used in subsequent control steps. Kurdahi and Parker 22] present an algorithm which accomplishes this by analyzing variable lifetimes in the data ow graph. Some work 19, 28, 36] has been done to improve register usage for loops which breaks a variable's lifetime at the loop boundary, creating two \coupled" variables which the assignment process tries to assign to the same register. Register sharing|the use of one register by multiple variables when the lifetimes of those variables do not overlap|is employed to reduce the necessary number of registers.
Goossens 19] uses a heuristic to ll the gaps between coupled variables and then applies the left-edge algorithm. Stok 36] iteratively tries to improve an initial allocation produced by the left-edge algorithm by \permuting" variables in registers at loop end to match the variable assignment at loop beginning. Park, Kim and Liu 28] extend the left-edge algorithm to deal with behaviors having conditional and looping constructs.
A natural extension to this formulation is the grouping of those allocated registers into memory modules. Two issues arise in doing so: 1) the exact grouping of variables so as to minimize the number of modules used; and, 2) the interconnections now necessary to connect modules to functional units. Balakrisnan et al. 3] provide a solution which places primary importance on variable grouping. Ahmad and Chen 1] extend Balakrishnan's approach to take into account the commutative property of operations, thus allowing the connection of a memory module to either input of a functional unit. Kim and Liu 20] note that the rst issue heavily in uences the second, and, thus, in their approach, they place primary importance on interconnection cost rather than variable grouping.
Ramachandran, Gajski and Chaiyakul 35] present an algorithm to allocate storage for array variables. Their algorithm performs array-variable clustering which allows more than one array to be allocated to the same memory module subject to user performance and cost criteria.
Memory-Related Transformations
Callahan, Cocke and Kennedy 7] present a technique which improves the balance of a loop| the ratio of memory words accessed to the number of operations performed. Their algorithm restructures the loop at the input level based upon estimated functional unit pipeline hazards and interlocking mechanisms, but deals only with single-dimension arrays and single ow-of-control within a loop body.
Davidson and Jinturkar 14] reduce memory tra c in the context of instruction set processors without a cache by combining (or \coalescing") multiple memory accesses into a single memory access. Their technique combines small bit-length accesses of adjacent data into one larger bitlength access. Two limitations to this strategy are: 1) architectural support (a wide bus) is necessary to be able to combine small bit-length adjacent accesses into a single larger access; and, 2) combining multiple memory accesses scattered throughout the program can adversely a ect scheduling due to the new dependencies created. P ochm uller, Glesner and Longsen 31] discuss the notion of background memory management| the allocation of memory modules to arrays at the behavioral level. The authors note that background memory management is an important subtask in High Level Synthesis as it is necessary to remove unnecessary (redundant) array references due to the hardware that they generate.
Callahan, Kennedy and Porter eld 8] investigate the idea of software prefetching|a technique to explicitly fetch data into a cache. Although the goal of the technique is to improve performance, memory tra c can actually increase in situations where a prefetched item is bumped from the cache before its use.
Data blocking 32, 23] is another technique aimed at improving cache e ectiveness. This approach partitions a large data-space that is not containable within a cache into smaller data blocks which are containable within a cache and then the program is restructured to improve data reuse within those blocks.
Our work di ers from previous work in that our memory-access model is based on symbolic expressions|strings which symbolically represent the address calculation being performed in a reduced form. These expressions are then subject to symbolic evaluation during scheduling to determine if redundancy between memory operations exists. Thus, redundancy is exposed when warranted, i.e., when eliminating that particular redundancy has actual a ect on the performance of the schedule. Also, our technique assumes that data has been previously allocated to memory and is, therefore, complementary to the discussed techniques for memory allocation.
Behavioral Model
In our model, the input behavior is a procedural speci cation and internally represented by a control data-ow graph where each node corresponds to operations performed in one time step and the edges between nodes represent ow-of-control. Initially, each node contains only one operation. Parallelizing a behavior involves the compaction of multiple operations into one node (subject to resource availability).
Memory operations contain an indexing function which is composed of the variables used in indexing each dimension of the array access, as well as either a destination for load operations or a value for store operations. The semantics of a load operation are that issuing a load reserves the destination (local register storage) at issuance time (i.e., destination is unavailable during the load's latency) while the register storing the value argument of a store operation is assumed free after the issuance of the store. For the purpose of dependency analysis on memory operations, each contains a symbolic expression which is a string of symbols that formulates the memory address calculation without the behavior variables 21].
The purpose of the symbolic expression is to be able to e ciently compare memory operations for dependency analysis. In our approach, the variables used in the memory address calculation are \normalized" to a unique symbol for each loop thereby re-formulating the expression in as reduced a form as possible. Determining dependencies between two memory operations involves the comparison of the respective symbolic expressions. Various methods for performing this test exist including the GCD test 5], the Omega test 34], PIP 16] and inexact methods such as Fuzzy Array Analysis 13] .
Because our goal is to remove redundant memory operations, it is necessary in our approach to have some preliminary allocation and binding of array variables to storage (secondary memory). This may entail allocating multiple arrays to the same module, separate modules for each array, splitting a (large) array among several (smaller) memories or any combination of the above. Then, this information is given to the scheduler along with the system resource constraints (e.g., number of memory ports on secondary memories, size and number (if partitioned) of primary memories, number and type of functional units, functional and memory operation latencies, etc.). This is in contrast to 4, 24] , for instance, which both perform memory transformations before allocation.
Reducing Memory Tra c
Our solution to reducing the amount of memory tra c in HLS is to make explicit the redundancy in memory interaction within the behavior and eliminate those extraneous operations. Our technique is employed during scheduling rather than as a pre-pass or post-pass phase; a pre-pass phase may not remove all the redundancies since other optimizations can create opportunities for redundancy removal that may not have otherwise existed, while a post-pass phase cannot derive as compact a schedule since operations eliminated on the critical path allow for further schedule re nement. Fig. 2 shows the main algorithm for removing unnecessary memory operations. This function is invoked in our Percolation-based scheduler 33] by the move op transform (or any suitable local code motion routine in other systems) when moving a memory operation from one step into a previous step that contains other memory operations.
Algorithm in Detail
The function redundant elimination checks to see if the memory operation is invariant. If so, then the function remove inv mem op tries to remove it. If it is not invariant or could not be removed, then op is checked against each memory operation in the previous step for possible optimization. If two operations refer to the same location then the appropriate action is taken depending upon their types. The load-after-load and load-after-store cases will be discussed shortly. In the case of a store-after-store, the rst store operation is dead 2] and can be removed if it stores the same argument as the second operation and the argument has the same reaching de nitions. We choose to simply remove op, rather than removing mem op and moving op into its place. For the store-after-load, nothing is done as this is a false (anti-) dependency on a memory location that should be preserved to retain behavior correctness. Status re ecting the outcome is returned, allowing operations to continue to move if no redundancy was discovered.
Removing Invariants
Removing invariant memory operations is slightly di erent from general loop invariant removal. Traditional loop invariant removal moves an invariant into a pre-loop time step. For load operations this is correct; for store operations it is not. Conceptually, invariant loads are \inputs" to the loop, while invariant stores are \outputs." Therefore, loads must be placed into pre-loop steps and stores must be placed into loop exit steps.
To illustrate this, consider the following example:
for i = 1 to N for j = 1 to N In this example, the memory location a i] is used from iteration to iteration as a temporary accumulator. To maintain correct semantics, both memory operations must be removed. If only the load operation is replaced with a temporary then incorrect values are stored; if only the store operation is replaced with a temporary then incorrect values are loaded. The solution to this situation (or, in general, when a loop contains scattered invariant loads and stores to the same location) is to remove the invariant if and only if all other invariants to the same location can be removed (i.e., hoisted out of the loop).
An algorithm to perform invariant removal appears in Fig. 3 . The conditions necessary for loop invariant removal (adapted from 2]) are: 1) the step that op is in must dominate all loop exits (i.e., op must be executed every iteration), 2) only one de nition of the variable (for loads) or memory location (for stores) occurs in the loop and 3) no other de nition of the variable or memory location reaches their users. Additionally, store operations require that the de nition of its argument be the same at the loop exits so that correctness is preserved. If these conditions are met, then the operation can be hoisted out of the loop. If condition 2 fails and the operation is a load, it still might be possible to hoist the operation if a register can be allocated to the loaded value for the duration of the loop.
Load-After-Load Optimization
The load-after-load optimization is applied in situations where a load operation accesses a memory value that has been previously loaded and no intervening modi cation has occurred to that location's value (i.e., there is no intermittent store). In Fig. 4 transfer the value without re-loading it (i.e., transfer the value from the destination of the rst load to the destination of the second load thereby obviating the second load).
The algorithm is invoked with the initial load, mem op, and the redundant load, op and commences by gathering all of the nodes in the graph where the latency of mem op expires. Then, for each of those nodes, if that node is on a path in common with op, a move operation from the destination of mem op to the destination of op is inserted into that node. Finally, op is deleted from the behavior graph and any necessary local information is updated.
Because move operations are introduced into the schedule by this optimization, it may appear at rst glance that the necessary number of registers increases. However, due to the machine model, the register count remains exactly the same and is not a ected by load-after-load optimization. The following theorem establishes this fact.
Theorem 1: The load-after-load optimization does not increase the number of registers used. Proof: Let Op t denote a load operation which de nes a and is scheduled in time step t and Op t+1 denote a redundant load which de nes b and is scheduled in time step t + 1. Let load operations have latency l. Before the transformation, a is available for computation in time step t + l and b is available in time step t + l + 1. Concerning only a and b, the register counts in each cycle are: cycle t, one and cycle t + 1 to t + l + 1, two. After the optimization is applied, a move operation b := a appears in time step t + l when a is available. Now the register counts are: cycle t to t + l ? 1, one and cycle t + l to t + l + 1, two. Therefore the total register count in each step has not increased. In fact, in the period t + 1 to t + l ? 1 the register usage has decreased leaving a free register which other transformations can possibly exploit. 2 
Load-After-Store Optimization
The load-after-store optimization is used to remove a load operation which accesses a value that a store operation previously wrote to the memory. Again, the method employed is to insert a move operation into the schedule to transfer the value. Due to limited resources it is possible that this optimization cannot be applied in some cases. Consider the partial code fragment:
Step To eliminate the load c := a i], and replace it with the move operation c := b in step 2 would violate behavior semantics because it introduces a read-wrong con ict|it would incorrectly read the value rather than the value written to memory. The move operation must be placed in step 1 to guarantee correct results. However, in this code fragment:
Step placing a move operation in step 1 will violate behavior semantics because it introduces a writelive con ict|it would incorrectly overwrite the value ; the move must be inserted into step 2. Notice that in both cases, the transformation is still possible; analysis is required to determine which step is applicable.
This optimization might not be feasible in the following situation:
Step Therefore, the precise case when the load-after-store optimization fails to remove a redundant load is composed of three conditions: In practice, this situation occurs very infrequently.
The load-after-store optimization algorithm is found in Fig. 5 . This algorithm determines which step to place a move operation. Initially, the step that op is in is tried. If a read-wrong con ict occurs, the previous step is tried. If a write-live con ict arises, a free register is necessary to transfer the value. In this case, two move operations are added to the schedule. If a free register is not available, no optimization is done. If no con icts occur (or they can be alleviated by switching steps) then a move operation is inserted. Finally, the load operation is deleted and necessary information updated. 
A Note on Move Operations
The move operations inserted into the schedule by our transformation, may or may not be found in the nal schedule depending upon the context in which the moves appear. If the primary memory is consolidated, then the semantics of a move keep the value within the same memory, but in a di erent register. As it is not necessary to actually perform a move in this case, traditional copy propagation 2] and copy elimination 25] techniques, which are implemented in our system, may delete move operations from the schedule. However, as previously outlined in Section 3, the primary memory may be partitioned into multiple memories. If this is the case, then care in applying removal techniques is necessary. The moves may be deleted if the source and destination are in the same module, otherwise the move operation is necessary to transfer a value from one memory module to another. 6 illustrates the application of our transformation to an example behavior. For simplicity, all of the operations constituting the address calculations for the memory operations have been grouped into one node which has been darkened in the illustration. In practice, the memory operations will have symbolic expressions; however, we show the memory operations annotated by the source-level counterpart for illustration. During loop pipelining when a future iteration is overlapped with the current loop schedule, the schedule in Fig. 6(a) results. When percolating the load operation from iteration N + 1 into the previous time step, redundant elimination is invoked as the previous time step contains another memory operation. The disambiguator discovers that the symbolic expressions for the load and store operations are the same, so the load-after-store optimization is applied and the schedule in Fig. 6(b) results.
An Example

Experiments, Results and Analysis
We conducted experiments on a suite of benchmarks to study the e ects of our transformation on performance. The benchmark suite is composed of three numerical codes, two scienti c codes and ve image codes. Table 1 lists the benchmarks along with a brief description of each. Table 2 contains statistics related to the memory operations in these benchmarks. The rst column lists the benchmark while the second column lists the total number of memory operations contained within the respective innermost loop 1 . The third column lists the number of redundant memory operations in the rst sub-division and as a percentage of the total number of memory operations in the second. The fourth column lists the number of redundant memory operations that our transformation removed and the percentage this is of the total number of redundant memory operations (found in column two). In all cases, our transformation was able to remove all of the redundant memory operations found in the benchmark suite.
Experimental Set-up
For the purposes of generating schedules, we make certain assumptions on the latencies of functional and memory operation types and on memory system organization. Our scheduler is fully parameterized, and, as such, these assumptions are in no way \hard-coded" into the solutions. Latencies used are two steps for add/subtract operations, three steps for multiply operations, and ve steps for load/store operations. Fig. 7 shows the memory architectural model used 2 . In this model, a large secondary memory contains run-time data which must be transferred into the primary memory for computation. Explicit memory-accessing (load/store) instructions are found in the behavior for this purpose. Data may only be transferred between the primary and secondary memories through memory ports. We assume that all ports are homogeneous|that is, each port can issue both read (load) and write (store) operations.
With these assumptions, two experiments were conducted with our Percolation-based scheduler. In the rst, schedules were generated with the number of memory ports constrained between one and four and no functional unit (FU) constraints. Two schedules were produced for each benchmark with the sole di erence between them being the application of our transformation. The goal of this experiment was to isolate the di erence in transformed schedules without the bias of FU constraints.
In the second experiment, schedules were generated with one to four memory ports and the functional unit constraints of two adders and one multiplier. Again, two schedules were generated for each benchmark with the sole di erence between them being the application of our transformation. This experiment was designed to study the e ects on performance in the presence of realistic FU resources.
In the next section, we present our observed results and in the following section, we analyze the bene ts of our transformations.
Results
For each experiment, the number of steps in the schedule of the innermost loop was counted. The results of limited and unlimited resource schedules for the numeric and scienti c codes are found Table 3 while the results for the image codes are found in Table 4 . Also, these tables contain the percentage performance gain (% Gain) of the transformed schedules over the untransformed schedules; this is measured as:
cycles untrans ? cycles trans cycles trans ;
where cycles untrans is the number of cycles for the untransformed behavior and cycles trans is the number of cycles for the behavior with our enhancement. The column labelled \Opt" indicates whether or not our optimization was applied to the particular resource con guration. The columns collectively labelled \Number of Ports" contain the number of steps in the innermost loop for the respective FU and memory port parameters. In subsection 5.2.1 we present our results by the type of benchmark, while in subsection 5.2.2 we extract the percentage performance gains from the tables, average them and present them by the FU resources used.
Results by Benchmark Type
The results of our experimentation with the numerical benchmarks are found in Table 3 . From Table 2 , we note that these benchmarks have a moderate degree (21% to 33%) of redundancy and therefore we can expect a moderate increase in performance. The results for these benchmarks are better than expected|the performance increases observed are between 33% and 100%, with a majority of them above 50%.
Results for the scienti c benchmarks are found in Table 3 . These benchmarks have a moderate degree (between 25% and 33%) of redundancy, so a moderate performance increases could be expected. The 2D-Hydro benchmark does demonstrate a moderate increase (13% to 46%) in performance. However, observed performance increases for SOR range between 5% and 17%. Although applying our transformation is worthwhile, the lower performance increase stems from the large number of memory operations that remain after redundancy removal. Since there are a large number of memory operations which contend for memory port resources, the schedules still remain bottlenecked by the memory.
The results for the image benchmarks are found in Table 4 . These benchmarks have a high degree (66% to 70%) of redundancy, and therefore we expect our transformation to make a signi cant impact on performance. For the Laplace and Wavelet codes this is indeed the case| we observed 27% to 40% increase in performance for Laplace and 17% to 100% for Wavelet. For the other benchmarks, the range of performance increase is lower, ranging between 5% and 19%. With so much redundancy, why are the performance gains less than expected? These benchmarks have long chains of computations which e ectively \hide" the latency of the memory operations. When memory port resources are scarce, delaying memory operations has little e ect on the critical path and, hence, little e ect on performance. However, when combining our technique with other transformations that reduce the critical path length (tree height reduction, for instance) the importance of removing redundant memory operations increases as those operations now a ect the critical path and thus a ect performance.
To test this hypothesis, we generated schedules (as previously outlined) for the low-pass1 benchmark, however, this time tree height reduction was also applied. The results for this experiment are found in Table 5 . These results con rm the idea that our transformation plays a crucial role (increasing performance by 43%, on average) in the presence of other behavioral transformations. One thing to note for these results is, without our transformation, the resource constrained schedules, in one case (1 port, 2+, 1*) are worse if THR is applied to the behavior. Once redundancy is eliminated, however, the results are much better|a 48% performance gain rather than 7%. Fig. 8 graphically shows the average increase in performance for the schedules produced with no FU constraints. Because the bias of FU constraints has been removed from these measurements, bandwidth limitations have e ectively become the bottleneck to performance. Clearly the numeric benchmarks bene t the most (54% to 95%), but the scienti c and image benchmarks show considerable performance gains too, 5% to 16% and 8% to 38%, respectively. Fig. 9 graphically shows the average increase in performance for the schedules produced with the FU constraints of two adders and one multiplier. Except for the GLR benchmark (which is a ected to a greater extent by the particular resource constraints used), the average performance gain between the untransformed and transformed behaviors increase or stay the same as compared with the results in Fig. 8 . These results indicate that redundant elimination plays an important role in the presence of realistic resource constraints, yielding 64% to 73% performance gain for the numerical benchmarks, 12% to 26% for the scienti c benchmarks and 9% to 38% for the image benchmarks.
Results by Resource Constraints
Analysis
These improvements are largely due to the increased exibility in scheduling that redundancy removal o ers. When redundant memory operations are removed from the behavior, this exibility results from less contention for memory port resources as compared to the untransformed behavior. Higher operation mobility, which also contributes to scheduling exibility, is exhibited by operations which were originally dependent upon (now) removed memory operations. These operations can now be scheduled into time steps corresponding to the latency steps of the removed operation, possibly utilizing free resources in those steps. Finally, critical path length can be reduced when memory operations on the critical dependency path in the loop are removed, again providing more exibility to the scheduler. All of these factors|less resource contention, higher operation mobility and critical path length reduction|combine to increase scheduler exibility, and, as previously noted in our experiment with combining redundancy removal and tree height reduction, result in the generation of higher quality schedules.
Conclusion
In this paper we have presented a new local scheduler transformation which optimizes the accessing of a secondary memory, thereby reducing memory tra c. Our strategy is based on eliminating the redundancy found in memory access instructions both within and across iterations of a loop. Detecting these redundancies is through the use of memory aliasing theory, which determines when two memory instructions access the same location. With this foundation, our technique provides a powerful method of memory optimization in contrast to a simplistic pattern-matching approach (either based upon the comparison of source-level text or complicated manipulations of a control data ow graph to determine equivalency) that can often and easily be fooled. We have presented our algorithm in detail and provided results of its application to a suite of memoryintensive benchmarks where we observed performance gains up to 100%. Our approach is easily integrated into existing High-Level Synthesis systems and has been implemented in the UCI Percolation-based Synthesis tool suite. In the future, we will address the interactions of this local transformation with other optimizations.
