As memory system performance becomes an increasingly dominant factor in overall system performance, it is important to optimize programs for memory related operations. This paper concerns static analysis to detect redundant memory operations and enable other compiler transformations t o remove such redundant operations.
I N T R O D U C T I O N
The rate of improvement in microprocessor CPU speed continues to exceed the rate of improvement in DRAM memory speed, producing an increasing gap between processor and memory performance. It is vital to optimize memory usage to achieve better performance on modern processor architectures. An effective technique is to detect redundant memory operations -loads and stores that will have same effect as earlier memory operations in execution and either remove them or replace them with cheaper scalar operations. Consider the sample C code in Figure 1 . For the accesses to p->x *This work was supported by the DOE through the Los Alamos Computer Science Institute.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage end that copies bear this notice end the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 
MSP'02,

} *result)
F i g u r e 1: A n E x a m p l e C o d e F r a g m e n t and p->y in line 10 and 11, the compiler generates 4 loads. However, as the values of p->x and p->y are not changed, the loads in line 11 are redundartt. The two loads can be removed and their results from line 10 can be reused.
Redundant memory operations can be classified into three categories: run-time redundancies, partially-static redundancies, and fully-static redundancies. Run-time redundancy is the most general form: loads or stores are redundant if in program execution, they access the same memory address and use the same value as a previous memory operation. This is shown in the upper le~ corner of Figure 2 , where operation P and Q access same memory address and operate on the same value. Sometimes, static analysis can prove that on some control flow path from P to Q, P and Q operate exactly on the same address and value. This case, called partially static redundancy, is shown in the upper center of Figure 2 . If static analysis can prove that P occurs on alJ possible control-flow paths to Q, and that P and Q access same address with same value, then Q is a fully static redundancy, shown in the upper right corner of Figure 2 . Generally, for arbitrary P and Q, it is impossible to decide statically whether they will access same address with same value. The Venn diagram at the bottom of Figure 2 shows the relationship between these three categories.
P(addr,v)
... P(addr,v) ... Both hardware [18, 23, 32, 37] and software techniques [25, 30, 24, 4, 6] have been proposed to detect and remove redundant memory operations. In their dynamic instruction reuse work [32] , Sodani et al propose to use hardware lookup tables to store operation input and result. A redundant operation is detected by matching input against lookup table entries. If there is a match, the operation is canceled and earlier result is reused. For loads, memory address can be used as lookup input; for stores, address and store value should be used to match any earlier load and store to determine redundancy. Yang et al describe a more elaborate hardware scheme for dynamic load redundancy removal [37] . Although hardware methods have the advantage that they can see run-time memory addresses and values, they do require significant hardware support. The limited lookup table size also limits the scope over which they can detect redundancy. Thus, they capture a subset of the run-time redundancy suggested in Figure 2 . In contrast, software methods target the partially static and fully static redundancies. They remove the redundancies with transformations that rewrite the code. This paper presents a static technique to detect redundant memory operations.
Previous work on static memory redundancy analysis targets the problem in isolation -memory operations are inspected separately from scalar operations. However, as the address and value involved in memory operations are either computed or used by scalar operations, we believe, the scalar and memory redundancy detection should be considered together rather than separately. In this paper, we present a unified global redundancy detection algorithm based on optimistic global value numbering, which is at least as powerful as the isolated approach, and is capable of detecting both scalar and memory redundancy at the same time. This avoids multiple iterations when using separate passes to detect scalar and memory redundancy. Further more, traditional scalar redundancy removal transformations can be easily extended to use the analysis result to remove redundant memory operations in the same way as scalar operations, eliminating the need for a dedicated memory redundancy removal phase. 
RTN
Figure 3: ILOC Code for Compute
The remainder of the paper is organized as follows: Section 2 introduces the intermediate representation we use to detect program redundancy, with a focus on memory related operations; Section 3 gives an overview of SOC-based value numbering which forms the basis for our global redundancy detection scheme; Section 4 describes a new SSA form for memory operations, and details the key memory value numbering algorithm; Section 5 demonstrates how the widely used scalar common subexpression elimination can be modified to remove those fully static redundant memory operations, and evaluates the redundancy detection and removal algorithms on a suite of realistic benchmarks; Section 6 discusses related work, and Section 7 concludes the paper.
INTERMEDIATE REPRESENTATION
Our compiler uses a low-level, Rise-style, three-address intermediate representation, called ILOC. All memory accesses in ILOC occur on load and store operations. The other operations work from an unlimited set of virtual registers. The ILOC code for the example function Compute is shown in Figure 3 . The line number indicates the source line in the C code of Figure 1 .
For static analysis to detect that two memory operations P and Q are equivalent, memory alias information must be considered. (The absolute access address and value axe generally not available at compile time.) To account for aliasing effects, an explicit list of memory objects, called an M-list, is associated with each memory operation (loads, stores, and calls). The M-list indicates the possible set of memory objects that the operation may affect. Our algorithm assumes that an external alias analysis is performed to disaznbiguate memory and compute M-lists for memory operations.
The results reported in this paper were generated with a flow-insensitive, context-insensitive, Andersen-style pointer analysis [31 for C programsJ The pointer analysis generates, as its result, the M-lists for loads and stores. In addition, it 1The whole program pointer analysis includes C library functions which are summarized in stubs and treated polymorphically. C array objects are treated as a single entity; array elements are not distinguished. For 0 struct objects, every scalar and array field is treated as a separate object. Heap objects are classified by their allocation sites; all objects from a single site are treated as a single entity. xLDs are the family of load operations in ILOC. x designates a specific load instruction. ILOC supports signed and unsigned loads of bytes, half-words, words, and double-words. r,~ is the load memory address, r. is the load result. M-use is the set of memory object names that the load may read. M-use is computed by the pointer analysis. xSTs are the family of store operations in ILOC, where x can take on the same values as in a load. ra is the memory address defined by the store, rv is the value to be stored at ra. M-use and M-def are identical. They contain the set of memory objects that the store may define.
The M-def and M-use sets are both called M-list for FRAME, JSRI/r, xLD, xST, and they play the key role in our new memory redundancy detection algorithm to reason about memory states and detect equivalence of load and store operations. The M-lists for memory operations of Compute are shown in the brackets in Figure 3 . For example, iLD r6 => r7 [ ©pa_0 ¢pb_0] means the load may read from object ©pa_0 or ¢pb_0 using base address in r6 and the load result is put in fT. Our front end encodes C struct field by its offset, ©pa_0 and ~pb_0 correspond to pa.x and pb.x in Figure 1 .
SCC-BASED VALUE NUMBERING
To prove that memory operation P is redundant in terms of Q, static analysis must decide that the address and value they operate upon are equivalent. However, in the presence of memory aliasing, the address and value can only be determined symbolically. Addresses that differ symbolically may, in fact, produce the same address at run-time. Thus, the analysis must further prove that the memory states before and after execution of P will be equivalent to those surrounding the execution of Q. Our algorithm uses value numbering to discover both address and value equivalence and memory state equivalence. The key property of value numbering is: I f two scalar values are assigned the same value number, they m u s t have same value at run time. To extend value numbering to memory objects, we must ensure the analogous property for memory objects: The new algorithm builds on Simpson's SCCVN algorithm for optimistic global value numbering [31] . It discovers valuebased identities (as opposed to lexical identities) [13] . Simpson's algorithm is, arguably, the strongest global technique for detecting redundant scalar values. It finds all the redundancies discovered by the Alpern-Wegman-Zadeck algorithm [2] . It finds a broad class of algebraic identities. It discovers all the constants found by the sparse simple constant algorithm [35] . SCCVN has been implemented in a number of compilers.
The SCCVN algorithm extends Simpson's dominator-based technique (DVNT) to a global scope [7, 31] . It abstracts cycles out in the control-flow graph (CFG) and iterates over each cycle's internal structure to find a fixed-point solution for that cycle. That solution then factors back into the global propagation of value numbers.
The algorithm assumes the presence of both a CFG and the SSA graph. It constructs a reduced CFG by replacing each cycle in the CFG with a single node that represents it. The reduced CFG is acyclic. SCCVN visits all the nodes in the reduced CFG in reverse postorder and value numbers them. When it encounters a node that represents a cycle in the original CFG, it iterates over the basic blocks in the cycle, value numbering as it goes. Figure 4 shows the algorithm. SCCVN maintains two key data structures: the value table and the operation table. The value table maps each SSA name to a value number. The process that generates value numbers guarantees that if two SSA names have the same value number, they will always have the same value at run time. The operation table maps a tuple, containing an opcode and the value numbers of its operands, into a value number. It is used to discover redundemt operations. If the current operation matches an earlier entry in the operation table, then the current operation must be redundant with that earlier operation.
VALUE NUMBERING MEMORY
The new algorithm extends SCCVN so that it computes value numbers for memory operations and uses the results to find redundant memory operations. By value numbering mem- 
Building SSA Form with M-lists
Like scorn, the new algorithm works on SSA form. It constructs SSA names for the memory objects in the M-lists as follows: for FRAME, names in M-def are treated as definitions; fox' JSR operations, names in M-use are treated as uses, and names in M-def are treated as definitions; for xLD, names in M-use are treated as uses; for xST, names in M-use are treated as uses, and names in M-def are treated as definitions. The construction inserts C-functions for memory object names, just as it does for virtual registers.
The uses and definitions for memory operations are defined so that SSA naming of memory objects represents a flowsensitive description of memory states for those memory operations. In particular, if two memory operations have the same SSA name for one memory object in their uses, then the state of that memory object must be same before the Figure 5 : SSA Form for Compute execution of the operations. The M-def set for FRAME can be considered as the initial states of all possible memory objects accessed by the function. Because JSR and xST operations can change the states of memory objects in M-def, the SSA construction conservatively assumes those operations do change the states of those memory objects. Thus, it treats M-def as definition set. Figure 5 shows the SSA form of Compute.
Of course, in some contexts, those operations will not modify the states of those memory objects, e.g., they might store the same value into the same memory objects. The new algorithm discovers such facts, propagates that knowledge along edges in the SSA graph.
SCC-based Memory Numbering
To extend SCCVN to handle memory operations, we must modify the basic-block value-numbering algorithm to deal with the M-list on memory operations, and to number the Cfunctions for memory objects. The modified version is shown in Figure 6 . Because the new algorithm extends SCCVN, it inherits the optimistic nature of that algorithm. Like SCCVN, it finds the maximum fixed point for cycles, except that it handles both scalar values and memory states. The value numbering for store checks the canonical store opcode, generated by Norm, the store address, store value, and M-list value numbers against an earlier load or store operation to detect redundancy. If xST is redundant, the state of objects in the M-def set is not changed, and the algorithm propagates its numbering from the M-use set to the M-def set; if xST is non-redundant, objects in the Mdef set needs to assign new value number to indicate new memory state.
Correctness
The value numbering process maintains two invariants with regaxd to the M-list sets. First, the memory state immediately prior to the execution of a memory operation is rep- 
Time Complexity
The time complexity of the new algorithm is determined by the time needed to manipulate the SSA value 
O(c~MN x D(G)), and the overall complexity for scc-based memory value numbering is O((c~MN + M) x D(G)). In practice, D(G) is
Discussion
The value table and operation table for Compute after value numbering are shown in Figure 7 and Figure 8 . For the first and third loads, the algorithm proves they use equivalent addresses (r2) and have the same memory states (©pa_0 and ©pb_0); thus the third load is redundant. The same holds for the second and fourth loads. As shown in this example, detection of redundant memory operation requires tracing of flow of scalar values (addresses and memory values); on the other hand, load results are used in other scalar computations, and the ability to trace value through memory operations therefore contributes to the detection of scalar redundancy.
Our new algorithm extends the value-based redundancy detection of Simpson's algorithm to handle memory-based values. It relies on value identity, rather than lexical identity. Because it unifies the treatment of register-based values and memory-based values, it can detect equalities that separate analyses cannot find. The power of the underlying algorithm, Simpson This repetition might be required to trace values from registers through memory and back to registers. Unifying these two analyses eliminates the need to iterate between them.
EXPERIMENTAL RESULTS
The execution model for our compiler system is depicted in Figure 9 . The C front end @2 0 converts the program into ILOC. The compiler applies multiple analysis and optimization passes to the ILOC code. Finally, the back end generates executable in C form to emulate ILOC code.
Value-based CSE
Our new algorithm discovers redundancies through value numbering. Because it uses value equality and ignores control flow, it requires a separate phase to discover which operations can actually be removed. This follow-on transformation must incorporate information about control flow that allow it to distinguish between a partially-redundant and a fully-redundant operation. Two candidates for this redundancy removal phase are traditional common subexpression elimination (CSE) [13, 1] or partial redundancy elimination [26, 20] . For this work, we implemented the classic CSE framework, reworked to reflect the fact that the equations are operating on an SSA-based name space where kills cannot occur. With information of redundant memory operations, it is now capable to remove the fully static memory redundancy as shown in Figure 2 . 2  12  55  69  5  6  4  14  180  197  637  27  59  58  57  769 1125  3833  162  664  288  36  523 1099  1569  190  307  121  96 1089 1474  6363  456  1371  361  113 1645 2465  5080  660  1212  516  24  313  611  1205  81  392  213  60  969 1649  3841  303  970  496  62  718 1626  3695  302  796  293  255 3095 5610  14428  1903  4632  1001 AVLOCi; 2) for a memory operation in block i with ID m, m 6 AVLOCi, furthermore, if it is an xLD with result value r,, then r~ 6 AVLOC,.
When the equations in Figure 10 are solved, the AVINi set contains the available value and memory ID at the entry of block i. Fully redundant operations can be detected and removed by scanning the operations in block i in execution order as follows: (1) if scalar operation s computes r~ 6 AVIN~, s is redundant and removed, otherwise, add r, to AVINi; and (2) if memory operation with ID m 6 AVIN~, m is redundant and removed, otherwise, add m to AVINi, furthermore, for load, add result value r, to AVINi. In the example for Compute, the third and fourth loads have same ID with the first and second load, and thus are removed by csE. Figure 11 shows the code for Compute after csE.
Benchmarks
We integrated the new algorithm followed with CSE as a single pass on ILOC, referred to as vl. For comparison, we also implemented the scalar scorN with csE, referred to as v0. We tested the two passes on 10 benchmarks, 6 from Mediabench [22] , 4 from SPEC2000 CPU integer benchmarks [33] . The benchmarks and the statistics of the compiled ILOC intermediate form for the applications are listed in Table 1 .
The applications are first translated from C into ILOC, and a sequence of sparse conditional constant propagation [35] , 0 FRAME 0 => r2 r3 I0 iLD r2 => r7 I0 uADDI 4 r2 => r8 I0 iLD r8 => r9 I0 iADD r7 r9 => rlO i0 iST r3 rlO II uADDI 4 r3 => rll II iSUB r7 r9 => r16 II iST rll r16 12 RTN Figure 11 : Compute after R.edundancy Removal dead code elimination, control flow restructuring (to remove empty and unreachable basic blocks), copy coalescing passes are applied to the ILOC. After that, whole program pointer analysis is run to get the annotated ILOO with point-to and function REF and MOD information. The v0 and vl pass works on the annotated LoG after pointer analysis. The output ILOC code is then run through dead code elimination, peephole optimization, copy coalescing, and passed to the back end to generate final executables.
5.3
Results
The dynamic total instruction count and instruction count of loads and stores axe shown in Table 2 . The difference of total instruction count and memory instruction count are shown in Table 3 .
The dynamic instruction counts show that all the benchmarks except adpcm had significant reductions in run-time memory operations ~om doing CSE based on memory value numbering (vl), when compared to CSE with scalar value numbering only (v0). The column labeled Z-.h4 shows that six of the benchmarks (gsm, mpeg2dec, 181.mcf, 164.gzip, 256.bzip2, 175 .vpr) have a larger difference in total instruction count than in memory operation count. This suggests that, for those applications, detection of memory redundancy also leads to detection of more scalar redundancy.
To understand why memory value numbering finds no opportunity in adpcm, we inspected its source code. Interestingly, in the key encoding function adpcm_coder, the programmer stores the values of frequently used global variables into local variables which the compiler maps to virtual registers. In the kernel computation loop, the values in the local variables are used rather than loading the values from the global variables. In this way, the programmer has manually rewritten the function to perform exactly the transformation that the new algorithm would do: find redundant memory operations and reuse the results in registers.
As the data from Table 3 show, for g721 and pegwit, the entire difference in instruction counts occurs in memory operations. This suggests that the redundant memory instructions do not lead to discovery of more redundant scalar instructions. For epic, the memory instruction count difference is slightly larger than total instruction count difference, as some eliminated memory operations are replaced with scalar operations.
For all six other benchmarks, the total instruction count difference is significantly larger than the memory instruction count difference. This suggests detection of redundant memory operations leads to discovery of additional redundant scalar operations. In these cases, it would take the separate approach multiple iterations to detect the same set of scalar and memory redundancy.
Our execution model does not model the microarchitecture. As memory operations become more expensive relative to arithmetic, reduction in memory operations should produce more run-time performance improvement. We would like to explore the memory performance issues using sophisticated microarchitecture simulators and study the effects how the compile-time memory operation reduction optimization will interact with different configurations of processor resources and cache system. The data in Table 2 and 3 also show that most redundant memory operations discovered by our technique are loads. Figure 12 . A careful code review would have caught these cases 3, however, it does highlight the need for automatic analysis to detect and remove such inefficiencies.
Algorithm Run Time Analysis
To assess the impact of memory value numbering on compile time, we measured the running time of the SSA construction, SCC value numbering, and CSE removal phase of v0 and Vl. Because the absolute running time is very fast, we intentionally measured the times on a relatively slow machine -a lightly loaded SUN Ultra-1 workstation with 140MHz clock and 240MB memory. The results are shown in Table 4 . These results show, that the key phases of memory value numbering are fast. for real applications, we also collected the set size for memory related operations. The results are shown in Table 6 . The data shows that, for load and store operations, the Muse and M-def set size is very small. These measurements explain why the new algorithm runs so quickly. As the data suggest, it has essentially linear time complexity in practice.
RELATED WORK
Program redundancy detection and removal have long been studied in literature [1, 13, 26, 20, 5] . However, most algorithms only deal With unambiguous scalar values. Aliased memory operations cause the algorithms to use worst case assumptions, i.e., all scalar values related to memory would be killed by an aliased store. On the other hand, work on register promotion focuses solely on memory redundancy [25, 24, 30] . As our experimental results show, these two kinds of redundancy interact, so detecting one often allows the compiler to detect the other. The published algorithms for register promotion work from a lexical notion of identity, rather than from a value-based identity. The value-based approach taken in our algorithm should reveal a larger set of equivalent expressions. In [4, 6] , Bodik et aL use a value numbering based path-sensitive analysis to detect both scalar and memory redundancy based on value equality. However, the algorithm mainly targets array-oriented applications where array subscripts can be represented as linear algebraic expressions and can be analyzed symbolically. They also treat aliased store unsafely (values will not be killed even the store does change those values), and depend on data speculation support from hardware to maintain program correctness. In comparison, our algorithm targets general purpose applicatious and the analysis is conservative and safe.
Other researchers have also proposed using SSA form to represent aliasing information [14, 11] . These extended SSA representations separate the aliased memory object information from the memory operation itself. Thus they require significant numbers of new SSA names to model the possible alias effects, and the authors propose various solutious to resolve the problem. In contrast, in our M-list representation, the aliased memory object information is directly associated with memory operation it affects, there is no need to invent additional SSA names. The aliasing effect is handled by taking memory object state as additional operand in the value numbering process, which allows our algorithm to not only detect redundant memory operation, but also propagate unchanged state along SSA edges. The value numbering process in [11] does not detect and propagate unchanged states.
Our algorithm assumes the existence of an interprocedural pointer analysis pass. The implementation uses a version of Andersen's algorithm [3] . Nothing in the work relies directly on that algorithm; other styles of pointer analysis would work in this framework [29, 21, 10, 28, 36, 15, 34] . Although our algorithm needs the alias analysis result to build M-list, however, the relation between the effectiveness of our algorithm and the accuracy of alias analysis is subtle: it is more profitable to have empty intersection between M-lists (so more memory states might be preserved as unchanged), rather than to keep the M-list size smaller which has been used as dominant criterion to measure accuracy of alias analysis. Our experience suggests flow-insensitive, context-insensitive pointer analysis works well with the new algorithm. As C structure is widely used in applications, it is important to distinguish field accesses to aggregate data structures. We use ad-hoc method in the Andersen algorithm to distinguish C struct fields. The aggregate decomposing algorithm described in [27] would be useful if it can be targeted for the C-like pointer-intensive languages.
CONCLUSION AND FUTURE WORK
This paper presents a powerful, unified approach finding redundancy in both register-based, scalar operations and memory-based operations. By using a ssA-based M-list representation, our algorithm captures accurate information about the state of memory. It can then detect redundant memory operations alongside with scalar redundancies. It can discover all the scalar and memory redundancies found by separate approaches. Our experiments show that interaction exists between scalar and memory redundancies. Our algorithm finds them both in a single analysis phase.
The redundancy relation detected by the new algorithm can then be used to drive a redundancy-removal transformation. We showed that traditional scalar common subexpression elimination can be easily adapted to remove fully static redundant memory operations. A similar approach should adapt partial redundancy elimination to use this information [26, 20] . In particular, with the M-list representation, analysis can be developed to identify loop invaxiant loads and stores, and move them outside the loop. We are building a redundancy-removal transformation based on lazy-code motion to demonstrate the use of memory value numbers in that transformation.
ACKNOWLEDGMENTS
This work was performed in the research compiler built over the years by the Massively Scalar Compiler Group at Rice. Many people have contributed to that effort with time, implementation, and insight. All these people deserve our thanks. We also thank the anonymous reviewers for their helpful suggestions. This effort was funded through the Los Alamos Computer Science Institute, as part of the project on Compilation Issues for High-performance Microprocessors.
