We revisit the classical problem of scalar replacement of array elements and pointer accesses. We genera the state-of-the-art algorithm, by Carr and Kennedy [CK94], to handle a combination of both conditi control-flow and inter-iteration data reuse. The basis of our algorithm is to make the dataflow availab information precise using a technique we call SIDE: Statically Instantiate and Dynamically Evaluate SIDE the compiler inserts explicit code to evaluate the dataflow information at runtime. Our algorithm operates within the same assumptions of the classical one (perfect dependence informati and has the same limitations (increased register pressure). It is, however, optimal in the sense that wi each code region where scalar promotion is applied, given sufficient registers, each memory location is and written at most once.
benchmarks. We have implemented our algorithm in a C compiler, and we have found numerous instan where it is applicable as well.
For the impatient reader, the key idea is the following: for each value to be scalarized, the comp creates a 1-bit runtime flag variable indicating whether the scalar value is "valid." The compiler also cre code which dynamically updates the flag. The flag is then used to detect and avoid redundant loads an indicate whether a store has to occur to update a modified value at loop completion. This algorithm ens that only the first load of a memory location is executed and only the last store takes place. This algori is a particular instance of a new general class of algorithms: it transforms values customarily used onl compile-time for dataflow analysis into dynamic objects. Our algorithm instantiates availability datafl information into run-time objects, therefore achieving dynamic optimality even in the presence of constr which cannot be statically optimized.
We introduce the algorithm by a series of examples which show how it is applied to increasingly c plicated code structures. We start in Section 2 by showing how the algorithm handles a special case, of memory operations from loop-invariant addresses. In Section 3.3 we show how the algorithm optim loads whose addresses are induction variables. Finally, we show how stores can be treated optimall Section 3.4. In Section 4 we describe two implementations of our algorithm: one based on control-fl graphs (CGFs), and one relying on a special form of Static-Single Assignment(SSA) named Pegasus. though the CFG variant is simpler to implement, Pegasus simplifies the dependence analysis require determine whether promotion is applicable. Special handling of loop-invariant guarding predicates is cussed in Section 5. Finally, in Section 7, we quantify the impact of an implementation of this algori when applied to the innermost loops of a series of C programs.
This paper makes the following new research contributions:
it introduces the SIDE class of dataflow analyses, in which the analysis is carried statically, but computation of the dataflow information is performed dynamically, creating dynamically opti code for constructs which cannot be statically made optimal;
Conventions
We present all the optimizations examples as source-to-source transformations of schematic C prog fragments. For simplicity of the exposition we assume that we are optimizing the body of an innerm loop. We also assume that none of the scalar variables in our examples have their address taken. We w f(i) to denote an arbitrary expression involving i which has no side effects (but not a function call). write for(i) to denote a loop having i as a basic induction variable; we assume that the loop bod executed at least once. For pedagogical purposes, the examples we present all assume that the code has been brought in canonical form through the use of if-conversion [AKPW83], such that each memory statement is guar by a predicate; i.e., the code has the shape in Figure 1 . Our algorithms are easily generalized to han nested natural loops and arbitrary forward control-flow within the loop body.
Scalar Replacement of Loop-Invariant Memory Operations
In this section we describe a new register promotion algorithm which can eliminate memory referen made to loop-invariant addresses in the presence of control flow. This algorithm is further expanded in S tion 3.3 and Section 3.4 to promote memory accesses into scalars when the memory references hav constant stride. Figure 2 shows a simple example and how it is transformed by the classical scalar promotion algorit Assuming p cannot point to i, the key fact is *p always loads from and stores to the same address, there *p can be transformed into a scalar value. The load is lifted to the loop pre-header, while the store is mo Figure 3: A small program that is not amenable to classical register promotion.
The Classical Algorithm
after the loop. (The latter is slightly more difficult to accomplish if the loop has multiple exits goin multiple destinations. Our implementation handles these as well, as described in Section 4.2.2.)
Loop-Invariant Addresses and Control-Flow
However, the simple algorithm is no longer applicable to the slightly different Figure 3 . Lifting the loa store out of the loop may be unsafe with respect to exceptions: one cannot lift a memory operation out loop it if may never be executed within the loop.
To optimize Figure 3 , it is enough to maintain a valid bit in addition to the the tmp scalar. The val bit indicates whether tmp indeed holds the value of *p, as in Figure 4 . The valid bit is initialized to fa A load from *p is performed only if the valid bit is false. Either loading from or storing to *p sets valid bit to true. This program will forward the value of *p through the scalar tmp between iterat arbitrarily far apart.
The insight is that it may be profitable to compute dataflow information at runtime. For example, valid flag within an iteration is nothing more than the dynamic equivalent of the availability dataflow formation for the loaded value, which is the basis of classical Partial Redundancy Elimination (PRE) [MR When PRE can be applied statically, it is certainly better to do so. The problem with Figure 3 is that compiler cannot statically summarize when condition (i&1) is true, and therefore has to act conse tively, assuming that the loaded value is never available. Computing the availability information at run-t eliminates this conservative approximation. Maintaining and using runtime dataflow information ma sense when we can eliminate costly operations (e.g., memory accesses) by using inexpensive operat (e.g., Boolean register operations).
This algorithm generates a program which is optimal with respect to the number of loads within e region of code to which promotion is applied (if the original program loads from an address, then optimized program will load from that address exactly once), but may execute one extra store: 3 if original program loads the value but never stores to it, the valid bit will be true, enabling the post store. In order to treat this case as well, a dirty flag, set on writes, has to be maintained, as show Figure 5 . 4 Note: in order to simplify the presentation, the examples in the rest of the paper will not include dirty bit. However, its presence is required for achieving an optimal number of stores.
Inter-Iteration Scalar Promotion
Here we extend the algorithm for promoting loop-invariant operations to perform scalar promotion pointer and array variables with constant stride. We assume that the code has been subjected to stand dependence analysis prior to scalar promotion. consecutive values of i. This quickly creates register pressure and therefore heuristics are usually use decide whether promotion is beneficial. Since register pressure has been very well addressed in the litera [CCK90, Muc97, CMS96, CW95], we will not concern ourselves with it anymore in this text.
The Carr-Kennedy Algorithm
A later extension to the Carr-Kennedy algorithm [CK94] allows it to also handle control flow. algorithm optimally handles reuse of values within the same iteration, by using PRE on the loop b However, this algorithm can no longer promote values across iterations in the presence of control-fl The compiler has difficulty in reasoning about the intervening updates between accesses made in diffe iterations in the presence of control-flow.
Partial Redundancy Elimination
Before presenting our solution let us note that even the classical PRE algorithm (without the suppor special register promotion) is quite successful in optimizing loads made in consecutive iterations. Figu shows a sample loop and its optimization by gcc, which does not have a register promotion algorithm all. By using PRE alone gcc manages to reuse the load from ptr2 one iteration later.
The PRE algorithm is unable to achieve the same effect if data is reused in any iteration other than immediately following iteration or if there are intervening stores. In such cases an algorithm like C Kennedy is necessary to remove the redundant accesses. Let us notice that the use of valid flags achi the same degree of optimality as PRE within an iteration, but at the expense of maintaining run-time in mation. do { *ptr1++ = *ptr2++; } while(--cnt && *ptr2);
if (--cnt) break; tmp = *ptr2; if (! tmp) break; } while(1);
Figure 7: Sample loop and its optimization using PRE. (The output is the equivalent of the assembly c generated by gcc.) PRE can achieve some degree of register promotion for loads. Figure 9 . Applying constant propagation and dead-code elimination will simplify this code by remov the unnecessary references to a2 valid.
Removing All Redundant Stores
Handling stores seems to be more difficult, since one should forgo a store if the value will be overwri in a subsequent iteration. However, in the presence of control-flow it is not obvious how to deduce whe the overwriting stores in future iterations will take place. Here we extend the register promotion algori to ensure that only one store is executed to each memory location, by showing how to optimize the exam in Figure 10 .
We want to avoid storing to a[i+2], since that store will be overwritten two iterations later by the s to a[i]. However, this is not true for the last two iterations of the loop. Since, in general, the comp cannot generate code to test loop-termination several iterations ahead, it looks as if both stores mus performed in each iteration. However, we can do better than that by performing within the loop only store to a[i], which certainly will not be overwritten. The loop in Figure 11 does exactly that. The l body never overwrites a stored value but may fail to correctly update the last two elements of arra Fortuitously, after the loop completes, the scalars a0, a1 hold exactly these two values. So we can inse loop postlude to fix the potentially missing writes. (Of course, dirty bits should be used to prevent use updates.)
Implementation
This algorithm is probably much easier to illustrate than to describe precisely. Since the important mess was hopefully conveyed by the examples, we will just briefly sketch the implementation in a CFG-ba framework and describe in somewhat more detail the Pegasus implementation. 
CFG-Based Implementation
In general, for each constant reference to a[i+ ] (for a compile-time constant ) we maintain a scala and a valid bit t valid. Then scalar replacement just makes the following changes:
Replaces every load from a[i+ ] with a pair of statements: t = t valid ? t : a[i+ ]; t valid = truē
Replace every store a[i+ ] = e with a pair of statements: t = e; t valid = true. Furthermore, all stores except the generating store 5 are removed. Instead compensation code is ad "after" the loop: for each t append a statement if (t valid) a[i+ ] = t .
Complexity: the algorithm, aside from the dependence analysis, is linear in the size of the loop 6 . Correctness and optimality: follow from the following invariant: the t valid flag is true if only if t represents the contents of the memory location it scalarizes. Hyperblocks are stitched together into a dataflow graph representing the entire procedure by crea dataflow edges connecting each hyperblock to its successors. Each variable live at the end of a hyperbl gives rise to an eta node [OBM90] . Eta nodes have two inputs-a value and a predicate-and one out When the predicate evaluates to "true," the input value is moved to the output; when the predicate evalu to "false," the input value and the predicate are simply consumed, generating no output. A hyperblock w multiple predecessors receives control from one of several different points; such join points are represen by merge nodes.
Pegasus
Operations with side-effects are parameterized with a predicate input, which indicates whether the o ation should take place. If the predicate is false, the operation is not executed. Predicate values are indic in our figures with dotted lines.
The compiler adds dependence edges between operations whose side-effects may not commute. S edges only carry an explicit synchronization token -not data. Operations with memory side-effects (lo stores, calls, and, returns) all have a token input. When a side-effect operation depends on multiple o operations (e.g., a write operation following a set of reads), it must collect one token from each of th For this purpose a combine operator is used; a combine has multiple token inputs and a single token out the output is generated after it receives all its inputs. In figures (e.g., see Figure 12 ) dashed lines indi token flow and the combine operator is depicted by a "V". Token edges explicitly encode data flow thro memory. In fact, the token network can be interpreted as an SSA form for the memory values, where
combine operator is similar to a function. The tokens encode both true-, output-and anti-dependen and they are "may" dependences. In Figure 12 (A) there is one load and two stores. A load is den by "=[ ]" and has 3 inputs: address, predicate and token; it produces two outputs: the loaded value another token. A store is denoted by "[ ]=" and has four inputs: address, data, predicate and token; only output is a token.
Register Promotion in Pegasus
We sketch the most important analysis and transformation steps carried out by CASH for register promot Although the actual promotion in Pegasus is slightly more complicated than in a CFG-based representa (because of the need to maintain -nodes), the dependence tests used to decide whether promotion ca applied are much simpler: the graph will have a very restricted structure if promotion can be applied. 7 key element of the representation is the token edge network whose structure can be quickly analyze determine important properties of the memory operations.
We illustrate register promotion on the example in Figure 8 .
1. The token network for the Pegasus representation is shown in Figure 13 . Memory accesses may interfere with each other will all belong to a same connected component of the token netw Operations that belong to distinct components of the token network commute and can therefore analyzed separately. In this example there is a single connected component, corresponding to acce made to the array a.
these accesses are constant (i.e., iteration-independent), making these accesses candidates for reg promotion.
The induction step of the addresses indicates the type of promotion: a 0 step indicates loop-invar accesses, while a non-zero step, as in this example, indicates strided accesses.
3. The token network is further analyzed. Notice that prior to register promotion, memory disambig tion has already proved (based on symbolic computation on address expressions) that the accesse a[i] and a[i+2] commute, and therefore there is no token edge between them. The token netw for a consists of two strands: one for the accesses to a[i], and one for a[i+2]; the strands generated at the mu, on top, and joined before the etas, at the bottom, using a combine (V). If only if all memory accesses within the same strand are made to the same address can promotion carried.
CASH generates the initialization for the scalar temporaries and the "valid" bits in the loop pre-hea We do not illustrate this step.
4. Each strand is scanned from top to bottom (from the mu to the eta), term-rewriting each mem operation: Figure 14 shows how a load operation is transformed by register promotion. The resul construction can be interpreted as follows: "If the data is already valid do not do the load ( the load predicate is 'and'-ed with the negation of the valid bit) and use the data. Otherw do the load if its predicate indicates it needs to be executed." The multiplexor will select ei the load output or the initial data, depending on the predicates. If neither predicate is true, output of the mux is not defined, and the resulting valid bit is false. Figure 15 shows the term-rewriting process for a store. After this transformation, all st except the generating store are removed from the graph (for this purpose the token inpu connected directly to the token output, as described in [BG03]). The resulting constructio interpreted as follows: "If the store occurs, the data-to-be-stored replaces the register-promo data, and it becomes valid. Otherwise, the register-promoted data remains unchanged."
5. Code is synthesized to shift the scalar values and predicates around between strands (the assignm t ½ = t ), as illustrated in Figure 16 .
6. The insertion of a loop postlude is somewhat more difficult in general than a loop prelude, since definition natural loops have a unique entry point, but may have multiple exits. In our implementa each loop body is completely predicated and therefore all instructions get executed, albeit some nullified by the predicates. The compensating stores are added to the loop body and executed o Figure 17 : Sample code with loop-invariant memory accesses. c1 and c2 stand for loop-invariant exp sions.
during the last iteration. This is achieved by making the predicate controlling these stores to be loop-termination predicate. This step is not illustrated.
Handling Loop-Invariant Predicates
The register promotion algorithm described above can be improved by handling specially loop-invar predicates. If the disjunction of the predicates guarding all the loads and stores of a same location cont a loop-invariant subexpression, then the initialization load can be lifted out of the loop and guarded by subexpression. Consider Figure 17 on which we apply loop-invariant scalar-promotion. By applying our register promotion algorithm one gets the result in Figure 18 . However, using the that c1 and c2 are loop-invariant the code can be optimized as in Figure 19 . Both Figure 18 and Figur execute the same number of loads and stores, and therefore, by our optimality criterion, are equally go However, the code in Figure 19 is obviously superior.
We can generalize this observation: the code can be improved whenever the disjunction of all co tions guarding loads or stores from *p is weaker than some loop-invariant expression (even if none of conditions is itself loop-invariant), such as in Figure 20 . In this case the disjunction of all predicate f(i)||!f(i) which is constant "true." Therefore, the load from *p can be unconditionally lifted ou the loop as shown in Figure 21 .
In general, let us assume that each statement × is controlled by predicate with È´×µ. Figure 17 using the invariance of c1 and c2.
Our current implementation of this optimization in CASH only lifts out of the loop the disjunctio all predicates which are actually loop-invariant.
Discussion

Dynamic Disambiguation
Our scalar promotion algorithm can be naturally extended to cope with a limited number of memory cesses which cannot be disambiguated at compile time. By combining dynamic memory disambigua [Nic89] with our scheme to handle conditional control flow, we can apply scalar promotion even w pointer analysis determines that memory references interfere. Consider the example in Figure 22 : e though dependence analysis indicates that p cannot be promoted since the access to q may interfere, bottom part of the figure shows how register promotion can be applied.
This scheme is an improvement over the one proposed by Sastry [SJ98], which stores to memory all values held in scalars when entering an un-analyzable code region (which in this case is the region guar by f(i)).
Hardware support
While our algorithm does not require any special hardware support, certain hardware structures can impr its efficiency.
Rotating registers were introduced in the Cydra 5 architecture [DHB89] to support software pipelin These were used on Itanium for register promotion [DKK · 99] to shift all the scalar values in one cycle
Rotating predicate registers as in the Itanium can rotate the "valid" flags. Software valid bits can be used to reduce the overhead of maintaining the valid bits. If a v is reused iterations later, then our algorithm requires the use of ¾ different scalars: valid bits /* second store */ if (! fi) { tmp = 2; tmp_valid = true; } } if (tmp_valid) *p = tmp;
Figure 21: Optimization of the code in Figure 20 using the fact that the disjunction of all predicates guard *p is loop-invariant (i.e., constant) "true" and the same code after further constant propagation. values. A software-only solution is to pack the valid bits into a single integer 9 and to use mask and shifting to manipulate them. This makes rotation very fast, but testing and setting more expensiv trade-off that may be practical on a wide machine having "free" scheduling slots.
Predicated data [RC03] has been proposed for an embedded VLIW processor: predicates are no tached to instructions, but to data itself, as an extra bit of each register. Predicates are propagated thro arithmetic, similar to exception poison bits. The proposed architecture supports rotating registers by plementing the register file as an actual large shift register. These architectural features would make valid flags essentially free both in space and in time.
Other Applications of SIDE
This paper introduces the SIDE framework for run-time dataflow evaluation, and presents the register motion algorithm as a particular instance. Register promotion uses the dynamic evaluation of availab and uses predication to remove memory accesses for achieving optimality. SIDE is naturally applied to availability dataflow information, because it is a forward dataflow analysis, and its run-time determina is trivial.
PRE [MR79] is another optimization which uses of availability information which could possibly b efit from the application of SIDE. In particular, safe PRE forms (i.e., which never introduce new com tations on any path) seem amenable to the use of SIDE. While some forms of PRE, such as lazy c Second, it contains more computations than the original program in maintaining the flags. The o mized program may end-up being slower than the original, depending, among other things, on the freque with which the memory access statements are executed and whether the predicate computations are on critical path. For example, if none of them is executed dynamically, all the inserted code is overhead practice profiling information and heuristics should be used to select the loops which will most benefit f this transformation.
Third, scalar promotion removes memory accesses which hit in the cache, 11 therefore its benefit app to be limited. However, in modern architectures L1 cache hits are not always cheap. For example, on Intel Itanium 2 some L1 cache hits may cost as much as 17 cycles [CL03] . Register promotion trades bandwidth to the load-store queue (or the L1 cache) for bandwidth to the register file, which is alw bigger.
Fourth, by predicating memory accesses, operations which were originally independent, and could potentially issued in parallel, become now dependent through the predicates. This could increase the namic critical path of the program, especially when memory bandwidth is not a bottleneck.
Performance Measurements
In this section we present measurements of our register promotion algorithm as implemented in the CA C compiler. We show static and dynamic data for C programs from three benchmark suites: Me bench [LPMS97] , SpecInt95 [Sta95] and Spec CPU2000 [Sta00].
Our implementation does not use dirty bits and therefore is not optimal with respect to the numbe stores (it may, in fact, incur additional stores with respect to the original program). However, dirty bits only save a constant number of stores, independent of the number of iterations. We have considered t overhead unjustified. We only lift loop-invariant predicates to guard the initializer; our implementation thus optimize Figure 17 , but not Figure 20 . As a simple heuristic to reduce register pressure, we do scalarize a value if it is not reused for 3 iterations. Table 2 shows how often scalar promotion can be applied. Column 3 shows that our algorithm fo many more opportunities for scalar promotion that would not have been found using previous scalar motion algorithms (however, we do not include here the opportunities discovered by PRE). CASH us simple flow-sensitive intra-procedural pointer analysis for dependence analysis. Figure 23 and Figure 24 show the percentage decrease in the number of loads and stores respectiv that result from the application of our register promotion algorithms. The data labeled PRE indicate number of memory operations removed by our straight-line code optimizations only. The data labeled l shows the additional benefit of applying inter-iteration register promotion. We have included both bars s some of the accesses can be eliminated by both algorithms.
The most spectacular results occur for 124.m88ksim, which has substatial reductions in both lo and stores. Only two functions are responsible for most of the reduction in memory traffic: alignd loadmem. Both these functions benefit from a fairly straightforward application of loop-invariant m ory access removal. Although loadmem contains control-flow, the promoted variable is always acce unconditionally. The substantial reduction in memory loads in gsm e is also due to register promotio invariant memory accesses, in the hottest function, Calculation of the LTP parameters. T function contains a very long loop body created using many C macros, which expand to access sev constant locations in a local array. The loop body contains control-flow, but all accesses to the small a are unconditional. Finally, the substantial reduction of the number of stores for rasta is due to the FR4 Figure 23 : Percentage reduction in the number of dynamic load operations due to the application of memory PRE and register promotion optimizations.
function, which also benefits from unconditional register promotion.
The impact of these reductions on actual execution time depends highly on hardware support. performance impact modeled on Spatial Computation (described in [BG03, Bud03] ) is shown in Figure  Spatial Computation can be seen as an approximation for a very wide machine, but which is connected a bandwidth-limited network to a traditional memory system.
We model a relatively slow memory system, with a 4 cycles L1 cache hit time. Interestingly, improvement in running time is better if memory is faster (e.g., with a perfect memory system of 2 c latency the gsm e speed-up becomes 18%). This effect occurs because the cost of the removed L1 acce becomes a smaller fraction of total execution cost when memory latency increases.
The speed-ups range from a 1.1% slowdown for 183.equake, to a maximum speed-up of 14% for gsm There is a fairly good correlation of speed-up and the number of removed loads. The number of remo stores seems to have very little impact on performance, indicating that the load-store queue conten caused by stores is not a problem for performance (since stores complete asynchronously, they do not h a direct impact on end-to-end performance). 5 programs have a performance improvement of more t 5%. Since most operations removed are relatively inexpensive, because they have good temporal loca the performance improvement is not very impressive. Register promotion alone causes a slight slow-d for 4 programs, while being responsible for a speed-up of more than 1% for only 7 programs. Schemes that use hardware support for register promotion such as [PGM00, DO94, OG01] are radic different from our proposal, which is software-only. Hybrid solutions, utilizing several of these techniq combined with SIDE, can be devised.
Bodík et al. [BGS99] analyzes the effect of PRE on promoting loaded values and estimates the poten improvements. The idea of predicating code for dynamic optimality was also advanced by Bodík [BG and was applied for partial dead-code elimination. In fact, the latter paper can be seen as an applicatio the SIDE framework to the dataflow problem of dead-code. Muchnick [Muc97] gives an example in w Figure 21 ), but he doesn't describe a general algorithm solving the problem optimally.
Conclusions
We have described a scalar promotion algorithm which eliminates all redundant loads and stores eve the presence of conditional control flow. The key insight in our algorithm is that availability informat traditionally computed only at compile-time, can be more precisely evaluated at run-time. We transf memory accesses into scalar values and perform the loads only when the scalars do not already con the correct value, and the stores only when their value will not be overwritten. Our approach substanti increases the number of instances when register promotion can be applied.
As the computational bandwidth of processors increases, such optimizations may become more adva geous. In the case of register promotion, the benefit of removing memory operations sometimes outwe the increase in scalar computations to maintain the dataflow information; since the removed operat tend to be inexpensive (i.e., they hit in the load-store queue or in the L1 cache), the resulting performa improvements are relatively modest.
