Abstract. Data dependences are known to hamper e cient parallelization of programs. Memory expansion is a general method to remove dependences in assigning distinct memory locations to dependent writes. Parallelization via memory expansion requires both moderation in the expansion degree and e ciency at run-time. We present a general storage mapping optimization framework for imperative programs, applicable to most loop nest parallelization techniques.
Introduction
Data dependences are known to hamper automatic parallelization of imperative programs and their e cient compilation on modern processors or supercomputers. A general method to reduce the number of memory-based dependences is to disambiguate memory accesses in assigning distinct memory locations to non-con icting writes, i.e. to expand data structures. In parallel processing, expanding a datum also allows to place one copy of the datum on each processor, enhancing parallelism. This technique is known as array privatization 5, 12, 16] and is extremely important to parallelizing and vectorizing compilers.
In the extreme case, each memory location is written at most once, and the program is said to be in single-assignment form (total memory expansion). The high memory cost is a major drawback of this method. Moreover, when the control ow cannot be predicted at compile-time, some run-time computation is needed to preserve the original data ow: Similarly to the static singleassignment framework 6], -functions may be needed to \merge" possible data de nitions due to several incoming control paths.
Therefore parallelization via memory expansion requires both moderation in the expansion degree, and e ciency in the run-time computation of -functions. A technique limited to a ne loop-nests was proposed in 11] to optimize memory management. The systolic community have a similar technique implemented in ALPHA compilers 14] . A di erent approach 15] is limited to perfect uniform loop-nests, and introduces universal storage mappings. We present a general storage mapping optimization framework for expansion of imperative programs, applicable to most parallelization techniques, for any nest of loops with unrestricted conditionals and array subscripts.
Section 2 studies a motivating example showing what we want to achieve, before pointing out contributions in more detail. Section 3 formally de nes storage mapping optimization, then we present our algorithm in Section 4. Experimental results are studied in Section 5, before we conclude.
Motivating Example
We rst study the kernel in Figure 1 .1, which appears in several convolution codes 1 ; Parts denoted by have no side-e ect on variable x. For any statement in a loop nest, the iteration vector is built from surrounding loop counters 2 . Each loop iteration spawns instances of statements included in the loop body:
Instances of S are denoted by hS; i; ji, for 1 i n and 1 j 1. 
Instance-wise Reaching De nition Analysis
We believe that an e cient parallelization framework must rely on a precise knowledge of the ow of data, and advocate for Instance-wise Reaching De nition Analysis (IRDA): It computes which instance of which write statement de ned the value used by a given instance of a statement. This write is the (reaching) de nition of the read access.
Any IRDA is suitable to our purpose, but Fuzzy Array Data-ow Analysis (FADA) 1] is prefered since it handles any loop nest and achieves today's best precision. Value-based Dependence Analysis 17] is also a good IRDA. In the following, is alternatively seen as a function and as a relation. The results for references x in right-hand side of R and S are nested conditionals: (hS; i; ji; x) = if j = 1 then fTg else fhS; i; j 1ig, (hR; ii; x) = fhS; i; ji : 1 jg.
Conversion to Single Assignment Form
Here, memory-based dependences hampers direct parallelization via scheduling or tiling. We need to expand scalar x and remove as many output-, anti-and true-dependences as possible. In the extreme expansion case, we would like to 1 E.g. Horn and Schunck's 3D Gaussian smoothing by separable convolution. 2 When dealing with while loops, we introduce arti cial integer counters.
convert the program into single-assignment (SA) form 8], where all dependences due to memory reuse are removed.
Reaching de nition analysis is at the core of SA algorithms, since it records the location of values in expanded data-structures. However, when the ow of data is unknown at compile-time, -functions are introduced for run-time restoration of values 4, 6] . Figure 1 .2 shows our program converted to SA form, with the outer loop marked parallel (m is the maximum number of iterations that can take the inner loop). A -function is necessary, but can be computed at low cost: It represents the last iteration of the inner loop.
Reducing Memory Usage
SA programs su er from high memory requirements: S now assigns a huge n m array. Optimizing memory usage is thus a critical point when applying memory expansion techniques to parallelization. Figure 1 .3 shows the parallel program after partial expansion. Since T executes before the inner loop in the parallel version, S and T may assign the same array. Moreover a one-dimensional array is su cient since the inner loop is not parallel. As a side-e ect, no -function is needed any more. Storage requirement is n, to be compared with n m+n in the SA version, and with 1 in the original program (with no legal parallel reordering).
We have built an optimized schedule-independent or universal storage mapping, in the sense of 15]. On many programs, a more memory-economical technique consists in computing a legal storage mapping according to a given parallel execution order, instead of nding a universal storage compatible with any legal execution order. This is done in 11] for a ne loop nests only.
Our contributions are the following: Formalize the correctness of a storage mapping, according to a given parallel execution order, for any nest of loops with unrestricted conditional expressions and array subscripts; Show that universal storage mappings de ned in 15] correspond to correct storage mappings according to the data-ow execution order; Present an algorithm for storage mapping optimization, applicable to any nest of loops and all parallelization techniques based on polyhedral dependence graphs.
Formalization of the Correctness
Let us start with some vocabulary. A run-time statement instance is called an operation. The sequential execution order of the program de nes a total order over operations, call it . Each statement can involve several array or scalar references, at most one of these being in left-hand side. A pair (o; r) of an operation and a reference in the statement is called an access. The set of all accesses is denoted by A, built of R, the set of all reads|i.e. accesses performing some read in memory|and W, the set of all writes.
Imperative programs are seen as pairs ( ; f e ), where is the sequential order over all operations and f e maps every access to the memory location it either reads or writes. Function f e is the storage mapping of the program The basis of our parallelization scheme is instance-wise reaching de nition analysis: Each read access in a memory location is mapped to the last write access in the same memory location. To stress the point that we deal with operations (i.e. run-time instances of statements), we talk about sources instead of de nitions. In our sense, reaching de nition analysis computes a subset of the program dependences (associated with Bernstein's conditions). Practically, the source relation computed by IRDA is a pessimistic (a.k.a. conservative) approximation: A given access may have several \possible sources".
As a compromise between expressivity and computability, and because our prefered IRDA is FADA 1], we choose a ne relations as an abstraction. using tools like Omega 13] and PIP 8]. (1) Given a parallel execution order 0 , we have to characterize correct expansions allowing parallel execution to preserve the program semantics. We need to handle \absence of con ict" equations of the form f e (v) 6 = f e (w), which are undecidable since subscript function f e may be very complicated. Therefore, we suppose that pessimistic approximation 6 l is made available by a previous stage of program analysis (probably as a side-e ect of IRDA): f e (v) 6 = f e (w) ) v 6 l w: Theorem 2 (Correctness of storage mappings). If the following condition holds, then the expansion is correct|i.e. allows parallel execution to preserve the program semantics. In both cases|and for any polyhedral representation|computing 0 yields an a ne relation, compatible with the expansion correctness criterion.
Computing Parallel Execution Orders
Eventually, the data-ow order de ned by relation is supposed (from Theorem 1) to be a sub-order of every other parallel execution order. Plugging it into (2) describes schedule-independent storage mappings, compatible with any parallel execution. This generalizes the technique by Strout et al. 15 ] to any nest of loops. Schedule-independent storage mappings have the same \portability" as SA with a much more economical memory usage. Of course, tuning expansion to a given parallel execution order generally yields more economical mappings.
An Algorithm for Storage Mapping Optimization
Finding the minimal amount of memory to store the values produced by the program is a graph coloring problem where vertices are operations and edges represent interferences between operations: There is an edge between v and w i they can't share the same memory location, i.e. when the left-hand side of (2) holds. Since classic coloring algorithms only apply to nite graphs, Feautrier and Lefebvre designed a new algorithm 11], which we extend to general loop-nests.
Partial Expansion Algorithm
Input is the sequential program, the result of an IRDA, and a parallel execution order (not used for simple SA form conversion); It leaves unchanged its control structures but thoroughly reconstitutes its data. Let us de ne Stmt(hS; xi) = S and Index(hS; xi) = x.
1. For each statement S whose iteration vector is x: Build an expansion vector E S which gives the shape of a new data structure D S , see Section 4.2 for details. Then, the left-hand side (lhs) of S becomes D S x mod E S ]. 2. Considering as a function from accesses to sets of operations (like in Section 2), it can be expressed as a nested conditionals. For each statement S and iteration vector x, replace each read reference r in the right-hand side (rhs) with Convert(r), where:
If (hS; xi;r) = ;, then Convert(r) = r (the initial reference expression).
If (hS; xi; r) is not a singleton, then Convert(r) = ( (hS; xi;r)); There is a general method to compute at run-time, but we prefer pragmatic techniques, such as the one presented in 3] or another algorithm proposed in 4]. If (hS; xi;r) = if p then r1 else r2, then Convert(r) = if p then Convert(r1) else Convert(r2).
3. Apply partial renaming to coalesce data structures, using any classical graph coloring heuristic, see 11]. This algorithm outputs an expanded program whose data are adapted to the partial execution order 0 . We are assured that with these new data, the original program semantic will be preserved in the parallel version.
Building an Expansion Vector
For each statement S, the expansion vector must ensure that expansion is systematically done when the lhs of (2) 
Computing this for each dimension of E S ensures that D S has a su cient size for the expansion to preserve the sequential program semantics.
Summary of the Expansion Process
Since we consider unrestricted loop nests, some approximations 3 are performed to stick with a ne relations (automatically processed by PIP or Omega). The more general application of our technique starts with IRDA, then apply a parallelization algorithm using as dependence graph (thus avoiding constraints due to spurious memory-based dependences), describe the result as a partial order 0 , and eventually apply the partial expansion algorithm. This technique 3 Source function is a pessimistic approximation, as well as 6 l.
yields the best results, but involves an external parallelization technique, such as scheduling or tiling. It is well suited to parallelizing compilers.
If one looks for a schedule-independent storage mapping, the second technique sets the partial order 0 according to , the data-ow execution order 4 . This is useful whenever no parallel execution scheme is enforced: The \portabil-ity" of SA form is preserved, at a much lower cost in memory usage.
Experimental Results
Partial expansion has been implemented for Cray-Fortran a ne loop nests 11]. Semi-automatic storage mapping optimization has also been performed on general loop-nests, using FADA, Omega, and PIP.
The result for the motivating example is that the storage mapping computed from a scheduled or tiled version is the same as the schedule-independent one (computed from the data-ow execution order). The resulting program is the same as the hand-crafted one in Figure 1 .
A few experiments have been made on an SGI Origin 2000, using the mp library (but not PCA, the built-in automatic parallelizer...). As one would expect, results for the convolution program are excellent even for small values of n. The interested reader may nd more results on the following web page:
http://www.prism.uvsq.fr/~acohen/smo/smo.html.
Conclusion and Perspectives
Expanding data structures is a classical optimization to cut memory-based dependences. The rst problem is to ensure that all reads refer to the correct memory location, in the generated code. When control and data ow cannot be known at compile-time, run-time computations have to be done to nd the identity of the correct memory location. The second problem is that converting programs to single-assignment form is too costly, in terms of memory usage.
We have tackled both problems here, proposing a general method for partial memory expansion based on instance-wise reaching de nition information, a robust run-time data-ow restoration scheme, and a versatile storage mapping optimization algorithm. Our techniques are either novel or generalize previous work to unrestricted nests of loops.
Future work is twofold. First, improve optimization of the generated code and study|both theoretically and experimentally|the e ect of -functions on parallel code performance. Second, study how comprehensive parallelization techniques can be plugged into this framework: Reducing memory usage is a good thing, but choosing the right parallel execution order is another one. 
