In this paper we shortly survey some loop transformation techniques which break anti or output dependences, or arti cial cycles involving such \false" dependences. These false dependences are removed through the introduction of temporary bu er arrays. Next we show how to plug these techniques into loop parallelization algorithms (such as Allen and Kennedy's algorithm). The goal is to extract as many parallel loops as the intrinsic parallelism of the nest authorizes, while avoiding a full memory expansion. We aim at reducing the number of temporary arrays that we introduce, as well as their dimension.
Introduction
Flow (or value-based) dependences are the only \true" dependences of a program. Anti dependences 1 and output dependences 2 are due to storage re-use and can be eliminated at the price of more memory usage. Removing anti and output dependences proves very useful to break data dependence cycles and thereby enabling vectorization and/or improving parallelization. However, removing all memory-based or \false" (i.e. anti and output) dependences may have a prohibitive cost. A complete removal of false dependences is usually achieved, if feasible, via conversion of the original loop nest program into single assignment form. This turns out to be unnecessarily costly. Indeed, there are some memory-based dependences whose removal will not improve the parallelization. Rather, we should introduce as much memory overhead as needed to expose all the parallelism of the original program. As much as, but no more than, needed. The aim of this paper is to show how to plug false dependence removal techniques into loop parallelization algorithms. The idea is to characterize those false dependences that do decrease the amount of parallelism, and to remove only these dependences. This will lead to memory savings without sacri cing performance. Section 2 is devoted to a brief survey of techniques aimed at removing anti and output dependences. In Section 3 we work out an example to illustrate the key-ideas of our \plugging" scheme. We summarize the general steps of our method in Section 4. Finally, we give some conclusions in Section 5.
False dependence removal techniques
Many papers have been devoted to the problem of eliminating anti and output dependences. Proposed methods include \array data ow analysis" 6, 9], \variable expansion", \variable renaming" and \node An splitting " 3] . See the survey papers of Bacon, Graham and Sharp 2], as well as the book of Wolfe 11] , for further references. Note that \array privatization" 7] is yet another technique that can be applied, but it comes later, when moving from virtual to physical processors.
Renaming Scalar (or array) renaming gives di erent names to di erent occurrences of a variable in a program. This allows the removal of anti and output dependences due to the multiple use of the variable.
Expansion If a scalar variable is written at several iterations of a loop nest, there are output dependences from and to the statement involved in the multiple writings. To suppress these dependences, the scalar is expanded into an array. This technique can be extended for expanding multi-dimensional arrays when the arrays can be considered as scalars when some loop indices are xed.
Node splitting This technique splits a statement into two statements as indicated in Figure 1 , in order to break cycles in the dependence graph. The value computed at each iteration of statement S is stored into a temporary array whose access function is the same as that of \lhs", the left hand side of S. Obviously, if another statement instance depends upon a value \lhs(g(i))" computed by S, then the access to \lhs(g(i))" must be replaced by \temp(g(i))". This implies knowledge of the data ow graph. . This technique relies on an exact analysis of the ow dependences. The result of each computation is stored in a dedicated temporary cell, and the right-hand side is modi ed to reference either a temporary cell for a value computed in the code, or the original scalar or array for a value kept unchanged. All new arrays have as many dimensions as there are loops surrounding their de nition. This transformation has two main weaknesses: the resulting code is in general very complicated, including many \if" tests in the innermost loops; and it requires a very large amount of memory. Some attempts have been made to remedy these two problems: more sophisticated rewriting techniques have been proposed to move if tests into the outermost loops if possible and to minimize the memory usage (through memory folding, i.e. memory reuse) once parallelism has been detected. Chamski looked in 4] at the latter problem. His technique has three main steps: 1) transform the code into SAF, through full memory expansion; 2) parallelize the code; 3) reduce memory size by analyzing the life duration of each cell in the parallelized code.
In this paper, we explore an opposite approach: rst determine anti and output dependences that are responsible for a loss of parallelism, remove them through memory expansion and then parallelize. We believe that this approach is more exible and powerful to enable various parallelization strategies. We point out that a similar approach is being currently developed in 8], but with a more restricted methodology since the false dependence removal is done with respect to a given schedule.
Motivating example
We brie y review Allen and Kennedy's algorithm (AK) in Section 3.1. Next we present a simple example, upon which we apply three parallelization schemes. First, in Section 3.2, we apply AK directly on the example. Then, in Section 3.3, we apply AK to the single assignment form of the example. Finally, in Section 3.4, we integrate the false dependences removal techniques into the parallelization process.
Allen and Kennedy's parallelization algorithm
We summarize Allen and Kennedy's algorithm (AK) because we use this parallelization algorithm throughout the paper. More details on this algorithm can be found in 1]. AK works on a structure called reduced leveled dependence graph (RLDG), i.e. a description of the level of dependences. For a loop carried dependence, the level of dependence is the rank of the rst non null component of the distance vectors. This is also the depth of the outermost loop which carries this dependence. For a loop independent dependence, the level of dependence is said to be in nite and is denoted 1. If e is an edge of the RLDG, l(e) denotes its level of dependence. Before summarizing the algorithm in its simpler form, we need to recall some simple graph de nitions:
A strongly connected component of a directed graph G is a maximal subgraph of G in which for any vertices p and q (p 6 = q) there is a path from p to q;
The acyclic condensation of a graph G is the acyclic graph whose nodes are the strongly connected components V 1 ; : : : ; V c of G. There is an edge from V i to V j if there is an edge e = (x i ; y j ) in G such that x i 2 V i and y j 2 V j ; Let G be a reduced leveled dependence graph. Let H be a subgraph of G. Then l(H ) (the level of H ) is the minimal level of an edge of H : l(H ) = minfl(e) j e 2 H g. 2. Otherwise, let k = l(V i ). Generate parallel \For" loops (\ForPar") for levels from l to k ? 1, and a sequential \For" loop (\ForSeq") for level k. Call AK(V i ; k + 1).
AK(H;
Finally, to apply AK to a reduced leveled dependence graph G, call: AK(G, 1).
Direct parallelization scheme
The direct parallelization scheme consists in the application of AK on the following loop nest:
Motivating example The RLDG of the motivating example is drawn 4 on Figure 2 . This RLDG contains a single strongly connected component which includes the three statements and dependences at level 1. Thus, the outermost loop (loop i) is marked \sequential" by AK. We now remove level 1 edges: there is still a unique strongly connected component including dependences at level 2. Thus, the second loop (loop j) is marked \sequential". We now remove level 2 edges: there are two strongly connected components, each including dependences at level 3. Thus, the third loop (loop k) is marked \sequential" for both components, and thus for all statements. Thus, AK nds no parallelism in this example when taking into account anti and output dependences, hence the need of removing at least some of the memory based dependences, in order to expose parallelism.
Parallelization of the single assignment form
Another approach could be rst to transform the loop nest into single assignment form (thereby removing all memory based dependences), and then to apply the parallelization algorithm.
Motivating Parallelization of the single assignment form One can easily see that AK will mark the two outermost loops (loops i and j) \sequential" for statements S 1 and S 2 . All other loops will be found \parallel". This is expressed in the parallelized form written below (strictly speaking, we should also copy back atemp(i; j; k) into a(i; j; k), ctemp(i; j; k) into c(i; j; k), and btemp(N; N; N ) into b). , which is the size of the iteration domain. We show in the next section that a clever integration of the false dependence removal techniques into the scheduling process enables maximum parallelism to be found while introducing less memory overhead.
Plugging false dependence removal techniques into the parallelization
The data ow graph (the RLDG where only ow dependence edges are kept, see Figure 4 (a)) tells us exactly what amount of parallelism can be found in the program. Our aim is to nd in the whole RLDG as much parallelism, while introducing as less memory overhead as possible.
First loop
Because of the ow dependences, the outermost loop (loop i) must be sequential for statements S 1 and S 2 . However this loop could be made parallel for S 3 . In order to expose this parallelism, S 3 should not belong any longer to a strongly connected component including some level 1 dependence. Thus only two false dependence edges, called incompatible edges, must be removed: 1) the anti dependence from S 1 to S 3 which strongly connects S 3 with a strongly connected component that contains an edge at level 1; 2) the self anti dependence on S 3 at level 1. These two edges can be removed by splitting the node S 3 . This only introduces a single new three dimensional array. The new RLDG is depicted in Figure 4 (c). We can now apply the rst step of Allen and Kennedy's algorithm and we get: 
Second loop
We now consider the second step of AK for the loops surrounding S 1 and S 2 (since the rest of the code is already fully parallelized). The remaining RLDG at level 2 are depicted in Figure 5 : Because of the ow dependences at level 2 (see Figure 5 ), the second loop (loop j) must be sequential for statements S 1 and S 2 . Thus no false dependence is removed and the second loop is marked sequential.
Third loop
Considering the data ow graph of Figure 6(a) , we see that the innermost loop could be marked parallel because there is no cycle at level 3. In order to expose this parallelism, S 1 and S 2 should not belong any longer to a strongly connected component including some level 3 dependences. Therefore three false dependence edges must be removed (see Figure 6(b) ): the self output dependence on S 2 , the anti dependence from S 1 to S 2 , and the self anti dependence on S 1 . The rst two dependences can be removed by expanding the scalar b. The third dependence can be removed by splitting the node S 1 . However, instead of introducing two 3-dimensional arrays to suppress these dependences, we introduce only two 1-dimensional arrays \atemp" and \btemp": this is because the outermost two loops are already sequential. Indeed, we only need to remove incompatible dependences for each iteration of the outermost loops. This is done as if the outermost two loops indexes were xed. We rst get the code of Figure 7 (a) whose RLDG at level 3 is depicted in Figure 7 (c). Finally, applying the last step of AK leads to the code of Figure 7 (b). 4 Plugging false dependence removal techniques into parallelization algorithms
In Section 3, we showed that integrating the false dependence removal techniques into the parallelization scheme makes it possible to nd all the parallelism while introducing less memory overhead. In our example, one 3D array and two 1D arrays were introduced instead of three 3D arrays as needed by the parallelization of the single assignment form. Note that if external constraints had dictated that introducing a 3D array were too costly, we would have been able to cope with these constraints. We would have exposed less parallelism,
Figure 7: Third level: (a) after dependence removal; (b) after parallelization but that is the price to pay! We now summarize our methodology.
Integration scheme
Instead of removing all possible false dependences rst and then applying a parallelization algorithm, we propose to combine both techniques. Suppose that we use a parallelization algorithm that has the following properties: each statement S surrounded by n S loops in the original code is surrounded by n S loops in the parallelized code; the choice of the loops surrounding S is made from the outermost to the innermost 5 . We propose to plug false dependence removal techniques into loop parallelization algorithms as follows (where G is a given dependence graph):
Parallelization(G)
Determine false dependences that could be removed by standard dependence removal techniques, as presented in Section 2. Let F be the dependence graph obtained if these dependences were removed.
F is the dependence graph that exhibits as much parallelism as can be exposed. Apply the parallelization algorithm on F . Let P P be the parallelized program obtained. Each statement S is surrounded in P P by a sequence of n S loops, marked either sequential or parallel.
For d = 1 to max(n S ) do 1 . Mark as incompatible all dependences in G but not in F (i.e. false dependences that can be removed) which, if not removed at depth d, will induce a loss of parallelism if P P is taken as reference. In other words, incompatible dependences are those that prohibit the generation of the same number of parallel loops as in P P within the the n S ? d + 1 remaining loops surrounding S.
2. Remove all the incompatible dependences by introducing temporary arrays of dimension as small as possible. 3. Generate the loop.
Comments
We do not go on more formally: the integration scheme above is simply the sketch of our methodology. Many problems remain to be solved: how to characterize incompatible edges for an arbitrary parallelization algorithm, how to minimize the number of incompatible dependences that should be removed, how to minimize the dimension of the temporary arrays that have to be introduced,... However, we have given several examples in Section 3 which should make these problems clearer.
On the dimension of temporary arrays In theory, we can hope to introduce temporary arrays with as many dimensions as nested loops minus the number of sequential loops already generated. See for example how node splitting has been performed in Figure 6 : we introduced only a 1D array and not a 3D array for splitting S 1 . This adds a new output dependence with level 1 and 2 for S 0 1 but it has no e ect in terms of parallelization since the two outermost loops are already sequential.
In practice however, the dimension of the temporary arrays may be di erent. On one hand, we may need extra memory if the false dependence removal technique is not powerful enough to enable code generation: this is the case, for example, if the access functions are too complicated. On the other hand, memory overhead can be reduced by the use of scalar/array privatization, instead of expansion, along the parallel dimensions.
Incompatible edges Plugging dependence removal techniques into Allen and Kennedy's algorithm is straightforward, because it uses only loop distribution/fusion, which corresponds in terms of graph to the detection of strongly connected components in the RLDG. We have seen in Section 3 that this permits the identi cation of incompatible edges. They are precisely de ned as follows:
For a RLDG H , we denote by H l the subgraph of H obtained by deleting all edges with level < l. Then, incompatible edges at level l are the edges which belong to (at least) one cycle C such that: 1) C contains an edge of level l; 2) C contains a vertex that only belongs in F l to cycles of level strictly greater than l.
The characterization of incompatible edges for Darte and Vivien's algorithm is much more complicated. We refer to 5] for a complete description of the algorithm. We just give here the avor of this algorithm.
Darte and Vivien's algorithm takes as input a reduced dependence graph G whose edges are labeled by dependence polyhedra. First, the reduced dependence graph G is uniformized into a graph G u , which contains the nodes of G and some new nodes, called virtual nodes. Then G u is processed by the parallelization algorithm as the reduced dependence graph of a system of uniform recurrence equations, except that one has to take into account the fact that some nodes are virtual. Then, the algorithm is recursive and it relies on the construction of a particular subgraph of G u , the subgraph of null weight multi-cycles, denoted as G 0 . Incompatible edges are the edges that change the structure of G 0 , more precisely that change the structure of the vector space generated by the weight of the cycles of G 0 (or of one of the G 0 that will be recursively de ned). We have not studied yet the complexity of this complete characterization.
Conclusion
We have presented a framework to plug false dependence removal techniques into loop parallelization algorithms. Our approach has two main advantages. First, we remove only false dependence edges that are responsible for a lesser degree of parallelism, i.e. responsible for the sequentialization of an extra loop. Second, for each dependence removal { hence for each new temporary array { we use our knowledge on the already generated loops to (try to) minimize the number of dimensions of this temporary array. Furthermore, our algorithmic sketch can be controlled by external parameters. For instance, we can require that each statement is surrounded by the same number of sequential loops. More important, we can straightforwardly cope with external constraints on the total memory overhead which is allowed. For instance if only 2D-arrays may be introduced, we will generate only 1D or 2D temporaries, at the price of some loss in parallelism. Further work should be devoted to a better characterization of \incompatible" edges, and to a precise estimation of the extra memory that is needed to expose all the potential parallelism.
