Abstract. The accurate modeling of the electronic structure of atoms and molecules involves computationally intensive tensor contractions involving large multidimensional arrays. The efficient computation of complex tensor contractions usually requires the generation of temporary intermediate arrays. These intermediates could be extremely large, but they can often be generated and used in batches through appropriate loop fusion transformations. To optimize the performance of such computations on parallel computers, the total amount of interprocessor communication must be minimized, subject to the available memory on each processor. In this paper, we address the memory-constrained communication minimization problem in the context of this class of computations. Based on a framework that models the relationship between loop fusion and memory usage, we develop an approach to identify the best combination of loop fusion and data partitioning that minimizes inter-processor communication cost without exceeding the per-processor memory limit. The effectiveness of the developed optimization approach is demonstrated on a computation representative of a component used in quantum chemistry suites.
Introduction
The development of high-performance parallel programs for scientific applications is usually very time consuming. The time to develop an efficient parallel program for a computational model can be a primary limiting factor in the rate of progress of the science. Our overall goal is to develop a program synthesis system to facilitate the rapid development of high-performance parallel programs for a class of scientific computations encountered in quantum chemistry. The domain of our focus is electronic structure calculations, as exemplified by coupled cluster methods [4] , in which many computationally intensive components are expressible as a set of tensor contractions. We are developing a synthesis system that will transform an input specification expressed in a high-level notation into efficient parallel code tailored to the characteristics of the target architecture.
A number of compile-time optimizations are being incorporated into the program synthesis system. These include algebraic transformations to minimize the number of arithmetic operations [8, 13] , loop fusion and array contraction for memory space minimization [13, 12] , tiling and data locality optimization [1, 2] , space-time trade-off optimization [3] , and data partitioning for communication minimization [9, 10] . Since the problem of determining the set of algebraic transformations to minimize operation count was found to be NP-complete, we developed a pruning search procedure [8] that is very efficient in practice. The operation-minimization procedure results in the creation of intermediate temporary arrays. Often, these intermediate arrays that help in reducing the computational cost create a problem with the memory required. Loop fusion was found to be effective in significantly reducing the total memory requirement. However, since some fusions could prevent other fusions, the choice of the optimal set of fusion transformations is important. So we addressed the problem of finding the choice of fusions for a given operator tree that minimizes the space required for all intermediate arrays after fusion [12, 11] .
We have also previously addressed the problem of communication optimization in the context of the operator trees [9, 10] . An efficient polynomial-time dynamic programming algorithm was developed for the determination of optimal distributions of the various arrays through the evaluation of the operator tree so as to minimize interprocessor communication overhead. However, that model did not consider the effects of loop fusion for memory minimization. As we elaborate later with examples, it is not feasible to simply apply the previously developed loop fusion algorithm and the previous communication minimization algorithm (in either order) to optimize for the parallel context when memory size constraints are severe. For many computations of interest to quantum chemists, the unoptimized form of the computation could require in excess of hundreds of terabytes of memory. Therefore, the following optimization problem is of great interest: given a set of computations expressed as a sequence of tensor contractions (explained later on), an empirically derived measure of the communication cost for a given target computer, and a specified limit on the amount of available memory on each processor, re-structure the computation so as to minimize the total execution time while staying within the available memory. In this paper, we present a framework that we have developed to address this problem. The memory-constrained communication minimization algorithm we develop here will be incorporated into the synthesis system being developed.
The computational structures that we target arise in scientific application domains that are extremely compute-intensive and consume significant computer resources at national supercomputer centers. They are present in various computational chemistry codes such as ACES II, GAMESS, Gaussian, NWChem, PSI, and MOLPRO. In particular, they comprise the bulk of the computation with the coupled cluster approach to the accurate description of the electronic structure of atoms and molecules [14, 15] . Computational approaches to modeling the structure and interactions of molecules, the electronic and optical properties of molecules, the heats and rates of chemical reactions, etc., are very important to the understanding of chemical processes in real-world systems.
There has been some recent work on using loop fusion for memory reduction for sequential execution. Fraboulet et al. [5] use loop alignment to reduce memory requirement between adjacent loops by formulating the one-dimensional version of the problem as a network flow problem; they did look at the effect of their solution on cache behavior or communication. Song et al. [17, 18] present a different network flow formulation of the memory reduction problem and they include a simple model of cache misses as well. They do not consider trading off memory for recomputation or the impact of data distribution on communication costs while meeting per-processor memory constraints in a distributed memory machine. There has been much less work investigating the use of loop fusion as a means of reducing memory requirements [6, 16] . To the best of our knowledge, loop fusion transformation for memory reduction, in combination with data partitioning for communication minimization in the parallel context, has not been previously considered.
The paper is organized as follows. In the next section, we elaborate on the computational context of interest and the pertinent optimization issues. Section 3 presents our multi-dimensional processor model, discusses the interaction between distribution of arrays and loop fusion, and describes our algorithm for the memory-constrained communication minimization problem. Section 4 presents results from the application of the new algorithm to an example abstracted from NWChem [7] . Conclusions are provided in Section 5.
Elaboration of Problem
In the class of computations considered, the final result to be computed can be expressed as multi-dimensional summations of the product of several input arrays. Due to commutativity, associativity, and distributivity, there are many different ways to obtain the same final result and they could differ widely in the number of floating point operations required. Consider the following example:
If implemented directly as expressed above, the computation would require 2N i N j N k N t arithmetic operations to compute. However, assuming associative reordering of the operations and use of distributive law of multiplication over addition is acceptable for the floating-point computations, the above computation can be rewritten in various ways. One equivalent form that only requires N i N j N t +N j N k N t +2N j N t operations is as shown in Fig. 1(a) .
Generalizing from the above example, we can express multi-dimensional integrals of products of several input arrays as a sequence of formulae. Each formula produces some intermediate array and the last formula gives the final result. A formula is either: summation formula. Such a sequence of formulae fully specifies the multiplications and additions to be performed in computing the final result.
A sequence of formulae can be represented graphically as a binary tree to show the hierarchical structure of the computation more clearly. In the binary tree, the leaves are the input arrays and each internal node corresponds to a formula, with the last formula at the root. An internal node may either be a multiplication node or a summation node. A multiplication node corresponds to a multiplication formula and has two children which are the terms being multiplied together. A summation node corresponds to a summation formula and has only one child, representing the term on which summation is performed. As an example, the binary tree in Fig. 1(b) represents the formula sequence shown in Fig. 1(a) .
The operation-minimization procedure discussed above usually results in the creation of intermediate temporary arrays. Sometimes these intermediate arrays that help in reducing the computational cost create a problem with the memory capacity required. For example, consider the following expression:
If this expression is directly translated to code (with ten nested loops, for indices a − l), the total number of arithmetic operations required will be 4N 10 if the range of each index a−l is N. Instead, the same expression can be rewritten by use of associative and distributive laws as the following:
This corresponds to the formula sequence shown in Fig. 2 (a) and can be directly translated into code as shown in Fig. 2(b) . This form only requires 6N 6 operations. However, additional space is required to store temporary arrays T 1 and T 2. Often, the space requirements for the temporary arrays poses a serious problem. For this example, abstracted from a quantum chemistry model, the array extents along indices a − d are the largest, while the extents along indices i − l are the smallest. Therefore, the size of temporary array T 1 would dominate the total memory requirement.
We have previously shown that the problem of determining the operator tree with minimal operation count is NP-complete, and have developed a pruning search procedure [8, 9] that is very efficient in practice. For the above example, although the latter form is far more economical in terms of the number of arithmetic operations, its implementation will require the use of temporary intermediate arrays to hold the partial results of the parenthesized array subexpressions. Sometimes, the sizes of intermediate arrays needed for the "operation-minimal" form are too large to even fit on disk.
A systematic way to explore ways of reducing the memory requirement for the computation is to view it in terms of potential loop fusions. Loop fusion merges loop nests with common outer loops into larger imperfectly nested loops. When one loop nest produces an intermediate array which is consumed by another loop nest, fusing the two loop nests allows the dimension corresponding to the fused loop to be eliminated in the array. This results in a smaller intermediate array and thus reduces the memory requirements. For the example considered, the application of fusion is illustrated in Fig. 2(c) . By use of loop fusion, for this example it can be seen that T 1 can actually be reduced to a scalar and T 2 to a 2-dimensional array, without changing the number of arithmetic operations.
For a computation comprised of a number of nested loops, there will generally be a number of fusion choices, that are not all mutually compatible. This is because different fusion choices could require different loops to be made the outermost. In prior work, we have addressed the problem of finding the choice of fusions for a given operator tree that minimizes the total space required for all arrays after fusion [13, 12, 11] .
A data-parallel implementation of the unfused code for computing S abi j would involve a sequence of three steps, each corresponding to one of the loops in Fig. 2(b) . The communication cost incurred will clearly depend on the way the arrays A, B, C, D, T 1, T 2, and S are distributed. We have previously considered the problem of minimization of communication with such computations [13, 9] . However, the issue of memory space requirements was not addressed. In practice, many of the computations of interest in quantum chemistry require impractically large intermediate arrays in the unfused operation-minimal form. Although the collective memory of parallel machines is very large, it is nevertheless insufficient to hold the full intermediate arrays for many computations of interest. Thus, array contraction through loop fusion is essential in the parallel context too. However, it is not satisfactory to first find a communicationminimizing data/computation distribution for the unfused form, and then apply fusion transformations to minimize memory for that parallel form. This is because 1) fusion changes the communication cost, and 2) it may be impossible to find a fused form that fits within available memory, due to constraints imposed by the chosen data distribution on possible fusions. In this paper we address this problem of finding suitable fusion transformations and data/computation partitioning that minimize communication costs, subject to limits on available per-processor memory.
Memory-Constrained Communication Minimization
Given a sequence of formulae, we now address the problem of finding the optimal partitioning of arrays and operations among the processors and the loop fusions on each processor in order to minimize inter-processor communication and computational costs while staying within the available memory in implementing the computation on a message-passing parallel computer. Section 3.1 introduces a multi-dimensional processor model used to represent the computational space. Section 3.2 discusses the combined effects of loop fusions and array/operation partitioning on communication cost, computational cost, and memory usage. An integrated algorithm for solving this problem is presented in Section 3.3.
Preliminaries: A Multi-Dimensional Processor Model
A logical view of the processors as a multi-dimensional grid is used, where each array can be distributed or replicated along one or more of the processor dimensions. As will be clear later on, the logical view of the processor grid does not impose any restriction on the actual physical interconnection topology of the processor system since empirical characterization of the cost of redistribution between different distributions is performed on the target system.
Let p d be the number of processors on the d-th dimension of an n-dimensional processor array, so that the number of processors is p 1 × p 2 × . . . × p n . We use an ntuple to denote the partitioning or distribution of the elements of a data array on an n-dimensional processor array. The d-th position in an n-tuple α, denoted α[d], corresponds to the d-th processor dimension. Each position may be one of the following: an index variable distributed along that processor dimension, a '*' denoting replication of data along that processor dimension, or a '1' denoting that only the first processor along that processor dimension is assigned any data. If an index variable appears as an array subscript but not in the n-tuple, then the corresponding dimension of the array is not distributed. Conversely, if an index variable appears in the n-tuple but not in the array, then the data are replicated along the corresponding processor dimension, which is the same as replacing that index variable with a '*'.
As an example, suppose 128 processors form a 4-dimensional 2 × 2 × 4 × 8 array. For the array B(b, e, f , l) in Fig. 2(a) , the 4-tuple b, e, * , 1 specifies that the first and the second dimensions of B are distributed along the first and second processor dimensions respectively (the third and fourth dimensions of B are not distributed), and that data are replicated along the third processor dimension and are assigned only to processors whose fourth processor dimension equals 1. Thus, a processor whose id is P z 1 ,z 2 ,z 3 ,z 4 will be assigned a portion of B specified by B (myrange(z 1 , N b , p 1 ), myrange(z 2 , N e , p 2 
Interaction Between Array Partitioning and Loop Fusion
The partitioning of data arrays among the processors and the fusions of loops on each processor are inter-related. Although in our context there are no constraints to loop fusion due to data dependences (there are never any fusion preventing dependences), there are constraints and interactions with array distribution: (i) both affect memory usage, by fully collapsing array dimensions (fusion) or by reducing them (distribution), (ii) loop fusion does not change the communication volume, but increases the number of messages, and therefore the start-up communication cost, and (iii) fusion and communications patterns may conflict, resulting in mutual constraints. We discuss these issues next.
(i) Memory usage and array distribution. The memory requirements of the computation depend on both loop fusion and array distribution. Fusing a loop with index t between a node v and its parent eliminates the t-dimension of array v. If the t-loop is not fused but the t-dimension of array v is distributed along the d-th processor dimension, then the range of the t-dimension of array v on each processor is reduced to N t /p d . Let DistSize(v, α, f ) be the size on each processor of array v, which has fusion f with its parent and distribution α. We have
where v.dimens = v.indices − {v.sumindex} is the array dimension indices of v before loop fusions, v.indices is the set of loop indices for v including the summation index v.sumindex if v is a summation node, Set( f ) is the set of fused indices for fusion f , and Instead, the number of messages increases with loop fusion, while the total volume of communication stays the same. Therefore, the communication cost increases, due to higher start-up costs. Consider the computation sequence presented in Fig. 3(a) , where the array C(i, k) is first "produced" from A(i, j) and B( j, k), and then "consumed" to produce E(i, l). For this simple example, we assume that the computation is executed in parallel on 4 processors, with a one-dimensional logical processor view. Figure 3(b) shows the pseudo-code in the absence of fusion: the array C(i, k) is re-distributed from k to l only once. In the presence of fusion, where the i-loop is the outermost loop, the dimensionality of the array C is reduced to C(k), but the redistribution is performed N i times. The pseudo-code in Fig. 3(c) illustrates this effect. be possible, the loop must either be undistributed at both u and v, or be distributed onto the same number of processors at u and at v. Otherwise, the range of the loop at node u would be different from that at node v, preventing fusion of the loops. Let us consider again the computation given in Figure 3 (a), with a different distribution of the array C(i, k) at the two nodes: assume that we have a k distribution at the first node, and a i distribution at the second node. The pseudo-code for this computation on 4 processors is presented in Fig. 4(a) . Fusion of the i-loop is no longer possible, due to the different loop ranges at the two nodes. However, we can overcome this problem by taking a virtualized view of the computation on a larger set of virtual processors, mapped onto the actual physical processors. Consider a virtual partitioning of the computation and split the i-loop into two loops, i and ii. (see the pseudo-code in Fig. 4(b) ). With this modification, the outermost i-loop can be fused, and the size of the array C is reduced from
This transformation of the i-loop is presented graphically in Fig. 5 . At the first node (where it is produced), the array C is distributed among the 4 processors along the k dimension ( k distribution, or vertical partitioning in the Figure) . In addition, each physical processor can be further viewed as 4 "virtual processors", as showed by the horizontal virtual partitioning lines in Fig. 5 . The purpose of the virtual partitioning along the i dimension at the first (produce) node is to match the actual i partitioning at the second (consume) node and allow for fusion of the i-loop. Fusion of the i-loop no longer produces a one-dimensional C(k) array in this case. Each processor stores the equivalent of 4 such arrays, corresponding to the 4 virtual processors. In Fig. 5 , the elements stored on processor P 0 , before and after re-distribution, are represented by shaded areas.
In general, the virtual partitioning of the computation depends on the distribution at the nodes involved. Let u and v be two nodes in the operator tree T that have a common loop index t. The t-loop is distributed onto p u processors at node u and onto p v processors at node v. Let p virtual be lowest common multiple of p u and p v . With these notations, the t-loop can be virtually partitioned by a factor of p virtual /p u at the u node, and by a factor of p virtual /p v at the v node. The resulting virtual partitions along the t dimension at the u and v nodes become identical, allowing for loop fusion.
Virtual partitioning is essential for the success of our combined loop fusion -data distribution approach. Since both fusion and distribution impose constraints on the array dimensions, the potential for conflict is enormous. In practice, unless we allow virtual partitioning, we often find that optimal array distribution for minimizing inter-processor communication precludes effective memory reduction by fusion. The number of compatible loop fusion and array distribution configurations is very limited. Virtual partitioning relaxes the mutual constraints imposed by the loop fusion and data distribution, allowing for the optimal solution(s) to be found.
Memory-Constrained Communication Minimization Algorithm
In this section, we present an algorithm addressing the communication minimization problem with memory constraint. Previously, we have solved the communication minimization problem but without considering loop fusion or memory usage [9] . In practice, the arrays involved are often too large to fit into the available memory even after partitioning among the processors. We assume the input arrays can be distributed initially among the processors in any way at zero cost, as long as they are not replicated. We do not require the final results to be distributed in any particular way. Our approach works regardless of whether any initial or final data distribution is given.
The main idea of this method is to search among all combinations of loop fusions and array distributions to find one that has minimal total communication and computational cost and uses no more than the available memory. A dynamic programming algorithm for this purpose is given in this section.
Let Mcost(localsize, α, β) be the communication cost in moving the elements of an array, with localsize elements distributed on each processor, from an initial distribution α to a final distribution β. We empirically measure Mcost for each possible non-matching pair of α and β and for several different localsizes on the target parallel computer. Let MoveCost(v, α, β, f ) denote the communication cost in redistributing the elements of array v, which has fusion f with its parent, from an initial distribution α to a final distribution β. It can be expressed as:
where LoopRange(i, v, α, x) and
Let CalcCost(v, γ) be the computational cost in calculating an array v with γ as the distribution of v. Note that the computational cost is unaffected by loop fusions. For multiplication and for summation where the summation index is not distributed, the computational cost for v can be quantified as the total number of operations for v divided by the number of processors working on distinct parts of v. In our example in Fig. 2(a) , if the array T 1(b, c, d, f ) has distribution c, d , f , 1 , its computational cost would be
1875 × 10 12 multiply-add operations on each participating processor. Formally,
For the case of summation where the summation index i = v.sumindex is distributed, partial sums of v are first formed on each processor and then either consolidated on one processor along the i-dimension or replicated on all processors along the same processor dimension. We denote by CalcCost1 and MoveCost1 the computational and communication costs for forming the sum without replication, and by CalcCost2 and MoveCost2 those with replication.
Finally, we define Cost(v, α) to be the total cost for the subtree rooted at v with distribution α. After transforming the given sequence of formulae into an expression tree T (see Section 2), we initialize Cost(v, α) for each leaf node v in T and each distribution α as follows (where NoRep(α) is a predicate meaning α involves no replication.):
For each internal node u and each distribution α, we can calculate Cost(u, α) according to the following procedure: Case (a): u is a multiplication node with two children v and v . We need both v and v to have the same distribution, say γ, before u can be formed. After the multiplication, the product could be redistributed if necessary. Thus,
u is a summation node over index i and with a child v, which may have any distribution γ. If i ∈ γ, each processor first forms partial sums of u and then we either combine the partial sums on one processor along the i dimension or replicate them on all processors along that processor dimension. Afterwards, the sum could be redistributed
With these definitions, the bottom-up dynamic programming algorithm proceeds as follows: At each node v in the expression tree T , we consider all combinations of array distributions for v and loop fusions between v and its parent. If loop fusion of the same index t between v and its parent is not possible because of different distribution ranges, then a virtual processor view is considered in order to allow the fusion. The array size, communication cost, and computational cost are determined according to the equations in Sections 3.1 and 3.3. If the size of an array before and after redistribution is different, the higher of the two should be used in determining memory usage. At each node v, a set of solutions is formed. Each solution contains the final distribution of v, the loop nesting at v, the loop fusion between v and its parent, the total communication and computational cost, and the memory usage for the subtree rooted at v. A solution s is said to be inferior to another solution s if they have the same final distribution, s has less potential fusions with v's parent than s , s.totalcost ≥ s .totalcost, and the memory usage of s is higher than that of s . An inferior solution and any solution that uses more memory than available can be pruned. At the root node of T , the only two remaining criteria are the total cost and the memory usage of the solutions. The set of solutions is ordered in increasing memory usage and decreasing cost. The solution with the lowest total cost and whose memory usage is below the available memory limit is the optimal solution for the entire tree.
An Application Example
In this section, we present an application example of the memory-constrained communication minimization algorithm. Consider again the sequence of computations in As an example, we investigate the parallel execution of this calculation on 32 processors of a Cray T3E, assuming 512MB of memory available at each node, and on 16 processors of an Intel Itanium cluster, assuming 2GB of memory available at each node. The best partitioning of the algorithm depends on the number of processors and the amount of memory available. It also depends on the empirical characterization data that we use to describe the communication costs of a given machine. We generated this data by measuring the communication times for each possible non-matching pair of array distributions and different array sizes for both the Cray T3E and the Itanium cluster. Although generating the characterization is somewhat laborious, once a characterization file is completed, it can be used to predict, by interpolation or extrapolation, the communication times for arbitrary array distributions and sizes. Tables 1 and 2 present the solutions of the memory-constrained communication minimization algorithm on the Cray T3E and Itanium cluster, respectively. For the system of 32 processors of the Cray T3E, the optimal logical view of the processor space is found to be a two-dimensional 4 × 8 distribution. Table 1 shows the full fourdimensional arrays involved in the computation, their reduced (fused) representations, their initial and final distributions, their memory requirements, and the communication costs involved in their re-distribution. The final distribution is defined in the same way for both input and intermediate arrays: it is the distribution at the multiplication node at which the array is used or consumed. The initial distribution is defined differently for input and intermediate arrays: it is the distribution at the leaf node for an input array, and the distribution at the multiplication node where the array is generated, or produced,
