Abstract-This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a Locality-Communication Graph (LCG) and the formulation of the compiler technique as a Mixed Integer Nonlinear Programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions-the decompositions-that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.
INTRODUCTION
M ULTIPROCESSOR architectures are converging toward an organization in which nodes containing memory and one or more processors are connected via a fast network. Processors have access to their local memories and to a hardware-supported global address space [1] . This organization enables high-scalability at a reasonable cost. The organization also facilitates programming by enabling the gradual introduction of parallelism on sequential prototypes. However, the Non-Uniform Memory Access (NUMA) organization of these machines makes data locality a crucial performance factor.
When data placement is controlled by the operating system, the programmer (or the compiler) must be aware of the page placement policy and either modify the program to adapt its memory access pattern to the operating system policy, or bypass the operating system and hand-code a customized page placement scheme. A possible page placement policy is first-touch, which maps a page onto the processor that references it first. Affinity scheduling can then be used to enhance locality. An alternative way to deal with dynamic memory reference patterns is dynamic page migration [2] , which automatically changes page placement if a remote processor accesses the page significantly more frequently than the local processor where the page is allocated.
In order to evaluate the effectiveness of the operating system strategies, we conducted a set of experiments on an SGI Origin 2000 using three codes: tfft2 from NAS, trfd from the Perfect Benchmark Suite, and tomcatv from SPEC95. One parallel version-version 0-of these programs was generated automatically using PFA (Power Fortran Accelerator [3] ). PFA is a source-to-source parallelizer that analyzes a sequential program and automatically inserts parallel directives similar to those of the OpenMP standard [4] . Version 0 relies exclusively on the page migration strategy of IRIX. 1 This strategy is based on counters that keep track of the number of memory accesses issued by each node in the system to each physical memory page. Whenever the number of remote accesses to a particular page exceeds the number of local accesses by a certain threshold, an interrupt is generated so that the OS can decide whether to migrate the page. We generated a second version of the parallel programs-version 1-by inserting SGI page-level distribution directives on version 0 programs, to measure how the compiler could help the operating system to improve locality.
Indeed, we generated a third parallel version-version 2-that is a message-passing version of the program. This was generated by identifying data and computation distributions in which data elements are, whenever possible, placed in the local memories of the processors accessing them. In this version, each processor allocates its own local data, and accesses to remote memories are handled via explicit put/get communication primitives. The goal of version 2 was to measure how much locality can be exploited in a NUMA architecture using true data distribution techniques instead of data placement controlled by the operating system. A similar performance comparison, for another set of benchmarks, has been conducted by Chapman et al. in [5] , achieving similar results.
The sequential times and the parallel times for all three versions of our experiments are shown in Fig. 1 . The lack of scalability of results in version 0 and version 1 is due to the fact that the page migration and page level distribution strategies do not optimize the locality of references in a NUMA architecture, such as the Origin 2000, because the granularity of a page is too coarse and, thus, a significant number of remote accesses happen. On the other hand, version 2 obtains good scalability for all the codes. In addition, the execution times of version 2 are the smallest for any number of processors. The results of version 2 gave us the idea that much more locality can be exploited in a NUMA architecture using true data distribution techniques instead of the page-level distribution (version 1) or dynamic page migration (version 0) techniques.
Given these results, we believe that the best way to exploit all the available locality of a code in a NUMA architecture is to identify a suitable distribution for both iteration and data (in other words, the decomposition), as we did in version 2. However, to find and implement a good decomposition by hand is a difficult task requiring extensive analysis and complex transformations of the sequential source code. Fortunately, we think that an advanced compiler can alleviate this cumbersome task. Our approach is to have the programmer write a conventional serial-nonannotated--program and rely on the compiler to automatically parallelize it, distribute the iteration and data between the processors, and generate the communications necessary to keep global data consistent. If such a compiler were truly successful, it would become the key tool in a highly-scalable, easy-toprogram computer system.
In this paper, assuming that the compiler has detected the parallelism in a previous step, we summarize how the data locality information can be extracted and focus on how to use it to formulate an optimization problem whose goal is to find a decomposition that minimizes overhead and exploits locality for the whole code.
Related Work
Current compiler literature contains a lot of information regarding the exploitation of data locality relying on data caching. Some of these works are based on loop transformation techniques [6] , [7] ; other approaches have addressed data transformation techniques [8] , and some recent efforts have aimed at combining the benefits of loop and data transformations [9] . The main difference between our work and these is that they focus on uniprocessor memory hierarchy, whereas our work focuses on multiprocessor memory hierarchy. In this sense, we are nearer to the automatic data distribution field, where the goal is finding a data distribution that minimizes the overhead due to nonlocal accesses in NUMA architectures.
A substantial body of work in the compiler literature has formulated the data distribution problem into two parts: 1) the alignment problem [10] , [11] , which tries to relate the dimensions of different arrays in order to minimize the number of accesses to remote memory positions (communications), and 2) the mapping problem, which decides which of the aligned dimensions are distributed and the number of processors assigned to each distributed dimension. Another influential approach [12] uses an algebraic representation of data and computation mappings and tries to find a static data decomposition without communications; if a free communications decomposition is not found, a naive algorithm tries to find a dynamic data distribution for the most executed program segment. However, in their formulations these approaches do not consider the load imbalance problem which can become a critical issue in machines with high-throughput and a low latency network [13] . One noteworthy difference in our approach is that, besides communications, we address load balancing. As we will demonstrate in Sections 3 and 4, the decompositions derived from considering the load imbalance are completely different to the ones that we obtain if we do not take into account the imbalance. As we will see, these new decompositions have an important impact on the performance of the parallel code in NUMA machines.
One interesting strategy in the automatic data distribution field consists of formulating an integer linear programming framework [14] , [15] , where the objective is to determine the mappings that minimize communications. Only in [15] is the load imbalance taken into account in a particular way. One important difference between our work and these others is that we base our techniques on the Linear Memory Access Descriptor (LMAD), a powerful and accurate array access notation introduced by Paek et al. [16] , for finding parallelism in the Polaris compiler [17] . We use the LMAD to extract the locality information for all arrays across the whole program. The LMAD notation lets us extend the domain of interest of the previous works to programs that contain loops not perfectly nested and to cover nonaffine subscript expressions. Another major difference is that our objective function is a nonlinear expression of problem variables because we have incorporated load imbalance into the problem formulation.
Next, we summarize how the paper is organized. In Section 2, we briefly outline the Locality-Communication Analysis Algorithm [18] , [19] , an algorithm that identifies the constraints that the iteration/data distributions must fulfill in order to ensure the locality of array references. When remote accesses are unavoidable, our techniques can identify the communication patterns suitable for each situation [19] . Using the information of locality and communication patterns provided by the Locality-Communication Analysis, in Section 3 we present a new technique based on a Mixed Integer Non-Linear Programming (MINLP) problem that attempts to derive, for a code, the optimal iteration/data distributions that exploit all the available locality. Our MINLP problem models the overhead of the parallel code due to the load imbalance and communications. In Section 4, we use several benchmarks to evaluate the approach presented in the paper, in terms of the complexity of the MINLP model and in terms of performance of the derived iteration/data distributions using as target different NUMA machines. Finally, in Section 5, we present the conclusions.
COMPILER REPRESENTATION OF LOCALITY
In our approach, the compiler first identifies the parallel loops. After loop parallelization, the first question to be answered is which loop is to be executed in parallel in each DO loop nest with two or more parallel loops because our techniques only exploit one-level parallelism. This is important because the selection of the parallel loop determines if communications are necessary in such loop nests as well as the amount of computation carried out by each processor. We assume that this selection will be done using an exhaustive search of all possible combinations of parallel loops. That is, at each step of this search, we consider all the loop nests (these loop nests do not have to be perfectly nested) of the program and, for each loop nest, at most one parallel loop is considered. Each of these loop nests we call a phase (see examples in Fig. 4) .
For each step of the search algorithm, the compiler builds a graph called the Locality-Communication Graph-LCG -which is our compile time representation of locality in the code. Next, using the information provided by the the LCG, the compiler formulates the optimization problem that looks for the decomposition that minimizes the overhead (communication and load imbalance) at that step. Doing this, we reduce the complexity of the optimization problem. This approach is guaranteed to produce the solution with the best predicted behavior although in some cases the number of possibilities could be too large. In these cases, the compiler could only analyze a subset of possibilities selected at random or by using a search strategy. In this paper, we focus on how the compiler works for one iteration of this exhaustive search algorithm: the process of building the LCG of the code, formulating the optimization problem, and solving it.
Notation Overview
For precise locality analysis, it is essential to have an accurate representation of the array regions accessed in each phase of the code. For this, we use the Linear Memory Access Descriptors (LMADs) [16] . Initially, the LMAD form summarizes the region that an array reference touches during the execution of a phase. The collection of the LMADs for all the references to the array in a phase summarizes the array region accessed in such a phase. One advantage of the LMAD notation is that it can accurately represent array references with affine or nonaffine subscript expressions [16] . In addition, since the LMAD can represent array access across procedure boundaries efficiently [20] , this notation lets us obtain a complete interprocedural analysis of the source program. Another advantage of the LMAD notation is that it allows, even for complex subscripts, summarization operations that need a polynomial time algorithm in contrast to worst-case exponential algorithms required in the summarization process of the linear constraint-based techniques [21] , [22] . More details of this comparison can be found in [16] , [20] .
The Iteration Descriptor-ID [18] , [19] -is an extension of the LMAD developed to more conveniently pinpoint the subregions of an array accessed by each iteration of the parallel loop in the phase. Due to space limitations, the LMADs and IDs are only briefly described in Section 2.2. The IDs are used to build the LCG (the Locality-Communication Graph). The LCG is a set of directed, connected graphs, each of which represents an array. Each graph contains exactly one node for each phase accessing the array the graph represents. The nodes are connected according to the program control flow. The LCG may contain cycles if there are outer loops that are not included in any phase. For example, Fig. 2 shows the LCG for tomcatv and a fragment of tfft2. The LCG of tomcatv contains four graphs corresponding to arrays X, Y , RX, and RY . In the LCG of the tfft2 code section, there are two graphs, corresponding to arrays X and Y . The LCG is constructed by the Locality-Communication Analysis Algorithm, which we summarize in Section 2.3.
Linear Memory Access and Iteration Descriptors
In our framework, accessing an array is viewed as traversing a linear memory space. For example, in Fig. 3 , the two-dimensional array access is, in reality, the traversal in a linear memory space starting from the base address ( (= the memory location for A (1, 4) ). The diagram in Fig. 3 illustrates that memory traversals are driven by the loop indices I and J. The LMAD attempts to capture the pattern of a memory traversal driven by a single loop index with the notion of stride and span. The stride is the distance (measured in number of array elements) between accesses generated by consecutive values of the index. In this example, the stride for index I is 2. Similarly, we can see the stride for J is N. The span is the total number of elements that the reference traverses when the index moves across its entire range. In the example, the span for index I is 2 Á K, which is the entire distance traversed between iterations 0 and K for a fixed value of J. Similarly, the span of J is (M-1) ÁN because J takes values from 1 up to M.
The LMAD contains the base offset and the collection of stride/span pairs of all indices involved in the array access. Its general form is: A stride i 1 ;stride i 2 ;ÁÁÁ;stride i d spani 1 ; spani 2 ;ÁÁÁ; spani d þ (. Here, we assume that access to array A is driven by d loop indices,
The base offset is (, which is the distance in elements from the location of the first reference to A to the beginning of the array. As can be seen in the above example, a single stride/span pair for a loop index-such as f2; 2 Á Kg for I and fN; ðM À 1Þ Á Ng for J-characterizes the access pattern generated by one of the indices assuming the other index or indices remain constant. The LMAD representing all the accesses in the example of 
where P ¼ 2 R . These examples illustrate how the LMAD notation can be used to capture the region of an array accessed by each reference in a phase.
The last two LMADs are a good example of accurate representation of accesses controlled by nonaffine subscripts. In this case, despite the apparent complexity of the subscript expressions, a simplification operation such as coalescing can safely remove redundant information that may be contained in the initial LMADs without losing accuracy; for instance, the third and fourth stride/spans on each LMAD can be coalesced as a single pair [23] . After applying this operation twice to those descriptors, respectively, the new form of those LMADs is:
We note that, after the coalescing operation, the new LMADs still represent the accesses for the two array references accurately, and are more manageable expressions. More details of the coalescing and other simplification operations can be found in [16] . The Iteration Descriptor (ID) [18] was developed to make explicit the subregions of an array accessed by each iteration of the parallel loop in a phase. For example, the subregions of array X accessed by one iteration, say i, of the parallel I-loop in the kth phase of the program, can be represented by an ID with the general form: 
Assuming that there are m different LMADs representing all the accesses to array X in the phase, each I k j ðX; iÞ item is derived from the jth collected LMAD. Each item represents a different subregion LMAD. Thus, the ID is, in general, the collection of several LMADs. In the ID, B is a m Â s matrix with a row for each LMAD, and a column for each span derived for a sequential loop index, assuming that there are s sequential loop indices in the LMADs (after applying the stride coalescing operation). The entry D represents a similar matrix, but with column entries representing strides. ( ( I ðiÞ is the extended offset vector whose elements are ( I ðj; iÞ ¼ ( j þ i Á P ðjÞ for j ¼ 1 : m, where P ðjÞ is the stride associated with the parallel loop index in the jth LMAD. ( I ðj; iÞ points to the first memory position of the subregion accessed by the jth LMAD in the ith iteration of the parallel loop.Á Á is the symmetry distance vector. This is originally set to null. The components of this vector are briefly described later. Fig. 5 shows a graphical representation of the IDs of the array X, associated with each iteration i of the parallel loop (index I) in phase F 3 of the tfft2 code (the code of Fig. 4c) . After computing the two LMADs for the two array references in the code and applying the coalescing operation mentioned before, we find two LMADs (the expressions in (2)). As there are two LMADs, there are two ID items, thus,
; ð1Þ; ð2 Á P Á iÞ; ðÁÞ; W; LE : 1
; ðÁÞ; R; LE : 1
where i is one of the iterations of the parallel loop. The general expression for the ID I 3 ðX; iÞ is shown in Fig. 5 . When i is instantiated to a value, we then find the elements of the array subregions accessed during such parallel iterations. For instance, I 3 ðX; 0Þ defines the elements of array subregions accessed during parallel iteration i ¼ 0, I 3 ðX; 1Þ the elements of subregions for iteration i ¼ 1, and so on. On each parallel iteration, two subregions are accessed: one starts at element 2 Á P Á i, which has a span of P =2 À 1 and the elements are separated by stride 1; the other starts at element P =2 þ 2 Á P Á i, which has a span of P =2 À 1 and again the elements are separated by stride 1.
The ID of an array also includes access and control flow information about the execution of the phase. The term access_type in (3) represents the memory access type for all the subregions accessed by an iteration of the parallel loop. This will be used to annotate the nodes of the LCG (Fig. 2) with one of these possible values: P if the array is privatizable [17] in the phase, W when all accesses are writes, R if all accesses are reads, and R/W if there are both read and write accesses. In the example of Fig. 5 , the memory access type for the array in all the subregions is R/W. In addition, one or more than one execution preds in the ID specify the execution properties of the phase. The possible values are: 1) CE if the phase is conditionally executed, in which case par is the probability of execution of that phase, 2) IE if the phase is inside an outer loop, in which case par is the number of iterations of the external loop, and 3) LE if the phase is executed in lexicographic order, in which case par is always 1. Cases 2 and 3 are exclusive because, during the building of the LCG in the Locality-Communication Analysis algorithm, we need to add a backward edge in the second case (e.g., the backward edge that connects phases F 1 -F 7 in the LCG of Fig. 2a ), but not in the third case (e.g., LCG of Fig. 2b , where all the phases are executed in lexicographic order). In the example of Fig. 5 , LE states that F 3 is executed lexicographically (and the parameter is 1). The execution predicates and their parameters are used to compute the cost functions, as we will discuss in Section 3.
TheÁ Á vector of a ID can be null or, in some cases, can contain symmetry distances to indicate that there are symmetries between the subregions accessed in a parallel iteration. This situation appears when the ID contains more than one I k j ðX; iÞ item. In these cases, two ID components can describe subregions with the same spans and strides, but different offsets. We identify the symmetry through three classes of symmetry distances [18] . The most relevant symmetry distance for the Locality-Communication Analysis algorithm is the Overlapping distance or Á o : This represents that two subregions accessed in different parallel iterations have the same access pattern, but are partially overlapped. For instance, the ID items that describe them, I k j and I k l , have the same spans, strides, and parallel stride, but I k l ðX; iÞ contains elements of I k j ðX; i À 1Þ. In this case, Á o ¼ ( I ðj; iÞ À ( I ðl; iÞ specifies the overlapping distance. Á o can be used to compute the overlapped elements (these elements define the shadow areas [19] that will be array subregions that our approach will replicate in the local memory of two processors). In the example in Fig. 5 , there is no overlapped elements (i.e., the subregion I 
Locality-Communication Analysis Algorithm
As we mentioned before, we use the LCG representation to capture the memory locality and communication patterns. Each node of the LCG corresponds to a phase accessing an array X. The compiler labels each node of the graph using the flag access_type of the corresponding ID (i.e., R, W, R/W, and P). The edges connecting nodes are also labeled with two attributes, which we call the locality labels: C if interprocessor communication is needed for the array when execution proceeds from the first node to the second, and L if communication is not needed. The locality labels, L and C, will be determined at the end of the Locality-Communication Analysis Algorithm.
Our Locality-Communication Analysis Algorithm looks for decompositions in which: 1) the iterations of each parallel loop are statically distributed between the H processors involved in the execution of the code according to a CYCLIC(k) (block-cyclic) pattern, and 2) the data distribution for each array is induced from the iteration distribution. In other words, our algorithm ensures the affinity between the parallel iterations and the data required by them.
In the following sections, we present some details of our Locality-Communication Analysis. The interested reader can find more details in [19] . Here, we assume that array X is accessed in phases F k and F g , thus there is a node for each phase in the graph that represents the array in the LCG.
Intraphase Locality
Our algorithm must ensure the affinity between a parallel iteration and the data required by it. Let us assume that parallel iteration i (of phase F k ) is scheduled in processor P E l ; 0 l H À 1. H is the total number of processors. We can state that the affinity is achieved when all accesses to array X in that iteration are local to P E l , which means that the subregions described by the ID of the array for the parallel iteration i-i.e., I k ðX; iÞ-are allocated to the local memory of P E l . This is what we call the intraphase locality condition. This is a sufficient condition to ensure that all accesses to array X in the iteration i are local to P E l . Clearly, this condition must be fulfilled for all the parallel iterations scheduled on each processor, in order to guarantee the locality of array references in F k . The data distribution for the array in that phase will be derived in order to hold this condition, as we will see next.
Interphase Locality
Once we have established the locality of array references in phase F k , the Locality-Communication Analysis algorithm determines when a decomposition for the array in two phases, F k and F g , can avoid communications when execution moves from the first phase to the second.
First, we analyze the case when array X is nonprivatizable in F k and F g . In this situation, the algorithm needs to relate a sequence of parallel iterations on a phase to the array region accessed by these iterations because we are looking for block-cyclic distributions. To do this, our algorithm uses two values: the upper limit and the memory gap. As shown in Fig. 6 , the upper limit, ULðI k ðX; iÞÞ, of array X for a parallel iteration i represents the highest memory position of the subregions described by the ID of X in the iteration i. This is computed from the sum of all spans plus the extended offset of the ID:
If there is more than one I k j ðX; iÞ item, then the sum is computed for each component and the upper limit is the maximum value. Similarly, ULðI k ðX; iÞ; pÞ = max i:iþpÀ1 ðULðI k ðX; iÞÞÞ represents the highest memory position of all the subregions of X for a sequence of p parallel iterations (the sequence goes from i to i þ p À 1). In the example of Fig. 6 , the upper limit for the parallel iteration 0 is ULðI 3 ðX; 0ÞÞ = ðP À 1Þ þ 2 Á P Á i = P À 1, whereas the upper limit for the sequence of iterations 0 : 2 is ULðI 3 ðX; 0Þ; 3Þ ¼ max i¼0:2 ðULðI 3 ðX; iÞÞÞ ¼ 5 Á P À 1:
The memory gap, h k , of array X in phase F k is defined as being the distance (measured in number of array elements) between the lowest memory position of the subregions associated with the ID of the iteration i þ 1 and the highest memory position of the subregions associated with the ID of the parallel iteration i. There can be phases in the program where this distance is not zero, for instance, the memory gap is a positive value. This happens when the sequential loops of the phase do not access all the memory positions between two consecutive parallel iterations. For example, in Fig. 6 , the memory gap between subregions is h 3 ¼ P . In other cases, this distance can be negative, but then our algorithm sets h k ¼ 0 because, in this situation, the parallel iterations cover all the memory positions.
The Regular Region, R k ðX; iÞ, of an array for a parallel iteration i of phase F k , can be characterized by the upper limit ULðI k ðX; iÞÞ and the memory gap h k . As illustrated in Fig. 6 , each R k ðX; iÞ (represented by a bounding box)
contains all the elements of the array subregions described by the corresponding ID. Similarly, the aggregation of all the regular regions that a sequence of p parallel iterations covers, S i:iþpÀ1 R k ðX; iÞ, can be characterized by the upper limit for the sequence of parallel iterations i to i þ p À 1, and the memory gap, i.e., ULðI k ðX; iÞ; pÞ and h k . Henceforth, we will call S i:iþpÀ1 R k ðX; iÞ the Aggregated Regular Region.
iÞ contains all the elements of the array subregions accessed in the chunk of parallel iterations
The ULðI k ðX; iÞ; pÞ and h k values are used to formulate the following balanced locality condition for array X in two phases F k and F g , in which the array is accessed:
where u k and u g represent the upper bounds of the parallel loops in phases F k and F g , and H is the number of processors. For instance, starting at parallel iteration i = 0, from Fig. 6 , we can deduce for array X that,
and from the code of Fig. 4c , we know that u 3 ¼ Q À 1, thus p 3 must verify (7),
Equation (6) determines the length of the sequence of parallel iterations (p k , p g ) that ensure that the aggregated regular region is identical in F k and F g . Equations (7) and (8) limit the maximum size for each sequence of parallel iterations that can be scheduled in a processor for phases F k and F g . Thus, if we find a solution (p k , p g ) that verifies the system of (6), (7), and (8), we say that the balanced locality condition holds. In other words, we can find a (p k , p g ) pair, for which the sequence of parallel iterations i : i þ p k À 1 in F k , and the sequence i 0 Á p g : ði 0 þ 1Þ Á p g À 1 in F g access the array elements contained in the same aggregated regular region of the array. To find a (p k , p g ) pair means that (6) must be held by sequences 0 : p k À 1 and 0 : p g À 1 of F k and F g , respectively, then by sequences p k : 2 Á p k À 1 and p g : 2 Á p g À 1, and so on until all parallel iterations of F k and F g have been considered.
If the sequence of parallel iterations i : i þ p k À 1 and i 0 Á p g : ði 0 þ 1Þ Á p g À 1 are scheduled in P E l , then the accesses to array X are local when the common aggregated regular region is stored in the local memory of P E l (the intraphase locality condition). This process can be repeated in a round robin fashion to cover all sequences of parallel iterations. That is, we can find a decomposition in which the sequences of iterations ðl Á p k :
. . from F k , and the sequences of iterations ðl Á p g : ðl þ 1Þ Á p g À 1Þ, ððl þ HÞ Á p g : ðl þ H þ 1Þ Á p g À 1Þ, . . . from F g , can be scheduled in processor P E l , following a CYCLIC (p k ) and a CYCLIC (p g ) iteration distribution, respectively. Next, we can build a data distribution for the two phases in such a way that the corresponding common aggregated regular region of array X covered by each sequence of iterations is allocated in the corresponding local memory. That is, the regular regions of array X induced by the iteration distributions will define the data distributions of the array. The implementation of these data distributions takes place in the data allocation procedure during the code generation step (see Section 3.4). From now on, we will call a sequence of parallel iterations a chunk.
The Locality-Communication Analysis algorithm checks if one of the next four situations occur and labels the LCG accordingly:
1. Array X is nonprivatizable in phases F k and F g and the balanced locality condition holds. Thus, it is possible to find iteration distributions in F k (CYCLIC (p k )) and F g (CYCLIC (p g )), and to build a data distribution for the two phases, able to avoid communications, following the aggregated regular region strategy explained above. In this case, the edges connecting the corresponding nodes in the LCG are labeled with L. 2. Array X is nonprivatizable in phases F k and F g and the balanced locality condition holds. However, now there are shadow areas for the array in another phase (i.e., 9 Á o -see Section 2.2-) and accesses to the array in F k are writes. In this case, an updating of the replicated shadow areas may happen, and we say that the Frontier communication pattern occurs. Thus, the nodes are connected with a C F label (a communication label). 3. Array X is nonprivatizable in phases F k and F g and the balanced locality condition does not hold. In other words, each phase requires a different static data distribution, so an array redistribution between the two connected nodes must take place, and we say that the Global communication pattern occurs. Thus, the edges connecting the corresponding nodes in the LCG are labeled with C G (another communication label) 4. Array X is privatizable in phase F k (F g ). By definition, the value of X in F g (F k ) does not depend on the value of X in phase F k (F g ). In this case, the data allocation procedure-during the code generation step-will privatize on the local memory of each processor the regular regions that correspond to the parallel iterations scheduled on it. The edge connecting the two nodes can be removed (these are the dashed edges in Fig. 2b ). More details and examples that illustrate these situations are found in [19] .
Chains
Notice that (Global or Frontier) communications occur between the phases that are connected by a C edge that represents a data redistribution (C G ) or an updating of shadow areas (C F ) when the program control crosses from one phase to the other. On the other hand, the set of nodes that are connected consecutively by L edges we call a chain of nodes. There can be more than one chain for an array, and each array of the LCG has at least one chain. For instance, there is a chain of nodes for array X that includes the nodes corresponding to phases Fig. 2b .
It is possible to find a static data distribution for all the nodes of a chain, such that all the accesses to the array in any of the connected nodes of the chain are in local memory. Intuitively, the reason is because the verification of the balanced locality condition for the L-connected nodes of a chain will guarantee that all the nodes belonging to the same chain cover the same common aggregated regular regions of the array. A data allocation procedure must allocate the array elements of the common aggregated regular regions, in the corresponding local memory, before the first node of a chain. This is implemented during the code generation step (see Section 3.4).
PROBLEM FORMULATION
The goal of this section is to show how, from the array access and control flow, information supplied by the IDs of each array, and from the locality information of the LCG, we can derive an optimization problem to find out the decompositions (iteration/data distributions) that minimize the execution overhead of the parallel code, while exploiting all the available locality detected by the LCG. The p k values (the size of the chunk in the block-cyclic or CYCLIC (p k ) iteration distribution for each phase) are the variables of the problem. The objective function of our programming problem is the overhead (in seconds) due to: 1) the load imbalance (which depends on the iteration distribution of each phase, i.e., the computation), and 2) the communications (which depend on the communication patterns of the C edges on the LCG). We compute the objective function in Section 3.1. This objective function is subject to several constraints that are described in Section 3.2. We will see in these sections that, for the computation of the costs functions (especially the communication costs) and the collection of the constraints, we use the graph structure of the LCG. Next, in Section 3.3, we discuss several issues relating to our model and their solutions.
Objective Function
The C-connected nodes take part in the definition of the objective function because, as we said, communications are one of the overhead sources that we consider in our approach. The other overhead source is the accumulation of the load imbalance assigned to processors. The general expression of this objective function is:
where j traverses all arrays (X j ) of the LCG and k traverses, for each array, all the nodes where the array is referenced. D k ðX j ; p k Þ represents, for array X j , the load imbalance cost function in phase F k , and C kg ðX j ; p k Þ represents, for array X j , the communication cost function for the communications that take place between phases F k and F g when the corresponding nodes of the LCG are connected with a C-edge. Both cost functions will be described next.
Load Imbalance Cost Function
In this paper, to simplify the computation of the load imbalance cost function, let us assume that, for each phase, loops are perfectly nested and that they have been linearized by the compiler. We compute the load imbalance cost function, D k ðX j ; p k Þ, of a phase F k and for an array X j , as the difference between the time consumed by the most loaded processor and the least loaded one. Roughly, this cost function depends on p k and the number of references to array X j . In order to compute this cost function, we must be able to estimate the computational load of a processor P E l , in other words, the time consumed by all the occurrences of an array X j in a phase F k when a CY CLICðp k Þ iteration distribution has been selected for the parallel loop. This time can be estimated as being due to two factors: 1) the load of processor P E l , i.e., the total number of iterations scheduled in that processor, and 2) the time consumed by all the occurrences of array X j inside the phase.
In order to calculate factor 1), we start computing the sequential load for a chunk of p k parallel iterations in a phase F k , L k ðp k Þ, as the number of sequential iterations executed by the p k parallel iterations. For example, in Fig. 7 , we compute the sequential load for a chunk in a rectangular phase and in a triangular phase. Of course, there are other kinds of triangular phases and domains. However, we choose these two simple cases to illustrate how our method computes the cost functions. In the L 1 ðp 1 Þ expression (i.e., the number of sequential iterations executed by a chunk of p 1 parallel iterations in F 1 ), M and R are the number of iterations of the J-loop and the L-loop (the sequential loops), respectively. In the L 2 ðp 2 Þ expression, R is the number of iterations of the L-loop and Chunk is the chunk number. That is, if the chunk number for which we are computing L 2 ðp 2 Þ is Chunk ¼ 0 (i.e., the first chunk), that chunk contains the parallel iterations 0 : p 2 À 1; if the chunk number for which we are computing L 2 ðp 2 Þ is Chunk ¼ 1 (i.e., the second chunk), the chunk now contains parallel iterations p 2 : 2 Á p 2 À 1, and so on.
Next, we compute the load of processor P E l in phase F k , L k ðP E l ; p k Þ, as the accumulation of the sequential load due to all the chunks scheduled in that processor following a
In other words, L k ðP E l ; p k Þ represents the total number of sequential iterations executed by all the chunks scheduled in processor P E l . Obviously, the load of a processor depends on the chunks scheduled on such processor (Chunk in P E l ), and the size of a chunk (p k ).
We use a simple approach for the computation of factor 2-which we call Ref k ðX j Þ. We only roughly estimate the time consumed by all the references to X j in the loop body of phase F k . In our estimation, the compiler analyzes the subscript expressions of the references to X j in the phase. We have assumed that arrays are laid out in memory following the main column order because we are analyzing Fortran codes. If there are references with subscript expressions that contain the stride 1 (here, we only look through the strides associated with the sequential loop indexes), then we assume that these references will be satisfied mainly in cache. The other references will be satisfied in local memory. Recall that, in our model, there are no remote references inside a phase; the intraphase locality condition ensures this fact. Therefore, for each phase, and for each array X j , we count the number of references that will be satisfied mainly in cache and multiply them by the cache latency of the system. We count the rest of the references (that will be satisfied in local memory) and we multiply them now, by the local memory latency. As we see, Ref k ðX j Þ is a machine-dependent function. Another way to compute Ref k ðX j Þ is by using profile information [12] .
Finally, once factors 1 and 2 have been calculated, we can estimate, for processor P E l , the time consumed by all the occurrences of the array X j in F k , when a CY CLICðp k Þ iteration distribution has been selected, as L k ðP E l ; p k Þ ÁRef k ðX j Þ. Now, we can compute the load imbalance cost function for each array X j referenced in 
Equation (10) includes the factors 1 and 2, and the term Q a par a À Á . This term represents the contribution of phase F k in the total overhead time. Thus, we must take into account whether the phase is nested or not in some control sentences (such as iterative and/or conditional sentences). par a represents the parameter associated with one of the execution predicates of the ID for array X j in F k (see (3)). Thus, Q a par a À Á is the contribution due to all parameters of the ID when the phase is nested in several control sentences.
Equation (10) represents a rough approach to the computation of the load imbalance overhead. In our load imbalance cost function, we have to deal with a fact detected in real measurements of the load imbalance: If the number of chunks in the CYCLIC (p k ) iteration distribution is a multiple of the number of processors (H), then the load imbalance is smaller than the load imbalance when this condition does not hold. This is due to the block-cyclic iteration distribution pattern because when the number of chunks is a multiple of H, then all processors have the same number of chunks. Otherwise, some processors have one chunk more than others. To capture this issue, we compute (10) in two situations:
1. The number of chunks is a multiple of the number of processors. Then, the number of chunks is the same for all the processors. Clearly, it is easy to deduce that if the phase is rectangular, then
The expression for a triangular phase is a little more complex because, in this case, the most loaded processor is P E M ¼ H À 1 and the least loaded is
Remember that, for a triangular phase, we must accumulate the sequential load for the corresponding chunk numbers (see L 2 ðp 2 Þ expression in Fig. 7 ). u k þ 1 is the number of iterations of such a parallel loop, and R the number of iterations of all sequential loops that do not depend on the parallel loop index (only the L-loop in our example of F 2 in Fig. 7 ). 2. The number of chunks is not a multiple of the number of processors. In this case, the most loaded processor is the one with the last chunk, i.e., Fig. 7 . Computation of the sequential load for a chunk of parallel iterations: F 1 is rectangular, F 2 is triangular, and Chunk represents the chunk number.
and the least loaded is P E m ¼ P E M þ 1. For example, when the phase is rectangular, then
that is, there is one chunk of load imbalance. When the phase is triangular, then
Once we have obtained the two load imbalance cost expressions for situations 1 and 2, we can combine them into a single expression. For this, we can use a binary variable b k : b k is equal to 0 when the number of chunks is a multiple of H, and equal to 1 in other cases. In Table 1 , we summarize the final expressions of the load imbalance cost function ((10)) derived using this method for the rectangular and triangular phases of Fig. 7 . These expressions are derived using (11) and (13) for the rectangular case, and (12) and (14) for the triangular one. From Table 1 , we notice that the load imbalance cost functions are nonlinear expressions.
Communication Cost Function
The second overhead source in our model is due to the edges labeled with C (C F or C G edges) that connect the nodes that correspond to phases F k and F g in the LCG graph of array X j . They represent the need for communications when the program control crosses from one phase to the other because we cannot ensure the locality of references to array X j in the two connected nodes. In these cases, we compute the communication cost function, C kg ðX j ; p k Þ, which is defined as the time consumed in the communications that takes place between the execution of the corresponding phases.
The communication cost for M messages of n elements, for a single-step communication routine in absence of memory contention, is:
where represents the startup time and ! represents the bandwidth of the communication primitive which, obviously, depend on the target architecture. On the other hand, Q a par a À Á represents the contribution due to all parameters of the ID for array X j in phase F k , when the phase is nested in some control sentences, as we explained in Section 3.1.1. In this work, we use the put primitive as the communication primitive. Thus, the communications are asynchronously initiated by the processor that owns the data. In other words, the number and size of messages that a processor sends depend on the data that are allocated in the local memory of such a processor.
Let's focus our discussion on the Global communication pattern; the detailed analysis of how to derive the cost functions for the Frontier communication patterns can be found in [24] . The Global communication pattern represents a data reallocation. In this case, two phases F k and F g (or two chains of nodes) require two different static data distributions of array X j . Thus, a redistribution of all the elements of the array must take place after the execution of F k and before the execution of F g . The general case of redistribution consists of sent messages of size n ¼ 1, that we call Global Communication without Aggregation. In this case, the number of messages M is the number of elements in the local memory of each processor after the execution of F k . M depends on: 1) the number of chunks per processor scheduled in phase F k ( ukþ1 HÁpk l m ), 2) the size of the chunk (p k ), and 3) the number of elements of array X j that are in the regular region that is allocated in the local memory of a processor, for each parallel iteration, that we call ' k ðX j Þ and can be computed from the corresponding ID. Other cases of redistribution consist of aggregating messages in blocks in order to reduce the latency of the communications, as well as the number of messages in the network. Message aggregation is provided in our model for two particular cases:
1. Messages are aggregated in blocks of size n ¼ p k . In this case, the number of messages M depends only on the number of chunks per processor scheduled in phase F k , and the number of elements of array that belongs to a regular region (' k ðX j Þ). 2. Messages are aggregated in blocks of size n ¼ ' k ðX j Þ. In this case, the number of messages M depends only on the number of chunks per processor scheduled in phase F k , and the size of the chunk. We have chosen these cases of message aggregation because they help our compiler simplify the code generation Fig. 7 for the Global communication pattern. Summarizing, we show in Table 2 the number of messages (M) and the size of each message (n) that must be considered in (15) , for the general case of the Global communication pattern without aggregation of messages (n ¼ 1), and for the particular cases of the Global communication pattern when messages are aggregated in blocks of size n ¼ p k (case 1) or in blocks of size n ¼ ' k ðX j Þ (case 2). More details can be found in [24] , where we show the conditions to perform automatic message aggregation (this issue is beyond the scope of this paper because its focus is to show how to derive the communication cost function in our approach, and that our model is able to consider some cases of message aggregation). In any case, from Table 2 , we notice that the communication cost functions are nonlinear expressions. Table 3 illustrates the objective function for the LCG of the tfft2 code example (Fig. 2b) . X j represents one of the two arrays of the program (X 1 X, X 2 Y ) and index k traverses all the nodes where the corresponding array is accessed, following the flow of the corresponding LCG graph.
Constraints
As we mentioned earlier, our objective function is subject to different kinds of constraints:
1. Locality constraints. These are derived for each pair of phases in the LCG, connected with L, and they represent the balanced locality condition (Section 2.3). The expression of the locality constraints is derived from (6) . The verification of these constraints states that it is possible to exploit memory locality without communications between the L-connected nodes. 2. Maximum chunk size constraints. For each phase in the LCG, these constraints set the limits for the size of the chunks. The expression of these constraints is similar to those of (7) and (8) . In these equations, we see that the size of the chunks in a phase F k , must be between 1 and ukþ1 H AE Ç . 3. Nonlinear behavior constraints. These constraints are due to the nonlinear behavior of the load imbalance cost functions. We have seen that the load imbalance cost functions are nonlinear expressions that depend on a binary variable b k . This binary variable is 0 when the number of chunks in a phase is a multiple of the number of processors, H, and 1 when the number of chunks is not a multiple of H. The expressions of the nonlinear behavior constraints are like those that we show in Table 3 . In these expressions, c k represents the number of chunks, bb k ðlÞ is a binary variable, and MUL k ðlÞ is a table of constants (each constant in this table is a multiple of H). 4. Integrality constraints. These constraints ensure that the size of a chunk (p k ) and the number of chunks (c k ) are just integer variables, while b k and bb k ðlÞ are binary variables. Examples of all these constraints are shown in Table 3 for the LCG of the tfft2 code section of Fig. 2b .
We note that our method formulates a Mixed Integer NonLinear Programming problem or MINLP which is NP-hard. We have chosen to rely on a general purpose programming solver to find the optimal solution of our complex problem, rather than to use a heuristic. In order to compute the solution to this problem, we have invoked a MINLP solver of GAMS (General Algebraic Modeling System) [25] called DICOPT (DIscrete and Continuous OPTimizer) [26] . DICOPT is based on the extensions of the outer-approximation algorithm for the equality relaxation strategy. The MINLP algorithm inside DICOPT solves a series of NLP (Non-Linear Programming) and MIP (Mixed Integer Programming) subproblems. By solving our optimization problem, we find p k . In other words, we obtain the optimal CYCLIC(p k ) iteration distribution for each phase in the code, which minimizes the overhead due to the communications and load imbalance. Once we have obtained the iteration distributions, the arrays are distributed across the processor memories following the procedure sketched in Section 3.4. 
Analysis of the Model and the Solutions
An important question at this point is which of the many components of our overhead model are more relevant. To illustrate this discussion, we present as a case study the fragment of code of Fig. 9a , where we only consider the references to array X. The Locality-Communication Analysis Algorithm (see Section 2.3) detected the Global communication pattern with aggregation-case 1 for array X, when the execution moves from phase F k to F g . For this reason, our compiler inserted the "Global communication routine" in the code. Let us assume that the number of processors is H ¼ 4. Fig. 8a represents the overhead due to the global communications. Fig. 8b represents the overhead due to the load imbalance when we assume that F k is a rectangular phase (for instance, v ¼ 255). Fig. 8c is the overhead due to the load imbalance when we assume that F k is a triangular phase (i.e., v depends on the outer loop index, for instance, v ¼ I). In the figures, we show the overheads estimated by our cost functions and the overheads measured in a real machine. 2 The overheads are times in seconds (Y-axis). To estimate the cost of the global communications pattern with aggregation, we use an expression such as the one in the second row of Table 2 . Regarding the load imbalance overhead, we use similar expressions to those shown in Table 1 for a rectangular and for a triangular phase, respectively. One point here is that our cost functions seem to model quite accurately the real behavior of the code overhead in a real machine. We have estimated and measured each one of the overheads for different values of p k (X-axis). As there are 256 iterations in the parallel loop of F k , the size of the chunk of the iteration distribution that we are looking for, i.e., p k , could range from 1 to 64, i.e., from CYCLIC(1) to CYCLIC(64) (= BLOCK) distribution. Another point is that we can deduce directly from the figures the optimum p k that achieves minimum overhead, which is the value of interest.
If we just consider the communication cost function as the only overhead source, our method tends to select CYCLIC(64) = BLOCK distributions for F k (see Fig. 8a , where the minimum overhead is reached when p k ¼ 64). If we only consider the load imbalance cost function, the selection of p k depends on the type of phase: If the phase is rectangular our method could select those values of p k for which the number of chunks is a multiple of H (see Fig. 8b , where the minimum is reached for p k ¼ f1; 2; 4; 8; 16; 32; 64g); if the phase is triangular, our method selects p k ¼ 1, because bigger values of p k increase the load imbalance overhead (see Fig. 8c ). When we combine the overhead of Figs. 8a and 8b (i.e., communications and load imbalance in a rectangular phase), our method selects p k ¼ 64 (i.e., the BLOCK distribution). When we combine the overhead of Figs. 8a and 8c (i.e., communications and load imbalance in a triangular phase), our method selects a CYCLIC (p k ) distribution in which the size of the chunk, p k , is a trade off between the communications and load imbalance. More specifically, the value of p k depends basically on a machine-dependent parameter: the latencies of the memory hierarchy. We now go into this issue more deeply.
In our research, we consider two types of latencies: the latencies of remote memory access (i.e., the latencies of communication primitives) and the latencies of cache and local memory access. The latencies of communication primitives are considered in the computation of the communication cost function (the parameter ). On the other hand, the latencies of cache and local memory access affect the computation of the load imbalance cost function (because they take part in the estimation of Ref k ðX j Þ). Let us study Figs. 9b and 9c to illustrate how the latencies affect the selection of the optimum chunk size in our model when we analyze the case that we are now studying: a triangular phase with communications. In these figures, we represent the optimum chunk size selected by our approach (the Y-axis), taking into account communication and load imbalance overheads for the triangular phase of Fig. 9a (i. e., v ¼ I) when H ¼ 4.
In Fig. 9b , we fix the parameter (the latency of the communication primitive) and vary the latency of local memory (LT ) from 1 nanosecond to 10 microseconds (the X-axis). In Fig. 9c , we fix the latency of local memory and vary the latencies of the communication primitive () from 1 nanosecond to 10 microseconds (the X-axis). The other values of the cost functions are fixed to the Cray T3E's parameters. We note that the optimum chunk size ranges from p k ¼ 64 (the BLOCK distribution) to p k ¼ 1 (the CYCLIC(1) distribution). We note that our method only selected those values of p k for which the number of chunks is a multiple of H (in this example, the possible values for p k are f1; 2; 4; 8; 16; 32; 64g). The main conclusion drawn from Fig. 9b is that, when the latency of local memory increases, the load imbalance becomes the more important factor in the cost function and the optimum tends to smaller chunks. For example, with a latency of local memory of 10 nanoseconds, the optimum is p k ¼ 64; with a latency of 1 microsecond the optimum is p k ¼ 16; and beyond 7 microseconds, the optimum is p k ¼ 1. On the other hand, the main conclusion from Fig. 9c is that, when the latency of the communication primitive increases, the communication cost becomes the more important factor in the cost function and the optimum leads to bigger chunks. In this case, with a latency of remote memory of 1 nanosecond, the optimum is p k ¼ 1; with a latency of 1 microsecond, the optimum is p k ¼ 16, and beyond 9 microseconds, the optimum is p k ¼ 64. As we can see from these two figures, the effects of the local memory latencies are the opposite to those of the remote memory (communication) latencies when we look for the optimum chunk size in a triangular phase. Intuitively, if we only consider the load imbalance in a triangular phase, scheduling chunks of the smallest size is the right decision. If we only consider communications, aggregating messages to build the biggest chunk is the best option. When we consider the communications and the load imbalance together, we have to find a trade off value. As we see, this trade off is captured in our approach.
Some other issues related to our MINLP formulation (such as the complexity of the nonlinear programming problem and optimality of the solutions) are briefly discussed in Section 4.
Parallel Code Generation
In the following, we simply outline the procedure to generate the parallel code, once our method has found, for each phase F k , the optimum p k , i.e., the CYCLIC (p k ) iteration distribution that minimizes the overhead of the parallel code.
In Fig. 10 , we show how the parallel code is generated by our method for a fragment of the tfft2 code that includes phases F 1 and F 2 . We notice that loops marked as parallel by the compiler in a previous step (Fig. 10a) , are decomposed into two nested loops (following a strip-mining scheme) as we see in Fig. 10b . This is the implementation of the CYCLIC (p k ) iteration distribution for each phase. In this way, the external loop-driven by I 1 -traverses all the chunks, CH k ðP E l Þ, scheduled in processor P E l , 0 l H À 1, and the internal loop-driven by I 2 -crosses the parallel iterations of each chunk. Now, we can build the data distribution for each array of a phase. Our data distribution follows the aggregated regular region strategy explained in Section 2.3.2, and takes care of holding the intraphase locality condition that we identified in Section 2.3.1. This is done in the "Data allocation procedure," which is called at the beginning of the code execution. In [24] , we show the algorithm of a generic data allocation procedure. Basically, this procedure uses the information provided by the ID of array X to select the array elements that must be allocated in the local memory of processor P E l for each chunk of p k parallel iterations scheduled on such a processor.
Let us return to Fig. 10 . Before the execution of F 2 , the compiler inserts a call to a "Global communication routine" for array X. This is because the Locality-Communication Analysis algorithm detected the Global communication pattern for array X between the execution of phases F 1 and F 2 , and this was indicated in the LCG of Fig. 2b with a C G edge. In addition, from this LCG we can deduce that the compiler will insert another call to a "Global communication routine" for array X between the execution of phases F 2 and F 3 and, after this point, all references to X will be local for the chain of nodes that correspond to phases
The generation of the "Global communication routine" for an array X follows a similar process to a data allocation procedure. The idea of the global communication routine is to traverse all array positions of the regular regions allocated in the local memory of each processor, according to the data distribution of the source phase F k . The task of the global communication routine is to compute the target processor and the remote array position where each local element must be sent (using the put primitive), according to the data distribution of the target phase. We do not go deeper into this issue due to space constraints, but more details can be found in [24] .
EXPERIMENTAL RESULTS
In this section, we attempt to evaluate the impact of our approach using six benchmarks: two codes from SPECfp95 (tomcatv and swim), three codes from the Perfect Benchmark Suite (bdna, mdg, and trfd), and one code from the NAS (tfft2). In Table 4a , we summarize the size of the main arrays in the codes.
The first task in our experiments was to measure the complexity of our model in the GAMS optimization tool. For all the cases, we built the LCG of the code, derived the nonlinear integer programming problem, and obtained the solutions invoking the DICOPT solver of GAMS. The second question was to evaluate the efficiency for the decompositions (iterations and data distributions) derived with our techniques in different NUMA machines.
Complexity of the Method
In Table 4b , we summarize, for each benchmark, some issues about the complexity of our nonlinear integer programming model: the number of equations (E), the number of variables (V), the number of terms (T), the number of nonlinear terms (NL.T), the number of iterations (I) till DICOPT found an optimal solution, and the time in milliseconds (Time) that DICOPT spent in finding a solution in a MIPS R10000 running at 195 MHz.
We note that the DICOPT solver found optimal solutions in a relatively small number of iterations. For this reason, the time that DICOPT needed to find the solutions was very small: from 90 milliseconds for mdg to 160 milliseconds for tfft2. We believe that these times show the feasibility of our approach. In fact, the times are quite competitive, justifying the fact that our method can be implemented in the Polaris parallelizing compiler as a new pass. We can even take advantage of these very small times to avoid the necessity of fixing H at compile time. Thus, the compiler can generate the parallel code in which the value of p k for each phase is an unknown parameter. At the beginning of the source parallel code, we insert a call to a routine where we have parametrized the objective function and constraints of the code. At runtime, this routine has the task of finding out the number of processors (H) and invoking DICOPT to solve the parametrized MINLP problem of the code, but now knowing H. The output of this routine are the values of p k for each parallel loop of the code.
Another issue that we should address is that, although the complexity of the source code increases, the runtime of the solver to find the solution does not grow dramatically. For example, bdna is the most complex code, so the number of variables and equations is the biggest. However, if we compare the runtime of bdna versus trfd, which is the least complex code, we notice that they are not significantly different. The reason is because, based on experience with other models, the authors of the DICOPT solver have chosen as the default stopping rule "stop as soon as the NLP subproblem has an optimal objective function that is worse than the previous NLP subproblem" [26] . In several cases, this stopping rule makes DICOPT find the best integer solution in the first few iterations (as happens in our codes). Obviously, this heuristic may provide a suboptimal solution (a local minimum instead the global minimum) because our objective function is not a convex function. As we see, the optimization technology has not yet reached the stage of maturity in MINLP problems. One of the best current approaches to avoid a local minimum in MINLP problems is to compute the convex envelope of the objective function and to formulate a new optimization problem with this new function. In this case, DICOPT would always give us the global minimum. We could compare this global minimum with the minimum obtained with our original nonconvex objective function.
If they had the same values of p k , then we would have found the global minimum. Otherwise, we could set bounds to p k in the area where the global minimum for the convex envelope was found, and again, formulate the original nonconvex objective function with this new constraint, to try to fit the local minimum of the nonconvex objective function to the global minimum of the convex envelope. As we see, this is a challenging area that needs more research and is beyond the scope of this paper. In any case, for our codes, DICOPT always gave us the global optimum in our initial formulation.
Results of the Iteration/Data Distributions Derived With Our Method
In order to evaluate the efficiency of the iterations and data distributions derived in our method, we conducted two sets of experiments. In the first set of experiments, two different parallel versions were generated as targets from each sequential program: version 1 and version 2. Version 1 was generated by Polaris applying the techniques currently implemented in the compiler [16] , [19] : BLOCK iteration distribution for the rectangular parallel loops or CYCLIC (1) iteration distribution for the triangular ones, and always BLOCK distribution for data. Version 2 was generated applying the techniques discussed in Section 3: The LCG of the source code was built, then the MINLP problem for the optimal iteration/data distributions was derived. We used the DICOPT solver of GAMS to obtain the solutions. Finally, the distributions were hand-coded following the procedure explained in Section 3.4. We note that the same loops were parallelized for both versions. These experiments were originally intended to evaluate the effectiveness of our distribution model and to roughly estimate how much we could improve the parallel performance once our techniques are fully implemented in the Polaris compiler. In Table 5 , we summarize for each program the decomposition selected by Polaris (version 1) and, by our method, (version 2) for the major parallel loops and arrays.
In Fig. 11 , we show the execution results of these benchmark programs on the Cray T3D. When we compare the execution times for the two versions, we H is the number of processors and NMOL, N, P, and NRS are input parameters. A.R.R.S. stands for Aggregated Regular Region Strategy. Fig. 11 . Executions of the sequential code and its parallel code for version 1 and version 2 on the T3D.
notice that version 2 always outperforms version 1. Let's examine the reasons for these time differences. In swim and tomcatv, both version 1 and version 2 select BLOCK iteration distributions for all of the parallel loops (BLOCK = CYCLIC(N/H)). However, version 2 is faster than version 1 due to the fact that our approach detects and exploits the Frontier communication pattern ( [19] ), which reduces the number of communications drastically.
Regarding the codes bdna and trfd, these contain important triangular loops that were distributed following a CYCLIC(1) and BLOCK schedule in version 1, and following a block-cyclic schedule in version 2. However, the key is that the main arrays of these codes were distributed using the BLOCK data distribution in version 1, whereas the arrays were distributed using the aggregated regular region strategy in version 2. The problem in version 1 is that there is no affinity between the iteration distributions (CYCLIC (1)) and data distributions (BLOCK) in the triangular loops, so version 1 needs to generate many communications. However, in our approach, the affinity of iterations and data is guaranteed thanks to the intraphase locality condition and the aggregated regular regions strategy. Thus, using our techniques, we remove the need for communications while the load is balanced in these triangular loops, improving the performance of version 1 by approximately 50 percent in these cases.
On the other hand, the tfft2 code is an interesting case. The main loop nests are rectangular, but now the arrays present symmetries between subregions. In other words, in the same iteration of a parallel loop, different subregions on an array are accessed with the same access pattern. This information is captured in version 2 ( [19] ), where several block-cyclic schedules of iterations are selected. The locality constraints and the aggregated regular regions strategy guarantee the affinity of iterations and data accessed in such symmetric subregions. In other words, our technique can identify chunks of parallel iterations and the subregions accessed by these chunks and can schedule the iterations and place the data in the corresponding processor. However, in version 1, the BLOCK distribution for iteration and data cannot handle this complex situation.
We believe that our techniques will be effective for a wide variety of NUMA machines, including machines with high throughput and low latency in which the data locality issue is a less critical factor than load balancing. The reason for this is because our approach looks for a trade off between a minimization of communications and load balancing. In order to check the behavior of our approach in a machine with these characteristics, we conducted a second set of experiments, using two different NUMA machines: a Cray T3E and an Origin 2000.
In Tables 6 and 7 , we show the execution times (in seconds) for version 2 (the parallel version based on our techniques) on the Cray T3E and on the Origin 2000, respectively. In addition, we show the efficiencies of the parallel codes. We see superlinear behavior in the swim and tomcatv codes (due to a cache effect) on the Cray T3E and on the Origin 2000. In fact, we achieve in both machines better efficiencies than in the Cray T3D. However, we notice from the tables that parallel efficiencies for 16 processors are, in general, worse in the Origin 2000 than in the Cray T3E. This can be explained to some extent because the put primitives are less scalable in the Origin than in the Cray T3E. In other words, we note that for the same parallel code that achieves the same load balance in both machines, the higher the remote latency the worse the program scalability.
In summary, we believe that these results show that our MINLP formulation helps the compiler to find efficient decompositions for real codes to be executed in a variety of NUMA platforms. 
CONCLUSIONS
We summarize the contributions of this paper as follows:
1. We have modeled the overhead due to communications and load imbalance in NUMA architectures. Using this model, we have formulated a MINLP optimization problem to find out the optimal iteration/data distributions that minimize this overhead while exploiting the available locality. 2. We have demonstrated, with measurements, that the MINLP formulation does not increase compilation time significantly and that our approach generates efficient decompositions for a variety of NUMA architectures. 3. We have shown that communications (latencies of remote memory) and load imbalance (latencies of cache and local memory) can affect in opposite ways the selection of the optimal iteration/data distributions, and that our model addresses this issue. We note that some of the ideas presented in this paper have been described and reported in a number of journal articles: the use of block-cyclic decompositions for load balancing and scalability, redistribution of data between phases when a static data distribution is not found for the whole program, or the use of nonlinear programming techniques to optimize a MINLP problem. However, the main contribution of this paper is to present a new approach that puts the pieces together and that can be used by a parallelizing compiler to generate efficient parallel code from conventional sequential code without user intervention. . For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
