AbstractÐThe Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.
INTRODUCTION
T HE goal of designers of multiprocessor systems has always been to achieve high performance and ease of programmability at the lowest possible cost. Unfortunately, high performance and ease of programmability are often opposing forces. In general, features of a machine that make it easy to program also limit its performance. An example of this principle can be seen in the performance implications of the programming paradigms used within parallel computers. There is wide consensus that shared-memory programming [11] , [12] , [39] is easier than message-passing programming [33] , [34] , yet the hardware implementation of shared-memory mechanisms in general seems to limit their speedup. Shared memory computers that are built with a bus connecting all processors to memory modules do not achieve performance that scales beyond a small number of processors. This is chiefly due to the serializing effect of the bus. High-end shared-memory machines are built with scalable hardware, but many of them, such as the SGI Origin 2000 and the Convex Exemplar, still employ a serializing mechanism to maintain their cache coherence protocol, although the serial bottleneck is much less than for bus-based machines. We call a shared-memory machine of this type a CC machine.
The simple fact is that shared-memory mechanisms constructed in hardware lack knowledge of the data access within a particular program and must therefore be implemented in a conservative manner to enable them to operate correctly, even in a worst-case scenario. This may produce inefficiencies for some programs. The extra hardware can also make the machines more expensive. To cope with the expense and lack of scalability, distributed memory computers, also called message passing machines [6] , [19] , became popular in the early 1990s. These machines consisted of a large number of processors connected in various configurations by communication paths. Each processor had its private memory which it could access by issuing memory addresses. Data that was stored in the memory of another processor was not accessible by issuing an address. However, the processors could communicate with messages. One processor would pack some data in a buffer and issue a Send primitive, while another processor would issue a Receive primitive and load the data into its own memory. This process is called two-sided communication.
More recently, it has become evident that more non-cachecoherent scalable shared-memory machines, such as the Cray T3D and T3E and the Fujitsu AP1000+ and AP3000, have been chosen for high-end multiprocessor architectures. These machines, which we call NCC machines, dedicate hardware for explicit one-sided communication such that a processor can issue Put and Get primitives to store and retrieve global data on a remote memory without into one-sided shared-memory codes for execution on NCC machines and analyzing their performance. We identified the limits of these techniques and, to overcome them, proposed several new techniques, which can be divided largely into three groups: 1) precise data flow analysis for capturing array access patterns in the source code, 2) dependence analysis for generating parallel threads, and 3) locality analysis for efficient communication generation. This paper will describe a compiler framework using these techniques.
COMPARISON WITH PRIOR WORK
One of the main differences between our work and much of the prior work in the literature for distributed machines is our focus on automatic parallelization. Many projects deal with the problem of optimizing data parallel languages in which the user codes the distribution of the data and its parallel usage. Such projects include languages like Fortran D [20] , Vienna Fortran [9] , and HPF [18] . Several alternative data parallel languages have also been proposed. One example is the Fx project [47] at Carnegie Mellon University which has proposed a new dialect of HPF that enables the programmer to specify both data parallelism and task parallelism in order to handle a wider range of scientific problems than ordinary data parallel languages. In our work, we expect the user to write only a serial program. The compiler then detects the parallelism of the code and uses the shared-memory paradigm to find the data that is fetched.
Compiler studies based on these data parallel languages originally assumed no global address space for their target machines. Thus, their compilers have focused mainly on the conversion of data parallel programs, annotated with directives, into message passing codes with Send/Receive primitives for execution on their target machines. Thus, most of their automatic parallelization studies have been conducted in the context of Send/ Receive-based message passing models. By contrast, the focus of our work is on Put/Get-based shared-memory models. By focusing on Send/Receive, other compilers must either be able to produce symbolic expressions at compile time that represent the source and destination for each message or else they must produce code which builds these expressions at runtime. This code can be complex to build and can use a lot of time and/or memory at runtime. Our focus on Put/Get allows us to ignore this problem since the Put/Get mechanism will locate the data for us. In fact, the more distributed memory machines have supported Put/Get in hardware, the more attention compiler researchers have paid to code optimizations in the context of shared-memory models. Although there has been some work on optimizations for NCC machines, most of the work did not handle extensive compiler issues, only focusing on a couple of individual optimization issues, such as reducing synchronization overhead in Put/Get-based parallel codes [15] or building a new communication library that is to be used for an array language [8] .
The Castle/Titanium project [12] , [27] is somewhat similar to our work because their compiler handles Split-C, a Put/ Get-based shared-memory language. Also, more recently, some researchers in the Fx project made efforts to generate Put/Get for NCC machines. However, our work differs from theirs since they take explicitly parallel code written in Split-C or an HPF dialect as the source instead of sequential code.
SUIF [16] and Parafrase-2 [44] are typical examples of conventional parallelizing compilers that mainly target busbased shared-memory architectures. Although several researchers [1] extended SUIF to retarget distributed memory machines, our work also differs from theirs since their techniques were implemented only for Send/Receivebased message passing models, like those based on the data parallel languages.
This paper analyzes the results of an experiment with parallel code generated from sequential code. Much literature exists on similar studies done for diverse distributed memory machines, but, as discussed above, relatively little work has been done on automatic parallelization specifically in the context of Put/Get-based shared-memory models targeting NCC machines. The main contribution of our work lies in our use of accurate array access analysis (Section 4) to do advanced parallelization and communication optimization (Section 5). We quantify the results of an extensive experiment (Section 6) with real codes on two commercial NCC machines, which demonstrates how these techniques play a crucial role in speeding up our target codes.
COMPILING FRAMEWORK
In our framework, there are three major compilation steps:
. parallelism detection, . iteration/data distribution selection, and . code generation.
Parallelism Detection
We do interprocedural parallelization using Memory Classification Analysis with Linear Memory Access Descriptors, as described in [23] , [21] . The dependence test used is called the Access Region Test (ART). During the process of doing the interprocedural parallelization, the compiler builds a phase descriptor (PD) and an iteration descriptor (ID) for the references to each array in each loop nest. We will not dwell on the details of the parallelization in this work since it is not the focus of this paper. It is only mentioned since the information needed to determine the iteration and data distribution may be derived during the parallelization analysis.
Iteration/Data Distribution Selection
The inputs to this step are:
1. the parallel loops (at most one parallel loop for a phase because we only exploit unidimensional loop level parallelism) and 2. the ID of each array that is referenced in each phase. The iteration/data distribution selection step is organized in two stages:
The Locality Analysis Stage. This stage uses the information provided by the IDs to build the Locality Communication Graph (LCG) of the code. The LCG is our compile-time representation of locality and communications patterns for a parallel code. The locality analysis is based on an algorithm that checks the conditions that parallel iterations and array regions accessed by those iterations must satisfy to insure the locality of references of an array within a phase and between phases. If such conditions cannot be insured, then communications are necessary. At the end of the locality analysis, we know when it is possible to avoid communications and when communications are required. An analysis of the conditions imposed by locality analysis can potentially find several possible solutions for iteration and data distributions for the code. We then need a procedure to select from these possibilities an optimal iteration and data distribution for the whole code. This is the goal of the formulation stage.
The Formulation Stage. Using the information provided by the LCG and the IDs, we formulate an integer programming problem: Find the iteration and data distribution that minimizes the communications and load imbalance while exploiting all the available locality detected in the locality analysis. The objective function of this optimization problem is derived from the costs due to load imbalance and communications. The communications costs expression models the relation between the cost of Put/Get operations and the choice of iteration/data distribution. On the other hand, the load imbalance costs expression models the relation between the differences of computational load among the processors and the choice of iteration distribution. The constraints on this optimization problem are basically locality constraints imposed by the locality analysis.
The solution of that optimization problem gives us an iteration distribution for the parallel loop in each phase and the data distribution for a set of consecutive phases, which we call a chain of phases, where a consistent static data distribution can be found. The chain of phases is detected by locality analysis.
The outputs of the second compilation step are:
1. the iteration distribution of the parallel loop of each phase, 2. the data distribution for each array (data distributions can change dynamically during the program execution), and 3. the points in the parallel code where communications are necessary and the associated communication code.
Code Generation
The inputs of the third compilation step (code generation) are the LCG and the outputs of the second compilation step, as noted above. The output of the third compilation step is the parallel code using the SPMD paradigm with calls to communications routines suitable for each communication pattern at the corresponding program execution point. We will not focus on this step in this paper, but rather will concentrate on step twoÐiteration/data distribution.
ARRAY ACCESS ANALYSIS
Existing compiler techniques for finding parallelism and generating communication primitives depend heavily on the accuracy of array access analysis, which identifies the array elements accessed within a certain section of code by a particular reference. Our prior observation on benchmarks [41] indicated that the accuracy of this analysis relies greatly on the representational power of the array region descriptor, which is used to summarize array accesses in the analysis. In general, the accuracy of a descriptor is sensitive to array subscripting patterns. Therefore, to find a practical and accurate descriptor for our array access analysis, we examined the actual subscripting patterns in a set of real codes. We divided the array accesses into six categories, depending on their subscripting patterns, and counted how many of the summaries of each category occur in the codes. Their percentages with respect to the total number of summaries are plotted in Table 1 .
The results in Table 1 confirm the common belief that a majority of array references in real programs consist of Simple Affine (SA)-type subscripts, which can be represented by the regular section descriptor (RSD) efficiently and accurately. However, also notice from our results that the techniques using the RSD, also called triplet notation, would be limited in several programs involving non-SA-types. In fact, several efforts [17] were already made to improve the accuracy of the RSD by handling Coupled Subscript (CS)-type subscripts and symbolic information. However, all these previous techniques based on the RSD were not accurate enough to handle the remaining types. Thus, as an alternative to the RSD, others have proposed a different type of descriptor based on convex regions, which we collectively call the convex region descriptor (CRD) [45] , [48] . The CRD can summarize the CS, Multiple Subscript (MS), and Triangular Affine (TA)-types, in addition to the SA-type. In this sense, it has more representational power than the RSD. However, it also suffers several critical drawbacks, such as 1) it cannot summarize Non-Affine (NA)-type accesses and 2) the linear inequalities generated during the summarization need linear system solvers, which require worst-case exponential-time algorithms.
The limitations of existing array region descriptors motivated us to develop two new descriptors, called the linear memory access descriptor (LMAD) and the iteration descriptor (ID), which are indispensable ingredients of our compiler analysis work. This section will present their main concepts.
Linear Memory Access Descriptors
In the LMAD [41] , accessing an array is viewed as traversing a linear memory space. For example, in Fig. 1 , the two-dimensional array access is, in reality, the traversal in a linear memory space starting from the base address the memory lotion forA1; 4 all the way to
The diagram in Fig. 1 illustrates that memory traversals are driven by the loop indices I and J. In our work, we attempt to capture the pattern of a memory traversal driven by a single loop index with the notion of stride and span. The stride is the distance (measured in the array elements) between accesses generated by consecutive values of the index. In this example, the stride for index I is 2 because the access moves across two elements of A on each iteration of I. Sources include the Perfect and SPEC benchmark suites, plus some codes obtained through the National Center for Supercomputing Applications: SS stands for array references containing Subscripted-Subscripts (e.g., A(B(I))); NA for those containing Nonaffine subscripts (e.g., AI P ); TA for those containing only affine subscript expressions within a Triangular loop; CS for those containing Coupled-Subscript affine expressions (e.g., A(I,I)); MS for those containing multiple loop indices that appear in a single array dimension; and SA for those containing Simple Affine subscripts (e.g.,A(I)). Here, each category excludes those above it. For instance, a reference with MS-type subscripts within a triangular loop is counted as MS and not TA. In Section 6, the six codes in bold face above will be analyzed in more depth. Similarly, we can see that the stride for J is N. The span is the total element-length that the access traverses when the index iterates its entire range. Again, in the example, the span for index I is P Á K, which is the entire distance traversed between iterations 0 and K for a fixed value of J. Similarly, the span of J is calculated as M À I Á N for the iteration of J from 1 up to M.
The LMAD is the collection of stride/span pairs for all loop indices involved in the array access plus the base offset, which has the general form: e stridei I ;stridei P ;ÁÁÁ;stridei d span i I ; span i P ;ÁÁÁ; span i d ; assuming the access to array e is driven by d loop indices, i I ; i P ; Á Á Á ; i d . Here, is set to the offset of the first access from the beginning of the array. As can be seen in the above example, a single stride/span pair for a loop index (e.g., fP; P Á Kg for I and fN; M À I Á Ng for J) characterizes the independent access patterns generated by the two loops. Thus, the LMAD represents the overall pattern of the access driven by all indices. The LMAD summarizing the overall access in the example would be e P;x Pu;wÀIx Qx À I. Likewise, we can calculate the LMAD for the accesses of the ocean loop in Fig. 2a : 
H:
Several simplification and summarization operations, such as coalescing, aggregation, and interleaving, have been defined for LMADs. These operations offer ways to represent multiple LMADs with a single LMAD and ways to represent the same access pattern with fewer dimensions in a single LMAD. The last two examples in Fig. 2 show situations in which the LMADs representing the accesses may be combined and simplified. More information on these operations may be found in [40] and [41] .
Once we have obtained the LMADs for the references to array X in a loop nest, we will use them to describe the whole region of the array accessed in such a loop nest. This is done by the phase descriptor (PD). Let us assume that there happen to be m (m ! 1) different references to array X within the kth nesting of the code. The general form of the PD of array X in the kth nesting is k X A; D; 8 :
The PD, k X, represents all the elements of the array X accessed by the kth loop nest. Let m be the number of reference sites to X in the loop nest and n be the number of loops in the nest. A and D are matrices of dimension m Â n, whereas is a column vector of dimension m. Each row of matrices A and D contains access information for one LMAD and, therefore, for one reference site for X. D represents the values of the strides: The coefficient ij of matrix D is the stride computed for the jth index of the loop nest (I j n) in the ith LMAD (I i m). A represents the spans divided by strides plus 1: The coefficient ij of matrix A is computed as the jth span of the ith LMAD divided by the corresponding stride ij plus 1 (i.e., the number of elements accesed by the ith LMAD in the jth dimension of the nesting). The vector contains the offset for each occurrence, i.e., i is the ith offset corresponding to the ith LMAD.
For the code example in Fig. 2c (which is the third loop nest in the tfft2 code), the PD for the array X is shown in Fig. 3 . In Fig. 3a , we show the initial form of the PD before the operations of coalescing and aggregation (that we mentioned above); Fig. 3b represents the PD after coalescing and Fig. 3c the resultant PD after aggregation.
Iteration Descriptors
The PD form is useful for summarizing the whole region of an array accessed in a loop nest, but we also needed to develop the ID [35] , [38] , an extension of the PD, to more conveniently pinpoint the subregions of an array accessed by each iteration of the parallel loop in the nesting. We will show how the ID is useful for optimizing communication in Section 5. In Fig. 2c , for instance, to represent the set of subregions of array X accessed by one iteration, say i, of the parallel I-loop in the kth nesting of the program, we generate an ID of the general form: Fig. 4 also shows that the shaded memory positions of X represent the subregions described by each s Q X; i. The ID supplies the information that each subregion contains four elements separated by a stride 1 and that the first array position for each subregion can be computed as V Á i. IDs may be simplified by merging two or more of its s k j X; i elements under certain conditions, called the storage symmetry conditions. The last componentÁ is the storage distance vector, originally set to null, that we use to flag these conditions, which are as follows:
Shifted Storage: Two array subregions have the same, but shifted, access pattern. Then,Á is set to the shifted storage
Reverse Storage: Two array subregions are accessed with a reverse access pattern, which means that one access function is increasing and the other is decreasing with respect to the parallel loop index. Then,Á is set to the reverse storage distance (Á Á r ).
Overlapping Storage: Two subregions are partially overlapped.Á is the overlapping distance (Á Á s ). These
overlapped elements define what we call shadow areas. figures represent two subregions that are always accessed on parallel iteration i H, 1, and 2. We can see that each ID consists of two LMAD subregions. Notice that storage distances help us to simplify the notation of IDs. The three kinds of storage symmetry we have just defined are not exclusive and can appear in the same ID.
The compile-time descriptors, LMAD and ID, presented in this section are the key elements of the analysis technique that we present in the next section: locality analysis. The locality analysis is a new technique based on the ID and it is applied after parallelism detection. This technique captures the memory locality exhibited by a program and helps the compiler identify and handle different communication patterns. The communication patterns and locality analysis are presented in Section 5. We will see that the information collected in locality analysis enables us to generate the communication primitives suitable for each communication pattern, as well as to find the optimal iteration and data distributions for the code.
COMMUNICATION ANALYSIS
During program execution on NCC machines, communication is required when necessary data is located in remote memory. In the parallel code, the communication costs depend on parallel loop distribution and data allocation in memory. Therefore, it is crucial for the compiler to find the iteration and data distributions that minimize communication. Our first approach for this issue is the copy-in/copy-out strategy, which is described in Section 5.1. We reduce the copy-in/copy-out costs by finding the suitable iteration/data distribution, as we discuss in Section 5.3. The essential information that the compiler can use to address these communication issues is extracted by locality analysis, which will be discussed in Section 5.2.
Copy-In/Copy-Out
After parallel loops are detected, the program is divided into phases, each being a loop nest 2 with at most one parallel loop. Data objects in each phase are either private or shared. Some objects are privatized by the parallelization technique and all remaining objects are, by default, declared as shared and distributed across the machine. In Section 5.3, we will discuss how we distribute shared objects. Processors communicate to access these distributed shared data. Particularly in NCC machines, shared data are not cached; hence, they should be fetched directly from memory each time they are referenced, which significantly increases the average memory latency and network contention. The basic concept of our strategy for tackling this communication issue, called copy-in/copy-out [43] , is that all shared accesses within a phase are replaced with private accesses by 1) copying all elements of the shared data used within the phase into private memory at phase entry and by 2) copying out to the original shared memory at phase exit the updated results from private memory.
Our copy-in/copy-out strategy is influenced by the communication patterns within a phase or between consecutive phases in the program control flow. We classify these patterns into three types: local, frontier, and global communications. This section discusses how to generate Put/Get to implement the copying strategy for a phase when we find the phase involves one of these communication patterns.
Local Communications
We say a phase has local communications when it has arrays that can be distributed such that all their references can be local to each processor, thereby minimizing communication caused by the arrays in that phase. Fig. 6 shows a typical example of local communications in a phase. Suppose the J-loop is parallel and X is declared as shared and blockdistributed. Then, to match the block data distribution, the compiler would perform array access analysis and decide to stripmine the loop with a block schedule, thereby allowing all the work to be done on private array X H with no communication.
Once private references are substituted for shared ones in the loop, PUT/GET statements are generated to copy data between the private and shared arrays. To generate efficient Put/Get, we use our LMAD-based array access analysis that accurately summarizes the array regions to be copied at loop boundaries. That is, the array regions accessed in the parallel loop after stripmining are summarized by using the LMAD, as described in Section 4. For instance, in Fig. 6 , the read regions for two references to X are summarized and simplified to generate the array region of X in the GET statement, as in Fig. 7 . The region of X H to receive data is computed by simply shifting the lower limit of the X region to 1. Then, initialize X H is implemented with a GET statement to move a single data block:
PAEK ET AL.: AN ADVANCED COMPILER FRAMEWORK FOR NON-CACHE-COHERENT MULTIPROCESSORS 7
Fig. 6. Source code, its target code in SPMD form, and the illustration of a data distribution for the array for processors p k . White boxes represent the shared array sections accessed by processors in the loop and gray boxes represent their private counterparts that processors use for their calculations.
2. Such a loop nest does not have to be perfectly nested.
For update X, we used the LMAD representing the write access to X: I;w wÀP;bÀIw t H w I, which has two stride/ span pairs. The region cannot be transferred with a single PUT invocation because Put/Get operations are implemented in hardware to transfer, one at a time, blocks of data with constant strides. Thus, one of the pairs are converted to an enclosing loop in which a vector of M À I elements is transferred with stride 1 on every iteration, which results in:
Notice here that the number of iterations is computed from the second pair and the size and stride of a data block are computed from both pairs. These examples show how, given accesses to a shared array, the LMAD is used to implement the copy-in/copy-out strategy using Put/Get.
Frontier Communications
In real programs, a loop often contains an array whose references are only of the form: ai AE c, where i is the loop index and a and c can be two arbitrary constants, as shown in Fig. 8 . In the example, array X is accessed only with three references of the forms: X(I-C1), X(I), and X(I+C2). Array references of this form result in a communication pattern that we call frontier communications.
Suppose the two inner loops in Fig. 8 are both parallel, whereas the J-loop is serial. If the inner loops are blockscheduled, neighboring processors have overlapped access regions (commonly known as shadow areas [18] ); thus, they need to communicate to keep their neighbors updated on the results of every iteration of J. The standard data distribution type that handles frontier communications is the shadow distribution. In the shadow distribution, an array is first block-distributed so as to allocate a private array to each processor. Extra consecutive private space, which we call the out-frontier region (OFR), is allocated for the elements that are shared by neighboring processors. The in-frontier region (IFR) refers to the elements within the local block that are to be moved to neighboring processors after they are updated.
To generate Put/Get, we first summarize access regions the same way we did in Section 5.1.1. In initialization, the initial values of all portions of the shared array corresponding to a private array are transferred by GET statements. In this example, we need a single Get for initializing Y H :
The update shadow operation handles the intermediate results generated during the loop execution. These results are stored in the frontier areas. In Fig. 8 , before starting the next iteration of the J-loop, the change in the IFR of a processor during the current iteration is copied to the OFR of its neighbors in two steps:
1. Write all IFRs back to the corresponding shadow areas simultaneously. 2. Update all OFRs with the new results in shadow areas. These steps are separated by a barrier to ensure that all writes to shadow areas will complete before any processor tries to read the region. For update shadow, we generate the following sequence of statements:
A similar procedure is followed when changes take place in the OFR during the loop execution. When
Global Communications
We say phases contain global communications if they have shared data that needs to be accessed by all processors and, therefore, requires data transfer under the intervention of the processors. Global communications are usually found in programs involving irregular or dynamic data access patterns where the majority of data tend to be frequently transferred between processors. The typical data transfer operations for global communications are broadcast, reduction, and gather/scatter. In Section 5.2, we will show how our compiler deals with global communications. More details can be found in [37] , where we show how to automatically generate the redistribution routine for arrays used in a phase with global communications patterns and how to perform message aggregation.
Locality Analysis
For our locality analysis, we use a compile-time representation that captures the memory locality exhibited by a program and communication patterns in the form of a graph, called the locality communication graph (LCG) [35] . In particular, as will be discussed later, we use the LCG to identify and handle the communication patterns classified in Section 5.1.
The Locality Communication Graph
The LCG is a set of directed, connected graphs, each of which represents an array in the program. A node in each graph corresponds to a phase accessing the array the graph represents. These phases do not have to include outermost loops. The nodes are connected according to the program control flow. The graphs are not necessarily trees because not all loops are part of a phase; thus, the LCG may have cycles.
The LCG is constructed by the Locality Analysis Algorithm whose goal is to identify when it is possible to distribute iterations and data so that all accesses to an array are local. Obviously, it is not always possible to find a static iteration/ data distribution such that the locality of references to an array is ensured. In such cases, Put/Get is generated to access the remote memory where the required data resides. Sections 5.2.2 and 5.2.3 will discuss the conditions for the Locality Analysis Algorithm that must hold to ensure the locality of references of an array within a phase and between phases.
In our approach, the data distribution may differ between phases. Our algorithm can identify sets of consecutive phases that cover the same data region of an array for a number of parallel iterations scheduled on each phase. From this set of phases, we can select a static data distribution for the array such that all accesses to this array are going to be local. The Locality Analysis Algorithm here assumes:
. The total number of processors, H, to be involved in the program execution is known at compile time, . The iterations of each parallel loop are going to be statically distributed between the H processors involved in the execution of the code following a CYCLIC(k) (block-cyclic) pattern, and . Each array will be dynamically distributed among the local memories. This algorithm starts by assigning an attribute to each node of the graph for array X in the LCG, identifying the type of memory access for that array in the corresponding phase as follows:
1. When the memory access for X in a phase is write only, label the associated node with the attribute W; 2. When the memory access is read only, the attribute is R; 3. For read and write accesses, the attribute is R/W; and 4. When X is privatizable in a phase, label the corresponding node with attribute P. In each graph, the edges connecting nodes are also labeled with two attributes: 1 L: It is possible to exploit memory access locality between the connected nodes and 2 C: It is not possible to assure memory access locality. The attribute C stands for ªcommunicationº because the lack of memory access locality implies remote accesses or, in other words, the necessity of communication between processors. In these cases, Put/Get will be placed just after the execution of the source connected phase and before the execution of the drain connected phase. The locality attributes, L and C, will be determined at the end of the locality analysis. Fig. 9 shows the LCG for a fragment of tfft2 in Table 1 .
For precise locality analysis, it is essential to have an accurate form used to summarize array regions accessed in each node of the LCG. For this, we use the ID that is generated from the LMADs for a loop. As shown in Section 4, the LMAD can accurately represent array accesses with affine or nonaffine expressions. Thus, unlike most other techniques [2] , [10] , [14] , [24] , [26] , our technique works whether subscripts and loop limits expressions are affine or nonaffine. Also, our technique is interprocedural since the LMAD can represent array access across procedure boundaries efficiently, as described in [22] .
Intraphase Locality
Let s k X; i be the ID for array X in the parallel iteration i of the kth nesting (or phase F k ). Let us assume that iteration i is scheduled in the processor P E j for H j H À I. We can state that all accesses to X in iteration i are local to P E j if the subregion described by s k X; i is allocated to the local memory of P E j . This is an intuitive idea and it is what we call the intraphase locality condition. In fact, this is a sufficient condition to ensure that P E j requires no communication for X in iteration i of phase F k , as indicated in Section 5.1.1. However, in the verification of the condition, three different situations may arise:
1. The array is privatizable. 2. The array is nonprivatizable and there is no overlapping storage for the array in the phase (i.e., T WÁ s ). 3. The array is nonprivatizable, there is overlapping storage for the array in the phase (i.e., WÁ s ), and accesses to the array in that phase are only reads. Fig. 10a illustrates Situation 1 for array Y in the phase F Q of tfft2. The verification of the intraphase locality condition implies that the local memory of each processor contains a copy of the privatizable subregion of Y. This subregion is represented by s Q Y ; H, s Q Y ; I, s Q Y ; P, etc. The example shows that the subregion of Y is identical for each parallel iteration i. So, array replication of the subregion on the local memory of each processor can guarantee that all accesses to Y in the phase are local. Fig. 10b shows Situation 2 for Y in the phase F P of tfft2. In this case, the verification of the condition supposes the distribution of the subregions represented by s P Y ; i to the local memory of the processor that executes the parallel iteration i. In this way, all the array references in this phase are satisfied in the local memory of processor P E j , as we see in Fig. 10b . Fig. 10c shows Situation 3 for array X in the phase F I . Again, the verification of the condition supposes the distribution of the subregions represented by s I X; i in the local memory of the processor for iteration i. But now, s k X; i contains some array elements (the shadow areas) that are replicated in other processor memories. In this case, if accesses are reads only, it is not necessary to update the replicated shadow area; thus, no communication is required during the execution in the phase.
Interphase Locality
Once we have established the intraphase locality condition, we extend the locality analysis to determine when two phases F k and F g (k < g) access the same local region of array X, so as to avoid communications between the execution of these phases. For this, we define the conditions that iteration and data distributions for the two phases must fulfill to assure that all accesses to arrays in the phases are local.
To begin, we define two concepts that help us to relate a chunk of parallel iterations to the region accessed by these iterations: the upper limit and the memory gap. As shown in Fig. 11 , the upper limit, vs k X; i, of X for a parallel iteration i represents the highest memory position of the subregion represented by the ID of X in iteration i of the parallel loop of phase F k . Similarly, vs k X; i; p represents the highest memory position of all the subregions of X for a chunk of p parallel iterations, starting from i, of the loop. The memory gap, h k , of X in F k is defined to be the distance between the highest memory position of the subregion associated with the ID of iteration i and the lower memory position of the next subregion associated with the ID of iteration i I. If there are phases in the program where the sequential loops do Fig. 10 . In these examples, we assume that parallel iteration i H has been scheduled in processor P E H , parallel iteration i I in processor P E I , parallel iteration i P in processor P E P , and parallel iteration i Q in processor P E Q and we illustrate the three situations of verification of the intraphase locality condition: (a) Y is privatizable, (b) Y is nonprivatizable and nonoverlapping storage, and (c) X is nonprivatizable, overlapping storage, and accesses are only-read. This situation arises in tomcatv in Table 1 . not access all the memory positions between two consecutive parallel iterations, h k is not zero. Regular regions, i , of an array for a parallel iteration i of each phase are built in such a way that each subregion fully covers the elements of the array represented by the corresponding ID. As illustrated in Fig. 11 , i can be characterized by the upper limit vs k X; i and the memory gap h k . Similarly, the aggregated regular region, S iXipÀI i , can be characterized by vs k X; i; p and h k . This information is used to formulate the balanced locality condition for array X in two phases F k and F g as:
where u kI and u gI represent the upper bounds of the parallel loops in phases F k and F g . By solving the system of (2), (3), and (4), we get the unknowns p k and p g , which give us the possible sizes of the chunk in the block-cyclic or CYCLIC(p k =p g ) iteration distributions for both phases. With these possible iteration distributions, we determine a feasible data distribution that ensures that all accesses to array X are local in phases F k and F g . The data distribution procedure has the responsibility to allocate in the local memory of each processor the common regular array region X( S iXipkÀI i ) accessed by each chunk. From (2), we determine the number of a chunk of parallel iterations
which needs to be scheduled in each phase in order to ensure that the common regular regions accessed in F k and F g are identical; that is, [
It is guaranteed that any access to this common region would be local if this identical region and these iterations are scheduled to the same processor P E j , following a CYCLIC(p k ) and CYCLIC(p g ) (block-cyclic) scheduling of the parallel iterations for each phase. Load balancing is guaranteed by (3) and (4), which limit the maximum number of parallel iterations that can be scheduled in a processor for phases F k and F g .
1.
Array X is nonprivatizable in phases F k and F g and the balanced locality condition holds, meaning that solutions exist for the system of (2), (3), and (4). In this case, for each solution (that is, a value for p k and p g ), we can find an iteration distribution CYCLIC(p k =p g ) on each phase and we can build a data distribution for both phases such that all accesses to X are to local memory. We just need to schedule the chunks on each phase following a block-cyclic pattern and to allocate the common regular regions covered by each chunk in the corresponding local memory. 2. Array X is nonprivatizable in phases F k and F g and the balanced locality condition does not hold, meaning that solutions do not exist for the system of (2), (3), and (4). In this case, the only way to find the iteration distribution CYCLIC(p k =p g ) and to build identical regular regions covered by each chunk of parallel iterations requires executing all iterations of F k and F g in a processor P E j and allocating the whole array to the local memory of that processor. This implies that, to avoid communication, F k and F g should be executed sequentially. This is equivalent to saying that no iteration/data distribution exists such that all accesses to X are to local memory, thereby requiring Put/Get operations for some accesses to X. 3. Array X is privatizable in phase F k or F g . By definition, the value of X in the phase where it is privatizable does not depend on the value of X in the other phase. Then, we say that phases F k and F g are uncoupled for the array. When an array is privatizable in a phase, we do not need to apply the balanced locality condition. We only need to assure the intraphase locality condition in both phases to know all accesses are local. The reason is that the values of X are defined independently on each processor for each phase, so Put/Get primitives are not necessary.
Chains
The edges in the LCG are labeled by the Locality Analysis Algorithm according to the three situations classified in Section 5.2.3. The edges will be labeled with L when Situation 1 arises (i.e., it is possible to avoid communications by exploiting locality). We assign the attribute D to the edges in Situation 3 of the interphase locality analysis (e.g., the dashed edges in Fig. 9 ) that bind two uncoupled phases. Later we remove these edges. The edges will be labeled with C when communication is required between the execution of the two connected phases. That happens in two cases:
1. When Situation 2 of the interphase locality analysis arises. In this case, an array redistribution between the two connected phases must take place because each phase requires a different static data distribution and we say that global communications occur in this part of the code. 2. When the array is nonprivatizable in phases F k and F g (the C-connected phases), overlapping storage exists for the array in a previous phase (i.e., WÁ s ) and accesses to the array in F k are writes. In this case, an updating of the replicated shadow areas is necessary and we say that frontier communications occur. Notice that global or frontier communications occur between the phases that are connected by a C edge that represents a data redistribution or an updating of shadow areas when program control crosses from one phase to the other. On the other hand, the set of phases that are connected consecutively by L edges covers common data regions of an array; in other words, we can find CYCLIC(p k ) iteration distributions for each of the connected phases and we can build a static data distribution for array X in those phases such that all accesses to X are local. We call this set of phases a chain of phases. There can be more than one chain for an array and each column of the LCG has at least one chain. For instance, three chains of X separated by the two C edges are shown in Fig. 9 .
The locality analysis tells us that phases connected with L-edges verify the balanced locality condition. But, there can be several solutions to that system of equations. In other words, to avoid communications, we can find several blockcyclic iteration distributions for each phase of the chain and build a static data distribution of the chain for each combination of iteration distributions. We need a procedure to select the suitable iteration/data distributions for each chain of phases and for the whole code. This is the goal of the next compilation step: the selection of the optimal iteration/data distributions for the whole code.
Iteration/Data Distribution Using the LCG
Using the LCG generated from the locality analysis, we formulate an optimization problem to find the optimal iteration/data distributions. The objective function of our problem is the parallel overhead due to communication costs plus load imbalance costs. The communication costs represent the time consumed in global or frontier communications that take place between the execution of two phases with a C edge in the LCG. These costs depend on the communications pattern, the startup time, and the bandwidth of the primitives (Put/Get in our model) 3 used to implement the communications. The load imbalance costs are the difference between the time consumed by the most loaded processor and the least loaded one. Clearly, they depend on whether the phase is rectangular or triangular. Equation (5) presents the objective function that models the parallel overhead of the optimization problem formulated from tfft2 shown in Fig. 9 :
where A j represents one of the two arrays, X and Y, of the program and index k traverses all the phases where the corresponding array is accessed. The components C kg X; p k and D k X; p k , respectively, represent the functions of the communication costs and the load imbalance costs. The communication cost function models the dependence between cost of Put/Get operations and the choice of iteration/data distribution. On the other hand, the load imbalance cost function models the dependence between the differences of computational load among the processors and the choice of iteration distribution. We explain how to derive those cost functions in [37] . We must note that our optimization problem is an integer programming problem where the variables of the problem are the integer values of p k .
The solution to our optimization problem allows us to find p k . Once the iteration distribution of a phase has been selected, we can build the data distribution for each array of that phase. To do this, the data distribution function must hold the intraphase locality requirements imposed by the IDs associated with the chunks of parallel iterations scheduled in each processor. However, we know it is possible to find a static data distribution for all the nodes of the chain such that all the accesses to X in any of the connected nodes of the chain are in the local memory of the processors. The reason is that all nodes in the same chain cover the same data region of X, as stated earlier. Thus, array X needs to be redistributed only before the first node of the chain. Thus, for each X in the LCG, we redistribute X after the execution of the last phase of a chain and before the execution of the first phase of the next chain. Using this strategy, we minimize the number of communications during the program execution.
An objective function is subject to several constraints. For example, the IDs of arrays whose nodes are in a chain define some constraints for our integer programming model. In fact, the system of equations that represents the verification of the balanced locality condition for each pair of nodes of a chain are what we call the locality restrictions. Other constraints take into account the storage symmetries of subregions accessed on each parallel iteration. Refer to [38] for the details of the constraints of our optimization problem and the full description of how we validate, by measurement, the communication costs and the load balance costs.
CASE STUDIES WITH BENCHMARKS
This section reports the experimental studies in which a set of real codes were transformed to shared memory codes for the Cray T3D [3] and T3E [46] by using the major techniques described earlier. For our experiments, we used a compiler framework, Polaris [5] , in which we implemented most of our techniques. We will start our reports with the general description of the six benchmarks in the following section.
Benchmark Programs
To measure the effectiveness of all our techniques with various types of applications, we chose the six benchmark codes based on their data access patterns shown in Table 1 ; that is, two codes with simple accesses (swim and tomcatv), two with with many complex accesses (tfft2 and trfd), and two in-between (bdna and mdg). Table 2 shows the characteristics of these codes in more detail.
Array subscripting patterns in swim are fairly simple. The major routines contain doubly nested parallel loops. Their loops access shadow areas and their communication patterns are frontier communications. In tomcatv, the loop main 140, which is the computational kernel, is serial. The major arrays in the inner loops of main 140 access shadow areas and their communication patterns are frontier communications.
Complex NA-type accesses can be found in major loops of tfft2 and also in trfd. In tfft2, inner loops make contiguous access to arrays with stride 1 and matrix transposition (six on each iteration) is performed with power-of-two strides. The repeated procedure calls inside loops form deeply nested loop structures. In trfd, the major loops are olda 100 and olda 300. Loop olda 300 is triangular. In olda, the work array X is divided into a vector, XI10, and three matrices, XI00, XI20, and XI30.
The major routine of bdna is actfor, which consumes 96 percent of the total sequential execution time. Some false dependencies in the loop actfor 240 may be eliminated by privatization. Reduction variable substitution does the same in the loops actfor 240=320=500=700. Most work in mdg is performed on a single array var which is divided into several subarrays. The computations are dominated by a triangular loop interf 1000. Privatization and reduction recognition can be useful to remove dependencies caused by arrays in interf 1000.
Summary of Experiments
To evaluate the major techniques described in this paper, we generated two different parallel code versions as targets and respectively tested their performance. Version 1 was produced by the original implementation of Polaris [22] , [23] , [41] , [43] . Fig. 12 illustrates an example of this code generation process through which Version 1 codes were produced. In this process, Polaris was applied to find parallel loops in the source code. The copy-in/copy-out strategy was used to control communications in the parallel code. To implement this strategy, naive iteration/data distribution and Put/Get generation techniques were employed. That is, the iterations of a parallel loop were stripmined by using the following simple method:
1. If several loops in a loop nest were parallel, then the outermost loop was parallelized, 2. If the loop nesting structure was rectangular, a block schedule of iterations was chosen for it, and 3. If the structure was triangular, a cyclic schedule of iterations was chosen. For data distribution, all shared arrays were simply blockdistributed. The generation of Put/Get is intraphase based, that is, given phases, the Put/Get generation algorithm analyzes the shared array regions accessed in individual phases and generates Put/Get around each phase independently. SC represents sequential coverage, that is, the percentage of overall running time of the loop in the sequential code; P =S shows whether the loop is DOALL-type fully parallel or intrinsically serial (or DOACROSS-type partially parallel); and L=F =G shows whether the major communication patterns for shared arrays in the loop would be local, frontier, or global communication patterns in the target shared-memory code. Fig. 12 . Example of the code generation process from a fragment of sequential code: proc num denotes the number of processors and my pid denotes the virtual processor ID number. One-sided communication primitives provided in the SHMEM library [11] were employed to implement Put/Get operations on the Cray NCC machines.
Little work has been done to optimize communications in Version 1. The communication costs could be reduced by the locality analysis and iteration/data distribution techniques discussed in Section 5. Unfortunately, these techniques were not implemented in Polaris when this experiment was conducted. However, before their implementation, we intended to evaluate the impact of these advanced techniques on the communication costs in parallel codes running on NCC machines. Thus, we applied by hand some of these techniques to generate Version 2 [35] , [36] , [38] , [42] as follows: After parallelization, the LCG of the source code was built and the integer programming problem for the iteration/data distributions of each phase was derived. The solutions were obtained by the GAMS solver [7] . Finally, those distributions were hand-coded, including the Put/Get generation for global or frontier communications. Table 3 illustrates a comparison of the core techniques used in the two versions. The Access Region Test (ART) was applied to both versions, but the main differences between these two were in the techniques to distribute the arrays and to generate the Put/Get communications.
Experimental Results
In this work, the two parallel versions were executed and their parallel execution times were compared with sequential execution times. This comparison was intended mainly to evaluate the effectiveness of our array access analysis and parallelization techniques already implemented in Polaris, but also to roughly estimate how much we would improve the parallel performance once our locality analysis and data/distribution techniques for communication optimization are fully implemented in our compiler. Fig. 15 compares the execution results for the two parallel versions and the sequential version of each benchmark program on the Cray T3D. The overall performance was affected by several factors, each of which will be analyzed next. Parallelization. Overall, the ART was effective for detecting parallel loops for all codes; the parallel loops in Table 2 were all identified by the compiler. The ART was be able to parallelize loops with complex access patterns as intraf 1000 and predic 1000 for mdg. The major loops interf 1000 and poteng 2000 of this code were also parallelized. The ART parallelized virtually all the loops in swim, partially because array subscripting patterns are fairly simple. For tomcatv, all the inner loops were parallelized by the compiler, too. For the tfft2 code, the ART was highly effective parallelizing the loops with complex access patterns, such as cfftz #1=2=3, as well as other major loops. Copy-In/Copy-Out. Fig. 13 demonstrates that the copyin/copy-out strategy enabled us to achieve reasonable scalability on up to 64 processors for several programs, even with the naive data block distribution strategy of Version 1. We credit this to three factors:
. The SHMEM library used to perform our Put/Get communications is quite efficiently implemented on top of the Cray NCC machines for fast one-way communication, which allowed our target codes to tolerate, to some extent, the increase of Put/Get overhead due to the naive distribution strategy. . A contiguous stream of accesses to an array was common in some programs like bdna. For these Fig. 13 . Comparison of multiprocessor speedups of parallel execution time over sequential execution time before and after the copy-in/copy-out strategy using Put/Get primitives is applied to Version 1.
programs, block distribution generally works well with the copy-in/copy-out strategy by placing blocks of consecutive array elements into each memory module. For instance, Fig. 15 shows that, on 32 processors, Version 1 for bdna slightly outperformed Version 2 in which several arrays were decomposed with block-cyclic(4) distribution. The block distribution for Version 1 slightly reduced the overall Put/Get invocation cost. . The T3D's fast asynchronous access to remote memory effectively decreases the communication overhead, which is typically increased by the short messages frequently generated for global address operations in shared-memory programs.
In our experiment, synchronization overhead was negligible due mainly to the fast hardware circuitry for barrier synchronization (for example, barrier time for 256 processors was around 1.90 sec. in a Fortran code). Fig. 13 indicates that the copy-in/copy-out strategy is the least effective for bdna because the majority of the arrays in its time-critical loops are reduction arrays; thus, they were all privatized for reduction operations, which eliminated the need for copy-in/copy-out operations on them. Locality Analysis and Iteration/Data Distribution. In Version 2, the main effort was made to optimize communication by finding a suitable data distribution. For example, for bdna, cyclic, and block_cyclic data distributions were chosen for major arrays because they are accessed mainly within triangular loops. In Version 2 of mdg, a block-cyclic distribution was chosen for three arrays, FX, FY, and FZ, because they are accessed in triangular loops, too. For swim, shadow distribution was chosen (BLOCK + shadow area) for the majority of arrays and, for this reason, the frontier communication pattern was exploited for those arrays in swim. The iteration and data distributions that our locality analysis and data distribution techniques found for tfft2 were the reason for the improvement of the parallel code performance in Version 2 because:
. A cyclic schedule of iterations was chosen for the loops transa and transb and a block-cyclic distribution for the array Y and array X. Doing this, it was possible to localize the symmetric accesses to array Y in those loops. . The frontier communication pattern was exploited. . Global communications on array S in the routine randp were removed by replicating S because this array is always accessed in serial loops.
In Version 2 of tomcatv, a shadow distribution was chosen for the rows of arrays X and Y, thereby frontier communications (which is the major communication pattern of main 140) could be exploited. In Version 2 of trfd, a block_cyclic data distribution was chosen for an important array XI30 in the code. The other key aspect in Version 2 was that arrays XI00, XI10, and XI20 were replicated to avoid communications because they are always accessed in serial loops. The comparison between Versions 1 and 2 revealed an interesting fact: The communication overhead can be substantially reduced for some programs only by recognizing local and frontier communications based on our locality analysis. The typical examples are swim, tfft2, and tomcatv.
In addition, we have observed that scalable speedups are often hampered by the presence of global communications. For the codes dominated by global communications, parallel codes of both versions suffered from diminishing returns beyond a certain number of processors in our experiment. These codes include bdna and mdg. They contain reduction arrays that cause global communications to collect the partial sums stored in each processor. The impact of reductions in three arrays, FX, FY, and FZ, in mdg was approximately 21 percent of the overall parallel execution time on 64 processors. Another example is tfft2. It has a few loops, such as transa 120, transb 120, and transc 120, that contain global communications for matrix transposition. As a result, a large quantity of small array fragments must frequently shuffle across memories, thus leaving little room for communication optimization through the copying strategy or a static data distribution strategy. These global communications, along with several serial loops in randp that took up about 30 percent of the total parallel execution time on 64 processors, caused tfft2 to suffer the worst scalability in speedups among the six codes, as illustrated in Fig. 14 .
Another important fact is that the comparative analysis between speedups for Versions 1 and 2 in Fig. 15 indicates that the importance of data distribution on the T3D becomes evident as the number of processors increases. For instance, in swim, we get almost linear speedups with the shadow distribution in Version 2, even on 64 processors. This explains the increased performance gap between Version 1 and Version 2 for swim on larger numbers of processors. Also, in mdg, Version 2 has better scalability on more than 16 processors than Version 1. This is due to the block-cyclic distribution for the three major arrays, avoiding communications in-and-between the execution of the main loops of the code. Similarly, in trfd, the data distribution strategy in Version 2 improved the performance of Version 1 by approximately 50 percent. The tfft2 code shows another example of the importance of data distribution. Despite its difficulties with array subscripting patterns and the global communications for matrix transposition, we have achieved reasonable scalability in Version 2 using block-cyclic distributions for the main arrays. The block-cyclic distribution gets local computation for 77 percent of sequential coverage and provides the opportunity to exploit parallelism without communication. For example, when we built the LCG for tfft2 shown in Fig. 9 , we found that locality would be exploited without communications for array X between the phases (F Q , F R ), (F R , F S ), (F S , F T ), (F T , F U ), and (F U , F V ) and for array Y between phases (F I , F P ). In addition, we found that phases (F P , F Q ), (F Q , F R ), (F S , F T ), and (F T , F U ) were uncoupled for Y ; thus, no communications were necessary between the execution of all those phases in Version 2.
Although data distribution is important in code generation for all types of distributed memory architectures, its importance may vary to some extent from machine to machine. LeBlanc and Markatos [28] demonstrated that the data locality issue is a less critical factor for a machine with high throughput and a low latency network. In fact, load balancing can become a more important factor in this kind of machine. The approach based on the LCG (Version 2), takes into account these two issues, looking for a trade-off between minimization of communications (data locality) and minimization of the load imbalancing. Thus, we believe that these techniques will be effective for a wide variety of NUMA machines, including machines with high throughput and low latency. In order to check the behavior of the techniques implemented in Version 2 in a machine with these characteristics, we conducted a second set of experiments, now using as target machine the Cray T3E. Fig. 16 shows the execution results for the Version 2 codes and the sequential codes on the Cray T3E. Now, we achieve, for the majority of the codes, better efficiencies than in the Cray T3D. This can be explained to some extent because the Put/Get primitives are less scalable in the T3D (the latency of a Put/Get operation is around 900 processor cycles) than in the Cray T3E (the latency of a Put/ Get operation is around 450 processor cycles). In other words, we note that, for the same parallel code which achieves the same load balance in both machines (Version 2), the higher the remote latency, the worse the program scalability.
However, in future NCC machines, the overhead of Put/Get operations, in terms of processor cycles, is likely to be even more skewed than on the T3E. For this reason, the minimization of Put/Get primitives is still one of the key issues in the efficiency of the parallel code for such machines. In any case, our iteration/data distribution step looks for the distributions that minimize the number of communications, while balancing the load. Thus, we conclude that our automatic iteration/data distribution techniques are likely to be effective on future NCC machines.
CONCLUSION
This work shows that, with sufficiently powerful parallelization and array access analysis, it is possible to generate scalable, high-performance code from serial programs for the Cray T3D and T3E. While the absence of global hardware cache coherence makes programming these machines directly more difficult, it also gives a compiler extra flexibility beyond what is available with pure message passing machines. This paper shows some techniques that can make use of this flexibility. The successful application of these techniques makes it possible to achieve both high performance and high programmability on these machines.
In this paper, we report three principal results from our work. First, we showed that all of the techniques discussed in this paper rely on the precision of and the opportunities for simplification afforded by accurate array access descriptors and that other array access descriptors, which cannot accurately represent the full richness of memory access within programs, handicap the techniques built upon them from the start. Then, we demonstrated how powerful array access analysis, based on symbolic range analysis with LMADs and IDs, allowed the compiler to recognize difficult parallel loops and to generate optimal Put/Get for common communication patterns. Second, we showed that locality analysis, based on the LCG, helped us to find suitable iteration/data distributions for target code. Third, we analyzed the results of a set of experiments with real codes and used them to compare the effectiveness of the various techniques described in this paper. Since 1991, he has been a full professor at the University of Ma Â laga. Currently, he is head of the Computer Architecture Department at the University of Ma Â laga, Spain. He has published more than 90 journal and 200 conference papers in the parallel computing field (applications, compilers and architectures). His main research topics include application fields, compilation issues for irregularly structured computations and computer arithmetic, and application-specific array processors. He is a member of the editorial board of the Journal of Parallel Computing and Journal of Systems Architecture. He has also been a guest editor of special issues of the Journal of Parallel Computing (languages and compilers for parallel computers) and Journal of Systems Architecture (tools and environments for parallel program development). He is senior memeber of the IEEE. Jay Hoeflinger has done parallel processing research and development for the last 16 years, first at the Center for Supercomputing Research and Development, then in the Department of Computer Science at the University of Illinois, where he focused primarily on parallelizing compilers. In 1998, he joined the Center for SImulation of Advanced Rockets, where he was a senior research scientist, focusing on the use of OpenMP and MPI in the detailed, integrated, whole-system simulation of solid fuel rockets. He joined Intel in 2000, where he has been involved in development of the Guide OpenMP translator as well as new product development. He is a member of the OpenMP C/C++ standards committee.
David Padua is a professor of computer science at the University of Illinois at Urbana-Champaign, where he has been a faculty member since 1985. He has served as a program committee member, program chair, or general chair of more than 40 conferences and workshops. He served on the editorial board of the IEEE Transactions of Parallel and Distributed Systems and as editor in chief of the International Journal of Parallel Programming (IJPP). He is currently a member of the editorial boards of the Journal of Parallel and Distributed Computing and the International Journal on Parallel Processing. His areas of interest include compilers, machine organization, and parallel computing. He has published more than 100 papers in those areas. He is a fellow of the IEEE.
. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
