This paper describes a number of optimizations that can be used to support the e cient execution of irregular problems on distributed memory parallel machines. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of o -processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).
Introduction
Over the past few years we have developed a methodology to produce e cient distributed memory code for sparse and unstructured problems in which array accesses are made through a level of indirection. In such problems the dependency structure is determined by variable values known only at runtime. In these cases e ective use of distributed memory architectures is made possible by a runtime preprocessing phase. The methods we have developed are implemented using compiler embeddable runtime support procedures. These procedures can be thought of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of and access to copies of oprocessor data, (3) minimize interprocessor communication requirements and (4) support a shared name space.
This paper describes and systematically evaluates a number of new optimizations that have been incorporated into the existing toolkit. Several of the optimizations reduce communication latency and volume. We also describe a paged translation table to e ciently support access to globally indexed distributed arrays that are mapped onto processors in an irregular manner.
We characterize the performance and scalability of our optimizations using (1) a family of test loops and an associated synthetic workload generator, (2) test loops with data access patterns drawn from unstructured applications and templates from real applications. This combination of empirical approaches allows us to characterize the performance of our optimizations under a wide variety of conditions. In order to provide a conceptual framework for understanding ways of evaluating the relative e ectiveness of an optimization on varying numbers of processors, we de ne the concept of optimization scalability.
In sparse and unstructured computations, distributed arrays are typically accessed using indirection. In many cases (e.g. distributed arrays referenced in loops with no loop carried dependencies or distributed arrays referenced in loops with accumulation type dependencies) it is possible to prefetch required o -processor data before a loop begins execution. In many cases several loops access the same o -processor memory locations. As long as it is known that the values assigned to o -processor memory locations remain unmodi ed, it is possible to reuse stored o -processor data. A mixture of compile-time and run-time analysis can be used to generate e cient code for irregular problems 9], 11], 15]. In this paper we give a detailed description of the communication optimizations that prove to be useful for optimizing irregular problem performance but do not further address compilation issues.
Overview of PARTI
In this section, we give an overview of the functionality of the PARTI primitives described in previous publications 22], 1], 17]. In many algorithms data produced or input during program initialization plays a large role in determining the nature of the subsequent computation. In the PARTI approach, when the data structures that de ne a computation have been initialized, a preprocessing phase follows. Vital elements of the strategy used by the rest of the algorithm are determined by this preprocessing phase.
In distributed memory machines large data arrays need to be partitioned between local memories of processors. These partitioned data arrays are called distributed arrays. Long term storage of distributed array data is assigned to speci c memory locations in the distributed machine. It is frequently advantageous to partition distributed arrays in an irregular manner. For instance, the way in which the nodes of an irregular computational mesh are numbered frequently does not have a useful correspondence to the connectivity pattern of the mesh. When we partition the data structures in such a problem in a way that minimizes interprocessor communication, we may need to be able to assign arbitrary array elements to each processor.
Each element in a distributed array is assigned to a particular processor. In order for another processor to be able to access a given element of the array we must know the processor in which it resides and its local address in that processor's memory. We thus build a translation table which, for do i = 1, n x(ia(i)) = x(ia(i)) + y(ib(i)) end do The class of problem with which we are dealing consists of a sequence of clearly demarcated concurrent computational phases where data access patterns cannot be anticipated until runtime. They are called static irregular concurrent computations 1] . In these problems, once runtime information is available, 1) data access patterns are known before each computational phase and 2) the same data access patterns occur many times. Adaptive problems can fall into this class of problem as long as data access patterns change relatively infrequently. A typical loop in such computations is shown in Figure 1 . In this loop the arrays x, y, ia and ib are all distributed arrays. The arrays ia and ib are used to index the arrays x and y respectively. At compile time it is not possible to gure out the indices of x and y that are accessed because they are dependent on the values stored in the arrays ia and ib. This data access pattern becomes available at runtime.
In distributed memory MIMD architectures there is typically a non-trivial communications latency or startup cost. For e ciency reasons information to be transmitted should be collected into relatively large messages. The cost of fetching array elements can be reduced by precomputing what data each processor needs to send and receive.
In irregular problems, such as solving PDEs on unstructured meshes and sparse matrix algorithms, the communication pattern depends on the input data. This typically arises due to some level of indirection in the code. In this case it is not possible to predict at compile time what data must be prefetched. To deal with this lack of information the original sequential loop is transformed into two constructs, namely the inspector and the executor.
During program execution the inspector loop examines the data references made by a processor and calculates what o -processor data needs to be fetched and where that data will be stored once it is received. The executor loop then uses the information from the inspector to implement the actual computation. We have developed a suite of primitives that can be used directly by programmers to generate inspector/executor pairs.
These primitives are named PARTI 5], 1]; they carry out the distribution and retrieval of globally indexed but irregularly distributed data-sets over the numerous local processor memories. Each inspector produces a set of schedules, which specify the communication calls needed to either: (1) obtain copies of data stored in speci ed o -processor memory locations (i.e. gather) or, (2) modify C Create the required schedules (Inspector) S1 call localize(dist of x, schedule ia, part ia, local ia, n local iterations, o proc x) S2 call localize(dist of y, schedule ib, part ib, local ib, n local iterations, o proc y) C The actual computation (Executor) S3 call zero out bu er(x(begin bu er x), o proc x) S4 call gather(y(begin bu er y), y, schedule ib) L1 do i = 1, n local iterations S5 x(local ia(i)) = x(local ia(i)) + y(local ib(i)) S6 end do S7 call scatter add(x(begin bu er x), x, schedule ia) The primitives that build schedules use hash tables to generate communication calls that, for each loop nest, transmit only a single copy of each o -processor datum 12], 22]. The schedules are used in the executor by PARTI primitives to gather, scatter and accumulate data to/from o -processor memory locations. In this paper, the idea of eliminating duplicates has been taken a step further. If several loops require di erent but overlapping data references we can avoid communicating redundant data (See Section 4.1.2).
The description of the functionality of the PARTI primitives is presented by transforming the example loop shown in Figure 1 such that it can be executed on a distributed memory machine. The arrays x, y, ia and ib are partitioned between the processors of the parallel machine. The partitioned indirection arrays ia and ib are called part ia and part ib on each processor and they are passed to the function localize as shown in statement S1 and S2 in Figure 2 . The localize calls shown in Figure 2 contains the inspector code for the simple irregular loop shown in Figure 1 . On each processor the function localize is passed the following:
1. a pointer to the distribution information, (dist of y in S2), 2. a list of globally indexed distributed array references for which processor P will be responsible, (part ib in S2), and 3. number of globally indexed distributed array references (n local iterations in S2).
Localize returns:
1. a schedule that can be used in PARTI gather and scatter procedures (schedule ib in S2), A sketch of how the procedure localize works is shown in Figure 3 . The array part ib in statement S2 in Figure 2 contains the globally indexed reference pattern used to access array y. Localize translates part ib so that valid references are generated when the loop is executed. The bu er for each data array immediately follows the on-processor data for that array. For example, the bu er for data array y begins at y(begin bu er). Hence, when localize translates part ib to local ib, the o -processor references are modi ed to point to bu er addresses. When the o processor data is collected into the bu er using the schedule returned by localize, the data is stored in a way such that execution of the loop using the local ib accesses the correct data. Similarly, part ia is also translated to point to local addresses during the localize call shown in statement S2.
The executor code shown in Figure 2 carries out the actual loop computation. In this computation the values stored in the array x are being updated using the values stored in y. During the computation, accumulations to o -processor locations of array x are carried out in the bu er associated with array x. This makes it necessary to initialize to zero the bu er corresponding to o -processor references of x. We call the PARTI function zero out bu er shown in statement S3 to perform this action. After the loop computation, the data in the bu er location of array x is communicated to the home processors of these data elements (scatter add We have ported a number of scienti c codes to parallel machines using the PARTI primitives. In this section we brie y describe two kernels that have been extracted from these codes and how they stress the primitives quite di erently. In Section 2.1 we describe a representative kernel from the explicit Euler solver 14], 5] and Section 2.2 brie y describes the kernel from the molecular dynamics code CHARMM 3], 7].
Unstructured Euler Kernel
Unstructured meshes provide a great deal of exibility in discretizing complex domains and o er the possibility of easily performing adaptive meshing. However, unstructured meshes result in random data-sets and large sparse matrices which require runtime preprocessing to be executed on a distributed memory machine. The average connectivity of each mesh point is between 6 and 10, hence the number of duplicate data references is low. This connectivity is quite low when compared with the connectivity that is generated for other types of problems such as molecular dynamics or particle dynamics. There are 8 irregular loops in the kernel and the number of indirection arrays is 2. The size of these indirection arrays are 6 to 10 times the size of the data arrays. The communication pattern generated by this kernel is dependent on the type of partitioning method used. When the data and the indirection arrays are block partitioned the number of o -processor data is extremely high, and nearly all processors have to communicate with each other. When the data is partitioned based on the connectivity of the mesh (spectral bisection 16]), we get low volumes of communication between neighboring processors. We have used geometric partitioners with fair amount of success.
Molecular Dynamics
This kernel from the molecular dynamics code CHARMM has a very di erent data access pattern from that of the Euler code kernel. Part of this kernel consists of a non-bonded force calculation. Figure 4 depicts the non-bonded force calculation loop. The array Partners(i, *) stores all the atoms that interacts with atom i. Inside the inner loop, the three force components between atom i and atom j are calculated (Vander Waal's and electrostatic forces). These forces are then added to the total forces associated with atom i and subtracted from the forces associated with atom j. All atoms within a given cuto radius interact with each other. The cuto radius is usually comparable to the geometric size of the problem, hence each atom interacts with hundreds of other atoms. This causes a large number of duplicate array references in the indirection array for the force calculation. In the input we used to test the kernel the total number of atoms (data array size) is 14,026 while the indirection array size is 6,700,818. There are 4 other irregular loops in the kernel, but the size of those indirection arrays are comparable to the data array sizes. We have used geometric data partitioners to partition the data associated with each atom. Even though the data is partitioned irregularly nearly all processors need to communicate with each other.
The molecular dynamics kernel and the Euler code kernel generate contrasting data access patterns, and stress di erent aspects of the communication primitives. We will use these two kernels in conjunction with the workload generator to evaluate the primitives.
Synthetic Workload
In this section we describe a parameterized synthetic workload generator that has been designed to simulate the kinds of data reference patterns and communication characteristics we have encountered in concurrent irregular scienti c problems, examples of which has been presented in section 2.
The synthetic workload consists of two parts, the Communication Pattern Generator (CPG) and the Data Access Pattern Generator (DAPG). The CPG is used to de ne the communication pattern induced by the problem. The DAPG generates indirection arrays that embody the communication pattern speci ed by the CPG. Most of the optimizations presented in this paper are methods that can be used to preprocess indirection arrays. Consequently, we have developed a relatively sophisticated DAPG. We have a very simple CPG and we describe it rst. It should be noted, however, that more sophisticated communication models can be de ned to replicate the communication behavior of irregular problems. One extension is the addition of extra features such as variability of the connectivity and communication volume. However, the current model is general enough to illustrate the key performance parameters of the optimization primitives.
The second part of the synthetic workload generator is the DAPG which is responsible for generating the data access patterns utilizing the communication graph. The actual communication takes place in a way that is determined by the communication graph. The data access pattern is de ned to be a permutation of a subset of the global index space. It speci es which global data indices have been accessed locally. The output of the DAPG is a set of indirection arrays that will be used in accessing the distributed arrays, whereas the input is the communication graph generated by the CPG and the following parameters : 
Number of Duplicates (N dup )
The number of duplicates for a given indirection array is de ned to be the number of distinct occurrences of the same o -processor data reference. If the number of duplicates for an indirection array is 2, then each unique reference in the indirection array will occur twice. Note that this parameter has no e ect on the total volume of unique data communicated.
Number of Dimensions(N dim )
N dim measures the degree of reuse of the same data access pattern across the dimensions of a distributed array.
An example of the type of work load generated is shown in Figure 5 . For this case the various inputs to the DAPG are R int = 0.5, N dup = 2, N loop = 2, and N dim = 2. Since R int equals 0.5, half of the values stored in array ia are also present in the array ib. Since N loop equals 2, there are two indirection arrays namely, ia and ib. N dim equals 2 makes the the upper bound of the compressed dimension of all the data arrays be 2 (in this case x(*,2) and z(*,2)). Since N dup equals 2 each reference in ia is repeated twice. The same follows for ib.
The parameters de ned for the workload and other symbols used throughout the rest of the paper are summarized in Table 1 .
Communication Optimizations
In this section we present a set of communication optimizations. We have implemented these optimizations in a set of primitives that we call the PARTI runtime library. Section 4.1 shows how software caching can be used to reduce the volume of communication between processors. One optimization is to remove redundant o -processor accesses associated with a particular indirect array reference. A more aggressive optimization is to remove redundant o -processor accesses associated with several indirect array references. Section 4.2 describes the optimizations that have been developed to reduce communication startups by coalescing communications into a decreased number of messages.
Software Caching
During the execution of irregular loops on distributed memory (or distributed shared memory) machines, the same o -processor data may be accessed repeatedly. In many cases, we can prefetch all data needed by an array reference before a loop's computation begins. In other cases, we can prefetch all data needed by a set of irregular references to the same array. In either case, the same o -processor data may be accessed multiple times, but only a single copy of that data must be fetched from oprocessor. In this paper, we do not address issues associated with identifying when these prefetches can be carried out; See von Hanxladen 11] and Das 9] for a detailed discussion.
Informally, the prefetches can be carried out when (1) it is possible to predict array reference patterns prior to a loop's execution, and (2) it is known that all array data subject to prefetch remains live, i.e. there is no possibility that the prefetched values are no longer valid.
Simple Software Caching
A hash table is utilized to identify duplicate o -processor data accesses associated with the indirect references to a single data element. Communication schedules are generated from the lists of unique o -processor data accesses. These schedules store the communication patterns to be used by the gather and scatter primitives. During the schedule generation process each processor sends the lists of data it needs from all other processors, it also receives the lists of data it must send to other processors. These lists contain the indices of the data that need to be communicated. Each schedule is associated with a distribution and a data access pattern, rather than being tied to speci c data arrays. Hence, if we encounter two references to arrays where the arrays are distributed in the same way and where data access patterns are identical, we can use the same schedule to gather or scatter data to these arrays. The PARTI primitives described in earlier papers and outlined in Section 1.1 provide runtime support for simple software caching.
Incremental Scheduling
Data communication volume is reduced by tracking and reusing live o -processor data copies. In a number of application codes there occur multiple indirect references to the same data array. When it is known that no array assignments can occur between some sets of indirect references, i.e. the array in question remains live in that portion of the code, we only have to fetch a single copy of each unique o -processor value accessed by any of the indirect references.
Assume there are N di erent indirect array references to distributed data array D. From each reference we can obtain o -processor indices used to access data from the array D. Let IA I be the set of o -processor indices from reference I. Hence, IA = fIA 1 ; IA 2 ; ::::; IA N g is the set of sets of o -processor indices used to access data from D. The use of an incremental schedule allows us to bring in only those data that are not available locally:
S IA I = f ia : ia 2 IA I for some set IA I 2 IA g
The number of indices belonging to this set is potentially smaller than the number of indices we would get if we simply concatenated the indices obtained from separately applying simple software caching to each distributed array reference. If every index listed in each of the sets IA is di erent then there is no advantage in doing incremental scheduling. On the other hand, if there is signi cant overlap in the o -processor references obtained from the reference sets then a large reduction in communication volume is achieved by using incremental scheduling.
The incremental scheduling procedure removes duplicate o -processor data references by using hash tables. This procedure is invoked with a set of references and a set of schedules that have to be considered during the generation of the incremental schedule.
Communication Coalescing
One can frequently collect many data items destined for the same processor into a single message This kind of optimization is sometimes called communication coalescing. The object of communication coalescing is to reduce the number of message startups. For many distributed memory systems there is a substantial latency associated with message passing. For instance, Bokhari 2] 
Simple Communication Aggregation
It is frequently possible to anticipate which data must be communicated before a loop executes. We can carry out the preprocessing needed to characterize the data required by a given right hand side array reference. Prior to loop execution, we pack into a single message all the data each pair of processors need to exchange. In a similar manner, we can often defer until after a loop the communication (and accumulations) associated with left hand side array references. We refer to this optimization as simple communication aggregation.
Communication Vectorization
If a number of columns of a multi-dimensional array are distributed in a conforming manner and if the data access pattern from these columns are the same, then the primitives gather and scatter data from all the columns using a single communication phase. This does not reduce the communication volume but reduces the latency by the number of columns. Hence, if any processor P is to receive data from N processors for L columns then the reduction of latency time is given by, Latency Reduction = N Time latency (L ? 1) The PARTI primitives for multi-dimensional arrays perform this optimization.
Schedule Merging
When data is gathered from, or scattered to, the same data array using a number of di erent schedules, then these schedules can be merged together to reduce the number of message startups and thereby the latency. Schedule merging is orthogonal to the software caching optimizations; for instance one can merge sets of schedules that arise from simple software caching or sets of schedules obtained from incremental scheduling. The total reduction in latency is by the factor (S ? 1) , where the number of merged schedules is S. PARTI provides primitives that merge a number of schedules to form a single communication schedule.
Example Test Codes
The placement of the runtime support depends on the nature of the communication optimization. For instance, in Figures 6 and 7 we compare the test loops associated with simple communication aggregation and schedule merging. The simple communication aggregation case shown in Figure 6 does the preprocessing with the various indirection arrays right at the beginning. It returns four schedules, one for each of the indirection arrays. The z values are fetched immediately before each loop executes the schedule for ic is employed before the rst loop and the schedule for id is employed before the second loop. After the execution of the rst loop the o -processor x values are accumulated using the schedule for ia. Similarly, after the second loop computation, o -processor accumulation of x is done using the schedule for ib.
The schedule merging code is shown in Figure 7 . As in the previous case, schedules are built using all the indirection arrays. In this case the schedules are merged and instead of four we have two, one for ia and ib and one for ic and id. All the required values of z are fetched using vectorized communication (z being a multi-dimensional array) before execution of the loops. We accumulate the o -processor values of x using the schedule for ia and ib after both loops execute. We can do this because of the commutative property of the`+' operator. The executor communication cost when schedule merging and vectorization is performed is much lower than that of the simple software caching case. The inspector cost for schedule merging is higher than the inspector cost of the software caching case.
Note that while the software caching and communication coalescing optimizations are orthogonal, on distributed memory machines it makes sense to use certain optimizations together. For instance, if we employ incremental scheduling, we can easily produce a single merged schedule to perform the Preprocessing for indirection arrays ia, ib, ic and id gather the values of array z using schedule for ic do i = 1, n x(ia(i)) = x(ia(i)) + z(ic(i)) end do accumulate values of x using schedule for ia gather the values of array z using schedule for id do i = 1, n x(ib(i)) = x(ib(i)) + z(id(i)) end do accumulate values of x using schedule for ib 
Paged Distributed Translation Table
In distributed memory machines, large data arrays need to be partitioned between the local processor memories. These partitioned data arrays are called distributed arrays. Long term storage of distributed array data is assigned to speci c memory locations in the distributed memory machine. It is frequently advantageous to partition distributed arrays in an irregular manner. For instance, the way in which the nodes of an irregular computational mesh are numbered frequently does not have a useful correspondence to the connectivity pattern of the mesh. Figure 8 depicts an unstructured mesh. The node numbers are generated during the mesh generation process and they do not follow any regular pattern. In unstructured computational uid dynamics solvers data is associated with each node of the mesh. When we partition the data structures in such a problem in a way that minimizes interprocessor communication, we may need to be able to assign arbitrary array elements to each processor. Once distributed arrays have been partitioned between processors, each processor ends up with a set of globally indexed distributed array elements that must be accessed.
Each element in a size S distributed array, A, is assigned to a particular home processor. In order for another processor to be able to access a given element, A(i), of the distributed array we must know the home processor where A(i) resides, and we must know the local address of A(i). We build a translation table which, for each array element, lists the home processor and the local address in the home processor's memory.
Memory considerations make it clear that it is not always feasible to place a copy of the translation table on each processor. A translation table can be distributed between processors. Earlier versions of Preprocessing to build a single schedule using arrays ia and ib Preprocessing to build a single schedule using arrays ic and id Gather for z using the single schedule for arrays ic and id do i = 1, n x(ia(i), 1) = x(ia(i), 1) + z(ic(i), 1) x(ia(i), 2) = x(ia(i), 2) + z(ic(i), 2) end do do i = 1, n x(ib(i), 1) = x(ib(i), 1) + z(id(i), 1) x(ib(i), 2) = x(ib(i), 2) + z(id(i), 2) end do Accumulate x using the single schedule for arrays ia and ib . This was accomplished by putting the rst N/P elements on the rst processor, the second N/P elements of the table on the second processor, etc ., where P is the number of processors. If we are required to access an element A(m) of distributed array A, we nd the home processor and local o set in the portion of the distributed translation table stored in processor ((m ? 1)=N) P + 1. We will refer to a translation table lookup, aimed at discovering the home processor and the o set associated with a global distributed array index, as a dereference request.
In many cases, the naive translation table described above tends to be costly to use because (1) The distribution of the translation table between processors is xed. The distribution of the translation table bears no particular relationship to the distribution of dereference requests; and (2) Some distributed In this paper, we introduce a paged translation table. In this scheme the translation table is decomposed into xed-sized pages. Each page lists the home processors and o sets associated with a set of B contiguously numbered distributed array indices. Each processor P stores Pa pages, and at least one processor maintains a copy of each page; consequently the total number of stored pages P*Pa must be greater than or equal to the distributed array size S divided by B. We follow the convention in the virtual memory literature and call the memory location associated with each page a page frame. Each processor maintains a page table; for each page, the page table lists a processor and a page frame.
Translation table information for each index must be stored somewhere; we make the simplifying assumption that each processor must store at least S=(B P) pages. In our current paged translation Figure 9 shows the index translation process. Figure 10 depicts a highly simpli ed scenario in which we have 2 processors, an 8-element distributed array (S=8), a page size of 2 (B = 2) and a replication factor of 0.0 (RF=0.0). Since no pages are replicated, each processor has the same page table. In Figure 11 we depict a scenario that is identical to the one shown in Figure 9 , except that we now have a replication factor of 0.5. In this case, Processor 1 contains a dynamic copy of page 3, and processor 2 contains a dynamic copy of page 1. Our runtime support allows each processor to choose which pages to replicate based on the characteristics of a user (or compiler) speci ed distributed array access pattern, speci ed by integer array IA. Each index i of IA is dereferenced by consulting page IA(i)?1 B + 1. On each processor, we choose the most heavily accessed pages as the dynamically assigned pages.
Optimization Scalability
Various models, metrics and analysis techniques for scalability of parallel programs have been proposed in the literature 20, 21, 13] . Under constant problem size scalability model, the application is run on varying sized systems without changing any of its parameters, including problem size. In the memoryconstrained scalability model, the application parameters are scaled so that the memory requirements grow linearly with the increase in system size 10]. Finally, in time-constrained scaling, the application parameters are scaled so that the running time does not change while the system size changes. All of these models have some de ciencies and none of them are universal for all applications as pointed out by 19] .
There are a wide range of optimizations designed to improve the performance of problems on multiprocessor architectures. Any given optimization targets a certain class of problems and the e ectiveness of the optimization varies with problem characteristics. We believe that it is also important to characterize the relative e ectiveness of an optimization on varying system sizes. For instance, consider a mechanism that is capable of exploiting parallelism that may exist between a heterogeneous set of atomic tasks (i.e. no task can be divided between processors). An optimization of this kind may be very e ective when the number of processors used is less than or equal to the number of tasks. In such a case, increasing the number of processors greater than the number of tasks would show no gain in performance. As we shall see below, there are other cases in which the e ectiveness of an optimization may remain constant or increase as one makes use of increasing numbers of processors.
We de ne optimization scalability function for a given problem and architecture as osf(A; P) = T base (s(A; P); P) T opt (s(A; P); P)
where s(A; P) is the scaled version of problem A on P processors using any of the known scaling models, T base (s(A; P); P) is the time taken by the base code to solve s(A; P) on P processors and T opt (s(A; P); P) is the time taken by the optimized code to solve s(A; P) on P processors. It is important to note that optimization scalability can be viewed in the context of any problem scaling model. For instance, in computing an optimization scalability function, we can x problem size and vary the number of processors. In this case s(A; P) = A. Alternately, we can increase the problem size or alter problem characteristics with increasing numbers of processors. We examine the e ect of our optimizations on unstructured problem communication requirements. In the analysis presented in this section we make use of the synthetic workload described in Section 3, which employs a set of loops of the type depicted in Figure 5 . In the experimental analysis presented in the following sections we make use of both the synthetic workload and data access patterns derived from real applications.
Our optimizations are applied to a data access pattern generated when a given unstructured problem is mapped onto a multiprocessor. Measured performance on a given architecture consequently depends on (1) the nature of the unstructured code (e.g. the real codes outlined in Section 2 or the test loops in Figure 5 ), (2) the dataset (e.g. the data structures used to represent unstructured meshes and molecular interactions described in Section 2), and (3) the way in which the dataset is partitioned among processors.
We rst examine the volume of communication and the number of communication startups associated with bringing in o -processor data. This data is presented in Table 2 . In this table, we assume that problem size is xed as we increase the number of processors. The row labeled \naive" stands for no optimization at all; each processor requests its data whenever that data is needed locally. In the \naive" case,the number of communication startups is equal to the number of data elements communicated. Using the notation de ned in Section 3, we assume that the unoptimized or base problem requires communication volume V/P to be sent and received by each processor, N loop represents the number of test loops employed by the DAPG and N dim represents the number of identically referenced array slices. When targeted at distributed memory architectures, the naive implementation is extremely ine cient 18].
The row labeled \simple communication aggregation" gives the communication characteristics associated with the optimization described in Section 4.2.1. This optimization reduces the number of messages that must be transmitted. , \schedule merging" adds the schedule merging optimization (Section 4.2.3) to the optimizations represented in the rows above. This optimization makes it possible to prefetch all data needed by the entire set of test loops before carrying out the rst of the test loops. The number of startups in this case is reduced by a factor of N loop and is equal to C.
Finally, we add the incremental scheduling optimization (Section 4.1.2) to the list above. Incremental scheduling allows us fetch from o -processor only the unique data values needed by any one of the test loops. This produces a savings when more than one test loop uses the same datum.
The left hand side array references in the test loops in Figure 5 involve accumulations. In most cases, our experience with applications scientists has indicated that it is permissible to defer oprocessor accumulations until after a loop. This does have the e ect of changing the order in which the accumulation is carried out. In the authors' experience, this change in operation order does not usually cause problems, since such loops are routinely vectorized, and vectorization also changes the order in which values are accumulated. If we limit ourselves to carrying out deferred accumulations after each loop, we nd that schedule merging and incremental scheduling optimizations cannot be employed. In some applications areas, such as molecular dynamics, some applications programmers nd that they can defer accumulations until after a sequence of (non-dependent) loops are carried out. In these cases one could make use of schedule merging and incremental scheduling optimizations.
Using the executor communication requirements presented in Table 2 we can show that the optimization scalability function for the executor is of the form osf(A; P) = 1 P + 1 2 P + 2 where, 1 P is the communication volume of base code, 2 P is the communication volume of optimized code, 1 is the number of startups for the base code and 2 is the number of startups for the optimized code. Recall that under the constant problem size scalability model s(A; P) = A. The base code in this case is taken to be the code with \simple communication aggregation" optimization (the \naive" code is so ine cient that in practice no one would employ this method!). When we take the derivative of the osf with respect to the number of processors P we get, osf0(A; P) = 1 ( 1 + 1 P) 2 + 2 P ?
The function osf0(A; P) > 0 because 1 2 and 1 2 . Hence we conclude that the optimization scalability function for the executor primitive is monotonically increasing with the increase in the number of processors. We can show similar characteristics for the optimization scalability function of the inspector primitives.
In CHARMM, one can increase problem size by varying the size of a molecule being simulated or by varying the cuto radius (Section 2.2). Increasing the cuto radius has the e ect of increasing N dup . As is clear in Table 2 , the e ectiveness of optimizations that include software caching improve with increasing N dup .
The communication requirements associated with preprocessing are very closely tied to the communication requirements needed to carry out o -processor gathers, scatters and accumulations. Table  3 depicts these communication requirements. We gain some advantage from the fact that we can reuse the same schedule each time we carry out communications in which we encounter identically referenced, Table 3 , we see that the inspector costs do not scale with the number of processors in the constant size problem scaling model. Consequently, when we include the inspector overhead, the optimization scalability function decreases with the number of processors.
We have developed a substantial body of compile time analysis dedicated to identifying such situations 11], 9]. In the case of the test loops, it is clear that we do not have to repeat our preprocessing for identically distributed array sections. This advantage is re ected in the communication volume and startup numbers depicted in Table 3 .
Experiments and Results
This section describes the experiments performed and the corresponding results. A number of di erent experiments were performed using the synthetic workload generator and the application code kernels. We present the results to show the performance of the primitives and also how they scale as we go from a small to a large number of processors.
Scalability of the Primitives
The communication optimizations presented in section 6 have been tested in various synthetic and real computational problems. In this section, we characterize the scalability of problems implemented using our runtime support. We also carry out experiments that make it possible to characterize a few of the optimization scalability functions associated with our optimizations. Experiments based on the memory constrained scaling model were used to demonstrate the scalability of the optimizations. In our experiment we increased the size of the problem together with the size of the machine. A useful property of our workload generator is that it can be used to produce communication patterns whose communication structure is preserved when the number of processors is increased. If we scale the synthetic workload in the above manner we did not observe any signi cant changes in performance with increasing numbers of processors. Table 4 presents the absolute timings for schedule merged incremental gathers using the commu- 
Synthetic Workload Performance Results
We now present empirical performance results to characterize the e ectiveness of the communications optimizations presented in Section 4. The reduction in communication time associated with incremental schedules is shown in Figure 12 . We compare performance of a code which employs schedule merging with incremental scheduling versus simple software caching carried out separately for each loop. We used four loops in our test loop code (N loop = 4). We keep the communication graph constant (C=4) and vary the R int parameter in order to vary the number of shared o -processor accesses. The loop structure is similar to the one presented in Figure 7 . We repeated the experiment for low ( 100 oating point numbers), medium ( 1000 oating point numbers), and high ( 2000 oating point numbers) communication volume. Figure 12 gives the timings for the gather calls both for incremental scheduling and simple software caching. For both Figure 13 . The inspector time for the incremental scheduling is usually higher than with simple software caching because of the larger volume of data that has to be hashed.
We now quantify the performance e ects of simple software caching. We keep the communication graph constant and vary N dup and the volume of communication. The structure of the test loop associated with the duplicate elimination version and the pre-scheduled communication version are very similar to the one in Figure 6 . Figure 14 shows the results when the duplication factor N dup is low (range 0 to 10). This is the case usually found in unstructured mesh computations computational uid dynamic calculations, in these calculations; the connectivity of the mesh ranges from 6 to 10. Figure 15 shows the case where the duplication factor is very high (range 0 to 500). This case is similar to the data access pattern found in molecular dynamics and particle dynamics codes, where each particle interacts with a large number of other particles (usually within a cut-o radius). We nd the expected result: performance improvement associated with software caching increases with the duplication factor, except when the communication volume and duplication factor are both low.
Performance Results Derived from Applications

Comparison of Communication Optimizations
The Euler kernel was timed varying the number of processors from 16 to 128. All timings presented are for 10 iterations of the outermost loop. The communication times for the di erent levels of optimizations are shown in Table 5 . We see that for both the 53K and 100K mesh input, schedule merging together with vectorization makes the communication time decrease slightly as we go to a larger number of processors. Similarly, the total running time, presented in Table 8 , also goes down signi cantly as more processors are used. It is also of interest to note that when we view simple software caching as our base case, our optimizations taken together exhibit an increasing optimization scalability function (as predicted by the discussion in Section 6. For the 53K mesh, the ratio between the last and rst row of Table 5 increases monotonically from 1.4 for the 16 processor case to 2.9 for the 128 processor case. For the 100K mesh, the ratio between the rst and last row of Table 5 increases from 1.4 to 1.9 in a monotonic fashion. This tells us that for a xed sized problem, our optimizations become increasingly e ective as one increases the number of processors.
Behavior of Paged Translation Table
We performed several experiments to measure the performance of the paged translation table. Table  6 shows the e ects of replication factor on the scheduling time for a 53K node unstructured mesh, and a benchmark input for CHARMM (MbCO + 3,830 water molecules; 14,026 atoms) on a 64-processor iPSC/860. The column labeled \Before" corresponds to performance with the initial block distribution of pages across processors. The column labeled \After" corresponds to the performance after a re- organization of replicated pages according to access behavior on each processor. In this experiment, we vary the number of pages replicated on each processor. As we expected, performance improves as the replication factor increases. For the unstructured mesh, reshu ing of translation table pages does not make much di erence in scheduling time. For the molecular dynamics case the reshu ing makes a large di erence, especially for low replication factors. Table 7 shows the performance of dereferences with varying block sizes for a xed replication factor, R = 0:5. We have observed that reasonable communication times can be obtained with relatively large page sizes. When the page size is decreased, the communication e ciency of the fully replicated case can be achieved without having to replicate all the data associated with the translation table. The PARTI primitives were also used to port CHARMM (120K lines) to the iPSC/860. The entire energy calculation portion of CHARMM has been parallelized. This involves both internal and external energy calculations. The internal energy calculation is not very expensive and accounts for 1% of the total energy calculation time. It calculates the bonded interactions of atoms. The external energy calculation is the most computationally intensive part of the code and accounts for more than 90% of the total energy calculation time. It rst generates the nonbond list and then calculates the nonbonded interactions of atoms. But the nonbond list is not updated every iteration. It is updated every n iterations where n is a variable that can be xed by the user. Currently, n is set to 25.
Atoms are partitioned based on geometric positions to reduce communication time by using recursive coordinate bisection. Since the execution time of external energy calculation depends on the size of the nonbond list, we must partition the nonbond list as balanced as possible. If we simply consider geometrical positions of atoms and assign equal numbers of atoms to processors, it is very likely that the load will be not balanced (see Table 9 ). The index of load balance is calculated by (max n i=1 execution time of processor i) (number of processor n) P n i=1 execution time of processor i
In order to keep the load balanced, i.e. index close to 1, we generate the nonbond list for every atom and use the size of nonbond list as the weight (or load) of each atom. Then we use the weighted recursive coordinate bisection to partition atoms by considering both the positions and loads.
During the internal and external energy calculations, the iteration partitioning is done based on the atom partitioning. The iterations of internal energy calculations are assigned to the processors with the largest number of atoms for the calculations. For the external (nonbonded) energy calculation, every processor generates the nonbond list for the atoms that have been assigned to the processor. And then the Van der Waals and electrostatic force calculations are performed.
We ran a benchmark case (MbCO + 3830 water molecules) and present the timing results in Table 9 and Table 10 . We present these results to make clear that we have found that our runtime support and optimizations can be, and are, being used to port challenging application codes. These results are comparable to all other implementations of which we are aware 4]. We should also note that the optimization scalability function for the weighted coordinate bisection load balancing optimization function takes on values of 1.11, 1.35, 1.69 and 1.65 for 16,32,64 and 128 processor respectively. As one would expect, a better load balancing procedure becomes increasingly important as the number of processors increases.
Conclusion
This paper has described and systematically evaluated a number of new communication optimizations aimed at irregular problems. These optimizations reduce communication latency and volume. We also describe a paged translation table to e ciently support access to globally indexed distributed arrays that are mapped onto processors in an irregular manner. This paged translation table can be con gured to reduce the time required to determine processor assignments for irregularly distributed data. The optimizations that are described in this paper are part of the PARTI primitives. These primitives can be used by parallelizing compilers to generate parallel code from programs written in data parallel languages like HPF, Fortran D etc. The choice of the optimization is done by the compiler after indepth analysis of the sequential code. We presented a framework for characterizing the performance and scalability properties of our optimizations. We introduced the concept of optimization scalability function to characterize the relative e ectiveness of an optimization as one employs varying numbers of processors. Optimization scalability can be viewed in the context of any problem scaling model.
Performance was measured using a family of test loops and an associated synthetic workload generator, using the test loops with data access patterns drawn from unstructured applications, and through the use of templates from real applications.
