Introduction
Distributed memory multicomputers have been widely used to solve many structured and unstructured problems. Most of the performance gain from the distributed memory multicomputers can be obtained by data distribution ant1 load balancing. In such multicomputers, each proceh\or owns a separate lo( al memorjl and is connected to an tnterconnection nrtwork. Processors communicate with each other by sending and receiving messages across the nvt work. However, since most of the applications are written assuming a single global index spact., the programming task on distributed memory multicomputers often requires a sobstantial amount of effort Thui, current research in pat d e l programming IS focused on closing the gap between globally indexed algorithms independent of the underlying dkaribution of data and the separate ;iddress spaces of proceh,ors.
f :ompilers for various languages such as Fortran [8] and C+ + [4] have been developed to give the dlusion of a shnred address space on distributed memory multicomputers For structured problems, such compilers as Ft)rtran D [8] use distribution directives to partition computa- tion across processors. Using the directives, the compilers can statically determine the processor that owns a data item and the processor that requires the value of the data item. The compilers can then generate message passing calls to directly pass this value from the owner processor to the processor that needs it. Another approach called Distributed Shared Memory (DSM) enables nn application's user-level code to support shared memory and message passing efficiently [12] . Distributed shared memory is typically supported by the processor's address translation hardware. This paper describes and evaluates a set of index translation schemes for implementing a global index space across a collection of distributed memories. These st,hemes can be incorporated into a runtime support library SO that calls to the library functions can be invoked by manually parallelized programs or can be generated by conipilers [14] . 'rhese schemes can also be incorporated into distributed shared memory systems such as the one used in the Wisconsin Wind Tunnel project t o support user-he1 shared memory [12] .
To illustrate the need of runtime support for address translation, consider the Jacobi iterative method for solving a partial differential equation on an irregular. numerical grid, which arises in molecular dynamics codes and sparse linear solvers. A typical example loop of such an irregular computation is presented in Figure 1 . The update of each grid point depends only on the values at, neighboring grid points from the previous iteration. Since the grid structure of such irregular problems is determined only !it runtime, compilers cannot fully analyze and translate globally indexed memory accesses. For instance, in Figure 1 Thus communication patterns between processors should be determined at runtime and accordingly the globally indexed data accesses should also be t,ranslated at runtime. The rest of the paper is organized as follows. Section 2 describes a distributed translation scheme which allows us to map a globally indexed distributed array and how this scheme can be used in parallel computation. Section 3 introduces new adaptive translation schemes which offer rcduced overhead of index translation by using software ca~:hing techniques. Experimental results performed with an irregular loop kernel and a direct particle simulation application are presented in Section 4 . ' We compare the perfoimance of the soft.ware--cached translation schemes and dicicuss the operational condif ion under which each scheme cart produce optimal performance in Sect,ion 5.
Distributed Translation Table
This section describes a distributed translation table which aklows us to map a globally indexed distributed array onto processors in an arbitrary fashion. Briefly outlined is how the distributed translation table can be used in the preprocessing stage of the inspector/executor model of paralltilization [IS, 71.
On distributed memory machines, large data arrays may not fit in a single-processor's memory, hence they are divided among processors. Also computational work is divided among individual processors to achieve parallelism. Once distributed arrays have been partitioned, each procexjor ends up with a set of globally indexed distributed array elements. Each element in a di5,tributc:d array A of si7e N is assigned to a particular honte processor. In ord t c for any processor to be able to access a given element, A( i), of the distributed array. the homv processor and local address of A ( i ) must be determined.
Generally, unst.ructuretl problems solved wit.h irregular data distributions perform more efficiently than with regular data distributions such as BLOCK In the case of irregular data distribution. a translation t~b k is built that, for Memory considerations make it clear that it is not always feasible to replicate a copy of the translation table on each processor, so the translation table must be distributed across processors. This is accomplished by distributing the translation table by blocks, i.e., putting the first N/P elements on the first processor, the second N/P dements on the second processor, and so on, where 1' is the number of processors and N is the number of globally indexed data items Figure 2 cessor needs to access a data item that corresponds t o a L1: do n = 1, n s t e p s L2: do i = 1, n-edges 
Software-Cached Adaptive Translation Schemes
In static irregular problems such as sparse linear systems and unstructured mesh codes. data access patterns are determined via a level of indirection and the access pat,terns remain static. Thus, the dereferenced dat.a accesses of globally indexed data items niay be reused over loop iterations by storing the index translation informatioik in local memory (for example, the arrays o f f s e t a 0 and proca() returned from the deref erence (1 function in Figure 3 ). In adaptive irregular applications which can tie found in direct particle simulation and nilnlecular dynamics simulation, however, the data access patterns may change during the processing of loop iterations. Figure 4 shows the computational structure of such adaptive irregular app1ic:ations. The data HCCC~SS pattern in loop L2 changes whenever the indirection arrays i a 0 and i b 0 are regenerared in the conditional statement S . rhen, since the index translation informat.ion stored in local memory can not be reused, the globally indexed data items should tie dereferenced whenever the access pat,teriis change.
I n adaptive applications such as IISMC [3] and CII ARMM [ 5 ] , data access patterns change frequently and irregular data distributiort is preferrcad for better performance over regular data distribution. 7'hus, minimization of the dereferencing cost is cIucial for efficient processing of such applications on distributed mernory multicomyu ters. In such cases, bhcr dtstributed tfriinslation table described in Section 2 tends to be too cost,lv to use. There arc? three main reasons. First, the dereferencing operation inherently requires communication between processors to exchange the translation information. Second, the distribution of the translation table across processors is fixed and bears no particular relationship to the distribution of dereferencing requests. Third, even though a nonlocal global index is dereferenced in several loop instances, the translation information obtained in the previous loop instance can not be reused in the subsequent loop instances unless it is stored explicitly in local memory.
In many cases there is enough memory to partially replicate the translation table T h e distributed translation table is not able to replicate portions of the translation table in order to trade memory for improved performance. This section introduces two variations of the distributed translation table which offer reduced overhead by rising extra memory: paged translatron table and hashed trarislation table. These translation schemes use software caching techniques so that the extra memory can be exploited adapt ively for changeable data access patterns and communiration latency can be avoided
Paged Translation Table
The paged translation table is composed of a page table and a set of page frames. Followed here is the convention found in the virtual memory literature where the memory location associated with each page is called a page frame The process of generating the paged translation table is governed by two adjustable parameters, r~ page size S and a replication factor 'R T h e replication factor 7L is defined as the fraction of the maximum numbvr of pages for which extra frames are allocated by each processor. In this scheme. the translation I th page table entry is iiull, then a page fault occurs When a page fault occurs, the distributed translation table is referenced to translate thp global index which caused the page fault. Then, a page frame is fetched from the page pool and the home processor and offset of the global index is stored in it. Figure 5 (b) shows a paged translation table snapshot taken after a set of dereferencing requests { 11,4,10} is rrocessed.
In many cases where a replication factor is chosen to be less than one, page faults may occur while no unused page frames are available in the page pool. There are basically two options in handling the situation: the information of home processors and offsets obtained by referencing the dibtributed translation table may be ri:turned without being stored in the paged translation table, or a page may be evicted to make room for an incoming one. The latter WAS chosen, since it adapts to the variation of data actess patterns. A replacement policy governs the choice of the victim when eviction of pages is in order. Since implementation of the well-known pagr replacenient algorithm L R U (Least-Recently-Used) imposes too much overhead to be handled by software alone, implemented here is the NRU (Not-Recently-Used) page rpplacement algorithm, one of the approximations of LRU, using the reference counters in the page table 19). This issue shall be further addressed in Section 5.
Hashed Translation
As with the paged translation table, a distributed translation table is built up as a back-end data structure. When <L hashed translation table is initially created, each processor stores only the translation information for globdlly indexed data items which the processor owns. Specifically, if a processor owns a global index z, the processor adds a hash node to the h(i)-th entry of its hash table. Figure 6 (a) depicts an initial hashed translation table for the same data distribution given in Figure 2 and Figure 5 . Since the number of processors P is 4 and replication factor ' R is 0.5, each processor creates a hash table of siLe 4 and a node pool with 6 unused hash nodes. Then, a list of (global index, processor, offset) triplets for locally owned data items is stored in the hashed translation 
Experiments with an irregular loop kernel
The sample adaptive loop described in Figure 4 was run with an irregular mesh with 100,OOU grid points. To emphhsize the effect of index translation operation, the assumption that the structure of the mesh is redefined every time step by the statement S in Figure 4 was used. Thus, Though the actual costs of the three index translation schemes differ by an order of magnitudt on different machines, the index translation schemes shr)w common characteristics While the performance of the distributed translation scheme is almost invariant during all the time steps, the costs of the other schemes are much higher in the initial time steps and lower in the remaining time steps than that of the distributed translation schenie. This is due to the fact that a number of nonlocal global indices are translated and cached into the paged or hashed translation table in the initial time steps, and most c f the global indices are translated locally in the subsequent time steps. It is also observed that the cost of the paged translation scheme is much higher than that of the hashed translation scheme in the initial time steps. This is due to the coarse-grained memory management of the paged translation scheme. In othcr words, the paged translation scheme needs to translate and cache all the indices i n the page frames that should be brought in local memory This property increases the number of dereferencing requctsts beyond the required number of indices.
The effect of page size on the performance of the paged translation scheme is shown in Figure 9 and Figure 10 . The replication factor was 0.05 in the experiments performed on the 512-node Paragon, and it was 0.10 in the experi- Figure 10: Index translation times with varying page sizes on the 128-node Cray T31) ('R=0.10) mr nts performed on the 1L8-node T.ID The results shown in both of the figures indic at? that the performance of the paged translation scheme is b w y sensitive to the chtiict of ptge size The paged translation s c b m e with relatively small page sizes signilic antly outperformed the distribiited tr,tnslation scheme Howrver when laiger page sizes were cli isen, the performance of t h e paged translation scheme brc ame even WOISC than that of the distributed sranslat~on b( Ireme. Such performanr e degradatnca IS mainly due to p.ige thrashing. It ih rtion' likely the page thrashing h.ipp~r i s with page framr5 of larger size berause thy larger the prige size the highvr the r,iti( of page iiaults The DSMC: method includes movement and collision handling of simulated particles on a spatial flow field domain overlaid by a Cartesian mesh. The spatial location of each particle is associated with a Cartesian mesh cell. The key concept of the DSMC method is that particle movement is decoupled from particle collisions. l'hat is, the computation of a time step can be split into the calculation of physical quantities of collided particles and the relocation of moved particles. Furthermore, since the computations associated with performing probabilistic chemistry and collisions can be distributed across processors cell by cell, the DSMC method in principle is a good match for parallel processing on distributed memory multicomputers [16: lo] .
Changes in position coordinates may cause the particles to move across cell boundaries. In the particular corner flow DSMC code presented here, about 30 percent of the particles change their cell locations every time step. However, particle movements are local enough that partitles only move between neighboring cells. The relocation of particles ntust be done every time step to move them 1.0 their new cells. 'Thus, tshe index translation must also be done every time step to find the particles' new owner processors. The corner flow DSMC code simulates a 3-dimensional flow field with 77,760 cells and about 600,000 particles. Figure 11 shows index translation costs of three translation schemes measured at each time step on the 53-node Paragon. The experiments discussed here fot-used on the first. 80 time steps of transient phase. During the transient time st,rps, the number of particles keeps increasing because the number of entering particles is greater t.han that of leaving particles. This explains the fact that the cost of the distributed translation scheme increases as the computation proceeds. Another key point of the experiments is that the problem domain (that is, cells) of the DSMC code is repartitioned across processors periodically to balance the work loitd. In these particular experiments, the domain was repartitioned every 20 time steps. If the problem domain is repartitioned, a translation table must be regenerated and the cached information of nonlocal global indices must be invalidated. Thus, the costs of paged tritnslation and hashed translation schemes are far higher in the time steps aft er domain repartitioning because a number of nonlocal glob& indices are translated and cached into the paged or hashed translation table. Table 1 shows the performance of the translation schemes with varying replication factors. The numbers in the parentheses represent the ratio of the translation time to the total elapsed time. The experiments shown in this table were carried out with the same corner flow DSMC code simulating 9,720 cells and about 50,000 particlcs. The performance numbers were measured in seconds for the first 200 time steps on the 33-node Paragon except the last column which was obtained using the hashed translation scheme on the 32-riodc T3D. 'r'hc, replication factor has a significant effect on the performance of the paged arid hashed translation schemes. Howwer, when the replicatmion factor becomes large enough t,o avoid frequent page or node replacements, the performaim is almost invariant with respect to varying replication fadors. It is also absetved that large page flames use up replicated memory faht and may cause severe performance degradation due to page thrashing.
Discussion
Through the experimental results presented in this paper, it has been demonstrated that both the paged and hashed translation schemes significantly outperform the dihtributed translation scheme. When comparing the results from the paged and hashed translation schemes, the hashed translation scheme slightly outperformed the paged translation scheme in most of the cases. This is due mainly to the difference in the granularity of the replicated memory management. That is, the finer-grained memory inanagcment of the hashed translation table adapts better to the highly random access patterns encountered in both experiments with the irregular loop kernel and with the NASA Langley DSMC code.
However, it is anticipated that the paged translation scheme will outperform the hashed translation scheme in other applications where the access patterns change slowly and bear high locality. In dereferencing a global index, if the global index has already been cached into the local memory, the paged translation scheme guarantees a constant translation cost. On the other hand. the hashed translation table may suffer from skew built in the hash table. T h a t is, if ii particular choice of a hiah funct,ion generates long lists of collided hash nodes, then the overhead of traversing a long list of hash nodes in discovering the key index may be high. Consequently, to aitabilize the performance of the hashed translation scheme. it is necessary to choose a good hash function which d0t.s not entail such a hash skew.
When the hash skew hurts the performiince of the hashed translation scheme, one of a collection .)f randomly generated hash functions can be selected to ensure a good performance. This is done by simulating tht. use of the individual functions with the globally indexed data items owned by each processor. For this purpose, t h e current implementation of the hashed translation scheme: allows the option of choosing a hash function from a ani2,ersalz class of hash functions If1 defined in [6] . It is experimentally shown that for a given set of keys, by choosing functions at random from the class H I , the theoretically predicted performance of the hash functions can be achieved in practice, independent of the key distribution [Ill.
These translation schemes have been implemented as a part of the CHAOS runtime support library on various distributed memory multicomputers such as InI.el Paragon, IBM SP-1/2, Thinking Machine CM-5 and Cray T3D. The current implementation has used vendor-supplied message passing libraries. it should be noted, though, that the translation schemes have been further 0ptimizr.d using the low latency shared memory functions on the Cray T3D [I] . The shared memory functions copy blocks of data directly from one processor's memory to another. These shared memory functions remove a substantial amocint of overhead for synchronization. The last two columns in Table l demonstrate the optimized performanre obtained from Cray T3D over that from Intel Paragon.
Another issue of the translation schemes is memory requirement Suppose that N is the total number of global Indices, P is the number of processors, S is a page size, and R is a replication factor. Then, the memory complexity of the paged translation scheme is given by O ( N x (5 + a) ).
In order to keep the amount of replicated memory scalable with large numbers of processors and large problems, it is desirable to make the page size S proportiondl to the number of processors P . However, the need for a large page size may result in severe performance degradation due to page thrashing. Thus, it may be a complicated process to choose an optimal' page size under various situations. On the other hand, the hashed translation wheme requires O( N x (+ + 12)) memory, which makes i he hashed translation scheme idedly scalable. Accordingly, the hashed translation table may be more desirable in a situation where the memory constraint is tight 6 Conclusions 'I'his paper has presented a set of index translation schemes tor implementing a user-level global index space across a ( ollection of local index spaces on distributed memory mult icomputers. Thest. schemes have been incorporated into the CHAOS runtime support. library so that calls to the library functions can be generated by compilers.
For unstructured problems with irregular data distrih t i o n s , a distributed translation table can be built to list the home processor and offset for each globally inclexed data item C x h e d translation schemes use sofl ware aching techniqueb to reduce the drieferencing costs for .tdaptive irregular applications which require frequent index translations Experiments have been performed with .in adaptively irregular loop kernel and a 3-dimensional VASA Langley DSMC code It has been observed that I he software-cached translation schemes significantly outperform the distributed translation table for such problems with changeable data access patterns. For example, the ltashed translation scheme achieved about 46 percent imjrrovement with thc, DSMC code on thc. 32-n(1de Paragon.
The performance of the software-cached translation whernes is sensitive to the choice of prameterr, which these xhemes are governed by Future work may includc t h e exlension of these schemes so that automatic selection o f the j)arameters can be done using runtline information su( h as 1 he amount of available memory and the fraction of locally xcessed global indices
