Index Translation Schemes for Adaptive Computations
on Distributed Memory Multicomputers by Moon, Bongki et al.
Index Translation Schemes for Adaptive Computationson Distributed Memory Multicomputers Bongki Moon Mustafa Uysal Joel SaltzInstitute for Advanced Computer Studies andDepartment of Computer ScienceUniversity of MarylandCollege Park, MD 20742fbkmoon, uysal, saltzg@cs.umd.eduAbstractCurrent research in parallel programming is focused on closing the gap between globally indexedalgorithms and the separate address spaces of processors on distributed memory multicomputers. Aset of index translation schemes have been implemented as a part of CHAOS runtime support library,so that the library functions can be used for implementing a global index space across a collection ofseparate local index spaces. These schemes include two software-cached translation schemes aimed atadaptive irregular problems as well as a distributed translation table technique for statically irregularproblems. To evaluate and demonstrate the eciency of the software-cached translation schemes,experiments have been performed with an adaptively irregular loop kernel and a full-edged 3D DSMCcode from NASA Langley on the Intel Paragon and Cray T3D. This paper also discusses and analyzesthe operational conditions under which each scheme can produce optimal performance.1 IntroductionDistributed memory multicomputers have been widely used to solve many structured and unstructuredproblems. Most of the performance gain from the distributed memory multicomputers can be obtainedby data distribution and load balancing. In such multicomputers, each processor owns a separate localmemory and is connected to an interconnection network. Processors communicate with each other bysending and receiving messages across the network. However, since most of the applications are writtenassuming a single global index space, the programming task on distributed memory multicomputersoften requires a substantial amount of eort. Thus, current research in parallel programming is focusedon closing the gap between globally indexed algorithms independent of the underlying distribution ofdata and the separate address spaces of processors.Compilers for various languages such as Fortran [8] and C++ [4] have been developed to give theillusion of a shared address space on distributed memory multicomputers. For structured problems,such compilers as Fortran D [8] use distribution directives to partition computation across processors.Using the directives, the compilers can statically determine the processor that owns a data item and theprocessor that requires the value of the data item. The compilers can then generate message passingcalls to directly pass this value from the owner processor to the processor that needs it. AnotherThis work was supported by NASA under contract No. NAG-11560, by ONR under contract No. SC 292-1-22913 andby ARPA under contract No. NAG-11485. The authors assume all responsibility for the contents of the paper.1





0 0 0 01 1 12 2 2 2 3
0 1 2 3
0 1 2 3 4 5 6 7 8 9 10 11
02 1 01 2 3 0





0 0 0 1 0 2 2 1 1 2 2 3
0 1 2 2 3 1 2 0 1 3 0 0
0 1 2 3 4 5 6 7 8 9 10 11
Data Distribution : P0 = {0,1,2,4},     P1 = {7,8,3},     P2 = {10,5,6,9},     P3 = {11}
Number of data items = 12
Number of processors = 4
P0 P1 P2 P3
(b) Distributed Translation TableFigure 2: Examples of Translation Tablesachieve parallelism. Once distributed arrays have been partitioned, each processor ends up with a set ofglobally indexed distributed array elements. Each element in a distributed array A of size N is assignedto a particular home processor. In order for any processor to be able to access a given element, A(i), ofthe distributed array, the home processor and local address of A(i) must be determined.Generally, unstructured problems solved with irregular data distributions perform more ecientlythan with regular data distributions such as BLOCK. In the case of irregular data distribution, atranslation table is built that, for each array element, lists the home processor and the local oset. Ifthe data is distributed in a BLOCK or CYCLIC manner, the translation table can be simulated withan analytic function. Otherwise, a full-edged translation table needs to be built. This translationtable is used for dereferencing, the process of nding the processor home of a global element and thelocal oset within the processor. Figure 2(a) illustrates a replicated translation table for a given datadistribution of 12 globally indexed data items over 4 processors. For instance, a data item with a globalindex 3 is stored in the third memory location within the processor P1. Thus, its home processor (1)and local oset (2) are stored in the fourth entry of the translation table.Memory considerations make it clear that it is not always feasible to replicate a copy of the trans-lation table on each processor, so the translation table must be distributed across processors. This isaccomplished by distributing the translation table by blocks, i.e., putting the rst N/P elements on therst processor, the second N/P elements on the second processor, and so on, where P is the numberof processors and N is the number of globally indexed data items. Figure 2(b) illustrates a distributedtranslation table obtained by partitioning the replicated translation table given in Figure 2(a).When an element A(i) of a distributed array A is accessed, the home processor and local osetare found in the portion of the distributed translation table stored in processor b iPN c. A dereferencingoperation using the distributed translation table requires communication between processors to exchangethe information stored in each processor's portion of the distributed translation table.Figure 3 presents an example of the Jacobi iterative loop parallelized with the CHAOS runtimelibrary [13]. Each processor passes the procedure build translation table a list of global indices of3
I1: ttable = build translation table(index,n local grids)I2: call dereference(ttable,ia,oseta,proca,n local edges)call dereference(ttable,ib,osetb,procb,n local edges)I3: call CHAOS functions to generate communication scheduleL1: do n = 1, n stepsE1:call CHAOS functions to gather o-processor data elementsL2: do i = 1, n local edgesy(ia local(i)) = 0.85 * x(ia local(i)) + 0.42 * x(ib local(i))y(ib local(i)) = 0.88 * x(ia local(i)) + 0.44 * x(ib local(i))enddoE2:call CHAOS functions to scatter o-processor data elementsL3: do i = 1, n local gridsx(i) = y(i)enddoenddoFigure 3: An irregular loop parallelized by CHAOS runtime libraryL1: do n = 1, n stepsL2: do i = 1, n edgesy(ia(i)) = 0.85 * x(ia(i)) + 0.42 * x(ib(i))y(ib(i)) = 0.88 * x(ia(i)) + 0.44 * x(ib(i))enddoL3: do i = 1, n gridsx(i) = y(i)enddoS: if (mesh redened) then regenerate ia() and ib()enddoFigure 4: An example code segment of an adaptive irregular looparray elements for which it will be responsible. To create a translation table, for example, in the rstinspector step I1 of Figure 3, each processor passes an array index(1:n local grids) to the runtimefunction. The array index(1:n local grids) stores a set of global indices owned by the processorwhich is determined by the current distribution of data. If a given processor needs to access a data itemthat corresponds to a particular global index i for a specic distributed array, the processor can consultthe distributed translation table to nd the owner processor and location of that item within the localmemory of the owner processor. The next inspector step I2 carries out the dereferencing operation.Though this step inherently incurs communication overhead, the cost of dereferencing will be amortizedover loop iterations as long as the indirection arrays ia() and ib() are not changed.3 Software-Cached Adaptive Translation SchemesIn static irregular problems such as sparse linear systems and unstructured mesh codes, data accesspatterns are determined via a level of indirection and the access patterns remain static. Thus, thedereferenced data accesses of globally indexed data items may be reused over loop iterations by storingthe index translation information in local memory (for example, the arrays offseta() and proca()returned from the dereference() function in Figure 3). In adaptive irregular applications which can befound in direct particle simulation and molecular dynamics simulation, however, the data access patterns4
Data Distribution : P0 = {0,1,2,4},     P1 = {7,8,3},     P2 = {10,5,6,9},     P3 = {11}
Number of data items = 12
Number of processors = 4

























(a) Paged Translation Table snapshot (b) Paged Translation Table snapshot taken after
 
ptrrefcnt
taken after initial creation at P3 dereferencing requests {11,4,10} processed at P3Figure 5: Paged Translation Tablesmay change during the processing of loop iterations. Figure 4 shows the computational structure of suchadaptive irregular applications. The data access pattern in loop L2 changes whenever the indirectionarrays ia() and ib() are regenerated in the conditional statement S. Then, since the index translationinformation stored in local memory can not be reused, the globally indexed data items should bedereferenced whenever the access patterns change.In adaptive applications such as DSMC [3] and CHARMM [5], data access patterns change fre-quently and irregular data distribution is preferred for better performance over regular data distribu-tion. Thus, minimization of the dereferencing cost is crucial for ecient processing of such applicationson distributed memory multicomputers. In such cases, the distributed translation table described inSection 2 tends to be too costly to use. There are three main reasons. First, the dereferencing operationinherently requires communication between processors to exchange the translation information. Second,the distribution of the translation table across processors is xed and bears no particular relationshipto the distribution of dereferencing requests. Third, even though a nonlocal global index is dereferencedin several loop instances, the translation information obtained in the previous loop instance can not bereused in the subsequent loop instances unless it is stored explicitly in local memory.In many cases there is enough memory to partially replicate the translation table. The distributedtranslation table is not able to replicate portions of the translation table in order to trade memory forimproved performance. This section introduces two variations of the distributed translation table whichoer reduced overhead by using extra memory: paged translation table and hashed translation table.These translation schemes use software caching techniques so that the extra memory can be exploitedadaptively for changeable data access patterns and communication latency can be avoided.3.1 Paged Translation TableThe paged translation table is composed of a page table and a set of page frames. Followed here is theconvention found in the virtual memory literature where the memory location associated with each5













Data Distribution : P0 = {0,1,2,4},     P1 = {7,8,3},     P2 = {10,5,6,9},     P3 = {11}
Number of data items = 12



















(a) Hashed Translation Table snapshot
taken after initial creation at P0
(b) Hashed Translation Table snapshot taken after













Replication Factor = 0.5,  Hash Function h(x) = x mod 4




























Number of time steps



































Number of time steps































Number of time steps


































Number of time steps





































Number of time steps




Figure 11: Index translation times of the 3D DSMC code (R = 0.2 and S = 16)respect to varying replication factors. It is also observed that large page frames use up replicatedmemory fast and may cause severe performance degradation due to page thrashing.5 DiscussionThrough the experimental results presented in this paper, it has been demonstrated that both the pagedand hashed translation schemes signicantly outperform the distributed translation scheme. When com-paring the results from the paged and hashed translation schemes, the hashed translation scheme slightlyoutperformed the paged translation scheme in most of the cases. This is due mainly to the dierence inthe granularity of the replicated memory management. That is, the ner-grained memory managementof the hashed translation table adapts better to the highly random access patterns encountered in bothexperiments with the irregular loop kernel and with the NASA Langley DSMC code.However, it is anticipated that the paged translation scheme will outperform the hashed translationscheme in other applications where the access patterns change slowly and bear high locality. In deref-erencing a global index, if the global index has already been cached into the local memory, the pagedtranslation scheme guarantees a constant translation cost. On the other hand, the hashed translationtable may suer from skew built in the hash table. That is, if a particular choice of a hash functiongenerates long lists of collided hash nodes, then the overhead of traversing a long list of hash nodesin discovering the key index may be high. Consequently, to stabilize the performance of the hashedtranslation scheme, it is necessary to choose a good hash function which does not entail such a hashskew.When the hash skew hurts the performance of the hashed translation scheme, one of a collectionof randomly generated hash functions can be selected to ensure a good performance. This is doneby simulating the use of the individual functions with the globally indexed data items owned by eachprocessor. For this purpose, the current implementation of the hashed translation scheme allows theoption of choosing a hash function from a universal2 class of hash functions H1 dened in [6]. It is12
experimentally shown that for a given set of keys, by choosing functions at random from the class H1,the theoretically predicted performance of the hash functions can be achieved in practice, independentof the key distribution [11].These translation schemes have been implemented as a part of the CHAOS runtime support libraryon various distributed memory multicomputers such as Intel Paragon, IBM SP-1/2, Thinking MachineCM-5 and Cray T3D. The current implementation has used vendor-supplied message passing libraries.It should be noted, though, that the translation schemes have been further optimized using the lowlatency shared memory functions on the Cray T3D [1]. The shared memory functions copy blocksof data directly from one processor's memory to another. These shared memory functions remove asubstantial amount of overhead for synchronization. The last two columns in Table 1 demonstrate theoptimized performance obtained from Cray T3D over that from Intel Paragon.Another issue of the translation schemes is memory requirement. Suppose thatN is the total numberof global indices, P is the number of processors, S is a page size, and R is a replication factor. Then,the memory complexity of the paged translation scheme is given by O(N  ( 1S +R)). In order to keepthe amount of replicated memory scalable with large numbers of processors and large problems, it isdesirable to make the page size S proportional to the number of processors P . However, the need fora large page size may result in severe performance degradation due to page thrashing. Thus, it maybe a complicated process to choose an optimal page size under various situations. On the other hand,the hashed translation scheme requires O(N  ( 1P +R)) memory, which makes the hashed translationscheme ideally scalable. Accordingly, the hashed translation table may be more desirable in a situationwhere the memory constraint is tight.6 ConclusionsThis paper has presented a set of index translation schemes for implementing a user-level global indexspace across a collection of local index spaces on distributed memory multicomputers. These schemeshave been incorporated into the CHAOS runtime support library so that calls to the library functionscan be generated by compilers.For unstructured problems with irregular data distributions, a distributed translation table can bebuilt to list the home processor and oset for each globally indexed data item. Cached translationschemes use software caching techniques to reduce the dereferencing costs for adaptive irregular appli-cations which require frequent index translations. Experiments have been performed with an adaptivelyirregular loop kernel and a 3-dimensional NASA Langley DSMC code. It has been observed that thesoftware-cached translation schemes signicantly outperform the distributed translation table for suchproblems with changeable data access patterns. For example, the hashed translation scheme achievedabout 46 percent improvement with the DSMC code on the 32-node Paragon.The performance of the software-cached translation schemes is sensitive to the choice of parameterswhich these schemes are governed by. Future work may include the extension of these schemes so thatautomatic selection of the parameters can be done using runtime information such as the amount ofavailable memory and the fraction of locally accessed global indices.AcknowledgmentsThe authors would like to thank Richard Wilmoth at NASA Langley for the use of DSMC productioncodes. Access to the Caltech CCSF Paragon and the Jet Propulsion Laboratory Cray T3D was providedby the Center for Research on Parallel Computation.13
References[1] Ray Barriuso and Allan Knies. Shmem user's guide. Report, Cray Research, Inc., April 1994.[2] T. J. Bartel and S. J. Plimpton. DSMC simulation of rareed gas dynamics on a large hypercube super-computer, AIAA-92-2860. In Proceedings of the 27th AIAA Thermophysics Conference, Nashville, TN, June1992.[3] Graeme A. Bird. Molecular Gas Dynamics and the Direct Simulation of Gas Flows. Clarendon Press, Oxford,1994.[4] F. Bodin, P. Beckman, D. Gannon, S. Yang, S. Kesavan, A. Malony, and B. Mohr. Implementing a parallelC++ runtime system for scalable parallel system. In Proceedings Supercomputing '93, pages 588{597. IEEEComputer Society Press, November 1993.[5] B. R. Brooks and M. Hodoscek. Parallelization of CHARMM for MIMD machines. Chemical Design Au-tomation News, 7, 1992.[6] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions. Journal of Computer andSystem Sciences, 18:143{154, 1979.[7] R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallelunstructured Euler solver using software primitives, AIAA-92-0562. In Proceedings of the 30th AerospaceSciences Meeting, January 1992.[8] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu. Fortran D languagespecication. Department of Computer Science Technical Report TR90-141, Rice University, December 1990.[9] Milan Milenkovic. Operating Systems Concepts and Design. McGraw-Hill, Inc., 1987.[10] Bongki Moon and Joel Saltz. Adaptive runtime support for direct simulation Monte Carlo methods ondistributed memory architectures. In Proceedings of the Scalable High Performance Computing Conference(SHPCC-94), pages 176{183, Knoxville, TN, May 1994. IEEE Computer Society Press.[11] M. V. Ramakrishna. Hashing in practice, analysis of hashing and universal hashing. In Proceedings of the1988 ACM SIGMOD, pages 191{199, June 1988.[12] Steven K. Reinhardt, James R. Larus, and David A. Wood. Tempest and Typhoon: User-level sharedmemory. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages325{336. IEEE Computer Society Press, April 1994.[13] J. Saltz et al. A manual for the CHAOS runtime library. Technical report, University of Maryland, Depart-ment of Computer Science and UMIACS, 1993.[14] Joel Saltz, Harry Berryman, and Janet Wu. Multiprocessors and runtime compilation. Technical Report90-59, ICASE, NASA Langley Research Center, September 1990.[15] Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman. Run-time scheduling and executionof loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303{312, April1990.[16] Richard G. Wilmoth. Direct simulation Monte Carlo analysis of rareed ows on parallel processors. AIAAJournal of Thermophysics and Heat Transfer, 5(3):292{300, July-Sept. 1991.
14
