Adaptive Runtime Support for Direct Simulation Monte Carlo
Methods on Distributed Memory Architectures by Moon, Bongki & Saltz, Joel
Adaptive Runtime Support for Direct Simulation Monte CarloMethods on Distributed Memory Architectures Bongki Moon Joel SaltzInstitute for Advanced Computer Studies andDepartment of Computer ScienceUniversity of MarylandCollege Park, MD 20742fbkmoon, saltzg@cs.umd.eduAbstractIn highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Sim-ulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. Thisuctuation may hinder ecient utilization of distributed memory parallel computers because of theresulting overhead for data redistribution and dynamic load balancing. To eciently parallelize suchadaptive irregular problems on distributed memory parallel computers, several issues such as eectivemethods for domain partitioning and fast data transportation must be addressed. This paper presentsecient runtime support methods for such problems. A simple one-dimensional domain partitioningmethod is implemented and compared with unstructured mesh partitioners such as recursive coordi-nate bisection and recursive inertial bisection. A remapping decision policy has been investigated fordynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.1 IntroductionIn sparse and unstructured problems, patterns of data access cannot be predicted until runtime. Sincethis prevents compile time optimization, eective use of distributed memory parallel computers maybe achieved by utilizing a variety of protable pre-processing strategies at runtime. In this class ofproblems, once data access patterns are known, the pre-processing strategies can exploit the knowledgeto partition work, to map global and local data structures and to schedule the movement of data betweenthe processor memories. The goal of the partitioning is to balance the computational load and reducethe net communication volume. Once data and work have been partitioned among processors, the priorknowledge of data access patterns makes it possible to build up a communication schedule that depictswhich data elements need to be exchanged between processors. This communication schedule remainsunchanged as long as the data access patterns remain unchanged.In past work a PARTI runtime support library has been developed for a class of irregular butrelatively static problems in which data access patterns do not change during computation. [14, 4, 5]To parallelize such problems, the PARTI runtime primitives coordinate interprocessor data movement,manage the storage of, and access to, copies of o-processor data, and partition work and data structures.This work was supported by NASA under contract No. NAG-11560, by ONR under contract No. SC 292-1-22913 andby ARPA under contract No. NAG-11485. The authors assume all responsibility for the contents of the paper.
PARTI primitives have been successfully used to port a variety of unstructured mesh single- and multi-grid codes, molecular dynamics codes onto distributed memory architectures. For these problems it isenough to perform pre-processing only once or occasionally.In many irregular problems the data access patterns and computational load change frequentlyduring computation. While the PARTI runtime primitives are able to handle many highly adaptiveproblems in which data access patterns change frequently, eciency may be hindered by relatively highpre-processing costs. Thus the runtime primitives for interprocessor data movement and storage man-agement need to be optimized for problems with varying data access patterns. Since the computationalload may change from time step to time step, data arrays may need to be redistributed frequently toachieve load balance. This requires ecient methods to compute partitions of problem domain and tocarry out remapping to balance load.We have recently developed a new runtime support library called CHAOS which aims at parallelizinghighly adaptive problems on distributed memory parallel computers. The runtime support library hasbeen implemented on the Intel iPSC/860, Touchstone Delta and Paragon, Thinking Machine CM-5,and IBM SP-1 platforms. It subsumes the previous PARTI library which was mainly targeted at staticirregular problems. The CHAOS library has been used to parallelize several highly adaptive applicationcodes including two and three dimensional Direct Simulation Monte Carlo (DSMC)methods. This paperdescribes approaches for parallelizing the direct particle simulations on distributed memory parallelmachines, emphasizing ecient data migration strategies, domain partitioning and dynamic remappingmethods for load balance.The rest of the paper is organized as follows. Section 2 describes general characteristics of DSMCmethods and motivations behind these parallelization approaches. Section 3 introduces a new domainpartitioner and compares it with other unstructured mesh partitioners. Periodic and dynamic remappingmethods for dynamic load balancing issues are presented in Section 4. Communication optimization forfast and frequent data migration is discussed in Section 5. Experimental results are given in Section 6.2 Motivations for parallelization2.1 Overview of the DSMC methodThe DSMC method is a technique for computer modeling of a real gas by a large number of simulatedparticles. It includes movement and collision handling of simulated particles on a spatial ow elddomain overlaid by a Cartesian mesh [12, 18]. The spatial location of each particle is associated witha Cartesian mesh cell. Each mesh cell typically contains multiple particles. Physical quantities suchas velocity components, rotational energy and position coordinates are associated with each particle,and modied with time as the particles are concurrently followed through representative collisions andboundary interactions in simulated physical space.The DSMC method is similar to Particle-in-cell (PIC) method in that it tries to simulate the physicsof ows directly through Lagrangian movement and particle interactions. However, the DSMC methodhas a unique feature that distinguishes it from PIC method: the movement and collision processes arecompletely uncoupled over a time step [13]. McDonald and Dagum have compared implementations ofdirect particle simulation on SIMD and MIMD architectures. [7]2.2 Computational characteristicsChanges in position coordinates may cause the particles to move from current cells to new cells accordingto their new position coordinates. This implies that the cost of transmitting particles among cells maybe signicant on distributed memory parallel computers since a substantial number of particles migrate2
in each time step, and each particle usually has several words of data associated with it. In the particular3-dimensional DSMC program reported here, each particle consumes about 10 words of storage. Onaverage, more than 30 percent of the particles change their cell locations every time step. However,particle movements are very local. In our simulations, we observed that particles only move betweenneighboring cells.Applications which have characteristics like the DSMC computation require runtime support thatprovides ecient data transportation mechanisms for particle movement. Computational requirementtends to depend on the number of particles. Particle movement therefore may lead to variation in workload distribution among processors. The problem domain overlaid by a Cartesian mesh needs to bepartitioned in such a way that work load is balanced and the number of particles that move acrossprocessors is minimized. It may also need to be repartitioned frequently in order to rebalance the workload during the computation. These characteristics raise an issue of dynamic load balancing and require eective domain partitioning methods, and adaptive policy for domain repartitioning decisions.3 Domain partitioning methodsThis section presents some techniques for domain partitioning. It is assumed that work units (e.g.cells) in the problem domain may require dierent amounts of computation. Partitioning of the prob-lem domain is important for parallel computation because ecient utilization of distributed memoryparallel computers may be aected by work load distribution and the amount of communication betweenprocessors.3.1 Recursive bisectionThere have been several theoretical and experimental discussions of partitioning strategies based onspatial information for many years. Recursive coordinate bisection (RCB) [1] is a well-known algorithmwhich bisections a problem domain into two pieces of equal work load recursively until the numberof subdomains is equal to the number of processors. Recursive inertial bisection (RIB) [10] is similarto RCB in that it bisects a problem domain recursively based on spatial information, but RIB usesminimum moment of inertia when it selects bisectioning directions, whereas RCB selects bisectioningdirections from x-, y-, or z-dimensions. In other words, RIB bisects a 3- dimensional domain witha plane at any angle with respect to the coordinate axes, whereas RCB does so only with a planeperpendicular to one of the x, y and z coordinate axes. Clearly the number of processors has to be apower of 2 to apply these algorithms on parallel computers.Recursive bisection algorithms produce partitions of reasonable quality for static irregular prob-lems, with relatively low overhead when compared with Recursive spectral bisection [11] and SimulatedAnnealing [17]. von Hanxleden [16] and Williams [17] discuss the qualities of partitions produced bythe recursive bisection algorithms, and compare their performance with other partitioning methods inseveral aspects.3.2 Chain partitionerMinimization of partitioning overheads is particularly important in problems that require frequentrepartitioning. We therefore considered low overhead partitioning methods, especially chain partition-ers. Chain partitioners decompose a chain-structured problem domain whose task graph is a chain ofwork units into a set of pieces which satises contiguity constraints { each processor has a contiguous3
subdomain of work units assigned to it. That is, a problem domain has to be partitioned in such away that work units i and i+1 are assigned to the same or to adjacent processors. Relatively simplealgorithms for nding the optimal partition of a chain-structured problem have been suggested [2, 8, 3].While these algorithms are developed to optimize computation and communication costs at the sametime, we have developed and used a new chain partitioning algorithm which considers computationcost only. This algorithm requires only one step of global communication and a few steps of localcomputation. Supposing that P is the number of processors, each processor i (0  i  P   1) executesthe same algorithm :1. Compute S = Pi 1k=0 Lk and T = iB where Lk is the amount of work owned by processor k andB = 1P PP 1k=0 Lk. Then S is a scan sum that indicates the relative position of the subchain ownedby processor i within the current distribution of work. T indicates the relative position of thesubchain that processor i has to own under the target distribution of work.2. If S is less than or equal to T , then nd a processor index j such that j = maxfj j j  i andT S  (i j)Bg. If S is greater than T , then nd a processor index j such that j = minfj j j  iand S T < (j i+1)Bg. Then j is the index of the processor that has the smallest index numberamong processors some of work units of processor i must be moved into. Note that if S = T thenj = i.3. Compute processor j's amount of work  that will be moved from processors with index numbersless than i. If S is less than or equal to T ,  = (i j)B (T  S), otherwise  = (S T ) (j  i)B.4. Compute m the number of processors to which work of processor i will be moved. m is an integersuch that Li+B  m < Li+B + 1.5. Compute k (0  k  m   1) the amount of work that has to be transferred to processor j + k;0 = B   ; 1 = : : : = m 2 = B; m 1 = Li   (m  1)B + .6. Using the 0; 1; : : : ; m 1, compute a list of processor indices to which each work unit of processori has to be moved.In applying our chain partitioner to 3- dimensional problems that have directional biases in commu-nication requirements, we make two assumptions. First, communication costs may be ignored exceptalong one direction with the most communication, so that 3- dimensional Cartesian mesh cells can beviewed as a chain of cells. Second, it is also assumed that communication costs between any pair ofcells are all the same. Then it is not necessary to take the communication costs into account. Theseassumptions can be supported by the highly directional nature of particle ow that characterizes someDSMC communication patterns. In some of our tests, more than 70 percent of particles moved alongthe positive x-axis, and the standard deviation of the numbers of particles which moved across cellboundaries along the positive x-axis per each cell was less than 2.3.3 Performance of domain partitionersWe have experimented with the above partitioning methods for a NASA Langley 3-dimensional DSMCcode which simulates a corner ow on Intel iPSC/860 parallel computer with 128 processing nodes.When we carried out 1000 time steps of DSMC computation using the chain partitioner, the averagenumber of messages sent by each processor was reduced by about 20 percent, while average volume ofcommunication was increased by about 32 percent, compared with the use of recursive coordinate bisec-tion. We measured message trac required only by DSMC computation and communication incurredby partitioners themselves were not considered. Since a few long messages are more desirable than4
(Time in secs) Partitioning algorithmsNprocs RCB RIB Chain8 0.4804 0.5662 0.005816 0.6531 0.6527 0.003532 0.9993 0.9652 0.002964 2.3708 2.2050 0.0030128 7.1666 6.5897 0.0036Table 1: Costs of partitioners on iPSC/860
Figure 1: Domain partitions produced by RCB (left) and Chain partitioner (right)many short messages on most distributed memory architectures, the chain partitioner, in some cases,may actually produce partitions with less communication overhead. Figure 1 illustrates the domainpartitions of the 3- dimensional DSMC code. The rst partition was produced by recursive coordinatebisection (RCB) and the second one was done by the chain partitioner.Table 1 compares the execution time of recursive bisections and chain partitioner which have beenimplemented and benchmarked with the NASA Langley 3- dimensional 30  18  18 Cartesian meshcells on the Intel iPSC/860. It can be observed that the overheads associated with recursive bisectionsincrease with the number of processors in a non-linear manner so that partitioning overheads that areacceptable on a relatively small number of processors may no longer be acceptable on a very largenumber of processors. This can be explained by the fact that our parallel implementations of recursivebisection algorithms require multiple phases of communication to exchange work load information amongprocessors whereas the chain partitioning algorithm needs only one step of global communication thatcan be eciently implemented by most message passing parallel computers.5



























0.00 50.00 100.00 150.00 200.00 250.00 300.00Rcost : Time step and cost of remappingCase I : Idle time without remappingCase II : Idle time with dynamic remappingFigure 2: Remapping cost and Idle timeis the idle time cost of not remapping. There is an initial tendency for W (n) to decrease as n increasesbecause the remapping cost C is amortized over an increasing number of time steps. Increase in thecost termPnj=1(Tmax(j) Tavg(j))=n indicates that work load balance is deteriorating. The SAR policyis to remap when W (n) rst begins to rise, i.e. the rst time that W (n) > W (n  1).Figure 2 presents idle time incurred during each time step of DSMC simulation. Case I and Case IIcurves illustrate the idle time of a 3-dimensional DSMC code without and with remapping, respectively.That is, results of the Case II were obtained by repartitioning the problem domain dynamically withchain partitioner, whereas those of the Case I were generated by keeping the partition static during theentire computation process. Both experiments were carried out on a 16-node Intel iPSC/860 parallelcomputer. Each data point on these curves represents idle time wasted during each time step. TheRcost curve presents time steps when the problem domain is repartitioned and depicts the amountof time spent in repartitioning. For instance, the 8th data point in the Rcost curve shows that theCartesian mesh cells of the DSMC code was repartitioned at the completion of the 103rd time stepand about 114 milliseconds was spent for repartitioning. The idle time wasted in the 104th time stepwas reduced from about 292 milliseconds in the Case I to about 8 milliseconds in the Case II. Thisdemonstrates that a substantial amount of idle time can be reduced by dynamic remapping with theSAR decision policy.5 Communication optimizationThe cost of transmitting particles between processors tends to be signicant in the DSMC codes becausea substantial number of particles change their cell locations in each time step, and each particle usuallyhas several words of data associated with it. However, a particle's identity is completely determined byits state information (spatial position, momentum, etc), and computation depends only on a particle's7
Table 2: Inspector/executor vs. Light-weight schedule (iPSC/860)48x48 Cells 96x96 Cells(Time in secs) Processors Processors16 32 64 128 16 32 64 128Inspector/executor 63.74 50.50 79.58 95.50 226.89 131.99 125.64 118.89Light-weight schedule 20.14 11.54 7.60 6.77 79.89 40.46 21.77 14.23state information and on the cell to which the particle belongs, and not on the particular numberingscheme for the particles. This implies that communication only need append particle state informationto lists associated with each cell. This avoids the overhead involved in specifying locations withindestination processors to which particle state information must be moved.This property of the DSMC computation helps build light-weight schedules which are cheaper tocompute than regular communication schedules built by PARTI primitives. A light-weight schedulefor processor p stores the numbers and sizes of inbound and outbound messages, and a data structurethat species to where the local particles of p must be moved. A set of data migration primitiveshas also been developed which can perform irregular communications eciently using the light-weightcommunication schedules. While the cost of building light-weight schedules is much less than thatof regular schedules, light-weight schedules and data migration primitives still provide communicationoptimizations by aggregation and vectorization of messages.6 Experimental resultsThis section presents the performance results for the 2-dimensional and the 3-dimensional DSMC codesimplemented on various distributed memory architectures including Intel iPSC/860, Intel Paragon andIBM SP-1.6.1 Light-weight schedulesTable 2 presents the elapsed times of 2-dimensional DSMC codes parallelized using light-weightcommunication schedules and data migration primitives compared with the results from the same codeparallelized by using regular schedules of PARTI runtime support. Since computational requirementsare uniformly distributed over the whole domain of the 2- dimensional DSMC problem space, we haveused a regular block partitioning method for the 2- dimensional DSMC code reported here. Thusload balance is not an issue for this problem and we can study the eectiveness of the light-weightschedule and data migration primitives without interference from other aspects such as partitioningmethods and remapping frequencies. The inspector/executor method of PARTI runtime library [15, 4]which is applied to the 2-dimensional problem carries out pre-processing of communication patternsevery time step because the reference patterns to o-processor data change from time step to timestep. Consequently preprocessing cost is greater for the inspector/executor method than using light-weight schedules. Moreover, since the fraction of local particles to the initial distribution becomessmaller as the computation proceeds, the communication volume tends to grow. The performance ofthe inspector/executor method degenerates on a large number of processing nodes, and it actually leadsto a decrease in performance when a large number of processors are used on a small problem.Even though data migration primitives that use light-weight schedules are invoked in every time stepfor the purpose of particle relocation, it still outperforms the inspector/executor method signicantly.This performance benet is achieved by the aggregated communication which is automatically carried8






















20.00 40.00 60.00 80.00 100.00 120.00Figure 3: Performance of periodic remappingout by the migration primitives, which incurs minimum communication costs. Each processor aggregatesinformation of all the moved particles into a single message, and sends and receives at most one messageto and from its neighboring processors. More importantly, the migration primitives allow each processorto locally own all the particles required for its computation of each time step. Provided workloadis optimally distributed, this feature guarantees the ecient utilization of processors as no furthercommunication is required to execute remaining computation of the time step.6.2 Periodic remappingFigure 3 compares periodic domain remapping methods with static partitioning (i.e. no remapping),which are applied to 3-dimensional DSMC codes on Intel iPSC/860 with varying numbers of processors.We have measured performance numbers in a normalized manner by multiplying the elapsed timeand the number of processors used. The Static curve represents the performance numbers producedby the static partitioning method. Curves RCB, RIB and Chain are produced by repartitioning theproblem domain with recursive coordinate bisection, recursive inertial bisection and chain partitionerrespectively. The problem domain is repartitioned every 40 time steps based on work load informationcollected for each Cartesian mesh cell. For an accurate workload estimate, actual computation timeis measured in microseconds for each cell in every time step. Results indicate that all of the periodicrepartitioning methods generate almost the same quality of partitions and signicantly outperform staticpartitioning on a small number of processors.Recursive bisection, however, leads to performance degradation when a large number of processorsare used. For instance, the performance of these methods was poorer than that of static partitioningon iPSC/860 with 128 processors. This performance degradation is a result of large communicationoverhead which increases as the number of processors increases. The chain partitioner on the otherhand appears to be very ecient in partitioning the problem domain with a minimum of additionaloverhead for this set of problems.6.3 Dynamic remapping 9
















100.00 200.00 300.00 400.00 500.00Fixed : remapped periodically in varying intervalsSar : remapped dynamically with SAR policyFigure 4: Dynamic remapping with RCBThe dynamic remapping method attempts to minimize the average processor idle time since the lastremapping by trading remapping cost and idle time. Since the SAR remapping decision heuristic usesruntime information such as remapping cost and processing time of bottleneck processor, it adapts tothe variations in system behavior without any a priori information about computational requirementsand workload distribution.This series of experiments with varying intervals for periodic remapping compared performanceof optimal periodic remapping with that of dynamic remapping. Figure 4 and Figure 5 representperformance results measured in elapsed times when the problem domain is remapped by RCB andthe chain partitioner respectively on Intel iPSC/860 with 128 processors. Note that the performanceof dynamic remapping is better than that of periodic remapping with an optimal remapping intervalwhen RCB is used to repartition the problem domain. When the chain partitioner is applied, dynamicremapping does not perform better than optimal periodic remapping; however, performance remainscomparable to that of optimal periodic remapping.In Figures 6 and 7, the performance of the dynamic remapping method is compared when carried outwith each of three domain partitioning algorithms on various distributed memory parallel computerssuch as Intel iPSC/860, Intel Paragon and IBM SP-1. 2 Performance numbers are measured in anormalized manner by multiplying elapsed time by the number of processors used. In the Figure 7the curves RCB-P, RIB-P and Chain-P represent performance of Intel Paragon, and RCB-S, RIB-S andChain-S represent that of IBM SP-1. The chain partitioner outperformed other domain partitioningalgorithms in all cases. Note that the chain partitioner can be applied to any number of processors,whereas RCB and RIB can be used only when the number of processors is a power of 2.2Interrupt-driven EUI-H was selected for incoming messages on IBM SP-1.10






















20.00 40.00 60.00 80.00Fixed : remapped periodically in varying intervalsSar : remapped dynamically with SAR policyFigure 5: Dynamic remapping with Chain partitioner7 Concluding remarksThe results presented here indicate that data transportation procedures optimized for communicationare crucial for irregular problems with uctuating data access patterns, and eective domain partitioningmethods and remapping decision policies can achieve dynamic load balancing.Light-weight communication schedules reduced the costs of data transportation dramatically for ap-plications where the numbering of data elements does not matter. A chain partitioner which partitionsmultidimensional problem domains in one-dimensional processor space was easy to implement. Thechain partitioner also produced partitions of almost the same quality at very low cost for applicationswith directional biases in communicational requirements when compared with unstructured mesh parti-tioners such as RCB and RIB. A dynamic remapping method with the SAR policy carried out dynamicload balancing of DSMC codes without any a priori information about computational requirements onvarious distributed memory architectures.DSMC codes described here employ a uniform mesh structure; however, for many applicationsit is necessary to use non-uniform meshes which, in some cases, adapt as computation progresses.Collaboration will continue with NASA Langley to address some of the many computational challengesposed by such non-uniform meshes.AcknowledgementsThe authors would like to thank Richard Wilmoth at NASA Langley for his constructive advice andthe use of DSMC production codes. The authors thank Robert Martino at the National Institute ofHealth for support and use of NIH iPSC/860. The authors also gratefully acknowledge use of theArgonne High-Performance Computing Research Facility. The HPCRF is funded principally by theU.S. Department of Energy Oce of Scientic Computing.11


















20.00 40.00 60.00 80.00 100.00 120.00Figure 6: Dynamic remapping on Intel iPSC/860References[1] M. J. Berger and S. H. Bokhari. A partitioning strategy for nonuniform problems on multiprocessors. IEEETrans. on Computers, C-36(5):570{580, May 1987.[2] S. H. Bokhari. Partitioning problems in parallel, pipelined, and distributed computing. IEEE Trans. onComputers, 37(1):48{57, January 1988.[3] Hyeong-Ah Choi and B. Narahari. Ecient algorithms for mapping and partitioning a class of parallelcomputations. Journal of Parallel and Distributed Computing, 19:349{363, 1993.[4] R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallelunstructured Euler solver using software primitives, AIAA-92-0562. In Proceedings of the 30th AerospaceSciences Meeting, January 1992.[5] R. Das and J. Saltz. Parallelizingmolecular dynamics codes using the Parti software primitives. In Proceedingsof the Sixth SIAM Conference on Parallel Processing for Scientic Computing, pages 187{192. SIAM, March1993.[6] P. C. Liewer, E. W. Leaver, V. K. Decyk, and J. M. Dawson. Dynamic load balancing in a concurrentplasma PIC code on the JPL/Caltech Mark III hypercube. In Proceedings of the Fifth Distributed MemoryComputing Conference, Vol. II, pages 939{942. IEEE Computer Society Press, April 1990.[7] J. McDonald and L. Dagum. A comparison of particle simulation implementations on two dierent parallelarchitectures. In Proceedings of the Sixth Distributed Memory Computing Conference, pages 413{419. IEEEComputer Society Press, April 1991.[8] David M. Nicol and David R. O'Hallaron. Improved algorithms for mapping pipelined and parallel compu-tations. IEEE Trans. on Computers, 40(3):295{306, March 1991.[9] David M. Nicol and Joel H. Saltz. Dynamic remapping of parallel computations with varying resourcedemands. IEEE Trans. on Computers, 37(9):1073{1087, September 1988.[10] B. Nour-Omid, A. Raefsky, and G. Lyzenga. Solving nite element equations on concurrent computers. InProc. of Symposium on Parallel Computations and their Impact on Mechanics, Boston, December 1987.[11] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM J.Mat. Anal. Appl., 11(3):430{452, June 1990. 12

































10.00 20.00 30.00 40.00 50.00Figure 7: Dynamic remapping on SP-1 and Paragon[12] D. F. G. Rault and M. S. Woronowicz. Spacecraft contamination investigation by direct simulation MonteCarlo - contamination on UARS/HALOE. In Proceedings AIAA 31th Aerospace Sciences Meeting and Exhibit,Reno, Nevada, January 1993.[13] Patrick J. Roache. Computational Fluid Dynamics. Hermosa Publishers, Albuquerque, N.M., 1972.[14] Joel Saltz, Harry Berryman, and Janet Wu. Multiprocessors and run-time compilation. Concurrency:Practice and Experience, 3(6):573{592, December 1991.[15] Joel Saltz, Kathleen Crowley, Ravi Mirchandaney, and Harry Berryman. Run-time scheduling and executionof loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303{312, April1990.[16] Reinhard v. Hanxleden and L. Riggway Scott. Load balancing on message passing architectures. Journal ofParallel and Distributed Computing, 13(3):312{324, November 1991.[17] R. Williams. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Con-currency, Practice and Experience, 3(5):457{481, October 1991.[18] M. S. Woronowicz and D. F. G. Rault. On predicting contamination levels of HALOE optics aboard UARS us-ing direct simulation Monte Carlo. In Proceedings AIAA 28th Thermophysics Conference, Orlando, Florida,June 1993.
13
