Runtime Support for In-Core and Out-of-Core Data-Parallel Programs by Thakur, Rajeev
Syracuse University 
SURFACE 
Electrical Engineering and Computer Science College of Engineering and Computer Science 
1995 
Runtime Support for In-Core and Out-of-Core Data-Parallel 
Programs 
Rajeev Thakur 
Graduate School of Syracuse University, Computer Engineering 
Follow this and additional works at: https://surface.syr.edu/eecs 
 Part of the Computer Engineering Commons 
Recommended Citation 
Thakur, Rajeev, "Runtime Support for In-Core and Out-of-Core Data-Parallel Programs" (1995). Electrical 
Engineering and Computer Science. 118. 
https://surface.syr.edu/eecs/118 
This Dissertation is brought to you for free and open access by the College of Engineering and Computer Science 
at SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science by an authorized 
administrator of SURFACE. For more information, please contact surface@syr.edu. 
RUNTIME SUPPORT FOR IN-CORE ANDOUT-OF-CORE DATA-PARALLEL PROGRAMSbyRAJEEV THAKURB.E., University of Bombay, India, 1990M.S., Syracuse University, 1992DISSERTATIONSubmitted in partial fulllment of the requirements for thedegree of Doctor of Philosophy in Computer Engineeringin the Graduate School of Syracuse UniversityMay 1995Approved Professor Alok ChoudharyDate
c Copyright 1995 Rajeev ThakurAll rights reserved
AbstractDistributed memory parallel computers or distributed computer systems are widelyrecognized as the only cost-eective means of achieving teraops performance in thenear future. However, the fact remains that they are dicult to program and ad-vances in software for these machines have not kept pace with advances in hardware.This thesis addresses several issues in providing runtime support for in-core as well asout-of-core programs on distributed memory parallel computers. This runtime sup-port can be directly used in application programs for greater eciency, portabilityand ease of programming. It can also be used together with a compiler to translateprograms written in a high-level data-parallel language like High Performance Fortran(HPF) to node programs for distributed memory machines.In distributed memory programs, it is often necessary to change the distributionof arrays during program execution. This thesis presents ecient and portable algo-rithms for runtime array redistribution. The algorithms have been implemented onthe Intel Touchstone Delta and are found to scale well with the number of processorsand array size. This thesis also presents algorithms for all-to-all collective communi-cation on fat-tree and two-dimensional mesh interconnection topologies. The perfor-mance of these algorithms on the CM-5 and Touchstone Delta is studied extensively.A model for estimating the time taken by these algorithms on the basis of systemparameters is developed and validated by comparing with experimental results.A number of applications deal with very large data sets which cannot t in mainmemory, and hence have to be stored in les on disks, resulting in out-of-core pro-grams. This thesis also describes the design and implementation of ecient runtimesupport for out-of-core computations. Several optimizations for accessing out-of-coredata are presented. An Extended Two-Phase Method is proposed for accessing sec-tions of out-of-core arrays eciently. This method uses collective I/O and the I/Oworkload is divided among processors dynamically, depending on the access requests.Performance results obtained using this runtime support for out-of-core programs onthe Touchstone Delta are presented.
ContentsList of Tables viiiList of Figures xAcknowledgments xiii1 Introduction 11.1 Distributed Memory Parallel Computers : : : : : : : : : : : : : : : : 21.2 Software for Distributed Memory Computers : : : : : : : : : : : : : : 21.3 Data-Parallel Languages : : : : : : : : : : : : : : : : : : : : : : : : : 41.4 Need for High Performance I/O : : : : : : : : : : : : : : : : : : : : : 51.5 Contributions of this Thesis : : : : : : : : : : : : : : : : : : : : : : : 61.6 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81.7 Organization of this Thesis : : : : : : : : : : : : : : : : : : : : : : : : 112 Issues in Runtime Support 122.1 Runtime Support for Regular Problems : : : : : : : : : : : : : : : : : 122.2 Runtime Support for Irregular Problems : : : : : : : : : : : : : : : : 132.3 Runtime Support for Compilers : : : : : : : : : : : : : : : : : : : : : 152.3.1 Overview of HPF : : : : : : : : : : : : : : : : : : : : : : : : : 152.3.2 Compiler and Runtime Support for HPF : : : : : : : : : : : : 172.4 Runtime Support for Out-of-Core Programs : : : : : : : : : : : : : : 212.4.1 Out-of-Core Applications : : : : : : : : : : : : : : : : : : : : : 22iv
3 Runtime Support for Array Redistribution 253.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 253.1.1 Need for Redistribution : : : : : : : : : : : : : : : : : : : : : 263.2 Notations and Denitions : : : : : : : : : : : : : : : : : : : : : : : : 283.3 Block(m) to Cyclic Redistribution : : : : : : : : : : : : : : : : : : : : 313.4 Cyclic to Block(m) Redistribution : : : : : : : : : : : : : : : : : : : : 353.5 Cyclic(x) to Cyclic(y) Redistribution : : : : : : : : : : : : : : : : : : 383.5.1 Special case x = k y : : : : : : : : : : : : : : : : : : : : : : : 383.5.2 Special case y = k x : : : : : : : : : : : : : : : : : : : : : : : 423.5.3 General Case : : : : : : : : : : : : : : : : : : : : : : : : : : : 463.6 Redistribution of Multidimensional Arrays : : : : : : : : : : : : : : : 503.6.1 Shape Retaining Redistribution : : : : : : : : : : : : : : : : : 513.6.2 Shape Changing Redistribution : : : : : : : : : : : : : : : : : 533.7 Circular Redistribution : : : : : : : : : : : : : : : : : : : : : : : : : : 553.7.1 Circular Block(m)$ Cyclic Redistribution : : : : : : : : : : : 553.7.2 Circular Cyclic$ Block(m) Redistribution : : : : : : : : : : : 563.7.3 Circular Cyclic(x)$ Cyclic(y) Redistribution : : : : : : : : : 564 Runtime Support for All-to-All Collective Communication 584.1 Architecture and Communication Issues : : : : : : : : : : : : : : : : 594.1.1 CM-5 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 594.1.2 Touchstone Delta : : : : : : : : : : : : : : : : : : : : : : : : 614.1.3 Performance Models : : : : : : : : : : : : : : : : : : : : : : : 624.2 All-to-All Communication on a Fat Tree : : : : : : : : : : : : : : : : 634.2.1 Linear Exchange (LEX) : : : : : : : : : : : : : : : : : : : : : 634.2.2 Pairwise Exchange (PEX) : : : : : : : : : : : : : : : : : : : : 644.2.3 Recursive Exchange (REX) : : : : : : : : : : : : : : : : : : : 654.2.4 Balanced Exchange (BEX) : : : : : : : : : : : : : : : : : : : : 674.2.5 Performance of Algorithms on the CM-5 : : : : : : : : : : : : 684.3 All-to-All Communication on a 2D Mesh : : : : : : : : : : : : : : : : 714.3.1 Pairwise Exchange for Power-Of-Two Mesh (PEX) : : : : : : 72v
4.3.2 Pairwise Exchange for General Mesh (PEX-GEN) : : : : : : : 744.3.3 PEX-GEN with Shift (PEX-GEN-SHIFT) : : : : : : : : : : : 744.3.4 General Algorithm for any Mesh (GEN) : : : : : : : : : : : : 774.3.5 Indirect Pairwise Exchange (IPEX) : : : : : : : : : : : : : : : 784.3.6 Recursive Exchange (REX) : : : : : : : : : : : : : : : : : : : 784.3.7 Performance of Algorithms on the Delta : : : : : : : : : : : : 814.3.8 Model Validation : : : : : : : : : : : : : : : : : : : : : : : : : 865 Runtime Support for Out-of-Core Programs: (I) Models and LocalOptimizations 905.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 905.2 PASSION Runtime Library : : : : : : : : : : : : : : : : : : : : : : : 915.3 Models : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 935.3.1 Architectural Model : : : : : : : : : : : : : : : : : : : : : : : 935.3.2 Data Storage and Access Models : : : : : : : : : : : : : : : : 945.4 Runtime Support for the Local Placement Model : : : : : : : : : : : 975.4.1 Out-of-Core Array Descriptor (OCAD) : : : : : : : : : : : : : 1005.5 Optimizations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1005.5.1 Data Sieving : : : : : : : : : : : : : : : : : : : : : : : : : : : 1015.5.2 Data Prefetching : : : : : : : : : : : : : : : : : : : : : : : : : 1085.5.3 Data Reuse : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1126 Runtime Support for Out-of-Core Programs: (II) Collective I/O 1146.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1146.2 Need for Collective I/O : : : : : : : : : : : : : : : : : : : : : : : : : 1156.3 Extended Two-Phase Method for Collective I/O : : : : : : : : : : : 1176.3.1 Reading Sections of Out-of-Core Arrays : : : : : : : : : : : : 1176.3.2 Writing Sections of Out-of-Core Arrays : : : : : : : : : : : : 1206.4 Partitioning I/O Among Processors : : : : : : : : : : : : : : : : : : 1216.4.1 Static Partitioning : : : : : : : : : : : : : : : : : : : : : : : : 1226.4.2 Dynamic Partitioning : : : : : : : : : : : : : : : : : : : : : : : 122vi
6.5 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1246.5.1 Reading Common Sections : : : : : : : : : : : : : : : : : : : : 1256.5.2 Reading Overlapping Sections : : : : : : : : : : : : : : : : : : 1276.5.3 Reading Distinct Sections : : : : : : : : : : : : : : : : : : : : 1296.5.4 Writing Distinct Sections : : : : : : : : : : : : : : : : : : : : : 1316.5.5 Accessing Sections with Non-Unit Strides : : : : : : : : : : : : 1316.5.6 Scalability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1326.6 Advantages of Extended Two-Phase Method : : : : : : : : : : : : : : 1357 Conclusions 140Bibliography 143Vita 159
vii
List of Tables3.1 Data Distribution and Index Conversion : : : : : : : : : : : : : : : : 314.1 Communication Schedule for PEX on 8 Processors : : : : : : : : : : : 644.2 Communication Schedule for REX on 8 processors : : : : : : : : : : : 664.3 Communication Schedule for BEX on 8 processors : : : : : : : : : : : 694.4 Performance of PEX (time in sec.) : : : : : : : : : : : : : : : : : : : 814.5 Performance of PEX-GEN (time in sec.) : : : : : : : : : : : : : : : : 824.6 Performance of PEX-GEN-SHIFT (time in sec.) : : : : : : : : : : : : 834.7 Performance of GEN on power-of-two mesh (time in sec.) : : : : : : : 844.8 Performance of GEN on non power-of-two mesh (time in sec.) : : : : 844.9 Performance of REX on Delta (time in sec.) : : : : : : : : : : : : : : 854.10 Performance of IPEX (time in sec.) : : : : : : : : : : : : : : : : : : : 855.1 Some of the PASSION Routines : : : : : : : : : : : : : : : : : : : : : 995.2 Performance of Direct Read/Write versus Data Sieving (time in sec.) 1075.3 I/O requirements of Direct Read and Data Sieving Methods : : : : : 1075.4 Performance of Median Filtering using 3 3 window (time in sec.) : : 1105.5 Performance of Median Filtering using 5 5 window (time in sec.) : : 1105.6 Performance of Data Reuse : : : : : : : : : : : : : : : : : : : : : : : 1136.1 Time (sec.) for Reading Common Sections : : : : : : : : : : : : : : : 1266.2 Time (sec.) for Reading Overlapping Sections : : : : : : : : : : : : : 1286.3 Time (sec.) for Reading Distinct Sections : : : : : : : : : : : : : : : : 1306.4 Time (sec.) for Writing Distinct Sections : : : : : : : : : : : : : : : : 131viii
6.5 Time (sec.) for Reading Sections with Non-unit Strides : : : : : : : : 1326.6 Time (sec.) for Writing Sections with Non-unit Strides : : : : : : : : 1336.7 Performance for dierent number of processors : : : : : : : : : : : : : 1346.8 Performance for large sections of large arrays : : : : : : : : : : : : : : 136
ix
List of Figures2.1 Phases of Compilation : : : : : : : : : : : : : : : : : : : : : : : : : : 183.1 Need for Redistribution : : : : : : : : : : : : : : : : : : : : : : : : : : 273.2 2D FFT Program using Redistribution : : : : : : : : : : : : : : : : : 283.3 Notations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 293.4 Block(m) to Cyclic Redistribution : : : : : : : : : : : : : : : : : : : : 323.5 Algorithm for Block(m) to Cyclic Redistribution : : : : : : : : : : : : 343.6 Algorithm for Cyclic to Block(m) Redistribution : : : : : : : : : : : : 363.7 Performance of Cyclic to Block Redistribution, array size 1M integers 373.8 Cyclic(4) to Cyclic(2) Redistribution : : : : : : : : : : : : : : : : : : 393.9 KY TO Y Algorithm for Cyclic(x) to Cyclic(y) Redist., where x = k y 413.10 KY TO Y Algorithm, cyclic(4) to cyclic(2), array size 1M integers : : 423.11 KY TO Y Algorithm, cyclic(4) to cyclic(2), 64 processors : : : : : : : 433.12 X TO KX Algorithm for Cyclic(x) to Cyclic(y) Redist., where y = k x 453.13 X TO KX Algorithm, cyclic(2) to cyclic(4), array size 1M integers : : 463.14 X TO KX Algorithm, cyclic(2) to cyclic(4), 64 processors : : : : : : : 473.15 General Cyclic(x) to Cyclic(y) Redistribution : : : : : : : : : : : : : 473.16 GCD and LCM Methods : : : : : : : : : : : : : : : : : : : : : : : : : 483.17 GCD, LCM and General Methods; cyclic(15) to cyclic(10); 1M array 493.18 GCD, LCM and General Methods; cyclic(11) to cyclic(3); 1M array : 503.19 LCM v/s General Methods, cyclic(11) to cyclic(3), 64 processors : : : 513.20 (block,block) to (cyclic,cyclic) Redistribution, 1K  1K array : : : : 52x
3.21 (cyclic(11),cyclic(11)) to (cyclic(3),cyclic(3)) Redist., 1K  1K array : 543.22 Circular cyclic(11) to cyclic(3) Redistribution, array size 1M integers 574.1 CM-5 fat tree : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 604.2 CM-5 Data Network with 64 Processing Nodes : : : : : : : : : : : : : 614.3 Pairwise Exchange (PEX) : : : : : : : : : : : : : : : : : : : : : : : : 644.4 Recursive Exchange (REX) on CM-5 : : : : : : : : : : : : : : : : : : 674.5 Balanced Exchange (BEX) : : : : : : : : : : : : : : : : : : : : : : : : 684.6 Performance on 32 node CM-5 for dierent message sizes : : : : : : : 704.7 Performance on CM-5 for message size 512 bytes : : : : : : : : : : : : 704.8 Performance on CM-5 for message size 2 Kbytes : : : : : : : : : : : : 714.9 PEX on 2  4 mesh : : : : : : : : : : : : : : : : : : : : : : : : : : : : 734.10 Pairwise Exchange for General Mesh (PEX-GEN) : : : : : : : : : : : 744.11 Processor Shift : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 754.12 Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT) : 764.13 General Algorithm for any Mesh (GEN) : : : : : : : : : : : : : : : : 774.14 Indirect Pairwise Exchange (IPEX) : : : : : : : : : : : : : : : : : : : 794.15 Recursive Exchange (REX) on Delta : : : : : : : : : : : : : : : : : : 804.16 Performance of algorithms on a 16  32 mesh : : : : : : : : : : : : : 864.17 Performance of algorithms on a 16  9 mesh : : : : : : : : : : : : : : 874.18 Performance of power-of-two algorithms for message size 16 Kbytes : 874.19 Performance of power-of-two algorithms for message size 1 Kbytes : : 884.20 Observed and Predicted times (PEX, GEN) : : : : : : : : : : : : : : 894.21 Observed and Predicted times (REX, IPEX) : : : : : : : : : : : : : : 895.1 Software Architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : 925.2 Data Storage and Access Models : : : : : : : : : : : : : : : : : : : : 955.3 HPF Program Fragment: Solving Laplace's Equation by Jacobi Itera-tion Method : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 985.4 Example of OCLA, ICLA and LAF : : : : : : : : : : : : : : : : : : : 985.5 Out-of-Core Array Descriptor (OCAD) : : : : : : : : : : : : : : : : : 101xi
5.6 Accessing Sections of the OCLA : : : : : : : : : : : : : : : : : : : : : 1025.7 Data Sieving : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1035.8 Data Prefetching : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1085.9 Median Filtering using 3 3 window : : : : : : : : : : : : : : : : : : 1115.10 Median Filtering using 5 5 window : : : : : : : : : : : : : : : : : : 1115.11 Data Reuse : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1136.1 Accessing row blocks of a le stored in column-major order : : : : : : 1166.2 The requests in processor 0's le domain : : : : : : : : : : : : : : : : 1196.3 Static versus Dynamic Partitioning : : : : : : : : : : : : : : : : : : : 1236.4 Extended Two-Phase Method for Reading Sections of Out-of-Core Arrays1246.5 The common sections listed in Table 6.1 (not to scale) : : : : : : : : 1266.6 The overlapping sections listed in Table 6.2 (not to scale) : : : : : : : 1286.7 The distinct sections listed in Table 6.3 (not to scale) : : : : : : : : : 1306.8 The sections listed in Table 6.8 (not to scale) : : : : : : : : : : : : : : 1366.9 Scalability Results, reading sections in case VI of Table 6.8 : : : : : : 1376.10 Scalability Results, writing sections in case VI of Table 6.8 : : : : : : 137
xii
AcknowledgmentsI have been fortunate to have Alok Choudhary as my advisor throughout the courseof my graduate studies. He has helped me in too many ways to mention. I greatlyappreciate his guidance, support, encouragement and enthusiasm. Working with himhas been a pleasure and a great learning experience.I wish to thank Thong Dang, Georey Fox, Salim Hariri, David Kotz and SanjayRanka for serving on my thesis committee and for their feedback on drafts of thisthesis. I thank Georey Fox and everyone else at NPAC for providing an excellentenvironment for my work. I thank Thong Dang for getting me interested in Com-putational Fluid Dynamics and inspiring me in many other ways. I am grateful toall my friends particularly Rajesh Bordawekar, Sachin Damle, Kanad Chakraborty,Ravi Ponnusamy, Sachin More, Kevin Roe and Mario Campuzano for their help atall times and making my stay at Syracuse pleasant and enjoyable.I cannot express in words my gratitude towards my parents, grandparents andsister. All my achievements have been possible only because of their constant en-couragement, love and support. I am fortunate to have such a wonderful family. Idedicate this thesis to them.
xiii
Chapter 1IntroductionMassively Parallel Processors (MPPs) with peak performance greater than 100 Gopshave already made their advent into the supercomputing arena. As a result, parallelcomputers are increasingly being used in many applications which require a greatdeal of computational power. Examples of such applications include many largescale computations in physics, chemistry, biology, engineering, medicine and othersciences, which have been identied as Grand Challenge Applications [56, 131]. Manyapplications dealing with information technology, such as multimedia systems, videoon demand, video compression and decompression, also require a large amount ofcompute power. It is estimated that a computer capable of 1 Tops (1012 ops) ormore would be required to solve these applications in a reasonable amount of time.It is widely recognized that rather than conventional vector supercomputers, parallelcomputers or distributed systems provide the only cost-eective means of achievingteraop performance.However, software support for parallel computers has lagged far behind advancesin hardware. Programming a parallel machine can prove to be quite tedious. Inorder to get the best performance from a parallel computer, the programmer has topay attention to many low-level implementation details. Also, the programs are veryoften not portable. One way to tackle this problem is by using advanced compilersand runtime support systems. The research presented in this thesis addresses severalissues in providing runtime support for in-core as well as out-of-core programs on1
CHAPTER 1. INTRODUCTION 2distributed memory parallel computers. This runtime support can be directly usedin application programs for greater eciency, portability and ease of programming.It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programsfor distributed memory parallel computers.1.1 Distributed Memory Parallel ComputersIn a distributed memory computer, the memory is physically distributed among pro-cessors. Each processor has its own local memory and all processors are connectedtogether by an interconnection network such as a hypercube, mesh, torus, fat treeor some other topology. A processor can directly access only its own local memory.If data from some other processor is needed, it can be obtained by explicit mes-sage passing. Examples of such systems are the Intel iPSC/860, Touchstone Deltaand Paragon; Thinking Machines CM-5; IBM SP-1, SP-2; Meiko CS-2; Cray T3D;nCUBE-2 etc.The main advantage of distributed memory computers is that they are scalable.Advances in interconnection network technology have made it possible to connecthundreds or thousands of processors into one large high-performance parallel com-puter. Hence, such machines are also called Massively Parallel Processors (MPPs).Each processor of an MPP is typically a powerful microprocessor like a DEC Alpha,IBM PowerPC, Sun Sparc or Intel i860. Since these microprocessors are also used inworkstations or personal computers, they are available at competitive prices. Hence,MPPs also prove to be very cost-eective.1.2 Software for Distributed Memory ComputersDistributed memory machines are relatively hard to program. The most commonprogrammingmodel for distributed memory computers is the Single ProgramMultipleData (SPMD) model in which each processor runs the same program, but on dierentdata sets. The program running on each node is essentially a sequential program
CHAPTER 1. INTRODUCTION 3(usually in C or Fortran) interspersed with calls to message passing routines. Thenode program is written in the local address space and the absence of a single globaladdress space is what makes message passing programming dicult. The programmerhas to decide how data should be partitioned among the processors and has to managecommunication between processors explicitly. Interprocessor communication is atleast an order of magnitude more expensive than accessing data in local memory.This forces the programmer to pay considerable attention to saving communicationcosts. Since each machine has its own architectural features and idiosyncrasies, theprogrammer tries to hardwire the program to exploit these peculiarities and get thebest performance. Hence a great deal of eort is required to port such programs toother machines. These details also make it dicult to convert existing sequentialprograms to parallel programs. Another problem is that each parallel machine hasits own message passing library which is quite dierent from that of other machines.Some of the common communication libraries are NX for Intel machines, CMMDfor the CM-5, MPL for the IBM SP-1 and SP-2 etc. A program written using theNX library for the Intel Paragon cannot be directly run on the CM-5 and vice-versabecause the message passing routines are totally incompatible.There have been several attempts to provide some measure of portability in par-allel programs. There are number of portable communication libraries like EX-PRESS [90], PVM [114], PICL [47], P4 [19] etc. which provide a communicationlayer above the native message passing library of the system. These libraries providea well-dened set of communication routines which remain the same for any system.For example, a program written using EXPRESS or PVM can be run on almost anyparallel computer and even on a network of workstations. The EXPRESS or PVMroutines make calls to the message passing library provided on the system. However,this portability is at the cost of slightly lower performance because of this additionallayer of communication software.There has also been an eort to develop a standard Message Passing Interface(MPI) [83]. The Message Passing Interface Forum, a group of researchers from indus-try, academia and research laboratories, has dened a set of library interface standards
CHAPTER 1. INTRODUCTION 4for message passing called MPI. MPI has tried to make use of the most attractivefeatures of a number of existing message passing systems. The goal is to establish apractical, portable, ecient and exible standard for message passing. The denitionof a message passing standard such as MPI provides vendors with a clearly denedset of routines that they can implement eciently, or in some cases provide hardwaresupport for, thereby providing better performance. But the diculty of explicitlydoing message passing still remains.1.3 Data-Parallel LanguagesSince message passing programming is dicult, the user would prefer to write a pro-gram in global name space without any message passing, and have a compiler translateit into a message passing program. This requires advanced compiler technology, butit would result in parallel programming being easy and portable. Researchers havefound it very hard to build compilers which can parallelize sequential programs writ-ten in standard C or Fortran 77. Hence, standard sequential languages have beenaugmented with directives and constructs to aid the compiler in generating messagepassing code. Fortran D [43, 58] is one such language. Fortran D consists of a set ofextensions to Fortran 77 which specify how data is to be distributed among the pro-cessors of a distributed memory machine. The Fortran D compiler developed at RiceUniversity [122, 59] can translate a Fortran D program into a Fortran 77 plus mes-sage passing node program. Another such language is Fortran 90D [128]. Fortran 90Dconsists of a set of extensions to Fortran 90, similar to those used in Fortran D. TheFortran 90D compiler developed at Syracuse University [13] translates Fortran 90Dprograms to Fortran 77 plus message passing node programs. Vienna Fortran [21, 130]and CM Fortran [121] also allow the user to write programs in global address space.Recently, the High Performance Fortran Forum, comprising a group of researchersfrom industry, academia and research laboratories, developed a standard languagecalled High Performance Fortran (HPF) [57, 67]. HPF was designed to provide a
CHAPTER 1. INTRODUCTION 5portable extension to Fortran 90 for writing data-parallel applications. It includes fea-tures for mapping data to processors, specifying data-parallel operations, and meth-ods for interfacing HPF programs to other programming paradigms. It is expectedthat HPF will be a standard programming language for computationally intensiveapplications on many types of machines, such as massively parallel MIMD and SIMDmultiprocessors as well as traditional vector processors. In order for HPF to be suc-cessful, it needs a powerful compiler. Compiler technology for HPF is still in itsinfancy and current versions of HPF compilers are not robust and mature enough tocompile large production codes eciently. However once good compilers are available,the modern features and powerful capabilities of HPF are expected to make it thenew popular version of Fortran for scientists and engineers solving complex large-scaleproblems [67].1.4 Need for High Performance I/OAnother important issue in high performance computing is providing support for highperformance parallel input-output (I/O). I/O for parallel systems has drawn increas-ing attention in the last few years as it has become apparent that I/O performancerather than CPU or communication performance may be the limiting factor in fu-ture computing systems. Large scale scientic computations, in addition to requiringa great deal of computational power, also deal with large quantities of data. Atpresent, a typical Grand Challenge Application could require 1Gbyte to 4Tbytes ofdata per run [38]. These gures are expected to increase by orders of magnitude asteraop machines make their appearance. Although supercomputers have very largemain memories, the memory is not large enough to hold this much amount of data.Hence, data needs to be stored on disk and the performance of the program dependson how fast the processors can access data from disks. In order to have a balancedsystem [75], it is essential that the I/O bandwidth is comparable to the CPU andcommunication bandwidth. Unfortunately, the performance of the I/O subsystems ofMPPs has not kept pace with their CPU and communication capabilities. It is still
CHAPTER 1. INTRODUCTION 6orders of magnitude more expensive to do I/O than to do computation or communi-cation. Improvements are needed in both hardware and software in order to improveI/O performance.1.5 Contributions of this ThesisThis thesis addresses several issues in providing runtime support for in-core as wellas out-of-core programs on distributed memory parallel computers. This runtimesupport can be used for performing many of the commonly required operations inapplication programs written using a distributed memory programming model. Theuse of runtime support makes it easier to write application programs and providesgreater eciency and portability. We mainly focus on runtime support for regularcomputations in which the communication and I/O patterns are known statically.This runtime support can also be used together with a compiler to translate pro-grams written in a high-level data-parallel language like HPF to node programs fordistributed memory parallel computers. In fact it forms an essential part of the For-tran 90D/HPF compiler developed at Syracuse University [13]. Runtime supporthelps to separate the machine dependent aspects of compilation from the machineindependent aspects. The compiler can do all the machine independent transforma-tions and the runtime system can be optimized for each dierent machine. Thus,a portable and ecient compiler can be obtained by porting the runtime system todierent machines. The runtime support discussed in this thesis is general and can beused in any other HPF compiler or a compiler for any other data-parallel language.In distributed memory programs, arrays are distributed across processors in somefashion. For a number of reasons, it is often necessary to change the distribution, orredistribute the arrays during the course of program execution. This thesis presentsecient algorithms for runtime array redistribution. Since array distribution andredistribution is rigorously dened in HPF, the algorithms are developed for redis-tributing arrays as dened in HPF. The algorithms are independent of the commu-nication mechanism used and hence are portable. A novel approach is proposed for
CHAPTER 1. INTRODUCTION 7performing the general cyclic(x) to cyclic(y) redistribution using two methods, calledthe GCD Method and the LCM Method, which have low runtime overhead. We haveimplemented all the algorithms on the Intel Touchstone Delta and they are found toperform very well for dierent number of processors and array sizes.A collective communication pattern which arises very often in many applicationssuch as two-dimensional Fast Fourier Transform, parallel quicksort as well as in arrayredistribution, is the all-to-all communication pattern, also called complete exchange.This thesis presents several algorithms for all-to-all collective communication on fat-tree and two-dimensional mesh interconnection topologies. Previous work in thisarea tried to minimize link contention by increasing the number of communicationsteps. The algorithms proposed in this thesis take advantage of the fact that inmany of the present generation machines like the Touchstone Delta and Paragon, thecommunication links have excess bandwidth and can tolerate a certain amount oflink contention. Hence communication can be performed in fewer steps by allowing asmall amount of link contention to exist. The performance of these algorithms on theCM-5 and Touchstone Delta is studied extensively. A model for estimating the timetaken by these algorithms on the basis of system parameters has been developed andvalidated by comparing with experimental results.A large number of applications deal with very large data sets which cannot tin main memory, and hence have to be stored in les on disks. This thesis alsodescribes the design and implementation of ecient runtime support for out-of-corecomputations. The runtime system supports three dierent models of data storageand access. A number of optimizations have been incorporated for improved per-formance. A new method, called the Extended Two-Phase Method, is proposed foraccessing sections of out-of-core arrays eciently. This method uses collective I/Oin which processors cooperate to perform I/O in an ecient manner by combiningseveral I/O requests into fewer larger requests, eliminating multiple le accesses forthe same data and reducing contention for the I/O subsystem. A dynamic schemeis used for partitioning the I/O workload among processors, depending on the accessrequests. Performance results obtained using this runtime support for out-of-core
CHAPTER 1. INTRODUCTION 8programs on the Touchstone Delta are presented.A Parallel Compiler Runtime Consortium (PCRC) [45] has recently been formedwith the goal of developing a common runtime support for high level parallel lan-guages. We believe that the research presented in this thesis would be very usefulto this consortium. It is also useful to vendors developing commercial HPF com-pilers, such as The Portland Group Inc. (PGI), Applied Parallel Research (APR),Digital Equipment Corporation (DEC) and others. Application programmers writ-ing distributed memory programs would also benet a great deal by using the ideasproposed in this thesis.1.6 Related WorkThis section briey describes some of the related research in parallel languages, par-allel compilers, runtime support systems and support for high-performance parallelI/O.The concept of dening processor arrays and distributing data to them was rstintroduced in the programming language BLAZE [82] in the context of shared memorysystems with non-uniform access times. This research was continued in the Kali [65]programming language for distributed memory machines. The Kali compiler usesthe inspector/executor strategy to parallelize irregular computations. The compila-tion system SUPERB [129] parallelizes sequential programs semi-automatically fordistributed memory machines. The SUPERB tool transforms sequential Fortran pro-grams with data distribution annotations into parallel programs. Compilers for func-tional languages like Id Nouveau and Crystal have been developed for distributedmemory machines. The Crystal compiler generates communication statements bystudying the access patterns of the arrays in a statement. Split-C [32] is a parallelextension of C intended for high performance programming on distributed memorymultiprocessors. It provides a global address space and allows a mixture of sharedmemory, message passing and data-parallel programming styles for both regular and
CHAPTER 1. INTRODUCTION 9irregular problems. A compiler for Split-C is being developed at the University of Cal-ifornia, Berkeley, with a runtime support system which uses Active Messages [124].pC++, a data-parallel extension to C++, has been developed at Indiana Univer-sity [5]. A Fortran D compiler is being developed at Rice University [122].Some research has been done in developing compilers which can automaticallydetermine a good distribution and alignment of arrays instead of having the userspecify them. Gupta [50] has proposed a constraint based approach to automati-cally determine a good data distribution. This method has been incorporated in thePARADIGM compiler [49, 112]. The PARADIGM project at the University of Illi-nois aims at developing a fully automated compiler to translate sequential programsto parallel programs for distributed memory computers. The problem of automaticalignment of arrays has been studied by Chatterjee et al. [22] and Li et al. [79].There has been some research in runtime support for applications with irregulardata access patterns. The PARTI/CHAOS toolkit is a collection of runtime libraryroutines to handle irregular computations [35, 102]. These primitives have been inte-grated with the Fortran D compiler at Rice University, the Fortran 90D/HPF com-piler at Syracuse University and the Vienna Fortran compiler at the University ofVienna. Compilation methods for irregular problems have been investigated by Pon-nusamy [93], Das [34] and Hanxleden [125]. Agrawal et al. [1] describe how runtimesupport can be integrated with a compiler to solve irregular block-structured prob-lems.Research has also been done in the area of array redistribution. Gupta et al. [53]and Koelbel [66] provide closed form expressions for determining the send and re-ceive processor sets and data sets for redistributing arrays between block and cyclicdistributions. An analytical model for evaluating the communication cost of data re-distribution is presented in [63]. A multiphase approach to redistribution is discussedin [64]. Wakatani and Wolfe [126] describe a method of array redistribution calledStrip Mining Redistribution in which parts of an array are redistributed in sequence,instead of redistributing the entire array at one time as a whole. The reason for doing
CHAPTER 1. INTRODUCTION 10this is to try to overlap the communication involved in redistribution with some ofthe computation in the program. Kalns and Ni [62] present a technique for mappingdata to processors so as to minimize the total amount of data that must be com-municated during redistribution. Ramaswamy and Banerjee [99] discuss algorithmsfor redistribution based on a mathematical representation for regular distributionscalled PITFALLS. There has also been some research on the closely related prob-lem of determining the local addresses and communication sets for array assignmentstatements likeA(l1 : h1 : s1) = B(l2 : h2 : s2) where A and B have dierent cyclic(m)distributions [23, 110, 111, 52].Algorithms for all-to-all communication (complete exchange) on a hypercube havebeen proposed by Bokhari [6], and also by Johnsson and Ho [61]. Complete exchangealgorithms for a two-dimensional mesh are discussed in [103, 7, 113, 51, 54]. Optimalcommunication algorithms on the hypercube have been developed by Fox and Fur-manski [42]. Algorithms for scheduling irregular communication patterns have beenproposed by Wang and Ranka [100, 127] and also by Shankar and Ranka [106, 107,108].Compiling out-of-core data-parallel programs is a fairly new topic and there hasbeen very little research in that area. A model and compilation strategy for out-of-coredata-parallel programs is discussed in [10]. Bordawekar [11, 8] is developing a compilerfor out-of-core HPF programs which uses the runtime library [116, 25, 115, 118]discussed in Chapters 5 and 6 of this thesis. Cormen and Colvin [31] are developinga compiler-like preprocessor for out-of-core C*, called ViC*, which translates out-of-core C* programs to standard C* programs with calls to a runtime library forI/O. Paleczny et al. [89] are also developing techniques for compiling out-of-coredata-parallel programs. Brezany et al. [18] are working on compilation techniquesfor out-of-core programs in the context of Vienna Fortran. Language extensions forout-of-core data-parallel programs are proposed in [8, 17, 109, 18].There has been a lot of eort to improve parallel I/O performance both by hard-ware and software means. Various RAID schemes (Redundant Arrays of Inexpensive
CHAPTER 1. INTRODUCTION 11Disks) are proposed in [91] for better performance, reliability, power consumption andscalability. An excellent overview of RAID concepts is given in [24]. Disk stripingwhere data is distributed across disks at a small granularity so that each block isdistributed across all disks is proposed in [101]. File declustering, where dierentblocks of a le are stored on distinct disks is suggested in [81]. This is used in theBridge File System [39], in Intel's Concurrent File System (CFS) [92] and in variousRAID schemes [91]. Vesta is a parallel le system which supports logical partitioningof les [29, 27]. The RAMA [84] le system distributes le blocks across disks ran-domly using a hash function, instead of the usual striped layout. Runtime librariesfor parallel I/O, such as the Panda Library [104, 105] and the Jovian Library [4],are being developed. Portable parallel le systems such as VIP-FS [55], PIOUS [85]and PPFS [60] have been developed recently. Techniques for improving I/O perfor-mance using collective I/O have also been proposed. Two-phase I/O is a techniquefor performing collective I/O using a runtime library [37, 12]. Disk-directed I/O is atechnique for performing collective I/O at the le system level [69, 70, 71].1.7 Organization of this ThesisThe rest of this thesis is organized as follows. Chapter 2 gives an overview of someof the issues in providing runtime support for in-core and out-of-core data-parallelprograms. Runtime support for array redistribution is discussed in Chapter 3. Chap-ter 4 describes runtime support for all-to-all collective communication on fat-tree andtwo-dimensional mesh interconnection topologies. Runtime support for out-of-coreprograms is discussed in Chapters 5 and 6. Chapter 5 describes three models of datastorage and access for out-of-core programs, and a number of local optimizations foraccessing out-of-core data eciently. Chapter 6 describes the Extended Two-PhaseMethod for accessing sections of out-of-core arrays using collective I/O. Finally, con-clusions are presented in Chapter 7, along with some ideas for future work.
Chapter 2Issues in Runtime SupportThis chapter gives a brief overview of the various issues involved in providing runtimesupport for programs on distributed memory parallel computers. Runtime supportcan either be used directly in message passing application programs, or used togetherwith a compiler to translate programs written in a high-level data-parallel languagesuch as HPF.2.1 Runtime Support for Regular ProblemsThere are many scientic applications which have very regular array access patterns.These access patterns may arise either from the underlying physical domain beingstudied, or the type of algorithm used. Examples of such applications include Matrix Multiplication, LU Decomposition and other operations in dense linearalgebra Solving Partial Dierential Equations (PDEs) on regular meshes Fast Fourier TransformThe main characteristic of such applications is that the communication pattern isknown statically before program execution. Thus all the necessary optimizationscan be performed beforehand at compile time. In such applications, data is usually12
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 13distributed using either a block, cyclic, or block-cyclic distribution [119]. The arraysubscripts used in the program are usually linear functions of the loop indices.Many dierent communication patterns are possible depending on the array ac-cess pattern in the program. Li and Chen [78] characterize many of the commonlyoccurring communication patterns. A library of collective communication routines isneeded for carrying out communication eciently. A particular type of communica-tion pattern which occurs frequently is the all-to-all communication pattern. Ecientalgorithms for all-to-all communication are presented in Chapter 4 of this thesis. Run-time support is also needed for array redistribution. Although arrays are distributedamong processors at the start of the program, it is very often necessary to changethe distribution during program execution, which is called array redistribution. Thisinvolves calculation of source and destination processor and index sets as well as in-terprocessor communication. Runtime support for array redistribution is discussedin detail in Chapter 3 of this thesis.2.2 Runtime Support for Irregular ProblemsThere is another set of scientic applications in which the array access pattern isirregular. This is usually due to the fact that the underlying physical domain isirregularly connected. Examples of such applications include Computational Fluid Dynamics (CFD) codes using unstructured meshes Molecular dynamics codes Sparse iterative linear systems solversIn irregular problems, arrays are accessed using one or more level of indirection. Anexample of this is the following loopdo i=1, nA(x(i)) = B(y(i)) + C(z(i))end doA regular distribution of data, such as block, cyclic or block-cyclic, may not be the best
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 14distribution for these problems because it may result in higher communication cost. Agraph partitioning algorithm is normally used to determine the best distribution in thecase of irregular applications. Such a distribution is called an irregular distribution.Due to the indirection in the array access, the communication pattern is not knownstatically at compile time. It depends on the values of the indirection arrays, whichmay not be known until runtime. Thus a runtime resolution scheme is needed todetermine the communication required.Runtime support for irregular problems has been studied by a number of re-searchers [93, 102, 34, 35, 68]. One way of detecting and performing the communi-cation is by using an inspector-executor [68, 102] approach. A parallel loop is trans-formed into two constructs called an inspector and an executor. During programexecution, the inspector examines the data references made by a processor, and cal-culates what o-processor data needs to be fetched and where that data will be storedonce it is received. The executor loop then uses the information from the inspectorto perform the actual communication and computation.PARTI [35] is a library of runtime routines for solving irregular problems ondistributed memory computers. PARTI primitives can be directly used to generatethe inspector/executor pairs. Each inspector produces a communication schedule,which is essentially a pattern of communication for gathering or scattering data. Inorder to avoid duplicate accesses, a list of o-processor data references is stored locallyfor each processor in a hash table. For each new o-processor data reference required,a search through the hash table is performed in order to determine if this reference hasalready been accessed. If the reference has not previously been accessed, it is storedin the hash table, otherwise it is discarded. The primitives thus only fetch a singlecopy of each unique o-processor distributed array reference. The executor containsembedded PARTI primitives to communicate data. The primitives issue instructionsto gather, scatter or accumulate (i.e. scatter followed by add) data according to theschedule created by the inspector. Latency or startup cost is reduced by packingsmall messages intended for the same destination into one large message.
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 152.3 Runtime Support for CompilersData-parallel languages like HPF [57] and pC++ [5] have recently been developedto provide support for high performance programming on parallel machines. Theselanguages provide a framework for writing portable parallel programs independentof the underlying architecture and other idiosyncrasies of the machine. The sameprogram can be run on dierent machines by simply using a dierent compiler foreach machine. Compilers for such languages usually rely on runtime support to carryout many of the commonly required operations usually involving communication.Runtime support helps to separate the machine dependent aspects of compilation fromthe machine independent aspects. The compiler can do all the machine independenttransformations and the runtime system can be optimized for each dierent machine.Thus, a portable and ecient compiler can be obtained by simply porting the runtimesystem to dierent machines.We briey describe some of the issues related to providing runtime support for anHPF compiler. We do not discuss runtime support for compiling irregular problems;that is explained in detail in [93]. We rst outline the salient features of HPF in orderto explain the runtime support needed for an HPF compiler.2.3.1 Overview of HPFHPF was designed to be a standard portable programming language for writing e-cient computationally intensive parallel programs. HPF uses Fortran 90 as its baselanguage and provides several extensions to Fortran 90. The new HPF language fea-tures fall into four categories with respect to Fortran 90: new directives, new languagesyntax, new library routines, and some language restrictions. The new directives arestructured comments that suggest implementation strategies or assert facts about aprogram to the compiler. They may aect the eciency of the computation per-formed, but do not change the value computed by the program. Compiler directivesform the heart of the HPF language. They start with the prex !HPF$, CHPF$ or*HPF$ which would actually be treated as comments in Fortran 90. Some of the
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 16new language features are FORALL statement and FORALL construct. HPF extendsthe Fortran 90 library of intrinsic functions to include additional reduction functions,combining scatter functions, prex and sux functions, and sorting functions.The HPF approach is based on two key observations. First, the overall eciencyof the program can be increased if many operations are performed concurrently bydierent processors, and secondly, the eciency of a single processor is likely to bethe highest if the processor performs computations on data elements stored in itslocal memory. Therefore, HPF provides means for explicit expression of parallelismand data mapping. The data alignment and distribution directives in HPF allow theprogrammer to inform the compiler how to partition arrays. Arrays are rst alignedto a template or index space using the ALIGN directive. The DISTRIBUTE directivespecies how the template is to be distributed among a set of abstract processors.The mapping of abstract processors to physical processors is not specied by HPFand is language-processor dependent. The combination of alignment (from arraysto templates) and distribution (from templates to processors) thus determines themapping of an array to the processors. The data mapping is declared using thedirectives: PROCESSORS, ALIGN, DISTRIBUTE and, optionally, TEMPLATE. In addition,arrays may be remapped at runtime. For this, the array must be declared usingthe DYNAMIC directive and the actual remapping is specied using the executabledirectives REALIGN and REDISTRIBUTE.In HPF, an array may be aligned with another in many ways including shifts,strides, or any other linear combination of a subscript (ie., n i+m), transpositionof indices, and collapse or replication of array dimensions. Irregular alignments are notallowed. The template can be distributed as BLOCK, CYCLIC or CYCLIC(m). In a blockdistribution, contiguous blocks of the array are distributed among the processors. Ina cyclic distribution, array elements are distributed among processors in a round-robin fashion. In a cyclic(m) distribution, blocks of size m are distributed cyclically.Irregular distributions are not allowed in version 1.0 of HPF.
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 172.3.2 Compiler and Runtime Support for HPFAn HPF compiler which translates in-core HPF programs to message passing nodeprograms for distributed memory parallel computers has been developed at Syra-cuse University [13, 15]. It translates HPF programs to Fortran 77 programs withcalls to a runtime library [2, 14, 119]. The compiler only exploits the parallelismexpressed in data parallel constructs such as FORALL, WHERE and array assignmentstatements. It does not attempt to parallelize other constructs such as DO loops andWHILE loops, since they are used only as naturally sequential control constructs inHPF. The compiler mainly recognizes commonly occurring computation and com-munication patterns. These patterns are then replaced by calls to the optimizedruntime support system routines. The runtime support system includes parallel in-trinsic functions, data redistribution routines, communication primitives and severalother miscellaneous routines.The basic structure of the HPF compiler is organized around four major mod-ules | parsing, data partitioning, communication detection and insertion, and codegeneration as shown in Figure 2.1. The compiler rst creates a parse tree from theinput HPF program. Also, each array assignment statement and WHERE statement isinternally transformed into its equivalent FORALL statement, so that the subsequentsteps only need to deal with FORALL statements. The partitioning module processesthe data distribution directives, namely TEMPLATE, DISTRIBUTE and ALIGN, and par-titions data and computation among processors. After partitioning, the parallel con-structs in the node program are sequentialized since the code will be executed on asingle processor. Array operations and FORALL statements in the original programare transformed into DO loops. The communication module detects communicationrequirements and inserts appropriate communication primitives.Finally, the code generator produces loosely synchronous [44, 46] SPMD code.The generated code is structured as alternating phases of local computation andcommunication. Local computations consist of operations by each processor on thedata in its own memory. The communication phase includes any transfer of dataamong processors, possibly with arithmetic or logical computation on the data as it
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 18
Communication Insertion
and Optimization
Sequentialization 
and Optimization
Dependency  Analysis
and Partitioning
Lexer and Parser
Fortran 77+MP
Code
Code Generation
HPF Code
Figure 2.1: Phases of Compilationis transferred (e.g. reduction functions). In such a model, processors do not need tosynchronize during local computation. But, if two or more processors communicatewith each other, they are implicitly synchronized during the communication.Communication LibraryThe HPF compiler described above relies on a very powerful runtime support systemwhich includes a library of collective communication routines [13], a library of intrinsicfunctions [2, 14] and other runtime routines such as for array redistribution [119].The HPF compiler produces calls to collective communication routines instead ofgenerating individual send and receive calls inside the compiled code. This is done
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 19for the following reasons:1. Improved performance: To achieve good performance, interprocessor communi-cation must be minimized. By developing a separate library of interprocessorcommunication routines, each routine can be optimized.2. Increased portability of the compiler: By separating the communication libraryfrom the basic compiler design, portability is enhanced because to port thecompiler, only the machine specic low-level communication calls in the libraryneed to be changed.3. Better estimation of communication costs: The cost of collective communicationroutines can be determined more precisely, which enables one to make a betterestimate of the time a program will take.The compiler must recognize the presence of collective communication patterns inthe program in order to generate the appropriate communication calls. This involvesa number of tests on the relationship among subscripts of various arrays in a FORALLstatement. These tests also include information about array alignments and distri-butions. The compiler uses pattern matching techniques to detect communicationpatterns [13, 15].Intrinsic LibraryIntrinsic functions form an important feature of Fortran 90 and HPF. They directlysupport many of the basic data-parallel operations on arrays and provide a meansfor expressing operations on arrays concisely. HPF includes all Fortran 90 intrinsicfunctions and also adds a number of new intrinsic procedures in three categories:system inquiry intrinsics, mapping inquiry intrinsics, and computational intrinsics.The computational intrinsic functions fall into the following main categories:- Simple Reduction Functions: ALL, ANY, COUNT, MAXVAL, MINVAL, PRODUCT,SUM.
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 20 Combining Scatter Functions: ALL SCATTER, ANY SCATTER, COUNT SCATTER,MAXVAL SCATTER, MINVAL SCATTER, PRODUCT SCATTER, SUM SCATTER Prex and Sux Functions: ALL PREFIX, ANY PREFIX, COUNT PREFIX,MAXVAL PREFIX, MINVAL PREFIX, PRODUCT PREFIX, SUM PREFIX,ALL SUFFIX, ANY SUFFIX, COUNT SUFFIX, MAXVAL SUFFIX, MINVAL SUFFIX,PRODUCT SUFFIX, SUM SUFFIX Array Sorting Functions: GRADE UP, GRADE DOWN Array Manipulation Functions: CSHIFT, EOSHIFT, TRANSPOSE. Array Location Functions: MAXLOC, MINLOC Array Construction Functions: SPREAD, MERGE, PACK, UNPACK Vector and Matrix Multiplication Functions: DOT PRODUCT, MATMUL. Bit Manipulation functions: IAND, IOR, POPCNT, POPPAR, LEADZ Elemental Intrinsics functions: SIN, COS, TANIt is necessary to have a library of these intrinsic functions which can be calledfrom the node programs of a distributed memory machine [14]. The HPF compilerdetects calls to intrinsic functions in the HPF program and replaces them with callsto these routines. Since arrays in HPF can have up to seven dimensions and can bedistributed in many dierent ways, it is not feasible to write intrinsic libraries for allpossible distributions. A more practical approach is to write optimized routines for afew distributions. If the distribution of an array is dierent from what is expected bythe intrinsic library, the array must rst be redistributed to the desired distribution,and after returning from the intrinsic, it must be redistributed back to the originaldistribution. It is essential to use ecient algorithms for redistribution, so as tominimize the redistribution overhead. Ecient algorithms for redistributing arraysare discussed in Chapter 3 of this thesis.
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 212.4 Runtime Support for Out-of-Core ProgramsAn out-of-core program is one in which all the data required by the program cannott in main memory at the same time. Hence data needs to be stored in les onsecondary storage such as disks. During program execution, only a portion of thedata can be present in memory at any time. This makes it necessary to move databack and forth between main memory and disks. Since the cost of performing I/Ois very high compared to computation and communication, this can prove to be veryexpensive. Hence it is necessary to do the I/O in out-of-core programs eciently toget good performance.The I/O performance is best when a processor makes large contiguous requestsinstead of many small requests. This is because of the high latency time associatedwith any I/O operation. In a parallel computer, when many processors need to doI/O simultaneously, there is the additional problem of contention for the I/O system.Hence it is necessary to schedule I/O requests of dierent processors as well as reorderand combine I/O requests within and across processors into large contiguous requests.We believe that this can best be done by having a library of optimized routines whichcan be directly called from an out-of-core program.Another important factor is the portability of out-of-core programs. There isno standard parallel le system interface or Application Program Interface (API)at present. Each parallel machine has its own parallel le system with a completelydierent interface from that of any other parallel le system. Hence programs writtendirectly using calls to the parallel le system are not portable to other systems. Thisproblem can be overcome by using runtime support. The application programs cancall a runtime library which provides a common interface for all machines. Theroutines in the runtime library can be implemented using the native parallel lesystem on each dierent machine.Thus runtime support is needed in the case of out-of-core programs for eciencyand portability. As in the in-core case, this runtime support can also be used togetherwith a compiler to translate out-of-core programs written in a high-level language like
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 22HPF to node programs with calls to the runtime library for I/O and communication.Chapters 5 and 6 of this thesis describe in detail the design and implementation ofruntime support for out-of-core programs, including a number of optimizations.2.4.1 Out-of-Core ApplicationsThere are a number of applications which deal with very large data sets, resulting inout-of-core programs. These applications exist in diverse areas such as large scale sci-entic computations, database and information processing, hypertext and multimediasystems, information retrieval and many others.del Rosario and Choudhary [38] have done a survey of the I/O requirements ofseveral Grand Challenge applications. They nd that the data requirements of theseapplications range from 1 Gbyte to 4 Tbytes per run. Some of the applications theysurveyed are:- Imaging of Planetary Data: The spacecraft Magellan has gathered more than3 Tbytes of data from the surface of the planet Venus. To produce a three-dimensional rendering of the surface of Venus at 200 Mbytes of data per framewould require over 13 Gbytes/sec. of I/O throughput at 50 frames per sec-ond [48]. Climate Prediction: A climate prediction code using the General CirculationModel (GCM) has the following requirements on the Intel Touchstone Delta.For a 100-year atmosphere run with 300 km2 resolution and 0.2 simulatedyears/machine hour, the simulation takes 3 weeks run time and generates 1144Gbytes of data at 38 Mbytes per simulation minute. For a 1000-year coupledatmosphere-ocean run with a 150 km2 resolution, the atmospheric simulationtakes about 30 weeks, while ocean simulation takes 27 weeks; the process pro-duces 40 Mbytes of data per simulation minute, or a total of 20 Tbytes of datafor the entire simulation [40].
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 23 Four-Dimensional Data Assimilation: This application from NASA incorpo-rates actual space-time observations into mathematical and computational mod-els in order to create a unied, complete description of the atmosphere. Thealgorithm for this currently operates on a 3 Tbyte database with single runsproducing 100Mbytes to several Gbytes of output. These gures are expectedto increase by orders of magnitude in the near future.Cormen [30] has also compiled a list of several applications which deal with hugedata sets. These include applications in computational biology, computational uiddynamics, combinatorial search, conjugate gradient solvers, genetic algorithms, geo-physics, region growing, logic simulation, computational physics, meteorology, molec-ular dynamics, ocean modeling, partial dierential equations solvers, synthetic aper-ture radar image processing, visualization and graphics. The Applications WorkingGroup of the Scalable I/O Initiative has provided a description of the I/O require-ments of several applications in biology, chemistry, physics, earth sciences, engineeringand graphics [97]. Fox [41] presents a performance analysis of applications on a hyper-cube machine with a hierarchical memory at each node, consisting of a fast cache andslower main memory. The applications considered are the N-body problem, Poisson'sequation solver, matrix multiplication, LU Decomposition, Fast Fourier Transformand neural networks. This performance analysis can be extended to out-of-core ap-plications in which the memory hierarchy includes a much slower secondary memory.There have also been studies of the le-access characteristics of application pro-grams on multiprocessor systems. Kotz and Nieuwejaar [74] present the results ofa three week tracing study in which all le-related activity on an Intel iPSC/860running production, parallel scientic applications at NASA Ames Research Centerwas recorded. An interesting result of this study was that there were a large numberof small strided requests to the le system. They found that 96.1% of all reads werefor fewer than 4000 bytes, but those reads transferred only 2% of all data read. Forwriting, 89.4% of all requests were for fewer than 4000 bytes, but those writes trans-ferred only 3% of all data written. Another study of the le-access characteristics
CHAPTER 2. ISSUES IN RUNTIME SUPPORT 24of production applications on the CM-5 at the National Center for SupercomputingApplications, University of Illinois, also found similar results of a large number ofsmall requests [98]. This shows that although it is well-known that I/O should bedone in large chunks to minimize the eect of high I/O latency, many parallel out-of-core applications actually access small strided data sets. Hence, it is necessary toprovide runtime support for accessing small strided data eciently. This issue is ad-dressed in this thesis and optimizations, such as data sieving and collective I/O usingan Extended Two-Phase Method, are proposed to access strided data in an ecientmanner.
Chapter 3Runtime Support for ArrayRedistribution3.1 IntroductionIn distributed memory parallel computers, arrays have to be distributed among pro-cessors in some fashion. The distribution can either be regular i.e. block, cyclic orblock-cyclic as in Fortran D [43] and High Performance Fortran (HPF) [57, 67]; orirregular in which there is no function specifying the mapping of arrays to processors.The distribution of an array does not need to remain xed throughout the program.In fact, it is very often necessary to change the distribution of the array at runtime,which is called array redistribution. This involves calculating source and destinationprocessor and index sets as well as data communication. It is essential to use ecientalgorithms for redistribution, otherwise the performance of the program may degradeconsiderably.This chapter describes ecient algorithms for runtime array redistribution. Sincearray distribution and redistribution is rigorously dened in HPF, the algorithms aredeveloped for redistributing arrays as dened in HPF. We consider block(m) to cyclic,cyclic to block(m) and the general cyclic(x) to cyclic(y) type redistributions. For thegeneral cyclic(x) to cyclic(y) redistribution, there is no direct algebraic formula tocalculate the source and destination processor and index sets [53]. We use a novelapproach for doing the general cyclic(x) to cyclic(y) redistribution, where we rst25
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 26develop optimized algorithms for two special cases | when x is a multiple of y, or yis a multiple of x. For the general case, we propose two algorithms called the GCDMethod and the LCM Method which make use of the optimized algorithms developedfor the above special cases. We initially describe algorithms for one-dimensionalarrays and then extend the methodology to multidimensional arrays. The algorithmsproposed have low runtime overhead. We study the performance of these algorithmson the Intel Touchstone Delta.3.1.1 Need for RedistributionHPF provides directives (ALIGN and DISTRIBUTE) which specify how arrays shouldbe partitioned among the processors of a distributed memory computer. Arrays arerst aligned to a template or index space. The DISTRIBUTE directive species how thetemplate is to be distributed among the processors. In HPF, an array (or template)can be distributed as BLOCK(m) or CYCLIC(m) [57]. Though the distribution of anarray is specied at compile time, there are a number of reasons for which it may benecessary to redistribute an array during the execution of the program. HPF has a directive called REDISTRIBUTE by which an array can be redis-tributed anywhere in the program provided the array was declared as DYNAMIC.REDISTRIBUTE can be considered as an executable directive. HPF provides intrinsic functions which operate on arrays. It is not practicalto write intrinsic and runtime libraries for all possible distributions, especiallysince arrays can have up to seven dimensions. Libraries can be written fora few common distributions and for any other distribution it is necessary toredistribute before calling the subroutine. After returning from the subroutine itis necessary to redistribute back to the original distribution. We call this type ofredistribution as a circular redistribution. This is illustrated in Figure 3.1 whichshows an HPF program calling the MATMUL intrinsic with all arrays distributed as(BLOCK,*). The MATMUL library routines have been written for (BLOCK,BLOCK)and (CYCLIC,CYCLIC) distributions. So it is necessary to redistribute to one
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 27
REAL A(N,N), B(N,N), C(N,N)
DISTRIBUTE (BLOCK,*) :: A, B, C
C = MATMUL(A, B)
END
HPF  PROGRAM
MATMUL1()
(BLOCK,BLOCK)
 FOR
(CYCLIC,CYCLIC)
MATMUL2()  FOR
REDISTRIBUTE REDISTRIBUTEOR
EXIT
REDISTRIBUTE REDISTRIBUTE
SUBROUTINE
MATMUL()
Figure 3.1: Need for Redistributionof these distributions before calling the intrinsic and then redistribute back to(BLOCK,*) after the intrinsic. Arrays and array sections can be passed as arguments to subroutines. The array(or array section) can be specied to have any distribution in the subroutine.If this distribution is dierent from that in the calling program, the array (orarray section) needs to be redistributed before the subroutine is called. At theend of the subroutine, the array (or array section) needs to be redistributed tothe original distribution. In many applications such as 2D FFT and the ADI method for solving multidi-mensional PDEs, dynamic redistribution can result in signicant performanceimprovement [20].An example of an HPF program using redistribution is shown in Figure 3.2. Thisis a two-dimensional FFT program in which the array is rst block distributed along
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 28REAL A(1024,1024)!HPF$ PROCESSORS P(4)!HPF$ DISTRIBUTE A(BLOCK,*) ONTO P!HPF$ DYNAMIC A...........CALL ROW FFT(A)!HPF$ REDISTRIBUTE A(*,BLOCK) ONTO PCALL COL FFT(A)...........ENDFigure 3.2: 2D FFT Program using Redistributionrows. A one-dimensional FFT is rst performed along each row of the array, whichcan be done without any communication. The array is then redistributed so that itis block distributed along columns. A one-dimensional FFT is performed along eachcolumn of the redistributed array, which again does not require any communication.3.2 Notations and DenitionsThe notations used in this chapter are given in Figure 3.3. We assume that all arraysare indexed starting from 1, while processors are numbered starting from 0 and thatarrays are identically aligned to the template. The algorithms can be easily extendedfor the general case. We also assume that the number of processors on which thearray is distributed remains the same before and after the redistribution.The block(m) and cyclic(m) distributions in HPF are dened as follows. Consideran array of size N distributed over P processors. Let us dene the ceiling division
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 29N global array sizeP number of processorspi logical processor ip logical number of the processorexecuting the programps source processorpd destination processorL local array sizem block sizeFigure 3.3: Notationsfunction CD(j; k) = (j + k   1)=k and the ceiling remainder function CR(j; k) =j   k  CD(j; k). Then block(m) distribution means that index j of the array ismapped to logical processor number CD(j;m)   1. Note that for a valid block(m)distribution, m  P  N must be true. Block by denition means the same asblock(CD(N;P )). In a cyclic(m) distribution, index j of the global array is mappedto logical processor numberMOD(CD(j;m) 1; P )1. Cyclic by denition means thesame as cyclic(1).Suppose we have 4 processors and an array of length 26. Distributing the arrayas BLOCK results in this mapping of array elements to processors :-Proc. 0 1 2 3 4 5 6 7Proc. 1 8 9 10 11 12 13 14Proc. 2 15 16 17 18 19 20 21Proc. 3 22 23 24 25 261MOD(a; b) = a modulo b
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 30Distributing the array as CYCLIC results in this mapping of array elements to proces-sors :- Proc. 0 1 5 9 13 17 21 25Proc. 1 2 6 10 14 18 22 26Proc. 2 3 7 11 15 19 23Proc. 3 4 8 12 16 20 24Distributing the array as CYCLIC(3) results in this mapping of array elements toprocessors :- Proc. 0 1 2 3 13 14 15 25 26Proc. 1 4 5 6 16 17 18Proc. 2 7 8 9 19 20 21Proc. 3 10 11 12 22 23 24In other words, in a block distribution, contiguous blocks of the array are dis-tributed among the processors. In a cyclic distribution, array elements are distributedamong processors in a round-robin fashion. In a cyclic(m) distribution, blocks of sizem are distributed cyclically. The cyclic(m) distribution with 1 < m < dN=P e iscommonly referred to as a block-cyclic distribution with block size m [43]. Block andcyclic distributions are special cases of the general cyclic(m) distribution. A cyclic(m)distribution withm = dN=P e is a block distribution and a cyclic(m) distribution withm = 1 is a cyclic distribution. The formulae for conversion between local and globalindices for the dierent distributions are given in Table 3.1.The redistribution algorithms proposed in this thesis are intended to be portable.Hence, we do not specify how data communication should be performed becausethe best communication algorithms are often machine dependent. Redistributionrequires all-to-many personalized communication in general and in many cases itrequires all-to-all personalized communication. These communication algorithms aredescribed in detail in Chapter 4 of this thesis and in [117, 95, 120]. The performanceresults presented in this chapter have been obtained using these algorithms. Wedo assume that all the data to be sent from any processor i to processor j has to be
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 31Table 3.1: Data Distribution and Index ConversionNote: This assumes that arrays are indexed starting from 1and processors are numbered starting from 0CD(j; k) = (j + k   1)=k and CR(j; k) = j   k  CD(j; k)BLOCK(m) CYCLIC CYCLIC(m)global index (g) to p = CD(g;m)  1 p =MOD(g  1; P ) p =MOD(CD(g;m)  1; P )processor number (p)global index (g) l = m+ CR(g;m) l = (g   1)=P + 1 l =MOD(g  1;m) + 1+to local index (l) (g=(mP ))mlocal index (l) to g = l+mp g = (l  1)P + p+ 1 g =MOD(l  1;m) + 1+global index (g) (P ((l  1)=m)+ p)mcollected together in a contiguous packet in processor i and sent in one communicationoperation, so as to minimize the communication startup cost. In the rest of thischapter, any division involving integers should be considered as integer division unlessspecied otherwise.3.3 Block(m) to Cyclic RedistributionWe rst consider the case of block(m) to cyclic redistribution, a few examples ofwhich are shown in Figure 3.4.Theorem 3.1 Let li1 and li2 be the local array sizes in processor pi corresponding toblock(m) and cyclic distributions respectively. In a block(m) to cyclic redistribution,the number of processors to which pi has to send data isP   1 if li1  Pli1 or li1   1 if li1 < PThe number of processors from which pi has to receive data is at mostCD(N;m) if li2  CD(N;m)li2 if li2 < CD(N;m)
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 32
p0 p1 p2 p31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Blockp0 p1 p2 p31 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 Cyclicp0 p1 p2 p3 p41 2 3 4 5 6 7 8 9 10 Blockp0 p1 p2 p3 p41 6 2 7 3 8 4 9 5 10 Cyclicp0 p1 p21 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Blockp0 p1 p21 4 7 10 13 2 5 8 11 14 3 6 9 12 15 Cyclicp0 p1 p2 p31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Blockp0 p1 p2 p31 5 9 13 17 21 2 6 10 14 18 22 3 7 11 15 19 23 4 8 12 16 20 24 CyclicFigure 3.4: Block(m) to Cyclic Redistribution
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 33Proof: Note that if N is not a multiple of P , li1 may not be equal to li2. If li1 < P ,each of the li1 elements of pi corresponding to a block(m) distribution will lie in adierent processor when the array is distributed cyclically. At most one of the li1elements will be remapped to processor pi itself. Therefore, pi will have to send datato either li1 or li1   1 processors. If li1  P , then clearly pi has at least one elementto send to every other processor.In a block(m) distribution, the number of processors with data is given byCD(N;m).Therefore, in the receive phase, if li2 < CD(N;m), each processor will receive datafrom at most li2 processors. If li2  CD(N;m), each processor will receive data fromat most CD(N;m) processors. 2The algorithm for block(m) to cyclic redistribution is given in Figure 3.5. In thesend phase, each processor only needs to calculate the destination processor of the rstelement of the local array. The other elements have to be sent to the other processorsin a round-robin fashion. Thus the array is scanned only once, which makes gooduse of the cache. In the receive phase, the data received from other processors hasto be stored in contiguous memory locations in order of logical processor number. Inevery processor, the data received from processor 0 is stored rst; from processor 1second and so on. Hence addresses do not need to be calculated. If the amount ofdata to be received from all processors is known, the data can be directly receivedinto appropriate locations in the array.In a block(m) distribution, the last element N of the global array is mapped toprocessor pN = (N   1)=m. If pN < P   1, no elements of the array are mappedto processors pN + 1; pN + 2; :::; P   1. The index of the last element of the localarray in processors p < pN is last index = m. The index of the last element in pN islast index =MOD(N   1;m) + 1. The index of the rst element of ps  pN that ismapped to p in a cyclic distribution is given byfirst index =MOD[p  MODfps MOD(m;P ); Pg + P;P ] + 1
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 34Send Phase1. The destination processor (pd) of the rst element of the local array is pd = MOD(pm; P ).2. Destination processor of element i, i = 2 : N , is MOD(pd + i; P ).Receive Phase1. Last processor with data is pN = (N   1)=m2. For ps = 0 to pN do3. If (ps = pN ) then4. last index =MOD(N   1; m) + 15. Else6. last index =m7. The index of the rst element of ps mapped to p isfirst index = MOD[p MODfpsMOD(m;P ); Pg+ P; P ] + 18. Number of elements to be received from ps is 0, if (last index < first index),and (last index  first index)=P + 1, otherwise9. No data is to be received from processors pN + 1; pN + 2; :::; P   1.10 The received data is to be stored in order of processor number.Figure 3.5: Algorithm for Block(m) to Cyclic RedistributionIf m is divisible by P , the rst element of ps that is mapped to p is p + 1. However,if m is not divisible by P , a shift is introduced in this simple mapping which is takeninto account by the MOD expression in the above equation. Hence, the number ofelements to be sent from any processor ps to p is0; if (last index < rst index)or (ps > pN )(last index  rst index)=P + 1; otherwisewhere first index, last index and pN are calculated as above.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 353.4 Cyclic to Block(m) RedistributionA cyclic to block(m) redistribution is essentially the reverse of a block(m) to cyclicredistribution. The algorithm for cyclic to block(m) redistribution is given in Fig-ure 3.6. In a cyclic distribution, the size of the local array in processor p isL = ( dN=P e if MOD(N;P ) = 0; or p < MOD(N;P )dN=P e   1 otherwiseIn the send phase, each processor p calculates the destination processor pd andthe destination local address ld of the rst element of its local array as pd = CD(p+1;m)  1 and ld = m+CR(p+1;m). The rst (m  ld)=P +1 elements from ps haveto be sent to pd. The next set of elements starting from i = (m  ld)=P + 2 have tobe sent to processor pd1 = CD((i 1)P +p+1;m) 1. The destination local addressof element i is given by ld1 = m + CR((i  1)P + p + 1;m) and so (m   ld1)=P + 1elements starting from i have to be sent to processor pd1. This process is continuedfor the remaining elements of the array. Thus blocks of elements have to be sent todierent processors.In the receive phase, the data received cannot be directly stored in the array asthe data has to be stored with a stride equal to the number of processors. Hence thedata received has to be stored in a temporary buer in memory. This gives us twochoices in implementing the receive phase :{1. Synchronous Method: Each processor waits till it receives data from all otherprocessors, before placing any data in the local array. This increases the memoryrequirements of the algorithm and also increases processor idle time. Theseproblems worsen as the number of processors is increased, so this method is notscalable.2. Asynchronous Method: The processors do not wait for data from all processorsto arrive. Instead, each processor takes any packet which has arrived and placesthe data from that packet into appropriate locations in the local array. This
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 36Send Phase1. i = 12. While (i  L) do3. The destination processor (pd) and destination local address (ld) of element i of thelocal array is pd = CD((i  1)P + p+ 1; m)  1 and ld = m+ CR((i  1)P + p+ 1; m)4. The destination processor of elements i to i+ (m  ld)=P is pd.5 i = i+ (m  ld)=P + 1 Receive PhaseSynchronous Method:1. Receive packets from all processors.2. The source processor of the rst element of the local array ps = MOD(mp; P ).3. The source processor of element i, i = 2 : N , is ps = MOD(ps + 1; P ).Asynchronous Method:1. The source processor of the rst element of the local array is ps = MOD(mp; P ).2. The source processor of the kth element, 2  k  P , is MOD(ps + k   1; P ).3. For i = 0 to P   1 do4. Receive a packet from any processor pi5. Starting from the location calculated above, place elements fromthe packet into the array with stride P .Figure 3.6: Algorithm for Cyclic to Block(m) Redistributionmethod overlaps computation and communication. Each processor posts non-blocking receive calls and waits for any packet to arrive. As soon as a packet isreceived, it places the data in appropriate locations in the local array. Duringthis time, other packets may reach the processor. When the processor has placedall the data from the earlier packet into the local array, it takes up the nextpacket and so on. This reduces processor idle time. Since all packets do nothave to be in memory at the same time, it also reduces memory requirements.This method is scalable as neither processor idle time nor memory requirementsincrease as the number of processors is increased.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 37
0
100
200
300
400
500
600
700
0 20 40 60 80 100 120 140
T
i
m
e
 
(
m
s
)
Processors
Synchronous Method
Asynchronous Method
Figure 3.7: Performance of Cyclic to Block Redistribution, array size 1M integersThe array locations where incoming data has to be stored can be calculated as fol-lows. The source processor (ps) of the rst element of the local array is given byps = MOD(mp;P ). The next (P   1) elements will be received from the remainingprocessors in order of processor number. This cycle is repeated for all elements ofthe local array. If all packets are present in memory (Synchronous Method), the localarray needs to be scanned only once to be lled. If the packets are processed one ata time (Asynchronous Method), the array elements are lled with stride P and thearray has to be scanned P times. So, clearly the Synchronous Method makes betteruse of the cache than the Asynchronous Method. Figure 3.7 compares the perfor-mance of the Synchronous and Asynchronous Methods on the Intel Touchstone Deltafor a global array with 1M (220) integers and the number of processors varied between8 and 128. We observe that the Asynchronous Method performs much better thanthe Synchronous Method as it overlaps computation and communication and thusreduces processor idle time. This dierence is larger for a small number of processorsbecause the amount of data communicated per processor is larger.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 383.5 Cyclic(x) to Cyclic(y) RedistributionFor a general cyclic(x) to cyclic(y) redistribution, there is no direct algebraic formulato calculate the set of elements to send to a destination processor and the localaddresses of these elements at the destination [53]. Several complex schemes have beenproposed for performing this redistribution [110, 111, 23, 99]. Since redistributionhas to be done at runtime, a simple and ecient algorithm with minimum runtimeoverhead is necessary. We propose a new method for performing the general cyclic(x)to cyclic(y) redistribution, which has low runtime overhead. We rst consider twospecial cases where x is a multiple of y, or y is a multiple of x, and develop optimizedalgorithms for these two special cases. For the general case where there is no particularrelation between x and y, we have developed two algorithms called the GCD Methodand the LCM Method, which make use of the optimized algorithms developed for theabove two special cases.3.5.1 Special case x = k yWe rst consider the special case where x is a multiple of y. Let x = k y. An exampleof this is given in Figure 3.8.Theorem 3.2 In a cyclic(x) to cyclic(y) redistribution where x = k y, if k < P ,each processor communicates with k or k   1 processors. If k  P , each processorcommunicates with all other processors.Proof: Assume k < P . Since x = k y, each block of size x is divided into k sub-blocks of size y and distributed cyclically. Consider any processor pi. Assume thatit has to send its rst sub-block of size y to processor pj . Then the remaining k   1sub-blocks of the rst block are sent to the next k   1 processors in order. The nextk(P   1) sub-blocks of the global array are located in the other P   1 processors.This results in a total of k P sub-blocks. Hence the (k+1)th sub-block of size y in piis also sent to pj . Thus all sub-blocks from pi are sent to k processors starting from
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 39p0 p1 p21 2 3 4 13 14 15 16 5 6 7 8 17 18 19 20 9 10 11 12 21 22 23 24Cyclic(4)p0 p1 p21 2 7 8 13 14 19 20 3 4 9 10 15 16 21 22 5 6 11 12 17 18 23 24Cyclic(2)Figure 3.8: Cyclic(4) to Cyclic(2) Redistributionpj . One of these processors may be pi itself, in which case pi has to send data tok   1 processors. For the receive phase, consider the rst k P sub-blocks of size y inthe global array corresponding to the rst P blocks of size x. Let us number thesek P sub-blocks from 0 to k P   1. Out of these, the sub-blocks that are mapped toprocessor pi in the new distribution are numbered pi to P (k   1) + pi with stride P .These sub-blocks come from fP (k 1)+pig piP +1 = k processors. One of these processorsmight be pi itself, in which case pi receives data from k   1 processors.If k  P , each block of size x has to be divided into k sub-blocks and distributedcyclically, where the number of sub-blocks is greater than or equal to the numberof processors. So clearly each processor has to send to and receive from all otherprocessors (all-to-all communication). 2The algorithm for cyclic(x) to cyclic(y) redistribution, where x = k y is given inFigure 3.9. We call this the KY TO Y algorithm. In the send phase, each processorp calculates the destination processor pd of the rst element of its local array as pd =MOD(k p; P ). The rst y elements have to be sent to pd, the next y to MOD(pd +1; P ), the next to MOD(pd + 2; P ) and so on till the end of the rst block of sizex. The next k sub-blocks of size y have to be sent to the same set of k processorsstarting from pd. The sequence of destination processors can be stored and neednot be calculated for each block of size x. In the receive phase there are two casesdepending on the value of k :-
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 401. (k  P ) and (MOD(P; k) = 0) : In this case, each processor p calculates thesource processor of the rst block of size y of its local array as ps = p=k. Thenext block of size y will come from processor MOD(ps+P=k; P ), the next fromMOD(ps + 2(P=k); P ) and so on till the rst k blocks. The above sequence ofprocessors is repeated for the remaining sets of k blocks of size x and hence canbe stored and used. If the Synchronous Method is used for receiving data, thelocal array needs to be scanned only once and the ith block, 0  i  dL=ye  1,of size y of the local array will be read from the packet received from processorMOD(ps + i(P=k); P ). If the Asynchronous Method is used, the rst blockfrom the packet received from some processor pi will be stored starting at thelocation calculated above. The remaining blocks will be stored with stride x.2. If k does not satisfy the above condition, it is necessary to calculate the sourceprocessor of the rst element (j = i y + 1) of each block of size y, 0  i dL=ye   1, of the local array as ps = MOD[(i P + p)=k; P ]. The block is readfrom the packet received from ps. The sequence of processors does not repeatitself and hence cannot be stored. In this case, the Synchronous Method is used.We have tested the performance of the KY TO Y Algorithm using both Syn-chronous and Asynchronous Methods on the Intel Touchstone Delta. Figure 3.10 com-pares the performance of the Synchronous and Asynchronous Methods for a cyclic(4)to cyclic(2) redistribution of a global array of 1M integers for dierent number ofprocessors. We observe that the Asynchronous Method performs better than theSynchronous Method, even though in this case each processor communicates with atmost two other processors. This is because the Asynchronous Method overlaps com-putation and communication and thus reduces processor idle time. The better cacheutilization of the Synchronous Method does not improve its overall performance. Fig-ure 3.11 shows the performance of the KY TO Y Algorithm for a cyclic(4) to cyclic(2)redistribution on 64 processors for dierent array sizes. For small arrays, the dier-ence in performance between the Synchronous and Asynchronous Methods is small,because of the small data sets. For large arrays, the dierence is signicant because
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 41Send Phase1. The destination processor (pd) of the rst element of the local array ispd = MOD(k p; P ).2. For each block of size x in the local array3. For i = 0 to k   14. The destination processor of elements (i y + 1) to (i+ 1)y of this block ofsize x is MOD(pd + i; P ).Receive Phase1. If (k  P ) and (MOD(P; k) = 0) then2. The source processor (ps) of the rst element of the local array is ps = p=k.Synchronous Method:3. Receive data from all processors.4. For j = 1 to dL=xe do5. For i = 0 to k   1 do6. Read the next block of size y from the data received fromprocessor MOD(ps + i(P=k); P ).Asynchronous Method:3. The ith block of size y, 0  i  k   1, is to be received fromprocessor MOD(ps + i(P=k); P ).4. For i = 0 to k   1 do5. Receive data from any processor pi.6. Place the rst block of size y in the local array starting from thelocation calculated above, and the other blocks with stride x.7. Else8. Receive data from all processors.9. For i = 0 to dL=ye   1 do10. The source processor (ps) of the rst element (j = i y + 1) ofthis block of size y is ps = MOD[(i P + p)=k; P ]11. Read this block of size y from the data received from ps.Figure 3.9: KY TO Y Algorithm for Cyclic(x) to Cyclic(y) Redist., where x = k y
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 42
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120 140
T
i
m
e
 
(
m
s
)
Processors
Synchronous Method
Asynchronous Method
Figure 3.10: KY TO Y Algorithm, cyclic(4) to cyclic(2), array size 1M integersof the higher processor idle time in the Synchronous Method.3.5.2 Special case y = k xWe now consider the special case where y is a multiple of x. Let y = k x. This isessentially the reverse of the case where x = k y.Theorem 3.3 In a cyclic(x) to cyclic(y) redistribution where y = k x, if k < P ,each processor sends data to k or k   1 processors and receives data from k or k   1processors. If k  P , each processor has to communicate with all other processors(all-to-all communication).Proof: Assume k < P . Consider the rst k P sub-blocks of size x in the global arraycorresponding to the rst P sub-blocks of size y. Let us number these k P sub-blocksfrom 0 to k P   1. Out of these, the sub-blocks that are located in processor pi arenumbered from pi to P (k 1) 1+pi with stride P . In the new distribution, these sub-blocks will be mapped to fP (k 1) 1+pig piP +1 = k processors. One of these processors
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 43
0
50
100
150
200
250
300
1 10 100 1000
T
i
m
e
 
(
m
s
)
Array Size (Kbytes)
Synchronous Method
Asynchronous Method
Figure 3.11: KY TO Y Algorithm, cyclic(4) to cyclic(2), 64 processorsmight be pi itself, in which case pi sends data to k 1 processors. In the receive phase,since y = k x, each block of size y in the new distribution consists of k sub-blocks ofsize x which will come from k processors. Consider any processor pi. Assume thatit receives its rst sub-block of size x from processor pj. Then the remaining k   1sub-blocks of the rst block are received from the next k 1 processors in order. Theother P   1 processors receive the next k(P   1) sub-blocks of the global array. Thisresults in a total of k P sub-blocks. Hence the next sub-block in pi, which is the rstsub-block of the next block of size y, is also received from pj . Thus all sub-blocksfrom pi are received from k processors starting from pj . One of these processors maybe pi itself in which case pi receives data from k   1 processors.If k  P , each block of size y will consist of k sub-blocks of size x, where thenumber of sub-blocks is greater than or equal to the number of processors. So clearlyeach processor has to send to and receive from all other processors (all-to-all commu-nication). 2
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 44The algorithm for cyclic(x) to cyclic(y) redistribution, where y = k x, is given inFigure 3.12. We call this the X TO KX Algorithm. In the send phase, there are twocases depending on the value of k :-1. (k  P ) and (MOD(P; k) = 0) : In this case, each processor p calculates thedestination processor of the rst block of size x of its local array as pd = p=k.The next block of size x has to be sent to processorMOD(pd+P=k; P ), the nextto MOD(pd + 2(P=k); P ) and so on till the rst k blocks. The above sequenceof processors is repeated for the remaining sets of k blocks of size x, and henceneed not be calculated again.2. If k does not satisfy the above condition, it is necessary to individually calculatethe destination processor of each block i of size x, 0  i  dL=xe   1, aspd =MOD[(i P + p)=k; P ].In the receive phase, each processor p calculates the source processor of the rstelement of its local array as ps = MOD[k p; P ]. As in the KY TO Y algorithm,the receive phase can be implemented using either the Synchronous Method or theAsynchronous Method. If the Synchronous Method is used to receive data, for eachblock of size y of the local array, the k sub-blocks of size x are read from the packetsreceived from the k processors starting from ps in order of processor number. If theAsynchronous Method is used, we know that the ith block of size x of the local array,0  i  k  1, will be received from processor MOD(ps + i; P ). Thus the local indexof the rst block received from any source processor can be calculated. The remainingblocks have to be stored with stride y.We have tested the performance of the X TO KX Algorithm on the Intel Touch-stone Delta for dierent array sizes and number of processors. Figure 3.13 comparesthe performance of the Synchronous and Asynchronous Methods for a cyclic(2) tocyclic(4) redistribution of an array of 1M integers for dierent number of processors.Figure 3.14 compares the performance of the two methods for dierent array sizes on64 processors. The results are similar to those obtained for the KY TO Y Algorithm.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 45Send Phase1. If (k  P ) and (MOD(P; k) = 0) then2. The destination processor (pd) of the rst element of the local array is pd = p=k.3. For j = 0 to dL=ye   14. For i = 0 to k   15. The destination processor of the next block of size x of the local arrayis MOD(pd + i(P=k); P ).6. Else7. For i = 0 to dL=xe   18. The destination processor (pd) of the rst element (j = i x+ 1) of thisblock of size x is pd = MOD[(i P + p)=k; P ].Receive Phase1. The source processor (ps) of the rst element of the local array is element of thelocal array is ps = MOD[k p; P ].Synchronous Method:2. Receive data from all processors.3. For each block of size y in the local array do4. For i = 0 to k   1 do5. Read elements (i x+ 1) to (i+ 1)x of the current block of size y from thepacket received from processor MOD(ps + i; P ).Asynchronous Method:2. The ith block of size x, 0  i  k   1, is to be received from processor MOD(ps + i; P ).3. For i = 0 to k   1 do4. Receive data from any processor pi.5. Place the rst block of size x in the local array starting from the locationcalculated above, and the other blocks with stride y.Figure 3.12: X TO KX Algorithm for Cyclic(x) to Cyclic(y) Redist., where y = k x
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 46
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120 140
T
i
m
e
 
(
m
s
)
Processors
Synchronous Method
Asynchronous Method
Figure 3.13: X TO KX Algorithm, cyclic(2) to cyclic(4), array size 1M integersThe Asynchronous Method is found to perform better in all cases.3.5.3 General CaseLet us consider the general case of a cyclic(x) to cyclic(y) redistribution in which thereis no particular relation between x and y (Figure 3.15). One algorithm for doing thisis to explicitly calculate the destination and source processor of each element of thelocal array, using the formulae given in Table 3.1. We call this General Algorithmand is described below.General AlgorithmIn the send phase, the destination processor of each element of the local array can bedetermined by rst calculating its global index based on the current distribution andthen the destination processor based on the new distribution. These two calculationscan be combined into a single expression to give the destination processor of elementi of the local array as pd =MOD[fMOD(i 1; x)+(P ((i 1)=x)+p)x+yg=y 1; P ].Similarly in the receive phase, the source processor of each element of the local array
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 47
0
50
100
150
200
250
300
1 10 100 1000
T
i
m
e
 
(
m
s
)
Array Size (Kbytes)
Synchronous Method
Asynchronous Method
Figure 3.14: X TO KX Algorithm, cyclic(2) to cyclic(4), 64 processorsp0 p1 p21 2 3 10 11 12 4 5 6 13 14 15 7 8 9 16 17 18 Cyclic(3)p0 p1 p21 2 7 8 13 14 3 4 9 10 15 16 5 6 11 12 17 18 Cyclic(2)Figure 3.15: General Cyclic(x) to Cyclic(y) Redistributioncan be determined by rst calculating its global index based on the new distributionand then the source processor based on the old distribution. These two calculationscan be combined into a single expression to give the source processor of element i ofthe local array as ps =MOD[fMOD(i   1; y) + (P ((i  1)=y) + p)y + xg=x  1; P ].The drawback of this algorithm is that calculations are needed individually for allelements of the array. We propose two algorithms called the GCD Method and theLCM Method, which make use of the optimized KY TO Y and X TO KX algorithmsand hence require a lot less calculations.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 48GCD Method1. Let m = GCD(x; y).2. Redistribute from cyclic(x) to cyclic(m)using the KY TO Y Algorithm.3. Redistribute from cyclic(m) to cyclic(y)using the X TO KX Algorithm. LCM Method1. Let m = LCM(x; y).2. Redistribute from cyclic(x) to cyclic(m)using the X TO KX Algorithm.3. Redistribute from cyclic(m) to cyclic(y)using the KY TO Y Algorithm.Figure 3.16: GCD and LCM MethodsGCD MethodThis method takes advantage of the fact that we have developed special optimizedalgorithms for a cyclic(x) to cyclic(y) redistribution when x = k y (the KY TO YAlgorithm) and y = k x (the X TO KX Algorithm). In the GCD Method, the re-distribution is done as a sequence of two phases | cyclic(x) to cyclic(m) followedby cyclic(m) to cyclic(y), where m = GCD(x; y). Since both x and y are multiplesof m, the KY TO Y Algorithm can be used for the cyclic(x) to cyclic(m) phase andthe X TO KX Algorithm can be used for the cyclic(m) to cyclic(y) phase. This isdescribed in Figure 3.16. The GCD Method involves the cost of having to do two sep-arate redistributions. For small arrays, the increased communication may outweighthe savings in computation, but for large arrays in some cases, the performance isbetter than that of the General Method. This can be observed from Figure 3.17which shows the performance of a cyclic(15) to cyclic(10) redistribution, for an arrayof size 1M integers on the Delta. We see that for small number of processors, theGCD Method performs considerably better than the General Method because of thesaving in the amount of computation per processor. Since the size of the global arrayis kept constant, as the number of processors is increased, the size of the local arrayin each processor becomes smaller and each processor spends less time on addresscalculation. Hence the improvement of the GCD Method over the General Methodis also small.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 49
8 16 32 64 128
Processors
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
Tim
e (
ms
)
General
GCD
LCM
Figure 3.17: GCD, LCM and General Methods; cyclic(15) to cyclic(10); 1M arrayOne disadvantage of the GCD Method is that in the intermediate cyclic(m) dis-tribution, the block size m is smaller than both x and y. In the KY TO Y andX TO KX algorithms, all the address and processor calculations are done for blocksof size x or y. Since m is the GCD of x and y, m can even be equal to 1 in some cases.When m = 1, calculations have to be done for each element, which is no better thanin the General Method. In this case the General Method performs better than theGCD Method. Figure 3.18 shows an example of cyclic(11) to cyclic(3) redistributionon the Delta for an array of size 1M integers. Since the GCD of 11 and 3 is 1, we ndthat the General Method always performs better than the GCD Method.LCM MethodThe key to getting good performance in this two-phase approach for redistribution isto have a large value for m. One way of ensuring that m is always large is by choosingm as the LCM of x and y. Since m is a multiple of both x and y, the X TO KX Algo-rithm can be used for the cyclic(x) to cyclic(m) phase and the KY TO Y algorithmcan be used for the cyclic(m) to cyclic(y) phase. This is described in Figure 3.16.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 50
8 16 32 64 128
Processors
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
Tim
e (
ms
)
GCD
General
LCM
Figure 3.18: GCD, LCM and General Methods; cyclic(11) to cyclic(3); 1M arrayAlso, since m is larger than x and y, all calculations are done for this larger blocksize. This results in fewer calculations than in the GCD and General Methods. Fig-ures 3.17 and 3.18 compare the performance of the LCM, GCD and General Methodsfor an array of 1M integers on dierent number of processors. We observe that theLCM Method performs better in all cases. Figure 3.19 compares the performance ofthe LCM and General Methods for a cyclic(11) to cyclic(3) redistribution keeping thenumber of processors xed at 64 and varying the array size. We observe that for smallarrays, the General Method performs better because it has less communication, butfor large arrays the LCM method performs better because the saving in computationis higher than the increase in communication.3.6 Redistribution of Multidimensional ArraysThe redistribution of two and higher dimensional arrays can be classied into twotypes :-
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 51
0
100
200
300
400
500
600
700
1 10 100 1000
T
i
m
e
 
(
m
s
)
Array Size (Kbytes)
GEN
LCM
Figure 3.19: LCM v/s General Methods, cyclic(11) to cyclic(3), 64 processors1. Shape Retaining: The shape of the local array remains unchanged after theredistribution, eg. (block,block) to (cyclic,cyclic).2. Shape Changing: The shape of the local array changes after the redistribution,eg. (block,*) to (*,block) where '*' indicates that the corresponding dimensionis collapsed. This type of redistribution is used for example in 2D FFT and theADI method for solving multidimensional partial dierential equations.The shape retaining and shape changing redistributions are quite dierent from eachother and require dierent algorithms.3.6.1 Shape Retaining RedistributionA shape retaining redistribution may involve redistribution in only one dimensionfeg. (block,block) to (cyclic,block)g or more than one dimension feg. (block,block)to (cyclic,cyclic)g. If the redistribution is only along one dimension, it is similar tothe redistribution of one-dimensional arrays and the same algorithms described earliercan be used. If both dimensions have to be redistributed, it can either be done directly
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 52
0
50
100
150
200
250
300
350
400
450
500
0 20 40 60 80 100 120 140
T
i
m
e
 
(
m
s
)
Processors
Direct Method
Indirect Method
Figure 3.20: (block,block) to (cyclic,cyclic) Redistribution, 1K  1K arrayor indirectly as a series of one-dimensional redistributions. In the Indirect Method,the array is redistributed separately along each dimension. For example, if an arrayhas to be redistributed from (block,block) to (cyclic,cyclic), it is rst redistributed to(block,cyclic) and then to (cyclic,cyclic). This method has the advantage that all theoptimizations developed for one-dimensional arrays in the previous sections can beeasily extended to multidimensional arrays. The order in which the dimensions areredistributed does not aect the performance. This is because the order of dimensionschosen only results in a dierent set of data values being communicated, and doesnot aect the amount or type of communication. So, one could also do the aboveredistribution as (block,block) to (cyclic,block) to (cyclic,cyclic).In the Direct Method, data is sent directly to the destination processor corre-sponding to the new distribution. Hence the optimized algorithms developed forthe one-dimensional case cannot be used. This method requires dierent algorithmsfor dierent number of dimensions and dierent types of redistributions, and thesealgorithms cannot be optimized much. However, data needs to be communicated
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 53only once in the Direct Method. Figure 3.20 compares the performance of the Di-rect and Indirect Methods for redistributing an array of size 1K  1K integers from(block,block) to (cyclic,cyclic) on the Touchstone Delta. The Indirect Method is foundto perform much better even though data is communicated twice. This is becausethe algorithms for one-dimensional redistribution are highly optimized and a lot ofthe communication during each one-dimensional redistribution actually takes placein parallel.Another interesting observation comes from Figure 3.21 which compares the per-formance of the Direct and Indirect Methods for a (cyclic(11),cyclic(11)) to(cyclic(3),cyclic(3)) redistribution of a 1K  1K array. The indirect redistributionis done in two ways | using the General Method and the LCM Method for eachcyclic(11) to cyclic(3) redistribution. We nd that the General Method performsbetter than the LCM Method. This is because in the General Method, for eachone-dimensional redistribution, the destination and source processors need to be cal-culated for each row (or column) of the array and the entire row (or column) has tobe communicated. In the LCM Method, this communication needs to be done twice.Since the amount of communication is much higher than the amount of computation,the General Method performs better. In the Direct Method, destination and sourceprocessor calculations have to be done for each element of the local array. If the localarray size is L  L, L2 calculations have to be done. In the Indirect Method, calcu-lations are done for each column during the column redistribution and for each rowduring the row redistribution. Hence the number of calculations is only L+ L = 2L.Therefore, the Direct Method performs considerably worse than any of the IndirectMethods in this case.3.6.2 Shape Changing RedistributionThis type of redistribution occurs when at least one dimension of the array is collapsedbefore or after the redistribution. Consider the redistribution from (X,*) to (*,Y) orvice-versa, where X and Y can be either block, cyclic or cyclic(m). This is basically a
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 54
8 16 32 64 128
Processors
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
Tim
e (
ms
)
Direct
Indirect LCM
Indirect General
Figure 3.21: (cyclic(11),cyclic(11)) to (cyclic(3),cyclic(3)) Redist., 1K  1K arraycollapsed to distributed type of redistribution in one of the dimensions, which is doneas follows. Each processor divides the local array into packets along the collapseddimension, depending on the type of the new distribution Y. The processors thenexchange packets with other processors. At the receiving end, packets are placed inthe local array in order of source processor number. Data from the received packetsmay have to be placed in the local array either contiguously or with a stride, dependingon the type of distributions X and Y.The other type of redistribution involving change of shape of the local array is ofthe type (X,*) or (*,X) to (Y,Z), or vice versa. That is, in either the source or targetdistributions, one of the dimensions is collapsed and in the corresponding target orsource distributions, both dimensions are distributed. Each case of this type requiresa dierent algorithm and so has to be considered separately. Note that we do not usethe Indirect Method for shape changing redistributions.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 553.7 Circular RedistributionVery often, it is necessary to redistribute from a distribution X to a distribution Yand then sometime later in the program it is required to redistribute back from Y toX. We call this X ! Y ! X type redistribution as a circular redistribution and isdenoted as X $ Y . The redistribution fromX to Y is called a forward redistributionand from Y to X is called a backward redistribution. For example when the mainprogram calls a subroutine with a dierent distribution, it is necessary to redistributefrom say X to Y . After returning from the subroutine, it is necessary to redistributeback to X to restore the original distribution.In the case of a circular redistribution, the backward redistribution is essentiallythe reverse of the forward redistribution. The calculation of the destination processorsand addresses done during the forward redistribution can be saved and reused in thebackward redistribution. Thus, no calculations need be done during the backwardredistribution. This decrease in computation is at the cost of increased memory forsaving the information calculated in the forward redistribution.3.7.1 Circular Block(m) $ Cyclic RedistributionWe rst consider a circular block(m)$cyclic redistribution. In the forwardblock(m)!cyclic redistribution, each processor calculates the destination processorto which the rst element of the local array is to be sent. This information can besaved (A). In the receive phase, each processor calculates how much data is to bereceived from other processors. This information can also be saved (B). In the back-ward cyclic!block(m) redistribution, Information (B) is sucient for each processorto know how many elements to send to other processors. Information (A) is sucientfor the receiving processor to know where to store incoming data. Thus, no calcula-tions need to be done in the backward redistribution phase. However, this saving incomputation is at the cost of increased memory requirements. Information (A) canbe stored in a single variable. Information (B) requires an array of length equal tothe number of processors.
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 563.7.2 Circular Cyclic $ Block(m) RedistributionThis is essentially the reverse of a circular block(m) $ cyclic redistribution. Inthe send phase of the forward cyclic!block(m) redistribution, each processor storesthe set of destination processors and the number of elements sent to them (A). Inthe receive phase, each processor stores the source processor of the rst element ofthe local array (B). In the backward block(m)!cyclic redistribution, Information(B) is sucient to carry out the send phase and Information (A) can be used bythe receiving processor to know how many elements are to be received from whichprocessor. Information (B) can be stored in a single variable. Information (A) requiresan array of length equal to the number of processors.3.7.3 Circular Cyclic(x) $ Cyclic(y) RedistributionLet us rst consider the special case of a circular cyclic(x)$ cyclic(y) redistributionwhere x = k y. In the forward cyclic(x)!cyclic(y) redistribution, each processorcalculates the sequence of processors to which sub-blocks of size y have to be sent.This information is stored (A). In the receive phase, each processor calculates thesource processors for blocks of size y of the local array. This information is also stored(B). In the backward cyclic(y)!cyclic(x) redistribution where x = k y, information(B) can be used in the send phase to determine the sequence of processors to sendblocks of size y. In the receive phase, information (A) can be used to determine thesource processors for blocks of size y of the local array. Thus, no calculations need tobe done in the backward redistribution phase. In the forward redistribution phase,since the sequence of processors to send sub-blocks of size y remains the same forevery block of size x, information (A) can be stored in an array of size x=y = k.Similarly, information (B) also requires an array of size k.The other special case y = k x is similar.For the general case where there is no particular relation between x and y, if theLCM or GCD Methods are used, information can be stored and reused in the same
CHAPTER 3. RUNTIME SUPPORT FOR ARRAY REDISTRIBUTION 57
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
0 20 40 60 80 100 120 140
T
i
m
e
 
(
m
s
)
Processors
With Reuse
Without Reuse
Figure 3.22: Circular cyclic(11) to cyclic(3) Redistribution, array size 1M integersmanner as described above for the x = k y case.If the General Method is used, then in the send phase of the forward redistribution,each processor calculates the destination processor of each element of the local array.This information can be stored (A). In the receive phase, each processor calculatesthe source processor of each element of the local array. This can also be stored (B).For the backward cyclic(y)!cyclic(x) redistribution, Information (B) gives the set ofdestination processors and Information (A) gives the set of source processors. Thissaves a lot of computation in the backward redistribution, since it would otherwise berequired to compute the source and destination processors for each individual element.However the amount of storage required is twice the size of the local array. Figure 3.22compares the performance of a circular cyclic(11)$ cyclic(3) redistribution with andwithout saving any information in the forward redistribution, for an array of size1M integers. We observe that there is considerable improvement in performance byreusing the information.
Chapter 4Runtime Support for All-to-AllCollective CommunicationPrograms on distributed memory parallel computers typically require interprocessorcommunication. In loosely synchronous parallel programs[46], all processors performsimilar operations but on dierent data sets. Hence it is very likely that a group ofprocessors or even all processors may need to perform communication at the sametime. This makes it possible for processors to cooperate with each other to performcommunication eciently, which is known as collective communication. Depending onthe type of communication required, dierent communication patterns are possible.These can be generally classied into the following types:- All-to-all: All processors need to send data to all other processors. All-to-many: All processors need to send data to some other processors. Many-to-all: Some processors need to send data to all other processors. Many-to-many: Some processors need to send data to some other processors.Ecient algorithms are needed to implement these communication patterns ondierent network topologies. In this chapter, we consider the all-to-all communicationpattern in detail. The other communication patterns, namely all-to-many, many-to-all and many-to-many, have been studied in [100, 127, 106, 107, 108].58
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 59The all-to-all communication pattern (also called complete exchange1) occurs fre-quently in many important parallel computing applications such as array redistri-bution, parallel quicksort, some implementations of two-dimensional Fast FourierTransform, matrix transpose etc. It is the densest form of communication because allprocessors need to communicate with all other processors. This can result in severelink contention and degrade performance considerably. Hence, it is necessary to useecient algorithms in order to get good performance over a wide range of messagesizes and number of processors.We describe several algorithms for all-to-all communication on fat-tree and two-dimensional mesh interconnection topologies. We use the Thinking Machines CM-5as an example machine with a fat tree topology and the Intel Touchstone Delta asan example machine with a two-dimensional mesh topology. Since these machineshave dierent architectures and communication capabilities, dierent algorithms areneeded to get the best performance on each of them. We present four algorithms forthe CM-5 and six algorithms for the Delta. Extensive performance results on the CM-5 and Delta are presented and analyzed. We have also developed analytical models toestimate the performance of the algorithms on the basis of system parameters. Theanalytical models are validated by comparing with experimental results.4.1 Architecture and Communication Issues4.1.1 CM-5The CM-5 is a scalable distributed memorymultiprocessor system which can be scaledup to 16K processors. It supports both SIMD and MIMD programming models. Eachnode on the CM-5 is a SPARC processor and has four optional vector processors.The CM-5 has two internal networks that support interprocessor communication |the Control Network and the Data Network. The control network is responsible forcommunication patterns in which many processors may be involved in the processing1We use the words \all-to-all communication" and \complete exchange" interchangeably in thischapter.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 60
Processors
Switches
SwitchesFigure 4.1: CM-5 fat treeof each datum, such as global reduction operations, parallel prex operations andprocessor synchronization. The data network supports point-to-point communicationand has a fat tree topology [76, 77] as shown in Figures 4.1 and 4.2. It is actually a4-ary fat tree or quad tree, where each node has four children. Each internal nodeof the fat tree is implemented as a set of switches. The number of switches pernode depends on where it is in the tree. Nodes at level 1 have two switches. Thenumber of switches per node doubles for each higher level till level 3, from whereon it quadruples. Each level 1 or level 2 switch has two parents and four children;switches at higher levels have four parents and four children. The data network hasa guaranteed system-wide bandwidth of 5 Mbytes/sec. Messages are divided intopackets of size 20 bytes of which 4 bytes is the header. The routing algorithm for theData Network compares the destination address with the source address to determinehow far up the tree the message must travel. The message can then take any pathup the tree. This allows the switches to perform load balancing on the y. Oncethe message has reached the necessary height in the tree, it must follow a particularpath down. A detailed discussion of the interprocessor communication overhead onthe CM-5 is given in [96, 16, 94].
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 61
Figure 4.2: CM-5 Data Network with 64 Processing Nodes4.1.2 Touchstone DeltaThe Intel Touchstone Delta is a 16 32 mesh of computational nodes, each of whichis an Intel i860/XR microprocessor. The two-dimensional mesh interconnection net-work has bidirectional links and uses wormhole routing. The links have a maximumbandwidth of 10 Mbytes/sec in each direction. Messages are divided into packets ofsize 512 bytes of which 32 bytes is the header. It uses deterministic XY routing inwhich packets are rst sent along the X dimension and then along the Y dimension.In other words, at most one turn is allowed, and that turn must be from the X di-mension to the Y dimension. For a 2D mesh, the XY routing algorithm is guaranteedto be deadlock free [33].In wormhole routing, a packet is divided into a number of its (ow control digits)for transmission. The size of a it is typically the same as the channel width. Theheader it of a packet determines the route. As the header advances along the route,the remaining its follow in a pipeline fashion. If the header it encounters a channelalready in use, it is blocked until the channel becomes available. The ow controlwithin the network blocks the trailing its and they remain in it buers along the
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 62established route. Once a channel has been acquired by a packet, it is reserved for thepacket. The channel is released when the last it has been transmitted on the channel.The pipelined nature of wormhole routing makes the communication latency almostinsensitive to path length in the absence of network contention. The network latencyfor wormhole routing is (Lf=B)D+L=B, where Lf is the length of each it, B is thechannel bandwidth, D is the path length, and L is the length of the message [87].Thus, if Lf << L, the path length D will not signicantly aect the network latencyunless it is very large. Further details of wormhole routing can be found in [87].4.1.3 Performance ModelsBarnett et. al. [3] have proposed algorithms and performance models for global com-bine operations on a wormhole routed mesh. We use similar models for our all-to-allcommunication algorithms, which take into account link conicts and other charac-teristics of the underlying communication system. The following notations are usedin our models :- startup time per messageex transfer time per byte for an exchange with no link conictssr transfer time per byte to send to and receive from dierent processorswith no link conictssat transfer time per byte on a saturated links transfer time per byte for a single send-recv with no link conictsL number of bytes to be exchanged per processor pairf(i) maximum number of messages contending for a saturated link at step iP total number of processorsThe time taken for an exchange operation may be dierent from the time to sendto and receive from dierent processors, because in the latter case the incoming andoutgoing messages may traverse links with dierent amount of contention. Hence,we use ex or sr depending on the algorithm. We assume that the time taken isindependent of distance, a property of both CM-5 and Delta [95, 3]. Thus, the time
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 63required for an exchange step i is given byT = + L max(ex; f(i)sat)We assume that conicting messages share the bandwidth of a network link. Thenetwork may have excess bandwidth, enabling multiple messages to traverse a link inthe same direction without conict. In other words, sat < ex; sat < sr; sat < s.Also, in any expression in this chapter, the division of two integer variables shouldbe considered as integer division, ie. 5=2 = 2.4.2 All-to-All Communication on a Fat TreeIn this section, we describe four algorithms for all-to-all communication on the fattree topology of the CM-5.4.2.1 Linear Exchange (LEX)The Linear Exchange algorithm is the simplest of the four algorithms. In step i,0  i < P , processor i receives messages from every processor except itself. Thealgorithm clearly requires P steps. Since at every step one processor receives fromall other processors, there is a lot of link contention. At step i, every processor sendsdata to processor i. Processor i has two links to its parent node and P   1 processorssimultaneously need to use these links. Hence, the maximum number of messagescontending for a link at any step is dP 12 e. The time taken for any step i isT (i) = + Lmax(s; dP   12 esat)The cost of LEX is obtained by summing over all steps of the algorithm :TLEX = P 1Xi=1 [+Lmax(s; dP   12 esat)] = (P  1)+L(P  1)max(s; dP   12 esat)
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 64do i=1, P   1destination = xor(mynumber, i)Exchange with destinationend doFigure 4.3: Pairwise Exchange (PEX)Table 4.1: Communication Schedule for PEX on 8 ProcessorsStep 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 70$ 1 0$ 2 0$ 3 0$ 4 0$ 5 0$ 6 0$ 72$ 3 1$ 3 1$ 2 1$ 5 1$ 4 1$ 7 1$ 64$ 5 4$ 6 4$ 7 2$ 6 2$ 7 2$ 4 2$ 56$ 7 5$ 7 5$ 6 3$ 7 3$ 6 3$ 5 3$ 44.2.2 Pairwise Exchange (PEX)We consider the Pairwise Exchange (PEX) algorithm which has been shown to be thebest algorithm for a hypercube network, on which it guarantees no link contention atany step [6, 103]. The algorithm is described in Figure 4.3. It requires P  1 steps andthe communication schedule is as follows. At step i, 1  i  P   1, each processorexchanges a message with the processor determined by taking the exclusive-or of itsprocessor number with i. Therefore, this algorithm has the property that the entirecommunication pattern is decomposed into a sequence of pairwise exchanges. Thecommunication schedule of the pairwise exchange algorithm for 8 processors is givenin Table 4.1. The entry i$ j in the table indicates that processors i and j exchangemessages.The time taken by this algorithm on the CM-5 can be estimated as follows. Whena processor has to communicate with another processor in its cluster of 4k processors,k  1, the message has to travel a maximum of k levels up the tree. When a cluster
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 65of 16 processors exchange among themselves, the messages have to travel either 1or 2 levels up the tree depending on the source and destination. There are 32 linksfrom level 0 to level 1 and 16 links from level 1 to 2 for this cluster, enough forthe 16 processors to exchange among themselves without contention. However, whenprocessors in a cluster of 16 need to exchange with processors in another cluster of16, there are only 8 links from a level 2 node to a level 3 node, which results in 16processors contending for 8 links. A similar bottleneck exists at higher levels. Forexample, a level 3 node has 32 links upwards and downwards, and 64 processors inits subtree.The communication schedule of PEX is such that in the rst 15 steps, processorsexchange completely within a cluster of 16 processors and after that they exchangeacross clusters. Hence, in the rst 15 steps there is no contention. From step 16onwards, there are a maximum of 2 messages contending for a link. The time takenfor step i is given byT (i) = ( + Lex for 1  i  15+ Lmax(ex; 2sat) for i > 15The time for the entire PEX algorithm can be obtained by summing over all stepsTPEX = 15Xi=1[+ Lex] + P 1Xi=16[+ Lmax(ex; 2sat)]which can be simplied toTPEX = (P   1)+ L [15ex + (P   16)max(ex; 2sat)]For a complete exchange on 16 processors, this algorithm has no contention.4.2.3 Recursive Exchange (REX)The Recursive Exchange algorithm is described in Figure 4.4. The number of proces-sors is halved in each step and each processor exchanges data with the corresponding
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 66Table 4.2: Communication Schedule for REX on 8 processorsStep 1 Step 2 Step 30$ 4 0$ 2 0$ 11$ 5 1$ 3 2$ 32$ 6 4$ 6 4$ 53$ 7 5$ 7 6$ 7processor in the other half. A processor sends all the data intended for all processorsin the other half to only one processor in that half, which forwards that data to theremaining processors in a later step. The number of steps required is lgP and eachmessage is of size L P=2. The communication schedule for REX on 8 processors isgiven in Table 4.2. Although this algorithm takes less number of steps than LEX andPEX, the amount of data transmitted in each step is much higher. Since it is a store-and-forward type algorithm, each step incurs the additional overhead of reshuingdataIn step i, 1  i  lgP each processor j exchanges with processor j  P2i . Commu-nication always takes place either entirely within a cluster of 16 processors or entirelyacross clusters. In steps 1 to lgP   4, communication takes place across clusters,so the maximum number of messages contending for a link is 2. In steps lgP   3to lgP , communication takes place within a cluster of 16 processors, so there is nocontention. Hence, the time taken by REX isTREX = lgP 4Xi=1 [+ L P2 max(ex; 2sat)] + lgPXi=lg P 3[+ L P2 ex]which can be simplied toTREX =  lgP + L P2 [4ex + (lgP   4)max(ex; 2sat)]
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 67size = Ppos = 0bytes = L P=2do i=1, lgPsize = size/2if (mynumber < (size + pos)) thendest = mynumber + sizeelse dest = mynumber { sizepos = pos + sizeend ifexchange message of size \bytes" with destend doFigure 4.4: Recursive Exchange (REX) on CM-54.2.4 Balanced Exchange (BEX)In the PEX algorithm, the communication schedule is such that all processors in acluster rst exchange completely with each other and then exchange with processorsin other clusters. In other words, all the communication is either entirely within thecluster or entirely across clusters. As explained above, this gives rise to contentionfrom step 16 onwards. An improvement in performance can be expected if there isa balance of local and long distance communication at every step, which will reducecontention in step 16 onwards. The Balanced Exchange (BEX) algorithm providessuch a schedule. BEX is a simple modication of PEX and is described in Figure 4.5.For the purpose of determining the communicating pairs of processors, we dene amapping between the physical number of a processor and its virtual number asvirtual no = MOD(physical no + 1, P)Balanced exchange consists of using the pairwise exchange algorithm with this map-ping and the virtual processor numbers. The communication schedule for BEX isgiven in Table 4.3.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 68virtual = MOD(mynumber + 1, P )do j=1, P   1dest = xor(virtual, j) { 1if (dest == {1)dest = P   1end ifexchange with destend doFigure 4.5: Balanced Exchange (BEX)The BEX algorithm has the property that in steps 0 to P=2   1, two processorsin each cluster of size P=2 communicate across clusters while the rest communicatewithin the cluster. In steps P=2 to P   1, two processors in each cluster of sizeP=2 communicate within the cluster, while the other processors communicate acrossclusters. In steps 0 to 15, there is no contention as in the PEX algorithm. In stepi > 15, instead of 2blg ic processors contending for 2blg ic 1 links, there are (2blg ic   2)processors contending for 2blg ic 1 links. Hence the maximum contention for any linkat step i > 15 is 2blg ic 22(blg ic 1) on an average. The total time taken by BEX isTBEX = 15Xi=1[+ Lex] + P 1Xi=16[+ Lmax(ex; 2blg ic   22(blg ic 1) sat)]which can be simplied toTBEX = (P   1) + L [15ex + (P   16)max(ex; 2blg ic   22(blg ic 1) sat)]4.2.5 Performance of Algorithms on the CM-5We have implemented the above algorithms on the CM-5 using the message passinglibrary CMMD Version 3.0 Beta. Figure 4.6 compares the communication time of
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 69Table 4.3: Communication Schedule for BEX on 8 processorsStep 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 70$ 7 0$ 2 0$ 1 0$ 4 0$ 3 0$ 6 0$ 51$ 2 1$ 7 2$ 7 1$ 5 1$ 6 1$ 3 1$ 43$ 4 3$ 5 3$ 6 2$ 6 2$ 5 2$ 4 2$ 35$ 6 4$ 6 4$ 5 3$ 7 4$ 7 5$ 7 6$ 7the four algorithms on a 32 node CM-5 with message size varied between 0 and 2048bytes. As expected, LEX performs much worse than the other algorithms, so wedo not consider it any further. For small message sizes, the performance of PEX,REX and BEX is virtually indistinguishable. However, for large message sizes, PEXperforms much better than REX and BEX performs better than PEX. This is becauseof the following reasons. First, even though the number of steps in REX is only lgP ,as compared to P   1 steps in PEX, the message size in REX remains constant atL  P=2, whereas the size of each message in PEX is L. Also, REX uses a store-and-forward approach in which a message is sent from source to destination processorthrough one or more intermediate processors. Sending a message from source todestination through k intermediate processors costs k times more than sending itdirectly. In addition, each node needs to buer and reshue data in REX so thatappropriate data can be sent to the appropriate node. These two overheads outweighthe savings in the number of communication steps. BEX performs the best becauseit maintains a balance of local and remote communication at each step.Figures 4.7 and 4.8 show the performance of the algorithms on up to 256 proces-sors for message size 512 bytes and 2 Kbytes respectively. PEX and BEX performbetter than REX for small number of processors because the overhead of messagesize and number of steps dominate for REX. As the number of processors increases,the overhead of the larger number of messages in PEX and BEX is higher than theoverhead of larger message size and reshuing in REX, and therefore, REX performsbetter.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 70
00.050.10.150.20.250.3
0.350.4
0 500 1000 1500 2000 2500Time(Secs.) Bytes
LEXPEXREXBEX
Figure 4.6: Performance on 32 node CM-5 for dierent message sizes
00.050.10.15
0.20.25
0 50 100 150 200 250 300Time(Secs.) Number of Processors
REX 3333 3 3 3 3PEX ++++ + + +
+BEX 2222 2 2 2 2Figure 4.7: Performance on CM-5 for message size 512 bytes
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 71
00.10.20.30.40.5
0.60.7
0 50 100 150 200 250 300Time(Secs.) Number of ProcessorsREX 3333 3 3 3
3PEX ++++ + + +
+BEX 2222 2 2 2
2
Figure 4.8: Performance on CM-5 for message size 2 Kbytes4.3 All-to-All Communication on a 2D MeshThis section describes algorithms for all-to-all communication on the two-dimensionalmesh topology of the Intel Touchstone Delta. A mesh is a low-dimension high-bandwidth network which performs well when there is no link contention. However, adense communication pattern like complete exchange results in a lot of link contentionwhich can degrade performance considerably. Also, the algorithms discussed abovefor the CM-5 and the existing algorithms for a hypercube assume that the number ofprocessors is a power-of-two. This is a valid assumption for hypercube and fat-treearchitectures because the number of processors is always a power-of-two. However,on the Delta, the user can allocate a mesh which may not be a power-of-two and mayeven be an odd number (eg. 5 4 or 3 3). So, it is necessary to develop algorithmswhich work even on non power-of-two meshes.Bokhari and Berryman describe two algorithms for a circuit-switched mesh, whichassume that the number of processors is a power-of-two [7]. Scott has shown that a3=4
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 72is the lower bound on the number of phases required to perform complete exchangeon an aa mesh such that there is no link contention in any phase [103]. However, ifwe allow link contention to exist, the operation can be performed in fewer steps. Wehave adopted this approach of allowing a small amount of link contention to exist,thereby reducing the number of steps and keeping all processors active at every step.This approach also takes advantage of the fact that the communication links in theDelta have excess bandwidth [3], so that a small number of contending messages willnot aect the communication time.We consider six algorithms on the Delta, for power-of-two and non-power-of-twomeshes. For the analysis of the algorithms, we assume that the mesh has r rows andc columns. Hence, P = r  c.4.3.1 Pairwise Exchange for Power-Of-Two Mesh (PEX)The Pairwise Exchange algorithm described earlier for the CM-5 can also be used onthe Delta without any modication, as long as the number of processors is a powerof two. However, since the mesh architecture is dierent from a fat tree architecture,the link contention caused by this algorithm on the Delta is dierent from that on afat tree. Figure 4.9 shows the communication pattern of PEX on a 2 4 mesh. Thecomplete exchange requires seven steps. In steps 1, 4 and 5 there is no contention.In steps 2, 3, 6 and 7, the maximum number of messages contending for a link is 2.Messages traveling in opposite directions on a link do not contend, because the linkson the Delta are bidirectional.Since each step of the algorithm involves an exchange between pairs of processors,the maximum number of messages contending for a link at any step is limited bymax(r; c)=2. An exact expression for the maximum number of messages contendingfor a link at step i is f(i) = 2blgfmax(MOD(i;c);i=c)gcThis expression was obtained empirically. We studied the communication pattern ofPEX for a large number of processors and mesh congurations. We found that there
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 73
0 1 2 3
5 6 74
0 1 2 3
5 6 74
0 1 2 3
5 6 74
0 1 2 3
5 6 74
0 1 2 3
5 6 74
0 1 2 3
5 6 74
0 1 2 3
5 6 74
*
*
*
*
*
*
*
*
Step 1,  No Contention Step 2,  Contention Step 3,  Contention Step 4,   No Contention
Step 5,  No Contention Step 6,   Contention Step 7,   Contention
arrows indicate communication pattern*  =  2  contending messagesFigure 4.9: PEX on 2 4 meshis a relation between the step number i, the shape of the mesh and the maximum linkcontention in that step. That relation is given by the above expression for f(i).The time taken for step i isT (i) = + Lmax(ex; f(i)sat)The cost of PEX can be determined by summing over all steps of the algorithm :TPEX = P 1Xi=1 [+ Lmax(ex; f(i)sat)] = (P   1) + L P 1Xi=1 max(ex; f(i)sat)If the number of processors is not a power-of-two, the exclusive-or function doesnot create all the processor pairs in P   1 steps, so the algorithm is not directlyapplicable.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 74q = 2dlgP edo j=1, q   1destination = xor(mynumber, j)if (destination < P ) thenExchange with destinationend ifend doFigure 4.10: Pairwise Exchange for General Mesh (PEX-GEN)4.3.2 Pairwise Exchange for General Mesh (PEX-GEN)We have extended the basic pairwise exchange algorithm, so that it works even whenthe number of processors is not a power-of-two. We call this algorithm Pairwise Ex-change for General Mesh (PEX-GEN) and is described in Figure 4.10. The algorithmrst nds the smallest power-of-two (say q) greater than the number of processorsand uses this number to schedule q   1 steps of the pairwise exchange. In each step,every processor checks to see if the calculated destination processor number is lessthan the actual number of processors. If so, it exchanges data with the processor,else it goes ahead to the next step. Thus, the algorithm requires q   1 steps whereq is the nearest power-of-two larger than the number of processors. Clearly, the al-gorithm takes more steps than necessary and many processors remain idle in severalof the steps. However, this reduces the link contention in each step. The maximumcontention in each step is upper bounded by that in the PEX algorithm.4.3.3 PEX-GEN with Shift (PEX-GEN-SHIFT)The motivation for the Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT) algorithm can be explained with the help of Figure 4.11(a). Assume that theuser has allocated a mesh of 20 processors which may be organized in any way (ie 45,
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 75
20 processors
0 15 31
0 15 3119
20 processors
6 25
(b)  Processor numbers shifted
(a)   20 processors allocated
Figure 4.11: Processor Shiftor 210 etc.). The processor numbers will be 0 to 19. The nearest power of two largerthan 20 is 32. The PEX-GEN algorithm will require 31 steps. The processor pairscreated by the exclusive-or function are such that in steps 1 through 15, processors0 to 15 exchange completely among themselves and do not communicate with anyprocessor in the range 16 to 31. Similarly, processors 16 to 19 exchange completelyamong themselves and do not communicate with any processor in the range 0 to 15.In steps 16 through 31, the processors in one half (0 | 15) exchange with processorsin the second half (16 | 31). Since there are only 4 processors in the second half,several of the processors in the rst half do not do any communication. Thus thecommunication pattern is such that in the rst 15 steps, all the processors in the rsthalf are active and in the next 16 steps, several of them are inactive. Since each stepinvolves communication with link contention, there is high link contention in steps 1| 15 and very little or no link contention in steps 16 | 31. In general, if there areP processors where P is not a power of two, PEX-GEN will require q 1 steps whereq = 2dlgP e. In the rst b(q  1)=2c steps, the rst q=2 processors are active and in theremaining steps, several of them are inactive.A better algorithm is one which balances the contention such that all steps havemore or less equal contention and equal number of inactive processors. This can be
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 76q = 2dlgP eshift = (q   P )/2myvirtual = MOD(mynumber + shift, P )do j=1, q   1virtual destination = xor(myvirtual, j)destination = virtual destination { shiftif (destination < 0) thendestination = destination + qend ifif (destination < P ) thenExchange with destinationend ifend doFigure 4.12: Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT)achieved by dening virtual processor numbers such that the real processors 0 |19 are numbered 6 | 25 as shown in Figure 4.11(b). The processor numbers areshifted by an amount equal to half the absolute dierence between the number ofprocessors and the nearest power of two. In other words, in the range 0 | 31, theactual processors are numbered 6 | 25, and there are no processors numbered 0| 5 and 26 | 31. Thus the empty space which earlier existed only in the half 16| 31 is now equally divided among the two halves. So, even in the rst 15 stepsof the algorithm, there are equal number of idle processors in both halves, whichbalances the contention among all the steps of the algorithm. We call this algorithmPairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT) and is describedin Figure 4.12. This algorithm also takes q   1 steps where q is the smallest power-of-two larger than the number of processors. The maximum contention at each stepis upper bounded by that for the PEX algorithm.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 77do j=1, P   1destination = MOD(mynumber + j, P )source = MOD(mynumber { j + P , P )send to destinationreceive from sourceend doFigure 4.13: General Algorithm for any Mesh (GEN)4.3.4 General Algorithm for any Mesh (GEN)The above algorithms require one less than a power of two number of steps, becausethey use the exclusive-or function to obtain processor pairs which exchange with eachother. For non power-of-two meshes, it would be advantageous to have an algorithmwhich requires only P   1 steps. Figure 4.13 describes such an algorithm, which wecall the General Algorithm for any Mesh (GEN), because it works for any number ofprocessors. In the GEN algorithm, processor pairs do not exchange with each other.Instead, at step i, a processor j sends data to processor MOD(j + i; P ) and receivesdata from processor MOD(j   i + P;P ). This algorithm requires only P   1 steps,for any value of P .The maximum contention at step i is obtained empirically asf(i) = min[MOD(i; c); c MOD(i; c)] +min[i=c; (P   i)=c]The total time for all steps is given by :TGEN = P 1Xi=1 [+ Lmax(sr; f(i)sat)] = (P   1) + L P 1Xi=1 max(sr; f(i)sat)
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 784.3.5 Indirect Pairwise Exchange (IPEX)The Indirect Pairwise Exchange (IPEX) algorithm aims at reducing link contentionin the direct Pairwise Exchange (PEX) algorithm. In IPEX, each processor commu-nicates only with the processors in its row and column. The algorithm is describedin Figure 4.14. Each exchange along a row is followed by a complete exchange alonga column. During the row exchange, each processor sends L r bytes of data to thedestination processor, out of which L(r   1) bytes are intended for other processorsin the same column as the destination processor. This is followed by a complete ex-change along the columns (involving messages of L bytes), in which the data receivedduring the row exchange is sent to the appropriate processors in the same column.This entire operation requires r(c   1) communication steps. Finally, an additionalcomplete exchange is required along the columns for processors to exchange their owndata directly with processors in the same column. In this phase, data is sent directlyfrom source to destination, requiring r   1 exchange steps. Hence, the total numberof steps required is r(c  1) + (r   1) = rc  1 = P   1.The maximum link contention at any step is the same as for pairwise exchangealong a row or column which is 2blg ic, where i is the step number along the row orcolumn. Hence, the total time required for IPEX is given by :TIPEX = c 1Xi=1[+ L r max(ex; 2blg icsat) + r 1Xj=1f+ Lmax(ex; 2blg jcsat)g]+ r 1Xi=1[+ Lmax(ex; 2blg icsat)]4.3.6 Recursive Exchange (REX)The Recursive Exchange algorithm is described in Figure 4.15. It is similar to thatfor the CM-5, except it is recursively applied to both dimensions of the mesh. Themesh is rst recursively halved in the x direction and messages are exchanged over
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 79x = my x-coordinatey = my y-coordinatedo i=1, c  1destx = xor(x, i)dest = y  c + destxexchange L r bytes with destdo j=1, r   1desty = xor(y, i)dest = destyc+ xexchange L bytes with destend doend dodo j=1, r   1desty = xor(y, i)dest = desty c+ xexchange L bytes with destend doFigure 4.14: Indirect Pairwise Exchange (IPEX)each cut. This takes lg c steps. The mesh is then recursively halved in the y directionand messages are exchanged over each cut, which takes lg r steps. Thus, the totalnumber of steps is lg c+lg r = lgP and the message size in each step is LP=2. Thisalgorithm also works only for power-of-two meshes, since the mesh is divided by twoin each step. REX has an indirect form of communication, in the sense that data issent from source processor to destination processor through one or more intermediateprocessors.Since the mesh is recursively divided by two and each processor in one partitioncommunicates with its mirror image in the other partition, the maximum number ofmessages contending for a link at step i isf(i) = ( c=2i for 1  i  lg cr2i lg c for lg c < i  lgP
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 80size = cpos = 0x = my x-coordinatey = my y-coordinatebytes = L P=2do i=1, lg csize = size/2if (x < (size + pos)) thendestx = x + sizeelse destx = x { sizepos = pos + sizeend ifdest = y  c + destxexchange message of size \bytes" with destend dosize = rpos = 0do i=1, lg rsize = size/2if (y < (size + pos)) thendesty = y + sizeelse desty = y { sizepos = pos + sizeend ifdest = destyc+ xexchange message of size \bytes" with destend doFigure 4.15: Recursive Exchange (REX) on Delta
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 81Table 4.4: Performance of PEX (time in sec.)Message Size Mesh Conguration(bytes) 4  4 8  8 16  8 8  16 16  16 16  32256 0.004 0.022 0.045 0.045 0.094 0.2031K 0.008 0.064 0.120 0.115 0.290 0.8604K 0.023 0.114 0.355 0.367 0.999 3.2188K 0.034 0.228 0.692 0.773 2.068 6.79416K 0.064 0.441 1.413 1.565 4.145 13.61The cost of REX can be obtained by summing over all steps of the algorithm :TREX = lgPXi=1[+ L P2 max(ex; f(i)sat)]which can be expanded toTREX =  lgP + L P2 lg cXi=1max(ex; c=2isat) + L P2 lgPXi=lg c+1max(ex; r2i lg csat)4.3.7 Performance of Algorithms on the DeltaWe implemented all the algorithms on the Intel Touchstone Delta and studied theirperformance for dierent mesh congurations and message sizes. As suggested in [80],we use forcedmessages (which provide higher bandwidth but also higher startup cost)if the message size is greater than or equal to 1.5 Kbytes and unforced messages ifthe message size is less than 1.5 Kbytes.The performance of PEX is shown in Table 4.4. The number of processors isvaried from 16 to 512 with message size varied from 256 bytes to 16 Kbytes. Messagesize refers to the amount of data communicated in each send and receive operation,so the total amount of data communicated increases as the number of processorsis increased. Hence, the time taken increases almost linearly with the number of
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 82Table 4.5: Performance of PEX-GEN (time in sec.)Message Size Mesh Conguration(bytes) 4  5 6  8 16  9 8  18 16  14 16  30256 0.008 0.019 0.085 0.090 0.092 0.2111K 0.017 0.038 0.191 0.230 0.270 0.8994K 0.037 0.091 0.576 0.830 0.977 3.5888K 0.073 0.174 1.188 1.743 2.007 7.61616K 0.138 0.333 2.403 3.480 4.056 15.82processors. In a mesh, the time taken depends not only on the number of processors,but also on the mesh conguration. The maximum contention in PEX is max(r; c)=2.Thus, for a xed number of processors, the time taken will be minimum for a squaremesh and maximum for a mesh which is a linear array.The performance of PEX-GEN is given in Table 4.5. We have chosen some meshsizes which are non power-of-two. We observe that for mesh sizes which are onlyslightly less than the nearest higher power-of-two, the performance is close to thatof PEX for that power-of-two. But, if the mesh size is only slightly higher than thenearest smaller power-of-two, the time taken is almost twice the time taken by PEXfor that power-of-two. For example, the time taken by PEX-GEN on a 16  9 meshis much higher than the time taken by PEX on a 16 8 mesh, but the time taken byPEX-GEN on a 16  14 mesh is very close to the time taken by PEX on a 16  16mesh. This is because of the dierence in the number of steps required. Anotherinteresting observation is that the time taken by PEX-GEN on a 16  30 mesh is infact higher than the time taken by PEX on a 16  32 mesh. This is because sincethe processors are numbered in row major order, a change in the number of columnsfrom a power-of-two to a non power-of-two, changes the communication pattern in themesh completely for an algorithm which uses the exclusive-or function to determineprocessor pairs. Hence, there is more contention in the 1630 case than in the 1632case.Table 4.6 shows the performance of PEX-GEN-SHIFT. In all cases it performs
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 83Table 4.6: Performance of PEX-GEN-SHIFT (time in sec.)Message Size Mesh Conguration(bytes) 4  5 6  8 16  9 8  18 16  14 16  30256 0.008 0.019 0.085 0.089 0.092 0.2111K 0.017 0.038 0.188 0.219 0.263 0.8944K 0.036 0.091 0.543 0.782 0.933 3.5268K 0.071 0.170 1.111 1.626 1.948 7.51516K 0.129 0.333 2.242 3.282 3.844 15.74no worse than PEX-GEN. In most cases, the improvement in performance is notsignicant.Table 4.7 gives the performance of GEN on a power-of-two mesh. GEN performsbetter than PEX for small message sizes and small number of processors. However,for large number of processors ( 64) and large message sizes (> 1 Kbytes) PEXperforms better. The GEN algorithm has a certain amount of asymmetry in thecommunication in the sense that each communication operation consists of a sendto one processor and a receive from some other processor. Thus, the incoming andoutgoing messages may traverse a dierent number of links with dierent amountsof contention, and the path which has the highest amount of contention adverselyaects the communication time. On the other hand, in the PEX algorithm, processorpairs exchange with each other at every step, so the incoming and outgoing messagestravel the same number of links with the same amount of contention.The performance of GEN on non power-of-two meshes is given in Table 4.8. GENreduces the number of steps from q   1 in PEX-GEN and PEX-GEN-SHIFT, whereq = 2dlgP e, to P   1. For small number of processors, PEX-GEN performs the bestand the improvement in performance is higher when q   P is large. However, ifq   P is small and the number of processors is large, the performance of PEX-GEN-SHIFT tends to that of PEX and and the performance of GEN tends to that for apower-of-two mesh. So in this case, PEX-GEN-SHIFT performs better than GEN.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 84Table 4.7: Performance of GEN on power-of-two mesh (time in sec.)Message Size Mesh Conguration(bytes) 4  4 8  8 16  8 8  16 16  16 16  32256 0.004 0.016 0.042 0.042 0.089 0.2831K 0.008 0.042 0.123 0.132 0.346 1.2174K 0.018 0.145 0.461 0.464 1.220 3.9448K 0.037 0.290 0.933 0.927 2.511 8.00716K 0.069 0.576 1.947 1.884 5.052 16.15Table 4.8: Performance of GEN on non power-of-two mesh (time in sec.)Message Size Mesh Conguration(bytes) 4  5 6  8 16  9 8  18 16  14 16  30256 0.004 0.015 0.046 0.048 0.074 0.2461K 0.009 0.027 0.146 0.171 0.285 1.0694K 0.025 0.083 0.527 0.566 0.998 3.7068K 0.052 0.186 1.071 1.154 2.011 7.75216K 0.098 0.369 2.182 2.360 4.005 15.94We observe from Table 4.9 that REX does not perform well for any mesh cong-uration and message size even though it requires only lgP steps. There are severalreasons for this. First, there is a lot of link contention in each step. Second, the mes-sage size per step is increased to L P=2 instead of L in the other algorithms. Third,the indirect form of communication requires a lot of data buering and shuing inorder to send the appropriate data to the appropriate node. Also, this algorithm hashigh memory requirements because the large message size requires more memory pernode. For example, with the other algorithms we could run tests with message sizesup to 2M bytes, but with REX we could only go up to 16 Kbytes on 512 processors.Table 4.10 gives the performance of IPEX. For large meshes and large messagesizes, IPEX performs better than any of the direct algorithms. This is because in
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 85Table 4.9: Performance of REX on Delta (time in sec.)Message Size Mesh Conguration(bytes) 4  4 8  8 16  8 8  16 16  16 16  32256 0.004 0.022 0.056 0.056 0.140 0.3931K 0.011 0.082 0.222 0.221 0.552 1.5814K 0.029 0.212 0.615 0.600 1.566 4.7398K 0.059 0.415 1.216 1.203 3.151 9.39016K 0.114 0.950 2.512 2.381 6.279 18.75Table 4.10: Performance of IPEX (time in sec.)Message Size Mesh Conguration(bytes) 4  4 8  8 16  8 8  16 16  16 16  32256 0.005 0.023 0.050 0.055 0.115 0.2561K 0.010 0.059 0.144 0.145 0.334 0.8294K 0.026 0.156 0.430 0.397 0.974 2.7468K 0.049 0.294 0.823 0.755 3.151 5.49916K 0.089 0.573 1.642 1.756 3.877 10.89large meshes, direct algorithms result in a lot of link contention. The reduction incontention by IPEX is larger than the cost of sending messages indirectly, henceIPEX performs better. The message size and contention in each step in IPEX ismuch smaller than in REX. The additional memory required per node in IPEX is Lr,compared to LP=2 in REX. In small meshes, the cost of communicating indirectlyis higher than the saving in contention, so direct algorithms perform better.A comparison of the power-of-two algorithms on a 16  32 mesh for dierentmessage sizes is shown in Figure 4.16. We can see that for this mesh size IPEXperforms the best, for reasons explained above. This is followed by PEX and GEN.REX always performs the worst in spite of its lgP steps. The relative performanceof non power-of-two algorithms on a 16  9 mesh is shown in Figure 4.17. For this
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 86
02468101214
161820
0 2000 4000 6000 8000 10000 12000 14000 16000 18000Time (s) Message Size
PEX 3
33 3 3 3
GEN +
+ + + + +REX 22 2 2 2
2IPEX    Figure 4.16: Performance of algorithms on a 16  32 meshmesh size, GEN clearly performs the best. For messages up to 2 Kbytes, the per-formance of PEX-GEN and PEX-GEN-SHIFT is almost indistinguishable. But forlarger messages, PEX-GEN-SHIFT performs better than PEX-GEN.The performance of the power-of-two algorithms for dierent mesh sizes keepingthe message size constant at 16 Kbytes is shown in Figure 4.18. The correspondinggraph for a message size of 1 Kbytes is shown in Figure 4.19. We observe that fora large message size of 16 Kbytes, PEX performs the best for meshes with up to200 processors. For larger meshes, IPEX performs the best. For a message size of1 Kbytes, we see that GEN performs the best on meshes with up to 125 processors.When the number of processors is between 125 and 420, PEX performs the best andfor larger systems, IPEX performs the best.4.3.8 Model ValidationWe have validated the models developed to predict the performance of the algorithmsby comparing the predicted times with the actual times observed on the Delta. For
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 87
00.511.5
22.5
0 2000 4000 6000 8000 10000 12000 14000 16000 18000Time (s) Message Size
PEX-GEN 333 3 3
3GEN ++ + + +
+PEX-GEN-SHIFT 22 2 2 2
2
Figure 4.17: Performance of algorithms on a 16  9 mesh
02468101214
161820
0 50 100 150 200 250 300 350 400 450 500 550Time (s) Number of Processors
PEX 3
3 3 3 3 3
GEN +
+ + + +
+REX 22 2 2 2
2IPEX     Figure 4.18: Performance of power-of-two algorithms for message size 16 Kbytes
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 88
00.20.40.60.811.2
1.41.6
0 50 100 150 200 250 300 350 400 450 500 550Time (s) Number of Processors
PEX 3
3 3 3 3 3
GEN +
+ + + + +REX 22 2 2 2
2IPEX     Figure 4.19: Performance of power-of-two algorithms for message size 1 Kbytesthis purpose, we use typical values for the communication costs on the Delta [80, 3],namely for unforced messages  = 75s, ex = 0:35s and for forced messages = 150s, ex = 0:2s. We assume that sr  ex and 2sat  ex, as donein [3], ie. two messages can travel on a link in the same direction without conict.Figures 4.20 and 4.21 show that the observed and predicted times agree very closely.We were not able to validate the models on the CM-5 because accurate values forsat relative to ex or sr for the CM-5 are not available in the literature.
CHAPTER 4. RUNTIME SUPPORT FOR ALL-TO-ALL COMMUNICATION 89
0
0.1
0.2
0.3
0.4
0.5
0.6
1000 4000 8000 12000 16000
Ti
me
 (
s)
Message Size (bytes)
PEX-predicted
PEX-observed
GEN-predicted
GEN-observed
Figure 4.20: Observed and Predicted times (PEX, GEN)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1000 4000 8000 12000 16000
Ti
me
 (
s)
Message Size (bytes)
IPEX-predicted
IPEX-observed
REX-predicted
REX-observed
Figure 4.21: Observed and Predicted times (REX, IPEX)
Chapter 5Runtime Support for Out-of-CorePrograms: (I) Models and LocalOptimizations5.1 IntroductionThere are a number of applications which deal with very large quantities of data.These applications exist in diverse areas such as large scale scientic computations,database applications, hypertext and multimedia systems, information retrieval, vi-sualization etc. The number of such applications and their data requirements keepincreasing day by day. Although supercomputers have very large main memories, thememory is not large enough to hold all the data required by these applications. Forexample, a typical Grand Challenge Application at present could require 1Gbyte to4Tbytes of data per run [38]. These gures are expected to increase by orders of mag-nitude as teraop machines make their appearance. Hence, data needs to be stored ondisk and the performance of the program depends on how fast processors can accessdata from disks. A poor I/O capability can severely degrade the performance of theentire program.Almost all present generation parallel computers provide some kind of hardwareand system software support for parallel I/O [29, 92, 9, 36]. But, the I/O performanceobserved at the application level is usually much lower than what the hardware can90
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 91support. There are several reasons for this. First, the data access patterns of manyparallel programs are such that they result in a large number of small requests to thele system [74]. Since the I/O latency is very high, this results is poor performance.Second, the interface to any parallel le system does not currently allow a programmerto specify strided accesses using a single read or write call; though there are somerecent proposals to rectify this [28, 88]. Third, the interface does not provide supportfor processors to make a collective I/O request. So le systems cannot performany optimizations based on the knowledge of the access requests of all processors.Finally, the programmer cannot specify access requests using a high level description,but instead has to explicitly manipulate le pointers. This makes it dicult for theprogrammer to perform optimizations at the application level, for example prefetchingto overlap I/O with computation, because of the complexities involved in managingbuers and le pointers.We have developed a runtime system called PASSION (Parallel and Scalable Soft-ware for Input-Output) [116, 25] which aims to alleviate many of these problems andprovide better software support for out-of-core programs. We believe that high-levelinterfaces that facilitate the use of semantic knowledge about the accesses from par-allel programs are necessary for simple and portable application programming. Ahigh-level interface can at the same time provide enough information so that I/O canbe done in an ecient manner. The PASSION Runtime Library accepts high-levelrequests from the application program, translates them to the low-level interface sup-ported by the parallel le system, and performs optimizations for ecient I/O. Thischapter discusses the basic design of PASSION and the various models, techniquesand optimizations used in it.5.2 PASSION Runtime LibraryThe PASSION Runtime Library provides routines to eciently perform the I/O re-quired in programs involving out-of-core multidimensional arrays. It provides supportfor loosely synchronous [46] out-of-core computations which use a Single Program
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 92
Interface Interface
Parallel File System
PASSION RUNTIME SYSTEM
MPI I/O
Node + MPHPC++HPF/
Figure 5.1: Software ArchitectureMultiple Data (SPMD) Model. The library can either be directly used by applicationprogrammers, or a compiler can translate out-of-core programs written in a high-level data-parallel language like HPF to node programs with calls to the library forI/O. PASSION provides the user with a simple high-level interface, which is a levelhigher than any of the existing parallel le system interfaces, as shown in Figure 5.1.For example, the user only needs to specify what section of the array needs to beread/written in terms of its lower-bound, upper-bound and stride in each dimension,and the PASSION Runtime Library will fetch it in an ecient manner. A number ofoptimizations such as Data Sieving, Data Prefetching, Data Reuse, and the ExtendedTwo-Phase Method have been incorporated in the library [116, 115, 118].
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 935.3 ModelsThis section discusses the architectural model and the data storage and access modelsused by the PASSION Runtime Library.5.3.1 Architectural ModelAn important goal in the design of PASSION has been to make it architecture in-dependent as far as possible. The architectural model assumed by PASSION is thatof any general distributed memory computer in which the processors are connectedtogether in some fashion. The system is assumed to be provided with a set of disksand I/O nodes. The I/O nodes can either be dedicated processors or some of thecompute nodes may also serve as I/O nodes [72]. Each processor may either haveits own local disk or all processors may share the set of disks. We prefer if the lesystem allows us to control the way les are stored on disks, particularly the numberof I/O nodes (or disks) across which the le is striped and the stripe size. The I/Osubsystem may have a separate interconnection network or it can share the samenetwork which connects the processors together.PASSION has been implemented on the Intel Touchstone Delta using the nativeConcurrent File System (CFS) and the NX message passing library. Hence it can runwithout modication on other Intel machines such as the Paragon and iPSC/860. Allthe performance results in this chapter as well as in Chapter 6 have been taken on theTouchstone Delta. The computation and communication hardware on the Delta isdescribed in Section 4.1.2. The I/O system on the Delta consists of 32 dedicated I/Onodes, each an Intel 80386 microprocessor. Each I/O node is connected to two disks,resulting in a total of 64 disks. A Concurrent File System (CFS) [92] is provided forparallel access to les. By default, a le is striped across all 64 disks in a round-robinfashion in blocks of size 4 Kbytes. It is possible for the user to specify the disks onwhich a le is to be stored, but the stripe size is xed. The CFS provides the user witha Unix-like interface and the parallel reads and writes are handled transparently. Theperformance of the CFS on the Touchstone Delta has been studied in detail in [9]. The
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 94CFS supports four modes of le access. In Mode 0, each processor has an independentle pointer. In Modes 1 { 3, processors have a common le pointer. All PASSIONroutines have been implemented using Mode 0.5.3.2 Data Storage and Access ModelsSince PASSION is used in programs having large arrays which do not t in mainmemory, the arrays have to stored on disks in some fashion. PASSION supports threebasic models of storing and accessing arrays, called the Local Placement Model (LPM),the Global Placement Model (GPM) and the Partitioned In-Core Model (PIM).Local Placement Model (LPM)In this model, the global array is divided into local arrays belonging to each processor.Since the local arrays are out-of-core, they have to be stored in les on disks. Thelocal array of each processor is stored in a separate le called the Local Array File(LAF) of that processor as shown in Figure 5.2(I). The node program explicitly readsfrom and writes into the le when required. The simplest way to view this model isto think of each processor as having another level of memory which is much slowerthan main memory. If the I/O architecture of the system is such that each processorhas its own disk, the LAF of each processor will be stored on the disk attached tothat processor. If there is a common set of disks for all processors, the LAF will bedistributed across one or more of these disks. In other words, we assume that eachprocessor has its own logical disk with the LAF stored on that disk. The mapping oflogical disks to physical disks depends on how much control the parallel le systemprovides the user. At any time, only a portion of the local array is fetched and storedin main memory. The size of this portion depends on the amount of memory available.The portion of the local array which is in main memory is called the In-Core LocalArray (ICLA). All computations are performed on the data in the ICLA. Thus,during the course of the program, parts of the LAF are fetched into the ICLA, thenew values are computed and the ICLA is stored back into appropriate locations in
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 95
(I)  Local Placement Model
Global Array
To P1
To P3To P2
To P0
Out-of-core Distribution
Global Array
To P1
To P3To P2
To P0
Out-of-core Distribution
P3P2
P0 P1
Processors
Disk 0
Disk 1
Disk 2
Disk 3
N
E
T
O
R
K
W
(II)  Global Placement Model
To P0 To P1
To P3To P2
Partitions
Global File
Global Array
(III)  Partitioned In-Core Model
DisksDisks
Local array Local array
  Files  Files
P3P2
P0 P1
Processors
ICLA ICLA
ICLA
I/O Nodes
In-Core Distribution
SlabFigure 5.2: Data Storage and Access Models
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 96the LAF.Global Placement Model (GPM)In this model, the global array is stored in a single le called the Global ArrayFile (GAF) as shown in Figure 5.2(II) and no local array les are created. Theglobal array is only logically divided into local arrays in keeping with the SPMDprogramming model. But, there is a single global array on disk. The PASSIONruntime system fetches the appropriate portion of each processor's local array fromthe global array le, as requested by the user, in an ecient manner. The advantageof the Global Placement Model is that it does not require the initial local arrayle creation phase in the Local Placement Model. The disadvantage is that eachprocessor's data may not be stored contiguously in the GAF, resulting in higher I/Olatency time. Also, explicit synchronization is required when a processor needs toaccess data belonging to another processor. Hence an optimized method, such asthe Extended Two-Phase Method proposed in Chapter 6, is needed to perform I/Oeciently.Partitioned In-Core Model (PIM)The Partitioned In-Core Model, illustrated in Figure 5.2(III), is a variation of theGlobal Placement Model. The array is stored in a single global array le as in theGlobal Placement Model, but there is a dierence in the way data is accessed. Inthe Partitioned In-Core Model, the global array is logically divided into a numberof partitions, each of which can t in the main memory of all processors combined.Thus the computation on each partition is essentially an in-core problem and no I/Ois required during the computation on the partition. Hence the name Partitioned In-Core Model. This model is useful when the data access pattern in the program hasgood locality. Otherwise, creating in-core partitions itself is dicult. The ExtendedTwo-Phase Method is used for I/O in this model.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 975.4 Runtime Support for the Local PlacementModelLet us rst consider runtime support for the Local Placement Model. Consider theHPF program fragment shown in Figure 5.3, which solves Laplace's equation byJacobi iteration method. Let us assume that arrays A and B are very large and henceout-of-core. We note that this may not be the best algorithm for solving Laplace'sequation with an out-of-core data set as discussed in [41], but we only use it for thepurpose of explanation.The arrays A and B are distributed as (block,block) on a 4 4 grid of processorsas shown in Figure 5.4. In the Local Placement Model, the out-of-core local arrayof each processor is stored in a separate local array le. Consider the out-of-corelocal array (OCLA) on processor P5, shown in Figure 5.4(B). This is stored in thelocal array le (LAF) shown in Figure 5.4(D). Depending on the amount of memoryavailable, the OCLA is divided into slabs each of which can t in the in-core localarray (ICLA). Program execution proceeds by fetching a slab from the LAF to theICLA, doing the computation on that slab, storing the results back to the le, andso on for the remaining slabs.The computation in this example is a stencil computation in which the value ofeach element (i; j) is calculated using the values of its corresponding four neighbors,namely (i 1; j), (i+1; j), (i; j 1) and (i; j+1). Also, the computation in the currentiteration uses values computed in the previous iteration. Thus to calculate the valuesat the four boundaries of the local array, P5 needs the last row of the local array ofP1, the last column of the local array of P4, the rst row of the local array of P9and the rst column of of the local array of P6. Before each iteration of the program,P5 needs to get these rows and columns from its neighboring processors. If the localarray was in-core, these rows and columns would have been placed in the overlap areasshown in Figure 5.4(B). This is done so as to obtain better performance by retainingthe DO loop even at the boundary. Since the local array is out-of-core, these overlapareas are provided in the local array le. The local array le basically consists of
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 98parameter (n=1024)real A(n,n), B(n,n)..........!HPF$ PROCESSORS P(4,4)!HPF$ TEMPLATE T(n,n)!HPF$ DISTRIBUTE T(BLOCK,BLOCK) ONTO P!HPF$ ALIGN with T :: A, B...........FORALL (i=2:n-1, j=2:n-1)A(i,j) = (B(i,j-1) + B(i,j+1) + B(i+1,j)+ B(i-1,j))/4...........B = AFigure 5.3: HPF Program Fragment: Solving Laplace's Equation by Jacobi IterationMethod
P0 P1 P2 P3
P4 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
Array distributed on Out-of-core Local In core Local
P5
16 processors Array on P5 Array on P5
Local Array File
  on P5
Overlap Area
Overlap Area
Actual Data
(A)
(D)
(B) (C)Figure 5.4: Example of OCLA, ICLA and LAF
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 99Table 5.1: Some of the PASSION RoutinesPASSION Routine Function1 PASSION read Read entire LAF into ICLA2 PASSION write Write entire ICLA to LAF3 PASSION read section Read a regular section (with stride) from LAF to ICLA4 PASSION write section Write a regular section (with stride) from ICLA to LAF5 PASSION read prefetch Prefetch a regular section6 PASSION prefetch wait Wait for a prefetch to complete7 PASSION read reuse read section with data reuse8 PASSION global read Read a regular section (with stride) from global array le9 PASSION global write Write a regular section (with stride) to global array le10 PASSION oc shift Shift type collective communication on out-of-core data11 PASSION oc multicast Multicast communication on out-of-core datathe local array stored in either row-major or column-major order. In either case, thelocal array le will consist of the local array elements interspersed with overlap areaas shown in Figure 5.4(D). The in-core local array also needs overlap area for thesame reason as for the out-of-core local array.At the end of each iteration, processors need to exchange boundary data withneighboring processors. In the in-core case, this would be done using a shift type col-lective communication routine to directly communicate data from the local memoryof the processors. In the out-of-core case, this communication also requires I/O. Inan out-of-core shift type collective communication, each processor reads the bound-ary data from its local array le and communicates it to the neighboring processor.The processor also receives the data sent by neighboring processors and stores it inappropriate locations in the local array le.We have developed a library of routines to do the I/O as well as the out-of-core communication required in programs such as the example described above. Theroutines use a number of optimizations which are described in Section 5.5. Table 5.1lists some of the PASSION routines and their function.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 1005.4.1 Out-of-Core Array Descriptor (OCAD)The runtime routines require information about the array such as its size, distributionamong the nodes of the distributed memory machine, storage pattern etc. All thisinformation is stored in a data structure called the Out-of-Core Array Descriptor(OCAD) and passed as a parameter to the runtime routines. Before any of theruntime routines are called, the compiler or user must ll the necessary informationin the OCAD. The structure of the OCAD is given in Figure 5.5. Rows 1 and 2 containthe lower and upper bounds of the in-core local array (excluding overlap area) in eachdimension. The lower and upper bounds of the in-core local array in each dimensionincluding overlap area are stored in rows 3 and 4. The size of the global array in eachdimension is given in row 5. Row 6 contains the size of the out-of-core local array.Row 7 species the number of processors assigned to each dimension of the globalarray. The format in which the out-of-core local array is stored in the local array le isgiven in Row 8. The array is stored in the order in which array elements are accessedin the program, so as to reduce the I/O cost. The entry for the dimension which isstored rst is set to 1, the entry for the dimension which is stored second is set to 2and so on. For example, for a two-dimensional array, x,y = 1,2 means that the arrayis stored on disk in column major order and x,y = 2,1 means that the array is storedin row major order. This information enables the runtime system to determine thelocation of any array element (i,j) on the disk. Row 9 contains information about thedistribution of the global array. Since the array can be distributed as BLOCK(m)or CYCLIC(m), where m is the block-size, the value of m is stored in Row 10 of theOCAD.5.5 OptimizationsA number of optimizations, such as data sieving, data prefetching and data reuse,have been incorporated in the PASSION Runtime Library.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 101Dimension1 2 3 4 5 6 7Incore lbIncore ubIncore lboIncore uboGlobal szOCLA sizeProcsOOC storage x yDistributionBlock szFigure 5.5: Out-of-Core Array Descriptor (OCAD)5.5.1 Data SievingStudies of le access characteristics by Kotz and Nieuwejaar [74] have shown thatmany scientic applications actually make strided accesses to the le. Hence, allthe PASSION runtime routines for reading or writing data from/to les supportthe reading/writing of regular sections of arrays with strides. We dene a regularsection of an array as any portion of an array which can be specied in terms of itslower bound, upper bound and stride in each dimension. The need for reading arraysections from disks may arise due to a number of reasons, for example FORALL orarray assignment statements involving sections of out-of-core arrays.Consider the array of size (11,11) shown in Figure 5.6, which is stored in a le.Suppose it is required to read the section (2:10:2,3:9:2) of this array. The elementsto be read are circled in the gure. The interface to any parallel le system does notcurrently allow a programmer to read or write strided data using a single call. Hencethe only direct way of reading the required data is to explicitly move the le pointerto each element and read it individually, which requires as many reads as the numberof elements. We call this the Direct Read Method. A major disadvantage of this
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 102
(1,1) (1,11)
(11,11)(11,1)
(2,3) (2,9)
(10,9)(10,3)Figure 5.6: Accessing Sections of the OCLAmethod is the large number of I/O calls and low granularity of data transfer. Sincethe I/O latency is very high, this method proves to be very expensive. For example,on the Intel Touchstone Delta using one processor and one disk, it takes 16.06 ms. toread 1024 integers from a le as one block, whereas it takes 1948 ms. to read all ofthem individually.Suppose it is required to read a section of a two-dimensional array specied by(l1 : u1 : s1; l2 : u2 : s2). The number of array elements in this section is(b(u1   l1)=s1c+ 1) (b(u2   l2)=s2c+ 1). Therefore, in the direct read method,No. of I/O requests = (b(u1   l1)=s1c + 1) (b(u2   l2)=s2c+ 1)No. of array elements read per access = 1Thus in this method, the number of I/O requests is very high and the number ofelements accessed per request is very low, which is undesirable.We propose a much more ecient method called Data Sieving to read or writeout-of-core array sections having strides in one or more dimensions. Data sieving canbe explained with the help of Figure 5.7. As explained earlier, each processor has anout-of-core local array (OCLA) associated with it. The OCLA is (logically) dividedinto slabs, each of which can t in main memory (ICLA). The OCLA shown in thegure has four slabs. Let us assume that it is necessary to read the array section
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 103
OCLA
In-core 
buffer
ICLASlab
Sieve
Read contiguous block
Section(l1,l2)
(u1,u2) Figure 5.7: Data Sievingshown in Figure 5.7, specied by (l1 : u1 : s1; l2 : u2 : s2), into the ICLA. Althoughthis section spans three slabs of the OCLA, because of the stride all the data elementscan t in the ICLA.In data sieving, the entire block of data from column l2 to u2 if the storage is col-umn major, or the entire block from row l1 to u1 if the storage is row major, is readinto a temporary buer in main memory using one read call. The required data isthen extracted from this buer and placed in the ICLA. Hence the name data sieving.A major advantage of this method is that it requires only one I/O call and the restis data transfer within main memory. The main disadvantage is the high memoryrequirement. Another disadvantage is the extra amount of data that is read fromdisk. However, we have found that the savings in the number of I/O calls increasesperformance considerably. For this method, assuming column major storage,No. of I/O requests = 1No. of array elements read per access = (u2   l2 + 1) nrowsData sieving is a way of combining multiple I/O requests into one request so as to
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 104reduce the eect of high I/O latency time. A similar method calledmessage coalescingis used in interprocessor communication, where small messages are combined into asingle large message in order to reduce the eect of communication latency. However,data sieving is dierent because instead of coalescing the required data elementstogether, it actually reads even unwanted data elements so that large contiguousblocks are read. The useful data is then ltered out by the runtime system in anintermediate step and passed on to the program. The unwanted data read into mainmemory is dynamically discarded.Data sieving can also be considered from the following perspective. In the directmethod, even though a separate read call is required for each individual element, itmay not result in a disk access each time. There is usually some form of caching doneat the I/O nodes and if the requested data lies in the cache, it can be read from thecache itself. In spite of this, we nd that reading individual elements is a lot moreexpensive than reading one large chunk. This is probably because of the overheadof making several requests to the I/O nodes and looking up the software cache atthe I/O node each time. Data sieving can be considered as a way of doing softwarecaching in user space at the compute node itself. An entire chunk of data startingfrom the rst element in the section is eectively cached at the compute node and allthe required elements are supplied from this cache. The le system sees only a singlerequest or at most a few requests, depending on the amount of memory available forthis cache.In the future when le systems support a strided interface, data sieving can alsobe implemented at the le system level. The I/O processors can read large chunks ofdata from the le, perform sieving, and send only the required data to the computeprocessors.Reducing the Memory RequirementIf the stride in the array section is large or the number of rows in the section issmall compared to the total number of rows in the out-of-core array, the amount ofmemory required to read the entire block from column l2 to u2 may be quite large.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 105There may not be enough main memory available to store this entire block. Henceinstead of reading the entire section in a single call, a smaller subsection is fetchedfrom the le, depending on the amount of memory available. Sieving is performed onthis subsection, then the next subsection is read, sieving is performed on it, and soon. This reduces the memory requirements of the program considerably and increasesthe number of I/O requests only slightly. Let us assume that the array is stored incolumn major order and n columns of the OCLA can t in main memory. Then forthis case No. of I/O requests = d(u2   l2 + 1)=neNo. of array elements read per access = n nrowsWriting Array SectionsSuppose it is required to write an array section (l1 : u1 : s1; l2 : u2 : s2) from the ICLAto the LAF. The issues involved here are similar to those described above for readingarray sections. A Direct Write Method can be used to write each element individu-ally, but it suers from the same problems of large number of I/O requests and lowgranularity of data transfer. To reduce the number of I/O requests, a method similarto the data sieving method described above needs to be used. If we directly use datasieving in the reverse direction, i.e. elements from the ICLA are placed at appropriatelocations in a temporary buer with stride, and the buer is written to disk, thedata in the buer between the strided elements will overwrite the corresponding dataelements on disk. In order to maintain data consistency, it is necessary to rst readthe entire subsection from the LAF into the temporary buer. Then, data elementsfrom the ICLA can be stored at appropriate locations in the buer and the entirebuer can be written back to disk. This is similar to what happens in cache memorieswhen there is a write miss. In that case, a whole line or block of data is fetched frommain memory into the cache and then the processor writes data into the cache. Thus,writing sections using sieving requires twice the amount of I/O compared to readingsections, because for each write to disk the corresponding block has to rst be fetchedinto memory. Therefore, for writing array sections
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 106No. of I/O requests = 2d(u2   l2 + 1)=neNo. of array elements transferred per access = n nrowsPerformanceTable 5.2 gives the performance of data sieving versus the direct method for readingand writing array sections. An array of size 2K  2K is distributed among 64 pro-cessors in one dimension along columns. So each processor's local array le consistsof an array of size 2K  32 stored in column major order. Each processor needs toread/write a section from/to its local array le. We measured the time taken by thePASSION read section() and PASSION write section() routines for reading andwriting sections of the out-of-core local array on each processor. We observe thatdata sieving provides tremendous improvement over the direct method in all cases.The reason for this is large number of I/O requests in the direct method, even thoughthe total amount of data accessed is higher in data sieving.Table 5.3 gives the number of I/O requests and the total amount of data trans-ferred for each of the array sections considered in Table 5.2. We observe that in thedata sieving method, the number of data elements transferred is more or less the samefor all cases. This is because the total amount of data transferred depends only onthe lower and upper bounds of the section and is independent of the stride. Hencethe time taken using data sieving does not vary much for all the sections we haveconsidered. However, there is a wide variation in time for the direct method, becauseonly those elements belonging to the section are read. The time is lower for smallsections and higher for large sections.We observe that even for writing array sections, data sieving performs better thandirect write even though it requires reading the section before writing. As expected,PASSION write section() takes about twice the time as PASSION read section()when using data sieving. All PASSION routines involving array sections use datasieving for greater eciency.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 107Table 5.2: Performance of Direct Read/Write versus Data Sieving (time in sec.)Global array size 2K  2K real nos. (single precision), 64 processors,OCLA size 2K  32, slab size = 16 columnsPASSION read section PASSION write sectionArray Section Direct Read Sieving Direct Write Sieving(1:2048:2, 1:32:2) 52.95 1.970 49.96 5.114(1:2048:4, 1:32:4) 14.03 1.925 13.71 5.033(10:1024:3, 3:22:3) 8.070 1.352 7.551 4.825(100:2048:6, 5:32:4) 7.881 1.606 7.293 4.756(1024:2048:2, 1:32:3) 18.43 1.745 17.98 5.290
Table 5.3: I/O requirements of Direct Read and Data Sieving MethodsGlobal array size 2K  2K real nos. (single precision), 64 processors,OCLA size 2K  32, slab size = 16 columnsNo. of I/O requests No. of array elements readArray Section Direct Read Sieving Direct Read Sieving(1:2048:2, 1:32:2) 16384 2 16384 65536(1:2048:4, 1:32:4) 4096 2 4096 65536(10:1024:3, 3:22:3) 2373 2 2373 40960(100:2048:6, 5:32:4) 2275 2 2275 57344(1024:2048:2, 1:32:3) 5643 2 5643 65536
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 108
Read Read ReadComp Comp CompWrite Write Write
(A) Without Prefetch
Read
Read Read
Comp Comp Comp WriteWriteWrite
(B) With PrefetchFigure 5.8: Data Prefetching5.5.2 Data PrefetchingIn the model of computation and I/O described earlier, the OCLA is divided intoa number of slabs, each of which can t in the ICLA. Program execution proceedsas follows:- a slab of data is fetched from the LAF to the ICLA; the computationis performed on this slab and the slab is written back to the LAF. This is repeatedon other slabs till the end of the program. Thus I/O and computation form distinctphases in the program. A processor has to wait while each slab is being read orwritten as there is no overlap between computation and I/O. This is illustrated inFigure 5.8(A) which shows the time taken for computation and I/O on 3 slabs. Forsimplicity, reading, writing and computation are shown to take the same amount oftime, which may not be true in certain cases.The time taken by the program can be reduced if it is possible to overlap com-putation with I/O in some fashion. A simple way of achieving this is to issue anon-blocking I/O read request for the next slab immediately after the current slabhas been read. This is calledData Prefetching. Kotz and Ellis [73] discuss in detail theeects of prefetching in parallel le systems. Since the read request is non-blocking,the reading of the next slab can be overlapped with the computation being performedon the current slab. If the computation time is comparable to the I/O time, this can
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 109result in signicant performance improvement. Figure 5.8(B) shows how prefetchingcan reduce the time taken for the example in Figure 5.8(A). Since the computationtime is assumed to be the same as the read time, all reads other than the rst oneget overlapped with computation. The total reduction in program time is equal tothe time for reading two slabs, as only two of the three reads can be overlapped inthis example.Note that the parallel le system may do some prefetching on its own. For exam-ple, for each read request on the Delta, the CFS automatically prefetches the nextseven blocks of the le into the I/O node, resulting in a total of at least eight blocksbeing read. For many applications, the prefetching done at the le system level maynot be sucient and could even degrade performance, unless the user can control theprefetching policy. This is because dierent applications have dierent access pat-terns which may not match well with the prefetching policy of the le system. Mostof the parallel le systems at present do not allow the user to control prefetching. Onthe other hand, the prefetching done by PASSION is controlled by the user. Since theuser knows what data is needed next, it can be prefetched correctly. Prefetching canbe done in PASSION using the routine PASSION read prefetch() and the routinePASSION prefetch wait() can be used to wait for the prefetch to complete.PerformanceWe use an out-of-core median ltering program to illustrate the performance of dataprefetching. Median ltering is frequently used in computer vision and image pro-cessing applications to smooth the input image. Each pixel is assigned the medianof the values of its neighbors within a window of a particular size, say 3 3 or 5 5or larger. We have implemented a parallel out-of-core median ltering program usingPASSION runtime routines for I/O and communication. The image is distributedamong processors in one dimension along columns and stored in local array les. De-pending on the window size, each processor needs a few columns from its right andleft neighbors. This requires a shift type communication which is implemented usingthe routine PASSION oc shift().
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 110Table 5.4: Performance of Median Filtering using 3  3 window (time in sec.)4 slabs 8 slabs 16 slabsProcs Prefetch No Prefetch Prefetch No Prefetch Prefetch No Prefetch4 36.37 46.56 33.63 46.75 30.65 47.218 18.32 23.37 16.72 24.41 16.36 24.8616 9.180 12.33 8.730 12.60 8.580 13.3532 5.340 6.830 5.260 7.000 5.080 7.16064 5.650 5.850 4.970 5.970 5.410 6.230Table 5.5: Performance of Median Filtering using 5  5 window (time in sec.)4 slabs 8 slabs 16 slabsProcs Prefetch No Prefetch Prefetch No Prefetch Prefetch No Prefetch4 81.47 94.09 79.25 95.63 78.58 96.888 41.81 47.76 41.35 49.32 41.01 50.5916 21.57 25.41 21.40 27.28 21.74 27.8132 11.36 12.83 11.40 13.64 11.43 14.8164 7.110 9.010 6.810 9.110 8.090 9.197Tables 5.4 and 5.5 show the performance of median ltering on the Intel Touch-stone Delta for windows of size 33 and 55 respectively. The image is of size 2K 2K pixels. We assume this to be out-of-core for the purpose of experimentation. Thenumber of processors is varied from 4 to 64 and the size of the ICLA is varied in eachcase in such a way that the number of slabs varies from 4 to 16. Since the TouchstoneDelta has 64 disks, each processor's LAF can be stored on a separate disk.The following observations can be made from these tables:-1. In all cases, prefetching improves performance considerably. In some cases, theimprovement is close to 40%. Figures 5.9 and 5.10 show the relative performancewith and without prefetching when the number of slabs is 8.
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 111
4 8 16 32 64
Processors
0.0
10.0
20.0
30.0
40.0
50.0
Tim
e (
se
c)
Without prefetch
With prefetch
Figure 5.9: Median Filtering using 3  3 window
4 8 16 32 64
Processors
0.0
20.0
40.0
60.0
80.0
100.0
Tim
e (
se
c)
Without prefetch
With prefetch
Figure 5.10: Median Filtering using 5  5 window
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 1122. Without prefetching, as the number of slabs is increased, the time taken in-creases. This is because a larger number of slabs means a smaller slab sizewhich results in a larger number of I/O requests.3. With prefetching, as the number of slabs in increased, the time taken decreasesin most cases. Since the rst slab can never be prefetched, all processors haveto wait for the rst slab to be read. As the slab size is reduced, the wait time forthe rst slab is also reduced and there is more overlap of computation and I/O.However, the number of I/O requests increases. When the slab size is large, areduction in the slab size by half improves performance because the saving inthe wait time for the rst slab is higher than the increase in time due to thelarger number of I/O requests. But when the slab size is small (64 processorcase with 8 or 16 slabs), the higher number of I/O requests costs more thanthe decrease in wait time for the rst slab. Hence the performance actuallydegrades in this case.5.5.3 Data ReuseIn out-of-core programs, data is fetched from les many times. If a portion of the datacurrently fetched into memory is also needed for the computation on the next data set,then that portion already in memory can be reused instead of reading it again withthe next data set. For example, consider the Laplace equation solver discussed earlier(Figure 5.3). Suppose the array is distributed along columns. Then the computationof each column requires one column from the left and one column from the right. Thecomputation of the last column requires one column from the overlap area and thecomputation of the column in the overlap area cannot be performed without readingthe next column from the disk. Hence, instead of reading the column in the overlaparea again with the next set of columns, it can be reused by moving it to the rstcolumn of the array and the last column can be moved to the overlap area before therst column (Figure 5.11). If this move is not done, it would be necessary to read thetwo columns again from the disk along with data for the next slab. The reuse thus
CHAPTER 5. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (I) 113eliminates the reading of two columns in this example. In general, the amount of datareuse would depend on the intersection of the sets of data needed for computationsinvolving two consecutive slabs. The PASSION routine PASSION read reuse() canbe used to perform data reuse.
Overlap Areas
ICLA
Move columns
Figure 5.11: Data ReuseTable 5.6: Performance of Data ReuseLaplace Equation solver, 64 processorsArray size Time in sec.Without Reuse With Reuse2K  2K 75.12 71.714K  4K 274.7 269.1Table 5.6 shows the performance improvement obtained by using data reuse for theLaplace Equation Solver program on the Intel Touchstone Delta using 64 processors.Two dierent global array sizes are used | 2K  2K and 4K  4K. Reuse providesgood performance improvement even in this case where only two columns can bereused.
Chapter 6Runtime Support for Out-of-CorePrograms: (II) Collective I/O6.1 IntroductionChapter 5 discusses runtime support for the Local Placement Model where there is aseparate local array le for each processor. Each processor can directly access only itsown local array le. However, the situation is quite dierent in the Global Placementand Partitioned In-Core Models. In these two models, there is a single global arrayle containing the entire out-of-core array. Each processor accesses the data it needsfrom this common le. A processor may, in general, need to access any arbitrarysection of the global array, with or without stride. The global array may be storedin the global array le in either row-major or column-major order. As a result, thedata required by each processor may not be located contiguously in the le. Also,the requests of some processors may overlap. In the extreme case, all processors maywant to access the same section of the array. If each processor directly tries to readthe data it needs, it may result in a large number of low granularity requests andmultiple requests for the same data.In loosely synchronous SPMD programs, all processors perform similar operationsbut on dierent data sets [46]. Hence, if one processor needs to read data fromdisks, it is very likely that a group of processors or maybe all processors need to readdata from disks at about the same time. This makes it possible for the requesting114
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 115processors to cooperate in reading or writing out-of-core data in an ecient man-ner, which is known as collective I/O. This chapter proposes a method, called theExtended Two-Phase Method, which uses collective I/O for accessing sections of out-of-core arrays eciently. PASSION provides two routines, PASSION global read()and PASSION global write(), to read/write arbitrary array sections with stridesfrom/to a global array le, using the Extended Two-Phase Method.6.2 Need for Collective I/OThe need for collective I/O can be explained using the following example. Considerthe large out-of-core array shown in Figure 6.1 and assume that it is stored in a lein column-major order. Four processors (0 { 3) need to access a block of rows eachas shown. Since the access requests are orthogonal to the le storage order, the datarequested by each processor is located non-contiguously in the le. Also, the requestsof dierent processors lie interleaved in the le. A portion of 0's request lies at thestart of the le, followed by some unwanted data (gap), then a portion of 1's requestfollowed by a gap, then a portion of 2's request followed by a gap, then a portion of3's request followed by a gap, then again a portion of 0's request, and so on. To readthe data using the interface provided by most of the existing parallel le systems,each processor has to explicitly seek to the appropriate location in the le, read thesmall chunk of data, then seek to the next location, and so on. We call this theDirect Method. The Vesta le system [29] and the nCUBE le system [36] do providesupport for the user to specify a logical view of the data to be read, and use a singlecall to read data. But each processor's request is serviced independently and there isno collective optimization based on the requests of all processors.The drawback of the Direct Method is a large number of low granularity requestswhich may arrive from dierent processors in any order. Since I/O latency is veryhigh, this usually results in poor performance. The basic premise behind collectiveI/O is that if a group of processors need to read/write data at the same time, they
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 116
3’s request
2’s request
1’s request
0’s request
Figure 6.1: Accessing row blocks of a le stored in column-major ordercan cooperate among themselves to perform I/O eciently in large chunks and inthe right order. This is usually possible in loosely synchronous SPMD programs inwhich all processors perform similar operations on dierent data sets. Hence, a groupof processors may need to perform I/O at the same time. An appropriate interfaceneeds to be provided for the user to specify a collective I/O request. The PASSIONRuntime Library provides such an interface [26].Given an appropriate collective I/O interface, the collective I/O operation canbe implemented either as a library on top of the le system, or at the le systemlevel itself. A technique called Two-Phase I/O [37, 12] has been proposed for doingcollective I/O at the library level. In this method, I/O is done in two phases. In therst phase, processors cooperate to read data in large chunks, and in the second phasethey do an in-core redistribution of the data. Disk-directed I/O [69] is a techniquewhich proposes to do collective I/O at the le system level. In disk-directed I/O, acollective I/O request is sent to all I/O nodes which determine the order and timingof the ow of data.
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 1176.3 Extended Two-Phase Method for CollectiveI/OThe Two-Phase Method [37, 12] is a collective I/O technique for reading an entirein-core array from a le into a distributed array in main memory, and converselyto write a distributed in-core array to a le. I/O is done in two phases. In the rstphase, processors always read data assuming a conforming distribution. A conformingdistribution is dened as a distribution of an array among processors such that eachprocessor's local array lies contiguously in the le. This results in each processorreading a single large chunk of data. For an array stored in a le in column-majororder, a column-block distribution is the conforming distribution. In the second phase,data is redistributed among processors to whatever is the desired distribution. SinceI/O cost is orders of magnitude more than communication cost, the cost incurred bythe second phase is negligible. This Two-Phase approach is found to give consistentlygood performance for all distributions [37, 12].We have extended the basic Two-Phase Method to access arbitrary sections ofout-of-core arrays. This Extended Two-Phase Method performs I/O for out-of-corearrays eciently by partitioning the I/O workload among processors dynamically, depending on theaccess requests, combining several I/O requests into fewer larger granularity requests, reordering requests so that data is accessed in proper sequence, eliminating multiple disk accesses for the same data, and reducing contention for disks.6.3.1 Reading Sections of Out-of-Core ArraysWe rst describe the Extended Two-Phase Method for reading regular sections ofout-of-core arrays. We dene a regular section of an array as a section which can
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 118be specied by a lower-bound, upper-bound and stride in each dimension. For thepurpose of explanation, we consider the case where each processor needs to read someregular section of a two-dimensional array stored in the le in column-major order.The Extended Two-Phase Method can actually be used for arrays with any number ofdimensions, stored in any order in the le and accessed by any number of processors.Section 6.6 discusses how the general case can be implemented.In the Extended Two-Phase Method, the I/O workload is divided among proces-sors. For this, we assign ownership to portions of the le such that a processor candirectly access only the portion of the le it owns. The le is eectively logicallydivided into domains. The portion of the le which a processor can directly access iscalled its File Domain (FD). For a le stored in column-major order, the le domainof each processor is some set of columns of the array. The issue of how to selectle domains is important because it determines how the I/O workload gets dividedamong processors. This is discussed in detail in Section 6.4.Let us assume that each processor needs to read some regular section(l1 : u1 : s1; l2 : u2 : s2) of the array in global coordinates. In the rst step ofthe Extended Two-Phase Method, processors exchange their own access information(the indices l1; u1; s1; l2; u2; s2) with other processors, so that each processor knowsthe access requests of other processors. This information is stored in a data structurecalled the File Access Descriptor (FAD). The FAD contains exactly the same infor-mation on all processors. This exchange phase is not required if the collective I/Ointerface itself provides information about other processors' access requests.Since each processor knows its own le domain and the access requests of otherprocessors, it can determine what portion of the data in its le domain is neededby other processors. This is done by computing the intersection of the requests ofother processors from the FAD and its own le domain. This information is storedin a data structure called the File Domain Access Table (FDAT). Thus the FDAT ofa processor contains information about which portions of its le domain have beenrequested by other processors.Each processor now has to read data from its le domain, as determined by the
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 119
0’s request
1’s request
2’s request
3’s request
A D
CB
File Domain of  processor 0Figure 6.2: The requests in processor 0's le domainFDAT. For example, Figure 6.2 shows the le domain of processor 0 and, for someaccess pattern, the portions of this le domain which have been requested by otherprocessors. A simple way of reading would be to read all the data needed by processor0, followed by that needed by processor 1 and so on in order of processor number.But, as in the case of Figure 6.2, this may result in too many small accesses which arenot in sequence. For the read to be done eciently, it is important that the FDAT beanalyzed so that the le is accessed in sequence and contiguously, as far as possible.We have devised a very general method for analyzing the information in the FDAT,which ensures that the le is read contiguously and in sequence. Each processorcalculates the minimum of the lower-bounds and the maximum of the upper-boundsof all sections in its FDAT. This eectively determines the smallest section whichcontains all the data that needs to be read from the le domain (for example, sectionABCD in Figure 6.2). This section may also contain some data which is not requiredby any processor. If the processor tries to read only the useful data, it may resultin a number of small strided accesses. To avoid this, it uses the optimization datasieving described in Chapter 5. The processor reads a column of the section at a timein a single operation into a temporary buer. This may include some unwanted data.The useful data is extracted from the temporary buer and placed in communication
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 120buers depending on which processors need the data. The entire section is read fromthe le domain in this fashion. It is possible to read more than one column at a timeif there is enough temporary memory available to do sieving on the set of columns.This forms the rst phase of the Extended Two-Phase Method.The second phase of the Extended Two-Phase Method consists of communicatingthe data read in the rst phase to the respective processors. The information in theFDAT is sucient for each processor to determine what data has to be sent to whichprocessor. Since each processor knows the le domains of other processors and itsown access request, it can calculate how much data it needs to receive from otherprocessors, as well as the locations in main memory where the received data needs tobe placed.The two phases of the Extended Two-Phase Method can either be done distinctlyby performing all the I/O rst and then the communication, or they can be overlapped(pipelined) by reading smaller portions of data and communicating it.6.3.2 Writing Sections of Out-of-Core ArraysThe algorithm for writing sections is essentially the reverse of the algorithm for readingsections. From the information in the File Access Descriptor (FAD), each processorcan determine what portion of the section it needs to write lies in the le domainsof other processors. Each processor also computes its own File Domain Access Table(FDAT), which indicates how much data it needs to receive from other processors, tobe written to its le domain. The rst phase of the Extended Two-Phase Method forwriting is to perform communication to get this data from other processors.The second phase is to write the data to the le in sequence and contiguously.The FDAT is analyzed in the same way as in the read algorithm. Each processordetermines the minimum and maximum of all indices in its FDAT. This eectivelydetermines the smallest section which includes all the data that needs to be writtento the le domain. It may also include some data which is not being written by anyprocessor. The processor writes the useful data in this section one column at a timeusing data sieving. However, for writing using data sieving, we cannot directly use
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 121the reverse of the method used for reading. If the useful data is placed at appropriatelocations (possibly with stride) in a temporary buer and the buer is written to thele, the contents of the buer between the useful data elements will overwrite the datain the le. To maintain data consistency, it is necessary to rst read the entire columnfrom the le into the temporary buer. Then, the data elements to be written in thatcolumn can be stored at appropriate locations in the buer and the entire columncan be written back to disk. Thus, writing sections requires twice the amount ofI/O compared to reading sections, because to write each column, the correspondingcolumn has to rst be read into memory. It is possible to avoid this extra reading incases where the entire column contains useful data to be written. This requires eachprocessor to do a more extensive analysis of the FDAT, to make sure that there areno \holes" between the data sets being written by dierent processors in the samecolumn, and also no strides in the section being written by each processor. As in thecase of reading sections, it is possible to do sieving with more than one column at atime if there is enough temporary memory available.If the sections requested to be written by dierent processors have some elementsin common, there is a data consistency problem. The result depends on how theExtended Two-Phase Method has been implemented. In our implementation, if thereare write requests from multiple processors to the same location, the data from thehighest numbered processor is written to the le.6.4 Partitioning I/O Among ProcessorsIn the Extended Two-Phase Method, processors cooperate to perform I/O. The exactpartitioning of the I/O workload among processors depends on how le domainsare dened. In general, the partitioning of I/O can be done either statically ordynamically.
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 1226.4.1 Static PartitioningOne way of partitioning I/O is to assign a block of columns of the entire out-of-corearray to each processor, as if the array were distributed among processors in a columnblock fashion. Thus the le domain of each processor is a block of columns of thearray, which is stored contiguously in the le. The le domains are predened andxed. The size of each domain can be determined from the size of the array and thenumber of processors, and is independent of the access requests. This is called a staticpartitioning of I/O among processors. Note that this is just a logical partitioning ofthe le among processors, so that each processor directly accesses only a particularportion of the le (its le domain). Figure 6.3(A) shows the le domains with staticpartitioning when there are four processors.6.4.2 Dynamic PartitioningThe main drawback of static partitioning is that the partitioning is independent ofthe type of access requests. For many types of access patterns, this may result in animbalance of I/O among processors. Some processors may perform more I/O thanothers and some may not perform any I/O at all. For example, consider the accesspattern in Figure 6.3. If the partitioning is done statically, the access requests spanthe le domains of only two processors. Thus only two processors (1 and 2) performall the I/O; the remaining two processors (0 and 3) only wait to receive data sent by1 and 2. Another drawback of static partitioning is that as the size of the out-of-corearray is increased keeping the number of processors xed, the size of each le domainalso increases. Hence, for the same access pattern, as the size of the out-of-core arrayis increased, the access requests span the le domains of fewer processors resulting ingreater imbalance of I/O and lower performance.The I/O throughput can potentially be improved by partitioning the I/O workloadamong processors dynamically, depending on the array sections requested. This isillustrated in Figure 6.3(B). For a le stored in column-major order, each processorcalculates the lowest and highest among the columns of the sections requested by
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 123
FD FD FDFD
of 0 of 1 of 2 of 3
FD FD FD FD
of 0 of 3of 2of 1
Bounding Section
(B) Dynamic(A) Static
Access Requests
Figure 6.3: Static versus Dynamic Partitioningall processors. The section formed by these columns and all the rows of the out-of-core array is called the bounding section. The bounding section includes the sectionsrequested by all processors and it lies contiguously in the le. Figure 6.3(B) showsthe bounding section for the given access requests. File domains are determined bydividing the bounding section among the processors in a column-block fashion. Thusthe le domain of each processor is a contiguous chunk of the bounding section.If the requested sections span all the columns of the out-of-core array, the dy-namically selected le domains are exactly the same as those determined statically.But if the sections span only a few columns, as in Figure 6.3, dynamic partitioningprovides a much better balance of I/O among processors. The memory requirementsof the Extended Two-Phase Method are also reduced, because the le domain of eachprocessor is smaller. In the static case, if all requested sections lie in a single proces-sor's le domain, all the requested data may not t in the memory of that processor.Hence I/O and communication may need to be done in stages several times. This isless likely to occur in the case of dynamic partitioning, because the requested data ismore evenly divided among processors.
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 1241. Exchange access information with other processors and ll in the File AccessDescriptor (FAD).2. Determine the smallest section which includes the sections requested by allprocessors, called the bounding section.3. The le domain of each processor is determined by dividing thisbounding section among the processors in a column-block mannerfor arrays stored in column-major order.4. Compute the intersection of the FAD and this processor's le domain,and ll in the File Domain Access Table (FDAT).5. Calculate the minimum of the lower-bounds and the maximum of theupper-bounds of all sections in the FDAT to determine the smallest sectioncontaining all the data needed from the le domain.6. Read this section using data sieving and communicate the datato the requesting processors.Figure 6.4: Extended Two-Phase Method for Reading Sections of Out-of-Core ArraysThe algorithm for reading sections of out-of-core arrays using the Extended Two-Phase Method with dynamic partitioning is given in Figure 6.4. If the array is storedin the le in row-major order instead of column-major order, the only dierence wouldbe that the le domains would be dened in terms of row-blocks and data sievingwould be done along rows.6.5 PerformanceWe extensively tested the performance of the Extended Two-Phase Method versusthe Direct Method on the Intel Touchstone Delta, for many dierent access patterns,
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 125using both static and dynamic partitioning. The access patterns can be classied intothree basic types:-1. Common Sections: All processors need to access exactly the same section of thearray.2. Overlapping Sections: Parts of the section requested by a processor may overlapwith parts of the sections requested by other processors.3. Distinct Sections: The section requested by each processor does not have anydata in common with the section requested by any other processor.6.5.1 Reading Common SectionsTable 6.1 compares the performance of the Extended Two-Phase Method and theDirect Method for reading common sections. The array size is 4K  4K and thenumber of processors is 16. Figure 6.5 shows approximately where each of thesesections is located in the array. The performance of the Extended Two-Phase Methodwas measured using both static and dynamic partitioning. We observe that in allcases, the Extended Two-Phase Method performs considerably better than the DirectMethod. This is primarily because, in the Extended Two-Phase Method, the commonsection is read only once and then broadcast to other processors, whereas in theDirect Method, all processors simultaneously try to read the same section from thele, resulting in extra I/O overhead.In all cases, the Extended Two-Phase Method takes a lot less time with dynamicpartitioning than with static partitioning. In the case of static partitioning, for a4K  4K array with 16 processors, each processor's le domain is of size 4K  256.Thus all sections except V lie in the le domains of only a few processors. But withdynamic partitioning, each section is evenly divided among all available processors,resulting in higher I/O throughput. Since Section V spans all 4096 columns, thestatically and dynamically selected le domains are identical, resulting in identicalperformance.
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 126Table 6.1: Time (sec.) for Reading Common SectionsArray size 4K  4K real nos. (single precision), 16 processorsNo. Array Section Direct Extended Two-PhaseRead Static DynamicI (1:100:1, 1:100:1) 1.632 1.027 0.431II (200:300:1, 200:300:1) 1.867 0.883 0.363III (400:800:1, 400:800:1) 6.265 3.692 1.056IV (32:64:1, 128:1024:1) 9.995 2.780 1.318V (1:16:1, 1:4096:1) 52.06 3.241 3.241VI (1:4096:1, 1:16:1) 1.216 2.024 0.420
(III)(I) (II)
(IV) (V) (VI)Figure 6.5: The common sections listed in Table 6.1 (not to scale)
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 127In cases where the sections span a large number of columns (e.g. Section V),the Extended Two-Phase Method provides a signicant improvement over the DirectMethod. This is because for such cases, the Direct Method results in a large numberof small requests spread across the entire le. In the Extended Two-Phase Method,the I/O gets evenly divided among all processors, the requests are reordered and datais read in large chunks.6.5.2 Reading Overlapping SectionsTable 6.2 compares the performance of the Extended Two-Phase Method and theDirect Method for reading overlapping sections. Figure 6.6 shows approximatelywhere these sections are located in the array. To represent these overlapping sectionsfor all processors concisely, we use the following notation. Each processor's requestis denoted by (l1 + ov1  p : u1 + ov1  p : s1; l2 + ov2  p : u2 + ov2  p : s2),where p is the processor number and ov1, ov2 are some constants. The amount ofoverlap can be changed by varying ov1 and ov2. For example, the notation (1:100:1,1+10p:100+10p:1) in row I of Table 6.2 represents a group of overlapping sectionswith processor 0 requesting section (1:100:1, 1:100:1), processor 1 requesting section(1:100:1, 11:110:1), processor 2 requesting section (1:100:1, 21:120:1) and so on. Thesections in cases I | IV overlap along columns, the sections in cases V | VII overlapalong rows and the sections in case VIII overlap in both dimensions.We observe that the Extended Two-Phase Method with dynamic partitioningperforms the best in all cases. The sections in cases I and II are of the same size, butthey dier in the amount of overlap. The sections in case I have more overlap thanthose in case II. Since the total number of columns of the out-of-core array spannedby the sections in case I is less than that by the sections in case II, it takes lesstime to read the sections in case I. Sections in cases IV, V and VI span only a fewcolumns. For these cases, the Direct Method performs better than the Extended Two-Phase Method with static partitioning. This is because static partitioning results inonly a few processors performing I/O. But the Extended Two-Phase Method with
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 128Table 6.2: Time (sec.) for Reading Overlapping SectionsArray size 4K  4K real nos. (single precision), 16 processorsNo. Array Section Direct Extended Two-Phase(p = processor number) Read Static DynamicI (1:100:1, 1+10p:100+10p:1) 2.000 1.830 0.693II (1:100:1, 1+50p:100+50p:1) 4.627 1.859 0.875III (400:800:1, 400+100p:800+100p:1) 8.097 3.348 2.477IV (1:4096:1, 1+8p:16+8p:1) 1.152 3.374 0.826V (1+50p:100+50p:1, 1:100:1) 1.579 1.994 0.524VI (400+100p:800+100p:1, 400:800:1) 7.442 11.84 1.361VII (1+8p:16+8p:1, 1:4096:1) 50.32 2.992 2.992VIII (200+100p:400+100p:1, 200+100p:400+100p:1) 3.104 2.986 1.739
overlap
(I) (II) (III) (IV)
(VI)(V) (VII) (VIII)
overlap
overlap
overlap
overlap
overlap
overlap
overlapFigure 6.6: The overlapping sections listed in Table 6.2 (not to scale)
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 129dynamic partitioning performs better than the Direct Method, since there is a betterdistribution of I/O among processors. The sections in case VII span all columns ofthe array, which is the worst case for the Direct Method. Finally, in case VIII, thesections have overlap in both dimensions and again the Extended Two-Phase Methodwith dynamic partitioning takes the least time.6.5.3 Reading Distinct SectionsTable 6.3 compares the performance of the Extended Two-Phase Method and theDirect Method for reading distinct sections. Figure 6.7 shows approximately wherethese sections are located in the array. We use the same notation as above, (l1+ov1p : u1+ ov1 p : s1; l2+ ov2 p : u2+ ov2 p : s2), for representing distinct sections.The overlap factors ov1 and ov2 need to be large enough to ensure that the sectionsare distinct.In all cases, the Extended Two-Phase Method with dynamic partitioning per-forms the best. The relative performance of the three methods is similar to that foroverlapping sections in Table 6.2. The sections in case I are located along rows, sothe requests of dierent processors lie in separate locations in the le. Hence theExtended Two-Phase Method performs only slightly better. The sections in cases II| IV are located along columns and so the requests of dierent processors lie in-terleaved in the le. Hence the Extended Two-Phase Method performs considerablybetter. Static partitioning does not give good performance for the sections in case IIbecause they span only a few columns. The best case for the Extended Two-PhaseMethod is case IV since the sections span all the columns. The sections in cases V andVI lie partly interleaved in the le. Even for these cases, the Extended Two-PhaseMethod performs the best.
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 130Table 6.3: Time (sec.) for Reading Distinct SectionsArray size 4K  4K real nos. (single precision), 16 processorsNo. Array Section Direct Extended Two-Phase(p = processor number) Read Static DynamicI (1:100:1, 1+100p:100+100p:1) 1.976 1.954 1.304II (1+100p:100+100p:1, 1:100:1) 1.633 2.182 0.548III (200+200p:400+200p:1, 512:1024:1) 8.016 5.680 1.725IV (1+32p:16+32p:1, 1:4096:1) 51.63 4.823 4.823V (200+200p:400+200p:1, 1+200p:512+200p:1) 5.466 4.524 3.912VI (1+32p:32+32p:1, 1+100p:1024+100p:1) 12.02 2.991 2.371
(III)(I) (II)
(IV) (V) (VI)Figure 6.7: The distinct sections listed in Table 6.3 (not to scale)
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 131Table 6.4: Time (sec.) for Writing Distinct SectionsArray size 4K  4K real nos. (single precision), 16 processorsNo. Array Section Direct Extended Two-Phase(p = processor number) Write Static DynamicI (1:100:1, 1+100p:100+100p:1) 1.944 3.250 2.801II (1+100p:100+100p:1, 1:100:1) 1.182 2.901 0.854III (200+200p:400+200p:1, 512:1024:1) 4.202 8.715 3.569IV (1+32p:16+32p:1, 1:4096:1) 24.85 10.25 10.25V (200+200p:400+200p:1, 1+200p:512+200p:1) 5.155 5.461 4.401VI (1+32p:32+32p:1, 1+100p:1024+100p:1) 8.233 4.994 4.2746.5.4 Writing Distinct SectionsWe only consider the case where each processor writes a distinct section to the le,because it is unlikely that processors will want to write overlapping or common sec-tions. Table 6.4 compares the performance of the Extended Two-Phase Method andthe Direct Method for writing distinct sections. The sections chosen are the same asthose for reading in Table 6.3 (Figure 6.7).We use the most general algorithm for writing in the Extended Two-Phase Method,which requires an extra read for each write. Hence for the sections in case I, the Di-rect Method performs better because it does not require the extra read and also thesesections are small with few columns. The sections in cases II { IV lie interleavedin the le, so the Extended Two-Phase Method with dynamic partitioning performsmuch better than the Direct Method. The sections in cases V and VI lie partly inter-leaved in the le and even for these cases, the Extended Two-Phase Method performsconsiderably better.6.5.5 Accessing Sections with Non-Unit StridesWe have also tested the performance for accessing sections with non-unit strides.When an array section has a non-unit stride, each element of the array lies strided
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 132Table 6.5: Time (sec.) for Reading Sections with Non-unit StridesArray size 4K  4K real nos. (single precision), 16 processorsNo. Array Section Direct Extended Two-Phase(p = processor number) Read Static DynamicI (p+1:4096:nprocs, p+1:4096:nprocs) 210.8 9.330 9.330II (1+250p:250+250p:2, 1+250p:250+250p:2) 53.13 3.610 2.842III (1+200p:500+200p:3, 1+200p:500+200p:3) 87.19 4.394 4.387IV (1+64p:64+64p:2, 500:2500:3) 96.20 4.759 3.848V (500:2500:3, 1+64p:64+64p:2) 130.7 4.574 2.340in the le. The only way of reading such array sections using a direct method is toexplicitly seek to each individual element and read only that element. This results invery low granularity of data transfer, which is very expensive. The Extended Two-Phase Method overcomes this drawback of the Direct Method by reordering requestsand using data sieving for larger granularity reads.Table 6.5 shows the performance of the Extended Two-Phase Method for readingsections with non-unit strides. The sections in case I span almost the entire array,with stride equal to the number of processors. Hence static and dynamic partitioningtake exactly the same time. The sections in cases II and III are located diagonallyacross the out-of-core array. The sections in case IV are located along columns andthe sections in case V are located along rows. In all cases, the Extended Two-PhaseMethod is more than 20 times faster than the Direct Method. Table 6.6 shows theperformance of the Extended Two-Phase Method for writing sections with non-unitstrides. The sections chosen are the same as in Table 6.5. Even for writing sections,the Extended Two-Phase Method provides considerable performance improvement.6.5.6 ScalabilityWe have also studied the scalability of the Extended Two-Phase Method for largenumber of processors, large array sections and large out-of-core arrays. Since dynamic
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 133Table 6.6: Time (sec.) for Writing Sections with Non-unit StridesArray size 4K  4K real nos. (single precision), 16 processorsNo. Array Section Direct Extended Two-Phase(p = processor number) Write Static DynamicI (p+1:4096:nprocs, p+1:4096:nprocs) 53.28 22.77 22.77II (1+250p:250+250p:2, 1+250p:250+250p:2) 25.22 6.438 3.775III (1+200p:500+200p:3, 1+200p:500+200p:3) 44.64 8.696 7.516IV (1+64p:64+64p:2, 500:2500:3) 71.35 8.858 7.279V (500:2500:3, 1+64p:64+64p:2) 79.24 7.724 4.405partitioning always performs better or at least as good as static partitioning, we onlyconsider dynamic partitioning for the scalability experiments. Table 6.7 shows thetimings obtained by varying the number of processors requesting array sections from4 to 128, for both reading and writing. We have selected a few sections in eachcategory, i.e. common, overlapping, distinct, and also sections with non-unit strides.Note that each processor is requesting a section, so as the number of processors isincreased, the amount of I/O required increases.The main observation is that the Extended Two-Phase Method scales well withthe number of processors. In many cases, the time taken increases only slightly as thenumber of processors is increased, which indicates that we are able to get higher I/Othroughput by increasing the number of processors. For example, for the sections incase I, the time taken increases from 1.282 sec. to only 2.130 sec. when the number ofprocessors is increased from 4 to 128. In some cases, such as case II, the time takeneven decreases. The Direct Method performs quite poorly as the number of processorsis increased, especially for sections in cases II, IV and VIII. The Extended Two-PhaseMethod also scales well for writing sections. For small number of processors, theDirect Method takes less time for writing than the Extended Two-Phase Method.This is because of the extra read required for every write in the Extended Two-PhaseMethod. However, for large number of processors ( 16), the Extended Two-PhaseMethod performs better. For sections with non-unit strides, the Extended Two-Phase
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 134Table 6.7: Performance for dierent number of processorsI = (400:800:1, 400:800:1), Figure 6.5(III)II = (1:16:1, 1:4096:1), Figure 6.5(V)III = (400:800:1, 400+25p:800+25p:1), Figure 6.6(III)IV = (1+8p:16+8p:1, 1:4096:1), Figure 6.6(VII)V = (1+25p:16+25p:1, 1:4096:1), Figure 6.7(IV)VI = (1+32p:32+32p:1, 1+24p:1024+24p:1), Figure 6.7(VI)VII = (p+1:4096:nprocs, p+1:4096:nprocs)VIII = (500:2500:3, 1+32p:32+32p:2)Time in sec., 4K  4K array, DR = Direct Read,ETP = Extended Two-Phase (dynamic partitioning), DW = Direct WriteREADING COMMON SECTIONSSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DR ETP DR ETP DR ETP DR ETP DR ETP DR ETPI 2.620 1.282 3.184 1.040 4.421 1.056 8.734 1.169 16.28 1.436 32.64 2.130II 12.16 4.315 13.95 3.099 19.65 3.241 32.96 2.647 60.11 3.432 116.7 3.219READING OVERLAPPING SECTIONSSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DR ETP DR ETP DR ETP DR ETP DR ETP DR ETPIII 3.079 1.748 5.208 1.699 6.850 1.991 13.61 2.798 24.98 3.801 47.95 4.602IV 13.75 4.450 13.77 3.391 19.63 2.992 32.70 3.696 60.58 4.791 115.9 7.401READING DISTINCT SECTIONSSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DR ETP DR ETP DR ETP DR ETP DR ETP DR ETPV 12.37 4.791 13.57 3.929 19.76 4.149 32.38 6.109 46.12 7.276 54.82 8.161VI 3.704 1.893 2.396 1.585 4.125 1.638 7.806 2.418 19.77 2.970 26.23 4.110WRITING DISTINCT SECTIONSSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DW ETP DW ETP DW ETP DW ETP DW ETP DW ETPV 3.129 7.900 6.971 6.861 12.45 8.554 27.52 12.74 37.70 18.52 52.41 24.74VI 0.982 1.937 1.803 2.218 3.954 3.058 6.436 5.028 7.139 6.234 21.20 9.403READING SECTIONS WITH NON-UNIT STRIDESSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DR ETP DR ETP DR ETP DR ETP DR ETP DR ETPVII 799.2 22.82 216.6 15.83 210.8 9.331 103.1 10.89 54.94 8.307 50.60 9.657VIII 56.44 1.342 77.78 1.440 83.87 1.870 163.1 3.123 331.5 5.062 867.4 7.711WRITING SECTIONS WITH NON-UNIT STRIDESSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DW ETP DW ETP DW ETP DW ETP DW ETP DW ETPVII 668.7 42.75 147.3 39.11 84.54 31.40 64.53 26.42 35.35 28.40 51.38 31.16VIII 9.041 1.612 18.83 1.603 35.17 2.972 75.95 4.812 163.6 7.915 341.8 21.75
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 135Method performs considerably better than the Direct Method. For the sections incase VIII, the stride depends on the number of processors, and so the total amountof I/O decreases as the number of processors is increased.Table 6.8 shows the performance for accessing large sections of a large out-of-corearray of size 16K  16K single precision real numbers (le size 1Gbyte). Figure 6.8shows approximately where these sections are located in the array. We consider com-mon, overlapping and distinct sections for reading, and distinct sections for writing.The trend in the results is the same as in Table 6.7 for a 4K  4K array, whichconrms the scalability of the Extended Two-Phase Method and its superiority overthe Direct Method. We observe that the Direct Method performs a lot worse foraccessing large sections than for small sections, whereas the Extended Two-PhaseMethod performs consistently well for sections of any size. The relative performanceof the two methods for reading and writing the sections in case VI of Table 6.8 isillustrated in Figures 6.9 and 6.10 respectively.6.6 Advantages of Extended Two-Phase MethodThe Extended Two-Phase Method with dynamic partitioning provides a very generalway of accessing arbitrary sections of out-of-core arrays in an ecient manner. Therst phase performs I/O optimizations at the cost of interprocessor communicationin the second phase. Since communication cost is orders of magnitude lower than I/Ocost, the overhead of communication is negligible. In all our experiments, we foundthe time spent on communication to be less than 5% of the total time. The ExtendedTwo-Phase Method combines many small requests of dierent processors into singlelarger requests, thus providing larger granularity of data transfer and lower latencytime. Another advantage is that multiple accesses by dierent processors to the samedata in the le are eliminated. For example, if all processors need to read exactly thesame section of the array, it will be read only once from the le and then broadcastto other processors over the interconnection network. Similarly, if the requests of twoor more processors are overlapping, the overlapping portion will only be read once
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 136Table 6.8: Performance for large sections of large arraysI = (5000:6000:1,5000:6000:1)II = (1+100p:300+100p:1,4000:8000:1)III = (1+100p:400+100p:1,2000+20p:2800+20p:1)IV = (4000:8000:1,1+4p:8+4p:1)V = (1+100p:100+100p:1, 1+100p:1024+100p:1)VI = (1+20p:16+20p:1,4000:12000:1)Time in sec., 16K  16K array, DR = Direct Read,ETP = Extended Two-Phase (dynamic partitioning), DW = Direct WriteREADING SECTIONSSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DR ETP DR ETP DR ETP DR ETP DR ETP DR ETPI 23.65 7.880 43.43 7.795 78.99 7.935 151.3 9.085 302.7 9.368 605.1 11.86II 53.30 26.51 103.3 28.10 132.3 28.50 157.6 32.49 162.3 40.03 182.4 52.08III 13.31 5.061 24.11 6.489 31.49 7.400 39.81 9.253 41.28 10.12 44.29 13.23IV 0.683 0.699 0.841 0.939 1.343 1.173 2.189 1.663 4.149 2.850 8.486 4.994V 10.97 5.380 19.31 8.475 26.52 10.58 35.06 12.69 52.81 14.10 124.2 22.06VI 57.29 21.94 74.05 23.05 127.3 32.88 240.8 51.26 500.2 112.2 799.7 98.68WRITING SECTIONSSec- Procs=4 Procs=8 Procs=16 Procs=32 Procs=64 Procs=128tion DW ETP DW ETP DW ETP DW ETP DW ETP DW ETPV 7.108 12.01 15.21 18.98 32.20 23.37 35.99 30.17 53.10 35.76 98.90 32.54VI 48.35 44.18 71.85 52.07 151.4 73.34 272.8 122.3 548.1 174.1 746.6 164.2
(I) (II) (III)
(VI)(V)(IV)
overlap
overlap overlap
Figure 6.8: The sections listed in Table 6.8 (not to scale)
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 137
4 8 16 32 64 128
Processors
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
Tim
e (
se
c.)
Direct Read
Ext. Two-Phase
Figure 6.9: Scalability Results, reading sections in case VI of Table 6.8
4 8 16 32 64 128
Processors
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
Tim
e (
se
c.)
Direct Write
Ext. Two-Phase
Figure 6.10: Scalability Results, writing sections in case VI of Table 6.8
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 138from the le.The Extended Two-Phase Method also provides a great deal of exibility in divid-ing I/O among processors. This can be done by dening the le domains appropri-ately. We have described one way of doing this dynamically as a function of the accessrequests. This performs better than a static assignment, but it may be possible to doeven better. For example, instead of dividing the bounding section in a column-blockmanner among the processors, it could be divided in a block-cyclic fashion, so that ifthe bounding section includes some unwanted columns, they get evenly distributed.Another approach is to divide I/O among processors so that the I/O requests of dif-ferent processors go to dierent disks or I/O nodes. For this, we need to know onwhich disks the requested sections are located. This information is not provided byany of the parallel le systems at present, but we expect it to be available in futureversions of the le systems. Furthermore, if the ratio of processors to disks on themachine is very high, it is possible to have only a few processors perform I/O, so asto reduce contention for disks.The Extended Two-Phase Method can be used for accessing arrays with any num-ber of dimensions and any storage order. For the dynamic partitioning scheme wehave proposed, the le domains for an n-dimensional array can be obtained by rstcalculating the n-dimensional bounding section of all requests. This section is dividedamong processors such that the le domain of each processor is stored contiguouslyin the le.Array sections other than those which can be represented by a lower-bound, upperbound and stride in each dimension, for example sections with non-uniform strides,can also be accessed using the Extended Two-Phase Method. This requires a moregeneral notation for representing such sections. The data structures such as the FileAccess Descriptor and the File Domain Access Table need to be modied to handlethis, but the basic idea remains the same.It is not necessary that all processors must call the Extended Two-Phase read/writeroutine. It is possible to dene a group of processors involving only those processorsthat need to access data from the le. This is similar to process groups in MPI [83].
CHAPTER 6. RUNTIME SUPPORT FOR OUT-OF-CORE PROGRAMS (II) 139Only the processors in the group need to call the Extended Two-Phase routine andparticipate in the two-phase process. The I/O workload can be divided among theprocessors in this group.The Extended Two-Phase Method is not specic to any particular machine, lesystem or architecture. It can be easily implemented on top of any of the existing lesystem interfaces or portable interfaces such as the proposed MPI-IO interface [28],resulting in portable implementations. It can also be easily modied and tuned forany particular system. This only requires dening the le domains appropriately, andpossibly using a dierent algorithm for interprocessor communication.
Chapter 7ConclusionsThis thesis addresses several issues in providing runtime support for in-core as well asout-of-core data-parallel programs. This runtime support can be used for performingmany of the commonly required operations in application programs written usinga distributed memory programming model. The use of runtime support makes iteasier to write application programs and provides greater eciency and portability.It can also be used together with a compiler to translate in-core as well as out-of-coreprograms written in a high-level data-parallel language like HPF to node programsfor distributed memory parallel computers. Runtime support helps to separate themachine dependent and machine independent aspects of the translation. The compilercan do all the machine independent transformations and the runtime libraries can beoptimized for each dierent machine. Thus, a portable and ecient compiler can beobtained by porting the runtime library to dierent machines.This thesis proposes ecient algorithms for runtime array redistribution which isfrequently required in parallel programs. We have considered block(m) to cyclic, cyclicto block(m) and the general cyclic(x) to cyclic(y) redistributions for one-dimensionaland multidimensional arrays. In all cases, the Asynchronous Method was found toperform better than the Synchronous Method because it overlaps computation andcommunication, and hence reduces processor idle time. For the general cyclic(x) tocyclic(y) redistribution, the LCM Method performs the best. The thesis also showshow circular redistributions can be performed eciently by saving address information140
CHAPTER 7. CONCLUSIONS 141in the forward redistribution and reusing it in the backward redistribution.This thesis presents several algorithms for all-to-all collective communication forthe fat tree and two-dimensional mesh network topologies. The performance of thesealgorithms was studied on the CM-5 and Touchstone Delta. An analytical modelfor estimating the time taken by these algorithms was developed and validated bycomparing with experimental results. The conclusion is that the choice of algorithmdepends on the message size and the number of processors. No one algorithm performsthe best for all message sizes and number of processors.Due to the large memory requirements of many applications, it is becoming in-creasingly important to provide support for out-of-core programs. This thesis alsodescribes the design and implementation of a runtime library for out-of-core compu-tations. This library can either be directly used in out-of-core applications, or usedtogether with a compiler to translate out-of-core programs written in a language suchas HPF. The runtime library supports three dierent models for data storage andaccess. Runtime routines have been developed for all these models. Several optimiza-tions, such as data sieving, data prefetching and data reuse, have been incorporatedin the runtime library. The optimizations have provided signicant performance im-provement. The thesis also proposes the Extended Two Phase Method for readingand writing sections of out-of-core arrays eciently. This method uses collective I/Oin which processors cooperate to perform I/O in an ecient manner by combiningseveral I/O requests into fewer larger requests, eliminating multiple le accesses forthe same data and reducing contention for the I/O subsystem. A dynamic schemeis used for dividing I/O among processors, depending on the access requests. TheExtended Two Phase Method showed impressive performance benets over a DirectMethod for many dierent access patterns.There are a number of areas in which the research presented in this thesis canbe extended. The area of runtime support for parallel I/O is particularly promising.A useful feature of the Extended Two-Phase Method is the exibility it provides indening le domains. We have studied one way of selecting le domains. A goodresearch problem is to determine the best way to dene le domains depending on
CHAPTER 7. CONCLUSIONS 142the pattern of access requests as well as the distribution of the le on disks. Scienticdata is very often stored in standard data formats such as HDF [86] or NetCDF [123].It would be useful to design and implement ecient runtime support for data storedin les using these formats. The PASSION Runtime Library has currently beenimplemented on the Intel Touchstone Delta, Paragon and iPSC/860 systems. It wouldbe an interesting project to port it to the IBM SP-2 using the PIOFS le system.A unique feature of PIOFS (and its predecessor Vesta) is that it supports logicalpartitioning of les [29]. It is not entirely clear how widely this logical partitioningcan be used. Implementing a general runtime system like PASSION using PIOFS canprovide some insight into the usefulness of logical partitioning.
Bibliography[1] G. Agrawal, A. Sussman, and J. Saltz. Compiler and Runtime Support forStructured and Block Structured Applications. In Proceedings of Supercomput-ing '93, pages 578{587, November 1993.[2] I. Ahmad, R. Bordawekar, Z. Bozkus, A. Choudhary, G. Fox, K. Parasuram,R. Ponnusamy, S. Ranka, and R. Thakur. Implementation and Scalability ofFortran 90D Intrinsic Functions on Distributed Memory Machines. TechnicalReport SCCS{256, NPAC, Syracuse University, 1992.[3] M. Barnett, R. Littleeld, D. Payne, and R. van de Geijn. Global Combine onMesh Architectures with Wormhole Routing. In Proceedings of the 7th Interna-tional Parallel Processing Symposium, April 1993.[4] R. Bennett, K. Bryant, A. Sussman, R. Das, and J. Saltz. Jovian: A Frameworkfor Optimizing Parallel I/O. In Proceedings of the Scalable Parallel LibrariesConference, pages 10{20, October 1994.[5] F. Bodin, P. Beckman, D. Gannon, S. Narayana, and S. Yang. DistributedpC++: Basic Ideas for an Object Parallel Language. In Proceedings of theFirst Annual Object-Oriented Numerics Conference, pages 1{24, April 1993.[6] S. Bokhari. Complete Exchange on the iPSC/860. Technical Report 91{4,ICASE, NASA Langley Research Center, 1991.143
BIBLIOGRAPHY 144[7] S. Bokhari and H. Berryman. Complete Exchange on a Circuit Switched Mesh.In Proceedings of the Scalable High Performance Computing Conference, pages300{306, 1992.[8] R. Bordawekar and A. Choudhary. Language and Compiler Support for Paral-lel I/O. In Proceedings of IFIP Working Conference on Programming Environ-ments for Massively Parallel Distributed Systems, April 1994.[9] R. Bordawekar, A. Choudhary, and J. del Rosario. An Experimental Perfor-mance Evaluation of Touchstone Delta Concurrent File System. In Proceedingsof the 7th ACM International Conference on Supercomputing, July 1993.[10] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny. AModel and Compilation Strategy for Out-of-Core Data Parallel Programs. InProceedings of the Fifth ACM SIGPLAN Symposium on Principles and Prac-tices of Parallel Programming, July 1995. To appear.[11] R. Bordawekar, A. Choudhary, and R. Thakur. Data Access Reorganizationsin Compiling Out-of-core Data Parallel Programs on Distributed Memory Ma-chines. Technical Report SCCS{622, NPAC, Syracuse University, September1994.[12] R. Bordawekar, J. del Rosario, and A. Choudhary. Design and Evaluation ofPrimitives for Parallel I/O. In Proceedings of Supercomputing '93, pages 452{461, November 1993.[13] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. Fortran 90D/HPFCompiler for Distributed Memory MIMD Computers: Design, Implementation,and Performance Results. In Proceedings of Supercomputing '93, pages 351{360,November 1993.[14] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, R. Thakur, and J. Wang.Scalable Libraries for High Performance Fortran. In Proceedings of the Scalable
BIBLIOGRAPHY 145Parallel Libraries Conference, pages 67{75. Mississippi State University, Octo-ber 1993.[15] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka, and M. Wu. Compil-ing Fortran 90D/HPF for Distributed Memory MIMD Computers. Journal ofParallel and Distributed Computing, pages 15{26, April 1994.[16] Z. Bozkus, S. Ranka, and G. Fox. Modeling the CM-5 Multicomputer. InProceedings of Frontiers of Massively Parallel Computation 92, pages 100{107,October 1992.[17] P. Brezany, M. Gerndt, P. Mehrotra, and H. Zima. Concurrent File Operationsin a High Performance Fortran. In Proceedings of Supercomputing '92, pages230{238, November 1992.[18] P. Brezany, T. Mueck, and E. Schikuta. Language, Compiler and ParallelDatabase Support for I/O Intensive Applications. In Proceedings of High Per-formance Computing and Networking 1995, May 1995.[19] R. Butler and E. Lusk. User's Guide to the P4 Programming System. TechnicalReport ANL{92/17, Argonne National Laboratory, October 1992.[20] A. Carle, K. Kennedy, U. Kremer, and J. Mellor-Crummey. Automatic DataLayout for Distributed Memory Machines in the D Programming Environment.Technical Report CRPC-TR93298, Center for Research on Parallel Computa-tion, Rice University, February 1993.[21] B. Chapman, P. Mehrotra, and H. Zima. Programming in Vienna Fortran.Technical Report 92{9, ICASE, NASA Langley Research Center, March 1992.[22] S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Automatic ArrayAlignment in Data-Parallel Programs. In Proceedings of Principles of Program-ming Languages (POPL) '93, pages 16{28, January 1993.
BIBLIOGRAPHY 146[23] S. Chatterjee, J. Gilbert, F. Long, R. Schreiber, and S. Teng. Generating LocalAddresses and Communication Sets for Data Parallel Programs. In Proceedingsof Principles and Practices of Parallel Programming (PPoPP) '93, pages 149{158, May 1993.[24] P. Chen, E. Lee, G. Gibson, R. Katz, and D. Patterson. RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2):145{185, June 1994.[25] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy,T. Singh, and R. Thakur. PASSION: Parallel and Scalable Software for Input-Output. Technical Report SCCS{636, NPAC, Syracuse University, September1994. Also available as CRPC Technical Report CRPC{TR94483{S.[26] A. Choudhary, R. Bordawekar, S. More, K. Sivaram, and R. Thakur. A User'sGuide for the PASSION Runtime Library Version 1.0. Technical Report SCCS{702, NPAC, Syracuse University, February 1995.[27] P. Corbett and D. Feitelson. Overview of the Vesta Parallel File System. InProceedings of the Scalable High Performance Computing Conference, pages63{70, May 1994.[28] P. Corbett, D. Feitelson, Y. Hsu, J. Prost, M. Snir, S. Fineberg, B. Nitzberg,B. Traversat, and P. Wong. MPI-IO: A Parallel I/O Interface for MPI, Version0.3. Technical Report NAS-95-002, NASA Ames Research Center, January1995.[29] P. Corbett, D. Feitelson, J. Prost, and S. Baylor. Parallel Access to Files inthe Vesta File System. In Proceedings of Supercomputing '93, pages 472{481,November 1993.[30] T. Cormen. Virtual Memory for Data-Parallel Computing. PhD thesis, Dept.of Electrical Engineering and Computer Science, Massachusetts Institute ofTechnology, February 1993.
BIBLIOGRAPHY 147[31] T. Cormen and A. Colvin. ViC*: A Preprocessor for Virtual-Memory C*. Tech-nical Report PCS-TR94-243, Dept. of Computer Science, Dartmouth College,November 1994.[32] D. Culler, A. Dusseau, S. Goldstein, A. Krishnamurthy, S. Lumetta, T. vonEicken, and K. Yelick. Parallel Programming in Split-C. In Proceedings ofSupercomputing '93, pages 262{273, November 1993.[33] W. Dally and C. Seitz. Deadlock-Free Message Routing in Multiprocessor Inter-connection Networks. IEEE Transactions on Computers, pages 547{553, May1987.[34] R. Das, R. Ponnusamy, J. Saltz, and D. Mavriplis. Distributed Memory Com-piler Methods for Irregular Problems { Data Copy Reuse and Runtime Parti-tioning. In J. Saltz and P. Mehrotra, editors, Languages, Compilers and Run-time Environments for Distributed Memory Machines, pages 185{220. ElsevierScience Publishers, 1992.[35] R. Das, J. Saltz, and H. Berryman. A Manual for PARTI Runtime Primitives.Interim Report 17, ICASE, NASA Langley Research Center, May 1991.[36] E. DeBenedictis and J. del Rosario. nCUBE Parallel I/O Software. In Proceed-ings of 11th International Phoenix Conference on Computers and Communica-tions, pages 117{124, April 1992.[37] J. del Rosario, R. Bordawekar, and A. Choudhary. Improved Parallel I/O via aTwo-Phase Runtime Access Strategy. In Proceedings of the Workshop on I/Oin Parallel Computer Systems at IPPS '93, pages 56{70, April 1993.[38] J. del Rosario and A. Choudhary. High Performance I/O for Parallel Computers:Problems and Prospects. IEEE Computer, pages 59{68, March 1994.
BIBLIOGRAPHY 148[39] P. Dibble, M. Scott, and C. Ellis. Bridge: A High-Performance File System forParallel Processors. In Proceedings of the Eighth International Conference onDistributed Computer Systems, pages 154{161, June 1988.[40] I. Foster, M. Henderson, and R. Stevens. Workshop Introduction. In Proceed-ings of the Workshop on Data Systems for Parallel Climate Models at ArgonneNational Laboratory, July 1991.[41] G. Fox. Domain Decomposition in Distributed and Shared Memory Environ-ments { I: A Uniform Decomposition and Performance Analysis for the nCUBEand JPL Mark IIIfp Hypercubes, volume 297 of Lecture Notes in Computer Sci-ence, pages 1042{1073. Springer-Verlag, 1987. Supercomputing, ed. E. Houstis,T. Papatheodorou, and C. Polychronopoulos.[42] G. Fox and W. Furmanski. Optimal Communication Algorithms for RegularDecompositions on the Hypercube. In The Third Conference on HypercubeConcurrent Computers and Applications, Volume 1, pages 648{713, January1988.[43] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, and C. Tseng.Fortran D Language Specications. Technical Report COMP TR90-141, CRPC,Rice University, 1990.[44] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. SolvingProblems on Concurrent Processors, Vol. I. Prentice-Hall Inc., 1988.[45] G. Fox, S. Ranka, M. Scott, A. Malony, J. Browne, M. Chen, A. Choud-hary, T. Cheatham, J. Cuny, R. Eigenmann, A. Fahmy, I. Foster, D. Gan-non, T. Haupt, M. Karr, C. Kesselman, C. Koelbel, W. Li, M. Lam,T. LeBlanc, J. Openshaw, D. Padua, C. Polychronopoulos, J. Saltz, A. Sussman,G. Weigand, and K. Yelick. Common Runtime Support for High PerformanceParallel Languages | Parallel Compiler Runtime Consortium. In Proceedingsof Supercomputing '93, pages 752{757, November 1993.
BIBLIOGRAPHY 149[46] G. Fox, R. Williams, and P. Messina. Parallel Computing Works. MorganKaufmann Publishers Inc., 1994.[47] G. Geist, M. Heath, B. Peyton, and P. Worley. A User's Guide to PICL, APortable Instrumented Communication Library. Technical Report ORNL/TM{11616, Oak Ridge National Laboratory, October 1991.[48] M. Grossman. Modeling Reality. IEEE Spectrum, pages 56{60, September1992.[49] M. Gupta. Automatic Data Partitioning on Distributed Memory Multicomput-ers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, September 1992.[50] M. Gupta and P. Banerjee. Demonstration of Automatic Data PartitioningTechniques for Parallelizing Compilers on Multicomputers. IEEE Transactionson Parallel and Distributed Systems, pages 179{193, mar 1992.[51] S. Gupta, S. Hawkinson, and B. Baxter. A Binary Interleaved Algorithm forComplete Exchange on a Mesh Architecture. Technical report, Intel Supercom-puter Systems Division, April 1993.[52] S. Gupta, S. Kaushik, C. Huang, and P. Sadayappan. On Compiling ArrayExpressions for Ecient Execution on Distributed MemoryMachines. TechnicalReport OSU-CISRC-4/94-TR19, Computer and Information Science ResearchCenter, The Ohio State University, April 1994.[53] S. Gupta, S. Kaushik, S. Mufti, S. Sharma, C. Huang, and P. Sadayappan. Onthe Generation of Ecient Data Communication for Distributed Memory Ma-chines. In Proceedings of International Computing Symposium, Taiwan, pages504{513, 1992.[54] S. Hambrusch, F. Hameed, and A. Khokhar. Communication Operations onCoarse-Grained Mesh Architectures. Parallel Computing, 1995. to appear.
BIBLIOGRAPHY 150[55] M. Harry, J. del Rosario, and A. Choudhary. VIP-FS: A Virtual, ParallelFile System for High Performance Parallel and Distributed Computing. InProceedings of the Ninth International Parallel Processing Symposium, April1995.[56] High Performance Computing and Communications: Grand Challenges 1993Report. A Report by the Committee on Physical, Mathematical and Engi-neering Sciences, Federal Coordinating Council for Science, Engineering andTechnology, 1993.[57] High Performance Fortran Forum. High Performance Fortran Language Speci-cation. Version 1.0, May 1993.[58] S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, and C. Tseng. AnOverview of the Fortran D Programming System. Technical Report COMPTR91-154, CRPC, Rice University, March 1991.[59] S. Hiranandani, K. Kennedy, and C. Tseng. Compiler Support for Machine-Independent Parallel Programming in Fortran D. In Languages, Compilers andRun-Time Environments for Distributed Memory Machines. North-Holland,1992.[60] J. Huber, C. Elford, D. Reed, A. Chien, and D. Blumenthal. PPFS: A HighPerformance Portable Parallel File System. Technical Report UIUCDCS-R-95-1903, University of Illinois at Urbana Champaign, January 1995.[61] S. L. Johnsson and C. T. Ho. OptimumBroadcasting and Personalized Commu-nication in Hypercubes . IEEE Transactions on Computers, pages 1249{1268,Sep 1989.[62] E. Kalns and L. Ni. Processor Mapping Techniques Toward Ecient Data Re-distribution. In Proceedings of the 8th International Parallel Processing Sympo-sium, pages 469{476, April 1994.
BIBLIOGRAPHY 151[63] S. Kaushik, C. Huang, R. Johnson, and P. Sadayappan. An Approach toCommunication-Ecient Data Redistribution. In Proceedings of the 8th ACMInternational Conference on Supercomputing, July 1994.[64] S. Kaushik, C. Huang, J. Ramanujam, and P. Sadayappan. Multi-phase Ar-ray Redistribution: Modeling and Evaluation. Technical Report OSU-CISRC-9/94-52, Computer and Information Science Research Center, The Ohio StateUniversity, September 1994.[65] C. Koelbel. Compiling Programs for Nonshared Memory Machines. PhD thesis,Purdue University, August 1990.[66] C. Koelbel. Compile-Time Generation of Regular Communication Patterns. InProceedings of Supercomputing '91, pages 101{110, November 1991.[67] C. Koelbel, D. Loveman, R. Schreiber, G. Steele, and M. Zosel. High Perfor-mance Fortran Handbook. The MIT Press, 1994.[68] C. Koelbel and P. Mehrotra. Compiling Global Name-Space Parallel Loops forDistributed Execution. IEEE Transactions on Parallel and Distributed Systems,pages 440{451, oct 1991.[69] D. Kotz. Disk-directed I/O for MIMD Multiprocessors. In Proceedings of theFirst Symposium on Operating Systems Design and Implementation (OSDI),November 1994. Revised as Technical Report PCS{TR94{226, Dept. of Com-puter Science, Dartmouth College.[70] D. Kotz. Disk-directed I/O for an Out-of-Core Computation. Technical ReportPCS-TR95-251, Dept. of Computer Science, Dartmouth College, January 1995.[71] D. Kotz. Expanding the Potential for Disk-directed I/O. Technical ReportPCS-TR95-254, Dept. of Computer Science, Dartmouth College, March 1995.
BIBLIOGRAPHY 152[72] D. Kotz and T. Cai. Exploring the Use of I/O Nodes for Computation in aMIMD Multiprocessor. Technical Report PCS{TR94{232, Dept. of ComputerScience, Dartmouth College, Revised February 1995.[73] D. Kotz and C. Ellis. Prefetching in File Systems for MIMD Multiprocessors.IEEE Transactions on Parallel and Distributed Systems, pages 218{230, April1990.[74] D. Kotz and N. Nieuwejaar. Dynamic File Access Characteristics of a Pro-duction Parallel Scientic Workload. In Proceedings of Supercomputing '94,November 1994.[75] H. Kung. Memory Requirements for Balanced Computer Architectures. InProceedings of the International Symposium on Computer Architecture, pages49{54, 1986.[76] C. Leiserson. Fat-Trees: Universal Networks for Hardware-Ecient Supercom-puting. In Proceedings of International Conference on Parallel Processing, pages952{958, 1984.[77] C. Leiserson, Z. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill,W. Hillis, B. Kuszmaul, M. St. Pierre, D. Wells, M. Wong, S. Yang, and R. Zak.The Network Architecture of the Connection Machine CM-5. In Proceedings ofthe Symposium on Parallel Algorithms and Architectures (SPAA), 1992.[78] J. Li and M. Chen. Compiling Communication-Ecient Programs for MassivelyParallel Machines. IEEE Transactions on Parallel and Distributed Systems,pages 361{376, July 1991.[79] J. Li and M. Chen. Index Domain Alignment: Minimizing Cost of Cross-References Between Distributed Arrays. In Proceedings of Frontiers of MassivelyParallel Computation 92, October 1992.
BIBLIOGRAPHY 153[80] R. Littleeld. Tuning Communication. Proceedings of the Delta Advanced UserTraining Class Notes, CCSF Technical Report CCSF-25-92, pages 99{119, jul1992.[81] M. Livny, S. Khoshaan, and H. Boral. Multi-Disk Management Algorithms.In Proceedings of the 1987 ACM SIGMETRICS Conference on Measurementand Modeling of Computer Systems, pages 69{77, May 1987.[82] P. Mehrotra and J. Van Rosendale. The BLAZE Language: A Parallel Languagefor Scientic Programming. Parallel Computing, 5:339{361, 1987.[83] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard.Version 1.0, May 1994.[84] E. Miller and R. Katz. RAMA: Easy Access to a High-Bandwidth MassivelyParallel File System. In Proceedings of the 1995 Winter USENIX Conference,pages 59{70, January 1995.[85] S. Moyer and V. Sunderam. PIOUS: A Scalable Parallel I/O System forDistributed Computing Environments. In Proceedings of the Scalable High-Performance Computing Conference, pages 71{78, 1994.[86] National Center for Supercomputing Applications, University of Illinois. NCSAHDF Reference Manual. Version 3.3, February 1994.[87] L. Ni and P. McKinley. A Survey of Wormhole Routing Techniques in DirectNetworks. IEEE Computer, pages 62{76, February 1993.[88] N. Nieuwejaar and D. Kotz. Low-level Interfaces for High-level Parallel I/O.Technical Report PCS{TR95{253, Dept. of Computer Science, Dartmouth Col-lege, February 1995.[89] M. Paleczny, K. Kennedy, and C. Koelbel. Compiler Support for Out-of-CoreArrays on Data Parallel Machines. In Proceedings of the Seventh Symposiumon the Frontiers of Massively Parallel Computation, February 1995.
BIBLIOGRAPHY 154[90] Parasoft Corporation. EXPRESS Fortran User's Guide, 1990.[91] D. Patterson, G. Gibson, and R. Katz. A Case for Redundant Arrays of Inex-pensive Disks. In Proceedings of ACM SIGMOD International Conference onManagement of Data, pages 109{116, June 1988.[92] P. Pierce. A Concurrent File System for a Highly Parallel Mass Storage Subsys-tem. In Proceedings of 4th Conference on Hypercubes, Concurrent Computersand Applications, pages 155{160, March 1989.[93] R. Ponnusamy. Runtime Support and Compilation Methods for Irregular Com-putations on Distributed Memory Parallel Machines. PhD thesis, School ofComputer and Information Science, Syracuse University, April 1994.[94] R. Ponnusamy, A. Choudhary, and G. Fox. Communication Overhead on CM-5 : An Experimental Performance Evaluation. In Proceedings of Frontiers ofMassively Parallel Computation 92, pages 108{115, October 1992.[95] R. Ponnusamy, R. Thakur, A. Choudhary, and G. Fox. Scheduling Regular andIrregular Communication Patterns on the CM-5. In Proceedings of Supercom-puting '92, pages 394{402, November 1992.[96] R. Ponnusamy, R. Thakur, A. Choudhary, K. Velamakanni, and G. Fox. Experi-mental Performance Evaluation of the CM-5. Journal of Parallel and DistributedComputing, pages 192{202, November 1993.[97] J. Poole. Preliminary Survey of I/O Intensive Applications. Scalable I/O Ini-tiative Working Paper Number 1, 1994.[98] A. Purakayastha, C. Ellis, D. Kotz, N. Nieuwejaar, and M. Best. CharacterizingParallel File-Access Patterns on a Large-Scale Multiprocessor. In Proceedingsof the Ninth International Parallel Processing Symposium, April 1995.
BIBLIOGRAPHY 155[99] S. Ramaswamy and P. Banerjee. Automatic Generation of Ecient Array Redis-tribution Routines for Distributed Memory Multicomputers. Technical ReportCRHC-94-09, CRHC, University of Illinois, 1994.[100] S. Ranka, J. Wang, and G. Fox. Static and Runtime Algorithms for All-to-ManyPersonalized Communication on Permutation Networks. IEEE Transactions onParallel and Distributed Systems, pages 1266{1274, December 1994.[101] K. Salem and H. Garcia-Molina. Disk Striping. In Proceedings of the Interna-tional Conference on Data Engineering, pages 336{342, 1986.[102] J. Saltz, H. Berryman, and J. Wu. Multiprocessors and Runtime Compilation.Concurrency: Practice and Experience, pages 573{592, December 1991.[103] D. Scott. Ecient All-to-All Communication Patterns in Hypercube and MeshTopologies. In Proceedings of the 6th Distributed Memory Computing Confer-ence, pages 398{403, 1991.[104] K. Seamons and M. Winslett. An Ecient Abstract Interface for Multidi-mensional Array I/O. In Proceedings of Supercomputing '94, pages 650{659,November 1994.[105] K. Seamons and M. Winslett. A Data Management Approach for HandlingLarge Compressed Arrays in High Performance Computing. In Proceedings ofthe Seventh Symposium on the Frontiers of Massively Parallel Computation,February 1995.[106] R. Shankar, K. Alsabti, and S. Ranka. The Transportation Primitive. Tech-nical report, Dept. of Computer and Information Science, Syracuse University,August 1994.[107] R. Shankar and S. Ranka. Random Data Accesses on a Coarse-grained ParallelMachine { I. One-to-One Mappings. Technical report, Dept. of Computer andInformation Science, Syracuse University, October 1994.
BIBLIOGRAPHY 156[108] R. Shankar and S. Ranka. Random Data Accesses on a Coarse-grained ParallelMachine { II. One-to-Many and Many-to-One Mappings. Technical report,Dept. of Computer and Information Science, Syracuse University, October 1994.[109] M. Snir. Proposal for IO. Posted to HPFF I/O Forum by Marc Snir, July 1992.[110] J. Stichnoth. Ecient Compilation of Array Statements for Private MemoryMulticomputers. Technical Report CMU-CS-93-109, School of Computer Sci-ence, Carnegie Mellon University, February 1993.[111] J. Stichnoth, D. O'Hallaron, and T. Gross. Generating Communication forArray Statements: Design, Implementation, and Evaluation. Journal of Paralleland Distributed Computing, pages 150{159, April 1994.[112] E. Su, D. Palermo, and P. Banerjee. Automatic Parallelization of RegularComputations For Distributed Memory Multicomputers in the PARADIGMCompiler. In Proceedings of International Conference on Parallel Processing,pages II{30|II{38, August 1993.[113] N. Sundar, D. Jayasimha, D. Panda, and P. Sadayappan. Complete Exchangein 2D Meshes. In Proceedings of the Scalable High Performance ComputingConference, pages 406{413, May 1994.[114] V. Sunderam. PVM: A Framework for Parallel Distributed Computing. Con-currency: Practice and Experience, 2:315{339, dec 1991.[115] R. Thakur, R. Bordawekar, and A. Choudhary. Compiler and Runtime Supportfor Out-of-Core HPF Programs. In Proceedings of the 8th ACM InternationalConference on Supercomputing, pages 382{391, July 1994.[116] R. Thakur, R. Bordawekar, A. Choudhary, R. Ponnusamy, and T. Singh. PAS-SION Runtime Library for Parallel I/O. In Proceedings of the Scalable ParallelLibraries Conference, pages 119{128, October 1994.
BIBLIOGRAPHY 157[117] R. Thakur and A. Choudhary. All-to-All Communication on Meshes withWormhole Routing. In Proceedings of the 8th International Parallel Process-ing Symposium, pages 561{565, April 1994.[118] R. Thakur and A. Choudhary. Collective I/O Using an Extended Two-PhaseMethod with Dynamic Partitioning. Technical Report SCCS{704, NPAC, Syra-cuse University, March 1995.[119] R. Thakur, A. Choudhary, and G. Fox. Runtime Array Redistribution in HPFPrograms. In Proceedings of the Scalable High Performance Computing Con-ference, pages 309{316, May 1994.[120] R. Thakur, R. Ponnusamy, A. Choudhary, and G. Fox. Complete Exchangeon the CM-5 and Touchstone Delta. The Journal of Supercomputing, pages305{328, March 1995.[121] Thinking Machines Corporation. CM Fortran Language Reference Manual. Ver-sion 2.1, January 1994.[122] C. Tseng. An Optimizing Fortran D Compiler for MIMD Distributed MemoryMachines. PhD thesis, Dept. of Computer Science, Rice University, January1993.[123] Unidata Program Center, University Corporation for Atmospheric Research.netCDF User's Guide. Version 2.0, October 1991.[124] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: AMechanism for Integrated Communication and Computation. In Proceedings ofInternational Symposium on Computer Architecture, 1992.[125] R. von Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. CompilerAnalysis for Irregular Problems in Fortran D. In Proceedings of the 5th Workshopon Languages and Compilers for Parallel Computing, August 1992.
BIBLIOGRAPHY 158[126] A. Wakatani and M. Wolfe. A New Approach to Array Redistribution: StripMining Redistribution. In Proceedings of Parallel Architectures and LanguagesEurope (PARLE 94), July 1994.[127] J. Wang. Load Balancing and Communication Support for Regular and IrregularProblems. PhD thesis, School of Computer and Information Science, SyracuseUniversity, December 1993.[128] M. Wu and G. Fox. Compiling Fortran 90 Programs for Distributed Mem-ory MIMD Parallel Computers. Technical Report SCCS{88, NPAC, SyracuseUniversity, April 1991.[129] H. Zima, H. Bast, and M. Gerndt. SUPERB: A Tool for Semi-AutomaticMIMD/SIMD Parallelization. Parallel Computing, pages 1{18, 1988.[130] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald. ViennaFortran - a Language Specication, Version 1.1. Interim Report 21, ICASE,NASA Langley Research Center, March 1992.[131] G. Zorpette. Teraops Galore. IEEE Spectrum, pages 26{76, September 1992.
VitaName: Rajeev ThakurPlace of Birth: Bombay, IndiaDate of Birth: July 15, 1969Previous Degrees: B.E., Computer Engineering, University of Bombay, 1990M.S., Computer Engineering, Syracuse University, 1992Experience: Summer Intern,Intel Supercomputer Systems Division,Beaverton, OR, May 1993 { August 1993Graduate FellowSyracuse University, August 1994 { May 1995Graduate Research Assistant,Northeast Parallel Architectures Center,Syracuse University, August 1991 { August 1994Graduate Research Assistant,Department of Electrical and Computer Engineering,Syracuse University, January 1991 { August 1991159
