Performance effects of node mapping on the IBM BlueGene/L machine by Smith, Brian Edward
Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 
1-1-2005 
Performance effects of node mapping on the IBM BlueGene/L 
machine 
Brian Edward Smith 
Iowa State University 
Follow this and additional works at: https://lib.dr.iastate.edu/rtd 
Recommended Citation 
Smith, Brian Edward, "Performance effects of node mapping on the IBM BlueGene/L machine" (2005). 
Retrospective Theses and Dissertations. 20918. 
https://lib.dr.iastate.edu/rtd/20918 
This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and 
Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses 
and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, 
please contact digirep@iastate.edu. 
Performance effects of node mapping on the IBM BlueGene/L machine 
by 
Brian Edward Smith 
A thesis submitted to the graduate faculty 
in partial fulfillment of the requirements for the degree of 
MASTER OF SCIENCE 
Major: Computer Engineering 
Program of Study Committee: 
Diane Rover, Major Professor 
Brett Bode 
Ricky Kendall 
Iowa State University 
Ames, Iowa 
2005 
Copyright © Brian Edward Smith, 2005. All rights reserved. 
ii 
Graduate College 
Iowa State University 
This is to certify that the Master's thesis of 
Brian Edward Smith 
has met the thesis requirements of Iowa State University 
Signatures have been redacted for privacy 
lll 
TABLE OF CONTENTS 
List of Tables . 
List of Figures . 
Acknowledgments. 
Abstract ..... . 
1 Introduction and Background 
1.1 Introduction . 
2 
1.2 Background . 
1. 2.1 Interconnect Networks in Distributed-Memory Machines 
1.2.2 Software on Distributed-Memory Machines 
1. 3 The Goal of the Research 
1.4 Organization . . . . . . . 
The IBM BlueGene/L Supercomputer 
2.1 The IBM BlueGene Supercomputer Program . 
2.2 BlueGene/L Overview 
2.2.1 The Hardware 
2.2.2 The Software . 
2.2.3 The Control System 
2.2.4 Performance . . . . . 
v 
Vl 
XIV 
xvi 
1 
1 
2 
5 
6 
10 
11 
. . . . 12 
12 
13 
14 
18 
20 
24 
iv 
3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 
3.1 The NAS Parallel Benchmarks . 26 
3.1.l EP 28 
3.1.2 MG 28 
3.1.3 CG 29 
3.1.4 FT 29 
3.1.5 IS 30 
3.1.6 BT, SP, and LU 30 
3.2 Ames Laboratory Classical Molecular Dynamics 32 
3.3 GAMESS .................... 33 
3.4 Visualization of the Communication Patterns 35 
3.5 Communications Profile Tables ........ 56 
4 Mappings and Map Generation Tools . . . . . . . . . . . 66 
4.1 Stock Mappings . 66 
4.2 Other Mappings . 71 
4.2.1 Gray-code based maps 72 
4.2.2 Other mesh-like mappings 75 
4.2.3 Lower/upper MPI Rank split maps 76 
4.2.4 Random Map 78 
4.3 Profiling Tools . . . . 78 
4.3.1 Communications Visualizer 79 
4.3.2 Communications Profiler 80 
4.4 Map Figures . . . . . . 80 
5 Procedure and Results . . . . . . . . . . . . . . . . 90 
5.1 General Procedures 90 
5.2 N AS Results . . . . 92 
6 
5.2.1 General Comments .. 
5.2.2 N AS BT and SP Results 
5.2.3 NAS FT Results 
5.2.4 N AS CG Results 
5.2.5 NAS IS Results 
5.2.6 N AS MG Results 
5.2.7 LU .... 
5.3 GAMESS Results 
5.4 ALCMD Results 
5.4.l Non-Gray-code results 
5.4.2 Gray-code results . . . 
Conclusions and Future Work 
6.1 Conclusions . 
6.2 Future Work . 
7 Graphs .. 
Bibliography . 
v 
92 
93 
95 
96 
97 
97 
97 
98 
100 
101 
102 
.... 104 
104 
105 
108 
204 
Vl 
LIST OF TABLES 
3.1 Communications in N AS EP 56 
3.2 Communications in NAS MG . 57 
3.3 Communications in N AS CG 58 
3.4 Communications in N AS FT 58 
3.5 Communications in N AS IS . 59 
3.6 Communications in N AS LU 60 
3.7 Communications in NAS BT 61 
3.8 Communications in N AS SP 62 
3.9 Communications in ALCMD lOOk atoms, 1024 nodes 63 
3.10 Communications in ALCMD lm atoms, 1024 nodes 63 
3.11 Communications in GAMESS - Quinone (512 processors) 64 
3.12 Communications in GAMESS - Penicillin (512 processors) . 65 
4.1 Coprocessor /Heater Mode Mappings . 67 
4.2 Virtual Node Mode Mappings 68 
4.3 Virtual Node Mode Mappings 70 
4.4 2D Gray code mesh mappings with mapmaker 74 
vu 
LIST OF FIGURES 
1.1 Sequential Decomposition of Object ......... 9 
1.2 Mesh-assumption Parallel Decomposition of Object 10 
2.1 BG /L Hardware Expansion . 15 
2.2 BG/L ASIC Block Diagram 16 
2.3 MPICH Modular Structure . 20 
2.4 BlueGene/L Control System Overview. 21 
3.1 Communication Patterns in ALCMD ........ 33 
3.2 Total Electron Density (RMP2 Orbitals) for Penicillin 36 
3.3 Total Electron Density (RMP2 Orbitals) for Quinone 37 
3.4 Visual Communication Patterns in NAS MG ..... 38 
3.5 Visual Communication Patterns in NAS MG for One Node 39 
3.6 Visual Communication Patterns in N AS CG ........ 40 
3.7 Visual Communication Patterns in N AS CG for Two Nodes . 41 
3.8 Visual Communication Patterns in N AS IS .......... 42 
3.9 Visual Communication Patterns in N AS IS for Three Nodes . 43 
3.10 Visual Communication Patterns in N AS LU ........ 44 
3.11 Visual Communication Patterns in N AS LU for Two Nodes 45 
3.12 Visual Communication Patterns in N AS BT ........ 46 
3.13 Visual Communication Patterns in N AS BT for One Node 47 
3.14 Visual Communication Patterns in N AS SP . . . . . . . . . 48 
Vlll 
3.15 Visual Communication Patterns in NAS SP for One Node . . . 49 
3.16 Visual Communication Patterns in ALCMD for 100,000 atoms 50 
3.17 Visual Communication Patterns in ALCMD for 1 million atoms. 51 
3.18 Visual Communication Patterns in ALCMD for 1 million atoms, 
One Node .................. . 
3.19 Visual Communication Patterns in GAMESS 
52 
53 
3.20 Visual Communication Patterns in GAMESS for One Data Server 54 
3.21 Visual Communication Patterns in GAMESS for One Compute 
4.1 
4.2 
4.3 
4.4 
4.5 
4.6 
4.7 
4.8 
4.9 
4.10 
4.11 
4.12 
4.13 
4.14 
4.15 
4.16 
Process ............ . 55 
BlueGene/L 128-way partition 73 
Sixteen by Eight Mesh 74 
X direction "unfolded" "mesh" 81 
Y direction "unfolded" "mesh" 81 
Z direction "unfolded" "mesh" 82 
X direction "unfolded" "mesh" in VNM with cores increasing first 82 
X direction "unfolded" "mesh" in VNM with cores increasing last 83 
X direction "unfolded" "mesh" in VNM with cores increasing as 
planes ............ . 
X directional sheet mapping 
Y directional sheet mapping 
Z directional sheet mapping 
X direction "Plus-One" mapping . 
Y direction "Plus-One" mapping . 
Z direction "Plus-One" mapping . 
X direction "blocks" mapping 
Y direction "blocks" mapping 
83 
84 
85 
85 
86 
86 
87 
88 
89 
IX 
4.17 Z direction "blocks" mapping . . . . . . . . . . . . . . . . . . . 89 
7.1 
7.2 
7.3 
7.4 
7.5 
7.6 
7.7 
7.8 
7.9 
7.10 
7.11 
7.12 
BT 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 109 
BT 128 Total Processors - Gray-code Meshes 110 
BT 128 Total Processors - GAMESS-style Maps 111 
BT 128 Total Processors - Unfold-style Maps . . 112 
BT 256 Total Processors (VirtualN odeMode, Optimized vs Non-
Optimized Collectives) . . . . . . . . . . . . 113 
BT 256 Total Processors - Gray-code Meshes 114 
BT 256 Total Processors - Unfold-style Maps 115 
BT 512 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 116 
BT 512 Total Processors - Gray-code Meshes 
BT 512 Total Processors - GAMESS-style Maps 
BT 512 Total Processors - Unfold-style Maps . . 
117 
118 
119 
CG 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 120 
7.13 CG 128 Total Processors - Gray-code Meshes 121 
7.14 CG 128 Total Processors - GAMESS-style Maps 122 
7.15 CG 128 Total Processors - Unfold-style Maps . . 123 
7.16 CG 256 Total Processors (VirtualNodeMode, Optimized vs Non-
Optimized Collectives) . . . . . . . . . . . . 124 
7.17 CG 256 Total Processors - Gray-code Meshes 125 
7.18 CG 256 Total Processors - Unfold-style Maps 126 
7.19 CG 512 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) .............................. 127 
7.20 
7.21 
7.22 
7.23 
x 
CG 512 Total Processors - Gray-code Meshes .. 
CG 512 Total Processors - GAMESS-style Maps 
CG 512 Total Processors - Unfold-style Maps .. 
128 
129 
130 
FT 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 131 
7.24 FT 128 Total Processors - Gray-code Meshes 132 
7.25 FT 128 Total Processors - GAMESS-style Maps 133 
7.26 FT 128 Total Processors - Unfold-style Maps . . 134 
7.27 FT 256 Total Processors (VirtualNodeMode, Optimized vs Non-
Optimized Collectives) . . . . . . . . . . . . 135 
7.28 FT 256 Total Processors - Gray-code Meshes 136 
7.29 FT 256 Total Processors - Unfold-style Maps 137 
7.30 FT 512 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 138 
7.31 FT 512 Total Processors - Gray-code Meshes 139 
7.32 FT 512 Total Processors - GAMESS-style Maps 140 
7.33 FT 512 Total Processors - Unfold-style Maps . . 141 
7.34 IS 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 142 
7.35 IS 128 Total Processors - Gray-code Meshes . 143 
7.36 IS 128 Total Processors - GAMESS-style Maps 144 
7.37 IS 128 Total Processors - Unfold-style Maps . . 145 
7.38 LU 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
7.39 
7.40 
7.41 
Collectives) . . . . . . . . . . . . . . . . . . . 146 
LU 128 Total Processors - Gray-code Meshes 
LU 128 Total Processors - GAMESS-style Maps 
LU 128 Total Processors - Unfold-style Maps .. 
147 
148 
149 
xi 
7.42 LU 256 Total Processors (VirtualNodeMode, Optimized vs Non-
Optimized Collectives) . . . . . . . . . . . . 
7.43 
7.44 
LU 256 Total Processors - Gray-code Meshes 
LU 256 Total Processors - Unfold-style Maps 
150 
151 
152 
7.45 LU 512 Total Processors (Coprocessor, Optimized vs Non-Optimized 
7.46 
7.47 
7.48 
Collectives) . . . . . . . . . . . . . . . . . . . 153 
LU 512 Total Processors - Gray-code Meshes 
LU 512 Total Processors - GAMESS-style Maps 
LU 512 Total Processors - Unfold-style Maps .. 
154 
155 
156 
7.49 MG 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . . 157 
7.50 MG 128 Total Processors - Gray-code Meshes . 158 
7.51 MG 128 Total Processors - GAMESS-style Maps 159 
7.52 MG 128 Total Processors - Unfold-style Maps . . 160 
7.53 MG 256 Total Processors (VirtualNodeMode, Optimized vs Non-
7.54 
7.55 
Optimized Collectives) . . . . . . . . . . . . . 
MG 256 Total Processors - Gray-code Meshes . 
MG 256 Total Processors - Unfold-style Maps . 
161 
162 
163 
7.56 SP 128 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 164 
7.57 SP 128 Total Processors - Gray-code Meshes 165 
7.58 SP 128 Total Processors - GAMESS-style Maps . 166 
7.59 SP 128 Total Processors - Unfold-style Maps . . 167 
7.60 SP 256 Total Processors (VirtualNodeMode, Optimized vs Non-
7.61 
7.62 
Optimized Collectives) . . . . . . . . . . . . 
SP 256 Total Processors - Gray-code Meshes 
SP 256 Total Processors - Unfold-style Maps 
168 
169 
170 
Xll 
7.63 SP 512 Total Processors (Coprocessor, Optimized vs Non-Optimized 
Collectives) . . . . . . . . . . . . . . . . . . . 171 
7.64 SP 512 Total Processors - Gray-code Meshes 172 
7.65 SP 512 Total Processors - GAMESS-style Maps . 173 
7.66 SP 512 Total Processors - Unfold-style Maps 174 
7.67 GAMESS Penicillin Workload (512 Nodes) 175 
7.68 GAMESS Quinone Workload (512 Nodes) . 176 
7.69 GAMESS Penicillin Workload (512 Nodes in VNM, Stock Map-
pings) ................................. 177 
7. 70 GAMESS Quinone Workload (512 Nodes in VNM, Stock Mappings) 178 
7.71 GAMESS Penicillin Workload (512 Nodes in VNM, Gray Mappings179 
7.72 GAMESS Quinone Workload (512 Nodes in VNM, Gray Mappings180 
7.73 GAMESS Penicillin Workload (1024 Nodes, Stock and GAMESS 
Mappings) .............................. 181 
7.74 GAMESS Quinone Workload (1024 Nodes, Stock and GAMESS 
Mappings) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 
7.75 
7.76 
7.77 
7.78 
7.79 
7.80 
7.81 
7.82 
7.83 
7.84 
7.85 
GAMESS Penicillin Workload (1024 Nodes, Gray mappings) 
GAMESS Quinone Workload (1024 Nodes, Gray mappings) 
GAMESS Penicillin Workload (1024 Nodes, Unfold Maps) . 
GAMESS Quinone Workload (1024 Nodes, Unfold Maps) 
CMD - lOk atoms, Non-Gray mappings . 
CMD - lOOk atoms, Non-Gray mappings 
CMD - lm atoms, Non-Gray mappings . 
CMD - lOm atoms, Non-Gray mappings. 
CMD - lOOm atoms, Non-Gray mappings 
CMD - lOk atoms, 512 Processors, Gray Mappings . 
CMD - lOOk atoms, 512 Processors, Gray Mappings 
183 
184 
185 
186 
187 
188 
189 
190 
191 
192 
193 
xiii 
7.86 CMD - lm atoms, 512 Processors, Gray Mappings . 194 
7.87 CMD - lOm atoms, 512 Processors, Gray Mappings 195 
7.88 CMD - lOk atoms, lk Processors, Gray Mappings 196 
7.89 CMD - lOOk atoms, lk Processors, Gray Mappings . 197 
7.90 CMD - lm atoms, lk Processors, Gray Mappings . 198 
7.91 CMD - lOm atoms, lk Processors, Gray Mappings 199 
7.92 CMD - lOOk atoms, 4k Processors, Gray Mappings . 200 
7.93 CMD - lm atoms, 4k Processors, Gray Mappings . 201 
7.94 CMD - lOm atoms, 4k Processors, Gray Mappings 202 
7.95 CMD - lOOm atoms, 4k Processors, Gray Mappings 203 
xiv 
Acknowledgments 
I am grateful to many people who helped me in my research over my many years as 
a graduate student. 
First, I must thank my committee: my major professor, Dr. Diane Rover provided 
initial guidance and put up with my somewhat non-standard situation (being funded 
by Ames Lab and therefore working more closely with Dr. Bode). Dr. Brett Bode 
and Dr. Ricky Kendall were always there for answering dumb questions and providing 
guidance. Brett was also responsible for my funding while I was working at Ames Lab 
as a graduate student. 
Dr. Dave Turner and Dr. Bode were very helpful porting ALCMD and GAMESS 
to BlueGene and helping me understand some of the underlying science behind the two 
codes. 
I must also thank many people at IBM for their help and encouragement and for 
letting me use hardware when necessary. Charles Archer was especially helpful but I 
must also thank my manager, Sam Ellis, for allowing me to stay at IBM for an extra 
semester while working on the thesis. Dr. Jose Moreira was a very useful resource 
as well. Dr. Jim Sexton was a good person to talk to during the many overnight/all 
night/weekend runs and helped keep me sane while dealing with the problems of working 
with such early prototype hardware and software. 
Part of the work was performed at IBM Rochester on early development hardware. 
Without access to that hardware, much of this work would have been impossible. 
I must also thank my family and friends for putting up with me for so many years. 
xv 
Part of this work was performed at Ames Laboratory under contract No. W-7405-
ENG-82 with the U.S. Department of Energy. The United States government has as-
signed the D.O.E. report number IS-T2703 to this document. This document is also 
Iowa State University technical report TR-2005-01-1. 
xvi 
Abstract 
The IBM BlueGene/L (BG/L) supercomputer is a new machine consisting of up to 
65536 relatively modest compute nodes connected with three application-level networks -
a high-performance point-to-point 3D torus network, a global combining/broadcast tree 
network for collective operations, and a global interrupt/barrier network for extremely 
fast global barriers. 
The BG /L control system allows the user to assign MPI logical ranks to physical torus 
coordinates at run-time in an arbitrary manner as long as all nodes are uniquely included 
in the mapping. This presents the possibility of increasing application performance with 
very little effort. 
This thesis investigates the performance effects of node mapping with several bench-
marks and scientific codes using a variety of existing and new mapping strategies. 
The benchmarks are the N AS parallel benchmarks, the Ames Laboratory Classical 
Molecular Dynamics code ( ALCMD), and the General Atomic and Molecular Electronic 
Structure System (GAMESS) application. The NAS benchmarks are short, easy to 
understand, and fairly well known. ALCMD has an interesting communication pattern 
that should benefit from a good mapping strategy. GAMESS is one application that is 
not necessarily well-suited for running on BlueGene because it requires a large amount 
of compute power and memory per node. However, it provides an interesting data point 
for performance of applications that were not designed for a particular system and the 
possible benefits of mapping for such applications. 
The mappings investigated were the stock permutations (XYZ, XZY, etc), Gray-code 
xvii 
based mesh mappings, random maps, variations on Gray-code maps for embedding 2D 
meshes in the 3D torus, and three maps designed for GAMESS. 
Performance results are presented for node mappings on several BG /L partition sizes. 
1 
1 Introduction and Background 
1.1 Introduction 
Scientific and engineering problems in computational science continuously push the 
envelope of what is achievable on even the fastest computers available today. Model-
ing large, complex systems for long periods of time (relative to the time steps involved) 
requires extremely powerful computers with immense quantities of memory to give mean-
ingful results. For example, a full atom-level simulation of a very "fast" protein folding 
is estimated to require approximately 1023 instructions. (AAA +01) This is equivalent 
to the five hundred fastest computers on Earth working together for three years con-
tinuously. Time-sensitive information such as financial modeling or weather prediction 
requires a careful balance of accuracy and detail versus time to obtain a solution. If pre-
dicting tomorrow's weather extremely accurately takes forty-eight hours, the information 
calculated is worthless. However, if a less-detailed model is used and the prediction is 
incorrect, then the prediction was worthless as well. 
Typically such requirements are beyond the abilities of single computers. Instead, 
multiple computers are employed to work together on solving a single problem. It is 
important to get the most performance for the amount of money spent and time and 
energy invested in a system. This thesis focuses on one aspect of getting the most per-
formance from a machine - the effects of re-arranging the logical ordering of physical 
nodes in a large multi-processor system. Many applications assume certain things about 
the machine they will be run on (for example, the amount of memory per node, the 
2 
processor speed relative to network speed, or the layout of the nodes). When these 
assumptions are valid, the application should perform as efficiently as possible. How-
ever, if one of the variables changes, performance could be drastically affected. This 
thesis looks at how performance changes if an application assumes a specific layout of 
the physical nodes but a different layout is employed. This "node mapping" has been 
investigated before for some very specific classes of machines and very specific network 
topologies. (CC88) discusses one common mapping - embedding hypercubes of degree 
m into hypercubes of degree n. This thesis investigates the performance effects of map-
ping two-dimensional grids in a three-dimensional torus, a general case of the m-into-n 
mapping for several current applications as well as several application-specific mappings 
that take into account some of the particular application assumptions. 
1. 2 Background 
Parallel computer architectures are generally classified according to Flynn's Taxon-
omy (Fly72). Flynn's taxonomy classifies machines based on their information flow. 
There are four broad categories: 
• SISD - Single-instruction, single-data 
• SIMD - Single-instruction, multiple-data 
• MISD - Multiple-instruction, single data 
• MIMD - Multiple-instruction, multiple-data 
Conventional, single-processor computers are classified as SISD. Each instruction per-
forms an operation on data taken from a single stream of data elements. SISD machines 
are the classical Von-Neumann machine. 
SIMD machines have one controller or instruction processing unit and several data 
processing units. Because instruction decoding is typically rather complicated, SIMD 
3 
machines can be very efficient since they operate on multiple pieces of data with a single 
instruction decode operation. The Thinking Machines CM-1 and the MasPar MP-1 and 
MP-2 are examples of SIMD machines. Modern vector machines are classified as SIMD 
as well. 
MISD machines are very rare. The typical example given for MISD computers are 
systolic arrays where a grid of processing elements is controlled by a global clock. In each 
cycle, an element will read data from one of its neighbors and perform on operation on 
the data. The processed data would then be ready for the next element to use on the next 
clock cycle. Systolic arrays allow high throughput data flow with less memory access, 
i.e. a piece of data is accessed once, then multiple computations are done with it. One 
commercial example of a MISD machine was the Carnegie Mellon/Intel iWarp machine 
(Bcc+ss),(BCC+go). This machine consisted of a number of cells, each containing a 
communications units and a computational unit. The cells could be connected as one-
dimensional lines or rings or two-dimensional meshes or tori. Systolic architectures are 
still useful in some applications such as signal/image processing and neural network 
simulation. See (Kun82) for more information. 
MIMD machines are the most general category and include such sub-categories 
as shared-memory machines and distributed-memory machines plus hybrid distributed 
shared-memory machines. 
Shared memory MIMD machines are machines where multiple processors share a 
memory bus. Any processor can read or write directly to the memory. This tends to 
be faster than distributed memory architectures, but introduces problems with memory 
consistency which requires some sort of locking scheme causing an increase in over-
head. Programming shared-memory machines tends to be simpler than programming 
distributed-memory machines because each processor has transparent access to the en-
tire global pool of memory. Typically, underlying hardware or low-level software ensures 
memory consistency for the application programmer. Modern examples of shared mem-
4 
ory machines include the SGI (Silicon Graphics, Inc) Origin 2000 and Altix machines 
and Sun Ultra HPC machines. The SGI machines are examples of cache-coherent non-
uniform memory access ( ccNUMA) architectures. This means that while an application 
programmer sees the entire global address space as a large fiat piece of memory, in re-
ality, some addresses are "closer" to a given processor than others. Performance can be 
significantly enhanced by considering this locality. Also, the hardware ensures data is 
consistent across processors in a more rigorous fashion than NUMA machines such as 
the Sun Ultra HPC. 
The other typical shared memory architecture type is simple cache-only memory 
architecture (S-COMA). There are currently no commercially available S-COMA ma-
chines, though the Kendall Square Research (KSR) KSR-1 was very similar to a COMA 
architecture. Each node had a small amount of local memory that was considered cache. 
The hardware automatically routed data between nodes as nodes requested non-local 
data. From a user perspective, memory behaved as a shared address space. 
Distributed-memory MIMD machines are machines where multiple processors have 
independent local memory. To access memory on another processor requires some sort 
of "message" passing over a communications network. 
There are many different methods for passing "messages" between processes. One 
method of passing messages is to use UNIX-style sockets or UNIX-style FIFO (first in, 
first out) pipes. Other, more common methods include using standard message pass-
ing application- programming interfaces (API) such as PVM (parallel virtual machine) 
(Sun90) or MPI (message passing interface) (For95), (Mes). Before the standardization 
of message passing APis (and even still) many distributed-memory machine vendors had 
their own message passing APis. 
These APis provide the support needed for operations such as sending a message or 
receiving a message. Message passing APis typically provide other features to simplify 
application development, such as support for collective operations, global synchroniza-
5 
tion, parallel file I/O, and process management. 
The current MPI standard specification (Mes) known as MPI-2 provides support for 
one-sided operations - also known as remote memory access (RMA) operations. Nor-
mally, a remote process is explicitly involved in a data transfer (i.e., both the sender 
and the receiver are involved in the communication and the synchronization is implicit 
through the communication operation). With RMA operations, only one side (e.g., the 
sender or receiver) is involved in the communication process. Synchronization must be 
done explicitly to ensure the communication is done before the remote process uses the 
data. The support for one-sided operations is rather limited in most MPI implementa-
tions currently. 
MPI-2 also provides hooks for user-provided profiling tools. This is discussed more 
in section 4.3. 
1.2.1 Interconnect Networks in Distributed-Memory Machines 
Many different communications networks are employed on distributed-memory ma-
chines for passing messages. Commodity workstation clusters use everything from cheap, 
ubiquitous ethernet (typically "fast-ethernet" - lOOMb per second or gigabit ethernet 
- lOOOMb per second) to much more exotic interconnects such as Quadrics (Qua), 
Myrinet (Myra), InfiniBand (Mel), or scalable coherent interconnect (SCI) (IEE92). 
These networks have different topologies as well. SCI is a two- or three-dimensional 
torus. Quadrics, Myrinet, and InfiniBand are all fat-tree networks. (Lei85) discusses 
some benefits of fat-tree networks. 
Large distributed-mei_nory systems such as the International Business Machines (IBM) 
SP /2, the Intel Paragon, and the Cray T3E typically use very high-bandwidth, very low-
latency interconnects that are more tightly integrated with the processors, memory, and 
operating system of the machine. Systems with an extremely large number of nodes -
usually termed MPP (massively parallel processing) - such as the Thinking Machines 
6 
CM-5 and the IBM BlueCene/L (BC/L or BCL) machine demand very scalable net-
works, in addition to fast networks. Frequently, MPP machines will employ multiple 
disjoint networks to maximize performance. For example, the IBM BC/L machine has 
a global combining tree network for doing reduction operations, the three-dimensional 
torus network for point-to-point traffic, and a global interrupts network for extremely 
fast global barriers. MPP machines might have separate data movement and control 
networks as well. The CM-5 and BC/L both have a separate control network. BC/L 
actually has two control networks. See chapter 2 for more specific information on the 
BC/L networks. 
Hybrid distributed shared-memory machines are becoming more popular with multi-
processor, workstation-based clusters. These systems have workstations with multiple 
processors (typically two or four, but as many as 2048 on the SCI Altix machine) con-
nected together with some manner of interconnect. Programming such machines is 
non-trivial and requires very careful effort to ensure maximum performance because of 
the multiple levels of memory. The "local" shared memory is significantly faster than 
the distributed memory, so the application developer must ensure that the most fre-
quently used data resides in local memory areas. One important hybrid distributed 
shared-memory machine is the SCI/NASA (National Aeronautics and Space Adminis-
tration) "Columbia" machine. This is a "small" cluster of very large nodes. Each node 
is a 2048-way SCI Altix shared-memory machine. There are five nodes comprising the 
entire system for a total of 10240 processors. This machine is currently ranked as the 
second fastest supercomputer on Earth. 
1.2.2 Software on Distributed-Memory Machines 
Not only is it important to have a good, high-performance interconnect on distributed-
memory machines, but it is also important to ensure the end-user application can send 
data as quickly as the hardware allows. It is important for the operating system, the 
7 
device drivers, and the message passing mechanism to be as efficient as possible. Fast 
hardware will always be limited by software overhead, so many strategies are utilized 
to overcome this problem. Workstation clusters typically run some UNIX variant (e.g. 
Linux, AIX, IRIX, Solaris, etc.). Rarely are these UNIX variants tuned for the type 
of network traffic seen on distributed-memory message passing applications. Low-level 
protocol stacks such as TCP /IP (transmission control protocol/internet protocol) are 
meant for reliable end-to-end transmission between distant networks. They are not 
optimized for the short-distance, single-hop traffic common in clusters. Because of 
this, several strategies have been investigated for by-passing the TCP /IP stack or at 
least using it more efficiently. One example is VIA (the virtual interface architecture) 
(CCC97). Myrinet also has an operating-system (OS) by-pass method in their GM soft-
ware (Myrb ). There are also OS by-pass methods for Infiniband, Quadrics, and SCI. 
Another off-loading approach is called remote direct memory access (RDMA). RDMA 
allows data to be transmitted from the memory space of one computer to the memory 
space of another computer without passing through either computer's main processor 
or cache. The transfers can occur concurrently with other system operations. RDMA 
requires hardware support on the network cards on both computers. RDMA over Infini-
band is the most common implementation but there is on-going work with RDMA with 
TCP /IP over commodity networks such as ethernet. For more information on RDMA 
over TCP /IP, see ( Con04). 
Another strategy is to optimize the message passing API. The most typical message 
passing APis are PVM and an implementation of MPI. The two most common imple-
mentations of MPI are MPICH (GLDS96) and LAM (BDV94). Both of them have a 
fair amount of overhead to be as general purpose as possible. MP _Lite (Tur) is one 
effort to reduce the overhead of an MPI implementation by removing some redundancy 
and lesser-used features. Also, "general purpose" MPI implementations such as MPICH 
statistically cannot be optimized for all possible hardware platforms nor all possible ap-
8 
plications because they are meant to run on a wide variety of architectures and with a 
wide variety of applications. Vendor implementations still exist as well to help minimize 
overhead by dealing with fewer, more specific types of hardware. 
Other approaches for ensuring hardware is used as effectively as possible are directed 
more towards the actual application(s) being run on the machine. For example, attempt-
ing to overlap communication with computation is a standard approach. This works well 
in many instances because the network is typically far slower than the processors. If the 
machine allows the application to initiate communication without blocking, a carefully 
setup program will make sure the data being sent or received is not used in the next 
computation step. 
This thesis looks at one additional possibility of performance improvement - re-
mapping. Node re-mapping re-assigns MPI logical ranks to physical hardware coordi-
nates. Many applications assume some sort of underlying topology which may or may 
not be present on the machine the application is running on. 
For example, a code might be simulating forces inside of some three-dimensional 
object. A sequential code might decompose the object into a finite number of three-
dimensional cubes or tetrahedrons. The force in each cell would have to calculated, 
along with the effects of neighboring cells. One way of approaching parallelizing this 
code would be to assign a processor to ·some number of cells - each processor gets 
numcells / numprocessors cells to work on. In each step, each processor would have to 
send the force data from it's bordering cells to neighboring processors. This algorithm is 
easiest to implement if one assumes the processors are setup in a three-dimensional grid. 
Typically a message passing API uses a sequential numbering of nodes or processes from 
zero to the total number of processes mihus one. The code would use some message 
passing API features to determine which nodes are the neighbors in three-dimensional 
space. A two-dimensional example is shown in figure 1.1 and figure 1.2. Figure 1.1 shows 
a physical object being discretized so some computations can be done. Figure 1.2 shows 
9 
the simplest method of conceptually parallelizing the computations. 
Figure 1.1 Sequential Decomposition of Object 
However, assuming a specific topology is not wise and if the code is run on a new 
machine with a different interconnect network performance might suffer. Some appli-
cations lend themselves to certain configurations more easily than other configurations. 
Assuming processors are laid out in a two-dimensional grid makes it much simpler to 
write matrix-based codes or two-dimensional decomposition codes for example. If the 
code is going to run on something like an Intel Paragon which has a two-dimensional 
mesh, this assumption is reasonable. Many codes have been written that do make these 
sorts of assumptions so it is worthwhile to see how they perform on different topologies 
than they were designed for. 
Some machines allow the user to renumber logical ranks. The hardware is obviously 
still the same, but the logical rankings are different. One example of this is the IBM 
BlueGene/L supercomputer. The control system on BG/L allows the user to specify an 
arbitrary logical-to-physical rank map. This is an excellent environment to look at node 
10 
Figure 1.2 Mesh-assumption Parallel Decomposition of Object 
mappings and their effect on application performance. 
1.3 The Goal of the Research 
The goal of the research presented in this thesis is to examine the effects of node 
mappings on application performance on large machines with non-flat networks such as 
BG/L. It is possible that other machines such as clusters connected via SCI or machines 
such as the Cray T3E would benefit from this as well. 
It was not investigated in this research, but it is possible node mapping could benefit 
NUMA shared memory machines and hybrid systems as well. If an application assumes 
a flat address space and does not consider the non-uniform memory throughput, re-
mapping may provide some benefits. Similarly, it might be possible for node mapping 
to benefit fat-tree based networks if applications make assumptions on the underlying 
topology. (For example, it might make sense to re-position leaf nodes in the fat tree 
layout to exploit application data locality assumptions). 
11 
To measure the effects of node mappings, several standard benchmarks were used. 
Several tools were built to examine this data and to attempt to generate better mappings. 
1.4 Organization 
In Chapter 2, the IBM BlueGene/L machine is described in detail. Chapter 3 dis-
cusses the benchmarks that were used to measure the affects of node mappings. Chapter 
4 discuses the different node mapping types that were employed and discusses the tools 
used to generate the maps. The profiling library and related applications are also dis-
cussed. Chapter 5 discusses the results. Chapter 6 has concluding remarks and ideas 
for future work. 
12 
2 The IBM BlueGene/L Supercomputer 
2.1 The IBM BlueGene Supercomputer Program 
In 1999, IBM announced the start of a program to build massively parallel computers 
to be applied to computational life sciences problems such as protein folding. A better 
understanding of how proteins fold is important in such areas as protein-drug interactions 
and understanding diseases such as cystic fibrosis. However, to simulate protein folding 
requires both fine-grained spatial details (i.e. atomic interactions) and fine-grained time 
steps (typically on the order of femptoseconds) for large amounts of time (typically 
on the order of milliseconds). Each time step requires calculating the interactions and 
forces of thousands of atoms, resulting in billions of calculations. (AAA +01) suggests 
roughly 1023 total machine instructions are required for simulating a protein that folds 
very quickly (hundreds of microseconds). This is an enormous amount of work. On a 
one-petafiop/second machine (1015 operations per second, roughly the sum of the five 
hundred fastest computers (Don04) today) this simulation would take over three years 
of continuous run time. This is obviously infeasible. 
The IBM BlueGene program was created to investigate extremely large-scale su-
percomputers that will allow scientists to simulate folding (and other problems) in far 
greater detail that was previously possible. One goal of the BlueGene program is to de-
velop a one-petafiop per second machine. To do this requires massive parallelism. The 
BlueGene approach is similar to a cellular-based architecture. (It is not related to the 
"cell processor" IBM is currently developing) In this architecture, each "cell" consists of 
13 
a small, simple, low-powered chip, plus some cache memory and supporting units such as 
network and memory controllers. In the BlueGene approach all of this resides in a single 
application-specific integrated circuit (ASIC). For more information on the BlueGene/L 
ASIC, see section 2.2.1. These cells are then replicated and connected together to form 
a whole system. This method is very scalable 
There are three architectures that the BlueGene program is looking at~ BlueGene/C, 
BlueGene/P, and BlueGene/L. BlueGene/C is less general-purpose than BG/P or BG/L. 
BlueGene/P was originally proposed as a one-petafiop/second machine. The first node 
card worth of BlueGene/L hardware was demonstrated in approximately July of 2003. 
2.2 BlueGene/L Overview 
The approaches outlined in this thesis could be implemented on any cluster or su-
percomputing architecture that has the ability to re-map node rankings. For example, 
the Scalable Coherent Interface (SCI) network which has become popular for commod-
ity PC-based clusters is available configured as a two-dimensional or three-dimensional 
processor mesh with wraparound links, similar to BG /L. The Myrinet and Infiniband 
software allow some node ordering/re-mapping as well. Tools could be built to re-map 
nodes at the message passing layer for these architectures. The MPI specification even 
allows for re-mapping based on understanding of the underlying hardware when setting 
up sub-communicators or subgroups of nodes. For example if an application program de-
sires to split the process space into a Cartesian grid with the MPLCart_create command, 
the MPI library is free to re-order nodes within the sub-communicator to take advantage of 
the underlying hardware topology. 
However, the work described in thesis was done on the IBM BlueGene/L prototype rack(s) 
and the second generation beta hardware. After adding the ability to re-map MPI rankings to 
the control system, BG /L presented an excellent candidate to look at the performance effects 
of node mapping on larger scale systems. 
14 
2.2.1 The Hardware 
The small "cells" (dual PowerPC cores, cache, memory and network controller, dual 64-bit 
floating point units (FPU) called "double hummer") are generally grouped into eight by eight 
by eight units called "midplanes". Mid planes are the smallest officially supported partitions 
available. 
Physically, a midplane has 256 compute cards arranged on sixteen "node boards". Each 
node board has sixteen "compute cards". Each compute card has two "cells" along with 
the main memory for the two cells. (512MB in second generation hardware; 256MB in the 
prototype hardware). Each node card may have between zero and two "I/0" cards in addition 
to the sixteen compute cards. 
Each BG /L rack has two mid planes stacked vertically for a total of 1024 compute nodes. 
The number of I/O cards is variable and is dependent upon the customers' needs. An "I/O 
rich" rack can have as many as 128 I/O nodes (an eight-to-one compute to I/O ratio). A 
typical rack would have between eight and thirty-two I/O nodes (i.e. 128:1, 64:1, or 32:1 
compute to I/O ratios). The typical configuration in Rochester is either 128:1 or 64:1, though 
some midplanes are broken down as small as eight to one. The configuration being shipped 
to Lawrence Livermore National Laboratory (LLNL) is 64:1. The LLNL configuration will 
have sixty-four racks for a total of 65536 processors (131072 cores) arranged as a sixty-four 
by thirty-two by thirty-two three-dimensional torus. Figure 2.1 shows how all of the pieces fit 
together for the Lawrence Livermore National Lab machine. 
Each of the ASICs making up a "cell" on BG/L has the two PowerPC 440 cores each 
with two 64-bit floating-point units, five different network controllers, and level 2 and level 3 
cache (which are coherent between the two cores). Each PowerPC core has level 1 cache, but 
there is no hardware cache coherency for Ll cache. The hardware provides lock-boxes to allow 
coherent processor-to-processor communication though. 
I/O nodes use the same ASICs, and the same compute card boards except they have the 
physical gigabit ethernet network connection wired in. (Compute cards have all of the ethernet 
logic, but no physical connection to use it). Figure 2.2 shows a block diagram of the BGL ASIC. 
Node Board 
(32 chips, 4x4x2) 
16 Compute Cards 
Compute Card 
(2 chips. 2x1x1) 
2.8/5J3 GF/s 
4MB 
15 
System 
(64 cabinets, 64x32x32) 
Cabinet 
(32 Node boards, 8x8x16) 
Figure 2.1 BG/L Hardware Expansion 
There are five networks on each ASIC. 
l. Global combining binary tree 
The global combining tree network is used for high-performance broadcasts and reduc-
tions. It can also be used for synchronizations (barriers). The tree hardware has an 
integrated arithmetic-logic unit (ALU) to perform simple arithmetic operations (such 
as addition) or logical operations with no CPU intervention. The tree ALU is designed 
for integer values only, but there is some software support for doing double-precision 
operations. 
There are sixteen class routes available on the tree. Two of them are generally reserved 
- one is setup by the micro-loader for I/O operations from the compute nodes to the I/O 
node and the other is generally setup for broadcasts and reductions over MPI _COMMJVORLD, 
16 
PLB 
32k/32kLI 128 
440CPU L2 256 
256 
Double Hummer 
SRAM 
32k/32k LI 128 
256 
440CPU L2 256 
Double Hummer 128 
Torus 
Ethernet 
Figure 2.2 BG /L ASIC Block Diagram 
L3 Cache 
Memory 
Controller 
i.e., the entire system. The other class routes are available and will probably be used for 
sub-communicator broadcasts/reductions. The tree can also be used for point-to-point 
communications, but isn't generally used in that way. 
2. The torus network 
The torus network is a high-speed point-to-point network. Each compute node has six 
bidirectional links (±X, ±Y, ±Z). The torus hardware guarantees reliable delivery of 
variable-length packets (between thirty-two and 256 bytes in multiples of thirty-two). 
Packets are routed individually using either deterministic routing or a minimal adaptive 
routing scheme. In deterministic routing, a packet always goes from the current node to 
the destination physical X coordinate, then Y, then Z. In the minimal adaptive routing 
scheme, the packet can bypass heavily congested links en-route to the destination node. 
3. The global interrupts/global synchronization network (GI) 
The GI network allows very high-speed (:::::1.5 microseconds at the hardware level) full-
17 
system barriers. It is essentially a set of (user-configurable) OR logic gates. 
4. The Joint Test Action Group (JTAG) over gigabit ethernet network 
This network is used for booting the machine and provides other very low-level control 
and test possibilities. For example, the temperature sensor information is gathered via 
JTAG. Also, it is possible to stop the processor cores and gather debug information such 
as instruction address registers and stack dumps over this network. 
5. External gigabit ethernet network 
The hardware for this network is present on the compute nodes, but the physical connec-
tors are only present on I/O nodes. This network enables the I/O nodes to communicate 
with the outside network (i.e., file servers, the front-end node, etc.). All compute node 
communication with the outside network is done through the I/O node over the tree. 
In addition to all of the node boards, each rack has eight link cards, two service cards, 
and one clock card. The link cards are responsible for connecting midplanes together. They 
provide the interface from the internal networks on the backplane (i.e., torus, tree, and global 
interrupts) to the physical cables used to connect mid planes together. The link cards also make 
it possible to electrically isolate midplanes from each other. This means multiple independent 
sub-partitions can be booted at the same time and will not interfere with each other. These 
sub-partitions get the point-to-point network (in some configurations it will be a mesh; in others 
it will be a torus depending on how other partitions are using the system), the combining tree, 
and the global interrupts. The service card acts as a ethernet hub for the JTAG-over-ethernet 
for the iDo chips (see below) on the node boards and link cards. It also handles the very 
minimal communication with the midplane and is responsible for providing initial power-on 
signals for the node cards and link cards. There is one global clock for the entire system (i.e., 
all sixty-four racks in the LLNL machine). Each rack has a local clock card that regenerates 
the clock signal for the rack to help minimize skew between racks. 
Every node board, link card, and serviCe card has another custom field programmable 
gate array (FPGA) called an "iDo" chip. These chips provide support for a variety of serial 
18 
protocols for communications with the core (for example JTAG and Inter-IC (usually written 
as I2 C). I2C is used for monitoring the fan speed and core temperatures). The iDo chips also 
provide the ability to read a small block of static RAM on each node. The static RAM is for 
things such as storing performance counter values. The iDo chips allow this data to be read 
with no software intervention on the compute core. 
The IBM Rochester site has four racks (4,096 compute node cards, 8,192 total cores) of 
first-generation hardware and sixteen racks (16,384 compute node cards, 32, 768 cores total) of 
second-generation hardware. First-generation hardware has compute nodes running at 500MHz 
with 256MB of RAM per board (128MB per processor core). Second-generation hardware has 
compute nodes running at 700MHz with 512MB of RAM per board (256MB per processor core). 
Because the network speed and memory bandwidth are very tightly coupled to processor speed 
(since they are all on the same ASIC), the increase in performance going from first-generation 
to second-generation is higher than the expected forty percent improvement based on processor 
speed increase alone. 
For more information on the BlueGene/L hardware, see (A +02). 
2.2.2 The Software 
There are several pieces of software that applications make use of on BlueGene. 
2.2.2.1 BlueGene/L Compilers 
Application code on BG /L is compiled on a "front-end node". This front-end node could 
be a single Intel or PowerPC machine for a small BG/L configuration or a cluster of Intel 
or PowerPC machines for a larger BG/L machine such as the LLNL machine. Because the 
front-end node is not likely to be a PowerPC 440 based machine, applications developed for 
BlueGene require cross-compilers. (A compiler running on one architecture generating code 
for another). The common, freely-available Gnu's not UNIX (GNU) C (GCC) and Fortran 77 
(G77) compilers can be used to generate code to run on BGL. However GCC and especially G77 
are not as optimized as IBM's commercial compilers - the "XL" compilers. The XL compilers 
19 
have support for the double hummer instruction set and are more optimized for the PowerPC 
440 than GCC/G77. XL also has support for compiling Fortran 90 codes. 
2.2.2.2 BlueGene/L MPI 
As discussed previously, BlueGene/L is a distributed memory machine. Because of this, the 
machine needs a communications library that facilitates passing messages (i.e., data) between 
nodes. BlueGene/L uses the Message Passing Interface (MPI) as a basis for the communications 
library. Specifically, BG/L uses MPICH from Argonne National Laboratory (ANL) as the basis 
for the system MPI implementation. 
MPICH was chosen for the base of the BlueGene/L communications library because of the 
modular design of MPICH. 
This modularity is beneficial because it allows BlueGene/L-specific optimizations while eas-
ily retaining MPI standard compliance. For example, BG /Lhasa tree-optimized MPI ...Allreduce. 
There is also a torus-optimized MPI...Allreduce for sub-communicators. There are also tree-
and torus- optimized MPLBcast functions as well. The torus-optimized broadcast takes ad-
vantage of the special "deposit bit" on the torus network to do very efficient broadcasts along 
lines. MPICH has these collective functions implemented with point-to-point communica-
tion, so whenever it is not possible to use optimized collectives, BG /L falls back to using the 
point-to-point based collectives. The BG/L MPI also uses the data type handling routines 
and communicator management routines from MPICH. This saved having to implement these 
routines and ensures they are compliant with MPI-2 standards. 
Figure 2.3 shows the MPICH model and what pieces BlueGene/L added or modified. 
For more information on the MPI implementation on BGL see (AAC+o3) and (AAC+04). 
2.2.2.3 Other software tools 
Because BGL emulates a number of Linux system calls (see section 2.2.3.5), uses a standard 
communications library, and has a number of other open APis, it is possible to use many 
existing open-source and proprietary software packages with BGL. For example, the Etnus 
20 
Application 
MPI I point to point I I datatype I I topology I I debug I I collectives I 
MPID (Abstract Device Interface) ] "' "" ti ~ ~ ·e 1il .;::; ·e "' ·~ I 1' I bgltorus 2 ..!l ..!!. 0 ~ 0 
.;::; 
s "' 0 0 
BG/L Additions I 
Figure 2.3 MPICH Modular Structure 
TotalView debugger is being ported to BlueGene/L and the hooks required by the debugger 
are available for other debuggers to use (such as the GNU debugger, GDB). A number of 
performance tools have been ported to BlueGene or make use of BlueGene data, such as 
Para Ver (fPoB). As soon as more end-users have access to BlueGene/L hardware, more tools 
will become available. 
2.2.3 The Control System 
The BGL control system (ABB+03a), (ABB+03b) is responsible for all controlling aspects 
of the machine, i.e., it is responsible for booting partitions, keeping track of allocated blocks, 
handling application loading, monitoring environmental sensors, etc. The BG /L core is state-
less when powered on, so the control system is required to manage state information. 
Figure 2.4 shows the overview of the BG/L control network. 
It consists of eight pieces: 
1. A back-end database 
2. The "multi-midplane control system" (MMCS) server 
3. The iDo proxy 
4. The Console I/O Daemon/DataBase (CIODB) job controller 
Scheduler 
Service 
Node 
MMCS 
Control 
Ethernet 
21 
l JTAG 
iDoChip 
l JTAG 
; 
Environmentals 
I/ONodeO 
Linux 
B 
I/O Node 1023 
[§] d 
tree 
Compute 
NodeO 
IBLRTSI 
Compute 
NodeO 
IBLRTSI 
Figure 2.4 BlueGene/L Control System Overview 
Compute 
Node63 
IBLRTSI 
Compute 
Node63 
IBLRTS I 
PsetO 
Pset 1023 
5. The lightweight Compute Node Kernel (CNK) known as BLRTS (BlueLight run-time 
supervisor) 
6. The Linux image on the I/O nodes 
7. The Console I/O Daemon (CIOD) running on the I/O nodes 
8. The hardware discovery process 
2.2.3.1 Back-end Database 
BlueGene/L has a database that maintains hardware inventory tables, job status tables, 
performance monitoring data, environmental data, and similar state information for the hard-
ware. Many of the other BG /L control system pieces interact with the database. The database 
resides on the external network on a service node. 
22 
2.2.3.2 MMCS Server 
The MMCS server (called mmcs_db_server) interacts with the back-end database to al-
locate and boot blocks, manage job submission, get hardware debugging information, and 
record reliability and serviceability (RAS) events to the database (problems that require user 
intervention, kernel panic messages, hardware error counters, etc.). The MMCS server also 
interacts with CIODB and the iDo proxy for booting partitions and starting jobs. 
2.2.3.3 iDo Proxy 
The iDo proxy is responsible for the low-level system interaction. It is the interface to 
the JTAG environment via the iDo chips on node cards, service cards, and link cards. The 
proxy is responsible for sending the initial micro-loader to the nodes, the Linux images to the 
I/O nodes, the CNK images to the compute node, querying temperature sensors, fan module 
speeds, etc. The proxy is also used for starting and stopping the CPU cores and reading the 
static RAM on the compute nodes. All low-level control functionality goes through this proxy. 
2.2.3.4 CIODB job controller 
CIODB is responsible for telling the MMCS server (and the user) when a partition or block 
is fully booted and ready for job submission. It also starts and terminates jobs. CIODB is also 
responsible for transferring a job's output and error text to a file so an end-user can read it. 
2.2.3.5 CNK (BLRTS) 
BLRTS is a custom kernel that runs on the compute nodes. BLRTS provides a flat, fixed-
size address space with no paging. The kernel and application share the same address space 
which the kernel residing in a protected address space. BLRTS provides a portable operating 
system interface (POSIX)-like interface to applications. 
BLRTS starts and stops jobs as directed by CIOD, plus forwards or deals with application-
generated signals. BLRTS also provides basic I/O operations to the application (which are 
proxied by CIOD). 
23 
BLRTS runs in one of three modes - coprocessor mode, virtual node mode, or heater mode. 
In coprocessor mode, one core generally handles MPI functions while the other core handles 
computation. 
The second mode is called virtual node mode. In this mode, BLRTS splits the address 
space in half and both cores do computations and MPI functions. All hardware resources 
(memory, send/receive queues, etc.) are split in half. Each node has read-only access to the 
other node's memory. There is also a set of lock-boxes provided to safely copy data between 
the two cores since the Ll caches are not coherent. Virtual node mode is useful for applications 
that do not have large memory requirements. 
The third mode (which is rarely used) is called "heater mode". In this mode one of the cores 
does absolutely nothing. All computation and communication happens on just one core. This 
mode was originally the default mode because the other two modes were not fully implemented. 
2.2.3.6 Linux 
A very minimal Linux system runs on the I/O nodes to interact with the outside world 
(i.e., it has support for various file systems, etc.). All system calls an application makes 
are executed on the Linux system via function shipping from CIOD. Linux is responsible for 
mounting external file-systems for compute nodes to utilize as well. 
2.2.3. 7 CIOD 
The console I/O daemon is a background process running under Linux on the I/O nodes. 
CIOD is responsible for proxying file I/O requests from the compute nodes to the Linux kernel. 
It also is responsible for loading application code onto the compute nodes. 
2.2.3.8 Discovery 
Because there are so many resources in a large BlueGene/L machine, it is very difficult to 
gather configuration information from mid planes and node cards, find new hardware, and prop-
erly prepare new hardware for running application code by hand. There is another background 
24 
process running on the service node that periodically scans for new hardware or hardware that 
has become inoperable (perhaps an over-temperature condition caused the midplane to shut 
down) and updates the state information in the database. Discovery is also responsible for 
finding the hardware for the initial database population step. 
2.2.3.9 The end-user perspective 
The typical end-user running applications on BlueCene/L interacts with the control system 
by using one of the consoles the communicates with mmcs_db_server. The three most popular 
consoles used currently are mmcs_db_console, ciorun, and mpirun. All three of them provide 
essentially the same functions and only differ in how transparent the control system is made 
to be. 
The end-user selects a block, boots it, submits a job, waits for output, and frees the block 
for other users. All of this can be done by a scheduler and to maximize utilization of a large 
machine like the LLNL machine, a good scheduler is required. 
2.2.4 Performance 
Each processor core can theoretically perform four floating point operations per cycle (in 
the form of two 64-bit multiply-adds) per cycle. The second generation BC/L hardware runs at 
700MHz given a peak performance of 2.8 billion floating point operations per second (gigaflop/s, 
CF). If both cores are active, the theoretical maximum per ASIC is 5.6 CF. This gives a 
maximum peak performance of approximately 2.9TF (or 5.7TF with both cores) for a rack. 
The maximum theoretical peak performance of the entire 64-rack machine at LLNL is 183.5TF 
or 367.0TF. 
These are very impressive numbers, but theoretical peak and "real-world" performance 
are not the same thing. One commonly used benchmark for evaluating performance (not 
necessarily "real-world" performance though) is the LINPACK benchmark (DLPOl). The top 
500 fastest machines on Earth are listed at (DSSM04), which is updated every 6 months. There 
is also a report with more detail at (Don04). In the latest list (November 2004), BlueCene/L 
25 
machines have three of the top fifteen spots, including two in the top ten. The fastest machine 
on Earth currently is a sixteen-rack BG/L beta machine located at IBM in Rochester, MN. 
This machine achieved 70.720TF out of a theoretical peak of 91.750TF. The other machine 
in Rochester is an older prototype four-rack system. It achieved ll.680TF (out of a peak 
of 16.384TF). The peak on the prototype machine is lower because the machines only run 
at 500MHz. Also, the percentage of peak achieved is lower because of the smaller memory 
configuration (256MB instead of 512MB), and the slower memory and network speeds. A two-
rack second generation machine at IBM research came in at fifteenth place with 8.655TF out 
of a maximum ll.469TF. These numbers show that the BG/L architecture is indeed scalable 
for at least one problem (LINPACK). For the purpose of this thesis, it is useful to point out 
that to get the maximum performance on LINPACK required a node re-mapping. LINPACK 
was not used as one of the benchmarks for measuring node mapping performance in this thesis 
so it is possible there exists a better map than the stock YXZ mapping used for the reported 
LINPACK results. Switching from the default XYZ mapping to YZX increased performance by 
roughly one percent. This may not sound like much, but in this case the performance difference 
alone was almost as much as the "slowest" machine on the top500 list. 
26 
3 Benchmarks 
To present performance data, some sort of benchmark(s) must be used. For this thesis, 
several different parallel benchmarks were chosen to show the effects of mappings on different 
types of applications. The benchmarks used were: 
1. The NASA Advanced Supercomputing (NAS) Parallel Benchmarks 
2. The Ames Laboratory Classical Molecular Dynamics Code (ALCMD) 
3. The General Atomic and Molecular Electronic Structure System (GAMESS) 
These benchmarks were chosen because they use a variety of communication patterns. The 
NAS benchmarks are fairly well known in parallel computing and they have been used for 
various performance studies (LWPS04), (LKC+03). GAMESS is a very well known quantum 
chemistry code. The ALCMD code has interesting communication patterns and is also good 
for showing overall system scaling. 
The profiling tools discussed in section 4.3 were used to make the communication sizes and 
types tables in section 3.5. The communications visualizer discussed in section 4.3 was used 
to make the point-to-point communications visualizations. 
3.1 The NAS Parallel Benchmarks 
The NAS Parallel Benchmarks (B+95) (B+94) (dW02) are a suite of eight benchmarks 
designed to compare performance of parallel computers on several pseudo-applications and 
some numeric kernels. The benchmarks are known by two-letter acronyms. They are: 
1. LU - LU (lower-upper) decomposition 
27 
2. BT - block tri-diagonal 
3. SP - scalar pentagonal 
4. IS - integer sort 
5. MG - multigrid 
6. FT - Fourier transform 
7. CG - conjugate gradient 
8. EP - embarassingly parallel 
All of the NAS benchmarks except IS have four problem sizes for parallel machines (and 
two workstation-level problem sizes). They are referred to as classes. "Class A" is the smallest 
problem size, i.e., it requires the minimal amount of memory to run and will run on very small 
numbers of CPUs. They also run in a very short amount of time. On large configurations, the 
run times are probably too short to be meaningful. For example, on a 512-way system the FT 
benchmark runs in only 0.26 seconds and only has 2176 MPI communications events. However, 
when possible, runs were made with "Class A" problem sizes on the larger configurations. The 
other classes are "B", "C", and "D". Up to class "C", the benchmarks will easily run on sixty-
four node partitions. Some of the class "D" benchmarks will run on configurations as small as 
128-way, but for consistency the class "D" benchmarks were only run on 512-way and 1024-way 
partitions. 
Because the benchmarks are fairly old (especially by high-performance computing stan-
dards), running some of the benchmarks on larger configurations (typically starting at 512-
nodes) proved difficult. All classes of a given benchmark that ran correctly on a given machine 
size were run. 
Despite these limitations, the benchmarks are still valid for smaller configurations and they 
do a good job of testing different communication patterns. 
The NAS benchmarks report three values for measuring performance: 
l. Total run time (wall-clock time) for the job 
28 
2. The total number of operations per second (in millions) (MOp/s) 
3. The total number of processors used and the number of operations per second per pro-
cessor (in millions) (MOp/s/proc) 
Only the MOp/s or MOp/s/proc are used for the graphs in this thesis. Details on each of the 
benchmarks is provided below. 
The descriptions of the benchmarks below are gathered from (B+94), (B+95), and (dW02). 
3.1.1 EP 
The EP benchmark "provides an estimate of the upper achievable limits of floating point 
performance, i.e., the performance without significant interprocessor communication." The 
code generates random numbers according to a specific scheme. The code is "emparassingly 
parallel" because all of the nodes involved do this computation independently. The only timed 
communication between them is a verification step at the end that ensures all processors 
achieved identical results and that the results are the expected values for each class size of 
the benchmark. The first two communications are using MPL.Allreduce. They are reducing a 
single double-precision value. The third communication is another call to MPI ...Allreduce but 
is ten double-precision values. The totals for class A and class C with 128 nodes are shown in 
table 3.1. For more information on table 3.1 (which is located in section3.5, see section 4.3.2. 
3.1.2 MG 
The MG benchmark is "a simplified multigrid kernel. It requires highly structured long 
distance communication and tests both short and long distance data communications." The 
benchmark solves for an approximate solution to the discretized Poisson equation on a cubic 
grid with periodic boundary conditions. Tlie benchmark employs a V-cycle multigrid algo-
rithm. The V-cycle multigrid algorithm tries to discretize a grid with finer granularity at 
points with higher-frequency information that might be lost in a very coarse grid, but with-
out finely discretizing the entire grid (and wasting space or introducing aliasing problems) The 
communication in MG is more complicated than in EP. First, the entire problem space (typically 
29 
256 by 256 by 256) must be divided among the processors involved. The code does this by 
successively halving the problem among the processors starting with Z, then Y, then X. For 
example, if there are two processors involved, each one gets a 256 by 256 by 128 problem. If 
there are four processors involved, each one gets a 256 by 128 by 128 problem. If there are 
thirty-two processors involved, each one gets a 128 by sixty-four by sixty-four piece. The code 
uses a mix of point-to-point and collective operations, though the majority of the work is done 
with point-to-point operations. The code uses non-blocking receives (e.g. MPL1recv) in an 
attempt to overlap communication and computation. The benchmark uses a wide range of 
message sizes for the point-to-point communications as is shown in table 3.2. Figure 3.4 shows 
the point-to-point traffic for all of the nodes while figure 3.5 shows a single, typical node's 
point-to-point communications. 
3.1.3 CG 
The CG kernel uses "a conjugate gradient method to compute an approximation to the 
smallest eigenvalue of a large, sparse, symmetric positive-definite matrix. This kernel is typical 
of unstructured grid computations in that it tests irregular long distance communication, 
employing unstructured matrix vector multiplication." The benchmark uses only point-to-
point communication in the timed section. It again uses non-blocking receives but does no 
computation between the MPLirecv, MPL.Send, and MPLWait calls. This is more to prevent 
deadlocks on machines where neither MPL.Send nor MPI...Recv return until completed. About 
one-third of the communications are large messages and the other two thirds are small messages 
as can be seen in table 3.3. 
See figure 3.6 for the all-nodes point-to-point communication pattern and 3. 7 for an arbi-
trary pair of nodes. 
3.1.4 FT 
The FT kernel is "a three-dimensional partial differential equation solution using Fast 
Fourier Transforms (FFT). This kernel performs the essence of many 'spectral' codes. It 
30 
is a rigorous test of long-distance communication performance." While this code may be de-
signed to test long-distance communication performance, it does so using MPI ...All toall. This 
make it a test of a system's bisection bandwidth more than long-distance point-to-point per-
formance. There are no point-to-point calls in FT as shown in table 3.4. Because there are no 
point-to-point calls in FT, there is nothing to visualize in the point-to-point communications 
visualizer. 
3.1.5 IS 
IS performs "a large integer sort. This kernel performs a sorting operation that is important 
in 'particle method' codes. It tests both integer and computation speed and communication 
performance." The definition for this kernel specifies how the keys must be distributed. The 
definition can lead to load imbalance with a large number of processors, but because the 
benchmark is fairly small for large partitions, this wasn't readily observed. The benchmark uses 
a mixture of point-to-point operations and collective operations though the majority of the work 
is done using the collective operations. The point-to-point communications are single nearest-
neighbor communications and there are very few of them. Figure 3.8 shows the point-to-point 
communications for all nodes, while figure 3.9 shows the nearest neighbor communications 
more clearly for three nodes. The communication types and sizes are summarized in table 3.5. 
3.1.6 BT, SP, and LU 
The last three benchmarks - BT - block tridiagonal solver, SP - scalar pentagonal solver, 
and LU - LU decomposition solver - are designed to mimic "real" applications rather than just 
functional kernels. The three benchmarks are solving the same problem in three different ways. 
All three are computational fluid dynamics (CFD) codes. Most CFD codes involve solving a 
system of partial differential equations numerically. In the same way that Maxwell's equations 
are a system of partial differential equations describing electro-magnetic radiation, the Navier-
Stokes equations describe fluid motion by ensuring conservation of mass, momentum, and 
energy in the fluid. 
31 
The majority of the communications in all three of these pseudo-applications is point-to-
point. Most of the collective operations are broadcasting problem parameters to all nodes or 
gathering timing results. 
3.1.6.1 LU 
The LU benchmarks solves the CFD system by using a symmetric successive over-relaxation 
algorithm. The benchmark divides the grid onto processors by dividing it into half, alternating 
between x and y. The LU profile is in table 3.6. The point-to-point communication is LU is 
different than that in BT or SP. It is shown in figure 3.10. Figure 3.11 shows a typical pair 
of nodes. The communications are two nearest-neighbors, then a node half-way across the 
partition. 
3.1.6.2 BT 
The BT benchmark solves the CFD system by using a Beam-Warming approximate factor-
ization. The factorization decouples the x, y, and z dimensions. In BT the resulting system of 
equations is block-tridiagonal. The communications profile for BT is shown in table 3.7. 
3.1.6.3 SP 
The SP benchmark solves the CFD system in essentially the same way as BT but ends up 
with a system that is scalar pentagonal. As in BT, the majority of the communication in SP is 
point-to-point. This is shown in table 3.8. 
Figures 3.12 and 3.14 show the point-to-point communications for BT and SP respectively. 
Figures 3.13 and 3.15 show a single typical node. The communication looks rather odd until 
one remembers that these two benchmarks only make use of 121 processors in the 128-way 
partition. They attempt to make a two-dimensional square mesh of the partition they are 
given. In the 128-way partition, this would be an eleven by eleven grid. So, node five (the 
one pictured) communicates with its two neighbors (four and six) and then nodes sixteen and 
fifteen (eleven away, plus one nearest neighbor). Finally, the two nodes at 115 and 116 are 
32 
sent data. These two nodes are 110 (i.e., eleven times eleven minus one) away and a neighbor 
of that node. 
3.2 Ames Laboratory Classical Molecular Dynamics 
The Ames Laboratory Classical Molecular Dynamics (abbreviated ALCMD) (TMH04) code 
is representative of molecular dynamics types of codes. It has been used by Ames Laboratory to 
benchmark systems but can also be used to do "real" science. The code supports calculations 
of a variety of classical atomic interactions such as Leonard-Jones pair interactions and 3-body 
interactions using Tersoff potentials with large numbers of atoms. The code prefers a two-
dimensional processor arrangement. If the number of atoms is significantly larger than the 
number of processors, only five communication steps are required. This occurred with ten 
million or more atoms on 512-way and 1024-way configurations. One million atoms on 512 
nodes required eight communication steps. The communications are global nearest neighbor 
communications. In the first step, all nodes send data "up" to their neighbor. Then, all 
nodes shift their data right. In the next step, all nodes shift down twice. The final shift is a 
diagonal up-left shift back to the origin. When there are more than five communication steps 
the communications cover every node above and to the right of the initial node that are within 
the interaction range specified. For example, with 512 nodes and one million atoms, eight 
communication steps are performed. The are up, right, down, down, right, up, up, and back 
to the origin. The communication patterns for five and eight communication steps are shown 
in figure 3.1. Communication profile tables are shown in table 3.9 for 100,000 atoms on 1024 
nodes and table 3.10 for one million atoms on 1024 nodes. 
Because the communications are based on discrete time steps, the point-to-point visualizer 
tool is not as useful as it is in some of the NAS benchmarks. However, figures 3.17 and 3.16 
show the communications for all nodes for one million atoms and for one hundred thousand 
atoms respectively. Both runs were made on a 1024-way partition. Figure 3.18 shows the 
communication for one node for one million atoms on a 1024-way partition. 
5 Communications -
33 
••.•• ••••• :< .••••. 
..... ····-· · .................. \ ..... ·-·-· . . . . . 
. •'· 
..•. _ .. _.'I-·····_*" .. ···• ···-··· 
. . . . 
•••• - • .;, •• - • - - ···~ - • - • ~·,. ••••• ~·· •• - • -· ":.'" •••• o; :. . •••• -
. . . 
• • • • • ;• - - - - - ••• - • ;.:<• ••• -
........... -.-·-----; .... . ........ ·····::; 
. . : . 
- . . . . -,· ... -.. : ......•... -. ~ - - . . . '• . - - . · ...... . 
. . .· ....... CJ ......... r· ---1···1·· ··-· .,. ' . ' ' . ' . : : : .::· 
I • I' , 
--.. -. . . ·.·. . . ..... ...-;- .. - ..... - . - . - . - . " ... - .,..,· 
; ..... <>·:·. -.... ; .. L -.. - ---- .. -
: . ~ 
' . 
'. 
-····. :····· ·:o::----·,"i.······ ···-· 
- .•.• , -·. - • - - - : . . . . • • ."'!''. - • - • ~:. . .••• -: ~- . - - - - . 
. . : -~ ... : 
'' -·" . 
,•'. . 
.. . ' 
' ' ' 
' . . ' 
Figure 3.1 Communication Patterns in ALCMD 
3.3 GAMESS 
The General Atomic and Molecular Electronic Structure System (abbreviated GAMESS) 
(SBB+93) is a general ab initio quantum chemistry package. GAMESS is a large code (roughly 
500,000 lines of FORTRAN and a few C wrapper functions) that is used for many types of 
quantum chemistry simulations. GAMESS can do many calculations in parallel using a custom 
communications library - the distributed data interface (DDI) (SFBGOO). DDI is designed to 
take advantage of shared memory machines or libraries, but has UNIX-based sockets and MPI-
based implementations as well. On most machines, GAMESS starts two processes on each node 
- a "data server" and a computation process. Because of the single-threaded/ single-application 
nature of applications that run on BlueGene/L, the easiest method porting of GAMESS to 
BlueGene/L is to just split the partition space in half. The lower MPI ranks (e.g. zero through 
255 in a 512-way midplane) are computation processes and the upper MPI ranks (e.g. 256 
34 
through 511 in a midplane) are data servers. 
GAMESS communications are generally point-to-point, but are very all-to-all-like. Each 
compute node sends data to and receives data from every data server, perhaps at different 
time steps. 
However, each compute node sends approximately four times as much data to its data 
server. For example, a 512-way run of quinone has a total of 30,855,575 unique sends. Of that, 
node zero sends 70,354 times. Of those 70,354 sends, node zero sends to node 256 (its data 
server) 1,043 times. Node zero sends an average of 270 times to each of the other data servers. 
Similarly, node 256 (the first data server) posts 50,418 sends. Of those, 880 are to node zero. 
Node 256 averages two hundred sends to the other 254 compute nodes. The message size is also 
highly variable, though most of the communications are small messages (less than sixty-four 
bytes) for both the point-to-point and collectives operations. 
In general, this all-to-all-like communication makes it difficult to develop efficient node 
mappings. However, there is just enough data locality between the compute node and its 
primary data server that it might be possible to get better-than-stock performance with a 
good mapping file. 
GAMESS reports total run times in seconds (based on wall-clock time). Two different 
workloads were used for node mapping performance comparisons. Both workloads perform an 
MP2 (Moeler-Plesset 2nd order perturbation· theory) gradient calculation. This is calculating 
the lowest-energy geometry configuration for the molecules being simulated. The two workloads 
were penicillin and quinone. Penicillin is a popular antibiotic drug. It has 41 atoms, 88 
molecular orbitals, and 385 atomic orbitals. Quinone is a precursor in some biological reactions 
and is used in making dyes. It has 12 atoms, 254 atomic orbitals and 91 molecular orbitals. 
Both are reasonably good files for showing communications performance. 
GAMESS is not an application that will likely be run a great deal on the BlueGene/L 
machine. This is because GAMESS generally needs more memory per node that BlueGene/L 
can provide. Also, there are many sequential steps in a typical GAMESS workload that require 
less-modest processors than the Power PC 440s in BlueGene/L. However, the very large amount 
35 
of total memory on the larger BlueGene/L configurations does allow some very large GAMESS 
simulations that might not be possible elsewhere, they just might not run well on BlueGene/L. 
Communication profiles for GAMESS are shown in Tables 3.11 and 3.12. Figure 3.19 
shows the point-to-point data fl.ow for a penicillin workload. It shows the all-to-all nature of 
the communications very well. Figure 3.20 shows the communications a typical data server 
makes. Figure 3.21 shows the communications a typical compute process makes. All three 
figures are from 512-way partitions. 
The GAMESS workloads calculate the MP2 orbitals for the electrons in the two molecules. 
The results are shown in figure 3.2 for penicillin and figure 3.3 for quinone. 
3.4 Visualization of the Communication Patterns 
This section has all of the communications pattern figures. As explained further in section 
4.3.1, the color of the arrows represent distance from sender to receiver. The arrow head 
diameter shows the relative number of bytes (compared to the largest communication in a 
complete application run) for the average of all communications between the two source and 
destination node. 
36 
Figure 3.2 Total Electron Density (RMP2 Orbitals) for Penicillin 
37 
Figure 3.3 Total Electron Density (RMP2 Orbitals) for Quinone 
38 
Figure 3.4 Visual Communication Patterns in NAS MG 
39 
Figure 3.5 Visual Communication Patterns in NAS MG for One Node 
40 
Figure 3.6 Visual Communication Patterns in NAS CG 
41 
Figure 3. 7 Visual Communication Patterns in N AS CG for Two Nodes 
42 
Figure 3.8 Visual Communication Patterns in N AS IS 
43 
Figure 3.9 Visual Communication Patterns in NAS IS for Three Nodes 
44 
Figure 3.10 Visual Communication Patterns in NAS LU 
45 
Figure 3.11 Visual Communication Patterns in NAS LU for Two Nodes 
46 
Figure 3.12 Visual Communication Patterns in N AS BT 
47 
Figure 3.13 Visual Communication Patterns in NAS BT for One Node 
48 
Figure 3.14 Visual Communication Patterns in NAS SP 
49 
Figure 3.15 Visual Communication Patterns in NAS SP for One Node 
50 
Figure 3.16 Visual Communication Patterns in ALCMD for 100,000 atoms 
51 
Figure 3.17 Visual Communication Patterns in ALCMD for 1 million atoms 
52 
Figure 3.18 Visual Communication Patterns m ALCMD for 1 million 
atoms, One Node 
53 
Figure 3.19 Visual Communication Patterns in GAMESS 
54 
Figure 3.20 Visual Communication Patterns m GAMESS for One Data 
Server 
55 
Figure 3.21 Visual Communication Patterns in GAMESS for One Compute 
Process 
56 
3.5 Communications Profile Tables 
The following tables show the average sizes (in bytes) of communications in the application 
benchmarks. For the NAS results, only class A and class C results are presented. The tool 
used to create the table is explained more fully in section 4.3.2. 
Table 3.1 EP Communications Profile (Classes "A" and "C", 128 nodes) 
Class A, 128 nodes 
Point-to-Point 
Message Size I Count I Total Bytes I Avg Bytes/Event 
N/A I oj oj 0 
Collectives 
AllReduce 0 <Size< 32 I 3841 30721 8.000000 
AllReduce 32 -:::; Size < 256 128 10240 80 
Total MPI Calls: 512 
Class C, 128 nodes 
Point-to-Point 
Message Size j Count j Total Bytes I Avg Bytes/Event 
N/A I oj oj 0 
Collectives 
AllReduce 0 <Size< 32 I 3841 30721 8.000000 
AllReduce 32 -:::; Size < 256 128 10240 80.000000 
Total MPI Calls: 512 
57 
Table 3.2 MG Communications Profile (Classes "A" and "C", 128 nodes) 
Class A, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0 S Size< 32 8000 128000 16.000000 
32 S Size < 256 19360 1569280 81.057851 
256 S Size < 512 11840 3399680 287.135135 
512 S Size < 1024 3840 3072000 800.000000 
1024 S Size < 2048 7680 8355840 1088.000000 
2048 S Size < 4096 3840 9953280 2592.000000 
4096 S Size < 8192 7680 32440320 4224.000000 
8192 S Size < 16384 3840 35512320 9248.000000 
16384 S Size < 32768 9728 161873920 16640.000000 
32768 S Size < 65536 4864 169500672 34848.000000 
Collectives 
Broadcast 0 S Size< 32 640 2560 4.000000 
Broadcast 32 S Size < 256 128 4096 32.000000 
AllReduce 0 S Size< 32 11264 131072 11.636364 
Reduce 0 S Size< 32 128 1024 8.000000 
Total MPI Calls: 738976 
Class C, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0 S Size< 32 33600 537600 16.000000 
32 S Size < 256 81312 6590976 81.057851 
256 S Size < 512 49728 14278656 287.135135 
512 S Size < 1024 16128 12902400 800.000000 
1024 S Size < 2048 32256 35094528 1088.000000 
2048 S Size< 4096 16128 41803776 2592.000000 
4096 S Size < 8192 32256 136249344 4224.000000 
8192 S Size < 16384 16128 149151744 9248.000000 
16384 S Size < 32768 32256 536739840 16640.000000 
32768 S Size < 65536 16128 562028544 34848 .000000 
65536 S Size < 131072 34304 2265710592 66048.000000 
131072 S Size < 262144 17152 2318950400 135200.000000 
Collectives 
Broadcast 0 S Size< 32 640 2560 4.000000 
Broadcast 32 S Size < 256 128 4096 32.000000 
AllReduce 0 S Size< 32 11264 131072 11.636364 
Reduce 0 S Size< 32 128 1024 8.000000 
Total MPI Calls: 3409312 
58 
Table 3.3 CG Communications Profile (Classes "A" and "C", 128 nodes) 
Class A, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0 ::; Size < 32 434176 3538944 8.150943 
4096 ::; Size < 8192 266240 1863680000 7000.000000 
Collectives 
Reduce 0 ::; Size < 32 128 1024 8.000000 
Total MPI Calls: 2101504 
Class C, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0 ::; Size < 32 2062336 16809984 8.150943 
65536 ::; Size < 131072 1264640 94848000000 75000.000000 
Collectives 
Reduce 0 ::; Size < 32 128 1024 8.000000 
Total MPI Calls: 9981184 
Table 3.4 FT Communications Profile (Classes "A" and "C", 128 nodes) 
Class A, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
N/A 0 0 0 
Collectives 
Broadcast 0 ::; Size < 32 256 2048 8.000000 
AlltoAll 8192 ::; Size < 16384 1024 8192 8192.000000 
Reduce 0 ::; Size < 32 768 12288 16.000000 
Total MPI Calls: 2176 
Class C, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
N/A 0 0 0 
Collectives 
Broadcast 0 ::; Size < 32 256 2048 8.000000 
All to All 131072 ::; Size < 262144 2816 131072 131072.000000 
Reduce 0 ::; Size < 32 2560 40960 16.000000 
Total MPI Calls: 5760 
59 
Table 3.5 IS Communications Profile (Classes "A" and "C", 128 nodes) 
Class A, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0:::; Size< 32 127 508 4.000000 
Collectives 
AllReduce 4096 :::; Size < 8192 1408 5795328 4116.000000 
AlltoAll 0:::; Size< 32 1408 5632 4.000000 
Reduce 0:::; Size< 32 256 1536 6.000000 
Total MPI Calls: 3453 
Class C, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0:::; Size< 32 127 508 4.000000 
Collectives 
AllReduce 4096 :::; Size < 8192 1408 5795328 4116.000000 
All to All 0:::; Size< 32 1408 5632 4.000000 
Reduce 0:::; Size< 32 256 1536 6.000000 
Total MPI Calls: 3453 
60 
Table 3.6 LU Communications Profile (Classes "A" and "C", 128 nodes) 
Class A, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
32 S Size < 256 3472232 538186368 154.997238 
256 S Size < 512 3720000 1153200000 310.000000 
512 S Size < 1024 44 22528 512.000000 
16384 S Size < 32768 56448 1156055040 20480. 000000 
32768 S Size < 65536 60480 2477260800 40960. 000000 
Collectives 
Broadcast 0 S Size< 32 1024 6656 6.500000 
Broadcast 32 S Size < 256 128 5120 40.000000 
AllReduce 0 S Size< 32 512 4096 8.000000 
AllReduce 32 S Size < 256 512 20480 40.000000 
Total MPI Calls: 15441140 
Class C, 128 nodes 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
32 S Size < 256 112 18144 162.000000 
256 S Size < 512 8960120 3584042720 399.999411 
512 S Size < 1024 9600000 7680000000 800.000000 
1024 S Size < 2048 44 57024 1296.000000 
65536 S Size < 131072 49392 6401203200 129600.000000 
131072 S Size < 262144 52416 12763215360 243498 .461538 
262144 S Size < 1048576 15120 4115059200 272160. 000000 
Collectives 
Broadcast 0 S Size< 32 1024 6656 6.500000 
Broadcast 32 S Size < 256 128 5120 40.000000 
AllReduce 0 S Size< 32 512 4096 8.000000 
AllReduce 32 S Size < 256 512 20480 40.000000 
Total MPI Calls: 38177140 
61 
Table 3.7 BT Communications Profile( Classes "A" and "C", 128 nodes (121 
used)) 
Class A, 128 nodes (121 used) 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes /Event 
1024 :::; Size < 2048 729630 1050667200 1440.000000 
8192 :::; Size < 16384 729630 6304003200 8640.000000 
16384 :::; Size < 32768 146652 3971481600 27080.991736 
Collectives 
Broadcast 0:::; Size< 32 484 3388 7.000000 
AllReduce 32 :::; Size < 256 242 9680 40.000000 
Reduce 0:::; Size< 32 121 968 8.000000 
Total MPI Calls: 6155882 
Class C, 128 nodes (121 used) 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
8192 :::; Size < 16384 729630 6566670000 9000.000000 
32768 :::; Size < 65536 729630 39400020000 54000. 000000 
131072 :::; Size < 262144 146652 25446182400 173514.049587 
Collectives 
Broadcast 0:::; Size< 32 484 3388 7.000000 
AllReduce 32 :::; Size < 256 242 9680 40.000000 
Reduce 0:::; Size< 32 121 968 8.000000 
Total MPI Calls: 6155882 
62 
Table 3.8 SP Communications Profile (Classes "A" and "C", 128 nodes 
(121 used)) 
Class A, 128 nodes (121 used) 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
1024 :S Size < 2048 300750 558192000 1856.000000 
2048:::; Size <4096 1215030 3344532480 2752.633663 
4096 :S Size < 8192 1395480 7935565440 5686.620690 
16384 :S Size < 32768 291852 7903641600 27080.991736 
Collectives 
Broadcast 0 :S Size< 32 363 2904 8.000000 
AllReduce 32 :S Size < 256 242 9680 40.000000 
Reduce 0 :S Size< 32 121 968 8.000000 
Total MPI Calls: 9367101 
Class C, 128 nodes (121 used) 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
8192 :::; Size < 16384 360900 5538612000 15346.666667 
16384 :S Size < 32768 1178940 21768718080 18464.653061 
32768 :S Size < 65536 1371420 51532477920 37576.000000 
131072 :S Size < 262144 291852 50640422400 173514.049587 
Collectives 
Broadcast 0:::; Size< 32 363 2904 8.000000 
AllReduce 32 :::; Size < 256 242 9680 40.000000 
Reduce 0 :S Size< 32 121 968 8.000000 
Total MPI Calls: 9367101 
63 
Table 3.9 ALCMD Communications Profile, lOOk Atoms, 1024 Nodes 
Point-to-Point 
Bytes Count Total Bytes Avg Bytes/Event 
0::; Size< 32 7629786 43083514 5.646753 
32 ::; Size < 256 1415680 81797120 57.779385 
256 ::; Size < 512 22528 11444224 508.000000 
512 ::; Size < 1024 112640 102809600 912.727273 
2048 ::; Size < 4096 2483197 8867483936 3570.994946 
4096 ::; Size < 8192 2152704 15567052800 7231.394934 
8192 ::; Size < 16384 3840 32342016 8422.400000 
32768 ::; Size < 65536 10240 335544320 32768.000000 
Collectives 
Total MPI Calls: 41493893 
Table 3.10 ALCMD Communications Profile, lm atoms, 1024 nodes 
Point-to-Point 
Bytes Count Total Bytes Avg Bytes/Event 
0::; Size< 32 4832218 30582522 6.328879 
32 ::; Size < 256 1473280 89047040 60.441355 
512 ::; Size < 1024 10240 10240000 1000.000000 
2048::; Size< 4096 3069 12276000 4000.000000 
4096 ::; Size < 8192 10240 47185920 4608.000000 
8192 ::; Size < 16384 102400 926515200 9048.000000 
16384 ::; Size < 32768 114752 3116187648 27155.846068 
32768 ::; Size < 65536 1140672 41335635968 36237.968468 
65536 ::; Size < 131072 921600 66709094400 72384.000000 
Collectives 
Total MPI Calls: 25827461 
64 
Table 3.11 Garness Communications Profile - Quinone 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0 S Size< 32 15429324 370303768 23.999999 
32 S Size < 64 10127616 469984680 46.406250 
128 S Size < 256 1477875 281610000 190.550622 
256 S Size < 512 545805 158544000 290.477368 
512 S Size < 1024 690750 538731000 779.921824 
1024 S Size < 2048 840991 1131868408 1345.874579 
2048 S Size< 4096 11 33616 3056.000000 
4096 S Size < 8192 144612 1104795952 7639. 725279 
8192 S Size < 16384 584195 5364104144 9182.043913 
16384 S Size < 32768 124032 3789799080 30555.010642 
32768 S Size < 65536 809391 34819581152 43019.481502 
65536 S Size < 131072 2118 173823752 82069. 760151 
131072 S Size < 262144 34155 5150300760 150792.000000 
262144 S Size < 1048576 44700 13874455920 310390.512752 
Collectives 
Broadcast 0 S Size< 32 16640 66560 4.000000 
Broadcast 256 S Size < 512 8192 2621440 320.000000 
Broadcast 512 S Size < 1024 512 471040 920.000000 
Broadcast 1024 S Size < 2048 2048 3862528 1886.000000 
Broadcast 2048 S Size< 4096 11520 43057152 3737.600000 
Broadcast 4096 s Size < 8192 558592 4559093760 8161. 759853 
Broadcast 65536 S Size < 131072 768 69341184 90288.000000 
AllReduce 0 S Size< 32 4352 36864 8.470588 
AllReduce 32 S Size < 64 2048 90112 44.000000 
AllReduce 64 S Size < 128 4096 376832 92.000000 
AllReduce 128 S Size < 256 2304 327680 142.222222 
AllReduce 512 S Size < 1024 512 307200 600.000000 
AllReduce 1024 S Size< 2048 1024 2007040 1960.000000 
AllReduce 16384 S Size < 32768 512 12607488 24624.000000 
AllReduce 65536 :'S Size < 131072 20224 1750560768 86558.582278 
AllReduce 131072 S Size < 262144 10240 1342177280 131072.000000 
Total MPI Calls: 62359598 
65 
Table 3 12 Garness Communications Profile - Penicillin 
Point-to-Point 
Message Size Count Total Bytes Avg Bytes/Event 
0 ::S Size < 32 38774774 930594568 24.000000 
64 ::S Size < 128 17449380 2093925600 120.000000 
128 ::S Size < 256 7367516 943042048 128.000000 
256 ::S Size < 512 2495700 1197936000 480.000000 
512 ::S Size < 1024 2790700 1814877760 650.330655 
1024 ::S Size < 2048 1001852 1915178560 1911.638206 
2048 ::S Size < 4096 2833573 7372876392 2601.971572 
4096 ::S Size < 8192 335627 1478608560 4405.511356 
8192 ::S Size < 16384 22 263736 11988.000000 
16384 ::S Size < 32768 7868 246213880 31293.070666 
32768 ::S Size < 65536 1963602 71368788744 36345.852542 
65536 ::S Size < 131072 163 15940584 97794.993865 
131072 ::S Size < 262144 2334082 395250843240 169338.884941 
262144 ::S Size < 1048576 181893 101198929832 556365.169809 
Size 2: 1048576 9723 11529533400 1185800.000000 
Collectives 
Broadcast 0 ::S Size < 32 23552 94208 4.000000 
Broadcast 32 ::S Size < 64 1024 32768 32.000000 
Broadcast 256 ::S Size < 512 12800 4096000 320.000000 
Broadcast 512 ::S Size < 1024 512 380928 744.000000 
Broadcast 1024 ::S Size < 2048 512 899072 1756.000000 
Broadcast 2048 ::S Size <4096 1280 4026368 3145.600000 
Broadcast 4096 ::S Size < 8192 975616 7957526528 8156.412490 
Broadcast 131072 ::S Size < 262144 3584 749371392 209088 .000000 
AllReduce 0 ::S Size < 32 10496 61440 5.853659 
AllReduce 32 ::S Size < 64 2048 90112 44.000000 
AllReduce 64 ::S Size < 128 4096 376832 92.000000 
AllReduce 128 ::S Size < 256 1280 172032 134.400000 
AllReduce 512 ::S Size < 1024 512 503808 984.000000 
AllReduce 2048 ::S Size < 4096 768 2365440 3080.000000 
AllReduce 4096 ::S Size < 8192 2560 15749120 6152.000000 
AllReduce 32768 ::S Size < 65536 512 30576640 59720.000000 
AllReduce 65536 ::S Size < 131072 19968 1505742848 75407. 794872 
AllReduce 131072 ::S Size < 262144 63744 8355053568 131072.000000 
Total MPI Calls: 156222678 
66 
4 Mappings and Map Generation Tools 
Several different mapping methods were investigated for this thesis. The first mapping 
is a simple permutation of the physical torus coordinates and is available to all applications 
running on BlueGene/L. All other mappings require a map file be provided to the control 
system. This is explained further in section 4.2. 
The second mapping (and first mapping to require a map file) is a Gray-code based mesh 
mapping. The third set of mappings are similar to Gray-code meshes but are meant to keep 
physically-neighboring nodes close together while embedding a mesh-like topology in the torus. 
The fourth set of maps are application-specific mappings designed to improve performance of an 
application that is not necessarily meant to run on a machine like BlueGene/L. For comparison, 
a random mapping is also provided. 
4.1 Stock Mappings 
The BlueGene/L control system allows twelve different standard mappings. They are the 
six permutations of X, Y, and Z coordinates with either the processor core first or last (when 
in virtual node mode). Generally, the processor core is referred to by the letter 'T'. 
The permutations in virtual node mode are: XYZT, XZYT, YXZT, YZXT, ZXYT, ZYXT and TXZY, 
TXZY, TYXZ, TYZX, TZXY, TZYX. In coprocessor and heater mode, only the first six mappings 
are available. Table 4.1 shows the arrangement of MPI ranks and corresponding physical 
coordinates for an eight-way partition arranged as two by two by two. Tables 4.2 and 4.3 show 
the same eight-way partition booted in virtual node mode. 
67 
Table 4.1 Rank to Physical Coordinates, Coprocessor/Heater Mode 
Mapping Type 
XYZT XZYT YXZT YZXT 
Rank Phys. Coord Rank Phys. Coord Rank Phys. Coord Rank Phys. Coord 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 
2 0 1 0 0 2 0 0 1 0 2 1 0 0 0 2 0 0 1 0 
3 0 1 1 0 3 0 1 1 0 3 1 0 1 0 3 1 0 1 0 
4 1 0 0 0 4 1 0 0 0 4 0 1 0 0 4 0 1 0 0 
5 1 0 1 0 5 1 1 0 0 5 0 1 1 0 5 1 1 0 0 
6 1 1 0 0 6 1 0 1 0 6 1 1 0 0 6 0 1 1 0 
7 1 1 1 0 7 1 1 1 0 7 1 1 1 0 7 1 1 1 0 
ZXYT ZYXT 
Rank Phys. Coord Rank Phys. Coord 
0 0 0 0 0 0 0 0 0 0 
1 0 1 0 0 1 1 0 0 0 
2 1 0 0 0 2 0 1 0 0 
3 1 1 0 0 3 1 1 0 0 
4 0 0 1 0 4 0 0 1 0 
5 0 1 1 0 5 1 0 1 0 
6 1 0 1 0 6 0 1 1 0 
7 1 1 1 0 7 1 1 1 0 
68 
Table 4.2: Rank to Physical Mapping, Virtual Node Mode 
Mapping Type 
XYZT XZYT YXZT YZXT 
Rank Phys. Coord Rank Phys. Coord Rank Phys. Coord Rank Phys. Coord 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 
2 0 0 1 0 2 0 1 0 0 2 0 0 1 0 2 1 0 0 0 
3 0 0 1 1 3 0 1 0 1 3 0 0 1 1 3 1 0 0 1 
4 0 1 0 0 4 0 0 1 0 4 1 0 0 0 4 0 0 1 0 
5 0 1 0 1 5 0 0 1 1 5 1 0 0 1 5 0 0 1 1 
6 0 1 1 0 6 0 1 1 0 6 1 0 1 0 6 1 0 1 0 
7 0 1 1 1 7 0 1 1 1 7 1 0 1 1 7 1 0 1 1 
8 1 0 0 0 8 1 0 0 0 8 0 1 0 0 8 0 1 0 0 
9 1 0 0 1 9 1 0 0 1 9 0 1 0 1 9 0 1 0 1 
10 1 0 1 0 10 1 1 0 0 10 0 1 1 0 10 1 1 0 0 
11 1 0 1 1 11 1 1 0 1 11 0 1 1 1 11 1 1 0 1 
12 1 1 0 0 12 1 0 1 0 12 1 1 0 0 12 0 1 1 0 
13 1 1 0 1 13 1 0 1 1 13 1 1 0 1 13 0 1 1 1 
14 1 1 1 0 14 1 1 1 0 14 1 1 1 0 14 1 1 1 0 
15 1 1 1 1 15 1 1 1 1 15 1 1 1 1 15 1 1 1 1 
Mapping Type (cont) 
ZXYT ZYXT 
Rank Phys. Coord Rank Phys. Coord 
0 0 0 0 0 0 0 0 0 0 
1 0 0 0 1 1 0 0 0 1 
2 0 1 0 0 2 1 0 0 0 
Table 4.2: continue on next page 
69 
Table 4.2: continued 
3 0 1 0 1 3 1 0 0 1 
4 1 0 0 0 4 0 1 0 0 
5 1 0 0 1 5 0 1 0 1 
6 1 1 0 0 6 1 1 0 0 
7 1 1 0 1 7 1 1 0 1 
8 0 0 1 0 8 0 0 1 0 
9 0 0 1 1 9 0 0 1 1 
10 0 1 1 0 10 1 0 1 0 
11 0 1 1 1 11 1 0 1 1 
12 1 0 1 0 12 0 1 1 0 
13 1 0 1 1 13 0 1 1 1 
14 1 1 1 0 14 1 1 1 0 
15 1 1 1 1 15 1 1 1 1 
70 
Table 4.3: Rank to Physical Mapping, Virtual Node Mode 
Mapping Type 
TXYZ TXZY TYXZ TYZX 
Rank Phys. Coord Rank Phys. Coord Rank Phys. Coord Rank Phys. Coord 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 
2 0 1 0 0 2 0 0 1 0 2 1 0 0 0 2 0 0 1 0 
3 0 1 1 0 3 0 1 1 0 3 1 0 1 0 3 1 0 1 0 
4 1 0 0 0 4 1 0 0 0 4 0 1 0 0 4 0 1 0 0 
5 1 0 1 0 5 1 1 0 0 5 0 1 1 0 5 1 1 0 0 
6 1 1 0 0 6 1 0 1 0 6 1 1 0 0 6 0 1 1 0 
7 1 1 1 0 7 1 1 1 0 7 1 1 1 0 7 1 1 1 0 
8 0 0 0 1 8 0 0 0 1 8 0 0 0 1 8 0 0 0 1 
9 0 0 1 1 9 0 1 0 1 9 0 0 1 1 9 1 0 0 1 
10 0 1 0 1 10 0 0 1 1 10 1 0 0 1 10 0 0 1 1 
11 0 1 1 1 11 0 1 1 1 11 1 0 1 1 11 1 0 1 1 
12 1 0 0 1 12 1 0 0 1 12 0 1 0 1 12 0 1 0 1 
13 1 0 1 1 13 1 1 0 1 13 0 1 1 1 13 1 1 0 1 
14 1 1 0 1 14 1 0 1 1 14 1 1 0 1 14 0 1 1 1 
15 1 1 1 1 15 1 1 1 1 15 1 1 1 1 15 1 1 1 1 
Mapping Type (cont) 
ZXYT ZYXT 
Rank Phys. Coord Rank Phys. Coord 
0 0 0 0 ci 0 0 0 0 0 
1 0 1 0 0 1 1 0 0 0 
2 1 0 0 0 2 0 1 0 0 
Table 4.3: continue on next page 
71 
Table 4.3: continued 
3 1 1 0 0 3 1 1 0 0 
4 0 0 1 0 4 0 0 1 0 
5 0 1 1 0 5 1 0 1 0 
6 1 0 1 0 6 0 1 1 0 
7 1 1 1 0 7 1 1 1 0 
8 0 0 0 1 8 0 0 0 1 
9 0 1 0 1 9 1 0 0 1 
10 1 0 0 1 10 0 1 0 1 
11 1 1 0 1 11 1 1 0 1 
12 0 0 1 1 12 0 0 1 1 
13 0 1 1 1 13 1 0 1 1 
14 1 0 1 1 14 0 1 1 1 
15 1 1 1 1 15 1 1 1 1 
4.2 Other Mappings 
The BG /L control system allows a user-specified file to be used for node mappings. The 
control system requires all physical coordinates in the block to be assigned uniquely. For 
example, a 512-way midplane is an eight by eight by eight configuration. Any mapping file for 
a 512-way midplane must have 512 entries and must have X values between zero and seven, Y 
values between zero and seven, and Z values between zero and seven. All values between zero 
and seven must be used only once. In virtual node mode, all X, Y, and Z values must be used 
along with both processor cores ('T'=O and 'T'=l) for a total of 1024 entries. 
As stated previously, any mapping which is not a stock permutation of the physical co-
ordinates requires a mapping file. To facilitate this, a map-generating program was written 
72 
called mapmaker. mapmaker will generate maps for any partition size in any orientation (if 
appropriate for the map in question). It can generate different maps if virtual node mode is 
being used or if a mesh (rather than a torus) is being booted. 
4.2.1 Gray-code based maps 
When hypercubic-topology machines were more popular (e.g. the nCube and MA8PAR in 
the early 1990s and earlier machines), several papers were written on embedding algorithms and 
topologies into higher-order hypercubes, i.e., embedding rings, meshes, tori, etc. in hypercubes. 
This is because any (n - 1)-cube can be embedded in an n-cube (8888). (It is also possible to 
embed trees in hypercubes. This is presented in (Wu85).) To embed different order graphs in 
hypercubes, the nodes in the hypercube must be numbered such that two neighbor nodes differ 
by only one bit. In a sixteen-node machine, the nodes would be numbered: 0000, 0001, 0011, 
0010, 0110, 0111, 0101, 0100, 1100, 1101, 1111, 1110, 1010, 1011, 1001, 1000. This sequence is 
a Gray code. Specifically, this is a binary-reflected Gray code. 
Gray codes were originally utilized in the early 1950s by Frank Gray at Bell Labs (Gra53) 
in rotary shaft encoders. Rotary shaft encoders are used to determine physical position of a 
control switch electro-mechanically by making/breaking switch contacts. Without something 
like Gray codes, a change from one position to another would flip multiple bits. For example, 
assume a rotary encoder has sixteen positions numbered one through sixteen. The device 
outputs a four-bit binary number representing the current position. When the user changes 
state from (say) position seven (binary 0111) to position eight (binary 1000), all four bits change 
state. The logic using the switch position value might detect any one of several intermediate 
states (for example, the shaft decoding logic might go from 01112 to 11112 then 11102 then 
11002 then 10002) depending on how fast the decoding logic is versus how fast the shaft rotates. 
In Gray codes, only one bit is flipped between positions so there is far less chance of incorrect 
intermediate states being detected. 
There are many methods of generating Gray codes. The most common method is the so-
called binary-reflected Gray code. This is generated recursively. Start with the one-bit code: 
73 
0 and 1. Now add a zero in front of all elements in the current code. Copy the current code 
in reverse order and change the front zero to a one. So, the two-bit code is 00, 01, 11, 10. 
The three-bit sequence will start with a zero-padded two-bit code: 000, 001, 011, 010, then 
change the leading zero to a one and reverse the order: 110, 111, 101, 100. The three-bit code 
is therefore 000, 001, 011, 010, 110, 111, 101, 100. 
As stated previously, Gray codes make it possible to map lower-dimension n-cube graphs 
into higher-order m-cube graphs. The easiest case is embedding a one-dimensional line with 
wrap-around (i.e., node n connects to node zero), commonly referred to as a ring. Any even 
number of nodes, l, less than or equal to the number of nodes in then-cube (2n) can be mapped 
into a ring easily. The procedure is outlined in (SS88) and involves constructing a reflected 
Gray code using the first ( l - 2) /2 elements of the ( n - 1) Gray code. When l = 2n, the 
procedure constructs the normal reflected sequence. 
Embedding higher-order graphs (two-dimensional tori, three-dimensional tori, etc.) is only 
slightly more complicated. Figure 4.1 shows the standard BlueGene/L 128-way partition - an 
eight by four by four torus. The nodes along each axes are numbered as Gray codes. The X 
0 
0 g 
-g 
OIU 0 
w 
0 
c 
Figure 4.1 BlueGene/L 128-way partition 
axis has eight nodes and is represented by a three-bit Gray code. The Y and Z axes have four 
nodes and are therefore represented by two-bit Gray codes. Figure 4.2 shows a sixteen by 
eight two-dimensional mesh. Again the axes are numbered with Gray codes (X with a four-bit 
Gray code and Y with a three-bit Gray code). The procedure for embedding the mesh in a 
higher-order hypercube is to realize that any node in the higher-order hypercube (in this case 
74 
'I](} .,j: 00 
~ ~ ~ ~ ~ $ ~ ~ ~ $ ~ ~ $ ~ ~ $ 
~-- r::ii:::,\ o•°" o•" ,~o ,o' "o°' il' " 
o' 
" 4' ,o" 
o~ 
" 
,o°" .... ~· ,,o 
~ I I 
,.,o ~"'' ~"'"' o'P &' o°"" o°"" 04' o°"°" o°"" o•o o•' o•"' o•" o<I" 
o<l' r::ib~ o"" <6• 0 ~-"' J>" &~ "" {,;:v (/':l -$/" "'" &~ 
A~ 
0 
"'"' 
"""' """ 
f;;jr.)\':J o"' o""' \':,">'°" o<i' o"" 1,'o~ ,;'· 0 o'o°" o"" o<d' 1,1o' 1,-"' 
~...,"- 0"; ..g,tx @" @" 1;:,~' o"°" o"" ~& ""' ~"' of!J 
o,;. ~" # 
o'-c o'~ o'°" o~ ,"5' \;)'\."'-., o'>°' \':J,;;i 01' "'" ~ o'>" o.V ,~ 01' il'o 
ooo o' 0 ,~ oil' ,~ oO'o oo" 1,cS' oo'!p oo• o'' o'-' ,,.,, '\)....,..., o"' 
0 - ,, w -1- 'J• °' ~ 00 c c :: " ·~ ;:: V• 0 g 0 c c 2 .:;o -S2 0 0 g g ~ - c 25 2 0 0 :: 2 0 2 " - c c 2 25 
Figure 4.2 Sixteen by Eight Mesh 
the eight by four by four torus with wrap-around) can be written as: n = x1X2X3Y1Y2z1z2. 
The mesh nodes can be written as: m = X1X2X3X4Y1Y2Y3· It is clear then that the mapping 
is such that the X coordinate in the torus is the first three bits of the X coordinate in the 
mesh. The Y coordinate in the torus is the fourth bit of X in the mesh and the first bit of 
the Y coordinate of the mesh. The Z coordinate in the torus is the second and third bit of 
Y in the mesh. The mapmaker program can split apart the combined mesh node number (i.e., 
m = X1X2X3X4Y1Y2Y3 in this case) in X, Y, or Z first ordering. In virtual node mode, the 
processor core can be first or last in the split. These four maps are shown in table 4.4 for a 
node picked at random. Element 326 in the list of Gray codes for a midplane in virtual node 
mode (1024 nodes total which requires a 10-bit Gray code) is 011111001012 which corresponds 
to node 997. The normal location for node 997 is ( 4, 4, 7, 1 ). mapmaker allows any valid 
Table 4.4 Gray code mesh mappings 
Rank: 326 I 011111001012 
Split x y z T 
TXYZ 7 1112 4 1002 5 1012 0 
XYZT 3 Olb 6 1102 2 0102 1 
YZXT 2 0102 3 0112 6 1102 1 
ZXYT 6 1102 2 0102 3 0112 1 
75 
power-of-two mesh to be constructed with all three different mesh orientations. For example, a 
512-way partition can be decomposed into seven different meshes - 256x2, 128x4, 64x8, 32x16, 
16x32, 8x64, 4x128, 2x256. Any one of these meshes can be constructed as an X oriented, Y 
oriented, or Z oriented mesh (or with T first in virtual node mode). This presents a very large 
number of possible mappings. 
4.2.2 Other mesh-like mappings 
Another mapping type mapmaker will generate is an "unfolded" "mesh". This map attempts 
to turn the three-dimensional torus into a two-dimensional torus but with fewer physically 
neighboring nodes moving apart. In the Gray-code mappings, neighboring nodes in "Gray-
code space" will be neighbors when embedding only in a fully-connected hypercube. The 
Gray-code coding implemented here moves nodes apart (to preserve the single-bit bit flip 
nature of Gray-codes). 
In the X direction case, the MPI ranks count up from zero at physical location (0, 0, 0) 
to seven at (7, 0, 0), assuming a 512-way eight-by-eight-by-eight configuration. Rank eight is 
at (7, 1, 0). The ranks then increase as the X coordinate decreases so rank fifteen is located 
at (0, 1, 0). This creates a line that is sixteen nodes long. In a torus, the next line starts 
at (0, 3, 0), proceeds along the X axis with the X coordinate increasing, then drops down to 
(7, 2, 0) and heads back to (0, 2, 0). This is shown in figure 4.3 located at the end of the chapter 
in section 4.4. As in the other map types, this mapping is possible when oriented with respect 
to Y and Z as well as X. These two maps are shown in Figures 4.4 and 4.5. 
This unfolding is also available in virtual node mode in one of three forms based on when 
the core ID changes in relation to physical coordinates changing. The first form increases the 
core ID first, then unfolds. rank zero is at the origin as usual. In the X case, rank one is 
processor zero's second core. Rank two is at (1, 0, 0, 0). Rank three is at (1, 0, 0, 1), and so 
on. This X case is shown in figure 4.6. The Y and Z cases are essentially the same, but 
unfold along the Y or Z axis instead. These mappings are referred to as "unfold-corefirst" or 
"unfold-CF" in the results section. The second form proceeds as in the coprocessor cases -
76 
along X, Y, or Z then "up", then back along X, Y, or Z. It then skips over the next "up" 
direction and proceeds along the main axis, "down", then back. The next ranks are now the 
virtual node mode partners and the same along the axis, "up", back, skip, along the axis, 
"down", back happens. Figure 4. 7 shows this mapping for an X split. The Y and Z cases are 
essentially the same except unfold along the Y and Z dimensions. This mapping is referred to 
as "unfold-corelast" or "unfold-CL" in the results section. 
The third form proceeds along the orientation axis then goes "up" in processor core and 
comes back. So, MPI rank zero is at physical location (0, 0, 0, 0). MPI rank one is at (1, 0, 0, 0) 
in the X case. MPI rank seven is at (7, 0, 0, 0) and MPI rank eight then is at (7, 0, 0, 1). Now, 
the ranks decrease in X so rank fifteen is at (0, 0, 0, 1). The "skipping" part does not happen. 
Rank sixteen is at (0, 1, 0, 1). Ranks increase along X, drop "down" in processor core ID, then 
back track along X so rank thirty-one is at physical coordinates (0, 1, 0, 0). Figure 4.8 shows 
this arrangement graphically. The Y and Z cases are laid out similarly with the unfold is along 
the Y and Z dimensions. This mapping is referred to as "unfold-coreplanes" or "unfold-CP" 
in the results section. 
4.2.3 Lower /upper MPI Rank split maps 
Because BlueGene/L is not necessarily a general-purpose machine, it is interesting to see if 
one can improve performance for an application that is not ideally suited for BlueGene/L. One 
such application discussed previously is the quantum chemistry code, GAMESS. (See section 
3.3 for an explanation of GAMESS and why it is not ideally suited for BlueGene/L.) 
There are three different mapping types that were added to mapmaker to see if they might 
exploit the minimal compute node to primary data server preference discussed previously. 
Recall the GAMESS application splits the node space in half; the lower MPI ranks are 
considered compute servers and the upper MPI ranks are considered data servers. In general, 
a compute node communicates will all data servers but favors its upper-half partner. (e.g., 
rank zero favors rank 256 in a 512-way partition. rank one favors rank 257, etc.) 
mapmaker has three maps available to attempt to exploit this. Each map allows splitting 
77 
in any of three dimensions (X, Y, or Z). 
The first map type (called "sheets") arranges adjacent planes in the partition as neighbors. 
For example, in a 512-way partition arranged as an eight by eight by eight torus, an X direc-
tional sheet mapping with the Y Z plane at X = 0 has MPI ranks zero through sixty-three. 
The Y Z plane at X = 1 has MPI ranks 256-319. The Y Z plane at X = 2 has MPI ranks 
sixty-four through 127. The Y Z plane at X = 3 has MPI ranks 320-383. This carries over to 
the X = 4, X = 5, X = 6, and X = 7 planes. This is represented in figure 4.9. 
A Y directional sheet mapping assigns MPI ranks zero through sixty-three to the X Z plane 
at Y = 0. The X Z plane at Y = 1 has MPI ranks 256-319. Again, this carries over to the 
other six planes. Figure 4.10 shows this more clearly. 
As will the other mappings discussed, it is also possible to specify a Z directional sheet 
mapping where the XY plane at Z = 0 has MPI ranks zero through sixty-three and the plane 
Z = 1 has MPI ranks 256-319. This continues for the other six planes. Figure 4.11 shows this 
mapping. 
The second GAMESS-specific mapping is called a "Plus-One" mapping. In an X "Plus-
One" mapping, MPI rank zero (at location (0,0,0)) has its data server (rank 256) at location 
(1, 0, 0), i.e. the next X location. Rank one is located at (2, 0, 0) while its data server (rank 
257) is located at (3, 0, 0). This is shown in figure 4.12. 
"Plus-One" in the Y direction is similar. Rank zero is still located at (0, 0, 0). Its data 
server - rank 256 - is located at (0, 1, 0). Similarly, rank one is now located at (0, 2, 0) and rank 
257 is at (0, 3, 0). See figure 4.13 for details. 
Finally, "Plus-One" in the Z direction retains rank zero at the origin, ((0, 0, 0)) with its 
data server at (0, 0, 1). Rank one is at (0, 0, 2) and rank 257 is at (0, 0, 3). See figure 4.14. 
The third application-specific mapping is similar to how stock mappings lay out the nodes 
for GAMESS, but it takes advantage of the wrap-around connections on BG/L. Figure 4.15 
shows a picture of the X-direction layout. In the X-direction case, rank zero is at the origin 
(i.e. (0, 0, 0)). Its data server is on the opposite edge of the torus, i.e. (7, 0, 0) for a 512-way 
eight-by-eight-by-eight torus. Rank one is at (1, 0, 0) while its data server (rank 257) is at 
78 
(6, 0, 0). Similarly rank two is at (2, 0, 0) with rank 258 at (5, 0, 0). Finally, rank three is at 
(3, 0, 0) with rank 259 at ( 4, 0, 0) 
In the Y-direction case (shown in figure 4.16, a similar bisection is performed along the 
Y axis. Rank zero is at the origin and its data server is on the opposite edge of the Y axis. 
Similarly, rank one is at (0, 1, 0) with its data server at (0, 6, 0). The Z-direction case is the 
same, except the bifurcation is along the middle of the Z axis. See figure 4.17 for details. 
4.2.4 Random Map 
As a basis for "worst-case" comparisons, it was desired to have random mappings avail-
able as well so mapmaker can generate random mappings. The logical MPI ranks proceed 
sequentially with a random set of physical coordinates for each rank. 
4.3 Profiling Tools 
As mentioned in chapter 1.2, the MPI-2 standard requires implementations to provide 
hooks for user-provided profiling libraries. This is known as the PMPI interface in MPICH. 
PMPI is discussed in chapter 8 of (SOH+oo). Basically, an MPI implementation is required 
to provide a mechanism through which all MPI functions can be over-ridden by a user library. 
A normal application would call MPLBcast for example. The profiling library could then 
intercept this call and do almost anything as long as eventually a call to PMPI ...Beast is made. 
The typical use would be to wrap the call to PMPI...Bcast with timing routines. This profiling 
interface can also be used for sanity-checking MPI calls. For example, all processes in a 
communicator should generally make collective calls with identical arguments. There is no 
way for the MPI implementation to make sure all processes called the collective function with 
the same parameters. A library can use the profiling interface to make a hash of the function 
call and then broadcast this hash to all processes involved for example. Processes can then 
verify if they are making the same collective call as other processes. 
A BlueGene/L specific profiling library was developed to help examine communications 
patterns in the benchmarks used in this thesis. The library measures the time for all MPI 
79 
operations (including point-to-point operations) using the processor cycle counters available 
on the PowerPC 440 cores. This cycle count is recorded to a text file on disk, along with what 
MPI call was made. The number of bytes transmitted is also recorded. For point-to-point 
operations, the sender and the destination ranks along with their physical coordinates are 
recorded as well. Collective operations with a root (e.g, MPI...Bcast, MPI...Reduce, MPLGather, 
etc.) record the root's rank and physical coordinates. Because all of this information is 
recorded to disk, there is no practical limit on the number of communications events that can 
be recorded and no memory is taken from the application. However, this process slows down 
the application significantly and can generate substantial files on disk. For that reason, the 
profiling library was only used for gathering communications snapshots; all timing runs were 
done without the library. 
Two applications were developed to make use of the profiling data. They are the commu-
nications visualizer and the communications profiler. 
4.3.1 Communications Visualizer 
The communications visualizer is an OpenGL-based application that shows the point-to-
point communications between nodes visually. It takes a stripped down version of the profile 
data (basically, just unique point-to-point node pairs and byte transmissions) and creates a 
three-dimensional representation of the data flow. Each communication is represented by a 
vector that goes from the sender's physical coordinates to the receiver's physical coordinates. 
The color of the vector indicates the physical' distance from sender to receiver. The colors pro-
ceed from red to purple in rainbow-order. Red or orange vectors represent short transmission 
lengths (typically nearest-neighbor sorts of communications). Yellow or green vectors repre-
sent medium-distance transmissions (typically three to five hops in each direction). Finally, 
blue or purple vectors represent long-distance transmissions, typically from one edge of the 
partition to the opposite edge. Note however that some of the blue or purple vectors might be 
misleading since the program does not know when the communications data is from a mesh 
or a torus partition, i.e. a transmission from physical coordinates (0, 0, 0) to (7, 7, 7) is really 
80 
the same number of hops as a transmission to (3, 3, 3) in a torus, but twice as many hops in a 
mesh. The visualizer currently assumes the data is for a mesh. 
The radius of the vector arrowhead represents how many total bytes were sent from the 
sender to the receiver. The arrowhead radii are proportional to the largest node-to-node total 
byte count for the entire application being viewed. 
The user can rotate the machine representation and turn on or off specific nodes. This is 
helpful in determining what sorts of communications are typical for a node. Chapter 3 has 
screen-captures from the communications visualizer program from data collected during a run 
of each of the benchmarks discussed in that chapter. 
4.3.2 Communications Profiler 
It was necessary to determine types of communications (point-to-point versus collective, 
small messages versus large messages, etc.) in the benchmarks to help determine if mappings 
can have any performance benefits. Initially, this involved looking through the benchmarks' 
source code. This is a very time-consuming and error-prone operation for large benchmarks, so 
the communications profiler tool was written. This application takes the profile data files and 
generates tables of communications types and sizes. In general, this is useful for determining 
what operations are likely to consume the most time or what operations should be optimized 
for future runs. For example, if an application has many smaller point-to-point operations, the 
target machine's latency is going to be more· important than its bandwidth. Communications 
profiles for all of the benchmarks are in chapter 3. 
4.4 Map Figures 
The diagrams on the following pages show the maps discussed in this chapter on 512-way 
mid planes. 
I 
81 
•·····.•···· · ··· ·.• ·· ·· ·.• ·· · · : .: ..........• ·..... : ..... · . . .. .. .... ··.~ . . 
····•·····•·····. : ..... . .• .. 
.... : J!I. • .•.. ~ 
• • .ii 
• • 
. . .. : •. -· ~ .. . • .: 
.ii 
• 
Figure 4.3 X direction "unfolded" "mesh" 
/_ :\. -
· · r .. "T 
I 
.. I. 
·· ··.• ·····.• ·· ·· : 
· ..... : .... · ... 
: ..... .. .. . 
Figure 4.4 Y direction "unfolded" "mesh" 
• 
.• 
• 
82 
/, 
/ 
: / 
/ . .. ·/ . ' #./ ... / ... . /. / .... :,./ 
·• . -ii -. 
•··· ·•·· ··•· 
• . ii . . -ii -. 
. . . . ·• ... . ' ... ·Iii·. ····•·· · 
I Iii ..... Iii .. . • · ... : ... ·• .. ·· • · .... 
L • ·· -- ii -- -- - ··· ·• ·· .. • .. ... · ---- · .... :\ 
Figure 4.5 Z direction "unfolded" "mesh" 
Core O 
Core I / 
Figure 4.6 X direction "unfolded" "mesh" in VNM with cores increasing 
first 
83 
'1 Core 0 
Core I z 
L , 
Figure 4. 7 X direction "unfolded" "mesh" m VNM with cores increasing 
last 
• . • .. • . • • . . II • .. -~ ..  :· .. ..•. ~ ~..  : ~ 
: • .. .. ..  :• ... ·.· . 
." I! ." II 
• II . 
. . . . ii· . .... ... ·ii · 
• II • "' !II 
·· - ~ · . . .. . ·•·· ii ·. . ·1)1 ·. ··• 
• • : • • • ·ii · . 
, • : • II . 
' Core 0 I 16 Core I I IS E 14 13 12 I I 10 
L , 3 1 E 4 6 
Figure 4.8 X direction "unfolded" "mesh" m VNM with cores increasing 
as planes 
"T
l 
C
om
pu
le
 !
'\o
de
 
R
a
n
kO
 
o:q
· 
c 
D
at
a 
~
c
n
c
r
 
R
 •. m
l\
 1
56
 
'""' (t>
 
,):
:..
 
C
om
pu
te
 N
od
e 
R
a
n
k 
6
-l
 
C.
D 
D
at
a 
._
,e
r.e
r 
R
an
k 
3:
!0
 
><
 
Q
...
 
C
o
m
p
u
te
 !\
:o
de
 
R
.m
k 
I:
.,
~ 
;::;
· 
rt>
 
D
at
.J
 S
c.
>n
er
 
R
a
n
k3
8
4
 
(
)
 
M
-
0 :::
i 
C
om
pu
te
 N
od
e 
R
a
n
k 
1
9
1
 
~
 
[J
:J
 
D
al
a 
S
en
 e
r 
R
an
k 
+
1
8
 
::
:;
' 
(t
>
 
(t
>
 
M
- B
 
~'
 
>l
' 
'V
 
'V
 5·
 
ac
i 
"' 
. 
•· 
• ..
.... 
Ii.
··
··
·
:·
··
•
···
·•
·. 
··
•
· 
·•
· 
·C
1.·
··
· •
· ·
··
C
:· 
..
 
•
· 
. 
: •
. 
]
j 
. 
• ..
.. 
:o
 .
.•
.•
. 
(X
J 
,):
:..
 
85 
ll.il '<I l 
Compute Node 
jl,tl "I' 
Compute Node 
lh ~18 
l';111fl'': ll 
Compute Node 
' II h 
I '. 
~ 111! I 'X • .... \I. 
Compute Node 
I 11·~ 
•• / 
Figure 4.10 Y directional sheet mapping 
.·If ... "Y .. > \ ... :: \ · ···:.y. 
U.· 0-- J·-~· 0·· 
·····•· ···•·····•·· 
··· ·•· . Iii · .. · l!I· 
.... ·•· ···· - ~ · -··· · ...... .. •· 
·····•···· 
l!I ... • · .... · l!I· 
. .... .......... · •  •• 
I 
0~J 
: .J~ Ra11k-W8 Raul 1'>2 Data Sc rl'cr Compute Node 
J.d.· 
Rank .~84 Data Server 
Rank 128 Compute Node 
Data Server 
Compute Node 
: .·'lt-
J 
Rank 320 
~ Rank6-l 
Rank 256 Data Server 
Rank O Compute Node 
Figure 4.11 Z directional sheet mapping 
86 
~ · :.Iii 
\ 
A 
/ !{an!.. 256 Rank 257 Rank 258 Rank 259 Rank 0 Rank I Rank 2 Rank ~ 
I 
L , 
Figure 4.12 X direction "Plus-One" mappmg 
I .1nl,.' 1 
l' .• nl-.1 
I u1~ 1 "x 
I .111~ 
'""' 
ii~ .. ,, 
. , 
.. .. . 
•• 
• • .. ... 
.... ... . .. ... l 
Figure 4.13 Y direction "Plus-One" mapping 
Compute Node 
Compute Node 
JJ U l '- \I 
Compute Node 
Compute Node 
87 
_-___ -_·7 __ --_·7 ·7 
- _ --· -- - ii-
- - - ii 
-ii -- -·---· 
• · - --ii .. --
\I I 
~' 
--· --- - -
--·-· 
----. ·ii ---- -
Figure 4.14 Z direction "Plus-One" mapping 
88 
I 
?? 
~ 
~ 
~ ~ ~ ~ 
~ 
IJ 
IJ tJ 'J• J ,_,, t..> 'J• 
'J• 00 
-Cl 
Figure 4.1.5 X direction "blocks" mappmg 
89 
- - ---· -- -
/ , 
/ ~, " 
Figure 4.16 Y direction "blocks" mapping 
Compute Node Rank 3 
l>ala Scnc1 Rm:iJ... 259 
Data St•rver Rm1k 258 
Data Server R;mk 257 
Data S1·rvcr RMk 25<> ~ _ 
--- --· -
-· -. . . ·• - •· 
) 
A 
I --· -- -- .. ··• --- --
L ·-~ . - . : . .. . : .... .. . . ... : ... . : _ _ ,, _: ...... .. . - __ -- · - -· --
• . ... ... -Iii --- · -- ... ... . --· -· . . ...... .... : ___ __ __ ... .. .. : .... : --· -- . 
Figure 4.17 Z direction "blocks" mapping 
90 
5 Procedure and Results 
5.1 General Procedures 
First generation BlueGene/L hardware was used for running the NAS benchmarks. This 
was because there were more partition sizes available on the first generation hardware. Ideally, 
it would be possible to boot a large partition (say 512-way or 1024-way) and specify the number 
of processors actually desired for running a job. However, at the time the runs were made, it 
was not possible to specify an arbitrary number of processors and a mapping file at the same 
time because of limitations in the preliminary control system. 
It is also important to use smaller-than-midplane partitions because some of the bench-
marks do not run on larger-than-midplane partitions. To see if the aspect ratio of the partition 
configuration affects results, it is necessary to have non-cubic partitions. A 128-way partition 
is set up as eight compute nodes in X , four in Y, and four in Z. If a midplane were used, 
the 128-way would end up as an eight by eight by two configuration and mapping files would 
simply rotate this rectangular solid inside the midplane cube. A 1024-way partition on the 
first generation hardware is arranged as an eight by eight by sixteen rectangular solid, i.e., it 
is two midplanes stacked on top of each other along the Z axis. This configuration would be 
useful for determining if the aspect ratio matters, but not all of the benchmarks run on such 
large systems. 
BT and SP both require a perfect-square numbers-of-processors. For 128-way runs, only 
121 processors were used, but NAS reports measurements as if 128 processors were used. The 
MOp/s per processor shown on the graphs are re-computed for 121 processors. This is also 
done for 512-way runs where only 484 processors are used. All the other benchmarks require 
91 
powers-of-two partition sizes so nothing special is required for them because the BlueGene/L 
standard partition sizes are all powers-of-two. 
Unfortunately, there are no 256-way partitions available (except 128-way partitions in 
virtual node mode) , nor are there partitions above a midplane that are not multiples of a 
midplane (e.g., 768-way). Therefore only 128-way, 128-way in virtual node mode, and 512-way 
runs were made. (when possible). The IS benchmark does not work on partitions greater 
than 512 nodes , and fails to verify at 256 nodes. MG does not work properly on machines with 
more than 512 nodes so it was not run above 256-way. LU, BT, and SP do not run at small 
classes with a large number of nodes, i.e., LU does not run at class A or class B with 512 or 
more nodes. IS has no class D problem size. CG has verification problems (the result does not 
match the expected value) with small problem sizes and large numbers of nodes. However, 
the discrepancy is very small (typically on the order of 10-6) and the benchmark still reports 
timing information so the numbers are included. Originally, it was thought that EP might be 
a good control benchmark, but the lack of point-to-point operations in FT made it more useful 
as a control since it was actually doing some' communication. 
To summarize: BT, CG, FT, IS , LU, MG, and SP were run with class A, B, and C problem 
sizes on 128 nodes in coprocessor mode. BT, CG, FT, LU, MG, and SP at class A, B, and C 
sizes were run on 128 nodes in virtual node mode (256 nodes total). BT, LU, and SP were run 
on 512 nodes with the class C problem size. 
Because some of the NAS benchmarks use collective operations, runs were made with 
and without the BlueGene/L optimized collectives to see if there would be any performance 
difference. This was only done on 128-way configurations. The default MPICH collective 
operations are implemented with point-to-point messages so it was hypothesized that map-
pings might make a difference on the benchmarks that use mostly collective operations if the 
collectives were done with these unoptimized point-to-point calls. 
GAMESS and ALCMD were both run on second generation hardware since both appli-
cations require large amounts of memory per node and more processing power. The second 
generation hardware is arranged the same as first generation for 512-way and 1024-way par-
92 
titians (i.e. eight-by-eight-by-eight and eight-by-eight-by-sixteen). 4096-way partitions are 
thirty-two by eight by sixteen. Since neither ALCMD nor GAMESS have had much run time 
on machines larger than 4096 and might not scale well at such large machine sizes, 4096 was 
the largest partition used . 
All applications were run with the stock mappings first. Other mappings were generated 
with the mapmaker application. Each benchmark was run with every mapping when possible. 
Generally, as many configurations as possible were run simultaneously. For example, if the 
entire 16384-node machine was available, typically eight racks would be booted as 1024-ways, 
four racks would be booted as midplanes (512-ways) , and the other four racks would be booted 
in virtual node mode as either racks or midplanes or the four racks would be booted as a 
4096-way, depending on which benchmarks were being run. Benchmarks would then be run-
ning simultaneously on multiple machines with different mappings. In most situations it is 
desired to have an entire machine available exclusively for benchmark runs. Since the parti-
tions in Bl ueGene /L are electrically isolated · from each other, as soon as they are booted, the 
benchmarks are running exclusively on the partitions with no interference between different 
partitions. 
The mapmaker application has different mapping strategies for several of the maps if they 
are going to be used on meshes (i.e., no wrap-around links are enabled). Unfortunately, the 
ability to boot larger configurations as meshes was not implemented at the time the runs were 
made so this was not investigated. 
All result graphs are in chapter 7. 
5.2 NAS Results 
5.2.1 General Comments 
The NAS graphs show mapping on the X axis and millions of operations per second per 
processor on the Y axis. Larger values are better. Most of the NAS benchmarks showed 
improvements over stock mappings for at least one mapping type, except FT and LU. It was 
93 
expected that FT would show very little variation among the many mappings since it uses only 
collective operations. However, the lack of variation among mappings for LU was surprising. 
The majority of the communication in LU is point-to-point so it was expected that some 
variation would be present. Except in virtual node mode, none of the mappings (including 
random) showed significant variation on MOp/s per processor on LU. In virtual node mode, 
nothing performed significantly better than stock mappings (and only 'T' last stock mappings 
were good). The mappings that performed worse than the stock mappings all performed 
similarly to each other. It was originally believed that LU should behave like BT and SP since 
they are doing similar things and have similar code. However, analysis with the communications 
visualizer showed the communications in LU are significantly different than those in SP or BT. 
5.2.2 NAS BT and SP Results 
As expected, BT and SP had similar qualitative results. As shown in figures 7.1 (BT) and 
7.56 (SP), there was very little difference using the non-optimized collectives. The main reason 
for this is that there are very few collective operations in either code (mostly broadcasting 
parameters to all nodes and collecting timing information). Most of the BlueGene/L tree-
optimized collectives operate only on MPLCOMM_WORLD so if the codes do collectives on sub-
communicators, they will not benefit from optimized tree collectives. The optimized torus 
collectives that do work on sub-communicators require rectangular sub-communicators. In 
figures 7.1 (BT) and 7.56 (SP) the two stock mappings with the greatest benefit are the two 
with the X coordinate last. This is because of the configuration of 128-way partitions. Recall 
that a 128-way partition is eight by four by four. Therefore, there are twice as many hops 
in the X direction as in Y or Z . Since the X coordinate is increasing slowest, adjacent MPI 
ranks tend to be closer together than when X increases fastest and the average number of hops 
increases. The qualitative differences between the three classes (class "A", "B", and "C") were 
very minimal. 
Figures 7.5 (BT) and 7.60 (SP) show the stock mappings in virtual node mode. There 
is significantly more variation among these mappings than in coprocessor mode, though, as 
94 
expected there is very little variation between using the optimized collectives and the default 
point-to-point-based collectives. The most interesting thing to note on these two figures is 
that (as in the 128-way case), the YZX and ZYX mappings are the best performers. However, 
it is interesting that the TXYZ and TXZY mappings are the best mappings with T (i.e. core 
ID) varying fastest (and are comparable to the YZX and ZYX mappings). It is unfortunate that 
there is no way to compare a coprocessor-mode 256-way configuration to these 128-way in 
VMM numbers. 
In the GAMESS-style mappings results shown in figures 7.3 (BT) and 7.58 (SP), most of the 
GAMESS-style split-in-half mappings do not perform as well as stock mappings. It was not 
expected that BT or SP would show any significant improvement using these mappings. It is 
somewhat surprising that the "blocks-X" and "sheets-X" mappings outperform the XYZ stock 
mapping. Most of the reason the two maps performed well was that they hide the fact that 
there are twice as many hops in X. 
The Gray-code mapping results shown in figures 7.2 (BT) and 7.57 (SP) do not perform 
nearly as well as expected. The best Gray code map is nearly equivalent to the best stock 
mapping (as shown in figures 7.1 (BT) and 7.56) (SP). The aspect ratio of the mesh has less 
effect than expected as well. A very skewed mesh was expected to have significantly worse 
performance than the nearly-square meshes. In SP the aspect ratio of the mesh has nearly no 
effect. BT shows marginally worse performance on the two highly-skewed extreme cases. Again, 
both benchmarks on 128-way partitions favor any mapping that hides the fact that there are 
extra hops in the X dimension. 
Figures 7.6 (BT) and 7.61 (SP) are much more interesting than the 128-way equivalents, 
though still do not show spectacular performance improvements. However, the expected benefit 
from square meshes is much more dramatic. 
The unfolding style map was expected to show at least some performance improvement 
over stock and managed to do so for all three dimensions, as shown in figures 7.4 (BT) and 
7.59 (SP). The best unfold mapping (the unfolding along Y) was not better than the best 
stock mapping however. It was better than the default stock mappings though. Refer back to 
95 
Figures 7.5 (BT) and 7.60 (SP) for stock results. 
The virtual node mode unfolding cases showed much more variation but not much perfor-
mance improvements. As expected, the mappings that hide the extra latency in X performed 
best. See figures 7.7 (BT) and 7.62 (SP) for the results. 
The 512-way results showed very little variation among stock mappings. See figure 7.8 for 
the BT results and figure 7.63 for the SP results. This is the expected behavior since a 512-way 
midplane is a cube. The random results are slightly worse than the stock mappings which is 
sort of surprising. It was assumed random would make performance suffer. The Gray-code 
mapping results were very good. Almost all of the Gray-code maps performed better than 
the default XYZ mapping. As expected, those closest to square (16x32 and 32x16) were the 
best mappings. Also, as expected since the midplane is cubic, there was not much difference 
between the three Gray-code orientations. This is shown in figure 7.9 for the BT benchmark and 
in figure 7.64 for SP. Similarly, the unfold maps in figure 7.11 (BT) and 7.66 (SP) out-perform 
the stock mappings with very little difference between the three orientations. 
Figure 7.10 shows the results for the GAMESS mappings with BT and figure 7.65 shows 
the results from SP. There is not much difference between the GAMESS mappings and stock, 
but BT and SP were not expected to benefit from the GAMESS-style mappings. 
5.2.3 NAS FT Results 
As expected, FT showed almost no performance difference, regardless of mapping. This was 
expected because FT uses all collective operations and is very small for the partition sizes used. 
The only notable performance variation was the optimized versus non-optimized collectives in 
virtual node mode as shown in figure 7.27. The optimized collectives are almost all identical 
but the unoptimized collectives show a difference between T first and T last mappings. The rest 
of the FT results are shown in figures 7.23 (128-way stock mappings) , 7.24, (128-way Gray-code 
mappings) 7.25, (128-way GAMESS mappings) and 7.26 (128-way unfolding maps). Figures 
7.28 and 7.29 are the 256-way Gray-code and 256-way unfolding results respectively. Figures 
7.30, 7.31 , 7.32, and 7.33 are the 512-way results for stock mappings, Gray-code mappings , 
96 
GAMESS mappings , and unfold mappings. 
5.2.4 NAS CG Results 
CG is another benchmark that runs extremely fast on even the smallest BlueGene/L con-
figurations. It also fails to verify correctly with a large number of nodes on a small problem 
size, though the verification is only off by a small amount (typically on the order of 10 - 5 or 
10-6 ) and the benchmark still reports results, so they are shown in the figures. As expected 
there is very little difference between the optimized and non-optimized collectives shown in 
figures 7.12 (128-way) , 7.16 (256-way) , and 7.19 (512-way) . There is more variation among 
the different stock mappings , especially in virtual node mode (figure 7.16. However, the maps 
that have the X coordinate varying slowest are the best performers once again. Strangely this 
behavior repeats in the 512-way case (figure 7.19 for the class A problem size). It is likely that 
this is just an anomaly with the small problem size and the large number of nodes along with 
how CG divides the problem space among the processors (discussed in section 3.1.3) . 
The Gray-code mappings shown in figures 7.13 (128-way), 7.17 (256-way), and 7.20 (512-
way) are somewhat surprising. It was not expected that the Gray-code mappings would benefit 
CG but many of the different aspect ratio meshes do beat the stock mappings. It is also odd 
that the meshes closest to square tended to be the worst-performers. This is probably related 
to how CG divides the problem space among the processors. A more in-depth analysis of the 
problem space distribution would be required to determine why the more rectangular Gray 
codes benefit CG. 
As expected, the GAMESS-style mappings (which only run in coprocessor mode) made 
almost no difference to performance. See figures 7.14 (128-way), and 7.21 (512-way). 
The unfolding maps also provided unexpected benefits over stock mappings , as shown in 
figures 7.15 (128-way), 7.18 (256-way), and 7.22 (512-way). At least , the maps hiding the fact 
that the X dimension is longer than Y or Z (only apparent on the 128-way and 256-way maps 
obviously) showed the best improvements. 
97 
5.2.5 NAS IS Results 
Results for IS with the stock mappings are shown in 7.34. The optimized results perform 
better of course since IS has many collectives. There is very little difference between mappings 
otherwise. There is also almost no difference between the three classes. The run times for the 
benchmark are extremely small and the operations count is extremely low. IS was the least 
scalable of the NAS benchmarks, but is also among the oldest of the benchmarks. Figure 7.35 
shows the results for the Gray-code mappings. Again, there is very little difference between 
the runs. Figure 7.36 shows the results for the GAMESS mappings and figure 7.37 shows the 
results for the unfolding maps. Again, there was almost no variation between the mappings 
because of the many collectives in IS and because the run times are so short . 
5.2.6 NAS MG Results 
MG showed very little variation among mappings. In the 128-way stock mappings case 
(figure 7.49), only the mappings with X last were any better than a random map. This is 
pretty typical of the other NAS benchmarks. The rest of the mappings showed very little 
improvement or difference between mappings as shown in figures 7.53 (256-way stock), 7.50 
(128-way Gray), 7.54 (256-way Gray), 7.51 (128-way GAMESS) , 7.52 (128-way unfold) , and 
7.55 (256-way unfold). 
5.2.7 LU 
As previously mentioned, LU showed surprisingly little variation with different mappings. 
It was assumed the behavior might be similar to SP or BT. However, this was not the case. 
In the stock mappings, there was some variation between the T first and T last mappings , 
but only between the two groups as a whole. The majority of the communication in LU is 
point-to-point so more performance differences between mappings was expected. As shown in 
figures 7.38, 7.39, 7.40, and 7.41 there was almost zero variation among all three classes on all 
128-way mappings. There is some variation in the 256-way mappings, but again, not much, 
and nothing much better than stock mappings. See figures 7.42 (stock) , 7.43 (Gray-codes), 
98 
and 7.44 (unfold). The results for 512-way partitions are very similar, i.e. very little difference 
between mappings. See figures 7.45 for stock mappings, 7.46 for Gray-code mappings, 7.47 for 
the GAMESS-style mappings and 7.48 for the funolding map results. 
As noted in chapter 3, the LU communications are mostly nearest neighbor and half-way 
across the partition. These hops are very short in a torus, so it is possible that a mesh would 
show more variation among the mappings. 
5.3 GAMESS Results 
The GAMESS results were rather mixed. Unfortunately, the best mappings were not 
consistent across workloads, nor were they consistent across partition sizes. 
Figures 7.67 and 7.68 show the results for 512 nodes (in coprocessor mode) for penicillin 
runs and quinone runs respectively. Only two mappings show a performance difference for 
the penicillin input file - the GAMESS-style Plusone Y and Sheets Z mappings. It seems 
odd that either mapping would make a significant difference on a midplane, especially since 
neither mapping improves the runtime for the quinone input file. In fact, the GAMESS-style 
Sheets Z mapping is one of the worst mappings for quinone. One possible explanation for this 
is the sequential steps in the GAMESS runs. It is possible the total runtime for quinone is 
more dominated by the sequential steps than the parallel steps or that the parallel steps are 
compute-bound, more so than with the penicillin input file. 
Figure 7.69 and figure 7. 70 show results for stock mappings with 512 nodes in virtual node 
mode (1024 processors active) for the penicillin and quinone input files. Very little variation is 
seen in the penicillin workload. However, the quinone workload has one interesting data point 
worth explaining. The random mapping performed eight percent better than the average of all 
twelve stock mappings. The random mapping performance improvement makes sense because 
of the all-to-all nature of GAMESS communications. A typical all-to-all implementation will 
lead to link congestion as nodes try to communicate with distant nodes through nodes com-
municating with their neighbors. One strategy to minimize link contention is random packet 
injection. In fact, the BlueGene/L MPI _All toall uses a random injection strategy where as 
99 
many as six packets are injected from a node into the network at any one time. With an 
optimal injection size, the random injection performs extremely well. 
Figures 7.71 and 7.72 show the Gray code mappings applied to GAMESS runs on mid-
planes in virtual node mode (1024 processors total) for the penicillin and quinone input files. 
The quinone workload results are as expected - vary little variation between the different orien-
tations and aspect ratios for the mappings. As discussed previously, the random mapping for 
512 nodes in virtual node mode performs very well. However, the penicillin results are rather 
surprising. Several orientations of the Gray-code maps with very bad aspect rations performed 
very well. For example, the Y oriented 2x512 mapping was the best performer of all, showing 
an eleven percent improvement over the average of the twelve stock mappings. The 4x256, 
256x4 and 512x2 maps in the Y orientation all showed very good performance increases over 
the stock average. The square 32x32 X oriented mapping showed the best performance of the 
X oriented Gray mappings while the skewed 256x4 Z mapping showed the best performance 
for the Z oriented mappings. All of the T oriented mappings showed similar performance to 
stock mappings. Again, it is difficult to explain the lack of symmetry in the results. If Y 
oriented highly-skewed mappings showed such performance increases, why did only the square 
X oriented mapping show a performance gain. Given the cubic nature of a 512-way midplane, 
it is difficult to explain why any X, Y, or Z oriented mapping showed significant improvement. 
It would take a very complete understanding of GAMESS and the communications in the input 
files. It might just be that the orbitals are distributed among the nodes strangely and that 
ends up favoring some of the mappings. 
Figures 7.73 and 7.74 show the results from stock and GAMESS-style mappings for 1024 
nodes for the penicillin and quinone input files respectively. The penicillin results are somewhat 
strange and rather disappointing. Only one mapping performed better than the average of the 
stock mappings - the GAMESS Plusone-X mapping. However, there is no easy explanation 
for the nearly twelve percent improvement from the average of the stock mappings , especially 
considering the Plusone-Y mapping showed no performance improvement over stock mappings. 
The quinone results are slightly more as-expected. AZ oriented mapping was able to produce 
100 
better results than the average of the stock mappings (and a very respectable sixteen percent at 
that) Recall that 1024-way partitions are eight by eight by sixteen, so there are twice as many 
hops on average in the Z direction than in X or Y. The Sheets - Z map helps to group compute 
and data servers closer together to hide the extra latency when there are more network hops. 
The random mapping also showed a very good performance increase, as discussed previously. 
Figures 7.75 and 7.76 show the expected results for Gray-code mappings on GAMESS runs 
- almost no variation between orientations and between aspect ratios. 
Figures 7. 77 and 7. 78 show the results from unfolding-style maps on the penicillin and 
quinone workloads for 512 nodes in virtual node mode (1024 nodes total) and 1024 nodes in 
coprocessor mode. As expected, none of the maps made a significant performance improvement 
for GAMESS (less than five percent). 
The GAMESS-specific mappings were very beneficial when they were beneficial, but they 
did not make as much of a consistent difference as hoped or expected. The GAMESS maps 
that showed the best performance gains were not the ones that were expected to be the best 
performers for a given partition configuration. It was rather surprising to see the skewed Gray-
code mappings performing extremely well in some cases, though not consistently. It is safe to 
say there is enough data locality in GAMESS that some mappings will provide performance 
gains (up to sixteen percent) , but further research is required to determine why some mappings 
performed so much better than expected and with such inconsistent results. 
5.4 ALCMD Results 
As with the GAMESS results, the ALCMD results were rather mixed. Runs with smaller 
numbers of atoms showed the most variation among the mappings since there are more com-
munication passes happening with the number of atoms is not significantly greater than the 
number of compute processes. 
The ALCMD graphs are divided into tw9 sets - Gray-code mappings and everything else 
(stock, GAMESS-style maps, unfolding maps). The non-Gray-code graphs are arranged by 
number of atoms simulated. It was not possible to run ten thousand atoms with 4096 nodes , 
101 
nor was it possible to run one hundred million atoms with 512 or 1024 nodes. 
5.4.1 Non-Gray-code results 
Figure 7.79 shows the non-Gray mappings for ten thousand atoms with 512 and 1024 nodes. 
The run times, even on 512 nodes, are really too short to make meaningful comparisons. 
Figures 7.80, 7.81 , and 7.82 show run times for one hundred thousand, one million, and 
ten million atoms respectively, all for 512, 1024, and 4096 processors. Several trends are easy 
to spot between the three graphs. First , there is very little variation in run times for the 
smaller machine configurations in the larger-number-of-atoms cases. Second, certain mappings 
are consistently poor performers for a given machine configuration. This makes sense since the 
different machine configurations have different layouts. For example, the Unfold-Y mapping 
is always one of the worst performers for 1024 nodes. This might be because the mapping is 
constructed by unfolding along Y then Z. Recall that a 1024-way rack is eight by eight by 
sixteen, so there are twice as many hops in the Z direction. This mapping might show better 
results if construction went along Y then X. 
Similarly, the GAMESS-style Sheets - Z mapping was a consistently poor performer for 
4096 nodes. The particular implementation of this mapping employed by mapmaker arranges 
the nodes with X varying then Y . Because of this, physical coordinates (0, 0, 0) get assigned 
to MPI Rank 0. However, MPI ranks for the neighboring Z sheet (i.e., nodes with Z coor-
dinate equal to 1) start at 2048. The performance for this mapping might improve if it were 
constructed with Y then X varying. However, it was not possible to try every possible method 
of constructing all of the maps. 
The worst performing map for 4096 nodes was consistently the GAMESS-style PlusOne-X 
map. This is because the PlusOne - X map puts nodes with neighboring X physical coordinates 
far apart in MPI rank. Given that the X direction is the fewest number of hops in a 4096-
configuration (which is eight by thirty-two by sixteen) , it makes sense the PlusOne-X map 
will perform poorly. Instead of hiding the latency of extra hops in Y and Z , the map actually 
moves neighbor nodes with the fewest number of hops as far apart in the MPI ranking as 
102 
possible. 
The stock mappings for 4096 nodes show a reasonable amount of variation consistent across 
the different atom counts. Maps with X varying fastest (i.e. XYZ and XZY) showed the poorest 
performance since they do nothing to hide the extra hop latency for the Y and Z axes. 
Figure 7.83 shows the non-Gray mappings for one hundred million atoms on 4096 nodes. 
As discussed above, the PlusOne - X and Unfold - Z mappings show the worst performance. 
The rest of the mappings are consistent qualitatively with the fewer atom cases. 
5.4.2 Gray-code results 
The Gray-code mappings , especially the nearly-square ones, were expected to increase 
performance for ALCMD runs. In some cases, this happened, in others it did not. 
Figures 7.84, 7.85, 7.86, and 7.87 show the results for Gray-code maps on ten thousand, 
one hundred thousand, one million, and ten million atoms. Very little variation is seen in the 
one million and ten million atom cases. This is probably because there is a minimal amount 
of communication with so many atoms and so few nodes. In the ten thousand and hundred 
thousand atom cases, all Gray-code mappings did at least marginally better than the stock 
XYZ and random mappings. However, the run times were too short to make strong comments 
on variations between different aspect ratios for the Gray-code meshes. 
Figures 7.88, 7.89, 7.90, 7.91 show the results for Gray-code maps on 1024 nodes for ten 
thousand, one hundred thousand, one million, and ten million atoms. All of the results are 
rather disappointing, but at least consistent for differing numbers-of-atoms. All of the Y 
oriented Gray codes performed the most poorly compared to the other orientations. 
More importantly, none of the Gray-code maps were significantly better than stock map-
pings. However, the nearly-square meshes were solid performers. Strangely, the very skewed 
meshes were not as bad as expected. Some of the worst skew meshes performed as well as the 
nearly-square meshes in fact. 
As shown in figures 7.92, 7.93, 7.94, 7.95, the 4096-way Gray mappings performed much 
better than the 1024-way cases. In the 4096-way cases, the aspect ratio did make a difference 
103 
for performance, but the very flat meshes did not show the worst performance. Instead, 
intermediary aspect ratios performed poorly, especially in the X oriented cases. The fact that 
the X oriented cases are the poorest overall Gray-code performers is not surprising since the 
maps are constructed in the Y Z planes. Physically neighboring X nodes are separated a great 
deal in their MPI ranks. In the Z oriented maps, the 32x128 mesh actually performed better 
than the 64x64 mesh for two of the three cases where mappings made a difference (one million 
and ten million atoms). 
104 
6 Conclusions and Future Work 
6 .1 Conclusions 
As was shown in many of the results graphs, the NAS problem class size made very little 
difference in relative performance - if a mapping showed the best performance on a "class A" 
problem size, it usually showed the best performance on a "class B" or "class C" problem size. 
This is not much of a surprise on the machine sizes in question. Even if the smaller problem 
sizes are more communications bound than computationally bound, because there are so many 
nodes involved, the difference is easily absorbed since each node has such a tiny amount of 
work to do. 
Turning off optimized collectives also had very little relative effect. For collectives-intensive 
benchmarks, the absolute effect was larger , but mappings made no difference, i.e. performance 
went down for non-optimized collectives runs, but the results were essentially the same regard-
less of mapping used. 
This might be because the MPICH collectives are implemented efficiently and do not use 
naive point-to-point methods (i.e. they might use a broadcast as part of an MPI ...Allreduce 
instead of simply sending point-to-point messages after doing the reduction.) 
The shape of the machine had the most interesting effects. It is unfortunate that arbitrarily-
shaped sub-midplane blocks are not supported and that the NAS benchmarks do not run on 
larger configurations where the aspect ratios become significantly more skewed. (Multiple racks 
expanding in one dimension for example) 
One would assume the benchmarks that showed performance improvements in the per-
mutation mappings would not show improvements in cubic configurations. For example, MG 
105 
showed performance gains on a 128-way partition using ZYX and ZYX because there are (on 
average) more hops in the X direction than Y or Z. Would it perform similarly on a 4096-way 
configuration such as a thirty-two by eight by sixteen? In that case, mappings that showed 
the best performance with mappings with slow-varying X coordinates should show that more 
strongly. There should also be more of a difference between the YZX and ZYX stock mappings 
as well, with YZX being the best mapping for that configuration. 
Proper node mappings can make very real gains in performance on benchmarks that assume 
certain configurations. For example, good maps on the BT benchmark provided as much as a 
fifteen percent gain over the default mapping (see figure 7.2) . 
The G AMESS and ALCMD runs also showed that performance gains are possible with the 
proper maps. However, the results were generally not as expected. The GAMESS results were 
also inconsistent between the quinone and penicillin workloads. Perhaps the communication 
to computation ratio or load balance among the nodes is significantly different between the 
two workloads. Either way, a much more comprehensive study of the communication patterns 
of these large applications would be required to get the absolute best mappings and to explain 
some of the inconsistent and surprising results. 
6.2 Future Work 
Given that mappings can improve performance and can do so with very little effort to the 
application developer, it is worthwhile to pursue other mapping strategies. 
One strategy would be to gather communications data with the PMPI profiling library and 
construct an "optimal" node mapping from the list of nodes that were actually involved in 
the communication. For example, using the PMPI tools used for determining the benchmarks' 
communication profiles, it was determined t}+,at each node involved in the NAS BT benchmark 
only communicates with six other nodes. There should therefore be an optimal node mapping 
for BT on BlueGene/L. The data from a small profiling run of an application could be analyzed 
to determine an optimal mapping for future, larger runs of the same application. When the 
number of nodes a given node communicates with is larger than the number of links available, 
106 
the problem becomes significantly more difficult. Some definition of what is really "optimal" 
given the machine, application, and amount of time required to analyze the profiling data 
would be required because it is believed there is no efficient algorithm for determining an 
optimal mapping for the general case. 
It might also be possible to change the mappings during the application run. This might be 
useful for especially long-running jobs (i.e., multiple days) if the communications profile changes 
in the middle of the application (the application might employ a fast-Fourier transform (FFT) 
first and then do nearest-neighbor shifts for example). If the steps take long enough to justify 
it, changing mappings in the middle of the run might increase performance significantly since 
any mapping optimized for one step is likely to be a poor match for other steps. It might even 
be possible for a very sophisticated control system to actually re-map nodes automatically 
during a run as it determines that is necessary. 
MPI even allows the implementer to re-map ranks for best performance in routines like 
MPLCart_create. It would be interesting to employ some "intelligence" in the routine to 
attempt to give a user a good mapping based on the nodes they wish to involve in the sub-
communicator. 
Despite the large number of mappings investigated in this thesis, there are more variations 
on each mapping possible. Many mappings have one dimension for their orientation but then 
assume a secondary physical dimension when constructing the maps. On machine configu-
rations with large differences in hop counts in different dimensions (for example, the eight 
by thirty-two by sixteen 4096-way rows), the choice of the secondary dimension might make 
significant differences in performance. Also, the NAS LU and CG benchmarks have interesting 
communications patterns that might benefit from a customized map. 
One other possible use for mappings would be to help maintain good performance after 
a fault condition. For example, if an application is not using all of the compute nodes in a 
partition and the control system detects a node with some sort of hardware fault preventing 
it from being used further, it might be possible to re-map the good nodes to minimize the 
performance effects of the faulty node dropping out from the run. 
107 
The ideas presented in this thesis should also be investigated on other machines or clusters. 
The effects of node mapping in NUMA machines is especially interesting since it is very likely 
an application programmer would assume all addresses are "local" to the processor. Seeing how 
logical mappings work in fat-tree clusters (e.g., Infiniband or Myrinet) is another interesting 
problem. 
108 
7 Graphs 
0 
U) 
U) 
Cl> 
0 
0 c.. 
Q; 
Cl. 
U) 
Ci. 
0 
~ 
109 
100 
80 
60 
40 
20 
0 
a. c a. c a. c a. c a. c a. 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ .s .s .s .s .s 
N N >- >- N N x x >- >- x >- >- N N x x N N x x >-x x x x >- >- >- >- N N N 
Figure 7.1 BT 128 Total Processors (Coprocessor, 
Non-Optimized Collectives) 
c 
0 
.s 
x 
>-
N 
li 
~ 
E 
0 
u 
c 
ell 
a: 
c 
0 
.s 
E 
0 
u 
c 
ro 
a: 
Class C EXX:LJ 
ClassB ~ 
Class A~ 
Optimized vs 
100 
80 
~ 
!fl 60 Q) 
() 
0 a. 
Qi 
0.. 
!fl 
Ci. 
0 
~ 40 
20 
0 
8: 8: 
v N 
<.O (") x x 
N v 
110 
8: 8: 8: 8: E E E E E E ~ ~ ~ ~ ~ ~ N >-
<.O 00 v N v N <.O 00 v N v N <.O 00 v N x x x x <.O (") x x x x <.O (") x x x x x ~ N v x x <.O N v x x <.O N v 00 (") <.O N v 00 (") <.O N v 00 (") <.O 
Figure 7.2 BT 128 Total Processors - Gray-code Meshes 
E 
0 
"O 
c 
Cll a: 
Class C KX20;l 
ClassB ~ 
Class A~ 
0 
CJ) 
CJ) 60 (]) 
(.) 
0 c.. 
w 
a. 
CJ) 
0. 
0 
:2: 40 
111 
g ~ fl g ~ fl g ~ fl N' E >- 0 CJ) CJ) CJ) (]) (]) (]) CJ) CJ) CJ) ~ "O ~ ~ ~ c c c (j) Q) (j) c (.) (.) (.) 0 0 0 ~ ell 0 0 0 (]) (]) (]) a: en en en ~ CJ) CJ) ..c ..c ..c (.) :::J :::J Cf) Cf) Cf) 0 a: a: a: (fj 
Figure 7.3 BT 128 Total Processors - GAMESS-style Maps 
ClassC ~ 
ClassB ~ 
Class A rs:::::s:::::s: 
100 
80 
0 
If) 
If) 
60 Q) 
u 
0 
0.. 
03 
0.. 
-a. 
0 
~ 40 
20 
0 
112 
E g ~ ~ N' >- 0 
-0 -0 -0 ~ 
-0 
0 0 0 c ell c c c .:.:. er: u ::::> ::::> ::::> .8 
(J) 
Figure 7.4 BT 128 Total Processors - Unfold-style Maps 
ClassC ~ 
ClassB ~ 
Class A i;;::::s:::s:si 
90 
80 
70 
0 
60 
CJ) 
CJ) 
(!) 
0 
0 50 Q_ 
Qi 
a. 
CJ) 
a_ 40 0 
~ 
30 
20 
10 
0 
113 
li c li c li c li c li c li c li c 
~ 
0 
~ 
0 
~ .s ~ 0 ~ 0 ~ 0 ~ 0 .s .s .s .s .s .s 
N N >- >- N N x x >- >- x x N N >- >- N N x x N N x x >- >- >- >-x x x x ~ >- ~ >- N N N N x x I- I- I- I- I- I- I- I- I- I-
li c li c li c li 
~ 
0 
~ 
0 
~ .s ~ .s .s 
>- >- N N x x >-N N x x N N x x x >- >- >- >- N 
c li c 0 
~ .s .s 
>- x x x >- >-
N N N 
li 
~ 
E 
0 
-0 c 
ca 
a: 
c 
0 .s 
E 
0 
-0 
c 
ca 
a: 
Class C KX::X:2!l 
ClassB ~ 
Class A~ 
Figure 7.5 BT 256 Total Processors (VirtualNodeMode, Optimized vs 
Non-Optimized Collectives) 
90 
80 
70 
0 
60 
(/) 
(/) 
Q) 
(.) 
0 
50 Q_ 
Cii 
a. 
~ 
a. 40 0 
~ 
30 
20 
10 
0 
E E E 
CXl v N 
N (!) C') 
x x x v CXl 
N 
114 
N E E E E 8: 8: 8: 8: 8: 8: 8: E E E E E E E s: s: s: s: s: s: s: >-
(!) CXl ~ N CXl v N (!) CXl v N CXl v N (!) CXl v N CXl v N (!) CXl v N x x x x N (!) C') x x x x N (!) C') x x x x N (!) C') x x x x N v CXl x x x N v CXl x x x N v CXl x x x N v CXl (!) C') (!) ~ v CXl (!) C') (!) ~ v CXl ~ C') (!) N v CXl ~ C') (!) ~ N N N 
Figure 7.6 BT 256 Total Processors - Gray-code Meshes 
N E >- 0 x " I- c cc 
a: 
ClassC ~ 
ClassB ~ 
Class A~ 
90 
80 
70 
0 
60 
<J) 
<J) 
<ll 
() 
0 
50 0.. 
w 
a. 
<J) 
Ci 40 0 
::! 
30 
20 
10 
0 
8: E: 8: 
~ ~ <J) 
~ ~ 'g 
0 0 0 
0 0 0 
Figure 7.7 
115 
8: E: 8: 8: E: 
t5 t5 t5 <J) <J) <ll <ll ~ ~ Ctl 
~ 
c:: c 
<ll ~ Ctl Ctl 0 0 0 a.. a.. 
0 0 0 <ll Q) 
0 0 
0 0 
BT 256 Total Processors -
8: N' N' >- >-<J) 
~ x <ll c c:: .:.c Ctl () .:.c a.. 0 () 
~ U5 0 
0 U5 
0 
Unfold-style Maps 
E 
0 
"O c 
Ctl 
a: 
Class C KX:X:2l 
ClassB ~ 
Class A E:S::S:SJ 
g 
(/) 
Q) 
u 
0 
Q_ 
Cii 
Cl.. 
(/) 
a. 
0 
~ 
60 
40 .... 
30 -
20 r-
[)< 
" y 
I" x 
I)< 
a. 
~ 
N 
>-x 
a. 
~ 
>-
N x 
> " 
a. 
~ 
N x 
>-
N x 
>-
a. 
~ 
x 
N 
>-
116 
c 
0 .s 
x 
N 
>-
Figure 7.8 BT 512 Total Processors 
Non-Optimized Collectives) 
a. 
~ 
>-x 
N 
c 
0 .s 
>-x 
N 
x 
x 
a. 
~ 
x 
>-
N 
(Coprocessor, 
c 
0 .s 
x 
>-
N 
a. 
~ 
E 
0 
-0 
c 
ro 
a: 
c 
0 .s 
E 
0 
-0 c 
ro 
a: 
Optimized vs 
g 
(f) 
Q) 
u 
0 
Ci. 
Qi 
0.. 
(f) 
Ci 
0 
~ 
<O 
lO 
C\J x 
C\J 
<X) 
C\J 
x 
""" 
""" <O x 
<X) 
C\J 
('") 
x 
<O 
<O 
x 
C\J 
('") 
<X) 
x 
""" <O 
""" x <X) 
C\J 
Figure 7.9 
C\J x 
<O 
lO 
C\J 
<O 
lO 
C\J x 
C\J 
<X) 
C\J 
x 
""" 
""" <O x 
<X) 
C\J 
('") 
x 
<O 
<O 
x 
C\J 
('") 
117 
<X) 
x 
""" <O 
""" x <X) 
C\J 
<O 
lO 
C\J x 
C\J 
""" <O x 
<X) 
C\J 
('") 
x 
<O 
<X) 
x 
""" <O 
""" x <X) 
C\J 
<O 
lO 
C\J x 
C\J 
BT 512 Total Processors - Gray-code Meshes 
E 
0 
""O 
c 
ell 
a: 
ClassC ~ 
118 
80 
l{x /' 
x. )<. Class C KXXJ!l 
70 -
)<) 
60 ~ 
0 50 -
(/) 
(/) 
Q) 
u 
0 [)')< > 
0.. 
40 .... (j; 
a. 
-a 
0 
~ 30 -
20 ~ 
Figure 7.10 BT 512 Total Processors - GAMESS-style Maps 
119 
80 -
70 -
60 .... 
0 
(/) 50 (/) 
(!) 
u 
0 a. 
(jj 
40 -0.. 
(/) 
Ci 
0 
~ 
30 - x 
20 -
10 .... 
I' ~ 
0 
k XX /x x 
g E E1 N' E >- 0 
-0 -0 -0 c -0 0 0 0 c:: .:.:. CIJ c c c u a: 
:::> :::> :::> 0 
Ci5 
Figure 7.11 BT 512 Total Processors - Unfold-style Maps 
0 
CJ) 
CJ) 
Q) 
u e 
0.. 
Ci> 
0.. 
CJ) 
a. 
0 
~ 
120 
20 .--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-----, 
15 
10 
5 
0 
a.. c 0.. c 0.. c a.. c 0.. c 0.. 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ E. E. E. E. E. 
N N >- >- N N x x >- >- x >- >- N N x x N N x x >-x x x x >- >- >- >- N N N 
Figure 7.12 CG 128 Total Processors (Coprocessor, 
Non-Optimized Collectives) 
c 
0 
E. 
x 
>-
N 
0.. 
~ 
E 
0 
"'O 
c 
co 
a: 
c 
0 
E. 
E 
0 
"'O c 
co 
a: 
ClassC ~ 
ClassB ~ 
Class A is;:::s:::s:sJ 
Optimized vs 
15 
0 
Ul 
Ul 
<ll 
{.) 
0 a.. 
10 Qi 
a. 
-a_ 
0 
~ 
5 
0 
g g 
'<t C\J 
(() C') 
x x 
C\J '<t 
121 
g g g g E E E E E E ~ ~ ~ ~ ~ ~ N >-
<D CXl '<t C\J '<t C\J <D CXl '<t C\J '<t C\J (() CXl '<t C\J x x x x x <D C') x x x x (() C') x x x x (() C\J '<t x x <D C\J '<t x x ~ C\J '<t CXl C') <D C\J '<t CXl C') <D C\J '<t CXl C') <D 
Figure 7.13 CG 128 Total Processors - Gray-code Meshes 
E 
0 
"O c 
<1l 
a: 
Class C KX::X::Sl 
ClassB ~ 
Class A r:;:s::::s::::: 
15 
0 
(fJ 
(fJ 
Q) 
(.) 
0 a. 
10 Qi 
a. 
-a 
0 
~ 
5 
122 
8: E 8: 8: E 8: 8: E 8: N' E >- 0 (fJ (fJ (fJ Q) Q) Q) (fJ (fJ (fJ 
~ 
"O 
..::.:: ..::.:: ..::.:: c c c Q) Q) Q) c (.) (.) (.) 0 0 0 Q) ..::.:: C'Cl 0 0 0 Q) Q) a: (fJ (fJ (fJ .r::. .r::. .r::. (.) ro ro iii :J :J :J (/) (/) (/) 0 
0::: 0::: 0::: (jj 
Figure 7.14 CG 128 Total Processors - GAMESS-style Maps 
Class C KXX1SJ 
ClassB ~ 
Class A rs:::::s::::s:: 
0 
en 
en 
QJ 
u 
0 a. 
Q; 
0.. 
en a. 
0 
~ 
123 
20 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~------, 
15 
10 
5 
g ~ ~ N' E >- 0 -0 -0 -0 ~ -0 c 0 0 :2 ~ ra c c c u a: ::> ::> ::> .9 
(/') 
Figure 7.15 CG 128 Total Processors - Unfold-style Maps 
ClassC Km 
ClassB ~ 
CI ass A K::S::S:SJ 
124 
Class C KX:XJ:<J 
14 ClassB ~ 
Class A i:;:::::s::s::s 
12 
10 
0 
en 
en 
(]) 
u 
0 8 a. 
(j; 
a. 
en a. 
0 6 
~ 
4 
2 
0 
Ci c Ci c Ci c Ci c Ci c Ci c Ci c li c li c li c Ci c li c Ci c ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 ..s. 0 .s .s .s .s .s .s .s .s .s .s .s .s .s 
N N >- >- N N x x >- >- x x N N >- >- N N x x >- >- x x E E >- >- N N x x N N x x >- >- >- >- N N x x N N x x >- >- 0 0 x x x x ~ ~ ~ ~ 
N N N N x x x x >- >- >- >- N N N N "'O "'O I- I- I- I- I- I- I- I- c c ro ro a: a: 
Figure 7.16 CG 256 Total Processors (VirtualNodeMode, Optimized vs 
Non-Optimized Collectives) 
14 
12 
10 
0 
(/) 
(/) 
Q) 
() 
0 8 0.. 
(iJ 
o._ 
-a_ 
0 6 :E 
4 
2 
0 
E E E 
co '<:!" C\J 
C\J <.D (") 
x x x '<:!" co 
C\J 
125 
E E E E g g g g g g g 2 2 2 2 2 2 2 ~ ~ ~ ~ ~ ~ ~ N >-
<.D co '<:!" C\J co '<:!" C\J <.D co '<:!" C\J co '<:!" C\J <.D co '<:!" C\J co '<:!" C\J <.D co '<:!" C\J x x x x x C\J <.D (") x x x x C\J <.D (") x x x x C\J <.D (") x x x x C\J '<:!" co x x x C\J '<:!" co x x x C\J '<:!" co x x x C\J '<:!" co <.D (") <.D C\J '<:!" co <.D (") <.D C\J '<:!" co <.D (") <.D C\J '<:!" co ~ (") <.D C\J C\J C\J C\J 
Figure 7.17 CG 256 Total Processors - Gray-code Meshes 
N 
>-x 
I-
E 
0 
-0 
c: 
co 
a: 
Class C E:X:X:2il 
ClassB ~ 
Class A~ 
14 
12 
10 
0 
CJ) 
CJ) 
Q) 
u 
0 8 a_ 
Q; 
a. 
CJ) 
Ci 
0 6 
~ 
4 
2 
0 
g E: ~ 
~ ~ ~ 
~ -= ~ ~ 0 0 0 
u u u 
Figure 7.18 
126 
g E: ~ g E: ~ N' N' >- >-
CJ) Vi Vi CJ) CJ) CJ) ~ x Q) Q) Q) cu cu cu c:: c:: c:: .:x !::. Qi Qi Qi cu cu cu u .:x 0 0 0 a. a. a. 0 u 
u u u ~ ~ ~ (jj 0 
0 0 0 (jj 
u () () 
CG 256 Total Processors - Unfold-style Maps 
E 
0 
"O c:: 
cu 
a: 
Class C KXX2l 
ClassB ~ 
Class A is:::s:::s::'SI 
0 
en 
en 
Q) 
u 
0 c. 
Qi 
c. 
en 
o_ 
0 
:2 
127 
6000 ~ 
'-
5000 '-
,.., "'-
4000 ~ "'-
> ) '"' 
.... ' 
3000 '-
" 
2000 '-
1000 '-
" >\> " >, " ''" " 
)( 
)< 
0 
c li c li c li c li s 0 s 0 s 0 s 0 .s .s .s .s 
N N >- >- N N x x >- >- N N x x N N x x x x >- >- >- >-
Figure 7.19 CG 512 Total Processors 
Non-Optimized Collectives) 
)<'>< .;>9, Class C B':XX2l 
~.,.) )~.,.> 
'> '>< > > -
a. c a. c a. c s 0 s 0 s 0 .s .s .s 
>- >- x x E E x x >- >- 0 0 N N N N -0 -0 c c co co a: a: 
(Coprocessor, Optimized vs 
1-
'rj 
aq
" 
~
 
>
-;
 
('D
 
--
:J
 
l:
V
 
0 Q
 0 C,J1
 
f-
"
 
l:
V
 
S3
 
c-
t- e. ~ >-; 0 n ('D (/) (/) 0 >-; (/) 0 >-; ~ I n 0 0....
 
('D
 ~
 
('D
 
(/
) 
p
-'
 
('D
 
(/
) 
2x
25
6 
(X
) 
4x
12
8 
(X
) 
8x
64
 (
X
) 
16
x3
2 
(X
) 
32
x1
6 
(X
) 
64
x8
 (
X
) 
12
8x
4 
(X
) 
25
6x
2 
(X
) 
2x
25
6 
(Y
) 
4x
12
8 
(Y
) 
8x
64
 (
Y
) 
16
x3
2 
(Y
) 
32
x1
6 
(Y
) 
64
x8
 (
Y
) 
12
8x
4 
(Y
) 
2x
25
6 
(Y
) 
2x
25
6 
(Z
) 
4x
12
8 
(Z
) 
8x
64
 (
Z
) 
16
x3
2 
(Z
) 
32
x1
6 
(Z
) 
64
x8
 (
Z
) 
12
8x
4 
(Z
) 
2x
25
6 
(Z
) 
X
Y
Z
 
R
an
do
m
 
M
O
p/
s 
p
e
r 
pr
oc
es
so
r 
0 
I\
)
 
~
 
O
'l 
()
) 
0 
I\
)
 
I 
I 
I 
I 
-
~
~
 
~
~
;
1
 
/V
V
V
'V
V
V
'V
V
'.
/.
/V
V
"
v
"
v
' 
/
-
.
/
V
'
~
.
f
v
"
.
/
V
'
/
x
 
/
.A
A
A
A
A
A
/V
'v
'.
.A
A
A
/V
'v
'.
.f
i 
A
A
.
A
A
/I
 
rv
v
v
v
v
v
v
v
v
v
v
v
<
>
<
/.
X
X
X
X
X
>
O
O
O
O
<
X
>
O
<
X
.X
)Y
 
r~
A
A
/
V
'
v
'
.
.
A
X
>
<
.
.
J
<
.
/
'
\
.
/
\
.
.
/
'
V
"
\
.
.
/
\
.
.
~
/
'
V
"
-
/
\
.
.
~
'
-
f
'
.
/
"
.
.
.
~
/
'
\
.
/
'
V
l ><
J 
x 
,X
I 
Y>
<,;>
< 
,~
/
Y
~
~
 
><
xX
'><
 
7 
x 
« 
x
x
~
 
Y
x
 
X
X
 )
(
X
X
)
(
 
'X
X
'X
 )
( 
~
 
x,,
 
X
.I
 
• 
I 
I 
I 
~
 
I I 
(
)
 
il
l 
en
 
en
 
(
)
 ~ 
f-
"
 
l:
V
 
0
0
 
129 
Class C K:X:2al 
14 I-
12 I-
'7 
'v 
)\, 
10 I-
0 
(/) 
(/) 
Q) 
k '> u 
0 8 -o._ 
tu 
a. 
(/) 
)< Ci. 'v 
0 6 -~ 
4 I-
2 I-
'> ) )c 
'> 
0 
8: E 8: 
(/) (/) (/) 
.:<: .:<: .:<: 
u u u 
0 0 0 
iii iii iii 
8: E 8: 8: E 8: N' E >- 0 Q) Q) Q) (/) (/) (/) 
~ 
1:J 
c c c Q) Q) Q) c ct! 0 0 0 Q) Q) Q) .:<: a: (/) 
~ 
(/) .c .c .c u 
::J ::J (/) (/) (/) 0 a:: a:: a:: U5 
Figure 7.21 CG 512 Total Processors - GAMESS-style Maps 
130 
5000 ~ 
4000 >-
0 
!/) 
!/) 
Ql 
u 
0 
0.. 
Qi 3000 
a. x 
!/) 
0.. x 
0 
:2 
2000 -
x 
1000 - x 
x 
0 I ~ c. ~ N >- 0 
"'O "'O "'O ~ 
"'O 
0 ~ ~ 
c 
c .:.:. ro c c u er: 
::J ::J ::J 0 
en 
Figure 7.22 CG 512 Total Processors - Unfold-style Maps 
70 
60 
50 
0 
en 
en 
(j) 
0 
0 40 o._ 
(jj 
0.. 
en 
Ci. 
0 30 :? 
20 
10 
0 
131 
li '2 a_ '2 li '2 a_ '2 a_ '2 li 
.2. 0 .2. 0 .2. 0 .2. 0 .2. 0 .2. .s .s .s .s .s 
N N >- >- N N x x >- >- x >- >- N N x x N N x x >-x x x x >- >- >- >- N N N 
Figure 7.23 FT 128 Total Processors (Coprocessor, 
Non-Optimized Collectives) 
'2 
0 .s 
x 
>-
N 
li 
.2. 
E 
0 
" c:: ro 
a: 
'2 
0 .s 
E 
0 
" c:: ro 
a: 
ClassC ~ 
ClassB ~ 
CI ass A is:s::s:si 
Optimized vs 
70 
60 
50 
0 
Ul 
Ul 
<ll 
0 
0 40 a. 
(i) 
Cl. 
-a 
0 30 ::2: 
20 
10 
0 
8: 8: 
""" 
N 
co C') x x 
N 
""" 
132 
8: 8: 8: 8: E E E E E E G: G: G: G: G: G: N >-
co <X) 
""" 
N 
""" 
N co <X) 
""" 
N 
""" 
N co <X) 
""" 
N x x x x x co C') x x x x co C') x x x x ~ N """ x x ~ N """ x x co N """ <X) C') co N """ <X) C') co N """ <X) C') co 
Figure 7.24 FT 128 Total Processors - Gray-code Meshes 
E 
0 
"O 
c 
C1l 
a: 
ClassC Km 
ClassB ~ 
Class A is:::::s::::::s 
70 
60 
50 
0 
en 
en 
Q) 
u 
0 40 0.. 
Qi 
a. 
en 
Ci. 
0 30 ::! 
20 
10 
0 
133 
g ~ G: g ~ G: g ~ G: N' E >- 0 en en en Q) Q) Q) en en en ~ "O ..><:'. ..><:'. ..><:'. c c c Q) Q) Q) c u u u 0 0 0 ..><:'. Cll 0 0 0 Q) Q) Q) a: en en en .J::. .J::. .J::. u m m iii :::l :::l :::l CJ) CJ) CJ) .8 a:: a:: a:: CJ) 
Figure 7.25 FT 128 Total Processors - GAMESS-style Maps 
Class C EXX:3I 
ClassB ~ 
Class A i;;::::s:::s::s 
70 
60 
50 
0 
<Jl 
<Jl 
a.> 
u 
0 40 a. 
m 
0.. 
~ 
0.. 
0 30 :2 
20 
10 
0 
134 
g E: @: N E >- 0 
" " " ~ " c: 0 0 0 Cl] c c c ~ a: u ::> ::> ::> 0 
(jj 
Figure 7.26 FT 128 Total Processors - Unfold-style Maps 
Class C KX:X:ZJ 
ClassB ~ 
Class A is:::s:::s::::s 
50 
40 
0 
en 
en 
<l> 
(.) 30 0 a. 
Qi 
c.. 
en 
Ci 
0 
::2: 
20 
10 
0 
135 
a. '2 a. '2 a. '2 a. '2 a. '2 a. '2 a. '2 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
~ 
0 
E.. E.. E.. E.. E.. E.. E.. 
N N >- >- N N x x >- >- x x N N >- >- N N x x N N x x >- >- >- >-x x x x ;::: ;::: ;::: ;::: N N ~ N x x f- f- f- f- f- f- f-
a. '2 a. '2 15: '2 
~ 
0 
~ 
0 
~ 
0 
E.. E.. E.. 
>- >- N N x x N N x x N N x x >- >- >- >-
a. '2 a. '2 
~ 
0 
~ 
0 
E.. E.. 
>- >- x x x x >- >-N N N N 
a. 
~ 
E 
0 
"O 
c 
ro 
a: 
'2 
0 
E.. 
E 
0 
"O 
c 
ro 
a: 
Class C KXXSJ 
ClassB ~ 
Class A~ 
Figure 7.27 FT 256 Total Processors (VirtualNodeMode, Optimized vs 
Non-Optimized Collectives) 
45 
40 
35 
30 
0 
(/) 
(/) 
Q) 
25 u 
0 
c.. 
Qi 
c.. 
(/) 20 Ci. 
0 
:E 
15 
10 
5 
0 
E E E 
ex:> 
""" 
N 
N <D C') 
x x x 
""" 
ex:> 
N 
136 
E E E E 8: 8: 8: 8: 8: 8: 8: ~ ~ ~ ~ ~ ~ ~ G: G: G: G: G: G: G: N >-
<D ex:> 
""" 
N ex:> 
""" 
N <D ex:> 
""" 
N ex:> 
""" 
N <D ex:> 
""" 
N ex:> """ 
N <D ex:> 
""" 
N x 
x x x x N <D C') x x x x N <D C') x x x x N <D C') x x x x N """ 
ex:> x x x N """ ex:> x x N """ ex:> x x 
x N 
""" 
ex:> 
~ C') <D ~ """ 
ex:> ~ C') <D ~ x """ 
ex:> <D C') <D ~ """ 
ex:> ~ C') <D N N N N 
Figure 7.28 FT 256 Total Processors - Gray-code Meshes 
N E >- 0 x -0 
I- c:::: ca a: 
ClassC m:2il 
ClassB ~ 
Class A i;;:::s::s:'SI 
45 
40 
35 
30 
0 
en 
en 
Q) 
25 u 
0 
0.. 
Qj 
0.. 
-a 20 
0 
~ 
15 
10 
5 
0 
8: ~ 8: 
~ en ~ 
~ ~ 
;.;: 
~ 
0 0 0 
(.) (.) (.) 
Figure 7.29 
137 
8: ~ 8: 8: ~ 8: N N >- >-
en Ul Ul en en en ~ x Cll Cll Q) Q) Q) !::: Cll c c c .:.::. ~ ~ ~ Cll Cll Cll u .:.::. 
0 0 0 a.. a.. a.. 0 u 
(.) (.) (.) ~ Q) ~ <lJ 0 
0 0 0 <lJ 
(.) (.) (.) 
FT 256 Total Processors - Unfold-style Maps 
E 
0 
"O 
c 
Cll 
a: 
Class C K:Z:J:aJ 
ClassB ~ 
Class A i::;:s:::s:::sJ 
0 
I/) 
I/) 
QJ 
u e 
c.. 
Q5 
c.. 
I/) 
Ci 
0 
::::? 
35000 -
30000 -
25000 -
20000 -
15000 -
10000 -
5000 -
Figure 7.30 
c 
0 
..s 
N 
>-x 
a. 
-9.. 
>-
N 
x 
c 
0 
..s 
>-
N x 
a. 
-9.. 
N 
x 
>-
c 
0 
..s 
N 
x 
>-
138 
c 
0 
..s 
x 
N 
>-
FT 512 Total Processors 
Non-Optimized Collectives) 
a. 
-9.. 
>-x 
N 
c 
0 
..s 
>-x 
N 
(Coprocessor, 
c 
0 
..s 
a. 
-9.. 
E 
0 
"'O 
c 
ctJ 
a: 
c 
0 
..s 
E 
0 
"'O 
c 
ctJ 
a: 
Optimized 
ClassC ~ 
vs 
139 
80 
ClassC ~ 
70 -
60 .... 
~ 
50 -
(1) 
u 
0 a. 
40 -(ij 
a. 
VJ a. 
0 
:? 30 -
20 -
10 -
) I> )( 
)< )< 
0 
g g g g g g g g E E E E E E E E ~ ~ ~ ~ ~ ~ ~ ~ N E >- 0 
<D co v N <D co v N <D co v N tO co v tO tO co v N tO co v tO x " ll) N tO ("") x x x x ll) N tO ("") x x x ll) ll) N tO ("") x x x ll) c N x x x v co tO N x x x v co N N x x x v co N co x co tO N tO N ll) x co tO N tO N x x co tO N tO N x a: 
N v ("") N N v ("") N N v ("") N 
Figure 7.31 FT 512 Total Processors - Gray-code Meshes 
140 
80 
ClassC ~ 
)< 
70 - A 
)< >< 
60 -
Cs 50 -
(J) 
(J) 
Q) 
u 
0 a. 
40 t-Q; 
a. 
(J) 
a. 
0 
~ 30 -
20 t-
~ 
10 r 
0 )< g E: 8: g E: 8: g E: 8: N' E >- 0 (J) (J) (J) Q) Q) Q) (J) (J) (J) 
~ 
"O 
~ ~ ~ c c c Q) Q) Q) c u u u 0 0 0 ~ «l 0 0 0 Q) Q) Q) a: 
Cii Cii Cii 
(J) (J) (J) ..c ..c ..c u 
:::J :::J :::J (}) (}) (}) 0 a:: a:: a:: U5 
Figure 7.32 FT 512 Total Processors - GAMESS-style Maps 
141 
40000 
X'\,' Class C KX::Z.:2J 
xx 
35000 
) )< 
30000 x >,. 
g 25000 .... , 
Cf} 
Q) 
u 
0 
0.. >,', 
Q; 20000 ... 
0.. 
-a " 0 
:E 15000 -
x 
" 
!< 
10000 -
>, )< 
5000 ... 
,.,. 
0 
N ($_ ~ t::!. >- 0 
"O "O "O ($_ "O 0 0 0 c 
~ 
ell c c c u a: 
:J :J :J 0 
(f, 
Figure 7.33 FT 512 Total Processors - Unfold-style Maps 
0 
<J) 
<J) 
Q) 
" e
a. 
:;; 
a. 
-a. 
0 
::::;; 
142 
5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~--, 
4 
3 
2 
a c a c a c a c 
-8- 0 -8- 0 -8- 0 -8- 0 .s .s .s .s 
N N >- >- N N >< >< >- >- N N >< >< N N >< >< >- >->< >< >- >-
Figure 7.34 IS 128 Total Processors 
Non-Optimized Collectives) 
a c a c 
-8- 0 -8- 0 .s c:: ~
>- >- >< >< >< >< >- ~ N N N 
(Coprocessor, 
a c 
-8- 0 .s 
E E 0 0 l:J 
O~imi~d 
ClassC = 
ClassB ~ 
Class A= 
vs 
0 3 "' "' "' (.} e 
c. 
Q; 
c. 
-a 
0 2 :::!: 
8: 8: 8: 8: 8: 8: E 
" "' CJ) CX) ~ "' " CJ) M >< x x CJ) x x ~ "' " x "' " CX) M CJ) "' 
Figure 7.35 IS 
143 
E E E E E ~ ~ ~ ~ ~ ~ >= 
"' CJ) CX) " "' " "' CJ) CX) " "' 
x 
M >< x x x 
CJ) ~ >< x x ;:x CJ) "' " x CJ) "' " CX) "' CJ) "' " co "' CJ) 
128 Total Processors - Gray-code Meshes 
E 
0 
"O 
c: 
"' a: 
ClassC = 
Class B iiZlliZlli 
ClassA = 
0 3 "' "' Q) 
" e
c. 
(;; 
c. 
-!!! c. 
0 2 ::;; 
g E f9: 
"' "' "' ""' ""' ""' " " " 0 .Q 0 co co <Xl 
Figure 7.36 
144 
g E f9: g E f9: N' E >- 0 Q) Q) Q) 
~ "' "' ~ "O c c c (ii (ii c 0 0 0 "' Q) Q) "' ""' a: "' "' "' .r:: .r:: .r:: " ~ ~ ~ oo oo oo B 
IS Ci[28 T8-tal Pfucessors - GAMESS-styM Maps 
ClassC = 
ClassB = 
Class A= 
4 
g 3 
(/) 
Q) 
~ 
a. 
(;; 
a. 
-a. 
~ 2 
8: 
32 
0 
'E 
::::> 
Figure 7.37 
145 
1J 1J 
0 :E c c:: -5 
::::> ::::> 0 
E 
0 
1J 
c: 
"' er: 
IS 128 Total Processors - ~nfold-style Maps 
ClassC = 
ClassB ~ 
Class A= 
90 
80 
70 
0 
60 
"' "' Ql 
" e 50 a. 
Qi a. 
"' a. 40 0 :::;; 
30 
20 
10 
0 
146 
a c a c a c a c a c a 0 0 0 0 0 E- -S E- -S E- -S E- -S E- -S E-
~ N >- >- N N ~ x >- >- x >- N N x x ~ 
x x >-x x x x >- >- >- N N N 
Figure 7.38 LU 128 Total Processors (Coprocessor, 
Non-Optimized Collectives) 
c 
0 
-S 
x 
>-
N 
a 
E-
E 
0 
"O 
c: 
"' a: 
c 
0 
-S 
E 
0 
"O c: 
"' a:
ClassC = 
ClassB = 
Class A= 
Optimized vs 
90 
80 
70 
0 
60 
"' "' Q) (.) 
E 50 a. 
Q; 
a. 
"' a. 40 0 
::2; 
30 
20 
10 
0 
8: 8: ..,. 
"' CD "' >< >< 
"' 
..,. 
147 
8: 8: 8: 8: E E E E E E 8: 8: 8: 8: 8: 8: >! 
CD a:> ..,. "' 
..,. 
"' CD a:> 
..,. 
"' 
..,. 
"' CD a:> 
..,. 
"' x x >< >< >< CD "' x >< >< ~ CD "' x >< >< >< CD "' ..,. >< >< CD "' >< >< CD "' 
..,. 
a:> "' CD "' 
..,. a:> "' CD "' 
..,. a:> "' CD 
Figure 7.39 LU 128 Total Processors - Gray-code Meshes 
E 
0 
-0 c 
"' Cl'. 
ClassC = 
ClassB = 
ClassA = 
90 
80 
70 
0 
60 
VJ 
VJ 
Ql 
u e 50 c. 
(;; 
c. 
VJ a. 40 0 :a; 
30 
20 
10 
0 
148 
8: ~ El: 8: ~ El: 8: ~ El: N' E >- 0 VJ VJ VJ Ql Ql Ql £3 £3 al c "O -"' -"' -"' c c c c u u 8 0 0 0 Ql Ql -"' "' 0 0 Ql Ql Ql a: iD iD iD VJ VJ VJ .c .c .c u :::J :::J :::J (/) (/) (/) Di c:: c:: c:: 
Figure 7.40 LU 128 Total Processors - GAMESS-style Maps 
ClassC = 
ClassB ~ 
Class A= 
90 
80 
70 
0 
60 
U) 
U) ., 
u e 50 0. 
l;; 
0. 
U) 
c. 40 0 ::;; 
30 
20 
10 
0 
149 
8: E !SI: ~ E 0 
" " " ~ " :§ :§ :§ c:: -"' "' c:: c:: c: u er: 
::i ::i ::i 0 
(ii 
Figure 7.41 LU 128 Total Processors - Unfold-style Maps 
Class C = 
ClassB = 
Class A= 
80 
70 
60 
0 
<I) 50 <I) 
(I) 
" e
a. ., 
40 a. 
<I) 
a. 
0 ::;: 
30 
20 
10 
0 
150 
Class C = 
ClassB = 
Class A= 
a p a c a c a c a c li c c li c c li c li c li c li c li c 0 0 0 0 0 a. 0 0 0 0 0 0 0 0 .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s .2. .s N 
~ >- ~ 
N N x x >- >- ~ x ~ ~ >- >-
N N x x >- >- x ~ 
E E >- N x x ~ N x x >- N N ~ x ~ N x x ~ 0 0 x ~ ~ ~ ~ ~ ~ 
N 
~ ~ x 
x x >- >- N N N Cl Cl I- I- I- I- c: 
"' 
c: 
a: "' a: 
Figure 7.42 LU 256 Total Processors (VirtualNodeMode, Optimized vs 
Non-Optimized Collectives) 
80 
70 
60 
0 
Ill 50 Ill 
Q) 
0 e a. 
Q; 
40 a. 
Ill a. 
0 
:::; 
30 
20 
10 
0 
E E E 
OJ ... N 
N CD (') 
x x >< ... OJ 
N 
151 
E E E E g g g g g g g E E 
CD OJ ... N OJ ... N CD ~ ... N OJ ... x >< >< >< N CD (') x >< >< N CD N ... OJ x >< >< N ... OJ x >< CD (') CD N ... OJ CD (') CD N ... 
N N 
Figure 7.43 LU 256 Total -.Processors - Gray-code Meshes 
ClassC = 
ClassB = 
ClassA = 
152 
80 
70 
60 
0 
"' 50 "' Q) 
" e 
c. 
Q; 
40 c. 
-!!! c. 
0 
::;;; 
30 
20 
10 
0 g E ~ g E ~ g E ~ ~ ~ E 0 
~ ~ ~ 1ii 1ii 1ii "' "' "' ~ ~ 
"O 
Q) Q) Q) c: 
~ ~ ~ "' "' "' c: c: c: "" "' E! E! E! "' "' "' " "" a: 0 0 0 0 0 0 0. 0. 0. .8 g (.) (.) (.) (.) (.) (.) !':' !':' !':' en 
0 0 0 (jj 
(.) (.) (.) 
Figure 7.44 LU 256 Total Processors - Unfold-style Maps 
70 
60 
50 
0 
"' "' Q) 
" e 40 c. 
Q; 
c. 
"' a. 
0 30 ::2 
20 
10 
0 
153 
7 
a '2 a '2 a '2 a '2 a '2 a '2 a '2 
-9- 0 -9- 0 -9- 0 -9- 0 -9- 0 -9- 0 -9- 0 .s .s .s .s .s .s .s N N >- >- N N x x >- >- x x E E >- >- N N x x N N 1:5 x ~ ~ 0 0 x x x x >- >- >- >- N "C "C c 
"' 
c 
a: "' a: 
Figure 7.45 LU 512 Total Processors (Coprocessor, Optimized vs 
Non-Optimized Collectives) 
154 
7' ClassC = 
70 ) 
) 
60 
) 
50 
5 
Ul 
Ul 
Q) 
(.} e 40 
a. ; 
(i; 
a. '> 
Ul 
c. ; 
0 30 ::? 
20 
10 
0 
8: 8: 8: 8: 8: 8: 8: 8: E: E: E: E: E: E: E: E: El: El: El: El: El: El: El: El: N E >- 0 
"' "" ... "' "' "" ... "' "' "" ... "' "' "" ... "' "' "" ... "' "' "" ... "' x "O LO "' "' "' x x x x LO "' "' "' x x x LO LO "' "' "' x x >< LO c: "' x x >< ... "" "' "' x >< >< ... "" "' "' ~ >< x ... "" "' "' >< "" "' "' "' "' LO >< "" "' "' "' "' >< >< "" "' "' "' "' >< a: "' ... "' "' "' ... "' "' "' "' "' 
Figure 7.46 LU 512 Total Processors - Gray-code Meshes 
155 
ClassC = 
70 vx 
60 st 
50 VY •v 
x )< 
y "v 
xx 
0 
II) 
II) ., 
0 e 40 
c. 
& 
-a 
0 30 ::!: '\( 
20 
10 ~ : 
xx 
0 
8: E E:l: 8: E E:l: 8: E E:l: [ E 0 ~ II) II) ., ., ., II) ~ ~ -0 -"' -"' c: c: c: al c: 0 0 0 0 0 0 -"' as 0 .Q 0 ., ., ., a: iii iii II) II) II) ..c: ..c: ..c: 0 aJ :> :> :> (/) (/) (/) bl a: a: a: 
Figure 7.47 LU 512 Total Processors - GAMESS-style Maps 
0 
U) 
U) 
(]) 
" 
70 
60 
50 
2 40 a. ., 
a. 
-K 
~ 30 
20 
~xx;: 
>?' X:» 
> > 
,) 
156 
t3: 
"O 
:§ 
c 
::J 
N' E 
>- 0 c "O c 
-" <U 
" cc a 
Figure 7.48 LU 512 Total Processors - Unfold-style Maps 
o 
en en 
Q) 
0 e 
a. 
Q; 
a. 
en a. 
0 
:::;: 
157 
80 ...--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
70 
60 
50 
40 
30 
20 
10 
0 
15: '2 15: '2 15: '2 15: 
2 0 2 0 2 0 2 .s .s .s 
N 
~ >- >-
N N x >- N N x x N x x x x >- >- >-
'2 15: '2 15: 0 2 0 2 .s .s 
x >- ?< 
x 
N x >-
>- N N N 
'2 15: 0 2 .s 
x E 
>- 0 
N "O c: 
"' a: 
'2 
0 .s 
E 
0 
"O 
c: 
"' a: 
ClassC = 
ClassB = 
Class A= 
Figure 7.49 MG 128 Total Processors (Coprocessor, Optimized vs 
Non-Optimized Collectives) 
158 
70 
60 
50 
5 
(/) 
(/) 
Q) 
~ 40 
a. 
Q; 
a. 
-a 
0 30 ::;: 
20 
10 
0 
8: 8: 8: 8: 8: 8: E E E E E E ~ ~ ~ ~ ~ ~ E 0 .... "' "' "' .... "' .... "' "' "' .... "' .... "' "' "' .... "' "O "' (') x x x x "' (') x x x x "' (') x x x x c x x "' "' .... x x "' "' .... x x "' "' .... "' "' .... "' (') "' "' .... "' (') "' "' .... "' (') "' a: 
Figure 7.50 MG 128 Total Processors - Gray-code Meshes 
80 
70 
60 
0 50 
U) 
U) 
Q) 
" 2 c. 
40 Q; 
c. 
-a 
0 
:; 30 
20 
10 
0 
159 
g E fg: g E fg: g E fg: ~ E 0 
U) 
~ 
U) Q) Q) Q) 2 2 2 c "O ~ ~ c c c c 
" 8 " 0 0 0 Q) Q) Q) ~ "' 0 0 Q) Q) Q) a: 1i'i 1i'i 1i'i U) U) U) .<::: .<::: .<::: " :::l :::l :::l (j) (j) (j) a a:: a:: a:: 
Figure 7.51 MG 128 Total Processors - GAMESS-style Maps 
ClassC = 
ClassB = 
Class A= 
80 
70 
60 
0 50 
U) 
U) 
Q) 
(.) 
~ 
40 Q; 
a. 
U) 
Ci. 
0 ::;; 30 
20 
10 
0 
160 
g E ~ ~ E 0 
" " " c " ~ ~ :2 
c: 
-"' "' c: (.) a: 
::::l ::::l ::::l a 
Figure 7.52 MG 128 Total Processors - Unfold-style Maps 
ClassC = 
ClassB ~ 
Class A= 
161 
70 
ClassC = 
ClassB = 
Class A= 
60 
50 
0 
"' "' 40 Q) 
() e 
c. 
Q; 
c. 
"' a. 30 
0 
~ 
20 
10 
0 a c a c a c a c a c a c p c a c a c a c a c a c a c 0 0 0 0 0 0 c. 0 0 0 0 0 0 0 -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s -9. .s 
N N >- >- N N x x >- >- x x N N >- >- N N x x >- >- x ~ 
E E >- >- N N x x ~ N 
x x >- >- >- >- N N ~ x N N x x >- 0 0 x 
~ 
x 
~ 
>-
~ ~ 
N N N N x x x x >- >- >- N N N N "O "O f- f- f- f- f- c: f- f-
"' 
c: 
"' a: a: 
Figure 7.53 MG 256 Total Processors (VirtualNodeMode, Optimized vs 
Non-Optimized Collectives) 
70 
60 
50 
0 
Cl) 
Cl) 40 CD 
~ 
Q. 
Gi 
Q. 
-a 
0 
30 
::!! 
20 
10 
0 
162 
Figure 7.54 MG 256 Total Processors - Gray-code Meshes 
ClassC = 
ClassB ~ 
Class A= 
70 
60 
50 
0 
Vl 
Vl 40 Q) 
u e c. 
Qi 
c. 
Vl a. 30 
0 
:2 
20 
10 
0 
8: E !:}: 
~ ~ ~ 
~ ~ ~ 
0 0 0 
() () () 
Figure 7.55 
163 
8: E !:}: 8: E !:}: ~ ~ 
1ii 1ii 1ii Vl Vl Vl c ~ Q) "' "' "' "' "' c: c: c: "" ~ ~ ~ "' "' "' u "" 0 0 0 c. c. c. a u () () () ~ ~ Q) .8 
0 0 0 (f) 
() () () 
MG 256 Total Processors - Unfold-style Maps 
E 
0 
"O c: 
"' a: 
ClassC = 
ClassB = 
ClassA = 
50 
40 
0 
"' "' <ll 
0 30 e a. 
a; 
a. 
"' a. 
0 
~ 
20 
10 
0 
164 
li c: li c: li c: Z' c: Z' c: li 0 0 0 a. 0 a. 0 -9- -S -9- -S -9- -S -9- -S -9- -S -9-
N N >- ~ 
N N x x >- >- x >- >- N x x N N x x >-x x x x >- >- >- >- N N N 
Figure 7.56 SP 128 Total Processors (Coprocessor, 
Non-Optimized Collectives) 
c: 
0 
-S 
x 
>-
N 
li 
-9-
E 
0 
"O c: 
"' a: 
c: 
0 
-S 
E 
0 
"O 
c: 
"' a: 
ClassC = 
ClassB ~ 
Class A= 
Optimized vs 
50 
40 
C5 
en en 
Q) 
0 30 0 a. 
(;; 
a. 
en 
Ci 
0 
:::;; 
20 
10 
0 
g g 
.... N 
"' "' x x "' .... 
165 
g g g g E E E E E E ~ ~ ~ ~ ~ ~ N >-
"' CXl .... "' .... "' "' CXl .... "' .... N "' CXl .... N x x x x x "' "' x x x x "' "' x x x x "' N .... x x "' N .... x x "' N .... CXl "' "' N .... CXl "' "' "' .... CXl "' "' 
Figure 7.57 SP 128 Total Processors - Gray-code Meshes 
E 
0 
" " "' a: 
ClassC = 
ClassB = 
Class A= 
40 
0 
CJ) 
CJ) 
Q) 
u 30 e 
0. 
Q; 
0. 
CJ) 
a. 
0 
::?: 
20 
166 
8: ?:: 8: 8: ?:: 8: 8: ?:: 8: ~ E 0 
CJ) .l2 .l2 Q) Q) Q) .'!J ~ ~ c " -"' c: c: c: c: u Q) "' u u 0 0 0 Q) Q) Q) -"' 0 0 0 CJ) CJ) CJ) .c .c .c u er a5 a5 a5 ::I ::I ::I (fl (fl (fl B a: a: a: (fl 
Figure 7.58 SP 128 Total Processors - GAMESS-style Maps 
ClassC = 
ClassB = 
Class A= 
40 
0 
(/) 
(/) 
Ql 
" 30 e c. 
Q; 
c. 
(/) 
a. 
0 
:2 
20 
167 
g E: ~ N' E >- 0 "O "O "O ~ "O :E :E ~ 
c:: 
""' "' c:: c:: " 0:: :::> :::> :::> a 
Figure 7.59 SP 128 Total Processors - Unfold-style Maps 
ClassC = 
ClassB = 
Class A= 
0 
!J) 
!J) 
Q) 
(.) e 
a. 
Q; 
a. 
!J) 
a. 
0 
:::; 
168 
50 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~---, 
40 
30 
20 
10 
0 
a '2 a '2 a '2 a '2 a '2 a '2 ""' '2 0 0 0 0 0 0 a. 0 E- .s E- .s E- .s E- .s E- .s E- .s E- .s 
~ N >- >- N N x x >- >- x x N N >- N N ~ x ~ ~ 
x x >- >- >- >-x x x x ~ 
N 
~ 
N 
~ 
x x f- f- f- f- f-f- f-
a '2 a '2 a '2 ""' 0 0 0 a. E- .s E- .s E- .s E-
>- >- N N x x >-N N x ~ ~ N 
x x x >- >- N 
'2 a '2 0 0 .s E- .s 
>- x x x >- >-N N N 
""' a. E-
E 
0 
"O 
c 
"' a: 
'2 
0 .s 
E 
0 
"O 
c 
"' a: 
ClassC = 
ClassB = 
Class A= 
Figure 7.60 SP 256 Total Processors (VirtualNodeMode, Optimized vs 
Non-Optimized Collectives) 
0 30 "' "' Q) 
0 e 
c. 
~ 
c. 
"' a. 
0 20 :;; 
169 
Figure 7.61 SP 256 Total Processors - Gray-code Meshes 
ClassC = 
ClassB i&mm 
Class A= 
0 
"' "' Q) 
u e a. 
Q; 
a. 
"' c. 
0 
::;; 
170 
50 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-, 
40 
30 
20 
10 
0 
8: E ~ 8: E ~ 8: E ~ N' ~ E >- 0 "§ l!! 1ii 1ii 1ii "' "' "' ~ ~ 
"O 
l!! Q) Q) Q) c: -= ~ -= "' "' "' c: c: c: ."'- "' ['! ['! ~ ~ ~ "' "' "' u ."'- a: 0 0 0 0 0 0 a. a. a. B u 
() () () () () () ['! ['! Q) en 0 
0 0 0 00 
() () () 
Figure 7.62 SP 256 Total Processors - Unfold-style Maps 
ClassC = 
ClassB = 
Class A= 
35 
30 
25 
0 
<J) 
<J) 20 Q) 
() e c. 
(;; 
c. 
<J) 
c. 15 
0 
::; 
10 
5 
0 
171 
77'77 
" " '" y ~V" 
x 
> 
~ 
x .AA/V VY x "v 
x,> x\ x > 
~ >oy 
0 
x x x ~ x"x' > x 
'> > x x' 
'\ \ x\ 
3, \ x'> v 
x > > ~ '""'' x' » 
> > 
a c a c a c a c Z' c a 0 0 0 0 c. 0 E- -S E- -S E- -S E- -S E- -S E-
N 
~ ~ >-
N N x x >- >- S': >- N S': x N N x x x x x x >- >- >- N N N 
Figure 7.63 SP 512 Total Processors (Coprocessor, 
Non-Optimized Collectives) 
ClassC = 
"'>;,7>"' 7> 
>'o >\ 
> > > > 
>) > ) 
"> »') > 
xxX ~ 
X)' X)' 
hv x "x 
»> x, 
~ 
y 
> 
'x\ > 
> y 
»> y 
»)> x 
>;; >;; > 
>;; » ) 
>» > 
c a c 0 0 
-S E- -S 
x E E >- 0 0 
N "O "O c: 
"' 
c: 
a: "' a: 
Optimized vs 
172 
40 
ClassC = 
35 
~>Y 'V " 
7>(x 
Xx' > /' x 
' x' x > >xx x' 77 7 ~ 
x' x > :x "" > '> ) 
30 
0 25 
"' "' Q) 
0 e 
a. 
20 Qi 
a. 
"' a. 
0 
::;; 15 
10 
5 
0 
xx > x' Y0 
~ 
>> 
~ > > ~ '> >\> ~> 'x x > >)> > :x 
> »> V> >»> ;/ '> x V> >x>~ /' x x 'x> x> »> >i<' >:x 
>;x ~ ~ > »~ :x > x 
~ > > :x ;/ 
> x >y »> > x x' »v > x' x x :x > 
> > x' ~ > x > >,(~ >)> >)> ? » y» >,(y) > > ;)' >;x »~ 
>$> 
~ > > > > > 
;)' >'>~ > x > » > > X'> >:x > >,( > > > »> !>» > »v »> » » x > > ) > ~x » > x > > ) > 
> >» ~» > > y> 
>~~>) 
> > > ) ? I> » >>) 
) ) i> '> !>>~ » ) » > \ ;» >;t>; »; >» ~> ,> > » > » > > » yy> > ') >» > » »~ ~» > > 1>\ >;) 
» »~» ) > 1>» ~ > » »> > !>) > 
8: 8: 8: 8: 8: 8: 8: 8: E E E E E E E E t3: t3: t3: t3: t3: t3: t3: t3: N E >- 0 
"' OJ 
..,. N "' OJ 
..,. N "' OJ 
..,. N "' OJ 
..,. 
"' "' OJ 
..,. N "' OJ 
..,. 
"' x "O l[) N "' ('") x x x x l[) N "' ('") x x x l[) "' N "' ('") x x x "' c: N x x x ..,. OJ "' N x x x 
..,. OJ N N x x x ..,. OJ ~ "' x OJ ~ N "' ~ "' x OJ ~ N "' ~ x x OJ ~ N "' ~ a: N ..,. ('") N N ..,. ('") N N ..,. ('") N 
Figure 7.64 SP 512 Total Processors - Gray-code Meshes 
0 
"' "' ~ 
0 a. 
Q; 
c. 
30 
25 
20 
-a 15 
0 
:::;: 
10 
173 
:LIBlmlllllJ !' /' 
8: E ~ 8: E ~ 8: E ~ ~ E 0 
"' ~ ~ Q) Q) Q) l!l *l *l c "O -"' c:: c:: c:: Q) c:: " " " 0 0 0 -"' "' 0 0 0 Q) Q) Q) a: iD iD iD "' "' "' .c: .c: .c: " ::> ::> ::> en en en a a: a: a: 
Figure 7.65 SP 512 Total Processors - GAMESS-style Maps 
ClassC KXx::si 
174 
40 
ClassC = 
<> 
0 
35 
> 
> y 
30 ~ v v' v v ><x y 
PY' x ,/·~ ): 
> KX; "'Ax> h 
0 25 
> 
en en 
Q) 
> x 
x 
" ea. 
20 Qi a. 
en a. 
0 
::;; 15 
» » ~ x ~ > > 
x x, 
> ~> x »> ;> x x 
~ »> ~ x> x x 
10 
5 
> »> x x x 
>'> » x 
> >'> ;x 
~ ~»> 
)\x 
~»> 
) »~ 
) »> ~ 
0 >X > 
8: E: ~ N' E >- 0 32 32 "O ~ "O 
~ 
c: .E .E 
"" "' c: c: " a: :::J :::J :::J a 
Figure 7.66 SP 512 Total Processors - Unfold-style Maps 
':c
j .....
 
aq
 ~ ~ °" -.J 0 > ~ t'rj (f1 (f1 ;;,o ~ ..... (") ..... - -..... ~ ~ '""! ::>;"" -0 ~ ,..-.... 01 ..... tv z 0 0.. ('!) {/.) '-' 
!::
: 
~
 
~- (Q
 ., 
B
lo
ck
 
B
lo
ck
s 
B
lo
ck
s 
P
lu
so
n 
P
lu
so
n 
P
lu
so
n 
S
he
et
s 
S
he
et
s 
S
he
e 
G
ra
y-
X
2
 
G
ra
y
-X
4
 
G
ra
y-
X
 
G
ra
y-
X
 1
 
G
ra
y
-X
3
 
G
ra
y-
X
 
G
ra
y-
X
 1
 
G
ra
y-
X
2
 
G
ra
y-
Y
2
 
G
ra
y-
Y
4
 
G
ra
y-
Y
 
G
ra
y-
Y
1
 
G
ra
y-
Y
3
. 
G
ra
y-
Y
 
G
ra
y-
Y
1
 
G
ra
y-
Y
2
 
G
ra
y
-Z
2
 
G
ra
y
-Z
4
 
G
ra
y-
Z
 
G
ra
y-
Z
1
 
G
ra
y
-Z
3
 
G
ra
y-
Z
 
G
ra
y-
Z
1
 
G
ra
y
-Z
2
 
U
nf
o 
U
nf
o 
U
nt
o 
R
an
d 
R
un
tim
e 
(s
ec
on
ds
) 
0 
g 
g 
"' g 
g 
C1
1 
°' 
g 
g 
8 
>o
 
->
}o 
>."
?;o
 
Jq
o 
J-
i!o
 
x~
o 
x~
o 
x~
o 
J-
i!o
 
x~
o 
~o
 
>o
·o 
>."
?;o
 
>.<
f:!o
 
o,,
. 
6',
ji·
O 
>.'<
=?
o 
x~
o 
>.
°'i
o 
>.°
'?o
 
x~
o 
J-
i!o
 
x~
o 
x~
o 
x~
o 
x~
o 
x~
o 
>.
'?i
o 
x~
o 
x~
o 
J-
i!o
 
>.
°'i
o 
>.
~o
 
J-
i!o
 
x~
o 
x~
o 
x~
o 
x~
o 
2$
2~
~~
2!
88
j<
::
::
x:
J 
J-
i!o
 
>.
°'i
o 
Ix
~~
~~
~~
~~
~~
~~
~~
~~
~~
~~
~~
~~
~~
 J-i!o
 
il(
 
x~o
 
J"
!'o
 
x~
o 
.....
 
-.
J 
0
1
 
'T
j 
aq
· 
~
 
....,
 
('D
 
-..
:i 
(:i
') 
0
0
 
0 >
 
~
 
tr:J
 
en
 
en
 
.0
 
~
 .....
. 
~
 
0 
s: 
~
 
DJ
 
-0
 
('D
 
12
. 
~ 
::
I 
<O
 "' 
....,
 
:>
;-'
 -0 ~ ,---.._ CJ1 ...... t-.:> z 0 CL ('D 00 "--' 
0 
U
l 
0 
g 
R
un
tim
e 
(s
ec
on
ds
) 
U
l 
0 
"' g 
I\
)
 
gj
 
"' 0 0 
XY
L 
X
.>
<;
_y
<;
<)
\<
;<
x"
;' 
Y
/y
'/
//
 <
'¢
\;
(/
;<
>
',,
/,
//
x'
l v
Y
Y
Y
v
>
/
N
~
X
/
X
'
 · 
<"a
. 
Xl
.'Y
 x
:_
v,
yx
 
x 
x 
x 
>"
.>
VX
?X
X:
x"
)'
)~
'.
)<
/Y
 
v 
·,-
x 
v;
;v
w
V
'/
/v
 
<"
:0
 
YX
2 
;<
,(
_
;x
X
..
V
V
".
'y
'-
-/
-/
/,
X
/:
Y
x'
!Y
Y
V
//
'./
.X
X
'x
V
xx
 
x 
v 
V
 /
)(
<
'l'
/>
j: 
<"
:o
 
Y
ZX
xY
:V
"'v
 
v 
/v
x
'/
 ".
> 
•'::
x"x
,x 
:;x
,<
c"
-_
V
 
/'
 
'v
v
J
x
 
<
"l
o 
ZX
'· 
',
/V
-.
·x
 
·-
. (
A
y
,x
)<
 V
 
x 
-
~
/
 <
 
<"
::O
 
ZY
X
 >
v 
/Y
" 
Y
 
'x
' 
' 
>
 
>"
 
<2
.'0?
o 
' 
'.
//
 
<>
 
0'.
9 
B
lo
ck
s_
) 
?\
(Y
"'
x 
'>
Y
 
X
. 
y 
X:
'rA
 A
AA
 A
,,.
,, 
A
A
•·•
x•
'",
,'"
X
;x
 x>
 
('.. 
y 
v 
x
./
x
'.
x
 
<"
&'O
 
B
lo
ck
s 
-y
 
xX
.. 
, X
x'<
'. 
X
. 
Y
' 
\/
'/
 
V
x.
 
y 
x
, 
y 
• 
. -
v
V
N
 
<"
J!o
 
B
lo
c
k
s
-Z
 
v 
.,. 
Y
/,
 
X
'X
'x
''
 
x 
'X
_
y
V
/x
x
 
<
"l
o 
P
lu
so
ne
 -
;, 
xx
 
• 
x 
--.
 
/v
 
·.
/,
 
./
/(
>
 X
/;"
· "
 
<"
d-
o 
P
lu
so
n
e
-
·
:
 
/)
<
, 
>
;<
;'
• 
•.
 
··-
·v
 
·-.
•'"
'.>
:./
 
x
·-
.x
: 
V
 
<
"i
o 
P
lu
so
ne
 -
<
 
" 
,,
-.
-,
 
<
V
x 
xv
' 
v 
<"
a.·
o 
S
he
et
s 
-
) 
• 
x 
<
x
' Y
 <
.>
,x
/. 
.. 
• 
>"
(<
.,'<
. 
_ .
: '
./ 
<"
%
0 
S
he
et
s 
-)
 
;.>
 .-
x 
;
·
·
 
,.x
.Y
;<
 
Y"
'x'X
 x 
xy
. 
v
x
'-
X
/ 
•
./
/ 
<"
:0
 
S
he
et
s 
-2
 &
 
x 
x
x
 0
 
~
 
v
v
 
x
v
 
'<
-·x
 ·
 
//
x
W
M
 
<
"l
o 
G
ra
y_
 x
 2
x 2
56
 •
<'
.Y
· 
x 
x 
x 
"
x
"
V
/v
 v 
/
v
 
Y
'•
. 
x 
.x
x 
x 
x 
v 
h 
x 
• 
<"
c{
o 
G
ra
y 
-
X
 4
x1
 2
~ 
, 
x 
"'>:
 
x 
'-V
V
'.x
Y
'X
<
;<
 'V
 
<"
:o
 
G
ra
y 
-
X
 ax
-
/'
 
;<
-, 
x 
x 
.
/
 
'
v
 
-W
V
v
)-
(#
;>
0
 
<"
:0
 
G
ra
y 
-
X
 1
6x
32
 
x 
·· 
·~
x 
v 
x 
x 
• 0
:/-
v'
<.
.Y
 "
--'
 
<"%
-o 
G
ra
y 
-
X
 3
2x
16
 x
 
Q
-;,
Y
<;
>;
> 
X
.'-
7
7
/x
 "> 
<"
:0
 
G
ra
y 
-
X
 6
4x
8 
x 
V
 
"
V
 
'x
 
. x
 /
' 
v)
(:
:,
0 
<"
%
0 
6
' 
G
ra
y 
-
X
 1
28
x•
 
. 
Y
·.
 
'°"c
9.>
° 
G
ra
y 
-
X
 2
56
x2
 
xx
 
v
V
 
>:,
> 
°' 
"'
_~
O 
G
ra
y 
-
Y
 2
x2
56
 ·· 
.x 
~x
 
<"c
9,9
° 
G
ra
y 
-
Y
 4
x1
28
 
·· 
'°"c
9,9
o 
G
ra
y 
-
Y
 8
x6
 
. 
, 
v 
<"c
9;/ 
G
ra
y 
-
Y
 1
6x
32
 
X.
 •
 
"'
_~
O 
G
ra
y
-
Y
 3
2x
16
 
>
 
>
 
·-
X
. 
N
 
x 
<
 x
 
>
 
\ 
<,>
, 
>,
 
X
 
V
 
>
 
'.)
 x
 
'°"c
9;/ 
G
ra
y 
-
Y
 6
4x
8 
xv
 
v 
( 
<"c
;>·
o 
G
ra
y 
-
Y
 12
8x~
~v~
®x~
~~~
x~-
X<'
~~®
,~·
~~~
~-~
·• ~
X.~
,."
~~~
~~~
~~-
0:.
~ 
<"
%
0 
';"
-
<9
 
G
ra
y 
-
Y
 2
56
x 
v
' 
. 
·· 
'°"c
9.>
° 
G
ra
y 
-
Z
 2
x2
56
 
'°"c
9.>
° 
G
ra
y 
-
Z
 4
x1
28
 
/Y
 
X
 
>
 
<"c
9_;
o 
G
ra
y 
-
Z
 B
x6
 
x 
x 
<"c
9_;
o 
G
ra
y 
-
Z
 1
6x
32
 _x
 
<"c
9_.
P 
G
ra
y 
-
Z
 3
2x
16
 
V
 
6<
 
<"c
9·o
 
. 
~
 
G
ra
y 
-
Z
 6
4x
f 
·· 
'x
 
<"c
9·o
 
G
ra
y 
-
Z
 1
2B
x.
 
<
) 
.;{
io 
G
ra
y 
-
Z
 2
56
x,
 
x 
<"
:0
 
U
nf
ol
d 
-
X
 
-
(x
 
'.
!'
 
v 
X
/,
 
<
"l
o 
U
nf
ol
d 
-
' 
· 
.x 
, 
>> 
x 
v 
<"
::o
 
U
nf
ol
d 
-
0:
 
x 
;:><
 
v
~
 
x 
)
' 
v 
x 
x 
<"
/-o
 
Y
Y
X
 
~
 
R
an
do
m
 
x 
,,.
. 
A
 
v 
'<-
·-,;
 
>< 
x 
><
., 
x 
x 
x 
x 
x. 
v 
. 
"2
~~
0 
"V
 
.....
. 
-..
:i 
O
J 
800 
700 
600 
~ 5 soo 
! 
~ 400 
~ 
:> 
a: 
300 
200 
177 
Mappings 
VNM= 
Coprocessor = 
Figure 7.69 GAMESS Penicillin Workload (512 Nodes in VNM, Stock Map-
pings) 
<ii' 
"O 
c 
0 
u 
Q) 
!!!-
Q) 
E 
~ 
::J 
a: 
178 
500 
400 
300 
200 
100 
0 
x 0 0 "O "O 
f- c c 
"' "' a: a: 
Mappings 
Figure 7.70 GAMESS Quinone Workload (512 Nodes in VNM, Stock Map-
pings) 
Ci) 
-0 
c: 
8 
"' .!!!. 
"' E 
~ 
::> a: 
179 
800 
700 
600 
500 
400 
300 
200 
100 
0 
Mapping 
Figure 7.71 GAMESS Penicillin Workload (512 Nodes in VNM, Gray Map-
pings 
U) 
"O 
c 
0 
u 
Q) 
.e 
Q) 
-~ 
E 
::i 
er: 
500 
400 
300 
200 
100 
0 
180 
- - - - - ------ - -
~'l-\~~~\'(\~\~\~\~\;\~\~\;\;\~\~\~\'1-\~\~~\'ri,\'(\;_~;\~\~\~\~\~~~\;,\,'(\~\~\;\~~~~\~\~. 
- v ~ / 
>«R: ::>» ~ x > . ~ x ~ » ) x x x x > ~ ;)x >~ x x > X' x > > 
x D). > ~ K, > x ~ 
~ > i< >~ x » -:>< > »> ~ > ~ ;) ~ x 
:) x 
)<~ > ~ > x :x;) ;; ,xr > i>, 
,x ~> x 
0~ ;; ,x > > ) :x x 
> ;) x >~ ,x '> 
~ > x ~ 
D x~ > ) x v '> ~> 
,x 
> ) x > ,x / 
~ j) ) > 
) 
D > ) ) I» "> D > 
> ~" ~ » x' ) x >D > p ;) > D V 
~~~ » ;;> f(> 
>r ~ ) > > ) I> I)> D 
D> 
) > p> 
) 
~) ~) » I) I\ 
) 
> ~> 
» ) 
l>y > > ~ ~) 
I>~~ » > D I> I) ) > > ) > > > > x 
..-tl,C\Jc1,~~ x~~~~~~~~xx~-mN~M~ x x x~.:8N~~ ..... ~~~~~5>o 
Mapping 
VNMKXXLJ 
Coprocessor = 
Figure 7.72 GAMESS Quinone Workload (512 Nodes in VNM, Gray Map-
pings 
Ii) 
"O c: 
0 
0 ., 
~ ., 
E 
~ 
::> 
a: 
800 
700 
600 
500 
400 
300 
200 
100 
0 
., 
~ c: 
1il CD .s:: ::> en a: 
181 
gi CD c: 
0 0 
.Q (/) 
ID ::> a: 
Mapping 
!!l ., ., 
.s:: en 
.;, 
iS 
0 
ai 
x 
Coprocessor = 
VNM~ 
Figure 7.73 GAMESS Penicillin Workload (1024 Nodes, Stock and 
GAMESS Mappings) 
.,. 
"O c 
0 
0 ., 
.!!!. ., 
E 
~ 
:::> 
a: 
182 
500 
400 
300 
200 
100 
0 
J:! ., "' J:! ., "' "' .;, c Q) c Q) ..>< c 0 0 ., 0 0 ., 0 0 0 "' ~ 0 "' ~ 0 "' 1ii :::> Cl) 1ii :::> Cl) 1ii :::> a: a: a: 
Mapping 
Figure 7.74 GAMESS Quinone Workload (1024 Nodes, Stock and 
GAMESS Mappings) 
800 
700 
600 
Ii> 
-g 500 
8 
Q) 
$ 
Q) 
E 400 
E 
::> 
a: 
300 
200 
100 
0 
183 
Mapping 
Coprocessor = 
VNM~ 
Figure 7.75 GAMESS Penicillin Workload (1024 Nodes, Gray mappings) 
184 
400 
Ul 
"O c 
0 
0 
Q) 300 
~ 
Q) 
E .E 
::l 
a: 
200 
0 
LO x x LO x x LO x 0 x ... x LO ... x LO ... "O "' <D 00 "' <D 00 "' <D c "' x x x x x >- >- >- >- >- >- N N N N N a: 
Mapping 
Figure 7.76 GAMESS Quinone Workload (1024 Nodes, Gray mappings) 
800 
700 
600 
Ii) 
'O 500 c: 
0 
0 ., 
~ ., 
E 400 
E 
::J 
a: 
300 
200 
100 
0 
'O 'O 'O >-
] ] ] x 
c: c: c: f- f- f- x 
::::> ::::> ::::> 'O 'O 'O 'O 
] ] ] ] 
c: c: c: c: 
::::> ::::> ::::> ::::> 
185 
x x >- >-
'O 'O 'O 'O 
] ] ] ] 
c: c: c: c: 
::::> ::::> ::::> ::::> 
Mapping 
>- N N 
'O 'O 'O 
] ] ] 
c: c: c: 
::::> ::::> ::::> 
N 
'O 
] 
c: 
::::> 
Coprocessor = 
VNM= 
Figure 7.77 GAMESS Penicillin Workload (1024 Nodes, Unfold Maps) 
186 
500 
400 
(i) 
"O 
c: 
0 
0 
Q) 300 
~ 
Q) 
E 
~ 
::I 
a: 
200 
100 
0 
"O "O "O (.) (.) (.) (.) (.) 
:E :E :E I- I- I- x x x >- >- >- N N N c: c: c: 
::i ::i ::i 32 "O 32 32 "O "O "O "O "O "O 32 "O 
.E :E ~ .E :E . :E :E :E :E :E .E :E c: c: c: c: c: c: c: c: c: c: c: 
::i ::i ::i ::i ::i ::i ::i ::i ::i ::i ::i ::i 
Mapping 
Figure 7.78 GAMESS Quinone Workload (1024 Nodes, Unfold Maps) 
1.4 
1.2 
"'  c 0 
u 0.8 Cll 
$ 
Cll 
E .E 
::J 
0.6 a: 
0.4 
0.2 
0 
187 
I/) I/) I/) Cll Cll Cll ~ ~ ~ J;1 ""' ""' ""' c c c .E u u u 0 0 0 0 0 0 Cll Cll Cll c 
<Xi <Xi <Xi I/) I/) 
I/) .r::: .r::: .r::: 
::J ::J ::J ::J (/) (/) (/) 
a: a: a: 
Mapping 
Figure 7.79 CMD - lOk atoms, Non-Gray mappings 
" 0 
"l: 
::J 
" :E 
c 
::J 
1024= 
512= 
188 
3.5 
512= 
1024~ 
096= 
3 
2.5 
Iii' 
"C 2 c: 
0 
0 
"' .!e. 
"' E 
E 1.5 
::J a: 
N 
~ ~ ~ "' "' l!? ~ ~ ~ ~ 3;! ~ 0 0 0 c: c: c;;i .E 0 0 0 0 0 "' "' "' c: c: iii iii iii <IJ <IJ .c: .c: .c: :::l :::l :::l ::J ::J ::J en en en c:: c:: c:: 
Mapping 
Figure 7.80 CMD - lOOk atoms, Non-Gray mappings 
"' "O c 0 
" Q) 
~ 
Q) 
E 
E 
::l 
a: 
189 
30 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~----, 
20 
15 
10 
~ ~ en Q) Q) Q) ~ ~ ~ "O :!2 "O -"' c c c Q) ~ .E ~ " " " 0 0 0 0 0 0 Q) Q) Q) c iii iii iii en en en .<:: .<:: .<:: :::i :::i :::i ::l ::l ::l rn rn rn c:: c:: c:: 
Mapping 
Figure 7.81 CMD - lm atoms, Non-Gray mappings 
512= 
1024 = 
4096= 
Ci) 
"O 
c 
0 
0 
Q) 
,!!!, 
Q) 
E 
E 
:::i 
a: 
190 
250 ,--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-, 
200 
150 
100 
50 
12 12 fl) Q) Q) Q) ~ ~ ~ "O "O -"' c c c ~ :E 0 0 0 0 0 0 0 0 0 Q) Q) Q) c 
Cii Cii Cii 
fl) fl) fl) .<= .<= .<= :::> :::> :::i :::i :::i U) U) U) 
0:: 0:: 0:: 
Mapping 
Figure 7.82 CMD - lOm atoms, Non-Gray mappings 
"O 
:E 
c 
:::> 
512= 
1024 = 
4096= 
191 
450 
~ 41 x xx 096= 
7 
400 
X? X ? )< x 
IX > X> 
r;- x X) 
>?~ 
350 ? x > > > 
) r; xx x » 
> )< 
300 
K p ? > ;x 
?p/;- ? > D 
Ii) 
"O c 250 0 
ll' /p ~? > p»>) Li' 
P);? )? » x 
0 
Q) 
.!!!. 
Q) 
E 200 E 
PJ / /} 
A 
~ .A x x ;;<so xxx: x ;o()C 
x 
?/x x>:x 
~> X> 
:::i a: X? XX X )< x > > lX 
r.>;- x xl' 
150 A x 
?» 
100 
r; x? x )< D> ) 
PJ /? > .A 
r; >x > >"> 
p>y> 
A 
50 p?//li 
K 
I- )) 
) 
0 
PJ >;- > » g») ,A ~ 0 Ii~ ) 
~ ~ ~ ~ x ~ ";" ': ··~ 
, 
': ', 6 x N ~ Q) Q) Q) "' 32 32 32 c 
~ 
:::i a: 
J 
en 
J 
en 
Mapping 
Figure 7.83 CMD - lOOm atoms, Non-Gray mappings 
0 
"O c: 
0 
0 
Ql 
,!!!. 
Ql 
E 
E 
::J 
cc 
o.s B 
0.6 
0.4 
)< > 
)< > 
)< > 
>> 
)< > 
192 
10k= 
Figure 7.84 CMD - lOk atoms, 512 Processors, Gray Mappings 
193 
5 
100k= 
4 
<il 3 X/ 
"O 
c:: 
0 
0 
Q) 
~ )< 
Q) 
E 
~ 
:::> 
2 cc 
> 
~ 
V' 
V' 
! • t: x ,. x ,. ... , , ,.. , , '. " ·~ "' .. . ~ '" 0 0 
"O "' co .... N "' co .... N "' co .... N "' co .... "' "' co .... "' "' co .... "' c:: ,,., "' "' M x )( )( )( 
,,., N "' M x )( )( )( 
,,., N "' M x )( )( x "' N x )( )( .... co "' N x )( )( .... co "' N x )( )( .... co "' cc )( co "' N "' N 
,,., )( co "' "' "' "' 
,,., )( co "' N "' "' 
,,., 
"' .... M "' N .... M "' N .... M "' 
Mapping 
Figure 7.85 CMD - lOOk atoms, 512 Processors, Gray Mappings 
30 
25 
20 
<n 
"O 
c 
0 
0 
Q) 
!!!.- 15 
Q) 
E 
E 
::> 
a: 
10 
5 
0 
> 
><x 
I/ 
I) 
> 
»> 
>-x 
194 
m= 
> v» ~> >> 
) /) 
~ > > 
~,> 
) 
)) 
) ) 
> 
) 
~ 
) /') )) 
t:: " ~ " ~ 
,.. ; >- >- ,.. ~ , " ~ '~ 0 
"O "' "' 
..,. 
"' "' "' 
..,. 
"' "' "' 
..,. 
"' "' "' ... "' "' "' ... "' "' "' ... "' c LO "' "' "' x x x x LO "' "' "' x x x x LO "' "' "' x x x x "' "' x x x ... "' "' "' x x x ... "' "' "' x x x ... "' "' a: x "' "' "' "' "' LO x "' "' "' "' "' LO x "' "' "' "' "' LO "' ..,. "' "' "' ... "' "' "' ... "' "' 
Mapping 
Figure 7.86 CMD - lm atoms, 512 Processors, Gray Mappings 
250 
200 
Ul 150 
"O 
c: 
0 
u 
<D 
,e 
<D 
E 
E 
::J 
100 a: 
50 
0 
x 
»( 
x 
> 
> 
) 
~ > 
>' 
~ > 
) 
" >-x 
195 
om= 
> 
I:: x , ;x ;x ' ;x x ;>- ,. 
,. ,. ,. r ' r r " '" '" 0 "O (!} co ... C\I (!} co ... C\I (!} co ... C\I (!} co ... C\I (!} co ... C\I (!} co ... C\I c: "' C\I (!} "' x )( )( )( "' C\I (!} "' x x x )( "' C\I (!} "' x )( )( )( "' C\I x )( x ... co (!} C\I x )( )( ... co (!} C\I x )( x ... co (!} a: x co (!} C\I (!} C\I "' )( co (!} C\I (!} C\I "' )( co (!} C\I (!} C\I "' C\I ... "' C\I C\I ... "' C\I C\I ... "' C\I 
Mapping 
Figure 7.87 CMD - lOm atoms, 512 Processors, Gray Mappings 
196 
1.2 10k= 
Mapping 
Figure 7.88 CMD - lOk atoms, lk Processors, Gray Mappings 
2 
1.5 
0.5 
0 
;7 
> 
> 
> > 
> 
7 > 
>I> 
> 
) 
> ) 
>-x 
t:: 
0 
"O c: 
"' a: 
7 
' ;x. , 
C\I (J) 00 
;;:; "' C\I "' x x x 
"' .... 00 
~ 
~ 
x 
~ > 
r<?,; 
> )< 
> 
)< )< 
> 
>>I> 
¥?,; [) 
> ~ 
> > 
;)' 
> 
> > 17 
»I> 
I> > 
) ; ) 
' ,.. , ,.. > >-
.... C\I (J) 00 .... C\I C\I 
(J) M x x x x ;;:; x x 00 (J) "' (J) "' .... "' "' ;;:; x M (J) "' "' 
197 
"' x 
IX, x 
IX IX 
ix:> IX x 
X:> ~ 
D 1» 
ix; xl7> 
~ 
yr/ >IXX 
i;, >D> 
r/ >i;.> 
> ~?> :> >~> 
~ [) > 
ti [) 
> ) ix: 
> 
p> >D> 
> 
D> > D > ) 
>D 
>- ,,. ,.. ,.. , - . ' ,, '" "' '" 
(J) 00 .... C\I (J) 00 .... C\I C\I (J) 00 .... C\I (J) 00 .... C\I 
"' "' (J) M ~ x x x ;;:; "' "' 
(J) M x x x x "' x x x 00 (J) "' "' x x x 00 (J) "' x (J) "' "' "' ;;:; x x (J) "' .... C\I "' ;;:; .... 00 M (J) "' "' .... 00 M (J) C\I 
Mapping 
Figure 7.89 CMD - lOOk atoms, lk Processors, Gray Mappings 
100k= 
14 
12 
10 
"'  c: 0 u 8 Q) 
~ 
Q) 
E 
-~ 
:; 
a: 6 
4 
2 
0 
x 
> 
> 
>-x = 0 " c: "' a: 
198 
~ m= 
v 
>x 7 7 
x 
/> » x > > 
X' X' > 
X' 
/ > 
> » )< x > 
;-
> » > > > » .; 
> 
> >S> > »' » 
~ 
» .; 
) Y> 
> » 
7 
> » 
» >» » ) > ) > ~ » 
A A ,. 
' ' ;< x >- >- ;r ,, " r ·~ "' "' " N I 
N CD CXl .... N CD CXl .... N N CD CXl .... N CD CXl .... N N CD CXl .... N CD "' .... N ;:;:; "' N CD "' x x x x ;:;:; "' N CD "' ~ x x x ;:;:; "' N CD "' x x x x N x x x CXl CD N N x x x CXl CD N N x x x CXl CD N x x CD N .... N "' ;:;:; x x CD C\I C\I "' ;:;:; x x CD N .... N "' ;:;:; C\I .... CXl "' CD C\I C\I .... CXl "' CD N C\I .... CXl "' CD C\I 
Mapping 
Figure 7.90 CMD - lm atoms, lk Processors, Gray Mappings 
199 
Om = 
120 
~ 
> > 
D 
)< >) 
D 
>D > > 
100 
v ~ > 
"' 
80 
-0 
c: 
IX> >D >> > > 
v D > 
:x > 
0 
u D 
Q) 
2. 
Q) 
E 60 
E 
:::> 
> 
) 
p >; 
)) ) 
a: I» ) » 
) > 
p > 
40 
» 
> » > > 
> ) I» » 
20 
) 
;) 
IX) >D > 
' ' <. 
,.. ,.. ,.. ,.. ,.. ,.. ;r ' " ~ ·~ ·~ 
0 
0 
-0 C\I "' co .... C\I "' co .... C\I C\I "' co .... C\I "' co .... C\I C\I "' co .... C\I "' co .... C\I c: u; LO C\I "' "' x x x x u; LO C\I "' "' x x x x u; LO C\I "' "' ~ 
x x x 
"' C\I x >< >< co "' C\I C\I x >< >< co "' C\I C\I x >< >< co "' C\I a: >< >< "' C\I .... C\I LO u; >< x "' C\I .... C\I LO >< >< "' C\I C\I LO u; C\I .... co "' "' C\I C\I .... co "' "' C\I LO C\I .... co "' "' C\I 
Mapping 
Figure 7.91 CMD - lOm atoms, lk Processors, Gray Mappings 
(i) 
"O 
c 
8 
Q) 
~ 
Q) 
E 
~ 
::J 
a: 
2 
1.5 
0.5 
0 
> 
) 
[ 
I) 
v> 
I< / > )< > 7 
> 
) 
t: >< "' > "' 0 "O o:> "' "' "' o:> "' c "' "' u::; U') "' "' "' 0 0 "' x x a: "' x x x ... x co "' "' "' "' "' "' 
Figure 7.92 
200 
100k= 
> 
) 
r /' > 
)' ~) 
D ~ 
v ? 
v'; 
r:? ~ ) > j )>) 
~) 
~;) 
)') 
> 
1>> 
v'> i/; 
~)' 
>< >< " " ' " ,,. rr '~ . '~ . '~ • N '~ .~ ~ 
"' "' o:> "' "' o:> "' "' "' o:> "' "' "' co "' "' o:> "' "' "' o:> "' "' "' o:> "' "' "' x x x x "' "' u::; U') "' "' "' x x x x "' "' u::; U') "' "' "' x x x x x "' "' ~ 0 0 "' x ~ x "' "' ~ 0 0 "' x x x "' ... ~ o:> "' u::; "' "' x x x co "' u::; "' "' x x x "' co "' u::; "' "' U') 0 0 x o:> ~ "' "' "' U') 0 0 x co "' "' "' "' U') ~ 0 "' "' "' ... "' "' "' "' "' "' "' "' 
Mapping 
CMD - lOOk atoms, 4k Processors, Gray Mappings 
t;
:j
 
.....
. 
Q
'q
 
E:; CD
 
-.
;i
 
i.o
 
c..
.:i 
(
)
 
~
 
tj
 
~
 s ~ ~~ w::.. p:;" ""d 1-1 0 (") ~ yi 0 1-1 ~ ~ .§ 'O ~· 
s:: ~. '5 
R
un
tim
e 
(s
ec
on
ds
) 
0 
N
 
"' 
"" 
"' 
I 
0 
' 
I 
' 
X
V
, 
/'
 
/
'
x
x
0
0
 
x 
~
 
x
/
'
 
A
/
'
c
;
<
:
~
 
/V
 
/
' 
R
a
n
d
o
-
)
' 
)
' 
~ 
">
¢'
 
;>
 
x 
v
)'
 
» 
V
V
'V
v
" 
2x
20
48
" 
n 
4x
10
24
 -
~
 
8x
51
2 
-
<. 
16
x2
56
 •
 
x._
x'x
J<
J 
3
2
x
1
2
8
-,
 
"'
• 
x
;y
 
: 
x 
~
x
N
x
,
-
(
x
~
 
x 
X
 
)"
 
x 
;y
 x
~
 
64
x6
4-
, 
x 
x
x
 
x/
x 
12
8x
32
 -
, 
x
x
x
 
x
x
 
, 
25
6x
16
 -
51
2x
8 
-
10
24
x4
 -
~ 
20
48
x2
 "
r
 
2x
20
48
 -
~
 
• 
x
x
 
4x
10
24
 -
~
x
x
x
x
 
8
x
5
1
2
-,
 
y 
y 
y 
XX
 
16
x2
56
 -
, 
v 
0
-,
N
',
. <
;x 
<;x
 X
:> 
~
 
x 
x 
X
A
?
\?
' 
32
x1
28
 · 
T
 
V
' 
'> 
64
x6
4 
-'
i 
>
()
. 
? 
>
 
12
8x
32
-'
i 
:)<
><
 
x
x
 
x 
x 
x 
>
 X:
x::>
<:!
 
25
6x
16
 "
' 
X
O
 
5
1
~
-
10
24
x4
 -
20
48
x2
" 
2x
20
48
" 
, 
4x
10
24
-.
.
.
 
8x
51
2-
16
x2
56
 "
 "
"
"
"
"
"
"
"
"
"
 
32
x1
28
 -
64
x6
4-
12
8x
32
 -
~~
~~
~~
~~
~~
~~
~~
~~
~~
 
25
6x
16
-~
~~
!!
l!
!l
~~
~!
!l
!I
~ 
51
2x
8 
-
10
24
x4
 -
20
48
x2
-~
~:
iQ
S2
,&
iQ
Q:
QQ
QQ
'.
32
:i
QS
2,
&i
QQ
:Q
Q:
QQ
QQ
;2
QJ
 
3 ~ "'
 
tV
 
0 ~
 
"r
j 
.....
. 
aq
 
,::::
: 
'"1
 
ct>
 :--1
 
<.
O
 
~
 
Q
 
~
 
u I ...... 0 s ~ 0 s w - ~ ~
 
""C
l 
'"1
 
0 ("
) ct>
 
w
 
w
 
0 '"1
 
_w
 0 '"1 ~
 ~
 
ti
' 
'd
 
'd
 s· aq w 
;:::
 
!l
J 
"O
 
12
. "' co 
0 
XY
 
R
an
do
 
2x
20
48
 
4x
10
24
 -
8
x5
1
2
-
16
x2
56
 -
32
x1
28
 -
64
x6
4 
-~
 
r ~
 ~ 
12
8x
32
 -
25
6x
16
 -
5
1
2
x8
-
10
24
x4
 -
20
48
x2
 -
2x
20
48
 -
4x
10
24
 -
8x
51
2 
-
16
x2
56
 -
32
x1
28
 -
64
x6
4 
-
12
8x
32
 -
25
6x
16
 -
5
1
2
x8
-
10
24
x4
 -
20
48
x2
 -
2x
20
48
 -
4x
10
24
 -
8x
51
2 
-
16
x2
56
 -
32
x1
28
 -
64
x6
4 
-
12
8x
32
 -
25
6x
16
 -
51
2x
8 
-
10
24
x4
 -
20
48
x2
 -
v " y v " ' ' '" - . - A
 
0
1
 
0 
'X
 
A
X
X
A
 
A
 
A
 
'X
 
x 
x 
y 
x 
y 
x 
x 
7
y
 
x 
7
y
 
x 
x 
R
un
tim
e 
(s
ec
on
ds
) 
0
1
 
X
x 
x 
x x v<
 
A
v
 
x 
7
y
' 
7
y
 
A
 
x x 
x 
x 
"' 0 'X
 
x 
y 
x 
x 
x 
x 
><
y 
;x
 
A
r
v
 "' 01
 
x
x
 
y x 
x x x 
x 
"' 0 x
 
V
V
x
 
x 
x
x
 
x 
;x
 A
X
 >J
 
)<
J 
"' 01 x: 
~
 " >J 
'X
 
0 3 ~ ,.
 
0 
t-
,J
 
0 t-
,J
 
203 
450 
~ 
OOm = 
'> 
400 
>< 
)< x 
> 
350 
> > 
>< 
>< 
> 
> ~ 
300 '> ~ 
"' 
>x 
> > 
"O 
c: 250 0 >7 
x 
(.) 
Q) 
-!'!. > 
Q) 
E 
200 ~ 
::J 
;) 
>:, > 
er: x 
150 
> > 
> > 
> > ~ 
100 > 
> 
) 
» > 
50 ) 
~» » » 
0 ['< x ) ' ~ x " ) ; ) ' ' ) r ' I ' r " r " r 0 "O OJ .... "' <O OJ .... "' <O OJ .... "' ~ .... "' <O OJ .... "' <O OJ .... "' ~ .... "' <O OJ .... "' <O OJ .... "' c: .... "' u:; I{) "' <O "' x >< >< >< "' u:; I{) "' <O "' x >< >< >< "' u:; I{) "' <O "' x >< >< >< "' 0 0 "' x ~ >< "' .... ~ 0 0 "' x ~ >< "' .... ~ 0 0 "' x >< >< "' .... ~ er: "' x >< >< OJ <O u:; "' "' x >< >< OJ <O u:; "' "' ~ >< >< .... OJ <O u:; "' >< OJ <O "' <O "' I{) 0 0 >< OJ <O "' <O "' I{) 0 0 >< OJ ~ "' <O "' I{) 0 0 "' .... "' "' "' "' .... "' "' "' "' "' "' "' 
Mapping 
Figure 7.95 CMD - lOOm atoms, 4k Processors, Gray Mappings 
204 
Bibliography 
[AAA+Ol] F. Allen, G. Almasi, W. Andreoni, D. Beece, B. J. Berne, A. Bright, J. Brun-
heroto, C. Ca§caval, J. Castafios, P. Coteus, P. Crumley, A. Curioni, M. Denneau, 
W. Donath, M. Eleftheriou, B. Fitch, B. Fleischer, C. J. Georgiou, R. Germain, 
M. Giampapa, D. Gresh, M. Gupta, R. Haring, H. Ho, P. Hochschild, S.Hummel, 
T. Jonas, D. Lieber, G. Martyna, K. Maturu, J. Moreira, D. Newns, M. Newton, 
R. Philhower, T. Picunko, J. Pitera, M. Pitman, R. Rand, A. Royyuru, V. Sala-
pura, A. Sanomiya, R. Shah, Y. Sham, S. Singh, M. Snir, F. Suits, R. Swetz, 
W. C. Swope, N. Vishnumurthy, T. J. C. Ward, H. Warren, and R. Zhou. Blue 
gene: A vision for protein science using a petafiop supercomputer. IBM Systems 
Journal, 40(2), 2001. 
[AAC+03] Gheorghe Almasi, Charles Archer, Jose Castafios, Manish Gupta, Xavier Mar-
torell, Jose Moreira, William Gropp, Silvius Rus, and Brian Toonen. MPI on 
BlueGene/L: Designing an efficient general purpose messaging solution for a large 
cellular system. In Proceedings of the 10th European PVM/MPI Users' Group 
Meeting, 2003. 
[AAC+04] Gheorghe Almasi, Charles Archer, Jose Castafios, C. Chris Erway, Philip Heidel-
berger, Xavier Martorell, Jose Moreira, Kurt Pinnow, Joe Ratterman, Nils Smeds, 
Burkhard Steinmacher-burow, William Gropp, and Brian Toonen. Implementing 
MPI on the BlueGene/L supercomputer. In Proceedings of Euro-par 2004, 2004. 
[ABB+03a] G. Almasi, L. Bachega, R. Bellafatto, J. Brunheroto, C. Ca§caval, J. Castafios, 
P. Crumley, C. Erway, J. Gagliano, D. Lieber, P. Mindlin, J. Moreira, R. K. Sa-
205 
hoo, A. Sanomiya, E. Schenfeld, R. Swetz, M. Bae, G. Laib, K. Ranganathan, 
Y. Aridor, T. Domany, Y. Gal, 0. Goldshmidt, and E. Shmueli. System manage-
ment in the BlueGene/L supercomputer. In International Parallel and Distributed 
Processing Symposium (IPDPS'03), 2003. Third Workshop on Massively Parallel 
Processing (WMPP). 
[ABB+03b] Gheorghe Almasi, Ralph Bellofatto, Jose Brunheroto, Calin Ca§aval, Jose Mor-
eira, Alda Sanomiya, and Karin Strauss. An overview of the BlueGene/L system 
software organization. Parallel Processing Letters, 13(4):561-574, 2003. 
[B+95] D. Bailey et al. The NAS parallel benchmarks 2.0. Technical report, NASA 
Advanced Supercomputing Division, Ames Research Center, Moffett Field, CA 
94035-1000, 1995. NAS-95-020. 
[BCC+88] Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. 
Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Ranking, 
P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An integrated 
solution to high-speed parallel computing. In Proceedings of Supercomputing 1988. 
IEEE Computer Society and ACM SIGARCH, 1988. 
[BCC+90] Shekhar Borkar, Robert Cohn, George Cox, Thomas Gross, H. T. Kung, Mon-
ica Lam, Brian Moore, Wire Moore, Craig Peterson, Jim Susmam, Jim Sutton, 
John Urbanski, and Jon Webb. Supporting systolic and memory communica-
tion in iWarp. In Proceedings of the 11th International Symposium on Computer 
Architecture, pages 70-81. ACM, 1990. CMU-CS-90-197. 
[BDV94] Greg Burns, Raja Daoud, and James Vaigl. LAM: An open cluster environment 
for MPI. In Proceedings of Supercomputing Symposium, pages 379-386, 1994. 
[CC88] M. Y. Chan and F. Y. L. Chin. On embedding rectangular grids in hypercubes. 
IEEE Transactions on Computers, 37(10):1285-1288, 1988. 
206 
[CCC97] Compaq Computer Corp., Intel Corporation, and Microsoft Corporation. Vir-
tual Interface Architecture specification. http://www.viarch.org/, 1997. Date 
retrieved: 27 Jan 2005. 
[Con04] RDMA Consortium. Architectural specifications for RDMA over TCP /IP. 
http://www.rdmaconsortium.org/home, 2004. Date retrieved: 27 Jan 2005. 
[D. 94] D. Bailey and others (sic.). The NAS parallel benchmarks. Technical report, 
NASA Advanced Supercomputing Division, Ames Research Center, Moffett Field, 
CA 94035-1000, 1994. NAS-94-007. 
[DLPOl] Jack Dongarra, Piotr Luszczek, and Antoine Petitet. The LINPACK benchmark: 
Past, present, and future. 
http://www.netlib.org/utk/people/ JackDongarra/PAPERS/hpl.pdf, 2001. Date 
retrieved: 27 Jan 2005. 
[Don04] Jack Dongarra. Performance of various computes using standard linear equations 
software. 
http://www.netlib.org/benchmark/performance.ps, 2004. Date retrieved: 27 Jan 
2005. 
[DSSM04] Jack Dongarra, Erich Strohmaier, Horst Simn, and Hans Meuer. www.top500.org. 
http://www.top500.org, 2004. Date retrieved: 27 Jan 2005. 
[dW02] Rob F. Van der Wijingaart. NAS parallel benchmarks version 2.4. Technical re-
port, NASA Advanced Supercomputing Division, Ames Research Center, Moffett 
Field, CA 94035-1000, 2002. NAS-02-007. 
[Fly72] M Flynn. Some computer organizations and their effectiveness. IEEE Transactions 
on Computing, 21, 1972. 
[For95] The MPI Forum. MPI: A message-passing interface standard. 
http://www-unix.mcs.anl.gov/mpi/index.html, 1995. Date retrieved: 27 Jan 2005. 
207 
[fPoB] European Center for Parallelism of Barcelona. Paraver. 
http://www.cepba.upc.es/paraver/. Date retrieved: 27 Jan 2005. 
[GLDS96] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-
performance, portable implementation of the MPI message passing interface stan-
dard. Parallel Computing, 22( 6) :789-828, 1996. 
[Gra53] Frank Gray. Pulse code communication, 1953. United States Patent Number 
2632058. 
[IEE92] IEEE. IEEE standard 1596-1992: IEEE standard for scalable coherent interface 
(SCI), 1992. 
[Kun82] H. T. Kung. Why systolic architectures? IEEE Computer, 15(1):37-46, 1982. 
[Lei85] Charles E. Leiserson. Fat-trees: universal networks for hardware-efficient super-
computing. IEEE Transactions on Computers, 34(10):892-901, 1985. 
[LKC+03] Jiuxing Liu, Sushmitha Kini, Balasubramanian Chandrasekaran, Weikuan Yu, 
Darius Buntinas, Jiesheng Wu, Peter Wyckoff, Weihang Jiang, and D K Panda. 
Performance comparison of MPI implementations over InfiniBand, Myrinet, and 
Quadrics. In Supercomputing 2003, 2003. 
[LWPS04] Q. Lu, J. Wu, D. Panda, and P. Sadayappan. Applying MPI derived datatypes 
to the NAS benchmarks: A case study. Technical report, Ohio State University, 
2004. OSU-CISRC-4/04-TR19. 
[Mel] Inc Mellanox. Infiniband solutions from mellanox. 
http://www.mellanox.com. Date retrieved: 27 Jan 2005. 
[Mes] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing 
Interface. Date retrieved: 27 Jan 2005. 
[Myra] Inc Myricom. Myrinet overview. 
http://www.myri.com/myrinet/overview. Date retrieved: 27 Jan 2005. 
208 
[Myrb] Myrinet. GM - the low-level message-passing system for myrinet networks. 
http://www.myri.com/scs/. Date retrieved: 27 Jan 2005. 
[N.R02] N.R. Adiga and others (sic.). An overview of the BlueGene/L supercomputer. In 
Supercomputing 2002, 2002. 
[Qua] Inc Quadrics. Quadrics QsNet high performance interconnect. 
http://doc.quadrics.com. Date retrieved: 27 Jan 2005. 
[SBB+93J M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M.S. Gordon, J. H. 
Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. Su, T. L. Windus, M. Dupuis, 
and J. A. Montgomery Jr. The general atomic and molecular electronic structure 
system. Journal of Computational Chemistry, 14, 1993. 
[SFBGOOJ MW Schmidt, GD Fletcher, BM Bode, and MS Gordon. The distributed data 
interface in GAMESS. Computer Physics Communications, 128, 2000. 
[SOH+oo] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. 
MP I - The Complete Reference Volume 1, The MP I Core. The MIT Press, second 
edition, 2000. 
[SS88] Youcef Saad and Martin H. Schultz. Topological properties of hypercubes. IEEE 
Transactions on Computers, 37(7):867-872, 1988. 
[Sun90] V. S. Sunderam. PVM: A framework for parallel distributed computing. Concur-
rency, Practice and Experience, 2(4):315-340, 1990. 
[TMH04J Dave Turner, James Morris, and Kai-Ming Ho. 
Ames lab classic molecular dynamics ( alcmd). 
http://www.cmp.ameslab.gov/cmp/CMP _Theory/cmd/alcmd_source.html, 
2004. Date retrieved: 27 Jan 2005. 
[Tur] David Turner. MP _Lite: A light weight message passing library. 
http://cmp.ameslab.gov/MP _Lite/. Date retrieved: 27 Jan 2005. 
209 
[Wu85] A. Y. Wu. Embedding of tree networks into hypercubes. Journal of Parallel and 
Distributed Computing, 2:238-249, 1985. 
