Optimizing Applications and Message-Passing Libraries for the QPACE Architecture by Wunderlich, Simon









Chemnitz, March 9, 2009
Supervisor: Prof. Dr.-Ing. Wolfgang Rehm
University Adviser: Dipl-Inf. Torsten Mehlan
Company Advisers: Dipl-Ing. (FH) Hans Böttiger
Dipl-Ing. (FH) Heiko J. Schick
(IBM Deutschland Research & Development GmbH)
Acknowledgements
I would like to thank Prof. Wolfang Rehm for providing the topic of this diploma
thesis and for his guidance, ideas and suggestions not only for this thesis but
also for my studies at TU Chemnitz. Further I would like to thank my company
advisers Heiko J. Schick and Hans Böttiger from IBM Deutschland Research
& Development GmbH who introduced me to IBM and the QPACE team and
supported me with many detailed information and insightful discussions. My
university adviser Torsten Mehlan supported me with answers to many techni-
cal and administrative questions I had while writing this thesis. I would also
like to thank the other colleagues of our computer architecture group Andreas
Heinig, Frank Mietke, René Oertel, Timo Schneider and Jochen Strunk for their
suggestions and help.
Also I am very grateful to the whole QPACE team for their support and answers
to many small but important questions, especially to Dr. Dirk Pleiter for the
interesting discussions and his helpful critical comments on my drafts. Special
thanks go to Willi Homberg of the Jülich Supercomputing Centre who kindly
gave me access and support on the JUICEnext QS22 cluster.
Maybe to the greatest part, I am deeply indebted to my girlfriend Christiane
Weidauer and my family who unconditionally supported me over all these years.
i
Abstract
The goal of the QPACE project is to build a novel cost-efficient massive par-
allel supercomputer optimized for LQCD (Lattice Quantum Chromodynamics)
applications. Unlike previous projects which use custom ASICs, this is accom-
plished by using the general purpose multi-core CPU PowerXCell 8i proces-
sor tightly coupled with a custom network processor implemented on a mod-
ern FPGA. The heterogeneous architecture of the PowerXCell 8i processor and
its core-independent OS-bypassing access to the custom network hardware and
application-oriented 3D torus topology pose interesting challenges for the im-
plementation of the applications. This work will describe and evaluate the
implementation possibilities of message passing APIs: the more general MPI,
and the more QCD-oriented QMP, and their performance in PPE centric or
SPE centric scenarios. These results will then be employed to optimize HPL for
the QPACE architecture. Finally, the developed approaches and concepts will
be briefly discussed regarding their applicability to heterogeneous node/network
architectures as is the case in the "High-speed Network Interface with Collective
Operation Support for Cell BE (NICOLL)" project.
ii
Theses
1. The QCD oriented special purpose computer QPACE can be opened up
for more general applications.
2. The Communication patterns of HPL can be mapped on the 3D torus of
QPACE.
3. Collective operations can be implemented efficiently on the QPACE torus.
4. The QS22 HPL version provides a good foundation for an efficient imple-
mentation on QPACE.




1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 QPACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 QCD Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Message Passing Libraries . . . . . . . . . . . . . . . . . . . . . . 5
1.5 NICOLL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 High Performance LINPACK on QPACE 7
2.1 Introduction to the HPL Benchmark . . . . . . . . . . . . . . . . 7
2.1.1 Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Benchmark Algorithm . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Look-ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The QS22 Patch for HPL . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Computation and Reorganization Specialists . . . . . . . 12
2.2.2 PPE and SPE Load Balancing . . . . . . . . . . . . . . . 12
2.2.3 Hugepage Support . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 MPI Collectives . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Parameter Limitations . . . . . . . . . . . . . . . . . . . . 13
2.3 MPI Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Required MPI Features . . . . . . . . . . . . . . . . . . . 14
2.3.2 MPI Versus QMP . . . . . . . . . . . . . . . . . . . . . . 15
2.4 HPL Communication Patterns . . . . . . . . . . . . . . . . . . . 15
2.5 Process Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Communication Sizes and Network Requirements . . . . . . . . . 20
2.7.1 Example Setup . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.2 Message Passing and Network Requirements . . . . . . . 21
3 QPACE Architecture and Torus Communication 23
3.1 Introduction to the QPACE Architecture . . . . . . . . . . . . . 23
3.1.1 QPACE Rack Configuration . . . . . . . . . . . . . . . . . 24
3.2 Torus Network Hardware . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Low-Level Communication Rules . . . . . . . . . . . . . . 28
iv
3.2.2 SPE Access . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 PPE Access . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 Proof of Concept: Memcpy with MFC . . . . . . . . . . . 30
3.2.5 Access Time Benchmarks . . . . . . . . . . . . . . . . . . 31
3.2.6 TX FIFO Filling Level Considerations . . . . . . . . . . . 33
3.2.7 PPE Overhead Times . . . . . . . . . . . . . . . . . . . . 34
3.2.8 Latency Limitations on the PPE . . . . . . . . . . . . . . 35
3.2.9 Suggestions to Circumvent the PPE Latency Limitations 38
3.3 Torus Network Model . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Memory Locations . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Communicators . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Topology Graphs . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Communication Algorithms . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Message Passing Libraries on QPACE 53
4.1 SPE Centric Approaches . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 MPI on QMP . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 QMP on MPI . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.3 QMP on QPACE Torus . . . . . . . . . . . . . . . . . . . 59
4.1.4 MPI on QPACE Torus . . . . . . . . . . . . . . . . . . . . 59
4.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 PPE Centric Approaches . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 PPE with direct MFC access to the NWP . . . . . . . . . 62
4.2.2 Function Ooading . . . . . . . . . . . . . . . . . . . . . 63
4.2.3 Integration into MPI or QMP . . . . . . . . . . . . . . . . 63
4.3 Programming Model Considerations for HPL . . . . . . . . . . . 63
4.3.1 SPE Accelerated Communication Tasks . . . . . . . . . . 64
4.3.2 MPI Network Module . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Integration into OpenMPI . . . . . . . . . . . . . . . . . . . . . . 65
4.4.1 OpenMPI Modular Component Architecture . . . . . . . 66
4.4.2 OpenMPI Byte Transfer Layer (BTL) . . . . . . . . . . . 67
4.4.3 Collectives Component (COLL) . . . . . . . . . . . . . . . 71
5 Conclusion and Outlook 73
5.1 Application to the NICOLL Project . . . . . . . . . . . . . . . . 73
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.2 Architecture Assumptions for NICOLL . . . . . . . . . . . 74
5.1.3 QPACE Programming Models and Message Passing
Strategies for NICOLL . . . . . . . . . . . . . . . . . . . . 75
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliography 78
v
A Source Code 85
vi
List of Figures
2.1 Example HPL.dat optimized for a single node run on QS22 . . . 9
2.2 Data distribution scheme over the process grid PxQ . . . . . . . 10
2.3 High level visualization of the LU process . . . . . . . . . . . . . 11
2.4 Look-ahead illustration: top without look-ahead, bottom with
look-ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 SPE function ooad architecture . . . . . . . . . . . . . . . . . . 12
2.6 Embedding 2D and 1D tori in a 3D torus . . . . . . . . . . . . . 17
3.1 QPACE node card schematic diagram . . . . . . . . . . . . . . . 23
3.2 Overview of the QPACE architecture . . . . . . . . . . . . . . . . 24
3.3 Torus configuration alternatives within one backplane . . . . . . 25
3.4 Standard (top) and alternative cabling (bottom) of 4 racks . . . 26
3.5 Communication Primitives . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Comparison of memcpy performance . . . . . . . . . . . . . . . . 30
3.7 Sending packets to the NWP . . . . . . . . . . . . . . . . . . . . 33
3.8 Dimension ordered spanning tree in a 3× 3× 3 mesh . . . . . . . 45
3.9 Illustration of the BlueGene/L adapted broadcast on a 4× 4× 4
mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.10 Illustration of the 3D-EDF algorithm for a 3× 3× 3 torus . . . . 48
4.1 MPI on QMP block diagram for a possible implementation . . . 55
4.2 QMP on MPI block diagram . . . . . . . . . . . . . . . . . . . . 58
4.3 List of used MPI calls in the QMP MPICH implementation . . . 59
4.4 OpenMPI Layer Model . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 OpenMPI BTL Component Interface (OpenMPI version 1.2.8) . 67
4.6 OpenMPI COLL Component Interface (as of OpenMPI 1.2.8) . . 71
5.1 The NICOLL Architecture . . . . . . . . . . . . . . . . . . . . . . 74
vii
List of Tables
2.1 MPI Calls used in HPL . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Communication algorithms in HPL . . . . . . . . . . . . . . . . . 16
2.4 Example Profiling output from a run on 2 QS22 . . . . . . . . . . 19
2.5 General message sizes and call counts . . . . . . . . . . . . . . . 20
2.6 Message sizes and call counts for example setup . . . . . . . . . . 21
3.1 Memory Access times . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Edge assignment to the Spanning Trees in a 3D Torus for span-
ning trees 1-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Edge assignment to the Spanning Trees in a 3D Torus for span-
ning trees 4-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 OpenMPI Layers (partial list) (from [1, 2]) . . . . . . . . . . . . 54
4.3 Code size of the HPL library (without MPI and BLAS, and test
code), compiled with spu-gcc -Os . . . . . . . . . . . . . . . . . . 64
viii
List of Acronyms
ACCFS . . . . . . . . . . . Accelerator File System, page 73
APE . . . . . . . . . . . . . . Array Processor Experiment, page 4
ASIC . . . . . . . . . . . . . Application-Specific Integrated Circuit, page 3
BLAS . . . . . . . . . . . . Basic Linear Algebra Subroutines, page 7
BML . . . . . . . . . . . . . BTL Management Layer, page 54
BTL . . . . . . . . . . . . . . Byte Transfer Layer, page 65
Cell/B.E.TM . . . . . . Cell Broadband EngineTM, page 2
COLL . . . . . . . . . . . . OpenMPI Collectives Component, page 71
CP-PACS . . . . . . . . . Computational Physics by Parallel Array Computer Sys-
tem, page 4
DCR . . . . . . . . . . . . . Device Control Register, page 24
DDR . . . . . . . . . . . . . Double Data Rate, page 4
DMA . . . . . . . . . . . . . Direct Memory Access, page 2
DSP . . . . . . . . . . . . . . Digital Signal Processor, page 4
EIB . . . . . . . . . . . . . . Element Interconnect Bus, page 2
FLOP . . . . . . . . . . . . Floating Point Operation, page 3
FPGA . . . . . . . . . . . . Field Programmable Gate Array, page 3
FPU . . . . . . . . . . . . . . Floating Point Unit, page 4
GBIF . . . . . . . . . . . . . Global Bus Infrastructure, page 28
GPGPU . . . . . . . . . . General Purpose computing on Graphic Processing Units,
page 74
HPL . . . . . . . . . . . . . . High Performance Linpack, page 1
IO-MMU . . . . . . . . . IO Memory Managment Unit, page 39
IWC . . . . . . . . . . . . . . Inbound Write Controller, page 24
LQCD . . . . . . . . . . . . Lattice Quantum Chromodynamics, page 2
MFC . . . . . . . . . . . . . Memory Flow Controller, page 2
MMIO . . . . . . . . . . . . Memory-mapped Input/Output, page 29
MPI . . . . . . . . . . . . . . Message Passing Interface, page 5
NICOLL . . . . . . . . . . High-speed Network Interface with Collective Operation
Support for Cell BE, page 5
NWP . . . . . . . . . . . . . NetWork Processor, page 24
OWC . . . . . . . . . . . . . Outbound Write Controller, page 24
PCIe R© . . . . . . . . . . . . PCI Express R©, page 73
PML . . . . . . . . . . . . . Point to Point Managment Layer, page 66
PPC64 . . . . . . . . . . . PowerPC R©64, page 2
ix
PPE . . . . . . . . . . . . . . Power Processing Element, page 2
QCD . . . . . . . . . . . . . Quantum Chromodynamics, page 2
QMP . . . . . . . . . . . . . Lattice QCD Message Passing, page 14
QPACE . . . . . . . . . . QCD Parallel computing on Cell Broadband EngineTM,
page 2
QS21 . . . . . . . . . . . . . IBM BladeCenter R©QS21, page 3
QS22 . . . . . . . . . . . . . IBM BladeCenter R©QS22, page 3
RAM . . . . . . . . . . . . . Random Access Memory, page 4
RSPUFS . . . . . . . . . Remote SPU File System, page 73
SIMD . . . . . . . . . . . . Single Instruction Multiple Data, page 2
SPE . . . . . . . . . . . . . . Synergistic Processing Element, page 2
SXU . . . . . . . . . . . . . . Synergistic Execution Unit, page 2




This chapter gives an introduction to the topics tackled in this diploma the-
sis. Section 1.1 explains the motivation of special purpose supercomputers and
the High Performance LINPACK benchmark to evaluate these systems. The
QPACE machine and its main components are briefly described in section 1.2.
Other QCDmachines are described and compared to QPACE in section 1.3. The
message passing libraries which can be considered for the QPACE machine are
introduced in section 1.4. The NICOLL project and its objective is introduced
in section 1.5. Finally the organization of the rest of this thesis is described in
section 1.6
1.1 Motivation
Supercomputers are one of the most important research vehicles for modern sci-
ence. The applications demanding extraordinary high computing power range
from climate research, weather forecasting, quantitative finance, and quantum
mechanical physics simulation to molecular modeling and cryptanalysis, to name
only a few. General purpose clusters which are built from commodity parts are
used to support a wide range of these applications, but also special purpose
machines like the QPACE supercomputer are built to support one specific appli-
cation efficiently. The hardware architecture of these special purpose machines
is designed according to the requirements of the target problem and can reach
higher performance for these problems at a lower cost or power consumption.
To compare the different supercomputers, the High Performance LINPACK
[3] benchmark (HPL), a portable version of LINPACK [4] for distributed mem-
ory systems, measures the floating point performance by solving a linear sys-
tem Ax=b. This problem is one of the most time critical kernels within many
applications, and therefore this benchmark is traditionally used to compare
supercomputers on the basis of the sustained double precision floating point
operations per second (FLOP/s) in this benchmark. Other Benchmarks like
the HPC Challenge [5] benchmark which evaluate a wider range of properties
1
CHAPTER 1. INTRODUCTION
of the supercomputers have been proposed, but HPL is still the most popular
benchmark for comparison.
A public ranking of the fastest supercomputers based on their HPL perfor-
mance has been maintained since 1993 in the top 500 list [6] which is updated
twice a year. This list has been established to detect trends in the high perfor-
mance computing market. An early implementation of HPL on new machines
is therefore not only interesting for performance evaluation but also important
for promotional reasons.
The green500 list [7] is a new ranking established in 2007 which ranks the
most energy efficient supercomputers from the top 500 list based on their HPL
performance per Watt ratio. In this list the focus moves from raw computing
power to energy efficiency of the machines which becomes increasingly important
for modern data centers. The machines on the first 7 ranks of the current
green500 list (November 2008) are based on IBM R©PowerXCellTM8i processors,
which are one of the most energy efficient commodity processors available today.
The QPACE machine which also uses this processor along with a very power
efficient architecture could be a strong competitor on this green500 list.
1.2 QPACE
The QPACE machine (QCD Parallel computing on Cell Broadband
EngineTM) [8, 9] is a supercomputer designed for Lattice Quantum Chromo-
dynamics (LQCD) applications. Quantum Chromodynamics (QCD) is a well
established theory describing interactions between quarks and gluons, the build-
ing blocks of particles such as the neutron and the proton. To perform numerical
simulation which are necessary to tackle certain QCD problems, Lattice QCD
uses a discretized formulation on a space-time lattice which is suitable for mas-
sively parallel computers. QCD on the Cell/B.E.TMarchitecture was already
explored by many different groups [10, 11, 12, 13, 14]. One of the most time
consuming kernels in LQCD codes is the Wilson-Dirac Operator [15], and it
has been shown that an implementation on the PowerXCell 8i processor can
reach an efficiency of 25% of the peak performance [10]. Hence, The Cell/B.E.
architecture appears to be a very promising platform.
The Cell Broadband Engine processorTM(Cell/B.E.) [16] is a heterogeneous
multicore microprocessor jointly developed by IBM, Sony and Toshiba. It fea-
tures one general purpose Power PC compatible core called PPE (Power Pro-
cessing Element) and 8 vector processing units called SPEs (Synergistic Pro-
cessing Element), which are interconnected on the chip with the high perfor-
mance ring bus EIB (Element Interconnect Bus). While the PPE is dedi-
cated to the operating system and PPC64 compatible legacy applications, the
SPEs provide the bulk of potential of the architecture. These SIMD processors
don't access the main memory directly, but operate on a 256 KiB local memory
called Local Store and transfer data from and to memory using an integrated
DMA engine called MFC (Memory Flow Controller). The processing core
SXU (Synergistic Execution Unit) of a SPE is decoupled from the memory
2
CHAPTER 1. INTRODUCTION
transfers, which allows an overlap of computation and communication. Com-
mercially available machines which use the Cell/B.E. architecture are the Sony
PlayStation 3 which uses a Cell/B.E. processor with 7 enabled SPEs, and the
IBM BladeCenter R©QS20 and QS21 which are SMP systems with 2 Cell/B.E.
processors. The IBM BladeCenter QS22 is the successor of these systems and is
equipped with 2 PowerXCell 8i processors, an updated version of the Cell/B.E.
with highly improved double precision floating point performance of the SPEs.
The QPACE networks are implemented on an FPGA (Field Programmable
Gate Array). The advantage of using an FPGA over a custom ASIC
(Application-Specific Integrated Circuit) are the lower risk and development
costs of the system, and the possibility to reconfigure the chips even after the
system is deployed. 3 networks are implemented: the standard Gigabit Eth-
ernet for storage access and maintenance, a global signal tree for application
communication and exception handling, and a custom high performance 3D
torus network with nearest neighbor connections which is designed for the QCD
applications.
The QPACE architecture is designed to be a low power architecture with
very high performance. The PowerXCell 8i processor is a suitable choice for
this, as it delivers the highest performance per watt of the currently available
commodity processors. Implementing only the required south bridge function-
ality in the FPGA allows to further reduce the number of power consuming
standard components. By using special metal enclosures for the node cards and
a cost-efficient cooling system, a very high package density can be reached.
At the time of writing this thesis, the QPACE machine is still being ac-
tively developed with several academic institutions together with IBM at the
IBM Research and Development Lab in Böblingen (Germany). The QPACE
torus network is not yet finished to allow library implementations or a final
performance evaluation. The contribution of this work is for this reason not a
finished presentation of an implemented application or message passing library,
but presents concepts, requirements, and performance extrapolations and pro-
poses enhancements.
1.3 QCD Machines
Specialized QCD machines have a long history [17]. One of the first QCD ma-
chines were the PACS/PAX [18, 19] series installed at the Kyoto University
and the University of Tsukuba. The first PACS machine PACS-9, which was
built in 1978, used a 3 × 3 two dimensional grid of 8 bit microprocessors. The
processors were equipped with 1 KiB of memory, reaching an aggregate per-
formance of 0.01 MFLOP/s (Million Floating Point Operations per Second) on
solving Poisson equation. Several upgraded machines were built, and the lat-
est PACS machine installed in 2005 is the PACS-CS [20], employing low-power
Intel R©Xeon R©processors interconnected in a Gigabit Ethernet based 3 dimen-
sional hyper crossbar of the size 16 × 16 × 10. The whole system reaches a
peak performance of 14.3 TFLOP/s, and is the first to break with the tradi-
3
CHAPTER 1. INTRODUCTION
tion of custom processors and network interfaces which were still used for its
predecessor CP-PACS [21] in 1997.
The QCDSP [22, 23] machines installed in 1998 at Columbia University and
RIKEN-BNL research center employ Texas Instruments DSPs (Digital Signal
Processor) for the calculation and a custom ASIC for accessing the 4D mesh
network and memory prefetching. The two machines reach a combined peak
performance of 1 TFLOP/s, and a maximum sustained performance of 30% of
the peak performance for the QCD code. The successor QCDOC [23, 24, 25]
(QCD on a Chip) instead used a IBM System-on-a-Chip design. A PPC440
core with an attached 64 bit IEEE FPU, 4 MiB1 memory, 2 Ethernet Con-
trollers, a DDR RAM controller, and a communication controller supporting
24 communication links were combined on a custom ASIC. With this ASIC, all
components of the QCDSP node card are combined on a single chip. A 6D mesh
network, standard Ethernet and a global interrupt network were employed to
interconnect the node cards. The machines installed in 2004 and 2005 at the
Brookhaven National Laboratory and the University of Edinburgh both reach
a peak performance greater than 10 TFLOP/s.
Another family of QCD computers with a long history are the APE machines
(Array Processor Experiment) which were designed and manufactured by dif-
ferent research groups of the Italy based INFN (National Institute of Nuclear
Physics), DESY (Deutsches Elektronen Synchrotron), and Université Paris-Sud
11 in Orsay. After the first APE machine [26] which was built starting 1984 and
featured a 16 node linear array system, the APE 100 [27], the APEmille [28] and
APEnext [29] machines were built and deployed in this chronological order. The
APEnext machine features a custom VLIW processor (Very Long Instruction
Word) implemented in a VLSI chip (Very-large-scale integration) for compu-
tation and 32 MiB RAM on each node card. 16 node cards are combined on
one processing board. They are backed up by Linux R©based standard PC host
systems which handle the start up and disk I/O. A backplane crate holds 16 of
these processing boards. The nodes are interconnected with a custom 3D torus
network which is integrated in the processing boards and backplane crates. The
largest installation can be found in Rome with 6656 nodes arranged in 13 racks,
reaching a theoretical peak performance of 10.6 TFLOP/s.
Different factors motivated these groups to use custom hardware instead of
commodity hardware. The standard processors available did not offer a reason-
able computing power to implement the QCD kernels efficiently. This argument
became weaker with the SIMD extensions in conventional microprocessor like
SSE. Another reason are the high power consumption of the standard proces-
sors. The custom microprocessors work with a fraction amount of power which
standard processors would need, delivering the same performance for the spe-
cific application. This allows a much tighter packaging and better cooling. The
optimal network for QCD applications is a low latency multi dimensional mesh
network, and the coupling of the network and the processing elements can be
1The IEC binary prefixes for storage units are used in this document. 1 GiB = 1024 MiB,
1 MiB = 1024 KiB, 1 KiB = 1024 Byte
4
CHAPTER 1. INTRODUCTION
built much tighter with a custom processor.
The decision for QPACE to use a PowerXCell 8i processor was motivated
by the fact that it can deliver high peak performance with a very good perfor-
mance per Watt ratio, and the needed QCD kernels could be implemented quite
efficiently on this architecture [11, 10]. The tight coupling is accomplished by
directly connecting the PowerXCell 8i processor with the FPGA without using
a south bridge between. The costs and complexity to develop a special purpose
VLSI in modern chip technologies are so high that they are no longer feasible
for an academic project of this scale.
1.4 Message Passing Libraries
The Message Passing Interface Standard (MPI) [30, 31] is the de facto stan-
dard for message passing and communication on general purpose distributed
memory systems. It is implemented in a library with Interfaces to C, C++
and FORTRAN. Optimized libraries are available for many different processor
and network architectures. MPI provides various send and receive mechanisms
between pairs of processes, so called Point to Point Communication, as well
as communication within a group of processes, so called Collective Communi-
cation. These groups of processes are organized in so called Communicators.
A communicator is a subset of processes from the complete process set, and
can be configured according to the distribution requirements of the application.
There are also special communicators like the cartesian communicator or graph
communicators available which allow to align the application to special logical
topologies. Furthermore the messages may use arbitrary datatypes, and MPI
takes care of the correct packing and machine datatypes. This feature allows to
use heterogeneous clusters where the machine words may have a different for-
mat. The MPI 2 standard extends these features with one sided communication,
dynamic process management and I/O functionality.
Another message passing standard is the QCD Message Passing Interface
(QMP). It is designed for QCD applications which usually communicate in lat-
tice topologies. QMP provides point to point communication with a special
focus on repetitive communication patterns. The communication is done be-
tween the neighbors of a logical multi dimensional torus or between arbitrary
pairs of processes. QMP also provides a set of collective algorithms which QCD
applications use, but not as many and general ones as MPI does. These col-
lectives only work on the complete set of processes. QMP can therefore be
considered as a functional subset of MPI.
1.5 NICOLL
The NICOLL project (High-speed N etwork Interface with Collective Opera-
tion Support for Cell BE) [32], a project of the Computer Architecture Group
at Chemnitz University of Technology in collaboration with the Center of Ad-
5
CHAPTER 1. INTRODUCTION
vanced Study IBM Böblingen, Development and Research, is a case study of a
heterogeneous system with an AMD OpteronTMmicroprocessor and a Cell/B.E.
microprocessor for acceleration. The microprocessors are tightly coupled using
an FPGA as bridge between the AMD HyperTransportTMprotocol and the Cel-
l/B.E. interface. The objective of this project is to develop a research prototype
and to explore the potential of this tightly coupled hybrid system in an HPC en-
vironment using InfiniBand as interconnect. One main focus of this project is to
research the potential of the Cell/B.E. architecture as accelerator for collective
operations.
1.6 Organization
This Diploma Thesis is organized into 4 parts: An overview of HPL and a dis-
cussion how to implement this application efficiently on QPACE is presented
in chapter 2. In chapter 3, approaches to use the QPACE torus network for
general purpose, which was originally designed for QCD applications, are pre-
sented, and torus-optimized collective algorithms are discussed. Possible general
message passing libraries for the QPACE architecture and consequences for the
programming models and HPL are discussed in chapter 4. Finally, this work is
concluded and the applicability of these results and strategies are discussed for





This chapter describes the HPL application and optimization approaches on
the QPACE architecture. Section 2.1 and 2.2 give an introduction to the HPL
Benchmark and its PowerXCell 8i optimized QS22 version. The communica-
tion patterns and implementation on MPI is analyzed in sections 2.3 and 2.4.
The process mapping for the QPACE HPL version is discussed in section 2.5.
Finally, from profilings of the existing HPL version as described in section 2.6,
an extrapolation of the expected message patterns and sizes is given for a large
QPACE setup in section 2.7, and requirements for network hardware are for-
mulated.
2.1 Introduction to the HPL Benchmark
HPL [3, 33] is the High Performance LINPACK Benchmark, a portable, scalable
implementation of LINPACK [4] used to measure and compare the performance
of large supercomputers with shared or distributed memory. The Software pack-
age depends on either Basic Linear Algebra Subprograms (BLAS) [34, 35] or
Vector Signal Image Processing Library (VSIPL) [36] for the local linear algebra
tasks and on an MPI 1.1 [30, 31] compliant implementation for Message Passing.
LINPACK implements a solver for a dense system of linear equations Ax = b
using the right-looking LU factorization [37, 38] as a variant of the Gaus-
sian elimination. To leverage the processing power with its memory hierar-
chies, blocking algorithms are used which enable the implementation of most
performance-critical routines with Level 3 BLAS routines [3, 39], which are
known to reach execution times close to peak performance on most architec-
tures.
The following analysis is based on the HPL version 1.0a, freely available on
the HPL website [33]. At this time of writing, also HPL version 2.0 is available,
which adds a better random number generator [40] and a new correctness test.
7
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
As these changes are not critical for the QPACE case and the QS22 patch is
written for version 1.0a, the version 2.0 was not used.
2.1.1 Parameter
The Benchmark is controlled by various parameters from the input file which
control the algorithm variants implemented in LINPACK. The 4 most important
parameters are:
• The problem size N specifies the size of the linear system, where a N ×N
matrix is factorized. This parameter is limited by the available RAM of
the system.
• The block size NB specifies the sizes of the sub-matrices used for compu-
tation. It depends on the employed BLAS implementations and hardware
characteristics like cache size and architecture.
• The grid size P×Q specifies the data distribution of the blocks. It depends
mostly on the network interconnect between the processes.
Other parameters select algorithm variants for the local factorization and vari-
ous thresholds. Optimal values for them have been found by single node runs.
Furthermore there are parameters to select different algorithms for the commu-
nication. The selection of the algorithm should depend on whether they can be
implemented in the high-capacity torus network. These algorithms are further
investigated in 2.4. Another interesting parameter is the depth of the look-
ahead, which is described in 2.1.3. A complete sample configuration file used
for a single node is given in Figure 2.1.
8
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
HPLinpack benchmark input f i l e
Innovat ive Computing Laboratory , Un ive r s i ty o f Tennessee
HPL. out output f i l e name ( i f any )
6 dev i c e out (6=stdout ,7= stder r , f i l e )
1 # of problems s i z e s (N)
20479 Ns
1 # of NBs
128 NBs
0 PMAP proce s s mapping (0=Row−,1=Column−major )
1 # of p roce s s g r i d s (P x Q)
1 Ps
1 Qs
16 .0 th r e sho ld
1 # of panel f a c t
0 PFACTs (0= l e f t , 1=Crout , 2=Right )
1 # of r e c u r s i v e stopping c r i t e r i um
2 NBMINs (>= 1)
1 # of pane l s in r e cu r s i on
2 NDIVs
1 # of r e c u r s i v e panel f a c t .
0 RFACTs (0= l e f t , 1=Crout , 2=Right )
1 # of broadcast
0 BCASTs (0=1rg ,1=1rM,2=2 rg ,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
3 SWAP (0=bin−exch ,1= long ,2=mix ,3=MPI−c o l l )
64 swapping th re sho ld
1 L1 in (0=transposed ,1=no−t ransposed ) form
0 U in (0=transposed ,1=no−t ransposed ) form
1 Equ i l i b r a t i on (0=no ,1=yes )
64 memory al ignment in double (> 0)
Figure 2.1: Example HPL.dat optimized for a single node run on QS22
The blocks are distributed in a row and block cyclic fashion over the pro-
cess grid PxQ as illustrated in Figure 2.2. This distribution assures good load
balancing over the processes while maintaining local alignment to block sizes
within the main memory. Locally, the data is stored in column major order
format.
A process only communicates with other processes within its local row and
local column depending on its position in the process grid. The ratio of P and
Q therefore has a strong influence on the communication pattern and load. A
description of the communication pattern and its communication space (rows
or columns) is given in section 2.4, and the message sizes and and number of
calls depending on the grid size is given in section 2.7.
9
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
Figure 2.2: Data distribution scheme over the process grid PxQ
2.1.2 Benchmark Algorithm
The main benchmark can be divided into the following steps for a single run:
1. Init and Setup
2. for all panels in A
(a) factorize a panel
(b) broadcast the panel
(c) update trailing sub-matrix
3. backward substitution
4. check the solution
The time for steps 2 and 3 are measured and used for the time result of the
benchmark. Step 2 is visualized in Figure 2.3. The panel factorization step
2a is only performed by the column process group of the grid which owns the
panel. The other processes receive the factorized panel via the broadcast in step
2b and use it to update the trailing sub-matrix in step 2c. The sub-matrix is
shrunk by the block size in both dimension in each iteration, and by the end of
step 2 the matrix A is replaced by the lower triangle matrix L and the upper
triangle matrix U. In step 3, backward substitution is used to receive the final
result vector x. To verify the results, various scaled residuals are computed to
check the correctness and precision of the computation.
10
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
Figure 2.3: High level visualization of the LU process
2.1.3 Look-ahead
A very popular optimization of blocking factorization algorithms is the look-
ahead. With the basic algorithm as described above, columns which do not
factor a panel have to wait for the broadcast and the matrix update before
they can start to factorize their own panel. This creates idle times as we can
see in Figure2.4. With look-ahead, the matrix update is postponed and the
factorization of the local panel is brought forward, thus creating an overlap of
the different steps. The dependencies of the different steps of the LU part are
relaxed, and the pipelining effect removes idle times on the processors. Different
look-ahead depths can be selected for HPL, where a depth of 1 or 2 is assumed
to give the best performance. More detailed information about this technique
can be found e.g. in [41, 42].
Figure 2.4: Look-ahead illustration: top without look-ahead, bottom with look-
ahead
2.2 The QS22 Patch for HPL
The QS22 patch for HPL optimized by IBM [43] contains multiple optimizations
to maximize the performance on QS22 Blade clusters. Because QS22 architec-
ture shares the same processor PowerXCell 8i and have comparable memory
11
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
performance as the QPACE node cards, the QS22 Patch for HPL offers a good
foundation for a QPACE optimized HPL version.
2.2.1 Computation and Reorganization Specialists
The performance of HPL is dominated by the BLAS routines DGEMM and
DTRSM. The QS22 patch contains so called specialists, optimized routines
for certain kernels which are implemented on the SPEs. The data is reordered
into a custom blocked-row format by SPE reorganization specialists before call-
ing the computation specialists to reach maximum performance. The data is
reformatted back just before its panel factorization, and by the end of the LU
process it is completely formatted in the conventional column major order for-
mat.
SPE routines are called by function ooading: the PPE informs the SPEs
by mailbox messages about the task to do. The SPEs then fetch the param-
eters from the main memory, execute the job and notify the PPE about the
completion by setting a Byte in a completion array in main memory.
2.2.2 PPE and SPE Load Balancing
Figure 2.5: SPE function ooad architecture
Most of these specialists are called synchronously, i.e. the PPE starts the SPE
and waits (idles) for their completion. The only point where the specialists are
called asynchronously is within the update of the trailing sub-matrix: the SPEs
are called for a (very long) DGEMM matrix multiplication operation while the
PPE polls for the panel broadcast. Figure 2.5 illustrates the design of this
function ooad mechanism.
12
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
2.2.3 Hugepage Support
Using the conventional main memory with 4KiB page size to store the matrix
produces a lot of TLB misses when working on large data sets, which slows
down the performance. This TLB trashing can be avoided by using huge pages
with a page size of 16 MiB, as much less page entries are needed to reference
the matrix data.
2.2.4 MPI Collectives
The matrix update step algorithm versions which where implemented using
point to point communication primitives in the original HPL were extended by
a version using MPI collectives. The MPI library might provide algorithms for
these tasks which are optimized for the underlying network and perform better
than the original point to point versions.
2.2.5 Parameter Limitations
The computation specialists operate on block sizes of 128 elements, thus the
parameter NB is fixed to 128. As the matrix A is stored along with the vector
b in an N + 1 × N matrix, the best performance is reached if N has the form
N = 128 · k − 1, k ∈ N. All blocks will then be properly aligned. L1 must
not be transposed (1), and U must be transposed (0), as only this parameter
constellation is implemented. The memory alignment must be set to multiples
of 64 double words.
2.3 MPI Binding
The following list of HPL/LINPACK MPI Functions has been extracted from
the HPL binary with the IBM patch.
Component MPI Functions
Initialization MPI_Init, MPI_Wtime, MPI_Finalize, MPI_Abort
Message
Passing















Table 2.1: MPI Calls used in HPL
13
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
2.3.1 Required MPI Features
The requirements on the MPI implementation are discussed for each feature set.
This is important for a possible early MPI subset for the QPACE torus, and
the decision if the QCD Message Passing Library QMP is sufficient for the HPL
Benchmark.
Message passing
Panel broadcasts are processed in a nonblocking way. Many functions probe at
their beginning or end if a panel is received, and forward it if needed. Instead
of probing and then calling MPI_Recv, it is possible to use a nonblocking
MPI_Irecv operation and poll for its completion with MPI_Test instead.
Messages do not overtake each other for one pair of sender and receiver in
HPL. Panel broadcast communication is overlapped with panel factorization
communication, but these communication patterns are performed in different
communicators, either row or column communicators.
Probing for a message is not possible in QMP. It would be necessary to
replace the probing mechanism with the nonblocking receive counterparts as
described above. Nonblocking receive operations are supported by QMP.
Communicators
MPI_Comm_split() is used to build row and column communicators. Most
collectives happen in one column (panel factorization) or in one row (panel
broadcast etc). The process grid spanned by HPL can be mapped on the 3D-
torus as described in section 2.5. The resulting communicators are tori of lower
dimension.
QMP does not provide communicators, but mapping row/column commu-
nicator ranks to global ranks is trivial, which is sufficient as long as primitive
send and receive operations are used.
Datatypes
Most usage of MPI Datatypes can be turned off by setting the define
HPL_NO_MPI_DATATYPE. However, MPI_Type_vector datatypes are not disabled.
They are used to send non-contiguous pieces of a block in various functions.
QMP does provide support for vector and strided datatypes, but not for struc-
tured datatypes. By using the HPL_NO_MPI_DATATYPE directive, the QMP
datatypes can be employed without further changes.
Collective communications
Most of the Communication in LINPACK is collective communication. In the
original HPL it is implemented with only Send/Recv primitives to allow bench-
marking of early machine prototypes which only have an MPI subset available.
It is possible to replace some of them with MPI collectives, but others like the
14
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
panel broadcast overlap computation and communication and call the collectives
in a nonblocking way. Replacing these algorithms with blocking MPI collectives
would kill the performance. An overview of used communication algorithms is
given in section 2.4. Using MPI collectives has the advantage that they could
be optimized for the 3D torus topology, while the original implementations use
hypercube algorithms or other topologies which can not be mapped on a torus
with nearest neighbor communication only.
Besides initialization routines which use the MPI_COMM_WORLD communicator,
all time-critical communicators use either row or column communicators. The
collectives must therefore support these sub communicators. QMP does not
support sub communicators. It provides only a few of the needed collectives
with limited functionality (e.g. broadcast only from root = 0) in the global
communicator.
2.3.2 MPI Versus QMP
The QMP collectives don't appear to be useful for HPL. The collective routines
have to be implemented from scratch for the torus in both the MPI and the QMP
case. There is no aspect that would make collectives on the QMP primitives
impossible. However, implementing them directly on the torus hardware instead
of QMP has more optimization potential, for example a broadcast could use a
cut through approach to forward a packet as soon as it arrives instead of waiting
for a whole buffer. A light-weight MPI subset with the required collectives could
be implemented directly on the low-level torus API. This could be reused for
other applications later, even by a QMP implementation.
The Message Passing library must support the main memory as valid posi-
tion for message buffers, as the matrix data is stored in main memory. Flow
control and buffer management in the collectives algorithms can also take this
issue into account for further optimization, e.g. by scheduling when the buffers
are kept in the Local Store or written back to main memory.
2.4 HPL Communication Patterns
An overview of the communication algorithms of HPL is given in Table 2.2.
We will evaluate whether the functions can be replaced by MPI collectives to
optimize them for the 3D torus topology and gain performance advantages.
As we can see, all communication can be optimized to use the 3D torus topol-
ogy efficiently, given a reasonable process-to-node mapping. Some algorithms
use ring topologies which can be embedded into the torus. Other algorithms
can be replaced with MPI counterparts which in turn can use optimized trees
within the 3D torus. Each of them can be implemented to use nearest neighbor
communication only (see sections 2.5 and 3.4).
15























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
2.5 Process Mapping
Figure 2.6: Embedding 2D and 1D tori in a 3D torus
A key factor to efficiently use the QPACE torus network in HPL is to map the
process grid into the 3D torus such that row and column communication of each
process can be performed using nearest neighbor communication. We therefore
need to embed the row and column communicators, which are logically utilized
as 1D-rings or collective communications, into the torus. One possible mapping
of 2 rings, expressed as functions from cartesian coordinates to ranks, is:
ring1(x, y, z) = x
ring2(x, y, z) = z · ny + even(z) · y + (1− even(z)) · (ny − y)
even(z) =
{
1 z is even
0 z is odd
For tori with even dimensions, this mapping creates rings with length nx
and ny · nz which suffice the topology requirements of HPL. An illustration is
given in Figure 2.6. Note that ring2 can also be used as 2D torus (the additional
edges are dotted in the illustration). It can be used for collective communication
which will be faster than only using the logical ring because of more links and
smaller path lengths.
17
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
The QPACE hardware allows different physical configuration of the links, as
described in section 3.1.1. Assuming the maximum configuration of one rack,
the tori sizes for the default cabling are n1 × n2 × n3 = 8 × 16 × 2 · #Racks.
With a logical torus nx = n2, ny = n1, nz = n3, a square symmetric grid of
dimension 16× 16 can be generated with 1 rack. The largest grid 16× 128 can
be generated with 8 racks. Because of the asymmetry in the 3 dimension, this
is the most symmetric 2D mapping which can be generated for 8 racks with the
default cabling. With an alternative cabling (see section 3.1.1), a grid size of
32× 64 could be used with 8 racks.
Another possible approach is to implement only one of the communicators
in the torus, and to use the other one with GBit Ethernet. This relaxes the
mapping constraints. For example the column communicator, which is latency
sensitive in the panel factorization, can be implemented in 2× 4× 4 cube. The
row communicator could use the GBit Ethernet. 3D cubes have lower latencies
because of shorter path lengths than 2D tori, for large process numbers. The
relaxed neighborhood constraint allows more square process mappings. However
the PPE-limited bandwidth of the Gigabit Ethernet device might be a severe
bottleneck.
2.6 Profiling
To find bottlenecks and confirm assumptions about running times of the different
sections, the HPL source code has been instrumented. Every MPI call and every
BLAS routine was wrapped into a respective wrapper routine which measures
the execution time using the high resolution timers of the PPE. The parameters
passed to the functions were evaluated, which allows to calculate the FLOP
count for each BLAS routine and the message sizes of MPI calls. The source
code is divided in the different sections as described above, and the calls are
accounted to their respective section. Based on these evaluations we can compile
aggregate statistics for the routines and see if they perform as expected.
Table 2.4 gives an example of a profiling output. The data has been gathered
on 2 QS22 blades with the modified and instrumented QS22 version of HPL.
The benchmark was run on a problem size N = 30847 with a grid P×Q = 2×2,
with 2 processes per node.
18

































































































































































































































































































































































































































































































































































































































































































































































































CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
We can see that the dgemm and dtrsm operations in the trailing matrix
update section, which is the most time consuming one, are performing nearly at
the peak performance of the 8 SPEs [44] which is 102.4 GFLOP/s. This is due
the fact that mostly big data sets are used in this section. On the other hand, the
same operations in the panel factorization section perform much slower because
mostly small or thin matrices are used in this section.
2.7 Communication Sizes and Network Require-
ments
Based on the profiling results and analysis of the source code it is possible to
estimate the sizes of the messages and call counts of the individual routines.
Table 2.5 shows an overview of the message sizes and calls depending on the
main parameters. The other parameters are set according to the sample con-
figuration file in Figure 2.1. The message sizes for the collective operations
are given as the combined size of all parts of the message, e.g. for a scatter
the original vector (and not its parts) are displayed. A factor for the message
sizes accounts for the part of the message size sent or received. For example in
the MPI_Allgatherv it is assumed that equally sized parts of size mp are sent
from each node, and the whole message m is received. Further communication
needed to forward or accumulate the data is not considered. Note that the call
number and message sizes are rough estimates which may depend on to position
in the grid, the random input data and other algorithmic details. The formulas
have been derived from the results of benchmark runs on many different small
systems while varying the N, NB, P, and Q parameters.
Section Operation Calls Message size (double) Factor














Panel MPI_Bcast NQ NB 1
















Table 2.5: General message sizes and call counts
2.7.1 Example Setup
As from previous experience with HPL and the HPL FAQ [45] we know that ratio
of P:Q in the range of 1:k with small k (say 2) reaches the best performance on
20
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
conventional networks. With the dimension-based process mapping (see section
2.5) and the various torus configurations (see section 3.1.1) the largest grid with
a good P:Q ratio would be a 2 rack configuration with P ×Q = 16×32 for the
default cabling. More racks would allow larger grids with a larger Q value, but a
more asymmetric grid. Therefore we can choose 2 racks for the best per-node
performance for a big system. These 2 racks would have a peak performance
of 16 · 32 · 108.8 GFLOP/s = 55.7 TFLOP/s. Assuming an efficiency of 70%
which is a reasonable ratio from the experience with QS22 systems, a LINPACK
performance of 0.7 · 55.7 TFLOP/s = 39.0 TFLOP/s can be expected for this
system.
The block size NB is limited to 128 by the SPE specialists. The problem
size depends on the available main memory. Assuming that 240 hugepages of
16 MiB are available, each node can hold a part of m =3840 MiB of the Matrix.
The maximum problem size can then be calculated:
N =
√




3840 · 220Byte · 16 · 32
8Byte
≈ 507640
Aligning this value to the next lower multiple of 128 minus 1 gives a final
value of N = 507519. Furthermore we set the broadcast algorithm to 1ring,
which is the only one usable for nearest neighbor communication in a ring. We
use MPI collectives wherever possible, assuming the message passing subsystem
executes them optimally on the given communicator. Other parameters affect
local computations, which can be adopted from the single node configuration
(see Figure 2.1).
2.7.2 Message Passing and Network Requirements




Trailing Matrix MPI_Allgather 4089 [1, 15860] 33628.163
Update MPI_Scatter 4089 [1, 15860] 7916.234
Panel bcast 15860 1 15.488
Factorization allreduce 15860 1 30.976
Backward
Substitution
send/recv ≤ 495 31 15.004
Panel
Broadcast
send/recv 3841 [128, 31720] 118983.404
Total Transfer ≤ 44234 160589.269
Table 2.6: Message sizes and call counts for example setup
From the formulas for the message sizes given in section 2.7, we can extrapolate
the expected message sizes for the example setup, which is given in Table 2.6.
21
CHAPTER 2. HIGH PERFORMANCE LINPACK ON QPACE
On one hand we have large messages in the range of multiple MiB like the panel
broadcast, where the run-time is mostly bound by the available bandwidth.
By changing the grid dimensions the trailing matrix update messages or the
broadcast messages will change its respective message sizes and weight in the
overall benchmark, but it's not possible to configure all messages small. On the
other hand we have many small messages of size 1 KiB in the panel factorization
step which have a latency bound run time. These message sizes are invariant to
the grid and are only determined by the fixed block size. Therefore the network
and the message passing system must handle both long and short messages
efficiently.
In our example, approximately 160.6Gib are transferred within 44234 oper-
ations. Assuming that all communication operations have a linear running time
like top = αop+m ·βop, where m is the message size in Bytes, β is the bandwidth
and α the latency and call overhead, we can estimate the total running time for
the communication. Assuming an optimistic average bandwidth of β = 1GB/s
and an average operation latency of α = 100µs, which is not so unrealistic as





(αop +mop · βop)
= 44234 · α+mtotal · β
= 4.42s+ 160.5s
= 164.92s
This approximation is a lower bound for the used communication size be-
cause more time may be consumed by protocols or synchronization effects, e.g.
waiting for messages to arrive. We can see in the calculation that the share of
time attributed to the bandwidth is much higher than the time for the latency.
If a tradeoff would have to be made between bandwidth and latency, a band-
width increase would improve the communication time a lot while an increased





In this section, the QPACE Architecture and the possibilities of communication
on the QPACE torus are explored. The section 3.1 gives an introduction to
the QPACE architecture and the physical interconnection of the torus network.
Section 3.2 analyzes the low level access to the torus network. Alternatives to
open the QCD-specialized torus network for a wider range of applications are
proposed, and limitations with a special focus on PPE access are discussed. A
model for the QPACE torus network is introduced in section 3.3, which allows
to formulate communication algorithms. Example algorithms optimized for the
QPACE architecture based on this model are presented in section 3.4.
3.1 Introduction to the QPACE Architecture
Figure 3.1: QPACE node card schematic diagram
23
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
The QPACE machine employs custom node cards as main computing com-
ponents. Each QPACE node card (see Figure 1) contains an IBM PowerX-
Cell 8i processor clocked at 3.2 GHz with a maximum IEEE-compliant dou-
ble precision peak performance of 108.8 GFLOP/s [44] for computing, and a
Xilinx R©Virtex R©-5 FPGA for I/O and communication where the NetWork Pro-
cessor (NWP) is implemented. The PowerXCell 8i processor and the Virtex-5
FPGA are coupled with 2 FlexIOTMlinks with an aggregate bandwidth of 6
GiB/s using RocketIO transceivers. The node cards are interconnected with
a custom three-dimensional torus network with nearest neighbor connections.
Unlike other Cell/B.E.-based parallel machines, it is possible to transmit data
directly from the Local Store of one SPE to the Local Store of a SPE from a
neighbor node. The design goal is to reach a latency of 1 µs for SPE-to-SPE
transfers and a bandwidth of 1 GiB/s. The node cards in the machine are also
interconnected with Gbit Ethernet and a global signal tree network.
The NWP address space is mapped into the physical address space of the
PowerXCell 8i processor. Various controllers handle the memory transfers be-
tween the PowerXCell 8i processor and the NWP: The DCR bus (Device Con-
trol Register) allows to control the low-speed devices, provides status registers
and is used for the torus network to provide credits. 32 bit reads and writes are
supported. The IWC (Inbound Write Controller) is used to send the packets
for the torus network or Gigabit Ethernet. It supports DMA write transfers
with packet sizes of 128 Bytes from the Cell/B.E. SPE MFC, which allows very
high speed. The OWC (Outbound Write Controller) writes the received pack-
ets from the torus network or the Gigabit Ethernet into the main memory or
the SPE Local Stores.
3.1.1 QPACE Rack Configuration
Figure 3.2: Overview of the QPACE architecture
The QPACE node cards as described above are plugged into a custom QPACE
backplane which provides power and network access. Each backplane can host
up to 32 nodecards and 2 rootcards. One rootcard manages 16 nodecards:
It controls the boot up, provides the global tree network, generates and/or
24
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
distributes the clock signal, and monitors status signals of the node cards at
run time.
One dimension of the torus network is completely routed within the back-
plane, providing a 8× 4× 1 torus. Each group of 8 node cards together provide
the first dimension (red links). Using redundant links, the network can also
be configured as multiple smaller tori with sizes{1, 2, 4, 8} × {1, 2, 4} × 1. The
possible configurations are illustrated in Figure 3.3.
Figure 3.3: Torus configuration alternatives within one backplane
One QPACE rack hosts up to 8 backplanes, 4 in the front and 4 in the back.
The 4 backplanes at one side are physically interconnected by cables to provide
the second dimension (green links), along with the links embedded in the
backplane. Redundant links allow software reconfiguration for the green links.
One rack can therefore offer partitions of size {1, 2, 4, 8} × {1, 2, 4, 8, 12, 16} ×
{1, 2}.
Multiple QPACE racks finally form the complete QPACE system. The third
dimension (blue links) is used to connect the different racks and the front and
backsides within the rack. The configuration of the third dimension can only be
changed by recabling. The final installed QPACE system with n racks provides
torus sizes {1, 2, 4, 8} × {1, 2, 4, 8, 12, 16} × {1, 2 · n}. The first 2 systems which
will be delivered contain 4 racks, and 8 racks would be possible with the current
configuration.
Other torus sizes are possible if other (longer) cables can be used: The front
and the back sides of one rack could be connected at the top and the bottom
using the green links to form a long ring of 32 nodes in the second dimension.
The third dimension (blue links) would be decreased accordingly and would only
form a ring of size n instead of 2·n along the front planes or the back planes. The
maximum size of this configuration is 8× 32×n which is more asymmetric in 3
dimensions, but allows very symmetric 2D torus mappings: With the mapping
algorithm described in section 2.5 and n = 4 racks, a square 2D torus with the
dimension 32 × 32 can be embedded. An application like HPL which benefits
from square 2D tori can then gain a better performance. An illustration of the
two explained cabling alternatives is presented in Figure 3.4.
25
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
Figure 3.4: Standard (top) and alternative cabling (bottom) of 4 racks
26
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
3.2 Torus Network Hardware
As a three-dimensional torus, each NWP contains six links to its neighbors,
two for each direction (X+, X-, Y+, Y-, Z+, Z-). To allow all 8 SPEs to
communicate independently, there are 8 virtual channels for each link. Packets
have a size of 128 Bytes, whereas up to 8 KiB (according to the configuration
at the time of writing) can be sent in one step. The maximum message size is
limited by the provided window in the mapped address space and by the buffer
size of the transmit FIFO in the NWP. Given a larger window, longer messages
could be sent, but would possibly block the interface until the message is sent
completely. This behavior could lead to deadlocks (see section 3.2.1). Messages
are transmitted in order reliably to the nearest neighbor, practically offering
an interface with channel semantics. The interface between the PowerXCell 8i
processor and the NWP however does not transfer the 128 Byte packets in-order
for performance reasons.
It is not possible to send to other nodes than the nearest neighbors, as no
hardware routing for messages is implemented. Data and credits are transmit-
ted by writing to addresses of the NWP address space which depend on the
peer link and channel. By using this implicit addressing, no headers or control
information have to be sent from SPE to SPE.
Data is sent to the NWP by performing a DMA transfer into a memory
window of the NWP address space. This can be accomplished by using the
SPEs Memory Flow Controller (MFC), executing a simple DMA PUT to the
address specified by link and channel of the target peer. On the receiver side, a
credit has to be granted for this data to allow the transfer to the target NWP. A
credit is a pair of an address offset and length (up to 512 KiB in multiples of 128
Bytes), allowing the NWP to write the data asynchronously to the base + offset.
The base usually points to the Local Store of the granting SPE. The offset can
currently address the memory within a window of 1 MiB and alignment of 16
Byte, but this may be changed to a window of 8 MiB and and alignment of 128
Bytes. After transmit, the NWP writes to the notify location in the Local Store
of the SPE, informing that the transfer has finished. 16 pending credits per
link and channel may be active at one time. Note that no interrupt mechanism
is used to inform about the completion of transfers, and all transfers happen
concurrently to the code flow on the SPE. The basic communication primitives
are illustrated in Figure 3.5.
27
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
Figure 3.5: Communication Primitives
There are 8 KiB packet buffer space at the sender side (TX FIFO) and at
least 8 KiB packet buffer space at the receiver side on the NWP for each link.
These buffer are shared among the virtual channels for each link.
3.2.1 Low-Level Communication Rules
Because the Communication is reliable but the buffer space on the NWP is
limited, communication rules have to be followed to avoid back pressure and
deadlocks. In the worst case, pending DMA PUTs to the NWP may end up
locking the interface to the GBIF (Global Bus Infrastructure) when no buffer
space is available and the remote side has not (yet) issued credits.
The rules defined for the QCD applications are:
1. The order of send and receive operations must match for each pair of nodes
connected by a link and channel.
2. Credits should be provided before sending the data to the NWP.
Rule 1 can be followed for arbitrary communication types by employing proto-
cols. While rule 2 is acceptable for synchronous applications like QCD where
all communication operations match, it is challenging to apply this rule to gen-
eral asynchronous message passing as used in MPI. In these applications, the
assumption is made that packets can always be inserted into the network, or it
can be checked for congestion without blocking the interface. An MPI subsys-
tem can postpone the transfer of a message if the network is congested.
28
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
An alternative to rule 2 is to monitor the filling level of the NWP transmit-
FIFO and only send new packets if the FIFO is almost empty. For considera-
tions on this filling level see 3.2.6.
When applications like HPL only employ one process per PowerXCell 8i
processor, only one virtual channel is necessary. Therefore no synchronization
between multiple SPEs is needed to monitor the filling level. Applications which
use multiple virtual channels need to synchronize the filling level before sending
data, as the TX FIFO is shared among the virtual channels. This overhead can
make the proposed alternative expensive to use for these applications.
3.2.2 SPE Access
The QPACE torus hardware is designed with SPE access in mind. The SPE can
send by programming the MFC for a DMA PUT with channel instructions. The
filling level of the FIFO can be checked by reading a DCR register, which can
be done with a 32-bit DMA GET using the MFC. Receiving is done by writing
a 32 bit credit into a DCR register of the NWP. The completion of the DMA
PUT can be verified by polling or waiting on the channels, and the completion
of the Receive operation can be checked by polling the configured notification
area in the Local Store.
3.2.3 PPE Access
The PPE alone can not write messages directly to the NWP because the PPE
can not write 128 Byte packets which is the only packet size the NWP accepts for
messages (greater data blocks are internally fragmented into 128 Byte packets).
A possible alternative is to use the SPEs Proxy Command Queue. The SPE
offers its MFC service to other devices through the MMIO register interface
in the problem state area [46]. This proxy interface provides a separate queue
and is not shared with the SPEs local command queue. The provided MMIO
register interface uses the same semantics as the SPE channel interface, with
the limitation that no DMA lists transfers are possible and only 8 transfers can
be queued up at one time (the local limit is 16 on the SPEs). PUT and GET
commands to transfer data to and from the SPEs Local Store with Fence and
Barrier modifiers are supported. The completion of the transfers can be checked
with an MMIO register in the MFC command area.
The NWP can also be used from the PPE when a SPEs Local Store and
the MFC may be used. The DCR registers for checking the filling level and
providing credits are accessible by simple read and write instructions to the
NWP address space. The receive operation can be executed by writing the
credit directly to the DCR register in the NWP and polling the notification
area of the Local Store until the transfers is complete.
An alternative is to use the MFC to provide the credits or generally access the
DCR. This is cheaper in terms of used cycles on the PPE (one MFC instruction
and one LS read), but the overall latency for MFC assisted DCR reads is higher:
29
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
the MFC access to DCR has a similar latency as from the PPE, but the MFC
instruction and LS read add more latency until the data is available on the PPE.
3.2.4 Proof of Concept: Memcpy with MFC
For a proof of concept of the MFC access and performance evaluation, a simple
memcpy routine has been implemented. It uses double buffering on the SPE
and interleaved PUT and GET instructions to copy the data to the Local Store
and back to its new position. Barriers in the GET instructions are used to
guarantee the correct order of the message chunks. The benchmark results
in Figure 3.6 illustrate and compare the results with the same access pattern
implemented on the SPE, the PPE glibc memcpy version and a hand-optimized
PPE replacement. The optimized PPE memcpy performs cache prefetching and
writes complete cache lines without loading the destination buffer into the cache,
which allows significant speedup. The buffer sizes used on the Local Store (and
consequently the data size per DMA transfer) were 2 KiB, 4 KiB and 16 KiB.
The transferred data was aligned at 128 Byte boundaries in the main memory,

















memcpy performance on QS22 node, bind cpu=0, mem=0
PPE memcpy with MFC, 16k
PPE memcpy with MFC, 2k
PPE memcpy with MFC, 4k
SPE memcpy with MFC, 16k
SPE memcpy with MFC, 4k
SPE memcpy with MFC, 2k
PPE, GLIBC memcpy
PPE, hand-optimized assembly memcpy
Figure 3.6: Comparison of memcpy performance
From the performance measurements we can see that the PPE can sustain
the bandwidth at least for large DMA transfer sizes. The sustainable bandwidth
30
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
is limited by the memory bandwidth on one hand and by the latency to instruct
the MFC from the PPE on the other hand. For the m =2 KiB buffer size case,
this effect is visible in the experiment results. An upper bound to the bandwidth
can be given by considering the latencies (for measurements see section 3.2.5):
Bandwidth 5
m · 3.2 · 109 · cycless
tMFC
=
2048Byte · 3.2 · 109 · cycless
795cycles
= 7.677GiB/s
3.2.5 Access Time Benchmarks




Read 32 bit NWP DCR tDR =2775.960 2767.764
Read 32 bit, 6
times
NWP DCR 16640.508 4149.420
Write 32 bit NWP DCR tDW =2585.580 2587.884
Write 32 bit, 6
times
NWP DCR 15584.484 15560.088
Instruct MFC
PUT/GET
SPE MMIO/Channels tMFC =795 7.5
Read MFC Queue
Status
SPE MMIO/Channels tQ =232 54
Read 32 bit SPE Local Store tLR =227 7
Write 32 bit SPE Local Store tLW =127 12








Table 3.1: Memory Access times
In Table 3.1, the results of the memory benchmarks are listed. Each opera-
tion is executed multiple times (100000 for DCR reads/writes, and 8 or 16 for
MFC instructions, which is the limit of the queue) and measured with the high
resolution clocks of the respective core, which was available with a resolution
of 120 clock cycles (timebase of 26.666 MHz at a clockrate of 3.2 GHz) on the
QPACE node card. Loop overhead was not removed. Each operation is started
and being waited until completion for the timing measurement. This approach
works for the measurements listed in the table, as the operation are acknowl-
edged or answered by their respective receivers. The IWC latencies however
31
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
can not be measured with this method, as not the IWC but the I/O controller
acknowledges the requests. The OWC latencies can not be measured as well
because not the PowerXCell 8i processor, but the NWP initiates these memory
requests.
On the SPE, all MFC transfers can be executed in a overlapping fashion,
and the initiation or polling can always be done in <60 cycles, therefore no
bandwidth impact because of high latency times is to be expected.
This is unlike the PPE, where DCR access or MFC instructions are very
expensive and can not be overlapped, thus the PPE will wait until one operation
is completed. The impact of this fact is analyzed in section 3.2.8.
32
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
3.2.6 TX FIFO Filling Level Considerations
Figure 3.7: Sending packets to
the NWP
To prevent a buffer overrun of the TX FIFO,
it is possible to first check if buffer space
is available and then send data accordingly.
However if we wait until the buffer is com-
pletely empty and send data only after this
event, the link will be in idle from the empty
signal until the next data arrives. To prevent
this idle time, we can use a signal if the buffer
is almost empty. This signal should be trig-
gered if the buffer is nearly empty, but enough
data should remain in the buffer to keep the
link busy until the next data arrives. This
mechanism prevents idle time of the NWP
link and avoids overflow of the send buffers,
and should allow to reach the maximum sus-
tainable bandwidth.
To calculate the optimal threshold for this
signal, we consider three times from the NWP
view:
• t0: the almost empty level is reached
and the signal is set to 1.
• t1: the DCR access request of the PPE
arrives and the almost empty bit is re-
turned as 1.
• t2: the first data from a DMA PUT data
is incoming.
The TX FIFO should keep enough (or more) data which can be sent in the
maximum possible time span t2 − t0 to prevent NWP idle time. We further
assume the worst case for the event t0: the almost empty signal is raised
immediately after the last PPE status request was returned as not almost
empty. Under optimal circumstances,
t1 − t0 5 tDR
holds when the sending core is constantly polling the NWP. The time until the
DMA data arrives at the NWP can be split into the following components:
t2 − t1 = tDR−back + tMFC + tDMA−startup + tIW−forward
The DCR and IWC access times have been split into the forward direction
from the Cell/B.E. core to the NWP and the backward direction from NWP to
the Cell/B.E. core. As we can only measure round trip timings, we assume
33
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
tDR := tDR−forward + tDR−back
tDW := tDW−forward + tDW−back









As mentioned before the IWC access times can not be measured, but it's
safe to assume that the IWC access time is less or equal the DCR access time.
The time tDMA−startup can not be measured at all, but we can assume that the
MFC request is dispatched fast, so we set
tDMA−startup := 0
.
The specific times depend whether the PPE or the SPE issue the request:
t2 − t0 = (1 + 12) · tDR + tMFC + tDMA−startup + tIW−forward
=
{
6563 cycles on the PPE
5781 cycles on the SPE
If the assumed link bandwidth of 1 GiB/s is reached, the minimum filling
level of the TX FIFO is
(t2 − t0) · 230 bytes
3.2 · 109 cycless
=
{
2202 byte for PPE access
1940 byte for SPE access
For very tight timing and SPE only access, 2 KiB may be enough. For a
more conservative timing and general access, an almost empty limit of 3 KiB is
more advisable.
3.2.7 PPE Overhead Times
From the latency times we can calculate the send and receive overhead times,
by considering a single short message which is stored in memory. On the sender
side, the steps are:
1. DMA GET from Main Memory to the Local Store (with Barrier)
2. check via DCR read if the TX FIFO link is empty
3. DMA PUT the message into the TX FIFO
34
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
We can assume that the message dispatched in step 1 is transferred to Local
Store after step 2 because the DCR read takes sufficiently long.
The time from the beginning of step 1 until the data arrives at the network
processor is therefore:
tsend−ovhd = tMFC + tDR + tMFC + tIW−forward
= 5894.61cycles , 1.842µs
Assuming that the credit has been granted before the data arrives, the re-
maining steps for the receiver are:
1. Check via a Local Store read if the data arrived
2. DMA PUT the message into the Main Memory
3. Check Queue Status of the MFC
Assuming the OWC access time is similar to the DCR access time, we set
tOR = tDR. The receive overhead can then be expressed as:
trecv−ovhd = tOR−back + tLR + tMFC + tQ
= 2717.58cycles , 0.85µs
The minimum transfer overhead for Main Memory to Main Memory without
the network latency (NWP to NWP) is tsend−ovhd + trecv−ovhd = 2.69µs.
3.2.8 Latency Limitations on the PPE
Compared to the SPE, issuing a transfer on the PPE takes very long time
because the MFC is not integrated in the PPE, and the transfers to the MFC or
the NWP can not be overlapped. The PPE stalls until its request is complete.
For the following examples we consider messages from main memory and an
optimistic almost empty limit of 2 KiB. The messages are sent in chunks of
6 KiB, because this is the limit which can be safely sent to the NWP without
risking a buffer overrun.
As we will see, only from latencies it is possible that the PPE is not fast
enough to request the transfers to sustain the links, thus the latencies become
a bottleneck. For the following examples we will only consider latencies to issue
the commands, without regarding dependencies, stall times or bus contention.
MFC assisted DCR reads are used to minimize the PPE latency. The achievable
bandwidths can therefore be seen as upper bound:
Sending/Receiving long messages in one direction
We consider sending and receiving a long message. It is assumed that double
buffering is used, therefore 2 ·6KiB = 12KiB of buffer space in the Local Store
has to be provided. On the sender side, for each 6 KiB chunk we have to:
35
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
• 1 DMA GET of 6 KiB from main memory to Local Store into the first
buffer
• 1 MFC assisted DCR read to check if the TX FIFO of the link is empty
• 1 DMA PUT of 6 KiB data from the other buffer
The sustainable send bandwidth is therefore bounded by:
6144Byte · 3.2 · 109 cycles
(3 · tMFC + tLR) = 7.010GiB/s
On the receiver side, we have to issue:
• 1 MFC assisted DCR write for the credit of the first buffer
• 1 Local Store read to check if the previous credit has been received suc-
cessfully (in best case).
• 1 DMA PUT of 6 KiB to transfer the other Local Store buffer into Main
Memory
The upper bound for the receive bandwidth is therefore:
6144Byte · 3.2 · 109 cycles
(tLR + 2 · tMFC) = 10.077GiB/s
We can see that on both sides the PPE is fast enough to issue the commands
to sustain the link which is limited to 1 GiB/s.
Broadcast example
Consider a double buffered broadcast, where the example node broadcasts a
message from main memory to all 6 links. Again, 12 KiB of buffer space in the
Local Store has to be provided. For each 6 KiB part of data, we have to do:
• 1 DMA GET of 6 KiB from Main Memory to Local Store into the first
buffer
• 6 MFC assisted DCR reads to check each links TX FIFO if its empty (we
assume that the checks always return positive)
• 6 DMA PUTs of 6 KiB data from the other buffer to the NWP, one for
each link
Only from latencies to instruct the commands (without considering the packet
transfer yet), the sustainable bandwidth on each link for the broadcasting node
is limited to:
6144Byte · 3.2 · 109 cycles
6 · tLR + (1 + 6 + 6) · tMFC = 1.565GiB/s
36
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
This is enough to sustain the links. However it is possible that the MFC
assisted DCR read cannot be used in the scheduling because the time between
the received filling level and the actual transmission of the data. If only direct
DCR access can be employed, the sustainable bandwidth bound per link drops
to:
6144Byte · 3.2 · 109 cycles
6 · tDR + (1 + 6) · tMFC = 0.791GiB/s
which is less than the peak bandwidth of 1GiB/s.
All-to-all example
The worst case is the all-to-all pattern, where the node sends and receives
from/to all links simultaneously. Again assume a chunk size of 6KiB with double
buffering. The buffer space needed on the Local Store is 2·2·6·6KiB = 144KiB,
which is more than half of one SPE Local Store size. If an application can not
allow that much buffer space for the communication, the buffers could be dis-
tributed over multiple Local Stores. Another argument to employ multiple SPEs
is that the needed memory bandwidth of 12GiB/s (6GiB/s for each direction)
can not be achieved with one SPE as seen in the MFC micro benchmark in sec-
tion 3.2.4. Multiple SPEs allow to reach a higher aggregate bandwidth, while
the number of MFC instructions from the PPE is the same .
In this example, each node has to do:
• 6 DMA GETs of 6 KiB from Main Memory to Local Store into the first
buffer for the respective link
• 6 MFC assisted DCR reads to check each links TX FIFO if its empty (we
assume that the checks always return positive)
• 6 DMA PUTs of 6 KiB data from the other buffer to the NWP, one for
each link
• 6 MFC assisted DCR writes to grant credits for the next incoming data,
one for each link.
• 6 Local Store reads to check that the received data is completely trans-
ferred
• 6 DMA PUTs of 6 KiB data from the Local Store receive buffer to the
Main Memory
The sustainable bandwidth which is used for each link in each direction is limited
by the instruction latencies to:
6144Byte · 3.2 · 109 cycles
(2 · 6) · tLR + (5 · 6) · tMFC = 0.862GiB/s
37
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
If for scheduling reasons the DCR access must be done directly from the
PPE, the maximum sustainable bandwidth bound drops to:
6144Byte · 3.2 · 109 cycles
6 · tDR + 6 · tDW + 6 · tLR + (3 · 6) · tMFC = 0.368GiB/s
This example shows again that the PPE latencies may become a bottleneck
in the communication.
3.2.9 Suggestions to Circumvent the PPE Latency Limi-
tations
As we have seen in the previous examples, the PPE latency limits can limit
the sustainable bandwidth. Scheduling, congestion and other synchronization
effects will further decrease the bandwidth which can be seen in practice. There-
fore suggestions to circumvent the PPE limitations are presented.
TX FIFO filling levels in one DCR register
Instead of using expensive DCR operations to read the TX FIFO filling levels for
each link separately, the filling level information could be combined in one DCR
register. Special care has to be taken because the information might already be
outdated when many links are served in an interleaved fashion, e.g. the FIFO
is already empty while the information indicates that its still full, or the link
runs out of data because other links are served with DMAs first. This technique
would decrease the needed DCR access on the sender side from up to 6 DCR
reads to 1 DCR read.
Using this method, the upper bound of the send bandwidth for the broadcast
example could be increased by 78% for the MFC assisted DCR access, and by
172% for the direct DCR access.
Use a faster controller to read/write credits and status information
Instead of using the slow DCR bus to control status information or provide
credits, an additional controller could be integrated in the NWP to provide
the time critical functions. This controller should be directly coupled with the
FlexIO interface or integrated into another high speed controller like the IWC.
Special care has to be taken that the credits are provided in order, e.g. by using
fenced MFC operations on the SPE.
NWP direct write to main memory
The NWP credits are granted for buffer spaces defined by the offset within
the credit and a base address set when initializing the torus interface. This
base address usually points to the physical address of the respective Local Store
address to allow direct SPE communication. For large messages, an alternative
38
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
would be to let the base address point to the receive buffer within the main
memory. The NWP would then transfer the received data directly to the main
memory. Larger credits could be issued, as they are not limited by the size of
the Local Store buffers. Larger credits also decrease the number of needed DCR
writes per block.
The direct write to main memory relieves the SPE in two ways: First, the
MFC is not involved in the receive direction, which is a benefit if an application
uses the SPEs while communicating. The second advantage is that no buffer
space has to be reserved in the Local Store for receiving data.
The NWP can not handle virtual addresses. One strategy would be to
allocate a contiguous buffer in the physical address space using a special device
driver. The address has to be translated from the virtual to the physical address,
and the physical address would then be passed to the NWP. Another possible
strategy is to program the IO-MMU of the PowerXCell 8i processor to handle
the address mapping.
When the base address is set, the offset in the credit can only address a
memory window of 1 MiB. If a larger message is to be transmitted, the base
address has to be moved multiple times and new credits have to be provided
for the moved window. While the base address is reset, no credits may be in
flight to avoid inconsistencies. This implies that no data can be received while
moving the base address, which will lead to minor regressions in the bandwidth.
Note that this mechanism can already be used without modification to the
current FPGA design, given that the software layer provides support for the
address handling.
Use a DMA engine to read directly from main memory
The send direction is implemented by actively sending buffers to the NWP. If
the NWP had a DMA engine, the PPE would only instruct the DMA engine
to perform transfers to the NWP. The NWP could decide on its own when TX
FIFO space is available, thus the filling level checks could be skipped. Another
advantage is that the SPE is relieved in terms of MFC access and Local Store
usage, similar to the NWP direct write case.
The same memory restrictions as from the NWP direct write case must be
applied to the DMA engine: a simple DMA engine would not be able to read
from virtual addresses, and special buffers or the IO-MMU must be used.
For a system with a DMA engine combined with the direct memory write
on the receiver side, we can review the all-to-all example. We assume that
message sizes of 512 KiB can be sent by the DMA engine and received with
one credit. We assume 2 DCR writes for the sending side (one word for the
address, one word for the length). 1 DCR read is used to check whether the
send has completed. 1 DCR write and 1 Local Store read is used for the credit.
Furthermore 1 DCR write is used to move the credit base address after each
received packet:
39
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
512KiB · 3.2 · 109 cycles
6 · (4 · tDW + tDR + tLR) = 18.367GiB/s
As we can see, the bound to the bandwidth inflicted by the PPE latency is no
longer a bottleneck for the link bandwidth. Another advantage is that the SPE
and its MFC are completely decoupled from the message transfer. No buffer
space must be allocated in the SPE Local Store. Performance regression of
parallel running SPE code can still occur because the interface to main memory
on the PowerXCell 8i processor must still be shared between the SPEs and the
NWP.
Change blocking communication model to a feedback model
Instead of using a blocking write which requires the filling level check to avoid
deadlocks, a nonblocking feedback model has been proposed [47] which could
be used instead. A message can then be sent without blocking regardless of
the buffer filling, and the NWP must then provide feedback e.g. by writing a
notification in the Local Store. A message would only be accepted if enough TX
FIFO space is available. The impact on the PPE latencies is relaxed as the DCR
reads for the filling level can be skipped and LS reads to check for success have to
be added instead. However the scheduling is simplified as timing dependencies
between the DCR reads and the MFC sends are removed. From the optimistic
sending we can also expect an improvement in the memory to memory latency,
as no filling level check has to be done before sending a message.
Larger buffers for the TX FIFO
If the buffer space on the FPGA is available, an obvious improvement would be
to enlarge the buffers of the TX FIFO. As the upper bounds in the presented
examples all linearly depend on this buffer size, increasing the buffer size would
increase the sustainable bandwidth by the same factor. The largest buffer size is
16 KiB, this is the maximum transfer size possible for the MFC. Larger buffers
(e.g. 16 KiB) combined with the feedback model would allow an upper bound
in the all-to-all example of:
16384Byte · 3.2 · 109 cycles
6 · tDW + 2 · 6 · tLR + (3 · 6) · tMFC = 1.454GiB/s
3.3 Torus Network Model
For theoretical examination of communication algorithms we make the following
assumptions:
1. The processing nodes are organized in a 3D torus of dimensionnx×ny×nz,
with n = nx · ny · nz nodes in total.
40
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
2. Each node can only communicate with its nearest neighbors.
3. It is possible to send and receive from all links simultaneously, 6 send and
6 receive operations may be active at the same time with full bandwidth.
4. The time to transmit a message located in SPE Local Stores from one
node to its neighbor is α+β ·m, where m is the message size in Bytes, α is
the latency (design goal: 1 µsecond) and β is the sustainable bandwidth
(design goal: 1 GiB/s).
Note for assumption 3 that the messages are practically not sent simultaneously,
but sent individually by the MFC. If one certain message is sent to multiple links,
one DMA PUT per link has to be issued separately. The bandwidth of the link
between the PowerXCell 8i processor and the FPGA however is 6 times higher
than the actual torus link bandwidth.
From these assumptions, we can find the following conclusions:
1. Due to assumption 3, it is also possible to multicast one message to
several peers with full bandwidth.
2. When routing a message from arbitrary nodes of the torus, the minimum
sustainable latency is nhops · α , where nhops is the number of links used
on the path. This follows from the fact that the latency for one hop is α,
which is the time to complete the transfer from one node to its neighbor.
3. When routing a message from arbitrary nodes of the torus, the maximum
sustainable bandwidth is β. This follows from assumption 3 and 4 and
can be achieved using double buffering.
3.3.1 Memory Locations
For the whole model, it is always assumed that data is transferred from and
to the Local Stores of the respective SPEs. Transferring messages from and to
the main memory is also possible, but the latency will be slightly higher: The
packets have first to be transferred to the Local Store, sent via the network as
usual, and moved back from Local Store to the main memory on the receiver
side. Assuming that the latency between LS and main memory is γ, the expected
time is α ·nhops+β ·m+2 ·γ for a path of length nhops. The bandwidth will still
be limited by the network bandwidth, as the SPE memory bandwidth is much
larger than the expected torus link bandwidth of 1GiB/s. We can also assume
that the memory latency γ will be much lower than the network latency α. For
simplicity, this additional memory latency is ignored, but can trivially be added
to the presented run time formulas.
3.3.2 Communicators
In MPI, communicators can be used to create subspaces of other communica-
tors or the original set of processes. The communicators can be used to organize
41
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
the processes according to the pattern used in the application, and use collec-
tive algorithms only in these communicators. For example, in HPL separate
communicators are used for row and column communication.
When using the QPACE torus network, an obvious limit is that the pro-
cesses of one communicator must be connected, because only nearest neighbor
communication is supported. For arbitrary connected communicators, we can
implement most of the algorithms by using spanning trees. There also have
been publications [48, 49, 50] for many different architectures [51, 52, 23] about
efficient collective algorithms in torus and mesh networks, which mostly operate
on different assumptions: network devices are only single ported, or the torus
is not limited to nearest neighbor communication. Unlike arbitrary communi-
cators, it is often easier to find the spanning trees in torus and meshes, usually
they don't even have to be constructed explicitly (peers can be determined by
the cartesian coordinates of the local node), and multiple edge-disjoint spanning
trees may also be employed to increase throughput. 4 types of communicators
for a 3D torus network will be considered here, from the general to the specific:
1. General Communicators: an arbitrary set of the nodes.
2. Connected Communicators: A general communicator, where each pair of
nodes of the communicator is connected by a path of nearest neighbors
which are part of the communicator.
3. Mesh: A connected communicator where the communicator are numbered
with ids (x, y, z) which hold 0 ≤ x ≤ nx − 1, 0 ≤ y ≤ ny − 1; 0 ≤ z ≤
nz − 1. nx, ny, nz are the dimensions of this communicator. The nodes
are connected to their neighbors
(x+ 1, y, z),
(x− 1, y, z),
(x, y + 1, z),
(x, y − 1, z),
(x, y, z + 1),
(x, y, z − 1),
if these neighbors are within the defined boundaries.
4. Torus: A mesh with the difference that each node (x,y,z) is connected to
the neighbors
f(x+ 1, y, z),
f(x− 1, y, z),
f(x, y + 1, z),
f(x, y − 1, z),
f(x, y, z + 1),
f(x, y, z − 1),
42
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
where f(x, y, z) = (x mod nx, y mod ny, z mod zy). Unlike the mesh there
are no boundary points or corners, each node has exactly 6 neighbors.
3.3.3 Topology Graphs
We will consider the topology graph as the directed graph G = (V,E) to for-
mulate propositions, where the vertices V are the set of nodes in the network,
and the edges E are the (bidirectional) links of the torus network, where
E =
[





| (x1, y1, z1) 6= (x2, y2, z2);
(x1, y1, z1) is neighbor of (x2, y2, z2) in the torus
]
.
Communicators can then be considered as subgraphs G′ = (V ′, E′) with
V ′ ⊆ V and E′ = E ∩ (V ′2 ).
3.4 Communication Algorithms
We will consider an example set of collective communication algorithms in differ-
ent type of communicators (connected, mesh, torus), and will give performance
estimations. The estimation assume that all nodes enter the communication at
the same time, and the network has no contention. All algorithms are defined
analogous to the MPI standard [30, 31].
We define the latency as the time starting with the entry of the communi-
cation function and ending when the last node has finished the communication,
while sending the smallest possible message (n = 0). This time is the minimum
time required for all nodes to complete. One node might finish its participa-
tion of the communication faster, and pipelining effects [49] when calling many
broadcasts in a row may give faster results than to be assumed from the latency
alone .
The bandwidth is defined as the ratio of data processed in the time from
start to completion of the algorithm, for sufficiently large data sets (theoretically
unlimited). The processed data depends on the algorithm.
3.4.1 Broadcast
In a broadcast, a message is sent from one node called root to all other nodes
in the network. The data processed is the message sent by the root.
Spanning trees
For connected communicators, a broadcast can be implemented by constructing
a rooted spanning tree on the topology graph G using the root node as the trees
root. The message can then be sent in a pipelined fashion: The root nodes sends
a chunk of the message to all of its neighbors within the tree simultaneously.
43
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
Each other node forwards this chunk of message to all other neighbors within
the tree (except the original sending neighbor) when receiving a chunk [49].
Bandwidth
The sustainable bandwidth for a single spanning tree is limited by the bandwidth
of one link β. The messages at each node are received by exactly one link with
bandwidth β, and can be forwarded to the outgoing links with full speed because
of conclusion 1.
Latency bounds
The latency of any spanning tree based collective algorithm has a lower bound
by the maximum path length l from the root node to any other node of the













for any root node.
In meshes, the maximum path length depends on the position of the root
node. If the root node is one of the corners, the farthest node is the node at the
opposite corner of the mesh. The maximum path length for this case is therefore
(nx− 1)+ (ny − 1)+ (nz − 1). On the other hand, when the root node is in the




























⌋ ≤ lmesh ≤ (nx − 1) + (ny − 1) + (nz − 1)
holds.
For arbitrary connected communicators, the path length lconnected may be












⌋ ≤ lconnected ≤ n holds.
The lower bound for the latency is l ·α, because this is the minimum time the
information needs to travel from the root node to the farthest node (or back).
Depending on the spanning tree, the maximum path length can be much longer
than the introduced lower bound, e.g. a Hamiltonian path through a torus is
also a tree with the maximum path length of n.
Constructing spanning trees in meshes and tori
A simple, yet optimal spanning tree in terms of latency is the dimension ordered
spanning tree, which can be constructed in meshes and tori. The construction is
similar to the dimension order routing [53]. The root node sends in all possible
directions. Each other node forwards in all possible directions which are the
same or behind the receiving direction in the dimension order X,Y,Z. In a torus,
a link is not used if the neighbor has a lower distance to the root node1, to
prevent loops. The constructed tree hits the minimum bounds for the path
lengths, and the latency is therefore optimal with lmin · α. As expected for a
single spanning tree, the bandwidth is β. The running time of this algorithm is
t = lmin · α+m · β with the message size m in Bytes.
1if the distance is the same, the message is only sent if the direction is positive, to support
even dimensions
44
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
Figure 3.8: Dimension ordered spanning tree in a 3× 3× 3 mesh
45
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
Figure 3.9: Illustration of the BlueGene/L adapted broadcast on a 4 × 4 × 4
mesh
Edge disjoint spanning trees
In dense graphs, multiple edge-disjoint spanning trees may be constructed. In
a 3D torus, the number of edge disjoint spanning trees k can not be larger than
6:
Each spanning tree has n − 1 edges. Because of assumption 3, each node
can have 6 ingoing and 6 outgoing edges. This limits the number of edges to
12·n
2 = 6 · n, because each edge is counted once as outgoing and once as ingoing
at each node. We can then see that k · (n − 1) ≤ 6 · n is feasible for k ≤ 6
(assuming n ≥ 8).
With multiple spanning trees, we can interleave the chunks of messages on
these k spanning trees. Each spanning tree transfers the chunks independently
as described above for single trees. With this mechanism we can achieve a
bandwidth of k · β at maximum. The latency is still bound by the maximum
path length of the spanning trees used. Constructing k spanning trees in a
general graph with k ≥ 2 is known to be a NP-hard problem [54], but efficient
algorithms for regular topologies like meshes and tori can be formulated.
Constructing 3 edge disjoint spanning trees in meshes and tori
Algorithms to construct these spanning trees for Meshes have been implemented
e.g. for the IBM BlueGene R©/LTM[50]. The broadcast algorithm for the Blue-
Gene/L machine is not optimal in terms that it doesn't construct the maximum
number of spanning trees possible in Tori, but it offers higher bandwidth for
mesh communicators than a single tree.
We assume a mesh communicator. The 3 spanning trees are practically
rotated copies of the first spanning tree. Figure 3.9 presents an illustration of
the constructed trees.
The spanning trees are constructed in 3 steps:
1. The message is distributed from the root node along the X-axis to the
outside boundary of the mesh.
46
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
2. The message is then broadcast using a dimension ordered routing along
the Y and Z axis.
3. All nodes on the boundary planes which don't have the same Y and Z
coordinate as the root node forward back to the center, and the inner
nodes forward it in the same direction they received it from until the X
coordinate of the root is reached.
The 2 other spanning trees are constructed using the same algorithm but shifted
dimensions: Spanning Tree 2 replaces the dimensions X,Y,Z in the algorithm
with Y,Z,X, and Spanning Tree 3 replaces the dimensions X,Y,Z with Z,X,Y.
We can verify that all links are mutually exclusive: The inner links of the
mesh are only used by the first dimension of the respective spanning tree. The
links within the boundary planes are used only for the outside direction in step
2 in the respective spanning tree and for the inside direction in step 3 in the
remaining spanning trees.
The maximum path length l from the root to a leaf in the constructed
spanning tree is bound by
l ≤ max(2 · nx + ny + nz, nx + 2 · ny + nz, 2 · nx + ny + 2 · nz)
and reaches its maximum if the root node is a corner node.
The running time of this broadcast algorithm is t = l · α+ 3 ·m · β with the
message size m in Bytes, as 3 disjoint spanning trees were constructed.
Constructing 6 edge disjoint spanning trees in tori
Based on the EDF (edge-disjoint fences) algorithm from Barnett et al [48] for
2D Tori, we can construct 6 edge disjoint spanning trees for 3D tori. Without
loss of generality we can assume that the root is (0,0,0), as any other root (x,y,z)
can be reduced to this case by translating all nodes by (-x,-y,-z).
47
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
Figure 3.10: Illustration of the 3D-EDF algorithm for a 3× 3× 3 torus
On each node, the links in positive directions X+, Y+, Z+ are sending for
spanning tree 1-3 and receiving for spanning tree 4-6, and the links in negative
direction X-,Y-,Z- are receiving on spanning tree 1-3 and sending for spanning
tree 4-6. Each link is assigned to two of the 6 spanning trees, according to the
assignment in Tables 3.2 and 3.3, once for receiving and once for sending for a
spanning tree.
Proof of correctness
Connectivity can be checked by verifying in the tables that for any node (except
the root), for each spanning tree, the sending links only exist when there is
exactly one receiving link. Furthermore, to ensure connectivity, we need to
check that the constructed spanning trees are indeed loop free: All nodes are
only forwarding in positive direction for spanning tree 1-3, and negative direction
for spanning tree 4-6. We can verify that for each spanning tree that in each
dimension a plane exists which does not forward packets of this spanning tree
except along this plane, acting like a barrier for the routing. Thus the routing in
this spanning tree is equivalent to positive (negative) routing in a mesh, which
is known to be loop free:
• Spanning tree 1 is not further forwarded in the 3 planes x = 0; y = yn −
1; z = zn − 1
• Spanning tree 2 is not further forwarded in the 3 planes x = xn − 1; y =
0; z = zn − 1
• Spanning tree 3 is not further forwarded in the 3 planes x = xn − 1; y =
yn − 1; z = 0
48
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
X Range Y Range Z Range X+ Y+ Z+ X- Y- Z- Category
Send Recv
x = 0 y = 0 z = 0 1 2 3 - - - root
x = 0 y = 0 z = zn − 1 3 3 - 1 2 3
x = 0 y = yn − 1 z = 0 2 - 2 1 2 3
x = 0 y = yn − 1 z = zn − 1 2 2 3 1 3 2 corner points
x = xn − 1 y = 0 z = 0 - 1 1 1 2 3
x = xn − 1 y = 0 z = zn − 1 1 3 3 3 2 1
x = xn − 1 y = yn − 1 z = 0 1 2 1 2 1 3
x = xn − 1 y = yn − 1 z = zn − 1 1 2 3 2 3 1
0 < x < xn − 1 y = 0 z = 0 1 1 1 1 2 3
0 < x < xn − 1 y = 0 z = zn − 1 3 3 3 3 2 1
0 < x < xn − 1 y = yn − 1 z = 0 2 2 1 2 1 3
0 < x < xn − 1 y = yn − 1 z = zn − 1 2 2 3 2 3 1
x = 0 0 < y < yn − 1 z = 0 2 2 2 1 2 3
x = 0 0 < y < yn − 1 z = zn − 1 2 3 3 1 3 2 edge points
x = xn − 1 0 < y < yn − 1 z = 0 1 1 1 2 1 3
x = xn − 1 0 < y < yn − 1 z = zn − 1 1 3 3 2 3 1
x = 0 y = 0 0 < z < zn − 1 3 3 3 1 2 3
x = 0 y = yn − 1 0 < z < zn − 1 2 2 2 1 3 2
x = xn − 1 y = 0 0 < z < zn − 1 1 3 1 3 2 1
x = xn − 1 y = yn − 1 0 < z < zn − 1 1 3 1 2 2 1
0 < x < xn − 1 0 < y < yn − 1 z = 0 2 3 2 1 3 2
0 < x < xn − 1 0 < y < yn − 1 z = zn − 1 1 3 1 2 3 1
0 < x < xn − 1 y = 0 0 < z < zn − 1 3 3 1 3 2 1
0 < x < xn − 1 y = yn − 1 0 < z < zn − 1 2 2 1 2 3 1 face points
x = 0 0 < y < yn − 1 0 < z < zn − 1 2 3 2 1 3 2
x = xn − 1 0 < y < yn − 1 0 < z < zn − 1 1 3 1 2 3 1
0 < x < xn − 1 0 < y < yn − 1 0 < z < zn − 1 2 3 1 2 3 1 inner points
Table 3.2: Edge assignment to the Spanning Trees in a 3D Torus for spanning
trees 1-3
49
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
X Range Y Range Z Range X+ Y+ Z+ X- Y- Z- Category
Recv Send
x = 0 y = 0 z = 0 - - - 4 5 6 root
x = 0 y = 0 z = 1 4 5 6 6 6 -
x = 0 y = 1 z = 0 4 5 6 5 - 5
x = 0 y = 1 z = 1 4 6 5 5 5 6 corner points
x = 1 y = 0 z = 0 4 5 6 - 4 4
x = 1 y = 0 z = 1 6 5 4 4 6 6
x = 1 y = 1 z = 0 5 4 6 4 5 4
x = 1 y = 1 z = 1 5 6 4 4 5 6
1 < x 5 xn − 1 y = 0 z = 0 4 5 6 4 4 4
1 < x 5 xn − 1 y = 0 z = 1 6 5 4 6 6 6
1 < x 5 xn − 1 y = 1 z = 0 5 4 6 5 5 4
1 < x 5 xn − 1 y = 1 z = 1 5 6 4 5 5 6
x = 0 1 < y 5 yn − 1 z = 0 4 5 6 5 5 5
x = 0 1 < y 5 yn − 1 z = 1 4 6 5 5 6 6 edge points
x = 1 1 < y 5 yn − 1 z = 0 5 4 6 4 4 4
x = 1 1 < y 5 yn − 1 z = 1 5 6 4 4 6 6
x = 0 y = 0 1 < z 5 zn − 1 4 5 6 6 6 6
x = 0 y = 1 1 < z 5 zn − 1 4 6 5 5 5 5
x = 1 y = 0 1 < z 5 zn − 1 6 5 4 4 6 4
x = 1 y = 1 1 < z 5 zn − 1 5 6 4 4 5 4
1 < x 5 xn − 1 1 < y 5 yn − 1 z = 0 5 4 6 5 4 4
1 < x 5 xn − 1 1 < y 5 yn − 1 z = 1 5 6 4 5 6 6
1 < x 5 xn − 1 y = 0 1 < z 5 zn − 1 6 5 4 6 6 4
1 < x 5 xn − 1 y = 1 1 < z 5 zn − 1 5 6 4 5 5 4 face points
x = 0 1 < y 5 yn − 1 1 < z 5 zn − 1 4 6 5 5 6 5
x = 1 1 < y 5 yn − 1 1 < z 5 zn − 1 5 6 4 4 6 4
1 < x 5 xn − 1 1 < y 5 yn − 1 1 < z 5 zn − 1 5 6 4 5 6 4 inner points
Table 3.3: Edge assignment to the Spanning Trees in a 3D Torus for spanning
trees 4-6
50
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
• Spanning tree 4 is not further forwarded in the 3 planes x = 0; y = 1; z = 1
• Spanning tree 5 is not further forwarded in the 3 planes x = 1; y = 0; z = 1
• Spanning tree 6 is not further forwarded in the 3 planes x = 1; y = 1; z = 0
As each spanning tree is connected and loop free, the constructed spanning trees
are proper trees.
Properties
The maximum path length used in this algorithm is (xn−1)+(yn−1)+(zn+1)+1
in any of the constructed spanning trees. Compared to the dimension ordered
spanning tree, this is more than twice of the path length used there. However
the sustainable bandwidth is 6 · β, which is optimal for our model, as all links
are used (except the receiving links at the root node) and no part of the message
is sent redundantly.
The running time of this broadcast algorithm is t = l · α+ 6 ·m · β with the
message size m in Bytes, as 6 disjoint spanning trees were constructed.
Using Gigabit Ethernet for broadcasts
The Ethernet protocol supports unreliable native broadcasts which can be used
for example with the UDP protocol. Based on the unreliable multicast or
broadcast, it is possible to build a practically constant time broadcast [55] for
MPI_COMM_WORLD communicator. Depending on the final latency perfor-
mance in the torus network, this method might be superior in the latency time,
as all nodes can be reached through the switched Ethernet interconnect without
intermediate nodes. The bandwidth however is limited to 1 GBit/s, which is
only a fraction of the bandwidth available in torus network.
3.4.2 Reduce
A reduce operation reduces a value (or a vector of values) from all processes to
one value, which is returned at a distinct node called the root node. Usually
rooted trees are employed where each node combines the incoming messages with
its own and forwards the result to its father in the the tree. The communication
pattern is the same as in the broadcast operation, but in reverse direction, and
we can use the same (multiple) spanning trees as for the broadcast operation.
A potential bottleneck is the processing power and memory bandwidth for
the combination of the input data. Unlike the broadcast, where the message
parts can be buffered in the Local Store and sent out in the next step, the
message parts from all incoming links must be collected and reduced before the
result can be forwarded. This requires more synchronization with the neighbors
than in the broadcast. The reduction must be actively performed by the PPE or
the SPE. The PPE might be a bottleneck in this reduction step: For a example
when a node has 5 incoming links and 1 outgoing link like in the spanning tree,
51
CHAPTER 3. QPACE ARCHITECTURE AND TORUS
COMMUNICATION
it requires a memory bandwidth of 6 GiB/s if all links are used, which is more
than the PPE can handle. It is therefore advisable to outsource the reduction
operation to the SPE by using function ooading (see section 4.2.2) if a PPE
centric approach is used. The SPE does not have this limitation if the message
parts are kept in the Local Store.
Assuming that there is no limitation by the processing speed to perform
the reduction, the run time of the Reduce operation is similar to the broadcast
operation for the same spanning trees. The same bandwidth can be sustained
using double buffering, but the latency will be higher because of the reduction.
Assuming a time τ which is needed to perform a reduction of the message parts
on one node, the running time for the reduce operation is
treduce = tbcast + τ · l




Message Passing Libraries on
QPACE
In this chapter the message passing libraries and consequences for the program-
ming models are discussed. The Cell/B.E. architecture as a heterogeneous multi
core microprocessor offers different alternatives to be programmed. Section 4.1
presents possible SPE centric message passing designs and embeddings in each
other. In section 4.2 the PPE centric approaches are discussed. The proposed
programming models are then discussed for HPL in section 4.3. Finally, an
interface description for the integration into the MPI framework OpenMPI is
presented in section 4.4.
4.1 SPE Centric Approaches
The SPE centric implementation options of the communication libraries MPI
and QMP on QPACE are discussed. Various alternatives are possible: Imple-
ment QMP or MPI on the low level interface which is provided by the torus, or
map one of the libraries on top of the other. Parts of the libraries can also be
outsourced to PPE assisted callbacks. In a SPE centric programming model,
also the constraints of the SPEs like the Local Store size must be considered.
From the introduction to QMP [56]:
"Depending upon demand, a subset of MPI could be imple-
mented above this new API so that legacy codes which use MPI
could function on the new architectures which implement only the
new API (albeit at somewhat reduced efficiency). Further, the new
API has been implemented atop MPI so that new applications using
this new API can still be run on older machines for which only MPI
is available, with negligible overhead."
From the functionality point of few, QMP is only a subset of MPI specialized
for QCD, designed for efficient (zero copy) nearest neighbor communication.
53
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
Therefore implementing QMP directly or on top of MPI or directly on the
network hardware is the natural choice, and implementations exist at least for
MPI (MPICH [57]), QCDOC [23] and M-VIA [58].
4.1.1 MPI on QMP
There seems to be no publicly available implementation for an MPI mapping on
QMP to the time of writing. The imposed limits or efforts needed to overcome
them for a (complete) MPI implementation on top of QMP are discussed.
A high level approach
OpenMPI Layer Offered by the Layer Offered by QMP
MPI - User Interface user interface none
DDT- Derived
Datatype Engine























limitations, only in global
communicator

















move raw bytes inorder supported
Table 4.2: OpenMPI Layers (partial list) (from [1, 2])
A blunt approach might be to embed QMP in an existing MPI framework like
OpenMPI. Features which QMP provides can be used, unsupported features
could be used from the frameworks codebase. A compliant MPI could be pro-
duced like this, but some limitations make it difficult to port a huge library to
the SPE:
• the small Local Store size of 256 KiB which has to keep the library code,
application code and run-time data
54
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
• missing direct hardware access (only partly true for QPACE)
• the unconventional programming model which does not give direct access
to the main memory
• slow path to the operating system (via PPE assisted callbacks)
There are various approaches to challenge these problems, like software managed
caches or code overlays. These techniques still require rather large porting
efforts and also add not negligible overhead. Previous implementations [59, 60]
therefore choose to implement only a subset of the library to keep the footprint
small.
However, from the organization of the OpenMPI Layers as illustrated in
Table 4.2 we can estimate what QMP can offer for MPI, and can compare the
layers and their features.
A low level approach
Figure 4.1: MPI on QMP block diagram for
a possible implementation
Instead of using a complete MPI
framework like OpenMPI, we can
try to build the MPI primitives
directly on the QMP primitives.
This is an approach for a scenario
where a reduced subset of MPI
is acceptable, as there are some
limitations in function and per-
formance as shown below. The
needed features must be imple-
mented from scratch. When a
more complete MPI is targeted,
QMP would only be used as pure
Byte Transport Layer. Compli-
cated Functions could be out-
sourced to the PPE and an exist-
ing MPI implementation. The de-
sign would be similar to the direct
MPI on QPACE torus approach
as described in 4.1.4, but with the
(redundant) QMP layer between.
55
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
Send/Recv implementation
illustration
The most common primitives, MPI_Send() and MPI_Recv(), can be easily
implemented directly with the QMP primitives:
1 int MPI_Send(void *buf , const int count , MPI_Datatype datatype , int







8 nbytes = pack(buf , count , datatype , &packed_buf);
9 phys_dest = get_phys_pos(dest , comm);
10 msgmem = QMP_declare_msgmem(packed_buf , nbytes);









20 int MPI_Recv(void *buf , const int count , MPI_Datatype datatype , int







27 nbytes = get_size(buf , count , datatype);
28 packed_buf = malloc(nbytes);
29 phys_src = get_phys_pos(src , comm);
30 msgmem = QMP_declare_msgmem(packed_buf , nbytes);







37 unpack(buf , count , datatype , packed_buf , nbytes);
38 free(packed_buf);
39 }
Redundant declaration and deallocation of buffers
QMP is designed for repetitive send and receive of similar data. In the imple-
mentation above the memory is declared and freed for each send/recv. This is
probably slow, from [56]:
56
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
"QMP messaging is meant to be highly repetitive and high per-
formance, and uses a gated message channel paradigm. In this
case messaging is done by first declaring the source and destina-
tion buffers and node ID (expensive part), then executing the pre-
computed I/O operation on demand as rapidly as possible. Destina-
tions are always known & pre-allocated buffers are used (no queuing
and so no extra copy for all but very short messages)."
Datatype engine
Datatypes must be converted from MPI datatypes to QMP datatypes. As ag-
gregated datatypes are not supported, they must be converted manually. This
is illustrated with the pack()/unpack() functions above.
However it is possible to use QMPs support for strided datatypes or vectors,
which is not illustrated above for simplicity.
Groups, communicators
QMP only uses the global communicator and does not support sub-
communicators, so all communicator housekeeping must be handled alone. This
is illustrated with the get_phys_pos() functions. Communicator ranks have to
be computed relative to the physical ranks defined by QMP.
The communicator among identifiers must be stored in some control infor-
mation for the message if nonblocking communication is to be supported (see
below).
Collectives/global operations
There are a few MPI collectives that can be represented with QMP col-
lectives: MPI_barrier() maps to QMP_barrier(), MPI_Allreduce() maps
to QMP_sum*(), MPI_broadcast() maps to QMP_broadcast() if the
root is 0. The main limitation is that only the main communicator
(MPI_COMM_WORLD) is supported here, sub-communicators cannot use
these functions. Some other collectives can probably be mapped on the pro-
vided QMP collectives (with performance overhead), e.g. MPI_Reduce() could
be mapped on MPI_Allreduce(). The datatypes might also limit the usefulness
of the QMP collectives.
For all uncovered cases mentioned above it's possible to implement the col-
lectives using the Send/Recv primitives as fallback routines.
Nonblocking communication, tags, protocols
It is possible to split the MPI_Send (MPI_Recv) call from above into non-
blocking MPI_Isend (MPI_Irecv) + MPI_wait() calls. However, unlike MPI
it is not expected for QMP that messages can overtake each other, so deadlocks
might occur. From the QMP introduction [56] (in Communication Operations,
Restrictions):
57
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
"Starting a second send (separate handle) to the same adjacent
node before the first completes, or a second receive before the first
completes, is allowed. The second send/receive start function is
allowed to block. That is, the implementation is not required to
implement an I/O queue to support this behavior."
The same applies for different tags. Tags are completely unsupported by QMP,
so they must be embedded in some control information for the sent data. Fur-
thermore deadlocks might be possible if different nonblocking messages from the
same node can not overtake each other.
To support the concurrent messages, message matching (identified by count,
datatype, tag, communicator), message scheduling and flow control must be
implemented. This can be done by a layer of protocols and message queues
between the MPI calls and the corresponding QMP calls.
4.1.2 QMP on MPI
Figure 4.2: QMP on MPI block diagram
There is already an implementa-
tion for QMP on MPI 1, writ-
ten by Jie Chen and Robert Ed-
wards of the Jefferson Lab HPC
group. The code is rather com-
pact with about 5000 lines of code
and about 22 KiB of code size
(.text segment compiled with spu-
gcc) in QMP-2.3.1. This ver-
sion worked nearly out of the box
on SPE-MPI, with only a one-
line patch to make MPI 1.X com-
pliant. It offers a complete im-
plementation of QMP and is in-
tended for single node use and
clusters where direct hardware
support is (not yet) available. As
QMP is mostly a functional sub-
set of MPI, many functions like
collectives can be translated to
the MPI variants without limita-
tions.
Only a subset of MPI is used,
so QMP on MPI only needs a few MPI functions actually (efficiently) imple-
mented. A reduced subset of MPI or a version where only special cases are
implemented efficiently would be sufficient. The performance is expected to be
lower than a directly implemented QMP on the QPACE torus (see next section)
1available at: http://usqcd.jlab.org/usqcd-software/qmp http://usqcd.jlab.org/
usqcd-software/qmp
58
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
Component MPI Functions














Collectives MPI_Barrier, MPI_Bcast, MPI_Allreduce, MPI_Op_create
Figure 4.3: List of used MPI calls in the QMP MPICH implementation
because of software overhead like the message scheduling in MPI, which is not
needed for the QMP operation. The impact of this kind of overhead has yet to
be evaluated. A list of the employed MPI calls is given in Figure 4.3.
4.1.3 QMP on QPACE Torus
The torus low level library offers nonblocking nearest neighbor communication
primitives which can be used directly by the corresponding QMP calls. There
is no need for flow control if it can be expected that the program is running
synchronous, posting receives at the same time as the neighbor is posting sends.
Collective calls can be implemented on top of the basic communication primi-
tives.
However, QMP has more capability requirements (as listed in [56]) which are
not so trivial to be satisfied. Sending a contiguous message to a given node which
is not a nearest neighbor requires routing capabilities from the torus network.
This is not implemented in hardware to this time of writing. Software routing
is an option, but requires protocols to handle unexpected incoming messages
which have to be forwarded, and hence the performance advantage over MPI by
not using protocols but only having synchronous communication is ruined.
Another option for the routing problem is to use PPE assisted callbacks for
these primitives. This could be solved for example as in SPE-MPI [61], with a
QMP instance running on the PPE for each SPE node. This option would slow
down the non-nearest-neighbor communication latency by an order of magni-
tude, but would keep the nearest neighbor communication fast and synchronous.
This would be a fair option if non-nearest-neighbor communication is not time
critical.
4.1.4 MPI on QPACE Torus
There are different approaches to implement MPI on the SPE. Reduced subsets
of MPI have been introduced in [59, 60]. A complete MPI implementation can
59
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
be provided with an approach like SPE-MPI [61], which can also be combined
with the reduced implementations. The idea is to implement only time critical
calls and features on the SPE, and falling back to PPE assisted callbacks if this
is not possible. For example on QPACE, an MPI_Send could check if it is an
easy datatype (basic, maybe also vector or strided) and if the destination is a
neighbor. If the requirements are met, the data can be sent directly over the
torus, otherwise the PPE must send it via its own MPI implementation, e.g.
OpenMPI using Ethernet. It is important to note that this mechanism only
works if the requirement decision can also be made a-priori on the receiver side.
Otherwise there could be conflicts, for example the data is transmitted directly
to the SPE, but the PPE is waiting for the reception and can not be notified.
2 This also makes the MPI_ANY_SOURCE or MPI_ANY_TAG wildcard
operators unusable, they must be forbidden if direct torus communication should
be used. This is not a very strong limitation in favor of the gained performance,
but ruins the complete MPI compliance.
There are features which have to be implemented for the direct torus com-
munication to work, and some which are optional and can be applied if the
application needs it. The considerations for optional features are the gained
performance and the bloat of the library size.
Mandatory features
Communication protocol on the QPACE torus
The QPACE torus allows messages to be delivered in-order to the nearest
neighbor, where the message is size is limited by the transmit buffer. Large
messages have to be split into smaller packets. Each neighbor link FIFO is
limited to 8 KiB, which may not be exceeded to avoid deadlock problems, hence
a flow control engine with congestion control must be used. MPI semantics
support concurrent messages which might have to be delivered out of order, for
example when nonblocking or buffered sends with different tags are used and
the corresponding receive calls are posted in a different order.
There are various approaches [60, 62] to implement MPI message passing
on zero copy RDMA or channel semantic networks, for example with eager




The MPI_COMM_WORLD communicator is mandatory. If sub-
communicators are to be supported, the book keeping information has to be
stored on the SPE and synchronized with the PPE, e.g. the map of neighbors
to the physical devices within the communicator or a flag if the communicator
2Employing MPI_Irecv, polling on both PPE and SPE, and calling MPI_Cancel when
the other Core has received the data would be an option, but is too expensive because PPE
assisted callbacks would have to be used in every case.
60
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
is connected. More complicated communicators like graph communicators can
be left unsupported and handled by the PPE.
Collectives
Collectives can be easily build on top of the nearest neighbor primitives when
the MPI_COMM_WORLD communicator is used. If this is not the case, it de-
pends on the communicator topology if collectives are efficiently implementable.
For example a communicator in a subspace of the torus can use the same al-
gorithms as in MPI_COMM_WORLD, but if the communicator is scattered
over the torus in not connected parts and routing capability is not available, one
needs to resort to PPE assisted callbacks and it might even be more efficient to
just call the corresponding collective call on the PPE.
An overview of some collective communication algorithms is given section
3.4.
Datatypes
The datatype engine for QPACE can be more simple than in a general MPI
like OpenMPI which has to support heterogeneous environments. Derived
datatypes still pose a greater challenge and require a packing/unpacking engine
or special support to build scatter/gather Lists for non contiguous datatypes.
A first implementation might only support the basic datatypes (MPI_FLOAT,
MPI_DOUBLE, ...) and can then be extended to support strided datatypes or
vector datatypes.
Software routing
The torus does not support routing in hardware. Software Routing is needed
if torus communication beyond neighbors should be possible on the SPE, with-
out falling back to the PPE. The link protocol between the SPEs has to be
extended to support this, and interrupts on the SPEs should be used to allow
fast forwarding of packets and thus reasonable latencies.
4.1.5 Conclusion
Mappings of the different message passing protocols on each other have been
discussed. Each mapping comes with its own overheads. Figure 4.2 and Figure
4.1 show possible designs with the most important features described in this
section.
For the MPI on QMP mapping, QMP lacks several features and semantics
which have to be implemented (again) in the MPI Layer. QMP can only pro-
vide limited datatype support and collective support, and this functionality is
redundant if these functions must be implemented in MPI again for full support.
However for an MPI implementation which should only offer a limited subset
this is a reasonable option. For a complete implementation, most of the MPI
library has to be implemented on the SPE which is not feasible. Using PPE for
61
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
assisted callbacks is an option, but in this case it would also be possible to skip
the QMP layer at all and use the torus low level library directly.
The QMP on MPI mapping is more favorable from a practical point of view,
as the QMP on MPI implementation is already available and only needs to be
adopted to the SPE architecture. QMP only uses a subset of MPI, therefore it's
enough for QMP if only this subset of MPI is implemented efficiently, e.g. with-
out PPE assisted callbacks. It has to be evaluated how the protocol overhead in
MPI affects the performance in QMP compared with a direct implementation.
If the performance appears not to be satisfying, this approach can also be used
as base for further optimizations where critical calls are implemented directly
on the torus library.
4.2 PPE Centric Approaches
Applications written for general purpose architectures are sometimes challenging
to port to the SPE architecture. A common approach which also has been
applied to HPL is to keep the original main code on the general purpose PPE
and only port performance (computation) critical routines for the SPE. Even if
the QPACE torus network is designed with SPE centric applications in mind,
the torus can be used from PPE centric applications as described in section 3.2.
4.2.1 PPE with direct MFC access to the NWP
The memcpy method described in section 3.2.3 can be used to send and receive
data to and from main memory. For the send, one chunk from the main memory
can be transferred to the SPE Local Store and then from the Local Store to the
IWC of the NWP. Receiving a message would require a supplied credit which
can be sent directly from the PPE or with the MFC. The packet received in the
Local Store then can be transferred with another MFC operation back to the
main memory.
The only requirements for this method are that some buffer space on the
SPE is available. The usual SPE code could be executed independently on the
SPE during transfer, but performance regression are to be expected as the MFC
is shared between PPE and SPE. The PPE only needs access to the problem
state area on the SPE, which can be requested in a Linux system with the
sufficient privileges.
If all links are used for sending and receiving at the same time at full speed,
the MFC has to process at least 6 · 2 · 1GiB/s = 12GiB/s for the send direction
(2 memory transfers per packet) and at least 6 · 1GiB/s = 6GiB/s for the
receiving direction. This totals in 18GiB/s required bandwidth, which can
only be reached with packets sizes of at least 5KiB according to the bandwidth
calculation given in section 3.2.4 because of the latency to program the MFC
from the PPE. Distributing the transfers onto multiple SPEs would not change
this as the number of requests is constant over the used SPEs; however this is a
62
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
valid option if it turns out that one MFC is not sufficient to sustain the memory
bandwidth.
4.2.2 Function Ooading
If the SPEs in an application are not busy with tasks while communicating, they
can also be used for communication functions. This is mostly the case for HPL
with its synchronous calls to the SPEs (see section 2.2.2). Collective operations
can be outsourced as functions on the SPEs. This requires that the SPE code is
prepared for these functions: it should not only contain the application-specific
code, but also the communication routines. These could be implemented as a
SPE library, and the application specific dispatcher function needs to recognize
the commands for these routines. If the application does not use the SPEs at
all or keep one SPE for communication only, the message passing library has to
assure that the communication code is loaded.
Besides the function ooading, the SPEs could also asynchronously send
and receive messages for the PPE: The PPE only gives the command for a send
or receive with the respective data position and length in the main memory,
and the SPE executes the transfer asynchronously. This is a possible solution
especially for nonblocking communication. The SPE can then poll for progress
in the NWP without disturbing the PPE.
4.2.3 Integration into MPI or QMP
The previous approaches can be integrated into existing OpenMPI frameworks
or as QMP implementation to support legacy applications. This method is
convenient as the porting effort of the applications can be minimized. Only
the process mapping should be set according to the applications communication
patterns to allow nearest neighbor communication wherever possible. As an
example, an integration into the OpenMPI framework is described in section
4.4.
4.3 Programming Model Considerations for HPL
Different programming models are possible on a heterogeneous processor like the
PowerXCell 8i processor. The IBM version of HPL uses the PPE centric model,
handing over the most critical BLAS tasks like matrix multiplication to the
SPEs, but keeping the communication library and other BLAS routines on the
PPE. Task parameters are pushed via processor internal mailbox transfers. The
main considerations in this model are the task granularity versus the overhead
to hand over the task, and the library size on the SPE with its limited Local
Store size.
63







Table 4.3: Code size of the HPL library (without MPI and BLAS, and test
code), compiled with spu-gcc -Os
Another alternative programming model for HPL would be the SPE centric
model, where the complete application code is located on the SPE. The PPE
is only called for Operating System tasks via PPE assisted callbacks. As we
can see in Table 4.3, a code size optimized HPL where MPI and BLAS have
yet to be included consumes already 75% of the SPE Local Store which can
store 256 KiB data. The SPE Local Store also needs to keep the stack and
run-time data, which makes code overlays and other data size saving techniques
(and resulting overhead) mandatory. The currently optimized specialists almost
consume the complete Local Store size too. This makes techniques like SPE
overlays unavoidable, which will decrease the performance.
For a SPE centric model, there is the question of how many SPEs should
form one process on each node. When each SPE has its own LINPACK process,
each process will only have 512 MiB of RAM available for data. The SPE
computation specialists however only reach a good performance on very large
data sets, a small process to SPE ratio is therefore recommendable. Using only
one SPE as master SPE which needs to synchronize with the other SPEs
brings us to a similar model as the PPE centric accelerator model, with similar
overheads.
An advantage of the SPE centric model in the case of QPACE is that the
network devices are designed for SPE access and should provide lower latencies.
However the cost of SPE overlays will probably ruin this benefit, and this pos-
sible small advantage alone is no justification for the porting efforts needed to
port the code for the SPE.
As the advantages of a PPE centric model strongly prevail, the proposal is
to use this model for the QPACE HPL implementation. A pragmatic approach
is to use and enhance the publicly available QS22 version of HPL.
4.3.1 SPE Accelerated Communication Tasks
The communication tasks are various collectives and have been identified above.
Like in the IBM version of HPL for the BLAS routines, we can outsource
MPI collectives like MPI_Allreduce(), MPI_broadcast(), MPI_scatterv()
or HPL-specific routines to the SPE. Instead of calling the MPI library, we
would call the SPE library instead.
If the granularity of the collectives turns out to be too fine-grained, the
calling functions can also be moved into the SPE until the granularity is coarse
64
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
enough. If asynchronous calls are possible the SPE may also run concurrently
to the PPE.
There is only one MPI process per PowerXCell 8i processor, so one SPE
is sufficient to handle communication, provided that the MFC can handle the
memory requests fast enough. Implementing the functions on the SPE instead
of the PPE has the advantage that memory accesses do not cause stalls, but
can be overlayed, which enables better performance (see section 3.2 for more on
this topic).
4.3.2 MPI Network Module
Instead of outsourcing the communication tasks on the application level, these
communication tasks could also be implemented and outsourced in the message
passing library as described in 4.4 for OpenMPI. However the communication
operations of an MPI library are likely to be more general than the specific ap-
plication requires, and therefore overhead might be associated with the support
of this generality.
Unlike the SPE accelerated communication tasks, the granularity is fixed to
the MPI calls. Inside of the MPI module, both PPE network access and SPE
accelerated communication is possible, provided that the SPE code supports
it. A very strong argument for this option is that the original application can
be transparently used without modification, and other applications can benefit
from this approach as well. Also the MFC assisted access can be used in this
module.
4.3.3 Conclusion
As the High Performance LINPACK sends large messages in most of its time
critical communication tasks, the fixed granularity to MPI tasks is not a per-
formance problem for this application. For portability reasons the proposal is
therefore to implement an MPI module which can then be used by HPL without
major modification in the application. A possible solution for this proposal is
described in the following section.
4.4 Integration into OpenMPI
An OpenMPI BTL[2, 63] or Channel Device for MPICH [64] or MPICH2 [65, 66]
allows unmodified applications to run on the PPE using the torus network while
minimizing the efforts to support the new network. Function ooading or direct
MFC access can be employed to implement the module. If the direct MFC
access method is used, bandwidth-critical applications may however be limited
by the PPE latencies for some communication patterns as described in 3.2.8.
The application should be started with a process mapping that allows nearest
neighbor communication in most of its communication patterns, otherwise the
MPI implementation will resort to the slow Gigabit Ethernet.
65
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
As an example we will consider the requirements for the OpenMPI modules
in more detail:
4.4.1 OpenMPI Modular Component Architecture
Figure 4.4: OpenMPI Layer Model
OpenMPI is organized into layers as illustrated in Figure 4.4. Most of these
layers are implemented in so called components of the OpenMPI Modular
Component Architecture (MCA). Each of these components is implemented ac-
cording to a well defined interface. This modular architecture allows to develop
the components independently, or substitute components with other implemen-
tation without changing the other components. This is especially useful for the
implementation of new network devices: Only a byte transfer layer (BTL) com-
ponent, and optionally a collective component (COLL) has to be implemented,
which are marked green in the Figure 4.4. An efficient implementation of only
these components is enough to leverage the hardware potential of the new net-
work interconnect. The run time system can decide which components are to
be enabled, or the user can forbid certain components.
Any component must provide version information and implement the follow-
ing functions:
66
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
/**
* BTL module interface functions and attributes.
*/
struct mca_btl_base_module_t {
/* BTL common attributes */
mca_btl_base_component_t* btl_component; /**< pointer back to
** the BTL component structure */
size_t btl_eager_limit; /**< maximum size of first fragment
** -- eager send */
size_t btl_min_send_size; /**< threshold below which the
** BTL should not fragment */
size_t btl_max_send_size; /**< maximum send fragment size
** supported by the BTL */
size_t btl_min_rdma_size; /**< threshold below which the
** BTL should not fragment */
size_t btl_max_rdma_size; /**< maximum rdma fragment size
** supported by the BTL */
uint32_t btl_exclusivity; /**< indicates this BTL should
** be used exclusively */
uint32_t btl_latency; /**< relative ranking of latency
** used to prioritize btls */
uint32_t btl_bandwidth; /**< bandwidth (Mbytes/sec)
** supported by each endpoint */
uint32_t btl_flags; /**< flags (put/get ...) */












mca_btl_base_module_dump_fn_t btl_dump; /* diagnostics */
/* the mpool associated with this btl (optional) */
mca_mpool_base_module_t* btl_mpool;
/* register a default error handler */
mca_btl_base_module_register_error_fn_t btl_register_error;
};
typedef struct mca_btl_base_module_t mca_btl_base_module_t;
Figure 4.5: OpenMPI BTL Component Interface (OpenMPI version 1.2.8)
• int mca_open_component (void); loads the component and initializes
component internal data structures.
• int mca_close_component (void); is called after all modules have been
finalized to clean up internal data structures.
4.4.2 OpenMPI Byte Transfer Layer (BTL)
A BTL module offers point to point byte data transfer service from one process
to another. It therefore implements low level point to point primitives directly
on the network interface. Furthermore the BTL offers information to the upper
protocol layers, and the BTL Management Layer (BML) can decide depending
on these information and attributes which BTL component to use.
67
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
Figure 4.5 shows the interface of the BTL component. The scalar values
inform the upper layers about thresholds and limits of the BTL as described
in the comments. The exclusivity, latency, bandwidth and flags parameters are
used to prioritize and select the appropriate BTL for a certain message.
The BTL functions which must be implemented are:
• int btl_add_procs(struct mca_btl_base_module_t* btl,
size_t nprocs, struct ompi_proc_t** procs, struct
mca_btl_base_endpoint_t** endpoints, struct ompi_bitmap_t*
reachable); is called by the Point to Point Management Layer (PML)
to inform about new processes which have been added to the process
list. The BTL can then inform the upper layers within the reachable
bitmap which process can be reached. A torus BTL would only mark
those processes reachable which are nearest neighbors of this node. The
endpoints parameters can be used to store BTL specific information
about the neighbor for faster access, e.g. link number.
• int btl_del_procs (struct mca_btl_base_module_t* btl,
size_t nprocs, struct ompi_proc_t** procs, struct
mca_btl_base_endpoint_t** ); is called to inform the BTL that
processes were deleted from the process list.
• int btl_register (struct mca_btl_base_module_t* btl,
mca_btl_base_tag_t tag, mca_btl_base_module_recv_cb_fn_t
cbfunc, void* cbdata ); registers a callback which is called when a
packet fragment is received.
• int btl_finalize(struct mca_btl_base_module_t* btl ); is called
before unloading to let the BTL clean up and release its allocated re-
sources.
• mca_btl_base_descriptor_t* btl_alloc (struct
mca_btl_base_module_t* btl, size_t size ); allocates a segment of
the specified size which can be used for sending or receiving.
• int btl_free (struct mca_btl_base_module_t* btl,
mca_btl_base_descriptor_t* descriptor ); returns an allocated
segment to the BTL.
• int btl_prepare_src (struct mca_btl_base_module_t*
btl, struct mca_btl_base_endpoint_t* endpoint,
mca_mpool_base_registration_t* registration, struct
ompi_convertor_t* convertor, size_t reserve, size_t* size
); prepares a descriptor for a send. The function must pack the data if
the data supplied in the convertor is not contiguous. reserve specifies
how many bytes should precede the packed data, and size returns the
actual size of the data.
68
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
• The btl_prepare_dst call has the same prototype as btl_prepare_src
and is only required if MPI-2 RDMA functionality should be supported.
• int btl_send (struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* endpoint, struct
mca_btl_base_descriptor_t* descriptor, mca_btl_base_tag_t
tag); initiates an asynchronous send of the prepared data to the peer.
If the data can not be sent in the moment of this call, the data may be
queued and sent later in a progress function.
• int btl_put (struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* endpoint, struct
mca_btl_base_descriptor_t* descriptor); initiates an asynchronous
RDMA put. This function is only required if MPI-2 RDMA functionality
is implemented, and we may skip this for the QPACE torus.
• int btl_get (struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* endpoint, struct
mca_btl_base_descriptor_t* descriptor); initiates an asynchronous
RDMA get. This function is only required if MPI-2 RDMA functionality
is implemented, and we may skip this for the QPACE torus.
• int btl_dump (struct mca_btl_base_module_t* btl, struct
mca_btl_base_endpoint_t* endpoint, int verbose); dumps di-
agnostic information of the BTL state for this endpoint.
The component must furthermore implement the following functions for the
component:
• struct mca_btl_base_module_t** mca_btl_component_init
(int *num_btls, bool enable_progress_threads, bool
enable_mpi_threads ); initializes one or more modules (instances)
of this BTL for each device (we would only create one for the torus
device), the number is returned in num_btls. The hardware discovery
and setup should be done in this function. enable_progress_threads
and enable_mpi_threads inform the BTL whether progress threads or
MPI threads are supported.
• int mca_btl_base_component_progress (void); defines the progress
function. This function should poll for incoming packets and call the
appropriate receive handler or send packets which were postponed before.
For the QPACE torus we should check the notify area, provide credits,
and send buffered fragments which have been queued by btl_send.
As we have seen the send and receive functions are called in a nonblocking
way. This requires that any send operation must return immediately even if
the fragment could not be sent (yet) because of back pressure in the network.
Furthermore a tag is used to identify the message on the peer side. A minimal
envelope protocol must therefore be employed to identify the messages.
69
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
The decision whether the QPACE BTL should be implemented with the
direct MFC access or the function ooading model depends on the application:
When the SPEs are not utilized by the application, synchronous SPE calls are
used or one (or more) SPE(s) can be sacrificed for communication, the function
ooading model can be employed. The PPE could then pass new transfer
request to the SPE, which checks for progress asynchronously to the PPE. An
application which uses the SPEs while communicating may only use the direct
MFC access method from the PPE.
70
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
4.4.3 Collectives Component (COLL)
A collective component can offer special network optimized collective operation
support for a specific network. As in the BTL, the upper layers may have
multiple candidates for a collective operation in a given communicator and can
choose the best candidate available by priority.
/*
* Structure for coll v1.0.0 components










typedef struct mca_coll_base_component_1_0_0_t mca_coll_base_component_1_0_0_t;
/*
* This struct is hung on the communicator by the winning coll component
* after the negotiation. It has pointers for all the collective
* functions , as well as a "finalize" function for when the
* communicator is freed.
*/
struct mca_coll_base_module_1_0_0_t {
/* Per -communicator initialization and finalization functions */
mca_coll_base_module_init_1_0_0_fn_t coll_module_init;
mca_coll_base_module_finalize_fn_t coll_module_finalize;


















typedef struct mca_coll_base_module_1_0_0_t mca_coll_base_module_1_0_0_t;
Figure 4.6: OpenMPI COLL Component Interface (as of OpenMPI 1.2.8)
The interface definition of a collective is given in Figure 4.6. collm_version
and collm_data provide meta data for the component. A collective component
implements:
• int coll_init_query (bool enable_progress_threads, bool
enable_mpi_threads); allows the collective component to back off
if it does not support the required threads.
71
CHAPTER 4. MESSAGE PASSING LIBRARIES ON QPACE
• const mca_coll_base_module_1_0_0_t *coll_comm_query
(struct ompi_communicator_t *comm, int *priority, struct
mca_coll_base_comm_t **data); is called when a new communicator
is created. The component can then analyze the new communicator
comm, for example check if the communicator topology is a torus, a mesh,
or even connected. It then sets its priority for this communicator and
provides a set of function pointers to the appropriate collective functions
with the data pointer if it can support this topology.
• int coll_comm_unquery (struct ompi_communicator_t *comm,
struct mca_coll_base_comm_t *data); allows to clean up data which
has been allocated in the coll_comm_query function. This function is
optional.
When the collective component has the highest priority for this communicator,
it is hung to the communicator structure. This structure contains the functions:
• struct mca_coll_base_module_1_0_0_t * coll_module_init
(struct ompi_communicator_t *comm); is called for the winning
module and allows to hang additional data to the communicator comm.
• int coll_module_finalize (struct ompi_communicator_t *comm);
is called when the communicator is destroyed to let the module clean up
its data from the communicator comm.
• the respective collective functions
One or more collective operations can be implemented, and optimized for the
different topologies. For example a broadcast can be implemented for a 3D torus
and a 2D mesh, and the correct callback function is then passed by the query
function. It is also possible to not implement a collective operation - one of the
standard algorithms will then be used (which might not use the torus network).
The considerations whether direct MFC access or function ooading should
be used are similar to the BTL case, as the COLL component also has direct




In this chapter, the applicability of the strategies developed for QPACE are
discussed in section 5.1. The thesis is then concluded in section 5.2, and the
next steps to provide a final HPL implementation on the QPACE architecture
are discussed in section 5.3.
5.1 Application to the NICOLL Project
The NICOLL architecture shares similarities with the QPACE architecture as
network access from the Cell/B.E. SPE is intended in both systems. In section
5.1.1 we discuss the steps taken towards a NICOLL prototype. Section 5.1.2
presents assumptions what to be expected from this prototype, and based on
these assumption the applicability of QPACE programming models and message
passing strategies are finally discussed in section 5.1.3.
5.1.1 Related Work
The NICOLL projects' objective is to build a hybrid system which couples an
AMD Opteron microprocessor and an IBM Cell/B.E. microprocessor by em-
ploying an FPGA which bridges between the processor specific protocols Hy-
perTransport and FlexIO. This proposed architecture is illustrated in Figure
5.1. Unfortunately such a prototype is not yet available. As an intermediate
step, a PCI Express R©coupled system using standard hardware was built. An
x86(-64) Host system and an IBM BladeCenter QS21 were interconnected with
PCIe R©host interface boards [67], establishing a PCIe x8 Link between the sys-
tems. This method allowed to explore the bring up, system level integration and
shared memory approaches of such a hybrid system [68]. For software and oper-
ating system integration, a Remote SPU File System (RSPUFS) was proposed
which allows to create and manage SPE contexts from other systems than the
Cell/B.E. architecture [69]. This approach has evolved to ACCFS [70, 71, 72], a
modular Linux file system driver which allows to control various types of accel-
73
CHAPTER 5. CONCLUSION AND OUTLOOK
Figure 5.1: The NICOLL Architecture
erators like Cell/B.E. SPEs, FPGAs or GPGPUs (General Purpose computing
on Graphic Processing Units), providing one consistent interface to manage
different types of accelerators. This interface is the preferred interface for the
NICOLL prototype because of its portability. It allows not only the acceleration
with Cell/B.E. SPEs, but also for example reconfigurable modules in the FPGA
[73].
5.1.2 Architecture Assumptions for NICOLL
As the final NICOLL architecture is not yet specified and there still are different
alternatives for the architecture which have yet to be evaluated, we need to make
assumptions to formulate ideas or apply the QPACE strategies.
The NICOLL machine is assumed to be a shared memory architecture, at
least from a process point of view. The SPEs can access at least a specified
window in the main memory which can also be accessed from the Opteron. The
shared memory is assumed to be cache coherent, or mechanism exist to assert
the coherence of the memory system, e.g. by calling explicit synchronization
operations.
The AMD Opteron takes the typical place of the PPE, it runs the operating
system and legacy application code. The reason for this decision is that the
AMD Opteron is superior in terms of processing power, memory bandwidth
and IO bandwidth over the PPE, and more legacy codes are available for the
x86 architecture. To support an efficient acceleration of collective operations on
the SPE, it is required that the communication hardware can physically access
the SPE Local Store. The SPE Local Stores must therefore be mapped into the
physical address space of the system and accessible from the FPGA.
As the cluster communication device, the InfiniBand interconnect was
planned. No proof of concept study of an InfiniBand stack implemented on
74
CHAPTER 5. CONCLUSION AND OUTLOOK
a Cell/B.E. SPE core has been published yet. A possible reason for this is that
typical Cell/B.E.-based systems don't allow physical access to the SPE Local
Store from the I/O hardware. However there is a case study of a 10 Gbps Eth-
ernet Interface on a SPE core [74] using a custom Cell/B.E.-based system, and
very good performance could be reached in this approach. One problem how-
ever was the limited size of the Local Store, and several techniques have been
applied to avoid exhausting this space. From this study it can be concluded
that a stripped down InfiniBand stack might also be feasible on the SPE.
Another alternative for NICOLL could be to use a custom network inter-
connect like the QPACE torus network. The advantage of this approach is that
the interface is very simple and a software stack could be implemented with
very small amount of Local Store space. Because many interesting alternatives
are possible, we further only assume that a network interconnect exists which
allows an efficient implementation on an SPE core.
5.1.3 QPACE Programming Models and Message Passing
Strategies for NICOLL
When the Opteron takes the place of the PPE of the Cell/B.E. with the same
functionality, the proposed programming models for the Cell/B.E. can be em-
ployed. The PPE centric approaches function ooading, direct MFC transfers
or OpenMPI components could be implemented in the same way as Opteron
centric versions. The memory position where messages are stored will influ-
ence the performance: If the Opteron memory is used, the link between the
FPGA and the Cell/B.E. is occupied multiple times for data transfer to Local
Store and network transfer. Therefore messages stored in the Cell/B.E. memory
should allow better performance, as the link is then exclusively available for the
network transfer.
The latencies to the MFC of the SPEs must be revised for Opteron-driven
communication, for example as in section 3.2.4. These latencies are expected to
be higher than PPE-driven communication, as the MFC is not chip local for the
Opteron. These might require larger buffer sizes to sustain the peak network
bandwidth. The function ooading mechanism should only suffer from higher
call overheads, which are not critical for large messages. This mechanism can
be implemented using the ACCFS framework. Finally, these mechanisms can
be included in an MPI framework module which would allow legacy code to
directly benefit from the accelerated network functionality.
SPE centric message passing libraries are also possible. However special care
has to be taken regarding the different datatypes. Assisted callback mechanisms
as used in SPE-MPI [61] build on the fact that the datatypes in 32 bit PPE
code and SPE code are very similar. The native Opteron datatypes however
differ from the SPE datatypes in both size and byte ordering.
75
CHAPTER 5. CONCLUSION AND OUTLOOK
5.2 Conclusion
This thesis provides the theoretical background for a successful High Perfor-
mance LINPACK implementation on QPACE. An implementation could not be
provided and evaluated as the QPACE hardware is still under active develop-
ment.
We have seen that the communication pattern of HPL can be mapped on a
3D torus using nearest neighbor communication, and how to map the processes
on the torus. The network requirements indicate that mostly large messages
are sent, and the communication performance mostly depends on the available
bandwidth. The implementation should be based on the QS22 version of HPL,
which already provides well optimized linear algebra subroutines for the most
demanding kernels.
The QPACE architecture has been analyzed for these requirements. The
original QCD optimized communication model has been evaluated, and an al-
ternative model for general communication has been proposed. We have seen
that also the PPE can be used for torus communication. The limits of the cur-
rent hardware and enhancement proposals have been presented. Based on an
abstract model, example collective algorithms for different topologies and their
performance estimations have been presented which can be used in QPACE
optimized message passing libraries.
Different programming models with a focus on SPE and PPE centric pro-
gramming have been shown. QMP and MPI were compared from a functional
point of view and different alternatives for their implementation have been eval-
uated. The PPE centric mechanisms, function ooading and direct MFC access
were revised for the use in the message passing library, and their integration into
an existing MPI framework has been evaluated for the example OpenMPI. The
programming model alternatives have been evaluated for the HPL application,
concluding that a PPE centric approach with either SPE accelerated communi-
cation tasks or a torus optimized MPI implementation should be employed.
Finally we've seen that it is possible to apply these programming model
strategies to the NICOLL project, with certain limitations posed by the archi-
tectural differences and higher latency times.
5.3 Further Work
The natural next steps are to implement the low level interface and the pro-
posed MPI components on the QPACE machine. The scheduling of the message
transfers to different links needs to be analyzed along with an evaluation of the
practical sustainable NWP capabilities for long messages. The proposed NWP
enhancement should be evaluated for their feasibility in the current hardware
design, and eventually implemented on the FPGA to allow higher bandwidth.
As soon as a large QPACE partition is available, the High Performance
LINPACK benchmark can be evaluated. The impact of the PPE directed MFC
transfers on the SPE accelerated functions should be analyzed. If possible, a
76
CHAPTER 5. CONCLUSION AND OUTLOOK
recabling of the machine allows new grid sizes for the benchmark. The more
symmetric grids should then sustain a higher performance on 4 racks. Interesting
enhancements for HPL which can be further evaluated are to convert more of
the SPE accelerated synchronous functions to asynchronous functions, which
would allow more overlapping between PPE and SPE tasks.
77
Bibliography
[1] R. Graham, G. Shipman, B. Barrett, R. Castain, G. Bosilca, and A. Lums-
daine, Open MPI: A high-performance, heterogeneous MPI, in Proceed-
ings of the Fifth International Workshop on Algorithms, Models and Tools
for Parallel Computing on Heterogeneous Networks, September, 2006.
[2] R. L. Graham, T. S. Woodall, and J. M. Squyres, Open MPI: A flexible
high performance MPI, in Proceedings, 6th Annual International Con-
ference on Parallel Processing and Applied Mathematics, Poznan, Poland,
September 2005.
[3] J. Dongarra, P. Luszczek, and A. Petitet, The LINPACK Benchmark:
past, present and future, Concurrency and Computation Practice and Ex-
perience, vol. 15, no. 9, pp. 803820, 2003.
[4] J. Dongarra, LINPACK Users' Guide. Society for Industrial Mathematics,
1979.
[5] P. Luszczek, J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas, J. Kepner,
J. McCalpin, D. Bailey, and D. Takahashi, Introduction to the HPC Chal-
lenge Benchmark Suite, Lawrence Berkeley National Laboratory, 2005.
[6] H. W. Meuer, E. Strohmaier, J. J. Dongarra, and H. Simon, Top500
supercomputer sites; electronic version, 2008, http://www.netlib.org/
benchmark/top500.html. [Online]. Available: http://www.netlib.org/
benchmark/top500.html
[7] W. Feng and K. Cameron, The Green500 List: Encouraging Sustainable
Supercomputing, Computer, vol. 40, no. 12, pp. 5055, 2007.
[8] H. Baier, H. Boettiger, M. Drochner, N. Eicker, U. Fischer, Z. Fodor,
G. Goldrian, S. Heybrock, D. Hierl, T. Huth et al., Status of the QPACE
Project, Arxiv preprint arXiv:0810.1559, 2008.
[9] G. Goldrian, T. Huth, B. Krill, J. Lauritsen, H. Schick, I. Ouda, S. Hey-
brock, D. Hierl, T. Maurer, N. Meyer et al., QPACE: Quantum Chromo-
dynamics Parallel Computing on the Cell Broadband Engine, Computing
in Science and Engineering, vol. 10, no. 6, pp. 4654, 2008.
78
BIBLIOGRAPHY
[10] A. Nobile, Performance analysis and optimization of LQCD kernels on the
Cell BE processor, Ph.D. dissertation, Univ. Milano-Bicocca, 2008.
[11] F. Belletti, G. Bilardi, M. Drochner, N. Eicker, Z. Fodor, D. Hierl,
H. Kaldass, T. Lippert, T. Maurer, N. Meyer, A. Nobile, D. Pleiter,
A. Schaefer, F. Schifano, H. Simma, S. Solbrig, T. Streuer, R. Tripiccione,
and T. Wettig, QCD on the Cell Broadband Engine, Proceedings of Sci-
ence, vol. LAT2007, 2007.
[12] K. Ibrahim and F. Bodin, Implementing Wilson-Dirac operator on the
cell broadband engine, in Proceedings of the 22nd annual international
conference on Supercomputing. ACM New York, NY, USA, 2008, pp.
414.
[13] S. Motoki and A. Nakamura, Development of QCD Code on a Cell Ma-
chine, Proceeding of Science, vol. LAT2007, 2007.
[14] J. Spray, J. Hill, and A. Trew, Performance of a Lattice Quantum Chro-
modynamics kernel on the Cell processor, Computer Physics Communica-
tions, 2008.
[15] K. Wilson, Confinement of quarks, Physical Review D, vol. 10, no. 8, pp.
24452459, 1974.
[16] J. C.R. and B. D.A., Introduction to the cell broadband engine architec-
ture, IBM Journal of Research and Development, vol. 51, pp. 503519,
2007.
[17] N. Christ, Computers for Lattice QCD, Nuclear Physics B Proceedings
Supplements, vol. 83, 2000.
[18] T. Hoshino, Less Known Historical Computers in Japan. PACS at The
University of Tsukuba. Joho Shori, vol. 43, no. 2, pp. 116117, 2002.
[19] T. Shirakawa, T. Hoshino, Y. Oyanagi, Y. Iwasaki, and T. Yoshie, Qcdpax-
an mimd array of vector processors for the numerical simulation of quan-
tum chromodynamics, in Supercomputing '89: Proceedings of the 1989
ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM,
1989, pp. 495504.
[20] S. Aoki, K.-I. Ishikawa, T. Ishikawa, N. Ishizuka, K. Kanaya, Y. Kuramashi,
M. Okawa, K. Sasaki, Y. Taniguchi, N. Tsutsui, A. Ukawa, and T. Yoshie,
The PACS-CS Project, Proceedings of Science, vol. LAT2005, p. 111,
2006.
[21] T. Boku, H. Nakamura, and K. Nakazawa, CP-PACS: a massively parallel
processor for large scale scientific calculations, in Proceedings of the 11th




[22] R. Mawhinney, The 1 Teraflops QCDSP computer, Parallel Computing,
vol. 25, no. 10-11, pp. 12811296, 1999.
[23] P. Boyle, D. Chen, N. Christ, M. Clark, S. Cohen, C. Cristian, Z. Dong,
A. Gara, B. Joo, C. Jung et al., Overview of the QCDSP and QCDOC
computers, IBM Journal of Research and Development, vol. 49, no. 2-3,
pp. 351365, 2005.
[24] D. Chen, N. Christ, C. Cristian, Z. Dong, A. Gara, K. Garg, B. Joo, C. Kim,
L. Levkova, X. Liao et al., QCDOC: A 10-teraflops scale computer for
lattice QCD, in 18th International Symposium on Lattice Field Theory
Lattice 2000, Bangalore (IN), 08/17/200008/22/2000, 2000.
[25] P. Boyle, C. Jung, and T. Wettig, The QCDOC supercomputer: hardware,
software, and performance, Arxiv preprint hep-lat/0306023, 2003.
[26] P. Bacilieri, S. Cabasino, F. Marzano, P. Paolucci, and S. Petrarca, The
APE project: a gigaflop parallel processor for lattice calculations, in Pro-
ceedings of the conference on Computing in high energy physics table of
contents. North-Holland Publishing Co. Amsterdam, The Netherlands,
The Netherlands, 1986, pp. 330337.
[27] C. Battista, S. Cabasino, F. Marzano, P. Paolucci, J. Pech, F. Rapuano,
R. Sarno, G. Todesco, M. Torelli, W. Tross et al., The APE-100 Com-
puter:(I) The Architecture, International Journal of High Speed Comput-
ing, vol. 5, no. 4, pp. 637656, 1993.
[28] A. Bartoloni, P. Boucaud, N. Cabibbo, F. Calvayrac, M. Della Morte,
R. De Pietri, P. De Riso, F. Di Carlo, F. Di Renzo, W. Errico et al.,
Status of APEmille, Arxiv preprint hep-lat/0110153, 2001.
[29] F. Belletti, S. Schifano, R. Tripiccione, F. Bodin, P. Boucaud, J. Micheli,
O. Pene, N. Cabibbo, S. de Luca, A. Lonardo et al., Computing for LQCD:
apeNEXT, Computing in Science and Engineering, vol. 8, no. 1, pp. 1829,
2006.
[30] M. P. I. Forum, MPI: A message-passing interface standard, Tech. Rep.
UT-CS-94-230, 1994. [Online]. Available: citeseer.ist.psu.edu/519858.html
[31] , MPI-2: Extensions to the Message-Passing Interface, Technical
Report, University of Tennessee, Knoxville, 1996. [Online]. Available:
citeseer.ist.psu.edu/517818.html
[32] A. Heinig, J. Strunk, W. Rehm, and H. Schick, Heterogeneous multipro-
cessing - on a tightly coupled opteron cell evaluation platform, 2007.
[33] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, HPL




[34] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic Linear
Algebra Subprograms for FORTRAN usage, in In ACM Trans. Math.
Soft., 5 (1979), pp. 308-323, 1979.
[35] J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson, An extended
set of FORTRAN Basic Linear Algebra Subprograms, in In ACM Trans.
Math. Soft., 14 (1988), pp. 1-17, 1988.
[36] R. Janka, R. Judd, J. Lebak, M. Richards, and D. Campbell, VSIPL:
an object-based open standard API for vector, signal, and image process-
ing, in Proceedings of the Acoustics, Speech, and Signal Processing, 200.
on IEEE International Conference-Volume 02. IEEE Computer Society
Washington, DC, USA, 2001, pp. 949952.
[37] J. Dongarra, Numerical Linear Algebra for High-performance Computers.
Society for Industrial Mathematics, 1998.
[38] J. Dongarra, R. Geijn, and D. Walker, Scalability issues affecting the
design of a dense linear algebra library, Journal of Parallel and Distributed
Computing, vol. 22, no. 3, 1994.
[39] J. Demmel, Applied Numerical Linear Algebra. Society for Industrial
Mathematics, 1997.
[40] J. J. Dongarra and J. Langou, The problem with the Linpack benchmark
matrix generator, LAPACK Working Note, Tech. Rep. 206, Jun. 2008.
[Online]. Available: http://www.netlib.org/lapack/lawnspdf/lawn206.pdf
[41] J. Kurzak and J. Dongarra, Implementing Linear Algebra Routines on
Multi-core Processors with Pipelining and a Look Ahead, in Applied Par-
allel Computing: State of the Art in Scientific Computing. 8th International
Workshop, PARA 2006, Umea, Sweden, June 18-21, 2006, Revised Selected
Papers. Springer-Verlag New York Inc, 2007.
[42] P. Strazdins, A Comparison of Lookahead and Algorithmic Blocking Tech-
niques for Parallel Matrix Factorization, 1998.
[43] IBM, Implementation of the High-Performance Linpack benchmark for IBM
QS22 systems with PowerXCell 8i processors., IBM, http://www.netlib.
org/benchmark/hpl/hpl_qs22-2008-11-30.patch. [Online]. Available: http:
//www.netlib.org/benchmark/hpl/hpl_qs22-2008-11-30.patch
[44] K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, and
J. Sancho, Entering the petaflop era: the architecture and performance of
Roadrunner, in Proceedings of the 2008 ACM/IEEE conference on Super-
computing. IEEE Press Piscataway, NJ, USA, 2008.
[45] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, HPL - A
Portable Implementation of the High-Performance Linpack Benchmark for




[46] IBM, Cell Broadband Engine Programming Handbook including PowerX-
Cell 8i. IBM, 2008.
[47] H. Simma, Architecture of the qpace torus-network, 2009, eQPACEMeet-
ing Feb 2009.
[48] M. Barnett, D. Payne, R. van de Geijn, and J. Watts, Broadcasting on
Meshes with Wormhole Routing, Journal of Parallel and Distributed Com-
puting, vol. 35, no. 2, pp. 111122, 1996.
[49] J. Watts, A Pipelined Broadcast for Multidimensional Meshes, Parallel
Processing Letters, vol. 5, no. 2, pp. 281292, 1995.
[50] G. Almasi, C. Archer, J. Castanos, J. Gunnels, C. Erway, P. Heidelberger,
X. Martorell, J. Moreira, K. Pinnow, J. Ratterman et al., Design and
implementation of message-passing services for the Blue Gene/L super-
computer, IBM Journal of Research and Development, vol. 49, no. 2/3, p.
393, 2005.
[51] S. Scott and G. Thorson, Optimized Routing in the Cray T3D, in Pro-
ceedings of the First International Workshop on Parallel Computer Routing
and Communication. Springer-Verlag London, UK, 1994, pp. 281294.
[52] N. Adiga, G. Almasi, G. Almasi, Y. Aridor, R. Barik, D. Beece,
R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich et al., An overview of
the BlueGene/L Supercomputer, in Proceedings of the 2002 ACM/IEEE
conference on Supercomputing. IEEE Computer Society Press Los Alami-
tos, CA, USA, 2002, pp. 122.
[53] L. Ni and P. McKinley, A Survey of Wormhole Routing Techniques in
Direct Networks, Computer, vol. 26, no. 2, pp. 6276, 1993.
[54] S. KHULLER and U. VISHKIN, Biconnectivity approximations and graph
carvings, Journal of the Association for Computing Machinery, vol. 41,
no. 2, pp. 214235, 1994.
[55] T. Hoefler, C. Siebert, and W. Rehm, A practically constant-time MPI
Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast, in
Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE
International, 2007, pp. 18.
[56] J. Chen, R. Edwards, and W. Watson, QMP: LQCD Message Passing
API, Tech. Rep., 2004. [Online]. Available: http://usqcd.jlab.org/
usqcd-docs/qmp/QMP-2-0-Introduction.html
[57] W. Gropp and E. Lusk, Users guide for mpich, a portable implementation




[58] P. Bozeman and B. Saphir, A Modular High Performance Implementation
of the Virtual Interface Architecture, in Proceedings of the 2nd Extreme
Linux Workshop, 1999.
[59] A. Kumar, N. Jayam, A. Srinivasan, G. Senthilkumar, P. Baruah, M. Kr-
ishna, and R. Sarma, Feasibility study of MPI implementation on the
heterogeneous multi-core Cell BE architecture, in Proceedings of the nine-
teenth annual ACM symposium on Parallel algorithms and architectures.
ACM Press New York, NY, USA, 2007, pp. 5556.
[60] S. Pakin, Receiver-initiated Message Passing over RDMA Networks.
[61] S. Wunderlich, SPE-MPI: an SPE-Centric MPI Implemen-
tation Approach for Cell BE, Student Research Project,




[62] F. O'Carroll, H. Tezuka, A. Hori, and Y. Ishikawa, The design and im-
plementation of zero copy MPI using commodity hardware with a high
performance network, in Proceedings of the 12th international conference
on Supercomputing. ACM New York, NY, USA, 1998, pp. 243250.
[63] J. M. Squyres and A. Lumsdaine, The component architecture of open
MPI: Enabling third-party collective algorithms, in Proceedings, 18th
ACM International Conference on Supercomputing, Workshop on Compo-
nent Models and Systems for Grid Applications, V. Getov and T. Kielmann,
Eds. St. Malo, France: Springer, July 2004, pp. 167185.
[64] W. Gropp and E. Lusk, Creating a New MPICH Device using the Channel
Interface, ANL/MCS-TM-213, Argonne National Laboratory, Tech. Rep.,
1996.
[65] R. Grabner, F. Mietke, and W. Rehm, Implementing an MPICH-2 channel
device over VAPI on InfiniBand, in Parallel and Distributed Processing
Symposium, 2004. Proceedings. 18th International, 2004.
[66] W. Gropp and I. Argonne, MPICH2: A New Start for MPI Implementa-
tions, in Proceedings of the 9th European PVM/MPI Users' Group Meet-
ing on Recent Advances in Parallel Virtual Machine and Message Passing
Interface. Springer, 1999.
[67] One Stop Systems, PCI Express x8 Host Interface Board Data Sheet, Apr.
2008, http://www.onestopsystems.com/documents/OSS-HIB2-x8-H-T_
002.pdf.
[68] R. Oertel, Design and implementation of an amd opteron and cell/b.e.
pci express based hybrid system - an environment for a tightly coupled
accelerator, Diploma thesis, Chemnitz University of Technology, 2008.
83
BIBLIOGRAPHY
[69] A. Heinig, Execution of SPE code in an Opteron-Cell/B.E. hybrid system,
Diploma thesis, Chemnitz University of Technology, 2008.
[70] A. Heinig, J. Strunk, W. Rehm, and H. Schick, Accfs - operating system
integration of computational accelerators using a vfs approach, in Accepted
for publication in proceedings of Applied Reconfigurable Computing. ARC,
2009.
[71] A. Heinig, W. Rehm, and J. Strunk, Accfs - accelerator file system, in Re-
search poster at International Supercomputing Conference 2008 (ISC'08).
IOS Press, 2008.
[72] A. Heinig, R. Oertel, J. Strunk, W. Rehm, and H. J. Schick, Gen-
eralizing the SPUFS concept  a case study towards a common accel-
erator interface, in Many-core and Reconfigurable Supercomputing Con-
ference (MRSC) 2008, Apr. 2008, http://private.ecit.qub.ac.uk/MSRC/
/Wednesday_Abstracts/Heinig_Chemnitz.pdf.
[73] J. Strunk, A. Heinig, T. Volkmer, W. Rehm, and H. Schick, Run-time
reconfiguration for hypertransport coupled fpgas using accfs, in Accepted
for publication in proceedings of First International Workshop on Hyper-
Transport Research and Applications. WHTRA, 2009.
[74] Y. Kawamura, T. Yamazaki, T. Ishiwata, K. Horie, and H. Kyusojin, Net-
work processing on an spe core in cell broadband engine, in HOTI '08:
Proceedings of the 2008 16th IEEE Symposium on High Performance In-





The source code which has been written for this diploma thesis can be obtained
on request from the Computer Architecture Group at TU Chemnitz. By an
agreement between the author and IBM, this source code belongs to IBM and
may only be used with permission of IBM. It contains:
• A patch to the QS22 version of HPL 1.0a which adds function instrumen-
tation and collective support as described in section 2.6.
• A patch to the MPICH version of QMP 2.3.1 to make it MPI-1.1 compli-
ant.
• The MFC memcpy benchmark as described in section 3.2.4.
• The access time benchmark as described in section 3.2.5.
Thesis Declaration
I hereby declare that the whole of this diploma thesis is my own work, except
where explicitly stated otherwise in the text or in the bibliography. I declare
that it has not been submitted in whole, or in part, for any other degree.
Chemnitz, March 9, 2009
Simon Wunderlich
86
