First Experiences with Intel Cluster OpenMP by Terboven, Christian et al.
  
  
Kommunikation in 
Clusterrechnern und 
Clusterverbund-
systemen 
3. Tagung 
 
 
 
 
12. Dezember 2007 
 
an der 
 
Rheinisch-Westfälischen Technischen 
Hochschule Aachen 
  
 
 
 
 
 
 
 
Organisation: 
 Professor Dr.-Ing. Wolfgang Rehm, TU Chemnitz 
 Professor Dr. habil. Thomas Bemmerl, RWTH Aachen 
 Dr.-Ing. Carsten Trinitis, TU München 
 Dr. rer. nat. Stefan Lankes, RWTH Aachen 
 Dipl.-Inform. Torsten Hoefler, Indiana University, Bloomington 
 Dipl.-Inform. Torsten Mehlan, TU Chemnitz 
 
  
  
  
  
  
Inhaltsverzeichnis 
  
 
 
 
MPI 
Andrew Friedley, Torsten Hoefler, Matthew Leininger 
Scalable High Performance Message Passing over InfiniBand  
for Open MPI                                                                                                1 
 
Torsten Hoefler, Marek Mosch, Torsten Mehlan, Wolfgang Rehm 
CollGM - A Myrinet/GM optimized collective component for Open MPI       9 
 
Boris Bierbaum, Georg Wassen, Stefan Lankes, Thomas Bemmerl:  
Evaluation of Optimized Barrier Algorithms for SCI Networks with  
Different MPI Implementations 17 
 
Metacomputing 
 
Carsten Clauss, Stephan Gsell, Stefan Lankes, Thomas Bemmerl 
An Approach for Deploying Externally Defined MPI Communicators 
at Runtime 24 
 
Daniel Becker, Wolfgang Frings, Felix Wolf:  
Performance Evaluation and Optimization of Metacomputing  
Applications 32 
 
Aufbau und Management von Rechnerverbundsystemen 
 
Frank Mietke, Torsten Mehlan, Torsten Hoefler, Wolfgang Rehm: 
Design and Evaluation of a 2048 Core Cluster System 40 
 
Silke Schuch, Rodolfo Bamberg, Thomas Bemmerl: 
Planungsverfahren in heterogenen Umgebungen mit Hilfe  
eines Genetischen Algorithmus 51 
 
 
Paralleles Rechnen in der Praxis 
Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner: 
First Experiences with Intel Cluster OpenMP 59 
 
Josef Minde, Josef Weidendorfer, Tobias Klug, Carsten Trinitis: 
PET-Bildrekonstruktion auf der Cell BE 67 
 
 
 
 
 
 
 
 
 
 
Scalable High Performance Message Passing over InniBand for Open MPI
Andrew Friedley123 Torsten Hoeer1 Matthew L. Leininger23
Andrew Lumsdaine1
1Open Systems Laboratory, Indiana University, Bloomington IN 47405, USA
{afriedle,htor,lums}@cs.indiana.edu
2Sandia National Laboratories, Livermore CA 94551, USA
3Lawrence Livermore National Laboratory, Livermore CA 94551, USA
{friedley1,leininger4}@llnl.gov
Abstract
InniBand (IB) is a popular network technology for
modern high-performance computing systems. MPI im-
plementations traditionally support IB using a reliable,
connection-oriented (RC) transport. However, per-process
resource usage that grows linearly with the number of pro-
cesses, makes this approach prohibitive for large-scale sys-
tems. IB provides an alternative in the form of a connection-
less unreliable datagram transport (UD), which allows for
near-constant resource usage and initialization overhead as
the process count increases. This paper describes a UD-
based implementation for IB in Open MPI as a scalable al-
ternative to existing RC-based schemes. We use the software
reliability capabilities of Open MPI to provide the guaran-
teed delivery semantics required by MPI. Results show that
UD not only requires fewer resources at scale, but also al-
lows for shorter MPI startup times. A connectionless model
also improves performance for applications that tend to
send small messages to many different processes.
1 Introduction
The Message Passing Interface (MPI) has become the de
facto standard for large-scale parallel computing. In large
part, MPI’s popularity is due to the performance-portability
it offers. Applications written using MPI can be run on
any underlying network for which an MPI implementation
is available. Because so many HPC applications rely on
MPI, all network hardware intended for high-performance
computing has at least one (usually more than one) imple-
mentation of MPI available.
InfiniBand (IB) [6] is an increasingly popular high-
performance network solution. Recent trends in the Top
500 list show that more and more systems are equipped with
IB (from 3% in June 2005 to more than 24% in November
2007). The size of such systems also shows an increasing
trend. The largest InfiniBand-based system in 2005, Sys-
temX, connected 1,100 dual processor nodes. In 2006, the
Thunderbird system appeared, consisting of 4,500 dual pro-
cessor IB compute nodes.
InfiniBand provides several network transports with
varying characteristics. A reliable, connection-oriented
(RC) transport similar to TCP is most commonly used (es-
pecially by MPI implementations). A less-widely adopted
alternative is the connectionless unreliable datagram (UD)
transport, which is analagously similar to UDP. While UD
does not guarantee message delivery, its connectionless
model reduces the resources needed for communication and
provides gains in performance when communicating with
many different processes.
These characteristics of UD make it an interesting op-
tion as a transport protocol for IB support in an MPI imple-
mentation. The traditional RC (connection-based) approach
with IB can lead to unacceptably high memory utilization
and long startup times. Although the UD (connectionless)
approach does not have the resource consumption problems
of the RC approach, the burden of guaranteeing reliable
message delivery is shifted from the transport protocol to
the MPI implementation. This is not necessarily a disad-
vantage however. Providing reliability within an MPI im-
pelmentation allows flexibility in designing a protocol that
can be optimized for MPI communication.
1
1.1 Related Work
Several strategies have been employed to reduce the re-
source requirements of RC in an attempt to preserve the vi-
ability of an RC-based MPI implementation at large scale
process counts. A shared receive queue (SRQ) feature al-
lows for buffers to be posted to a single receive queue,
instead of spreading resources across many different re-
ceive queues [12, 16]. MPI processes rarely receive large
amounts of data over all connections simultaneously, so
sharing allows the overall number of receive buffers to be
reduced.
Another method of improving efficiency at scale is to use
a lazy connection establishment mechanism. Connections
are not established until they are required for communica-
tion, resulting in lower resource utilization in applications
that require fewer than O(N 2) connections. However, ap-
plications exhibiting a fully-connected communication pat-
tern do not benefit, as all connections must be established at
some point.
The use of the UD transport to solve the remaining is-
sues with “fully-wired” applications was first developed in
LA-MPI [5] by Mitch Sukalski, but to our knowledge this
work was never published. Another approach was recently
proposed in [7, 8]. Although the implementation described
by Koop et. al. has recently been released, we did not have
time to perform a comparison against our work before pub-
lication.
While our implementation appears to be similar in many
ways, it differs in its clear design within Open MPI’s MCA
architecture as well as in its implementation of a reliability
protocol. Open MPI already provides a network-agnostic
reliability protocol via the Data Reliability (DR) Point-to-
Point Messaging Layer (PML) component, and is used to
provide reliable communication over the UD transport. Fur-
thermore, we describe an optimization and show that it in-
creases bandwidth significantly for UD.
2 MPI Over InfiniBand
The InfiniBand (IB) Architecture [6] is a high-
performance networking standard. Several network trans-
ports are defined. The most commonly used are the reli-
able connection-oriented (RC) and the unreliable datagram
(UD) transport. RC is analogous to TCP, as is UD to UDP.
Both transports provide traditional send/receive semantics,
but extend this functionality with different feature sets.
2.1 Reliable Connection Transport
The Reliable Connection (RC) transport allows for arbi-
trarily sized messages to be sent. IB hardware automatically
fragments the data into MTU-sized packets before sending,
then reassembles the data at the receiver. Additionally, Re-
mote DMA (RDMA) is supported over the RC transport.
RDMA allows one process to directly read/write memory
from/to the memory of another process, without requiring
any communication processing by the remote process.
A connection-oriented communication model means that
a connection must be explicitly established between each
pair of peers that wish to communicate. Each connection
requires a send and receive queue (together referred to as a
queue pair, or QP). A third type of queue, shared by many
queue pairs, is used by the hardware to signal completion of
send and receive requests.
Requiring a separate receive queue for each connection
means that separate receive buffers must be posted to each
of these queues for the host channel adapter (HCA) to copy
data into. Distributing receive buffers across many connec-
tions is wasteful, as the application is not likely to be re-
ceiving data from every other connected process simulta-
neously. Memory utilization increases with the number of
connections, reducing memory available for use by applica-
tions. To work around this issue, IB supports the concept
of a shared receive queue (SRQ). Each new queue pair (QP)
may be associated with an SRQ as it is created. Any data ar-
riving on these QPs consumes receive buffers posted to the
shared receive queue. Since receive buffers are shared, ap-
plications receiving from only a subset of peers at any one
time require fewer total buffers to be allocated and posted.
2.2 Unreliable Datagram Transport
As its name implies, the Unreliable Datagram transport
does not guarantee message delivery. Thus, a UD-based
MPI implementation must implement its own reliability
protocol. In principle, a hardware-based reliability mech-
anism should have superior performance over a software-
based mechanism. In practice, the flexibility of a software-
based reliability protocol has lead to performance gains
[13]. Also, the reliability offered by the RC transport ends
at the InfiniBand HCA. No reliability is guaranteed when
the data moves from the HCA across a PCI bus and into
system memory.
Hardware message fragmentation is not provided with
the UD transport, so software must also deal with fragment-
ing large messages into MTU-sized packets before being
sent. Current IB hardware allows for a maximum two kilo-
byte data payload. Since protocol headers must be sent with
each message, a small maximum message size presents a
challenge for achieving optimal bandwidth. This issue will
be discussed in section 4.2.
UD features a connectionless communication model.
Only address information for a single QP at each process
must be exchanged; peers may communicate without any
sort of connection establishment handshake. This leads to
2
both efficient and scalable resource utilization. First, only
one QP is required for communication with any number
of other processes. Active connections require resources
in both the application and the HCA, so a connectionless
model is desireable when minimizing resource usage at
scale. Second, this single QP’s receive queue behaves the
same as an RC shared receive queue. Our results show
that a simple, fully connected MPI job with 1024 processes
would require 8.8 MiB of memory per process using our
UD implementation compared to 14.75 or 25.0 MiB using
the existing RC implementation with or without SRQ, re-
spectively.
3 Open MPI
Open MPI [3] is a collaborative effort to produce an open
source, production quality MPI-2 [9] implementation. First
introduced in LAM/MPI [15], Open MPI’s Modular Com-
ponent Architecture (MCA) allows for developers to imple-
ment new ideas and features within self-contained compo-
nents. Components are selected and loaded at runtime.
MCA components are organized into a set of frameworks
responsible for managing one or more components that per-
form the same task. Each framework defines an interface
for each component to implement as well as an interface
for other parts of the implementation to access the com-
ponents’ functionality. Although many frameworks exist
within Open MPI, only those most relevant to the work pre-
sented are discussed in detail here.
3.1 Point to Point Messaging Layer
One of the most important frameworks in Open MPI’s
communication path is the Point-to-point Messaging Layer
(PML). PML components implement MPI send and re-
ceive semantics in a network-independent fashion. Message
matching, fragmentation and re-assembly are managed by
the PML, as well as selection and use of different protocols
depending on message size and network capabilities. Other
frameworks, like the Memory Pool (MPool) and Registra-
tion Cache (RCache) are enabled for networks that require
memory registration (such as InfiniBand). Figure 1 illus-
trates how all of these frameworks fit together to form Open
MPI’s point to point architecture. [11, 12] discusses the re-
lated frameworks in greater detail.
An interesting PML component in the context of this pa-
per is the data reliability (DR) component [10]. DR uses an
explicit acknowledgement protocol along with data check-
summing to provide network failover capabilities. This al-
lows DR to detect when network links are no longer func-
tional, and will automatically switch over to a different net-
work when failure occurs. Furthermore, DR provides ver-
ifies checksums once data reaches main memory. This al-
OB1 or DR PML
UD or OpenIB (RC) BTL
MPI Layer
MPool
RCache
Figure 1. Point-to-point Architecture
lows for detection of data corrupted by the system busses,
which IB’s hardware-based reliability does not provide.
Such reliability comes at a price however; performance is
slightly lower than that of the default PML component, re-
ferred to as OB1.
What makes the DR component interesting is its use in
conjunction with InfiniBand’s UD transport. As discussed
in section 2.2, a software reliability protocol is needed to
meet MPI’s guaranteed message delivery semantics. Rather
than implementing a reliability protocol specifically for UD,
the DR component may be used to achieve guaranteed mes-
sage delivery.
3.2 Byte Transfer Layer
Both the DR and OB1 PML components rely on another
framework, the Byte Transport Layer (BTL), to implement
support for communicating over specific networks. The
BTL framework is designed to provide a consistent inter-
face to different networks for simply moving data from one
peer to another. This simplicity allows for fast development
of new components for emerging network technologies, or
to explore research ideas.
The BTL interface is designed to provide a simple ab-
straction for communication over a variety of network
types, while still allowing for optimal performance. Tra-
ditional send semantics must be supported by each BTL,
while RDMA put/get semantics may be supported by net-
works that provide it. Receives of any form (traditional or
RDMA) are initiated only by registering a callback func-
tion and a tag value with the BTL. The BTL then calls this
function whenever a message arrives with a corresponding
tag value, providing the upper layers access to the received
message. Some networks (including IB) require that mem-
ory be registered for direct access by the HCA, so a memory
allocation interface is provided by the BTL as well.
Each BTL component implements support for a particu-
lar network and transport. InfiniBand is primarily supported
via the OpenIB component, which uses the RC transport
3
type exclusively. For clarity, this is referred to as the RC
BTL. RDMA is supported for both small and large mes-
sages. While using RDMA for small messages (referred to
in Open MPI as eager RDMA) may reduce latency, such an
approach is not scalable due to the requirement that RDMA
buffers be polled to check for completion [12].
Connection management is a major issue for large scale
MPI jobs. With a connection-oriented transport like RC, a
connection and associated resources must be allocated for
each pair of peers that wish to communicate. In the worst
case this results in O(N2) connections for an all-to-all com-
munication pattern.
One way to avoid this problem is to establish connections
only when communication between two peers is requested.
Each time a process sends a message, it checks the state of
the connection to the destination process. If no connection
is established, a new connection is initiated. Rather than
waiting for the connection to be established, the message
to be sent is queued and the application or MPI library con-
tinue execution. If the process tries to send another message
to the same destination before the connection is established,
that message is queued as well. Finally, when the connec-
tion is available, any queued messages are sent. Subsequent
messages are sent immediately, but a small cost is still in-
curred to check the state of the connection. Since applica-
tions rarely exhibit an all-to-all communication pattern [17],
the reduction in resources required for established connec-
tions outweighs the small cost of managing connection state
dynamically.
4 Implementation
To support communication over unreliable datagrams, a
new BTL component was developed. Currently named the
UD BTL, its design is far simpler than that of the existing
RC-based BTL. RDMA is not available over UD, so sup-
port for RDMA protocols is not needed. More importantly,
UD is a connectionless network transport; no connection
management is necessary. A single queue pair is used for
receiving all data from all peers. During MPI initialization,
QP address information is exchanged with every other peer
using an all-to-all algorithm over an existing TCP commu-
nication channel established by the runtime environment.
Unlike lazy connection establishment, no logic is needed in
the send path for managing connections – UD connections
are always available.
Bandwidth over the UD transport may be increased by
exploiting the ability of IB hardware to process multiple
send requests in parallel. This is done by initializing a
small constant number of QPs and posting sends among
them in a round-robin fashion. Four QPs were chosen
to give near-optimal bandwidth without affecting latency.
Figure 2 shows peak bandwidth reported by NetPIPE [14]
when striping over a varying number of queue pairs. Un-
like the send side, distributing receive buffers across several
QPs did not yield any performance gains. In our implemen-
tation, only one QP is used for receiving messages, while
four are used for sending.
 4000
 4500
 5000
 5500
 6000
 6500
Pe
ak
 B
an
dw
id
th
 (M
b/s
)
1 QP 2 QPs 4 QPs 8 QPs 16 QPs
Figure 2. QP Striping Bandwidth
4.1 Buffer Management
Ensuring that enough receive buffers remain posted is
a critical aspect of maximizing performance over the UD
transport. When data arrives for a QP that has no receive
buffers posted, the HCA silently drops the data. This prob-
lem is exacerbated by the fact that applications must call
into the MPI library in order for consumed receive buffers
to be process and re-posted to the QP, and may not do so
for long periods of time. Therefore, the UD BTL should
always have as many receive buffers posted as the number
of messages that might be received between MPI calls.
Flow control is difficult for two reasons. First, receive
buffers are shared, so any combination of remote sender ac-
tivity may exhaust the receive queue. Many senders each
sending small messages may overwhelm a receiver just as
easily as one or a few senders sending large messages. Flow
control is difficult, as all potential senders must be notified
of congestion at the receiver in order to be effective. Doing
this means sending a message to each sender, which is not
feasible at scale.
Second, accurately detecting a shortage of receive
buffers with enough time to react is difficult or even im-
possible, especially with data arriving while applications
are not making frequent calls into the MPI library. Due to
the simplicity of the BTL abstraction, it is not possible to
take advantage of a PML-level rendezvous protocol to pre-
post receive buffers when large messages are expected. Nor
4
does the IB verbs interface provide any information about
how many receive buffers are currently posted; the UD BTL
must keep its own counter of currently posted buffers based
on completion events. Furthermore, time required to either
process currently filled buffers or allocate and register new
buffers is often greater than the time remaining before cur-
rently posted resources are exhausted.
Our solution attempts to strike a balance between min-
imizing resource usage and keeping a large number of
buffers posted to the receive queue at all times. Rather than
trying to dynamically adapt to the communication load, a
static pool of buffers is allocated and left unchanged after
initialization. For tuning to particular applications, an MCA
parameter may be used to specify the number of buffers in
the pool at runtime. The receive path has been optimized to
process and re-post buffers as quickly as possible.
When allocating buffers for an SRQ, the RC BTL uses a
similar strategy, except that it adjusts the number of buffers
based on the number of processes in the MPI job. How-
ever, large process counts do not necessarily correlate to in-
creased communication load per receiver, possibly leading
to an unnecessarily large buffer pool at scale. Accurate siz-
ing of the buffer pool depends heavily on an application’s
behavior; particularly its communication patterns and fre-
quency of calls into the MPI library.
4.2 Message Format and Protocol
Both the PML and BTL are free to define their own pro-
tocol and message headers as they require. The PML builds
messages of a suitable size (bounded by the maximum mes-
sage size the BTL can send) and prefixes its own header
data. A BTL component treats this as an opaque data pay-
load, and prefixes its own header data as needed. When
messages arrive at the receiver, the BTL uses its header data
however it chooses, and passes the opaque data payload (in-
cluding the PML headers) to the PML for processing.
Lack of flow control has the advantage of allowing for a
simpler communication protocol. The UD BTL adds only a
single byte message tag value (rounded to 4 bytes for align-
ment purposes), used for determining which registered call-
back function should be used to pass the data to the upper
layer. However this low per-message overhead is offset by
the UD transport’s 2 Kib MTU. Packet headers for both the
PML and BTL must be included in each 2 Kib message. In
contrast, the RC BTL includes additional flow control infor-
mation in its packet headers, but is able to send much more
data with each message.
To ensure reliability, the DR PML defines the concept of
a virtual fragment, or VFRAG [10], which acts as a unit of
acknowledgement. Two timeout mechanisms are associated
with each VFRAG. The first is used to detect when local
send completion fails to occur. The second is the familiar
ACK timeout, which allows the sender to detect loss by lack
of positive acknowledgement from the receiver within some
time frame. Retransmission occurs either when a timeout
expires, or the receiver responds with an acknowledgement
indicating some portion of the VFRAG was not received.
After several retransmission attempts over one BTL, the
PML marks that BTL as failed, and begins using a different
BTL. While such an approach is intended to transparently
respond to network failure, it also guarantees delivery over
an unreliable network. When used in conjunction with the
DR PML, our UD implementation provides MPI’s guaran-
teed delivery semantics.
5 Results
5.1 Experimental Environment
Atlas, a production system housed at Lawrence Liver-
more National Laboratory (LLNL), was used for all exper-
imental results. Atlas is a 1,152 node cluster connected via
an InfiniBand network. Each node consists of four dual core
2.4 GHz Opteron processors (eight cores per node) with
16 Gib RAM and Mellanox PCI-Express DDR InfiniBand
HCAs, running Linux.
A development version of Open MPI was used, based on
subversion revision 16080 of the main development trunk.
Source code is available to the public on the Open MPI
website [http://www.open-mpi.org]. Stable UD support is
planned for Open MPI version 1.3.
Results for UD are provided using both the OB1 and DR
PMLs. The rationale behind including results for UD with
OB1 is that the reliability overhead introduced by DR may
be directly observed. Although use of OB1 with UD does
not guarantee message delivery, no messages were actually
dropped while running our tests.
5.2 Microbenchmarks
Microbenchmarks are used to determine the behavior of
a system under certain, usually isolated conditions. We
use those benchmarks to determine basic system parame-
ters such as latency, bandwith, and memory utilization.
5.2.1 Point-to-point Performance
NetPIPE [14], a common ping-pong benchmark, was used
to establish a baseline comparison of the latency and band-
width capabilities of the RC and UD BTLs. Latency as
a function of message size is presented in Figure 3, while
bandwidth is presented in Figure 4. For the RC BTL, eager
RDMA was disabled to approximate latencies at scale.
For messages smaller than 2 Kib, UD with OB1 is nearly
equivalent to RC, with or without SRQ. UD performance
5
 1
 10
 100
 1000
 10000
 100000
 1  10  100  1000  10000 100000 1e+06 1e+07
La
te
nc
y 
(us
)
Datasize (bytes)
OpenIB OB1
OpenIB SRQ OB1
UD DR
UD OB1
Figure 3. NetPIPE Ping Pong Latency
with DR is worse due to the DR PML reliability protocol. A
study of the performance overhead inherent in the DR PML
may be found in [10]. At 2Kib, UD’s MTU forces the PML
to switch to the slower rendezvous protocol, causing a drop
in performance that is overcome as message size increases.
 0.1
 1
 10
 100
 1000
 10000
 1  10  100  1000  10000 100000 1e+06 1e+07
Be
nd
wi
dt
h 
(M
ib/
s)
Datasize (bytes)
OpenIB OB1
OpenIB SRQ OB1
UD DR
UD OB1
Figure 4. NetPIPE Ping Pong Bandwidth
5.2.2 Startup Performance and Memory Overhead
We developed a benchmark to measure the overhead in-
curred by lazy connection establishment. Each process in
the MPI job iteratively sends a 0-byte message to every
other process, while receiving a 0-byte message from an-
other process. This is done in a ring-like fashion to pre-
vent flooding any one process with messages. The total time
taken by each process is averaged to produce a single mea-
surement.
Time taken to execute the benchmark for varying process
counts is shown in Figure 5. A total of 512 nodes were used;
results past 512 nodes were obtained by running more than
one process per node. This does not siginificantly affect the
results, as the benchmark is not bandwidth bound. The UD
BTL yields excellent performance, as no connections need
to established. Since there is no connection overhead, UD’s
time is only that taken for every process to send a 0-byte
message to every other process. RC, shown both with and
without SRQ enabled, incurs significant overhead due to a
connection establishment handshake protocol.
 0.001
 0.01
 0.1
 1
 10
 100
 128  256  384  512  640  768  896  1024
Ti
m
e 
(se
co
nd
s)
Processes
UD
RC SRQ
RC
Figure 5. Startup Overhead
Figure 6 shows average memory utilization per process
of MPI processes measured at the end of the allconn bench-
mark. Again, results past 512 nodes were obtained by run-
ning multiple MPI processes per node. UD’s memory uti-
lization grows very slowly, at a near-constant rate. Each
MPI process stores a small amount of address information
for every other MPI process, leading to slowly increasing
utilization as scale increases. RC with SRQ does signifi-
cantly better than without, resulting in lower utilization at
low process counts but increasing at a faster rate than UD.
This is due to the RC BTL adjusting the size of the receive
buffer pool based on the number of processes, while the
UD BTL receiver buffer count remains constant (cf. Sec-
tion 4). Memory utilization for RC without SRQ is signif-
icantly higher since receive buffers are allocated for each
connection.
5.3 Application Benchmarks
Application benchmarks help to analyze the behavior
of our implementation for real-world applications. We in-
cluded positive and negative results to illustrate the trading
of memory and wireup costs for communication costs.
6
 1000
 10000
 100000
 1e+06
 128  256  384  512  640  768  896  1024
Si
ze
 (K
iB)
Processes
UD
RC SRQ
RC
Figure 6. Memory Overhead
5.3.1 ABINIT
ABINIT is an open source code for ab initio electronic
structure calculations based on the density functional the-
ory. The code is the object of an ongoing open software
project of the Universite´ Catholique de Louvain, Corning
Incorporated, and other contributors [4]. We use a sam-
ple calculation representing a real-word problem described
in [1].
Overall execution times are presented in Figure 7. Only
one run in each configuration was possible due to limited
system availability. Even so, it is fairly clear that for a
real application such as ABINIT, performance is similar for
both UD and RC-based implementations. We were unable
to measure memory usage, but believe the results would be
positive but less pronounced than those of the microbench-
marks.
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
Ab
in
it 
Ru
nn
in
g 
Ti
m
e 
(s)
P=128 P=256 P=512 P=1024
UD/OB1
UD/DR
OpenIB
OpenIB/SRQ
Figure 7. Abinit results
5.3.2 SMG2000
SMG2000 [2] is a benchmark written around a parallel
semicoarsening multigrid solver designed for modeling ra-
diation diffusion. An analysis presented in [17] indi-
cates that SMG2000 tends to send many small messages
to many distinct peers. Based on the microbenchmark re-
sults, our expectation is that a UD-based MPI implementa-
tion will perform well in this sort of scenario due to superior
small-message latency and a connectionless communication
model.
Figure 8 shows the execution time of the SMG2000
solver phase for varying number of processes. For this ap-
plication, UD has a clear advantage, but trades in overhead
for the data reliability protocol. The short execution time
of our tests emphasize UD’s better connection setup perfor-
mance, especially as scale increases.
 0
 2
 4
 6
 8
 10
 12
 14
 16
SM
G
20
00
 S
ol
ve
r P
ha
se
 (s
)
P=216 P=512 P=1000 P=1728 P=2744 P=4096
UD/OB1
UD/DR
OpenIB
OpenIB/SRQ
Figure 8. SMG2000 Solver Time
6 Conclusions and Future Work
There are several conclusions that can be drawn from
the results in this paper. First, an unreliable-datagram based
MPI implementation can be a viable alternative to reliable
connection-based approaches, especially at large scale. The
UD approach provides comparable message passing per-
formance to RC approaches and provides distinct advan-
tages in startup time and memory overhead. Second, our
use of multiple queue pairs for sending messages optimizes
UD communication bandwidth. Finally, our implementa-
tion demonstrates how components in a component-based
software architecture can be reused to solve new problems.
In particular, we were able to provide data reliability in the
UD implemenation by reusing the data reliability compo-
nent (Data Reliability PML), thereby reducing code com-
plexity and maintenance costs.
7
Microbenchmarks and certain applications (e.g., Net-
PIPE and SMG2000) show that the UD BTL incurs a slight
performance penalty in some cases when compared to Open
MPI’s existing RC implementation. On the other hand,
our UD-based implementation offers reduced memory over-
head and fast startup times, especially for applications com-
municating with many peers. SMG2000 and ABINIT show
that applications can benefit, especially at scale. As larger
IB systems are deployed, the highly scalable, low-overhead
characteristics of a UD-based implementation will be pre-
ferred over RC-based implementations.
Future work will investigate the use of a UD-specific re-
liability schemes to minimize performance penalties asso-
ciated with the current DR-based implementation. Alterna-
tive solutions to the software flow control problem will also
be explored.
Acknowledgements
This work was supported by a grant from the Lilly
Endowment and National Science Foundation grant EIA-
0202048. This research was funded in part by a gift from
the Silicon Valley Community Foundation, on behalf of the
Cisco Collaborative Research Initiative of Cisco Systems.
Sandia is a multiprogram laboratory operated by Sandia
Corporation, a Lockheed Martin Company, for the United
States Department of Energy’s National Nuclear Security
Administration under Contract DE-AC04-94-AL85000.
This work performed under the auspices of the U.S.
Department of Energy by Lawrence Livermore National
Laboratory under Contract DE-AC52-07NA27344. UCRL-
CONF-235949.
References
[1] F. Bottin and G. Zerah. Formation enthalpies of monovacan-
cies in Aluminium and Gold a large-scale supercell ab initio
calculation. submitted to Physical Review B, 2006.
[2] P. N. Brown, R. D. Falgout, and J. E. Jones. Semicoarsening
multigrid on distributed memory machines. SIAM Journal
on Scientific Computing, 21(5):1823–1834, 2000.
[3] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Don-
garra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett,
A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham,
and T. S. Woodall. Open MPI: Goals, concept, and design
of a next generation MPI implementation. In Proceedings,
11th European PVM/MPI Users’ Group Meeting, pages 97–
104, Budapest, Hungary, September 2004.
[4] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs,
G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jol-
let, M. Torrent, A. Roy, M. Mikami, P. Ghosez, J.-Y. Raty,
and D. Allan. First-principles computation of material prop-
erties : the ABINIT software project. Computational Mate-
rials Science 25, 478-492, 2002.
[5] R. L. Graham, S.-E. Choi, D. J. Daniel, N. N. Desai, R. G.
Minnich, C. E. Rasmussen, L. D. Risinger, and M. W.
Sukalksi. A network-failure-tolerant message-passing sys-
tem for terascale clusters. International Journal of Parallel
Programming, 31(4):285–303, August 2003.
[6] InfiniBand Trade Association. Infiniband Architecture Spec-
ification Volume 1, Release 1.2. InfiniBand Trade Associa-
tion, 2004.
[7] M. J. Koop, S. Sur, Q. Gao, and D. K. Panda. High perfor-
mance MPI design using unreliable datagram for ultra-scale
InfiniBand clusters. In ICS ’07: Proceedings of the 21st
annual international conference on Supercomputing, pages
180–189, New York, NY, USA, 2007. ACM Press.
[8] M. J. Koop, S. Sur, Q. Gao, and D. K. Panda. Zero-copy
protocol for mpi using infiniband unreliable datagram. IEEE
Cluster 2007: International Conference on Cluster Comput-
ing, Austin, TX, USA, September 17-20, 2007.
[9] Message Passing Interface Forum. MPI-2: Extensions to
the Message Passing Interface, July 1997. http://www.mpi-
forum.org.
[10] G. M. Shipman, R. L. Graham, and G. Bosilca. Network
fault tolerance in open MPI. In Proceedings, Sixth Interna-
tional Workshop on Algorithms, Models and Tools for Paral-
lel Computing on Heterogeneous Networks, Rennes, France,
September 2007.
[11] G. M. Shipman, T. S. Woodall, G. Bosilca, R. L. Graham,
and A. B. Maccabe. High performance RDMA protocols
in HPC. In Proceedings, 13th European PVM/MPI Users’
Group Meeting, Lecture Notes in Computer Science, Bonn,
Germany, September 2006. Springer-Verlag.
[12] G. M. Shipman, T. S. Woodall, R. L. Graham, A. B. Mac-
cabe, and P. G. Bridges. Infiniband scalability in open mpi.
In Proceedings of IEEE Parallel and Distributed Processing
Symposium, April 2006.
[13] R. Sivaram, R. K. Govindaraju, P. H. Hochschild, R. Black-
more, and P. Chaudhary. Breaking the connection: Rdma
deconstructed. In Hot Interconnects, pages 36–42. IEEE
Computer Society, 2005.
[14] Q. Snell, A. Mikler, and J. Gustafson. NetPIPE: A Network
Protocol Independent Performace Evaluator. In IASTED In-
ternational Conference on Intelligent Information Manage-
ment and Systems, June 1996.
[15] J. M. Squyres and A. Lumsdaine. A Component Archi-
tecture for LAM/MPI. In Proceedings, 10th European
PVM/MPI Users’ Group Meeting, number 2840 in Lecture
Notes in Computer Science, pages 379–387, Venice, Italy,
September / October 2003. Springer-Verlag.
[16] S. Sur, M. J. Koop, and D. K. Panda. High-performance and
scalable MPI over InfiniBand with reduced memory usage:
an in-depth performance analysis. In SC ’06: Proceedings
of the 2006 ACM/IEEE conference on Supercomputing, page
105, New York, NY, USA, 2006. ACM Press.
[17] J. Vetter and F. Mueller. Communication characteristics of
large-scale scientific applications for contemporary cluster
architectures. In 16th Intl. Parallel & Distributed Processing
Symp., May 2002.
8
CollGM - A Myrinet/GM optimized collective component for Open MPI
Torsten Hoeer1,2 Marek Mosch2 Torsten Mehlan2
Wolfgang Rehm2
1Open Systems Laboratory, Indiana University, Bloomington IN 47405, USA
htor@cs.indiana.edu
2Technical University of Chemnitz, Chemnitz, 09107 GERMANY
{htor,mosm,tome,rehm}@cs.tu-chemnitz.de
Abstract
The Open MPI collective framework offers a way to im-
plement hardware-specic collective operations for Open
MPI. We used this framework to develop a Myrinet/GM
collective component by combining common knowledge of
the implementation of collective algorithms with GM pro-
tocol optimized techniques to achieve highest performance.
Our results show that the good performance of the exist-
ing point-to-point based tuned collective implementation in
Open MPI can be improved with the use of these techniques.
1 Introduction
Cluster systems dominate, due to their excellent price-
performance ratio, today’s high performance computing
(HPC) market1. Especially small and mid-sized cluster
systems are built from commodity components. However,
commodity interconnection networks like Gigabit Ethernet
are often not able to deliver the required communication
performance. Thus, special cluster interconnection net-
works are often used to connect workstations to a cluster
system.
One of those specialized networks is Myrinet [1], dis-
tributed by the company Myricom. It has been analyzed in
detail by Qian et al. in [14]. Myrinet is defined in the ANSI
standard document ANSI/VITA 26-1998 [18]. It offers spe-
cial features, such as low-latency, cut-through switching,
communication offload flow control and continuous link
monitoring, that are not common in commodity Ethernet
networks. Those features are especially beneficial in the
1cf. Top 500 list 06/2007
context of HPC applications. Myrinet also supports large
systems with switches that can connect up to 512 nodes.
Two versions of Myrinet are currently available, Myrinet
2000 and Myrinet 10G. For Myrinet 2000, two main Ap-
plication Programming Interfaces (APIs) are available, the
Glenn’s Messages (GM) API [12] and the Myrinet Express
(MX) [11] API. Both are fundamentally different. The GM
API resembles a Virtual Interface Architecture (VIA) [3]
and the MX API is closer to the Message Passing Interface
(MPI) standard [8, 9]. Myricom decided to discontinue the
support for the GM API, but there are still many systems
that run on this well proven and stabilized API (e.g., Eu-
rope’s currently fastest Supercomputer Mare Nostrum, the
256 CPU “Strider” cluster at the High Performance Com-
puting Center Stuttgart or the 16 CPU “Oscar” cluster at the
Technical University of Chemnitz).
While it seems natural to map MPI point-to-point oper-
ations to the MX API (which offers non-blocking point-to-
point functionality similar to MPI), it is not that obvious
for collective communication. The similarity of MX and
MPI suggests that the potential of a low-level implemen-
tation compared to the existing highly optimized collective
component (based on point-to-point messages) is very low.
Thus, we explore the possibility to implement collective op-
erations directly on top of the low-level GM API to use the
full semantics for collective algorithm design.
Our main theses for optimization potential in GM are:
1. special GM optimized algorithms (e.g., n-ary trees)
2. special handling of memory registration/de-
registration
3. optimized small/large message handling
4. avoiding the overhead of the point-to-point messaging
layer in MPI (PML/BTL, see Section 3)
9
5. optimized message forwarding (uses pre-registered
buffers to forward messages)
The remaining document is structured as follows. The
GM API is described in detail in Section 2. We sketch our
implementation using the semantic advantages of the GM
API in Section 3 followed by benchmark results in Sec-
tion 4. Conclusions and Future Work are presented in Sec-
tion 5.
2 The Myrinet/GM API
The sole vendor of Myrinet hardware – Myricom – pro-
vides several software stacks for interfacing the network.
As discussed in Section 1, the programmer can choose be-
tween two APIs, GM2 and MX. Each of these alternatives
comprises a user space library, an operating system driver
and a firmware intended to run on the network adapter’s
processing unit. The two versions of GM and MX are in-
compatible since they use different wire–protocols. We are
going to present a short overview about the GM API.
The GM software stack provides the GM Mapper for
automatic network discovery. This entity detects network
switches and hosts and calculates the routing information.
Thus, any user application can assume that the network is
properly configured. Applications usually access the net-
work via the functions of the library. These functions serve
as interface to the operating system driver as well as to the
hardware directly (bypassing the operating system).
Each application has to connect to the hardware by open-
ing a so called “port” to access further services of GM. The
port works as software context and logical root of any other
resources that may be needed to communicate over the net-
work. The logical network port belongs to a specific appli-
cation and is not accessible from other applications. GM–1
provides maximum 8 ports while GM–2 can provide max-
imum 16 ports. Figure 1 shows the organization of GM
ports.
The basic concept behind the GM interface is based on
queues. Each port provides one send queue, one receive
queue and one event queue. The application performs queue
operations to send and receive data. The protocol is connec-
tionless and guarantees lossless reliable in order transmis-
sion of all data. The working principle of Myrinet, mainly
the simple hardware design of the switches, does not per-
mit the native support of multicast and broadcast messages.
However, GM allows zero–copy transmission of data. Thus,
the virtual addresses of the user space has to be translated
into physical addresses before any data transmission can
take place. The DMA engine of the network adapter has
2two versions of GM exist, GM–1 and GM–2, we concentrate on the
more recent GM–2 in this paper
Port Port Port
Application
Port Port
Application
Network
ApplicationApplication
Figure 1. Communication ports of GM
to issue physical addresses while the user application only
has knowledge of the virtual addresses.
The functions of the GM library reflect the attributes
of the GM protocol described so far. The application
opens and closes a port by calling gm open() or
gm close() respectively. The send and receive buffers
have to be registered with the library before use, which
effectively pins the pages in memory (disables swap-
ping). Memory registration is performed by the functions
gm register memory() or gm dma alloc(). The
latter function allocates and registers memory in one
step. De-registration of memory is done by the functions
gm deregister memory() and gm dma free().
The function gm send with callback() appends one
send request to the send queue and returns immediately.
The real data transfer is done by the network adapter’s
DMA engines while the host CPU is free to do other work.
To prepare the receipt of a message the application
has to post a receive descriptor to the receive queue that
describes a buffer in memory. This is done by a call
to gm provide receive buffer with tag(). On
completion of the receive request a receive event be-
comes available in the event queue. Once the data trans-
fer is complete or some error happened a corresponding
event descriptor is put into the event queue of the re-
lated port. The application calls either gm receive() or
gm blocking receive() to get the next event from the
event queue. Until data transfer completion the application
must not change the contents of a send buffer or must not
make any assumptions on the content of a receive buffer. In
10
case of successful data transmission the application can ac-
cess the data buffer. The send buffer may be changed in any
way and the receive buffer is guaranteed to contain valid
data. Figure 2 shows the message reception process.
Application
Program
Control
Myrinet
Receive
Queue
gm_receive()
Incoming Message
Event Queue
gm_provide_receive_buffer()
Userspace
Network card
Figure 2. Reception of a message in GM
The Myrinet GM user interface supports Remote Direct
Memory Access (RDMA) as well. Using RDMA the ap-
plication can directly write to remote memory or read from
remote memory. The receive queue of the remote applica-
tion is not involved in any way. Thus, the only way for the
remote application to notice an RDMA write operation is
reading the memory where it expects the data. Myrinet does
not support any access restrictions for RDMA operations.
Thus any process is able to read and write the registered
memory of all other processes with open Myrinet ports.
A comparison between Myrinet GM and the more re-
cent technology InfiniBand [17] shows, that both network
technologies use fairly similar concepts. The user inter-
faces of both, Myrinet GM and InfiniBand, are based on
queues. The application has to post requests to the send
queue or receive queue respectively. The completion of
requests is signaled via a completion queue. The differ-
ence between Myrinet GM and InfiniBand is the address-
ing scheme. Myrinet GM uses a connectionless mecha-
nism while InfiniBand provides both, end–to–end connec-
tions and connectionless datagrams. One should note that
the connectionless datagram service of InfiniBand works
differently from the mechanism of Myrinet GM. Finally
both network technologies require registered memory for
data transfers. The process of memory registration and de-
registration in InfiniBand [10] and Myrinet GM is very sim-
ilar.
3 Implementation of CollGM
This section describes the implementation of the collgm
component.
3.1 Open MPI Structure
Open MPI [5] founds on the Modular Component Ar-
chitecture (MCA), that provides a flexible way for defining
frameworks [16]. For instance there are frameworks han-
dling point–to–point messages, datatype conversion, col-
lective communication and many other tasks. One single
framework defines an interface that may be implemented by
several components. In this work, The collective framework
serves as target for the implementation of a component
that handles the MPI functions MPI Barrier, MPI Bcast,
MPI Scatter[v], MPI Gather[v] and MPI MPI Alltoall[v]
over Myrinet GM in an optimized way. We refer to this
component as the collgm component in the following dis-
cussion.
3.2 The collective GM component
The collgm component is divided into two parts which
are explained in the following.
The MPI level The MPI level handles the entire semantic
of the collective operation. This part is responsible for:
• Copying (packing) the actual data of complex data
types into consistent memory areas
• Selecting the appropriate protocol for sending data
• Handling the protocol transactions, such as message
segmentation
• Selecting an appropriate algorithm that is considered
optimal
• Management of memory registration and de-
registration
In contrast the GM level of the collgm module provides
basic communication services that encapsulate the underly-
ing GM user interface:
• Provide reliable data transfer functions either blocking
or non–blocking
• Encapsulate the underlying protocol
• Save messages that can not be processed at time of re-
ception for later processing
11
The GM level The GM level hides the upper layer from
several implementation details of the GM protocol. There
is a hard limitation of the number of ports the applications
can use. Thus, we decided to use only one port. Conse-
quently there is only one send queue, one receive queue and
one event queue. Any incoming message results in an entry
to this event queue. Since the precise ordering of incoming
messages can not be controlled by the application, some of
these events have to be saved for later processing. An ex-
ample situation occurs while performing a blocking send
operation. The completion of the send operation is reported
through the event queue. Thus, the GM level of the collgm
module polls this queue. It may happen that an unexpected
receive event is reported first. This event has to be saved for
further processing during an impending receive operation.
Moreover the GM interface maintains a pool of tokens to
provide some basic flow control. It is the application’s re-
sponsibility to acquire a token before every send or receive
request. The GM level of the collgm module checks the
availability of tokens before the actual send or receive op-
eration starts. In case of a lack of tokens any blocking send
or receive operation blocks until a token becomes available.
The non–blocking functions report an appropriate error.
3.3 Communication Protocols
The MPI level of the collgm module implements two
types of protocols. The eager protocol avoids synchroniza-
tion between sender and receiver and the rendezvous pro-
tocol performs a handshake before the actual data trans-
mission starts. The eager protocol works with preregis-
 160
 180
 200
 220
 240
 260
 280
 300
 320
 340
 360
 512 256 128 64 32 16 8 4 2 1
La
te
nc
y 
(m
s)
Segment size (kiB)
Figure 3. Optimal eager segment size
tered memory and the MPI level of the collgm module is
in charge of copying the data from the source location into
a buffer of registered memory. This memory consists of
segments of 64 kByte size and larger data blocks have to
be segmented. This helps to improve performance because
the first segment can be processed by the network adapter
while the remaining data is copied into subsequent seg-
ments. Figure 3 shows the measurements with Netgauge
[7] of a transmission of 32 MByte data with different seg-
ment sizes. The best performance is reached with a segment
size of 64 kByte.
A limit on the number of segments to be transmitted pre-
vents the sender from flooding another node. If a large mes-
sage needs more segments to be transmitted, the sender has
to send a request and has to wait for a positive response.
This request is sent at the beginning of the data transfer
increasing the probability that a positive acknowledgment
arrives before the data transfer has to be interrupted. The
scheme is shown in Figure 4. The rendezvous protocol does
Ack1st Segment
Request More Segments
TimeSender
Receiver
Figure 4. Flow control of large messages
not copy the data into local memory. Instead the original
memory area is registered on the fly at the sender. Also the
receiver has to register the appropriate memory area. Thus,
two messages are needed to announce the pending data
transfer and to report the receiver’s memory address back
to the sender. In order to avoid any confusion with entries
in the receive queue belonging to eager protocol messages,
the rendezvous protocol makes use of the RDMA feature of
GM. The initiator of the RDMA transaction directly writes
into the remote memory. Moreover the protocol does not
register the entire memory area at once. Blocks of 128
kByte size are registered in a pipeline fashion (cf. [15]). The
data transfer starts as soon as registration finished. While
the data transfer progresses the collgm module is able to
start the next memory registration. Due to the time con-
suming nature of memory registration this helps to partially
hide the registration process behind the data transfer. The
principle is shown in Figure 5.
12
TimeSender
Receiver
RDMA Req FragmentsAcks
Figure 5. RDMA transfer of large messages in
a pipeline fashion
3.4 The Algorithms of the MPI level
The Myrinet GM interface does not provide native
broadcast or multicast features. Thus the collgm module
relies completely on point–to–point messages to implement
the collective communication. Many algorithms are avail-
able in this area. Thus, we performed measurements in or-
der to decide which algorithm is beneficial for Myrinet/GM
and which algorithms are suitable for different data sizes
and communicator sizes. The barrier implementation relies
on recursive doubling and the algorithm of Bruck [2]. The
benchmark section shows that the results are very close to
the [4] module of Open MPI.
The options for implementing the broadcast operation
are a flat tree (linear), binomial tree, a binary tree, a splitted
binary tree [13] and a pipeline. Measurements show that
very small messages benefit from a binomial tree. Small
to standard size messages should use the splitted binary
tree and large messages show the best performance with the
pipeline scheme.
The operations scatter and gather may be accomplished
by a flat tree or a binomial tree. In a flat tree scheme the
root node of the scatter operation sends the messages di-
rectly to each of the receivers. Accordingly during a gather
operation the root node receives all messages directly from
the source node. The binomial tree algorithm applies some
message aggregation at nodes in the middle of the binomial
tree. Measurements show that for small messages the bi-
nomial tree works best and for large messages (1 kByte or
larger) the flat tree performs best.
Message forwarding Collective operations are often im-
plemented on top of MPI point-to-point functions as in
the tuned module of Open MPI. Network technologies like
Myrinet or InfiniBand require registered memory for data
transfers. Thus, each MPI point-to-point function has to
copy the user data into a preregistered memory area or
register/de-register the user buffer on the fly. This design
can lead to performance loss when nodes have to forward
messages as in the pipeline broadcast algorithm. The mes-
sage forwarding works as follows: First, MPI Recv copies
the received message into the specified user buffer. The fol-
lowing MPI Send function must copy the message again,
this time from the user buffer to a preregistered memory
area3. The collective functions of the collgm module have
direct access to the transfer buffers. This has several advan-
tages. A received message can be forwarded immediately
by performing a send request using the preregistered receive
buffer. Further on the message is copied only once (into
the user buffer) while the network adapter already sends
the data to other node(s). The broadcast algorithms of the
collgm module make extensive use of this technique.
4 Microbenchmark Results
We benchmarked our implementation on the strider sys-
tem at the High Performance Computing Center Stuttgart
(HLRS). This cluster system consists of 125 dual 2Ghz
Opteron compute nodes connected by Myrinet 2000 run-
ning the GM-2 API. We analyze alltoall, broadcast and scat-
ter/gather operations for small (16) and large node counts
(64) running with a single process per node. All bench-
marks have been conducted with NBCBench [6].
4.1 Small node counts
Figure 6 shows the MPI ALLTOALL performance on 16
nodes. Both the Open MPI tuned implementation used a
hard-coded hand-tuned map of algorithms to use for every
combination of communicator and data size. The map of
the fastest algorithms (also comparing to OMPI/tuned and
MPICH-GM) for alltoall is displayed in Figure 7. This map
was used to hard-code the algorithm selection in the collgm
collective component.
Figure 8 shows MPI Broadcast performance measure-
ment results. A similar map as for alltoall has been bench-
marked for broadcast and is shown in Figure 9.
MPI SCATTER results are shown in Figure 10. Results
for MPI Gather are due to the similar implementation com-
pletely identical and omitted here. The algorithm selection
map in Figure 11 shows the optimal algorithm for every
node-count/data size combination.
Results for 64 nodes of the strider system for alltoall and
Scatter/Gather are shown in Figure 12 and 13 respectively.
3assuming no zero-copy implementation
13
 0
 10
 20
 30
 40
 50
 60
 70
 80
 0  100  200  300  400  500
La
te
nc
y 
(m
s)
Datasize (kiB)
MPICH-GM
OMPI/tuned
OMPI/collgm
Figure 6. Alltoall results on 16 nodes
Figure 7. Alltoall algorithm selection map
 0
 2
 4
 6
 8
 10
 12
 14
 0  100  200  300  400  500
La
te
nc
y 
(m
s)
Datasize (kiB)
MPICH-GM
OMPI/tuned
OMPI/collgm
Figure 8. Bcast results on 16 nodes
Figure 9. Broadcast algorithm selection map
 0
 5
 10
 15
 20
 25
 30
 35
 0  100  200  300  400  500
La
te
nc
y 
(m
s)
Datasize (kiB)
MPICH-GM
OMPI/tuned
OMPI/collgm
Figure 10. Scatter results on 16 nodes
The alltoall Benchmark aborted with GM errors with all
three implementations when run on 64 nodes.
5 Conclusions and Future Work
Our work is the first extensive collective implementation
that uses the advantages of the Open MPI MCA structure to
optimize collective communication for a specific network-
ing hardware. We showed with the Myrinet/GM interface
that a performance benefit can be achieved with this ap-
proach. We combine common knowledge of the imple-
mentation of collective communication operations with GM
protocol specific techniques to achieve the best performance
on Myrinet/GM cluster systems. However, we are not sure
14
Figure 11. Scatter algorithm selection map
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 0  100  200  300  400  500
La
te
nc
y 
(m
s)
Datasize (kiB)
MPICH-GM
OMPI/tuned
OMPI/collgm
Figure 12. Bcast results on 64 nodes
 0
 20
 40
 60
 80
 100
 120
 140
 0  100  200  300  400  500
La
te
nc
y 
(m
s)
Datasize (kiB)
MPICH-GM
OMPI/tuned
OMPI/collgm
Figure 13. Scatter results on 64 nodes
if the software-technological effort and the implementation
costs outweigh the relatively high effort of designing, im-
plementing and maintaining the collgm component.
References
[1] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik,
C. L. Seitz, J. N. Seizovic, and W.-K. Su. Myrinet:
A gigabit-per-second local area network. IEEE Micro,
15(1):29–36, 1995.
[2] J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby.
Efficient algorithms for all-to-all communications in multi-
port message-passing systems. In Transactions on Parallel
and Distributed Systems. IEEE Computer Society, 1997.
[3] D. Cameron and G. Regnier. The Virtual Interface Architec-
ture, 2002.
[4] G. E. Fagg, J. Pjesivac-Grbovic, G. Bosilca, T. Angskun,
J. J. Dongarra, and E. Jeannot. Tuned: An open mpi collec-
tive communications component. In Distributed and Paral-
lel Systems, pages 65–72. Springer US, 2007.
[5] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Don-
garra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett,
A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham,
and T. S. Woodall. Open MPI: Goals, Concept, and Design
of a Next Generation MPI Implementation. In Proceedings,
11th European PVM/MPI Users’ Group Meeting, Budapest,
Hungary, September 2004.
[6] T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation
and Performance Analysis of Non-Blocking Collective Op-
erations for MPI. 11 2007. Accepted for publication at the
Supercomputing 2007 (SC07).
[7] T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm. Net-
gauge: A Network Performance Measurement Framework.
9 2007. Accepted for publication at the High Performance
Computing Conference 2007.
[8] Message Passing Interface Forum. MPI: A Message Passing
Interface Standard. 1995.
[9] Message Passing Interface Forum. MPI-2: Extensions to the
Message-Passing Interface. Technical Report, University of
Tennessee, Knoxville, 1997.
[10] F. Mietke, R. Baumgartl, R. Rex, T. Mehlan, T. Hoefler, and
W. Rehm. Analysis of the Memory Registration Process in
the Mellanox InfiniBand Software Stack. In Euro-Par 2006
Parallel Processing, pages 124–133. Springer-Verlag Berlin,
8 2006.
[11] Myricom. A High Performance, Low-Level, Message-
Passing Interface for Myrinet. Myricom, 2006.
[12] Myricom. GM: A message-passing system for Myrinet net-
works. Myricom, 2006.
[13] J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg,
E. Gabriel, and J. J. Dongarra. Performance Analysis of MPI
Collective Operations. In Proceedings of the 19th Interna-
tional Parallel and Distributed Processing Symposium, 4th
International Workshop on Performance Modeling, Evalua-
tion, and Optimization of Parallel and Distributed Systems
(PMEO-PDS 05), Denver, CO, April 2005.
15
[14] Y. Qian, A. Afsahi, and R. Zamani. Myrinet networks:
A performance study. In NCA ’04: Proceedings of the
Network Computing and Applications, Third IEEE Interna-
tional Symposium on (NCA’04), pages 323–328, Washing-
ton, DC, USA, 2004. IEEE Computer Society.
[15] G. M. Shipman, T. S. Woodall, G. B. andRich L. Graham,
and A. B. Maccabe. High performance RDMA protocols
in HPC. In Proceedings, 13th European PVM/MPI Users’
Group Meeting, Lecture Notes in Computer Science, Bonn,
Germany, September 2006. Springer-Verlag.
[16] J. M. Squyres and A. Lumsdaine. The Component Archi-
tecture of Open MPI: Enabling Third-Party Collective Al-
gorithms. In Proceedings, 18th ACM International Confer-
ence on Supercomputing, Workshop on Component Models
and Systems for Grid Applications, St. Malo, France, 2004.
[17] The InfiniBand Trade Association. Infiniband Architecture
Specification Volume 1, Release 1.2. InfiniBand Trade As-
sociation, 2004.
[18] VITA Standards organization. Myrinet-on-VME, Protocol
Specification. VITA Standards organization, 1998.
16
Evaluation of Optimized Barrier Algorithms for SCI Networks
with Different MPI Implementations
Boris Bierbaum, Georg Wassen, Stefan Lankes, Thomas Bemmerl
Chair for Operating Systems
RWTH Aachen University
Kopernikusstr. 16
52056 Aachen, Germany
{bierbaum,wassen,lankes,bemmerl}@lfbs.rwth-aachen.de
Abstract
The SCI Collectives Library is a new software package
which implements optimized collective communication op-
erations on SCI networks. It is designed to be coupled to
different higher-level communication libraries (especially
MPI implementations) by adapter modules, thereby giving
them access to these optimized collectives. In this work,
we present the design of the SCI Collectives Library and of
adapter modules for OpenMPI and NMPI. We also describe
various barrier algorithms which we have implemented for
this library and compare their performance to one another
and to the barrier performance of MPI implementations
which include support for SCI.
1. Introduction
The performance characteristics and the design of low-
level interfaces vary greatly between different local area
networks, such as SCI [8], Myrinet, and Ethernet. There-
fore, for an MPI implementation to achieve good applica-
tion performance on a cluster equipped with such a network,
proper support for a specific network architecture must be
developed. This has led to MPI implementations which are
tailored to a specific high-speed network, e.g. SCI-MPICH
[17] for SCI and MPICH-MX for Myrinet. Unfortunately,
this limits the users of a cluster to a certain MPI imple-
mentation, even if other characteristics of it (like thread-
safety or tool support) may be unsatisfactory. These spe-
cific MPI implementations usually contain collective oper-
ations which are highly optimized for the network architec-
ture which the implementation supports.
The SCI Collectives Library provides optimized collec-
tive communication routines for SCI networks and is de-
signed to be adaptable to different MPI libraries. This gives
users of SCI clusters more freedom in their choice of an
appropriate MPI implementation. We aim to show with
the implementation of this library, that it can be coupled
to different MPI libraries without a significant performance
penalty. The library is also meant to serve as a tool for do-
ing research in the area of collective algorithm design and
implementation on SCI networks. So far, we have imple-
mented various barrier algorithms in the SCI Collectives Li-
brary and evaluated their performance.
The structure of this paper is as follows: Sec. 2 refers to
prior work about barrier algorithms for SCI networks and
describes the MPI implementations mentioned in the later
sections. Sec. 3 details the architecture of the SCI Collec-
tives Library and the way it can be adapted to different MPI
libraries. The barrier algorithms and related benchmark re-
sults are are described in Sec. 4, Sec. 5 concludes the paper.
2. Related Work
[6] compares various barrier implementations on an SCI
cluster. Unfortunately, this comparison is biased by the
best-performing barrier using direct writing to remote SCI
memory (see Sec. 4) while the other ones are based on
MPI_Send and MPI_Recv and therefore suffer from ad-
ditional overhead. In [7], SCI clusters of SMPs are con-
sidered. For the implementation of a barrier on such a
system, the author argues in favour of a dedicated process
per node doing the network-wide synchronization with the
node-internal parts of the barrier performed before and af-
ter that, an approach we followed in the SCI Collectives
Library. The algorithms and data layout for barrier syn-
chronization presented in [18] served as the starting point
for our implementation.
17
2.1. SCI-MPICH
SCI-MPICH, part of the MP-MPICH software package,
primarily is a channel device for MPICH, called ch smi
[17]. It is based on the SMI library [2] which in turn
makes use of the low-level SISCI [1] interface for SCI.
SCI-MPICH implements optimized point-to-point and col-
lective operations for SCI networks [18, 19]. Because of its
MPICH heritage, it provides neither thread safety nor full
support for the MPI-2 standard.
2.2. NMPI
NMPI [13] is based on MPICH2 and implements a chan-
nel device with optimized point-to-point operations for SCI.
Compared to SCI-MPICH, it has the advantages which
MPICH2 has over MPICH, but it does not contain opti-
mized collective algorithms. Instead, the standard MPICH2
collective algorithms [16], which are designed for switched
networks and not for the ring or torus topologies of SCI
clusters, are used for SCI, albeit on the basis of the fast
point-to-point operations.
2.3. Open MPI
Open MPI [3] is an MPI implementation which aims to
integrate the features of several older software distributions
like FT-MPI, LA-MPI, and LAM/MPI into a single pack-
age. It also differs from MPICH and MPICH2 in its com-
ponent based architecture, calledModule Component Archi-
tecture (MCA) [15]. The MCA allows for the inclusion of
new functionality and the replacement of software compo-
nents without the need to make source code changes to the
Open MPI distribution, because it can detect and activate
components implemented as shared libraries at runtime.
Point-to-Point communication with OpenMPI on an SCI
cluster can be done via sockets [14] and an implementa-
tion on top of SISCI has been considered [11], but to our
knowledge no collective communication operations tailored
to SCI networks are available for Open MPI yet.
3. Architecture of the SCI Collectives Library
3.1. Overview
Fig. 1 shows the design of the SCI Collectives library
from a high-level point of view. The collective algorithms
are implemented inside of the scicoll library, which is
coupled to the MPI libraries via the respective adapters. For
its algorithms, the scicoll library calls point-to-point op-
erations from the higher-level libraries or directly uses the
SISCI interface, when this is preferable (see Sec. 4).
scicoll
Open MPI adapter NMPI adapter
SISCI
Open MPI
collective framework
NMPI
...
... ...
Figure 1. Architecture of the SCI Collectives
Library
3.2. Interface Design
To provide multiple MPI implementations with opti-
mized point-to-point operations on a specific architecture,
the uDAPL interface [10] can be implemented for this archi-
tecture, since there are several MPI implementations which
can make use of this API, e.g. Intel MPI [9] and Open
MPI. For collective communication, there is currently no
such interface available, which made the development of
adapter modules for different MPI implementations neces-
sary. This situation also motivates the design of an interface
for the scicoll library which is suitable to be used by
such adapter modules. The interface for the SCI Collectives
Library has the following main properties:
• Support for all MPI collective operations (which are
not all implemented yet)
• Functions to register point-to-point operations used as
a basis for the collective algorithms
• We plan to provide all collectives also in asynchronous
versions compatible to LibNBC [5] (and to support
non-MPI collectives which may need this)
• The possibility for the user to choose between different
algorithms for a collective operation (if available)
The SCI Collectives API provides functions to initialize
and finalize the library and to create and destroy groups of
processes. For the creation of such a group, an adapter must
provide the pointers to some communication functions (de-
rived fromMPI blocking and nonblocking send and receive)
and can provide settings to override the default algorithm
choice and parameters. The results of precalculations influ-
encing the collective algorithms are stored in internal data
structures. A pointer to that data is returned to the adapter
and must be provided to the collective calls.
18
3.3. Adapter Modules
So far, we provide adapters for Open MPI and NMPI.
We plan to develop additional adapter modules and provide
documentation and sample source code to enable the devel-
opment of third-party modules.
The adapters control the initialization of process groups
during the creation of each MPI communicator. They im-
plement send and receive functions using the same inter-
nal functions of the MPI library that are also used inside
of MPI_Send, MPI_Recv etc. Furthermore, they contain
functions for the collective operations that are called by the
MPI library and which in turn call the optimized algorithms
of scicoll. This way, the collective functions are called
with almost no overhead during execution.
Open MPI. In Open MPI, the collective functions are
handled by the coll framework. It is able to deal with
multiple available collective components which implement
a subset or all of the collective routines. The adapter is cur-
rently based on Open MPI 1.2.1 with collective framework
version 1.0.0. Upon the creation of a new MPI communica-
tor, the coll framework queries the available components
to find the one most suitable for this particular setup. That
component is then used to create and initialize a module,
which is an instance of the component. The module returns
pointers to its collective functions to the framework. Func-
tions not implemented by this module are automatically
realized by generic algorithms from the included basic
component. The current version of the collective frame-
work can use multiple components and doesn’t need to fall
back to the generic functions if the best suited component
provides only a few collective operations.
The Open MPI adapter is available as a shared library
which maps collective routines required by Open MPI to
the scicoll interface. If put in the correct place, it is de-
tected by the framework and loaded automatically. Open
MPI supports MCA parameters that can be set in configura-
tion files and at the command line to influence the behaviour
of components. The adapter reads its parameters and hands
them to the library. This way, specific algorithms or modes
can be selected by a user.
NMPI. The origin of NMPI, MPICH2 supports the re-
placement of its collective functions by hooks that are called
each time a new communicator is created. A hook is a pre-
defined macro which is overwritten by the adapter to ini-
tialize the library. Only those collective operations which
are implemented by scicoll are replaced by optimized
versions while the others use their original algorithms. The
NMPI adapter is realized as a source code patch, therefore
a rebuild of NMPI is required. Parameters can be passed to
the NMPI adapter via a configuration file.
4. The Barrier Implementation
In MPI [12], a barrier is defined as a synchronization
among a group of processes which blocks the caller un-
til all group members have entered the corresponding call.
Thus no process can proceed with execution after the barrier
while there are processes which have not entered the barrier
yet.
In barrier algorithms, a process marks its arrival at the
barrier by emitting some kind of signal which must then be
passed to all the other processes until every process has re-
ceived enough signals to be sure that each other process has
reached the barrier. Thus, a signal carries the information
about the arrival of one or more processes at the barrier. The
generic barrier algorithms in Open MPI and NMPI use mes-
sage passing functions to send and receive such signals. To
avoid the overhead of the point-to-point communication, we
directly use the SISCI API for our barrier implementation.
SISCI allows the creation of SCI memory segments
which can be exported by a process A and imported by a
process B running on a different node. After B has mapped
the segment into its virtual address space, it can send data to
A via CPU store operations with very low latency for small
messages. A signal for a barrier algorithm can thus be real-
ized by writing via a remote pointer.
The SCI adapters used by us (see Tab. 1) contain stream
buffers, in which write gathering is performed for outgo-
ing data. A sequence check must be done to detect failed
data transfers, which are then repeated until the check suc-
ceeds. Each sequence check contains by default an inherent
store barrier to force the completion of all pending trans-
fers. These sequence checks take more than 5 µs, but fail-
ures are rare, while a remote write operation with a size of
4 bytes stalls the sender’s CPU for about 200 ns. Therefore,
it is preferable to protect as many data transfer operations
as possible with a single check. Furthermore, our experi-
ments revealed that a dedicated store barrier followed by a
sequence check which has its inherent store barrier deacti-
vated is 1.5 to 2 µs faster so that the combination of store
barrier and fast sequence check takes below 4 µs.
4.1. Local Synchronization
If multiple processes are running on the same node, one
of them is nominated master to communicate with the other
nodes. The slaves synchronize with the local master by Sys-
tem V shared memory. Each slave sets a flag and the master
waits until all have checked in before it synchronizes with
the other nodes (Fig. 2). The check-in flags are aligned
at the beginning of a cache-line so that least cache misses
occur. After the remote synchronization, the master sets a
single check-out flag the slaves are waiting for.
For any node n, this intra-node synchronization scales
19
set flag poll flags 
cache−line aligned  
check−in flags 
master  slaves  
check−out flag 
set flag 
poll flag 
master  slaves  
check−in  check−out  
Figure 2. Shared Memory Check-in/-out of Lo-
cal Processes
linearly with the number of processes Pn on the node . The
master reads Pn−1 flags from the local memory and writes
a single one in addition to the remote synchronization. The
slaves issue only a single write and read operation. This
is very fast compared to the remote memory access and re-
duces the problem of the synchronization of P processes on
connected SMP nodes to the synchronization of N nodes
(with N ≤ P ). [18]
4.2. Remote Synchronization
For the synchronization among the nodes, each master
process exports a local SCI segment and imports the seg-
ments of the other nodes. Each flag is aligned at the top of
the stream buffer size so that writing to that position makes
the SCI adapter issue the network transfer instantly. The
flags are always located at the receiver so that setting the
flag requires one data transfer and waiting for the flag can
be done by polling a variable in local memory.
The communication pattern is prepared during initializa-
tion and stored as barrier-data within the MPI communica-
tor. During the barrier call, each process just executes the
precomputed list of write and read operations.
Hierarchical Shared Memory Barrier. The hsb algo-
rithm described in [18] and implemented in SCI-MPICH
concentrates the arrival signals of all processes to a tree root
(a gather pattern forming an fin-ary tree) and broadcasts
the arrival information in the opposite direction afterwards
(via an fout-ary tree). In SCI-MPICH, this algorithm is per-
formed with f = fin = fout = 8. We re-implemented it
and did an experimental evaluation to find out the optimal
value of f on our cluster. Our experiments did not yet show
any advantages of setups with fin 6= fout.
As an example for the hsb algorithm, Fig. 3 shows the
synchronization of seven nodes with a ternary tree (f = 3).
Node 1 waits until 4, 5 and 6 have set their flags and sets
afterwards its fan-in flag at node 0. After that one has de-
tected all flags from 1, 2 and 3 it begins the fan-out process
by setting the corresponding flags in these nodes. Node 1
0
1 2 3
4 5 6
Figure 3. Hierarchical Shared Memory Barrier
was waiting for this event, promotes the signal to its chil-
dren and returns from the barrier call.
Each node waits for up to f children by reading local
memory until a flag is set. Except the tree root, a single
remote write operation with sequence check and the wait-
ing for the fan-out flag follows. Finally, the children are
released by up to f remote write operations and a single se-
quence check. The effort of each process is highly scalable
but the further down a node is located in the tree, the longer
it has to wait until the signal is promoted to the tree root and
back. This algorithm performs 2 · dlogf Ne steps.
Exchange Algorithms. The exchange algorithms were
inspired by the binary exchange barrier also presented in
[18]. But the number of steps is bound by O(log2N ) and
each step requires a time consuming sequence check. To
decrease their number, we generalized the binary exchange
to an n-ary exchange (nx) so that the number of steps is
bound by O(lognN ) which is better for n > 2. Each node
issues (n− 1) flags, but as setting a flag just creates a small
data transmission, this does not congest the network for a
reasonable number of nodes.
Two modes are realized. In the first one, in each of
logn(N) steps, groups of n nodes synchronize themselves.
The groups are assembled in a way that in each step repre-
sentatives from different groups meet (Fig. 4) and convey
the synchronizations they made before.
P 0 P 1 P 2 P 3
 step 1
 step 2
P 4 P 5 P 6 P 7 P 8
Figure 4. n-ary Exchange
This algorithm has the disadvantage that it requires N
to be a power of n. If this precondition cannot be met, ad-
20
ditional synchronization is needed, resulting in a total of
blognNc+ 2 steps.
This overhead can be avoided by a different communica-
tion pattern (mode 2) derived from the (binary) dissemina-
tion algorithms presented in [4] and used in MPICH2. This
algorithm requires dlogn(N)e steps. In step i, each node
nid sets (n − 1) flags at the nodes nid + ni (mod N),
nid+2 ·ni (mod N) etc. and waits for the same number
of local flags to be set by other nodes. Figure 5 illustrates
the dissemination of the first node’s signal for n = 3 on 9
nodes. In the same manner, the signal of each other node is
dispersed.
P 0 P 1 P 2 P 3
 step 1
 step 2
P 4 P 5 P 6 P 7 P 8
Figure 5. n-ary Dissemination
With the factor n, the number of steps (thus sequence
checks) and network packets can be influenced. In the
above example, each node sends and receives 4 signals dur-
ing 2 steps and a total of 9 · 4 = 36 remote write operations
are executed. With n = 9, only a single sequence check is
done, but each node sends and receives 8 signals, a total of
72 remote write operations.
4.3. Benchmark Results
To find optimal parameter settings, we benchmarked
each barrier algorithm with the Intel MPI Benchmarks
(IMB) on our development cluster, which is detailed in
Tab. 1, using the software with the given versions. The re-
sults for the nx algorithms with different values for n are il-
lustrated in Fig. 6, where each additional step can be seen as
an abrupt increase in the time taken for the barrier. Within
a constant number of steps, only a slight increase is visible.
The graph for n = 16 shows that n should not be set greater
than N . Up to 16 nodes, n = N is the optimal parameter
selection for this algorithm. Above 16 nodes, n = 2 and
n = 16 would require an additional step, but n = 6 can
avoid this for up to 36 nodes. By extrapolation, we assume
that the graphs for n = N and n = 6 cross at about 20
nodes.
All presented barrier algorithms were analyzed likewise
and by default, their parameters for a barrier concerning up
to 16 nodes is set to the number of nodes (hsb: fin = fout =
N − 1 and nx: n = N ) in the SCI Collectives Library. We
limit the parameters and set them to a much lower value
(e.g. 6) above a certain number of nodes. This reduces
the network traffic accepting additional steps but should be
faster.
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 2  4  6  8  10  12  14  16
ba
rri
er
 ti
m
e 
(µs
)
nbr. of nodes
Intel MPI Benchmark Suite 3.0 - Barrier (nx-mode 2)
n=2
n=6
n=16
n=N
Figure 6. Optimization of the n-ary Exchange
Algorithm
Therefore, both algorithms hsb and nx degenerate to se-
quential access (hsb) and all-to-all communication (nx) be-
low 16 nodes. The additional steps do not appear until more
nodes are involved, but the behaviour above that limit is cur-
rently extrapolated and must be confirmed by experiments.
Hardware
Processor 16 x Intel Pentium D 2.8 Ghz
RAM 2 GB per node
SCI D352 adapter, 4x4 2D Torus
Software
DIS Release 3.2.5
Open MPI 1.2.1
SCI-MPICH rc-1.5
NMPI 1.2
Linux Kernel 2.6.18
IMB 3.0
Table 1. Hardware and Software used for Per-
formance Evaluation
We measured the performance of our algorithms in com-
parison to NMPI and the optimized barrier of SCI-MPICH
on our cluster, the results are depicted in Fig. 7. The NMPI
barrier took more than 15 µs on 3 nodes and about 37 µs on
16 nodes and is therefore not contained in that figure. The
newly implemented hsb algorithm is slightly slower below 9
nodes than the same algorithm in SCI-MPICH. Above, SCI-
MPICH needs two steps because of the fixed fan-parameter
8 so that our optimized algorithm becomes faster by using
a single step. The new nx algorithm proved to be faster for
any number of nodes up to 16 than the other barrier rou-
21
 0
 2
 4
 6
 8
 10
 12
 14
 2  4  6  8  10  12  14  16
ba
rri
er
 ti
m
e 
(µs
)
nbr. of nodes
Intel MPI Benchmark Suite 3.0 - Barrier
hsb
nx-mode 2
sci-mpich
Figure 7. Comparison of Barrier Implementa-
tions
tines. Both hsb and nx algorithms show the same perfor-
mance with the Open MPI and the NMPI adapter, demon-
strating that neither of the two adapters introduces too much
overhead.
If two processes are running on each node, the barrier
time increases by 0.5 µs for the local synchronization (Sec.
4.1) independent of the number of nodes.
5. Conclusion, Outlook, and Acknowledge-
ments
This work describes the SCI Collectives Library, a new
software designed to provide optimized collective commu-
nication routines for SCI clusters to different MPI imple-
mentations. Our experiences in developing this library as
well as the performance results we present for the barrier
implementation show that this is indeed a feasible goal.
By implementing and evaluating various barrier algo-
rithms in the SCI Collectives Library, we were able to show
that the fan-parameter f for the hsb algorithm was set sub-
optimally in SCI-MPICH. In addition to that, the new nx al-
gorithm shows significantly better performance on our clus-
ter than any other barrier algorithm we tested. Thus, there
is now an improved barrier available for users of Open MPI
and NMPI on SCI clusters.
We plan to implement a full set of collective communica-
tion patterns in this library to support all available collective
functions of MPI. We also strive to develop adapter modules
for MPI implementations which are not yet supported, es-
pecially Intel MPI. We will also explore the possibility to
support other APIs besides MPI, which include collective
communication routines, with our library. Concerning our
barrier algorithms, we are currently conducting tests on a
cluster with more nodes to gather new insights.
We would like to thank Intel Corporation for sponsor-
ing this work and Dolphin Interconnect Solutions for their
support.
References
[1] Dolphin Interconnect Solutions. SISCI API User Guide, Ver-
sion 1.0, May 2001.
[2] M. Dormanns, K. Scholtyssik, and T. Bemmerl. A Shared-
Memory Programming Interface for SCI Clusters. In
H. Hellwagner and A. Reinefeld, editors, SCI: Scalable Co-
herent Interface, pages 281–290. Springer Verlag, 1999.
[3] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Don-
garra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett,
A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham,
and T. S. Woodall. Open MPI: Goals, Concept, and De-
sign of a Next Generation MPI Implementation. In Proceed-
ings of the 11th European PVM/MPI Users’ Group Meeting,
volume 3241 of LNCS, pages 97–104, Budapest, Hungary,
September 2004. Springer.
[4] D. Hensgen, R. Finkel, and U. Manber. Two algorithms for
barrier synchronization. International Journal of Parallel
Programming, Volume 17, Number 1:1–17, February 1988.
[5] T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation
and Performance Analysis of Non-Blocking Collective Op-
erations for MPI. In Proceedings of Supercomputing 2007,
Reno, Nevada, November 2007.
[6] L. P. Huse. Collective Communication on Dedicated Clus-
ters of Workstations. In Proceedings of the 6th European
PVM/MPI Users Group Meeting 1999 (EuroPVM/MPI),
Barcelona, Spain, September 1999.
[7] L. P. Huse. MPI optimization for SMP based clusters inter-
connected with SCI. In Proceedings of the 7th European
PVM/MPI Users Group Meeting 2000 (EuroPVM/MPI),
Lake Balaton, Hungary, September 2000.
[8] IEEE. ANSI/IEEE Std. 1596-1992, Scalable Coherent Inter-
face (SCI), 1992.
[9] http://www.intel.com/go/mpi.
[10] J. Lentini, V. Pham, S. Sears, and R. Smith. Implementa-
tion and Analysis of the User Direct Access Programming
Library. In Proceedings of the 2nd Workshop on Novel Uses
of System Area Networks, SAN-2, February 2003.
[11] T. Mehlan, T. Hoefler, F. Mietke, and W. Rehm. Concepts
for Integrating SISCI into Open MPI. In Proceedings of the
1st Workshop Kommunikation in Clusterrechnern und Clus-
terverbundsystemen, Chemnitz, Germany, November 2005.
[12] MPI Forum. MPI: A Message-Passing Interface Stan-
dard. International Journal of Supercomputing Applica-
tions, 1994.
[13] http://www.nicevt.ru/research/nmpi.
[14] F. Seifert and H. Kohmann. SCI SOCKET - A
Fast Socket Implementation over SCI. [Avail-
able on WWW at http://www.dolphinics.com
/whitepapers/sci-socket.pdf].
[15] J. M. Squyres and A. Lumsdaine. The Component Archi-
tecture of Open MPI: Enabling Third-Party Collective Al-
gorithms. In V. Getov and T. Kielmann, editors, Proc. of the
22
18th ACM International Conf. on Supercomputing, Work-
shop on Component Models and Systems for Grid Applica-
tions, pages 167–185, St. Malo, France, July 2004. Springer.
[16] R. Thakur and W. Gropp. Improving the Performance of
Collective Operations in MPICH. In Proceedings of the
10th European PVM/MPI Users Group Meeting 2003, vol-
ume 2840 of LNCS, pages 257–267, Venice,Italy, September
2003. Springer.
[17] J. Worringen. SCI-MPICH - The Second Generation. In
Proceedings of SCI-Europe 2000 (Conference Stream of
Euro-Par 2000), pages 11–20, Munich, Germany, August
2000.
[18] J. Worringen. Effizienter Nachrichtenaustausch auf
speichergekoppelten Rechnerverbundsystemen mit SCI
Verbindungsnetz. doctoral thesis, RWTH Aachen, 2003.
[19] J. Worringen. Pipelining and Overlapping for MPI Collec-
tive Operations. In Proceedings of the Workshop on High-
Speed Local Networks (HSLN), in conjunction with 28th
Annual IEEE International Conference on Local Computer
Networks (LCN 2003), pages 548–557, Bonn/Ko¨nigswinter,
Germany, October 2003.
23
An Approach for Deploying Externally Defined MPI Communicators at Runtime
Carsten Clauss, Stephan Gsell, Stefan Lankes, Thomas Bemmerl
Chair for Operating Systems, RWTH Aachen University
Kopernikusstr. 16, 52056 Aachen, Germany
{clauss, gsell, lankes, bemmerl}@lfbs.rwth-aachen.de
Abstract
When writing parallel applications according to the
MPI standard especially for hierarchical computing envi-
ronments, the recognition of the underlying heterogeneous
hardware structure at application level is not trivial at all.
Although the MPI standard tries to support the application
programmer with some process grouping and mapping fa-
cilities (notably the communicator concept and the topology
mechanism), the actual hardware hierarchy is usually still
kept opaque. In this paper, we present a generalized ap-
proach that allows the programmer to create suitable pro-
cess groups according to the given topologies by externally
defining MPI communicators in corresponding XML files.
We further introduce a small external library for MPI im-
plementations that is able to parse those XML files and can
build the desired communicators at runtime. That way, the
actual hardware hierarchy becomes visible also at applica-
tion level.
1 Introduction
An important feature of the Message Passing Interface
(MPI) [15] is the communicator concept. This concept al-
lows the application programmer to group the parallel pro-
cesses by assigning them to abstract objects called com-
municators. For that purpose, the programmer can split
the group of initial started processes into sub-groups, each
forming a new self-contained communication domain rep-
resented by such a communicator object [13].
This concept usually follows a top-down approach where
the process groups are built according to the communi-
cation patterns required by the parallelized algorithm. In
this way, hierarchical communication structures within the
algorithms can easily be implemented on top of the MPI
layer. However, since most MPI implementations nor-
mally assume homogeneous hardware environments, the
processes are usually mapped onto the available proces-
sors in a transparent way so that an arbitrary (or at least an
implementation-dependent) process-to-processor mapping
is most likely to result. That means that the MPI runtime
system usually does not draw any association between the
logical communication patterns of the algorithm on the one
side and the underlying physical hardware topology on the
other side. Although the MPI standard offers such an as-
sociation feature in terms of the MPI topology mechanism,
its intended functionality is very rarely realized by generic
MPI implementations [19, 18].
Heterogeneity-Aware MPI In contrast to ordinary MPI
implementations, heterogeneity-aware MPI libraries often
provide dedicated adaptation features which help the appli-
cation programmer to adapt the algorithms’ communication
patterns to the respective heterogeneity of the physical com-
munication topology. However, the implementation of such
optimization features normally follows a bottom-up method
where the topology information must be passed from the
MPI runtime system to the application in a more or less
unconventional way. At this point, two different ways can
be followed: the one way is to aim to keep standard con-
formity, whereas the other way is to sacrifice source code
compatibility e.g. for a more convenient handling. For
instance, several heterogeneity-aware MPI libraries supply
the programmer with additional predefined MPI communi-
cators that try to reflect the underlying hardware hierarchy.
However, if the symbol names of those additional commu-
nicators are already set at compile time of the MPI library,
an application breaks with the standard when using them
and hence makes its source code less portable.
Remainder of the Paper In this paper, we present a gen-
eralized approach that circumvents the communicator de-
termination at compile time by providing the programmer
with the ability to use self-predefined MPI communicators
whose group composition is just determined at runtime. Af-
ter a brief recapitulation of the communicator concept in
Section 2, we present a small additional library for MPI
implementations in Section 3 that follows our approach
and which is capable of building the desired communica-
24
tors for the application. Although the approach introduced
is mainly intended to build communicators following the
bottom-up method for heterogeneous computing environ-
ments, it can also be applied for a top-down strategy where
the communicators are built, guided by the programmer, ac-
cording to the patterns of the respective algorithm. There-
fore, application examples for both those methods are pre-
sented in Section 4. An overview of related work and pos-
sible future extensions to our work concludes the paper in
Section 5.
2 The MPI Communicator Concept
In order to follow the novelty of the approach introduced
in this paper, one needs some basic knowledge about the
common way of dealing with MPI communicators. For that
reason, we briefly resume the MPI communicator concept
and its handling on application level in this section. For
more detailed explanations, please refer to [10, 11, 13].
Intra-Communicators An MPI communicator is, as al-
ready mentioned in the introduction, an abstract object that
represents a (sub-)group of parallel processes defining an
explicit communication space. According to the MPI stan-
dard, there exist three predefined communicators in every
MPI environment. Those communicators are:
• MPI_COMM_WORLD — specifying all started pro-
cesses within an MPI run
• MPI_COMM_SELF — identifying each MPI process
itself
• MPI_COMM_NULL — pseudo communicator repre-
senting invalid communicators
The first and, to a very minor degree, also the second of
those can be used by an application programmer for (intra-
group) communication as well as for deriving new commu-
nicator groups. For that purpose of creating new commu-
nicators, the standard defines, amongst others, the function
MPI_Comm_split() that partitions the group of a par-
ent communicator into disjoint subgroups. Each process of
those subgroups in turn will be associated either with the re-
spective communicator created or with MPI_COMM_NULL
if the calling process does not take part in any of the de-
fined subgroups. At this point it should be emphasized that
the created child communicators within one call to the split
function will get the same communicator names assigned.
That in turn means that a symbol name representing a com-
municator object on application level does not necessarily
represent identical groups for the different processes at run-
time (see Figure 1).
0 2 31
MPI_COMM_WORLD
10
MPI_COMM_REALM MPI_COMM_REALM
2 0
Split
Figure 1. Splitting of a Communicator
Inter-Communicators Besides the regular communica-
tors introduced in the last preceding paragraph, which are
intended for intra-group communication, there also exists
a further type of communicators, which is designated for
inter-group communication. Therefore, communicators of
this second type are formally called inter-communicators,
whereas those of the prior type are formally called intra-
communicators. The difference between both types is that
an inter-communicator is associated with two groups of pro-
cesses: a local and a remote group.
In the context of inter-communicators, the processes
are always identified by their rank within the remote
group. That way, messages sent or received via an
inter-communicator are always exchanged between pro-
cesses of the two disjoined groups. Therefore, the cre-
ation of a new inter-communicator is based on the inter-
linkage of two disjoined intra-communicators by calling the
MPI_Intercomm_create() function.
At this point it should be emphasized that the returned
data type of an inter-communicator is the same as for a
regular intra-communicator. Hence, in order to be able
to distinguish between those communicator types repre-
sented by the same data type, the standard provides the
MPI_Comm_test_inter() function.
Process Identification When defining new MPI commu-
nicators, the application programmer can build them either
according to the top-down approach, which is the usual
case, or according to the bottom-up approach for hardware
awareness. In the first case, the world ranks of the processes
can serve as a unique differentiator among them.
The second case, following the bottom-up approach, is
a little bit more complicated: First of all, each process has
to determine on which node of the hardware topology it is
running. This identification can be done by utilizing the
result of the MPI_Get_processor_name() function,
whereas the programmer has to supply the correlation be-
tween processor names and the actual topology to be repre-
sented by the new communicators. By comparing its own
name with a list of processor names assigned to correspond-
ing communicator groups, each process can then determine
if it becomes part of a new communicator or not.
25
3 Implementation Details
3.1 A Library for Communicator Creation
So far, an application programmer usually has to encode
the process identification described in the former section di-
rectly into the respective application. In order to ease the
creation of self-defined communicators on the one hand and
to make the process identification more flexible and inde-
pendent from the application code on the other hand, we
have developed a small software library that relieves the
programmer from dealing with the communicator creation
procedure.
For that purpose, the library provides a special function
that only expects a reference to an uninitialized MPI com-
municator object as well as the symbol name of that object.
When called, the function will look up a table containing the
descriptions of the new communicators indexed by the com-
municator name. If the desired communicator can be found
in the table, each calling process will check the respective
description in order to determine whether it becomes part
of a new communicator group or not. Afterwards, the par-
ticipating processes will build the new communicator while
the others will just return a reference to MPI_COMM_NULL.
In order to supply the building function with the needed
information stated in the lookup table, an additional initial-
ization function must firstly read the desired communica-
tor configurations from an appropriate XML file. The dis-
placement of the communicator definitions into an external
configuration file offers several advantages and opportuni-
ties: For example, the application does not need to be re-
compiled if the desired grouping scheme or the processor
names have changed. Furthermore, the configuration needs
not necessarily be written by a user or an application pro-
grammer. In fact, the XML file containing the communica-
tor definitions can rather be generated by a process sched-
uler, for instance, or even by the runtime environment of a
heterogeneity-aware MPI implementation.
3.2 The XML Configuration Files
The reasons for XML [20] being the file-format of choice
are that it is human readable, easy to understand and widely
used. Another important reason was that it is highly hierar-
chical structured and thus represents the computer structure
in use quite well. There exist some public domain XML
parsers of which we chose libxml21 for our implementa-
tion.
The data is being structured by so-called elements.
Usually an element consists of the wanted information
surrounded by a dedicated start tag and the correspond-
ing end tag. For example, in our implementation we
1http://xmlsoft.org/
use <processor>igor</processor> to specify a
processor named igor. For the definition of a pro-
cess group represented by a communicator, we have in-
troduced the <comm> tag. This element may harbor
processor elements as well as other comm elements, al-
lowing a recursive definition style. To name a communi-
cator, we use a corresponding attribute within the tag like
<comm name="MPI_COMM_RED">...</comm>.
Identification by Processor Names In order to identify
the nodes of the hardware topology according to the bottom-
up approach, the communicators have to be associated with
the respective processor names. Therefore, consider the fol-
lowing example of a short configuration file:
<comm name="MPI_COMM_RED">
<processor>pd-01</processor>
<processor>pd-02</processor>
<comm name="MPI_COMM_PINK">
<processor>pd-02</processor>
</comm>
</comm>
<comm name="MPI_COMM_RED">
<processor>pd-03</processor>
</comm>
<comm name="MPI_COMM_BLACK">
<processor>pd-04</processor>
</comm>
Here, the first communicator, named MPI_COMM_RED,
contains the processors pd-01 and pd-02, whereby
the latter is also in a sub-communicator named
MPI_COMM_PINK. It is important to know that each
sub-communicator may only consist of processors that are
also defined in its parent communicator. Therefore it is for
instance not possible in the above example configuration
for the sub-communicator to include the processor pd-03,
since it is not in the parent-communicator. As already
stated in Section 2, the MPI standard allows for different
communicator groups to having the same symbol name (as
above with MPI_COMM_RED), which is also supported by
our library.
An additional feature is that processor names may also
include regular expressions, as for example p[d4]-01.
This expression matches the processors pd-01 and p4-01
which can be quite useful for example if you have two clus-
ters at hand that use the same naming scheme. (In fact, PD
and P4 are names of two actual cluster installations at our
institute.)
Although the new communicator-internal processor
ranks are typically derived from the order of occurrence in
the XML file, they can also be stated explicitly via an ad-
ditional key attribute. However, in case of a regular ex-
pression, the communicator-internal ranks are determined
by the alphabetical order of the actual processor names.
26
Inter-Communicators In analogy to Section 2, it is pos-
sible to define inter-communicators, too. In the XML file
this can be accomplished in a way alike the following:
<intercomm name="MPI_COMM_INTER">
<first color="1">MPI_COMM_RED</first>
<second>MPI_COMM_BLACK</second>
</intercomm>
This code fragment defines an inter-communicator
named MPI_COMM_INTER between the communicators
MPI_COMM_RED and MPI_COMM_BLACK. Since inter-
communicators may only connect two communicators that
have the same parent, the intercomm element may only
stand within a comm element (or in the top level node with
MPI_COMM_WORLD being the common parent). In our im-
plementation, different communicators with the same name
are distinguished by their different colors, which again is
just a value between zero and the number of the equally
named communicators minus one. If the communicator
name is unambiguous, the color statement can be omitted.
That means in this example that the inter-communicator
will represent an interlinking domain between the proces-
sors pd-03 and pd-04.
3.3 Portable Integration into Applications
To us, it was very important that our approach chosen for
deploying externally defined communicators ensures porta-
bility. Portability means that the library introduced is inter-
operable with any underlying MPI library on the one side,
and that the respective applications can still be written in
a standard conform manner on the other side. However, in
order to utilize the new features introduced, an application
has to be written in a distinctive way that will be described
below. That way, the application can not only be compiled
with and without the additional library but can also still be
started in both cases while switching back to the standard
communicator environment in the latter case of lacking sup-
port.
For that reason, we have chosen to place our library
transparently between the application and the respective
MPI implementation. Thus, the call to the new communica-
tor creation function described in Section 3.1 becomes in-
visible to the application by hiding it inside faked MPI com-
mands. Therefore, the application merely has to include
an also faked mpi.h header instead of the corresponding
header file of the native MPI library when being compiled.
Nevertheless, another possible way is to place the call
to the new communicator creation function directly into the
application code and using preprocessor directives for en-
suring portability. However, in this paper we want to focus
on the former alternative described.
Faked MPI Functions When using this option, the
externally defined communicators are built at runtime
during an appropriate call of MPI_Comm_rank() or
MPI_Comm_size(), assuming that those are one of the
initial MPI functions called which expect a communicator
as one of the arguments. For that purpose, all occurrences of
those function calls are replaced within the application via
the preprocessor by the following directives and prototypes
(for MPI_Comm_size() in an analogous manner):
#define \
MPI_Comm_rank( a, b ) \
MPIX_FAKE_Comm_rank( &a, b, #a )
int MPIX_FAKE_Comm_rank
( MPI_Comm *comm,
int *rank,
char *name );
That way, the library can get aware of the respective
communicator name in case the function is called with the
object’s symbol name as an immediate argument.
That is for example:
MPI_Comm_rank(MPI_COMM_BLACK, &rank);
Whereupon the symbol name MPI_COMM_BLACK is
passed as the third argument of the fake function into the
string name. By searching in the previously memorized
look up table, the library can now determine whether the
given communicator is an externally defined one, and if so,
how to create the desired entity. Furthermore, since the pre-
processor also converts the former call-by-value style for
the communicator argument into a call-by-reference one,
a reference of the currently built communicator entity can
now be returned back to application level. By this means, a
second search for this communicator becomes unnecessary
for further MPI function calls because the returned refer-
ence now actually represents a valid MPI communicator.
Usage at Application Level Nevertheless, since the li-
brary has to decide whether it is a first call or not, all com-
municators that are assumed to be defined externally have
to be explicitly declared as MPI_COMM_NULL before be-
ing used (or rather, before being built). However, due to
the fact that a call with a NULL communicator will most
likely result in an abort of the running program in common
MPI environments, an application has to take appropriate
measures in order to be still consistent with the standard.
For that purpose, an application should ensure that an MPI
function also returns in case of an erroneous communica-
tor argument. This is usually done by setting the MPI error
handler to MPI_ERRORS_RETURN. That way, the applica-
tion can determine on its own whether a communicator is
valid or not.
27
In order to clarify the handling of those issues, refer to
the following code example:
MPI_Comm MPI_COMM_RED
= MPI_COMM_NULL;
MPI_Errhandler_set(MPI_COMM_WORLD,
MPI_ERRORS_RETURN);
if(MPI_Comm_rank(MPI_COMM_RED, &rank)
== MPI_SUCCESS)
{
/* I am part of MPI_COMM_RED! */
. . .
}
else
{
/* I am NOT in MPI_COMM_RED! */
. . .
}
Initially, a variable of the data type MPI_Comm is de-
clared for the new communicatorMPI_COMM_RED and ini-
tialized with MPI_COMM_NULL. Since the error handler
of the MPI environment gets instructed to return all oc-
curring errors to the application level, the subsequent call
to MPI_Comm_rank() with a NULL communicator does
neither abort if the new communicator could not be found
in the external communicator definition, nor in the case that
the application was built without our library’s support. In
both cases, all calling processes will discover that they are
not part of MPI_COMM_RED. However, in the other case
of an adequately defined external communicator, the call-
ing processes will build the new communicator according
to its grouping definitions from the XML file right within
the faked MPI_Comm_rank() function. At this point it
should be emphasized that the creation of a new communi-
cator is a collective operation within the parent group. That
in turn means that even though they may become not part
of the new communicator group, all processes within the
parent group have to call the respective function.
For inter-communicators, all these descriptions can
be applied in a similar manner with the exception that
MPI_Comm_test_inter() is used as the communica-
tor building fake function:
if(MPI_Comm_test_inter(MPI_COMM_INTER,
&flag) == MPI_SUCCESS) && (flag))
{
/* inter-communicator created! */
. . .
}
And also in this case, the function must be called by all
processes within the parent communicator because it serves
as the so-called bridge communicator in MPI terms.
4 Application Examples
The examples presented here are derived from paral-
lel algorithms that perform so-called nearest neighbor ex-
changes of row and column halos from a 2D array [4, 10,
21]. We have chosen this communication pattern because it
is a common operation for domain decompositions applied
in parallel simulation applications. Such a decomposition
scheme is exemplarily shown in Figure 2, where 12 proces-
sors work on a 3 × 4 block partitioned domain. As one can
see, the resulting communication pattern is quite structured
since only directly neighboring pairs of processors are ex-
changing messages in a horizontal and vertical manner.
Proc 0 Proc 1
Proc 4 Proc 5
Proc 8 Proc 9
Proc 3
Proc 7
Proc 11
Halo Exchange
Proc 10
Figure 2. Domain Decomposition
A Top-Down Process Grouping According to this com-
munication pattern, the processes can obviously be arranged
into horizontal and vertical communicating subgroups as
quoted below:
Group Processes
Horizontal 0 0, 1, 2, 3
Horizontal 1 4, 5, 6, 7
Horizontal 2 8, 9, 10, 11
Group Processes
Vertical 0 0, 4, 8
Vertical 1 1, 5, 9
Vertical 2 2, 6, 10
Vertical 3 3, 7, 11
Of course, this simple grouping scheme can easily be im-
plemented inside an application, that means without deploy-
ing externally defined communicators. Nevertheless, exter-
nally defined communicators can still be useful here in order
to map the virtual topology (that is the algorithm’s commu-
nication pattern) onto the underlying (homogeneous) hard-
ware topology. If the underlying network is, for example, a
Cartesian mesh, then an optimal virtual to physical topol-
ogy mapping can be performed by placing the processes
onto the appropriate processors as denoted in Figure 3. In
this example, the processor names are composed of a tuple
28
that indicates the position (row and column) of a processor
in the mesh network. Thus, by creating the above process
groups according to this naming scheme , an ideal mapping
can be accomplished.
Proc 0 Proc 1 Proc 2 Proc 3
Proc 5 Proc 6 Proc 7Proc 4
Proc 8 Proc 9 Proc 10 Proc 11
pd−00 pd−01 pd−02 pd−03
pd−10 pd−13pd−11 pd−12
pd−20 pd−21 pd−22 pd−23
Figure 3. Process to Processor Mapping
When using externally defined communicators for that
purpose, just the following XML entries have to be supplied
with substituted x and y:
<comm name="MPI_COMM_HORIZONTAL_x">
<processor>pd-[x][0-3]</processor>
</comm>
<comm name="MPI_COMM_VERTICAL_y">
<processor>pd-[0-2][y]</processor>
</comm>
At this point it should be mentioned that the MPI topol-
ogy mechanism is exactly what the standard defines to over-
come this issue. In particular, the MPI_Cart_create()
function serves as an easy way to create a new communi-
cator with a Cartesian topology attached [10, 13]. Further-
more, an MPI implementation may reorder the processes
within this function call for a better performance. Unfor-
tunately, this reorder mechanism is only very rarely real-
ized in a beneficial way in common MPI implementations
[19, 18]. Since our library provides an explicit rank reorder-
ing determined on the basis of the processor names rep-
resenting the actual hardware topology, its utilization can
obviously be helpful if the respective MPI implementation
does not offer appropriate mapping facilities on its own.
An Example following the Bottom-Up-Approach As
already pointed out in the introduction, many heterogeneity-
aware MPI implementations provide the application pro-
grammer with additional adaptation features that should
support an appropriate process mapping onto the (mostly)
hierarchical physical topology. However, the realizations of
those features are usually not conforming to the standard
and, moreover, depend on the MPI implementation used.
That means that when adapting an application to a hierarchi-
cal topology by using the auxiliary features offered by a cer-
tain MPI implementation, the application becomes bound to
this particular environment. In fact, this issue was the ori-
gin of the work presented here since we were looking for
a portable way to specify hierarchical topologies in an MPI
convenient manner.
MPI_COMM_P4
PD−Cluster
MPI_COMM_WORLD
P4−Cluster
MPI_COMM_PD
Figure 4. Two Coupled Clusters
Assume the following two-tier hierarchical system con-
sisting of two coupled clusters in Figure 4. In such a cou-
pled system, the interlinking network between the clusters
obviously constituted the system’s bottleneck, whereas the
cluster internal connections are usually built up from dedi-
cated high performance interconnects. Hence, in order to
be able to forward messages along the inter-cluster link,
while still be able to benefit from the fast internal cluster
networks, an MPI implementation with multiple network
support needs to be applied. There exist a couple of multi-
network capable MPI implementations like Open MPI [9]
or MPICH/Madeleine [1] and special Grid-enabled MPI li-
braries like MPICH-G2 [12], PACX-MPI [3] and GridMPI
[14]. All of those libraries are proven to run large-scale ap-
plications and most of them offer an individual implemen-
tation of the above mentioned adaptation features.
However, at this point, an application programmer now
has the opportunity to abandon the use of those intrinsic fea-
tures by utilizing our approach of externally defined com-
municators that reflect the system’s hierarchy in a portable
way. Although in this case an additional communicator con-
figuration file needs to be supplied, this can be either stated
by a user who possesses the information about the hard-
ware structure, or this file can be generated by an automated
mechanism.
Currently, we have already implemented such a mech-
anism into the runtime environment of MetaMPICH [16],
a Grid-enabled MPI library that has also been developed
at our institute. MetaMPICH allows the user to config-
ure the coupled system in a very detailed way via so-called
meta-configurations that help to provide an explicit defini-
tion of each cluster involved. By extending such a meta-
configuration, it is now possible for the user to include an
additional communicator assignment into the configuration
in order to provide self-named MPI communicators repre-
29
senting the respective cluster sites. The following paragraph
shows an exemplary section of such a meta-configuration
that typically contains many more items than shown here,
as for example the types of the internal networks and the in-
formation about the interlinking topology between the sites.
For more information about MetaMPICH and the syntax of
its meta-configurations, please refer to [5].
METAHOST p4_cluster
{
NODES = p4-01,p4-02,p4-03,p4-04;
INTRACOMM = "MPI_COMM_P4";
. . .
}
METAHOST pd_cluster
{
NODES = pd-01,pd-02,pd-03,pd-04;
INTRACOMM = "MPI_COMM_PD";
. . .
}
CONNECTIONS
PAIR p4_cluster pd_cluster
- { INTERCOMM = "MPI_COMM_INTER" }
. . .
Although a meta-configuration is not coded in XML but
in a proprietary syntax, a designated parser, which is part
of the MetaMPICH runtime system, can read this configu-
ration and is able to setup the needed XML file containing
the desired communicator definitions. For that purpose, the
name and the path to the XML file are passed via environ-
ment variables to our library, whereas the runtime system of
MetaMPICH has to assure the accessibility of the generated
XML file on all relevant nodes. For the presented example,
the resulting XML file would look like the following:
<comm name="MPI_COMM_P4">
<processor>p4-01</processor>
<processor>p4-02</processor>
<processor>p4-03</processor>
<processor>p4-04</processor>
</comm>
<comm name="MPI_COMM_PD">
<processor>pd-01</processor>
<processor>pd-02</processor>
<processor>pd-03</processor>
<processor>pd-04</processor>
</comm>
<intercomm name="MPI_COMM_INTER">
<first>MPI_COMM_P4</first>
<second>MPI_COMM_PD</second>
</intercomm>
The user can, of course, choose arbitrary communicator
names representing the cluster sites. That way, it is possible
to adapt an application e.g. for a hierarchical system con-
sisting of two or more coupled sites without being bound
to any actual system. Moreover, since the creation of the
XML file may also be delegated to another instance than
MetaMPICH, as for example to a topology analyzing tool,
the application also becomes independent of the runtime en-
vironment in use.
PD−ClusterP4−Cluster
MPI_COMM_P4 MPI_COMM_PD
Figure 5. Application on Coupled Clusters
As a result, also the recently introduced application ex-
ample can easily be adapted to a hierarchical system as de-
noted in Figure 5. In this exemplary case, the domain is
partitioned into strips (columns) so that only two proces-
sors have to communicate across the inter-cluster link. For
that purpose, the processes can now be grouped by the cor-
responding intra-communicators representing their respec-
tive cluster sites, whereas an additional inter-communicator
can serve to handle the communication between the two
clusters. That way, an adaptation of the application to the
heterogeneous system can be achieved by applying a par-
tially synchronous relaxation scheme between the sites (in
this case, the inter-cluster halo exchanges are just performed
in a periodic manner), while sill being fully synchronous
within the clusters. By this means, the inter-cluster com-
munication bottleneck can be compensated by employing
accessory computing power in terms of additional iteration
steps. For more details about this adaptation and optimiza-
tion approach, please refer to [21, 2, 6].
5 Conclusions, Outlook and Related Work
In this paper, we have presented our approach to sim-
plifying the communicator creation for an MPI application
programmer without loosing the freedom of choosing an ar-
bitrary underlying MPI library on the one hand, and, more-
over, without breaking the applications’ source code porta-
bility and standard conformity on the other hand. By us-
ing XML, it is also possible that not the programmer him-
self needs to write the communicator configuration file, but,
given an appropriate plug-in, this can also be automatically
done, for example, by a process scheduler, by a topology
30
analyzing tool or even by the MPI runtime environment it-
self. Further application areas may be the automated and
standardized communicator definition by load balancers or
by domain decomposition tools that are able to provide
simulation applications with appropriate process grouping
schemes for a given problem to be solved on a certain sys-
tem.
Currently, we plan to develop a plug-in for the MP-
Cluma cluster management tool [17] that should allow the
user to compose the desired communicators in a very con-
venient way. MP-Cluma has also been developed at our in-
stitute in order to enable a uniform and comfortable startup
of MPI applications on heterogeneous systems. Since MP-
Cluma offers a Java-based graphical frontend to the user,
we want to include an intuitive drag-and-drop facility for
an easy grouping of processes. And, furthermore, since
MP-Cluma needs to collect information about the respec-
tive hardware environment, the inclusion of an additional
topology analyzer seems obvious.
There also exists some related work within this scope of
GUI-based handling of MPI artefacts like communicators
and MPI-related data types: VisualMPI [8] and BladeRun-
ner [7] are tools that help the user to program MPI appli-
cations by representing those data types in a visual and ab-
stract way. However, both projects focus rather on a semi-
automated code generation at development time of an MPI
application than on the mapping of communication patterns
at runtime, as we do.
References
[1] O. Aumage and G. Mercier. MPICH/MADIII: a Cluster
of Clusters Enabled MPI Implementation. In Proceedings
of the 3rd IEEE/ACM International Symposium on Cluster
Computing and the Grid, May 2003.
[2] H. E. Bal, A. Plaat, M. G. Bakker, P. Dozy, and R. F. H.
Hofman. Optimizing Parallel Applications for Wide-Area
Clusters. In Proceedings of the IPPS/SPDP Workshops on
Parallel and Distributed Processing, Orlando, Florida, April
1998.
[3] T. Beisel, E. Gabriel, M. Resch, and R. Keller. Distributed
Computing in a Heterogeneous Computing Environment. In
Proceedings of the 5th European PVM/MPI Users’ Group
Meeting, September 1998.
[4] D. P. Bertsekas and J. N. Tsitsikilis. Parallel and Distributed
Computation: Numerical Methods. Prentice Hall, Engle-
wood Cliffs, N.J., 1989.
[5] Chair for Operating Systems, RWTH-Aachen, University.
MP-MPICH – User Documentation & Technical Notes.
[6] C. Clauss, S. Gsell, S. Lankes, and T. Bemmerl. A Fair
Benchmark for Evaluating the Latent Potential of Heteroge-
neous Coupled Clusters. In Proceedings of the 6th Inter-
national Symposium on Parallel and Distributed Computing
(ISPDC 2007), Hagenberg, Austria, July 2007.
[7] B. R. T. Donald P. Pazel. Intentional MPI Programming in
a Visual Development Environment. In Proceedings of the
2006 ACM symposium on Software visualization SoftVis ’06.
ACM Press, September 2006.
[8] D. Ferenc, J. Nabrzyski, M. Stroinski, and P. Wierzejewski.
VisualMPI - A Knowledge-Based System for Writing Effi-
cient MPI Applications. In Proceedings of the 6th European
PVM/MPI Users’ Group Meeting, volume 1697 of Lecture
Notes in Computer Science, Barcelona, Spain, September
1999.
[9] R. L. Graham, G. M. Shipman, B. W. Barrett, R. H. Cas-
tain, G. Bosilca, and A. Lumsdaine. Open MPI: A High-
Performance, Heterogeneous MPI. In Proceedings of the
Fifth International Workshop on Algorithms, Models and
Tools for Parallel Computing on Heterogeneous Networks,
Barcelona, Spain, September 2006.
[10] W. Gropp, E. Lusk, and A. Skjellum. Using MPI - second
edition. Scientific and Engineering Computation series. MIT
Press, Cambridge, 1999.
[11] W. Gropp, E. Lusk, and R. Thakur. Using MPI-2: Ad-
vanced Features of the Message Passing Interface. Scien-
tific and Engineering Computation series. MIT Press, Cam-
bridge, 1999.
[12] N. Karonis, B. Toonen, and I. Foster. MPICH-G2: A Grid-
enabled Implementation of the Message Passing Interface.
Journal of Parallel and Distributed Computing, 63(5), 2003.
[13] M. Snir, W. Otto, S. Huss-Lederman, D.W. Walker and
J. Dongarra. MPI: The Complete Reference. Scientific and
Engineering Computation Series. MIT Press, Cambridge,
1996.
[14] M. Matsuda, Y. Ishikawa, Y. Kaneo, and M. Edamoto.
Overview of the GridMPI Version 1.0. In Proceedings of
the SWoPP05, Japan, 2005.
[15] MPI Forum. MPI: A Message-Passing Interface Stan-
dard. International Journal of Supercomputing Applica-
tions, 1994.
[16] M. Po¨ppe, S. Schuch, and T. Bemmerl. A Message Pass-
ing Interface Library for Inhomogeneous Coupled Clus-
ters. In Proceedings of the IEEE International Parallel and
Distributed Processing Symposium (IPDPS 2003), Nice,
France, April 2003.
[17] S. Schuch and M. Po¨ppe. MP-Cluma - A CORBA Based
Cluster Management Tool. In Proceedings of the Inter-
national Conference on Parallel and Distributed Process-
ing Techniques and Applications (PDPTA 2004), Las Vegas,
USA, June 2004.
[18] R. Thakur and W. Gropp. Open Issues in MPI Implemen-
tation. In L. Choi, Y. Paek, and S. Cho, editors, Advances
in Computer Systems Architecture, 12th Asia-Pacific Con-
ference, ACSAC 2007, Seoul, Korea, August 23-25, 2007,
Proceedings, volume 4697 of Lecture Notes in Computer
Science, pages 327–338. Springer, 2007.
[19] J. L. Traff. Implementing the MPI Process Topology Mech-
anism. In Proceedings of the IEEE ACM SC 2002 Confer-
ence, Baltimore, USA, November 2002.
[20] W3C. Extensible Markup Language (XML) 1.0 (Fourth Edi-
tion). http://www.w3.org/TR/xml/, September 2006.
[21] B. Wilkinson and M. Allen. Parallel Programming - Tech-
niques and Applications Using Networked Workstations and
Parallel Computers. Prentice Hall, 2nd edition, 2005.
31
Performance Evaluation and Optimization of Metacomputing Applications
Daniel Becker1,2, Wolfgang Frings1 and Felix Wolf1,2
{d.becker, w.frings, f.wolf} @fz-juelich.de
1 Forschungszentrum Ju¨lich, Ju¨lich Supercomputing Centre (JSC), 52425 Ju¨lich, Germany
2 RWTH Aachen University, Department of Computer Science, 52056 Aachen, Germany
Abstract
The combination of independent and potentially hetero-
geneous parallel machines creates a powerful metacom-
puter. Such a metacomputer can be used to run a single
parallel application if a single machine does not provide
enough CPUs. However, achieving satisfactory application
performance on such a metacomputer is difficult since in-
stances of grid-related as well as non grid-related perfor-
mance properties may introduce various wait states during
communication and synchronization. In our earlier work,
we have introduced an extension to the SCALASCA tool
set for recording event traces of metacomputing applica-
tions and searching them automatically for patterns of inef-
ficient behavior related to wide-area communication. Here,
we show how this extension in combination with statistical
analyses and time-line visualization provided by VAMPIR
can be applied to evaluate and optimize the performance
of a multi-physics production code running on a heteroge-
neous and geographically dispersed metacomputer.
Keywords: Performance tools, grid computing, meta-
computing, event tracing.
1 Introduction
The solution of critical numerical problems may require
more processing power and memory capacity than is avail-
able on a single parallel machine. Often, coupling multiple
independent parallel machines (i.e., metahosts) to form a
more powerful metacomputer is the only method to increase
the available resources for a single application.
However, although applications can benefit from the in-
creased parallelism offered by a metacomputer, achieving
satisfactory application performance is difficult. Algorithm
design has to adapt to hierarchies of latencies and band-
widths in addition to the heterogeneous hardware architec-
tures found in such environments. Hence, performance op-
timization is a crucial but non-trivial task that needs ade-
quate tool support. A frequent problem that needs special
attention are wait states that occur when the speed at which
the computation progresses varies between metahosts or
when message transfers are delayed by high network la-
tency.
In our earlier work [1], we have shown that automatic
pattern search in event traces is a suitable method to iden-
tify wait states that appear as a result of using a meta-
computer consisting of multiple geographically dispersed
metahosts. There, we have extended the trace-analysis tool
SCALASCA [10] so that it can be used in metacomputing
environments. Challenges addressed by our extension in-
clude performing the pattern analysis in the absence of a
shared file system between metahosts, the synchronization
of time stamps in hierarchical networks, and the definition
of grid-specific patterns that target communication and syn-
chronization across metahost boundaries.
In this paper, we demonstrate that not only performance
evaluation but also performance optimization of applica-
tions running on a heterogeneous and geographically dis-
persed metacomputer are feasible. Using the grid-enabled
tracing and analysis capabilities of the SCALASCA tool set,
we determine relevant performance properties and demon-
strate how this information can be used to significantly im-
prove the performance of MetaTrace [5], a grid-enabled
multi-physics application that simulates the transport of pol-
lutants in groundwater.
Starting point of our study are event traces generated us-
ing the enhanced SCALASCA measurement infrastructure.
First, we evaluate how the bandwidth and latency require-
ments of our application are met by the wide-area connec-
tion in our grid testbed using the statistical trace-analysis
capabilities of VAMPIR [8]. Second, we show how the lo-
calization, classification, and quantification of wait states
performed by the SCALASCA trace analyzer assists us in
eliminating a major fraction of waiting times, leading to
32
a significant improvement of the overall performance. Fi-
nally, by running the application on a homogeneous cluster
and comparing the results with those obtained on the meta-
computer, we verify that some of the performance problems
we have identified are indeed the consequence of using a
metacomputer.
The outline of this article is as follows: We start in Sec-
tion 2 with a short description of VIOLA, the metacomputer
testbed we used for our experiments, and the application
MetaTrace. In Section 3, we describe the methods and tools
used during the optimization process. Then, in Section 4,
we summarize our network analysis followed by an outline
of the incremental optimization process. Finally in Sec-
tion 5, we conclude our paper.
2 The VIOLA metacomputer
VIOLA [4] is a project funded by the German Ministry
for Education and Research, which provides a testbed for
advanced optical network technology. A major focus is the
enhancement and test of advanced grid applications.
2.1 Network topology and hardware ar-
chitecture
The network behind the VIOLA grid consists of a 10
Gbps backbone network with connections to workstations
and compute clusters located at various sites in Germany in-
cluding Sankt Augustin, Ju¨lich, Bonn, Nu¨rnberg, and Erlan-
gen. The nodes of the connected compute clusters are linked
to the backbone with 1 Gbps adapters. The high bandwidth
of the backbone can only be used if the data transmission
between the clusters is done in parallel.
These components form a very heterogeneous metacom-
puter layout with a hierarchy of different network latencies
and varying characteristics of the compute clusters, which
differ with respect to their operating systems (different ver-
sions of Linux) and compilers. It can be expected that the
high latency of inter-machine communication as well as the
heterogeneous hardware may adversely affect application
performance.
2.2 Middleware
Running a parallel application on such a metacomputer
needs middleware components for application startup and
a wide-area communication library for the transfer of data
between application processes residing on geographically
dispersed metahosts. The middleware interacts with local
resource managers to co-schedule jobs on different clus-
ters. The communication library should support transparent
high-bandwidth and low-latency message transfers between
all nodes of the attached clusters.
The co-scheduling of jobs on different clusters in the
VIOLA grid is managed by the grid middleware UNI-
CORE [7] which has been enhanced by adding a meta-
scheduler for the simultaneous allocation of compute and
network resources. Bierbaum et al. [2] describe this
UNICORE-based infrastructure supporting the co-allocation
of metacomputing resources in more detail, with special
emphasis on the intricate task of coordinating network al-
location with application startup. This infrastructure pro-
vides seamless access to distributed grid resources through
a graphical user interface, which is depicted in Figure 1.
Figure 1. Meta-scheduler and the UNICORE
graphical user interface.
Moreover, VIOLA uses MetaMPICH [3], the MPICH-
based MPI-implementation developed at RWTH Aachen
University, to establish direct connections to the external
network from each node. MetaMPICH supports these direct
connections through a multi-device architecture that allows
external communication within the VIOLA-testbed with the
maximum bandwidth of 1 Gbps per node across the wide-
area network without the involvement of dedicated router
processes.
2.3 Applications
Applications on the VIOLA grid cover various research
disciplines including environmental research, the design of
complex technological systems like biosensors and crystal
growth for microchip wafer production, and structural me-
chanics in engineering.
MetaTrace, one of the applications running on the
VIOLA-testbed, simulates the transport of pollutants in
groundwater. MetaTrace is a combination of two paral-
lel simulation submodels, Trace and Partrace. Whereas
33
Trace simulates water flow in porous media, Partrace com-
putes the transport of solutes in this water flow. Trace
applies a three-dimensional domain decomposition (in our
case 192×32×32 m3) with nearest-neighbor communica-
tion, whereas Partrace tracks individual particles. For sim-
ulating pollutant transport in non-steady flows, the simulta-
neous execution of both submodels is crucial. MetaTrace
couples the two submodels through a parallel connection
between the two submodels. This connection is mainly
used in one direction for the transfer of the distributed three-
dimensional velocity field from Trace to Partrace whenever
Trace completes a simulation step. The unidirectional com-
munication scheme makes MetaTrace suitable to run effi-
ciently on a computational grid. Running each submodel
on a single metahost allows the internal communication to
benefit from the low-latency network whereas only syn-
chronization as well as data exchange between the two sub-
models have to use the high-latency network. The unidirec-
tional and low-frequency communication between the two
submodels is done synchronously over the VIOLA backbone
network through the node-local network adapters. After re-
ceiving the data, Partrace replicates the received velocity
field on each node by synchronously distributing it across
all Partrace nodes using a systolic loop.
3 Performance measurement and analysis
In this section, we illustrate our performance measure-
ment and analysis method used to optimize the application.
We focus on the SCALASCA tool set and its recent exten-
sion that can be used to analyze metacomputing applica-
tions. In addition, we briefly describe the VAMPIR graphical
trace browser.
Often, parallel applications which are free of computa-
tional errors need to be optimized. This requires the infor-
mation which component of the program is responsible for
what kind of inefficient behavior. Performance analysis is
the process of identifying those parts, exploring the reasons
for their unsatisfactory performance, and quantifying their
overall influence. To do this, performance data are mapped
onto program entities. A developer can now investigate ap-
plication’s runtime behavior using software tools. Thus, the
developer is enabled to understand the performance behav-
ior of his application. The process of gathering performance
data is called performance measurement and forms the basis
for subsequent analysis.
Event tracing is a technique for post-mortem perfor-
mance analysis of parallel applications. Time-stamped
events, such as entering a function or sending a message, are
recorded at runtime and analyzed afterwards with the help
of software tools. The information recorded for an event
includes at least a time stamp, the location (e.g., the pro-
cess or node) where the event happened and the event type.
Depending on the type, additional information may be sup-
plied, such as the function identifier for function call events.
Message event records typically contain details about the
message they refer to (e.g., the source or destination loca-
tion and message tag).
Graphical trace browsers, such as VAMPIR, allow the
fine-grained, manual investigation of parallel performance
behavior using a zoomable time-line display and provide
statistical summaries of communication behavior. However,
in view of the large amounts of data generated on contem-
porary parallel machines, the depth and coverage of the vi-
sual analysis offered by a browser is limited as soon as it
targets more complex patterns not included in the statistics
generated by such tools.
By contrast, the trace analyzer of the SCALASCA tool
set [6] automatically searches event traces for patterns of in-
efficient behavior, classifies detected instances by category,
and quantifies the associated performance penalty. To do
this efficiently at larger scales and also to circumvent the
obstacles arising from the absence of a shared file system
in grid environments, the traces are analyzed in parallel by
replaying the original communication using the same hard-
ware configuration and the same number of CPUs as have
been used to execute the target application itself.
For our experiments presented in Section 4, we used the
SCALASCA tool set which has been extended to support the
automatic performance analysis of metacomputing applica-
tions. Goal of these extensions was (i) to enable automatic
trace analysis on a metacomputer and (ii) to help identify
metacomputing-specific performance problems in applica-
tions. On a technical level, capabilities have been added to
identify the metahost a process is running on, to synchro-
nize time stamps across a hierarchical network with dif-
ferent latencies, and to analyze traces in the absence of a
shared file system. In addition, special metacomputing pat-
terns have been added to the existing pattern base. The in-
terested reader can find a more detailed description in [1].
4 Performance evaluation and optimization
In this section, we present experimental results that show
the feasibility of evaluating relevant performance metrics
and of optimizing the performance of a real-world produc-
tion code in metacomputing environments.
4.1 Experiment description
To demonstrate that performance measurement in com-
bination with performance analysis can be used to identify
inefficient performance behavior, we analyzed the afore-
mentioned multi-physics application MetaTrace. For our
experiments we used the VIOLA sites at FH Bonn-Rhein-
Sieg Sankt Augustin (FH-BRS) and at Forschungszentrum
34
Figure 2. Analysis results of metacomputer experiment: Late Sender problem inside Trace function
cgiteration() at FH-BRS.
Ju¨lich (FZJ) to execute MetaTrace. That is, the metacom-
puter used for our measurements includes two metahosts,
one at each site:
• A PC Linux cluster with 6 4-way AMD Opteron SMP
nodes at 2 GHz with a usock over Myrinet interconnect
located at FH-BRS.
• A Cray XD1 Linux cluster with 60 2-way AMD
Opteron SMP nodes at 2.2 GHz with a usock over Rap-
idArray interconnect located at FZJ.
In our first experiment, Partrace ran at FZJ, while Trace
was executed at FH-BRS. To enable a comparison between a
grid environment and a homogeneous cluster we performed
a second experiment on an IBM AIX POWER 4+ cluster at
Forschungszentrum Ju¨lich. In both cases we used 24 pro-
cesses in total.The detailed configurations of these experi-
ments are listed in Table 1.
4.2 Experimental results
To generate the trace data needed to investigate the per-
formance behavior, the instrumented program was executed
on the VIOLA grid. MetaTrace was instrumented by man-
ually inserting directives which were automatically trans-
lated into appropriate SCALASCA measurement API calls by
a preprocessor. During the program run, the trace files were
generated in the EPILOG format. The trace data were ana-
lyzed by SCALASCA’s parallel analyzer to generate a pro-
file of high-level performance properties. From the anal-
ysis results we derived our decisions which optimization
we should apply to the application. For fine-grained visual
trace analysis, the EPILOG event trace was converted to the
OTF format.
Table 1. Detailed configurations of the two-
metahost and one-metahost experiments.
Experiment 1 Experiment 2
Partrace
FZJ: IBM AIX POWER 4+:
8 nodes 1 node
1 processes/node 8 processes/node
Trace
FH-BRS: IBM AIX POWER 4+:
4 nodes 1 node
4 processes/node 16 processes/node
4.2.1 Network characteristics of the VIOLA-testbed
For our initial performance measurement we used Meta-
Trace in the configuration described in Section 2. After ap-
plying an OTF converter to our EPILOG traces, we were able
to determine several performance metrics of the VIOLA-
testbed using VAMPIR’s statistical summary functionality.
Partrace and Trace simulate the spread of groundwater
pollution collaboratively, and thus, the two submodels ex-
change simulation data at synchronization points across the
external network. That is, the total amount of data sent
across the wide area network represents the use of VIOLA’s
infrastructure. Table 2 shows the total amount of data trans-
ferred across the internal and external network within the
35
Figure 3. Analysis results of metacomputer experiment: Difference experiment obtained by subtract-
ing the original version from the optimized version.
VIOLA-testbed. As can be seen in our experiment, Trace at
FH-BRS sent in total 547.8 MByte of data across the external
network to Partrace at FZJ. Thus, each Partrace process re-
ceived in average 68.5 MByte of data from Trace across the
external network. It should be mentioned that Partrace sent
only minor control and status information back to Trace.
Table 2. Total amount of data transferred
across the internal and external network in
the VIOLA-testbed in MByte.
FZJ FH-BRS
FZJ 4320.0 0.0
FH-BRS 547.8 1120.0
To clarify whether the data transfer used the full band-
width offered by VIOLA’s infrastructure, we determined the
maximum data rate of the internal and external commu-
nication as well. Our measurements summarized in Ta-
ble 3 show a maximum data transfer rate of 47.3 MByte/s
between two corresponding processes at FH-BRS and FZJ.
Each node at FH-BRS used a network link with the maxi-
mum bandwidth of 1 Gbps. Since we assigned 16 processes
to Trace and 8 processes to Partrace, only two Trace pro-
cesses on the same 4-way node could communicate in par-
allel with two corresponding Partrace processes during the
data exchange. Given that these two Trace processes shared
a single network link, each of the two could use half of the
bandwidth (62.5 MByte/s per process) offered by VIOLA’s
network links, and thus, our measurements show that Meta-
Trace almost fully utilized the VIOLA network bandwidth.
Table 3. Maximum P2P communication rate
of the internal and external communication in
the VIOLA-testbed in MByte/s.
FZJ FH-BRS
FZJ 208.6 0.4
FH-BRS 47.3 511.7
In addition, Table 4 illustrates the minimum duration
of the internal and external communication in the VIOLA-
testbed. In our configuration, the external message transfer
duration exceeded the internal message transfer duration by
almost two orders of magnitude. During the communication
between Trace and Partrace, the minimum message transfer
duration was 862.0 µs. Given that the sites at FZJ and FH-
BRS lie 100 km apart, the minimum message transfer time
of roughly 333.0 µs can be calculated based on the speed
of light. Hence, it can be concluded that the VIOLA net-
work indeed offered a low-latency wide area network link
between the sites used for our experiments.
Our measurements show that MetaTrace took advantage
of the state-of-the-art network capabilites offered by the
VIOLA grid. Solving larger input problems might necessi-
tate further improvements of the underlying network tech-
nology.
36
Table 4. Minimum duration of the internal
and external communication in the VIOLA-
testbed.
FZJ FH-BRS
FZJ 27.3 µs 879.0 µs
FH-BRS 862.0 µs 30.3 µs
4.2.2 Incremental performance optimization
To optimize the performance of MetaTrace, we used
SCALASCA to identify undesirable wait states hoping that
they can be easily removed. The optimization was carried
out in two cycles each consisting of a trace analysis using
SCALASCA and a subsequent source-code modification.
The analysis of the unoptimized version showed an over-
all execution time of 1837.40 seconds aggregated across all
processes, whereby a major fraction (72.1 %) was spent in
MPI function calls. This MPI fraction is composed of the
time used for actual communication (15.4 %) and the time
spent waiting (56.7 %) for a communication partner. Obvi-
ously, the waiting time clearly dominated the overall com-
munication behavior making it the most promising target
for our optimization efforts. Often, reasons for such wait
states can be found in the scheduling of communication op-
erations or in the distribution of work among the processes
involved.
Figure 2 shows a screen shot of SCALASCA’s trace anal-
ysis results. Apparently, the application suffered from grid-
specific Wait at Barrier situations (i.e., global) and non grid-
specific Late Sender and Wait at N×N situations (i.e., local),
when communicating or synchronizing. As the display indi-
cates, the global Wait at Barrier problem consumed 18.7 %
of the overall execution time. In addition, the local Late
Sender problem consumed 10.6 % of the overall execution
time. Finally, the local Wait at N×N problem caused 20.2 %
of the overall execution time. For a description of these pat-
terns, the reader may refer to [1].
Trace and Partrace synchronize at a global barrier before
Trace unidirectionally sends the velocity field to Partrace
for further processing. However, because Trace and Par-
trace are essentially two different programs, each submodel
invokes this barrier from a different function. As a result
both functions are diagnosed with the global Wait at Bar-
rier, although both occurrences are closely connected. Most
of the waiting time was attributed to the Partrace function
ReadFieldsFromTrace(), which had to wait until all pro-
cesses in Trace had reached the corresponding barrier call
in function printtolink(). That is, we detected an imbal-
ance between Trace and Partrace, since Partrace went ahead
of Trace. Moreover, Trace suffered from local Late Sender
and Wait at N×N situations, which together represent most
of the waiting time in internal communication.
The Trace-local Late Sender was concentrated in
cgiteration(). All Trace processes performed calcula-
tions inside cgiteration() and subsequently distributed
their local results to their nearest neighbors. Afterwards, a
dot product was calculated using MPI Allreduce(). Al-
though the domain decomposition assigned equally-sized
subdomains to every process, border processes were quicker
because they had fewer neighbors to exchange border cells
with. Given that these processes had fewer communica-
tion partners, they not only waited during the data exchange
phase for their peers in the center but they could also leave
the data exchange phase earlier. That is, this imbalance in-
troduced two performance problems. The first problem oc-
curred while all Trace processes were synchronizing in pairs
to exchange their local results, causing a local Late Sender
situation. The second problem occurred when Trace subse-
quently calculated the dot product, causing a local Wait at
N×N situation.
The goal of our first optimization was to make Trace
faster. More precisely, we assumed that reducing the Trace-
local Late Sender problem inside cgiteration(), would
allow Trace to reach the synchronization point with Par-
trace earlier, which would also decrease the barrier waiting
time between the two submodels. We therefore replaced the
synchronous communication operations in cgiteration()
with their asynchronous counterparts, allowing more vari-
ability for the nearest-neighbor data exchange. Now, pro-
cesses inside Trace would be able to process received results
earlier. In addition, the Trace-local Wait at N×N situation
would also be reduced, since processes in the center of the
domain could leave the data exchange phase earlier as well.
After our first optimization cycle, we measured an over-
all execution time of 877.90 seconds, corresponding to a
reduction by more than a factor of two. Now, only a frac-
tion of 42.0 % of the overall execution time was spent in
MPI function calls. In Figure 3, a screen shot of a difference
experiment [9] obtained by subtracting the original version
from the optimized one is depicted. Performance gains are
represented by sunken reliefs (negative numbers), perfor-
mance losses by raised reliefs (positive numbers). The num-
bers show the difference in execution time in percent rela-
tive to the unoptimized version. One can easily recognize
that the global Wait at Barrier as well as the Trace-local
Late Sender and Wait at N×N were significantly reduced.
For instance, the figure shows that the global waiting time
at the barrier inside ReadFieldsFromTrace was reduced
by roughly 14.1% of the total execution time.
Moreover, in the optimized version the global Wait at
Barrier problem consumed 8.6 % (18.7 % before) of the
overall execution time and the Trace-internal Late Sender
problem consumed 3.7 % (10.6 % before) of the overall ex-
37
ecution time. Finally, the Trace-local Wait at N×N prob-
lem caused 7.8 % (20.2 % before) of the overall execution
time. By means of asynchronous communication, we were
able to significantly reduce the Late Sender situation inside
Trace since Trace did not wait at synchronization points
inside cgiteration() during the internal data exchange.
In addition, Trace now needed less time for a single iter-
ation and so Trace reached the synchronization point with
Partrace earlier, which reduced the global Wait at Barrier
problem. Finally, the waiting time at the Trace-local Wait
at N×N situation was notably decreased as well, which was
caused by the elimination of synchronization points during
the preceeding data exchange phase.
Figure 4. Vampir display: Event traces of all
Partrace processes during one simulation cy-
cle.
However, the application still suffered from a global Wait
at Barrier situation apparent in the two functions mentioned
earlier. We decided to perform a second optimization cy-
cle. Trace has variable simulation time steps which depend
on the accuracy of the respective calculation whereas Par-
trace uses constant time steps independent of the accuracy.
Since the communication between the two submodels is es-
sentially unidirectional and asynchronous by nature, we re-
placed the synchronous communication operations between
Trace and Partrace including the barrier call with their asyn-
chronous equivalents to eliminate the global Wait at Barrier
problem. It is worth noting, that without the asynchronous
communication scheme, removing the barrier call would
cause waiting times during the data exchange. Also, al-
though in our case a decreased runtime of Partrace increases
the waiting time during the data transfer between Trace and
Partrace, we applied an optimization to Partrace as well.
Figure 4 visualizes the event traces of all Partrace processes
during one simulation cycle by showing a time line for each
process indicating its current execution state by color. Us-
ing VAMPIR’s zooming capability we examined the runtime
behavior further. Partrace used a systolic loop to distribute
its simulation data internally. We decided to replace the
original communication scheme with a collective commu-
nication since the collective operation MPI Allgather()
needs substantially less effort.
Table 5. Summary of performance measure-
ments of the unoptimized version and after
each optimization cycle.
Optimization
unoptimized 1 2
MPI fraction 72.1 % 42.0 % 34.4 %
Wait at Barrier 18.7 % 8.6 % 0.9 %
Late Sender 10.6 % 3.7 % 1.7 %
Wait at N×N 20.2 % 7.8 % 6.0 %
The results of our final performance measurement in-
cluding the aforementioned optimizations showed only a
fraction of 34.4 % of the overall execution time (771.50
seconds) spent in MPI function calls. Further, our analy-
sis results showed that the global Wait at Barrier problem
could be completely eliminated. Additionally, the Trace-
local Late Sender version only consumed 3.9 % of the over-
all execution time and the Trace-local Wait at N×N prob-
lem caused 6.0 % of the overall execution time. Hence,
the major performance problems were significantly reduced
and, thus, the performance behavior was significantly im-
proved. Table 5 summarizes the values of the respective
performance problem after each optimization cycle accord-
ing to the functions mentioned above.
Finally, we compared the application performance on the
VIOLA metacomputer achieved before and after our opti-
mizations with the performance when running on the ho-
mogeneous IBM AIX POWER 4+ cluster. While Figure 5 (a)
shows the the total execution time before and after one and
two optimization cycles, Figure 5 (b) shows the correspond-
ing percentage of the execution time spent in MPI function
calls. In addition, the respective MPI waiting time is de-
picted. As can be seen, the overall execution time as well
as its MPI fraction is smaller in each experiment performed
on the homogenous cluster than on the metacomputer. The
optimizations showed only minor influence on the applica-
tion performance in the homogeneous case. We were able to
significantly reduce the total execution time from 1837.40
seconds to 771.50 seconds on the metacomputer. Hence,
we were able to significantly reduce grid-specific perfor-
mance problems of a parallel computational grid applica-
tion by eliminating the major fraction of waiting times in
several optimization cycles.
5 Conclusion
In this paper, we have shown that our extension to the
SCALASCA tool set in combination with statistical analyses
38
0200
400
600
800
1000
1200
1400
1600
1800
2000
210
optimization cycle
ti
m
e
 [
s
]
Execution time two metahost case
Execution time one metahost case
(a) The total execution time before the optimization and after one and
two optimization cycles.
0
10
20
30
40
50
60
70
80
210
optimization cycle
[%
]
MPI fraction two metahost case
waiting time two metahost case
MPI fraction one metahost case
waiting time one metahost case
(b) The percentage of the waiting time and execution time spent in MPI
calls before the optimization and after one and two optimization cycles.
Figure 5. Optimization results on a homoge-
neous cluster and a metacomputer.
and time-line visualization provided by VAMPIR can be used
to evaluate and optimize the performance of a multi-physics
production code running on a heterogeneous and geograph-
ically dispersed metacomputer. Using the grid-enabled trac-
ing and analysis capabilities of the SCALASCA tool set, we
have determined relevant performance properties and have
experimentally demonstrated that this information can be
used to significantly improve performance.
First, we were able to verify that the bandwidth and la-
tency requirements of our application are met by the wide-
are connection in the VIOLA grid. Second, we presented a
detailed description of the performance optimizations ap-
plied to MetaTrace. While MetaTrace fully utilized the
entire network resources provided by the VIOLA grid, we
have shown in several optimization cycles that our modifi-
cations eliminated the major fraction of waiting times. In
addition, we compared results from a homogeneous cluster
with those obtained on the metacomputer, confirming that
some of the performance problems we identified are indeed
the consequence of using a metacomputer.
Given the fact that performance optimization for just a
single machine is already a non-trivial task that requires
substantial tool support, we argue that this is even more im-
portant for grid environments. With grid-enabled tools de-
velopers are able to optimize their applications to achieve an
appropriate performance level. Using MetaTrace as an ex-
ample, we have shown that grid-enabled performance tools
allow efficient execution of parallel applications in grid en-
vironments.
References
[1] D. Becker, F. Wolf, W. Frings, M. Geimer, B. Wylie, and
B. Mohr. Automatic trace-based performance analysis of
metacomputing applications. In Proceedings of the IEEE In-
ternational Parallel and Distributed Processing Symposium
(IPDPS), Long Beach, California, March 2007.
[2] B. Bierbaum, C. Clauss, T. Eickermann, L. Kirtchakova,
A. Krechel, S. Springstubbe, O. Wa¨ldrich, and W. Ziegler.
Orchestration of distributed MPI-applications in a
UNICORE-based grid with metampich and metaschedul-
ing. In Proc. 13th European PVM/MPI Conference, Bonn,
Germany, September 2006. Springer.
[3] B. Bierbaum, C. Clauss, M. Po¨ppe, S. Lankes, and T. Be-
mmerl. The new multidevice architecture of MetaMPICH
in the context of other approaches to grid-enabled MPI.
In Proc. 13th European PVM/MPI Conference, Bonn, Ger-
many, September 2006. Springer.
[4] BMBF (Ministry for Education and Research). Vertically
Integrated Optical Testbed for Large Applications in DFN
(VIOLA). http://www.viola-testbed.de/.
[5] Forschungszentrum Ju¨lich. Solute Transport in Heteroge-
neous Soil-Aquifer Systems. http://www.fz-juelich.
de/icg/icg-iv/modeling.
[6] M. Geimer, F. Wolf, B. J. N. Wylie, and B. Mohr. Scalable
parallel trace-based performance analysis. In Proc. 13th Eu-
ropean PVM/MPI Conference, Bonn, Germany, September
2006. Springer.
[7] S. Haubold, H. Mix, W. E. Nagel, and M. Romberg. The
UNICORE grid and its options for performance analysis.
pages 275–288, 2004.
[8] W. Nagel, M. Weber, H.-C. Hoppe, and K. Solchenbach.
VAMPIR: Visualization and analysis of MPI resources. Su-
percomputer, 12(1):69–80, 1996.
[9] F. Song, F. Wolf, N. Bhatia, J. Dongarra, and S. Moore.
An algebra for cross-experiment performance analysis. In
Proc. of the International Conference on Parallel Process-
ing (ICPP), Montreal, Canada, August 2004. IEEE Com-
puter Society.
[10] F. Wolf and B. Mohr. Automatic performance analysis of
hybrid MPI/OpenMP applications. Journal of Systems Ar-
chitecture, 49(10-11):421–439, Nov. 2003.
39
Design and Evaluation of a 2048 Core Cluster System
Frank Mietke1 Torsten Mehlan1 Torsten Hoefler1,2
Wolfgang Rehm1
1Technical University of Chemnitz, Chemnitz, 09107 GERMANY
{mief,tome,htor,rehm}@cs.tu-chemnitz.de
2Open Systems Laboratory, Indiana University, Bloomington IN 47405, USA
htor@cs.indiana.edu
Abstract
Designing a 2048 core high performance cluster, includ-
ing an appropriate parallel storage complex and a high
speed network, under the pressure of limited budget (2.6
Mio Euro), performance, thermal and space limitations is
really a challenging task.
In this paper, we present our design decisions and their rea-
sons, our experiences during the installation stage as well
as performance numbers using well-known benchmarks in
the field of scientific computing, networking and I/O, and
real world applications.
1 Introduction
Up to the beginning of the year 2000 the supercomput-
ers in the TOP500 list1, was dominated by Massively Paral-
lel Processor (MPP) (51.6%) and Symmetric Multiproces-
sor (SMP) (33.8%) systems. Cluster systems (5.6%) played
only a minor role. Since then the cluster architecture has
become the dominant supercomputing platform (81.2% in
11/2007 issue of TOP500 list). This was mainly due to a
growing PC/server market, which made the single machine
more affordable, the broader support of Linux, the inven-
tion of high-speed networks as well as a growing software
stack which simplified the setup, administration and pro-
gramming [17, 27].
1.1 Supercomputing in Chemnitz
The growing complexity of scientific problems and the
demand of more compute power led to the procurement
1bi-annual ranking of fastest supercomputers in the world, issue
11/1999
of the first supercomputer at the Technical University of
Chemnitz (TUC), a Parsytec GC 128 PowerPlus which was,
in 1994, one of the fastest machines in Germany. After
4 years it had become obsolete and the need for a new
system which had to be able to satisfy the growing needs
of the steadily growing user community became evident.
To achieve the best price performance ratio, the university
computing center decided to design and build their own
cluster computer from desktop PCs running Linux. Two
independent Fast Ethernet networks served as high perfor-
mance communication and administration mediums. This
new system was named CLiC (Chemnitzer Linux Clus-
ter) and was operational in 2000. This cluster was, in
the Top500 metric, the fastest in Germany and the sec-
ond fastest in the category self-made in the world after the
CPlant/Siberia at the Sandia National Laboratories. Also
the price-performance ratio was one of the best in the world.
The user community as well as the complexity of scien-
tific problems grew further which led, 4 years later, to new
discussions about an update of the supercomputer. A short
overview of current projects is given in [1].
We will describe the design process, the hardware and
software experiences with this new system, called CHiC
(Chemnitzer High-Performance Cluster) in the following
sections. Several results of synthetic benchmarks and real
world applications are presented in Section 3 to assess the
performance of the newly deployed cluster. We conclude
the relevant results in Section 4 and outline the future work
in Section 5.
2 CHiC
To further strengthen the HPC capabilities at the TUC
and thus accelerate the scientific outcome, the CHiC was
optimized for high-performance parallel computing as well
40
as high job throughput. In this section we are discussing the
design, hardware, software and the first experiences with
the new system.
2.1 Design
In the year 2000 the CLiC was an effort to build a big
cluster (528 nodes) using desktop computers and a Fast Eth-
ernet communication network. All nodes were connected
through a single Fast Ethernet switch. To utilize the budget
of 1.25 Mio Euro this cluster was self-made and exclusively
based on open source software and tools. This system ran
for about 7 years and was a milestone for the researchers in
Chemnitz as well as for the whole HPC community in Ger-
many. The main achievement of the existing CLiC system,
the excellent price performance ratio was retained as one of
the main goals for the new CHiC system which is described
in this article.
The experiences with the aged CLiC set some further
goals we had to fulfill. The most error prone components
of the old system were the local hard disk, the memory
modules and the power supply. On the software side the
Andrew Filesystem (AFS) was sometimes difficult to han-
dle and to stabilize. To avoid or at least mitigate the above
problems, the new system had to remove or improve these
components. Therefore we decided to run diskless compute
nodes using server components and ECC protected mem-
ory modules. The whole software repository should reside
in a high performance clustered file system to enhance the
access performance to the application data and avoid some
problems of AFS during application runs.
The budget for the new system was set to about 2.6 Mio
Euro. For this budget we had to design a balanced machine
which would dissipate not more heat than 200 kilowatt. Fur-
thermore, we decided to only self-design the cluster and
did not deploy the hardware but we wanted the full re-
sponsibility of the software installation process. To support
the design process we collected user requirements through
project descriptions, questionnaires as well as interviews,
and benchmarking of user applications. The results showed
that we would need a well balanced general purpose system
with high-performance in terms of floating-point operations
per second as well as memory bandwidth, and job through-
put capabilities. In Figure 1, the concept of the targeted
cluster system is shown.
Compute Nodes After an evaluation of current commod-
ity processors at that time we found that the best price-
performance ratio could be achieved with dual processor
SMP systems. The emerging dual core processors improved
this further. The processor of choice could be one of AMD
Opteron, Intel Xeon or IBM PowerPC-970. For the mi-
gration from the old CLiC to the new system the Intel and
Campus network
  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




   
   
   
   




  
  
  
  




  
  
  
  




 
 
 



 
 
 



 
 
 



  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




   
   
   
   




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




  
  
  
  




 
 
 



 
 
 



login node (with hdd)
management node (with hdd)
... ...
IO node (w/o hdd)
graphics node (with hdd)
InfiniBand cable
storage complex
max. 8 cables
GigaBit−Ethernet cable
InfiniBand Fabric
12 graphics nodes
max. 8 cables
(Redundancy)
2 cables each
6 cables each
Campus network
access gateway
512 compute nodes
compute node (w/o hdd)
Figure 1. Cluster Concept at a Glance
AMD processors seem to be best suited but we left this de-
cision to the vendors. A compute node had to be equipped
with 4 GiB of main memory which results from the user re-
quirements. As mentioned above, to improve the stability of
one node we decided against hard disks and use server com-
ponents with better mean time between failures (MTBF).
To support some special projects, we integrated 12 nodes
with graphics card accelerators in the cluster.
Network Due to the higher budget it was possible to de-
sign the system with a high-performance network in terms
of latency and bandwidth as compared to the communica-
tion network of the CLiC which was a Fast Ethernet net-
work comprising one big switch. To simplify the network
management we decided to request for proposals using only
one network for all tasks like communication, storage I/O,
management, monitoring and campus connectivity. The
only network architecture which offered capabilities for low
latency, high bandwidth, quality of service (QoS), conges-
tion control, combined with a broad range of software APIs
was InfiniBand [11, 12]. We had gained experiences with
this network technology since its market introduction in
2002 [5, 8, 10, 9, 20, 24]. InfiniBand as a switch-based net-
work technology supported a 288-port switch as the biggest
single switch solution at that time. Therefore, we had to
plan a hierarchy of InfiniBand switches which is shown in
Figure 2. This hierarchy has some advantages like the avail-
ability, if one of the big switches fails all nodes could com-
municate with half the bandwidth (5Gbit/s) in the average
case. If one of the small switches fails only 12 nodes would
not be reachable. The disadvantage of using this switch
hierarchy is that there are communication patterns which
could half the bandwidth in the worst case. We could not
mathematically prove this but one can intuitively lay these
patterns over the hierarchy.
Storage For the new system, there were no special stor-
age requirements regarding the capacity. Therefore, we had
41
   
   
   



   
   
   



   
   
   



   
   
   



     
     


     
     


     
     


     
     


                     
                     
                     
                     




   
   
   



   
   
   



   
   


   
   
   



Cisco
6500
Cisco
6500
...
6 cables each
6 cables each
(288−Port)(288−Port)
12 cables each
Campus network
Campus network
InfiniBand−
SwitchSwitch
InfiniBand−
InfiniBand
Fabric
access
GbE module with 6 ports
Firewall module
GigaBit−Ethernet cable
InfiniBand cable
24−Port InfiniBand switch
InfiniBand / GbE gateway
Figure 2. Network Concept at a Glance
taken the latest storage consumption of the project directo-
ries on the old system and calculated the necessary capacity
for the next 5 years under the assumption that the needed
capacity doubles every 12 months. To ensure I/O scalabil-
ity and a good price-performance ratio we decided to re-
quest at least 2GBytes/s of aggregate throughput to the hard
disks. The storage complex should be as redundant as pos-
sible therefore we required RAID level 6 or better which
means that 2 hard drives could fail without data loss. Also
the connectivity to the cluster had to be redundant.
On the software side, we chose Lustre from Cluster File
Systems Inc. because of its good performance [6], native
InfiniBand support, robustness and open source availability.
We tested other open source filesystems like PVFS2 and
GFS but both didn’t show equivalent properties [15]. We
left it to the vendors to offer a complete proprietary solution
which would be conforming to our requirements.
Software Since we had decided to run diskless compute
nodes we looked for a maintained open source toolkit to fa-
cilitate this setup. The only software package we found was
the Warewulf cluster toolkit [16]. It supports the creation of
node images, and the provisioning, management and moni-
toring of these nodes. As the underlying Linux distribution
we chose Scientific Linux 4.x as RedHat Enterprise Linux
clone due to its support in the computing center.
All the remaining software for the tasks of monitoring, man-
agement, message passing, development and job startup
were required to be completely open source. The only ex-
ception from this was the procurement of an optimizing
compiler suite and math library for the offered hardware
architecture.
2.2 Hardware
The CHiC consists of 530 compute, 12 visualization, 8
I/O, 2 management and 2 login nodes. All nodes are con-
nected with a high speed InfiniBand network and connected
to a 60 TiB (80 TiB gross) storage complex running the par-
allel filesystem Lustre. The hardware was delivered by IBM
(nodes), Voltaire (InfiniBand interconnection network) and
Megware/Xiranet (Storage System) and was installed in 18
water cooled racks from Knu¨rr.
A compute node (IBM x3455) comprises two AMD
Opteron 2218 Dual-Core 2.6GHz CPUs, 4GiB DDR2
(667MHz) ECC RAM, a single-port Voltaire InfiniBand
410Ex HCA and an Ethernet port with IPMI support. Each
visualization node (IBM IntelliStation A Pro) is equipped
with two AMD Opteron 285 Dual-Core 2.6GHz CPUs,
4GiB DDR (400MHz) ECC RAM, a two-ported Voltaire
InfiniBand HCA 400 (PCI-X) and an Ethernet port (without
IPMI). The visualization nodes are also equipped with an
nVidia Quadro FX 4500 X2 graphics card and two 250GiB
SATA HDDs. The I/O nodes are identical to the compute
nodes except that they have 16GiB DDR2 (667MHz) ECC
RAM, a two-ported Voltaire InfiniBand HCA 400Ex and a
80GiB SATA HDD. Two I/O nodes have an integrated LSI
SAS controller. A management node (IBM x3755) contains
two AMD Opteron 8218 Dual-Core 2.6GHz CPUs, 6GiB
DDR2 (667MHz) ECC RAM, 4 Ethernet ports with IPMI
support, one two-ported Voltaire InfiniBand HCA 400Ex
and a 4x300GB 10k SAS RAID5 with hot-spare. The lo-
gin nodes (also IBM x3755) are similar to the management
nodes with the exception that they have four AMD Opteron
8218 Dual-Core with 2.6GHz and 16GiB DDR2 (667MHz)
ECC RAM.
The nodes are connected with four different networks
instead of one single InfiniBand network as planned orig-
inally. The reason for this was that IBM could sell the
x1350 cluster product only with all these networks bundled.
Maybe, an InfiniBand-Only installation had been possible
with an IBM Business Partner but there was no appropriate
offer. Each node connects to the 10Gbit/s InfiniBand fabric,
a low-end Gigabit-Ethernet network, a serial console net-
work and a Keyboard-Video-Mouse (KVM) network. The
InfiniBand switch components, comprising two 288-port
switches (ISR 9288) and 46 24-port switches (ISR 9024S),
form a 5-stage Clos network. This network is mainly used
for computation and some administration tasks.
The remaining administration tasks are done through the
Ethernet network. Each rack is connected to the other racks
by only two Gigabit-Ethernet lines and all compute nodes
in one rack are connected to one Gigabit-Ethernet switch
which offers full bisectional bandwidth.
The other two networks are used for monitoring purposes
only. To connect all nodes with the campus network, we use
a special InfiniBand-Ethernet gateway device (ISR 9096)
which provides the remaining 48 InfiniBand ports.
The 60TiB storage complex consists of 10 RAID con-
42
InfiniBand
     
     


     
   


SAS
5x
MDSOSSOSS
IBM x3455
IBM x3455RAID−Controller RAID−Controller
Figure 3. Connection Topology of Storage
Complex
troller systems (XAS1000), 10 SATA JBODs2 with 16
500GiB hard disks each and 1 Serial Attached SCSI (SAS)
JBOD with 16 36GiB SAS hard drives. The SAS JBOD
is connected to two of the I/O nodes and serves as meta
data repository for the parallel filesystem Lustre. For ob-
ject data storage the 10 SATA JBODs are separated in 20
RAID-6 formations which are managed by 20 RAID con-
trollers (two RAID controllers per host). The connection
topology is shown in Figure 3. This Figure also shows the
redundant approach by creating pairs of RAID controllers
and JBODs.
2.3 Software
Using Scientific Linux 4.4 (RedHat Enterprise Linux
clone) ensures the best support for all hardware (especially
the IBM x3755 systems) and software components (espe-
cially Lustre parallel filesystem) we had installed. Another
reason is the usage of this distribution in the local comput-
ing center. To further facilitate the installation process we
decided to work with the Extreme Cluster Administration
Toolkit (xCAT) [3] in conjunction with the Warewulf toolkit
to run diskless and diskful nodes under one administration
domain. We use Nagios version 2.9 to monitor all the nodes
and infrastructure components.
On the system side the Open Fabrics Enterprise Edition
InfiniBand software stack in version 1.1 is used. To acceler-
ate the I/O throughput we installed the object based parallel
filesystem Lustre 1.6.0beta7 where all home/project direc-
tories and software installations reside. This filesystem in-
cludes native InfiniBand support and offers high throughput
performance.
For development of application codes the GNU com-
piler suite in version 3.4.6 and 4.2.0 as well as the Qlogic
EKOPath Compiler suite 3.0 were installed. As MPI mid-
dleware Open MPI 1.2, MVAPICH-0.9.9 and MVAPICH2-
0.9.8 can be used. Several math libraries, like Goto BLAS
1.13 and AMD Core Math Library (ACML) 3.6.0 are avail-
2JBOD - Just a Bunch of Disks, means here a chassis with special con-
troller hardware
able for users. To easily manage this software set and their
environment variables the Module [4] tool was installed.
To facilitate requests for nodes we installed the resource
management system TORQUE in version 2.1.8 and the
scheduler Maui in version 3.2.6p20. This ensures a seam-
less migration from the old system where a similar installa-
tion using OpenPBS was used because the user commands
are the same.
2.4 Experiences
During the installation and the first months of produc-
tion, several experiences were gathered on hardware and
software level. Generally, the IBM hardware seems to be
very stable and reliable so far. However, get the best mem-
ory performance a BIOS update was necessary which dou-
bles the achievable memory bandwidth. The management
controllers (IPMI) has a documented feature that they are
not available during PXE boot stage. Sometimes, after re-
booting a node it might happen that the node does not get a
DHCP lease. In this case the only way to reboot the node
again is to use the switched power distribution unit. Other-
wise, the IPMI information is really helpful in finding hard-
ware defects if they occur.
The InfiniBand network is performing really well but
some minor drawbacks of the current software installation
could be revealed. We tried using IP over InfiniBand in
an high-availability mode on our server nodes but when
migrating the IP address from one port to the other the
server node itself was not available anymore from the other
nodes. We are convinced that the problem will be solved
with OFED version 1.2. From time to time we experience
a similar problem where a random node can not reach the
management node but all the others. This might relate to the
same problem as described before. On the InfiniBand hard-
ware level, the InfiniBand-to-Ethernet gateway revealed a
single point of failure, the software image. To ensure full
redundancy and no single point of failure one would have
to insist on buying at least two devices. We accepted the
one-device-solution with hardware redundancy in the inter-
nal fabric due to delivery problems of other solutions.
A documented problem with the InfiniBand stack itself oc-
curs if the system() C-function, which in itself calls fork(),
is called. This leads into a failure during the job run and an
abort of the job. To solve this problem we have to install a
relatively recent vanilla kernel and the latest versions of the
InfiniBand stack and MPI implementations.
The Lustre filesystem shows good performance numbers
as can be seen in the next section but sporadically it oc-
curs that the Metadata server behaves strangely when the
filesystem is under load. Currently, this could be seen when
running several stress tests but not with production codes.
Here we also believe that an update to the latest stable ver-
43
sion might resolve this issue. The biggest drawback of cur-
rent Lustre implementation is that, if one Object Storage
Target – OST3 fails and is lost, one part of the filesystem is
missing. Due to the even distribution of files over all OSTs
the loss of one OST could mean that the remaining data is
useless and one is required to replay a backup of lost files.
That is, the used RAID level should be as redundant as pos-
sible to make this problem less likely. Besides these minor
issues, the Lustre filesystem exhibits a really good failover
capability. The only task to do is to mount the filesystem on
the hot-standby metadata or object storage server and en-
sure connectivity. Due to the several issues we were facing
with the Lustre parallel filesystem we are now proposing to
have some kind of backup-system.
The batch system TORQUE is adaptable to all problem
cases, has a simple configuration and a good support for
diskless clients. However, we could not configure all of our
policies with the standard configuration process, therefore
we have written a wrapper-script to the main user command
qsub which enforces these policies now. Other goodies are
the big user community and a Python interface to the batch
system.
3 Benchmarks
For assessing the effectiveness of the cluster system
and its software stack we have performed several micro-
benchmarks and application runs. This will show the indi-
vidual and combined performance of several subsystems.
3.1 Synthetic Benchmarks
In the following we will present performance num-
bers of STREAM, Intel MPI Benchmarks (IMB), High-
Performance Linpack (HPL) and Interleaved Or Random
(IOR) benchmark.
STREAM The STREAM benchmark [19] is a simple but
effective stress test of the memory subsystem. The bench-
mark consists of four kernels, COPY, SCALE, ADD and
TRIAD. TRIAD performs the operation
a[i] = b[i] + q · c[i]
with vectors of 2 million double precision elements (8 Byte
words). This is supposed to avoid cache effects. Further-
more one can simply calculate the achieved floating-point
performance. One iteration step of the above calculation in-
cludes two floating-point operations. This is multiplied with
3Lustre differentiate between Object Storage Server (OSS) and Object
Storage Target (OST). The latter one represents the real block device and
provides access to the chunks of user files. The OSS provides the network
request handling for one or more local OSTs
the number of iterations and then divided by the execution
time. Using the result of memory speed and floating-point
performance one can calculate the balance of the system
which is defined as
balance =
peak floating ops/s
sustained memory ops/s
This balance can be interpreted as the number of floating-
point operations that can be performed during the time for
an average memory access. To calculate the sustained mem-
ory ops/s one must simply divide the measured memory
bandwidth by the number of bytes of one double precision
element, in our case 8 bytes.
To get more comparable performance numbers we
benchmarked an Intel Woodcrest system (2.0GHz dual-
SMP dual core, 533MHz4 DDR2 main memory) and
one of our compute nodes5 using several numbers
of DIMM modules (2, 4 and 8) in the machines.
The gcc-4.2 and the PathScale-3.0 compilers with the
-O3, -march/-mcpu/-mtune flags set to the ap-
propriate architecture were used to compile the bench-
mark. OpenMP support was enabled and the additional
-fprefetch-loop-arrays flag was used for for gcc-
4.2.
Table 1 shows the results of the TRIAD benchmark, in-
cluding the measured memory bandwidth and the “balance”
as described above. The peak floating-point performance
of a single Opteron core is 2FLOP/cycle · 2.6GHz and
4FLOP/cycle · 2.0GHz for a single Woodcrest core.
The first observation is the clear advantage of the Path-
Scale compiler for both architectures. It seems that the
prefetching of data from memory is much better imple-
mented with this compiler. Another problem we have been
facing is the high variance in the results achieved with the
-fprefetch-loop-arrays optimization flag of gcc
compiler running with 4 threads. Here we took the best
value for comparison but sometimes the achieved band-
width is only half of the given values. Finally, one can
clearly see the advantage of the AMD architecture with in-
tegrated memory controller versus the shared memory con-
troller of the Intel one.
At the time of procurement there were no official Intel com-
piler available for the AMD64 architecture. More recent
benchmarks we made with version 10 of Intel’s compiler
suite have shown the same relative gap between AMD and
Intel processors. Using the latest Intel compilers improved
4We had only these DIMM modules in the machines but the 667MHz
DDR2 modules would only be marginally better
5Due to the multiplier used in the Opteron to get the CPU speed,
2.6GHz is a perfect match with the memory speed of 667MHz because
no decrease in memory bandwidth is necessary
44
gcc-4.2 pathscale-3.0
Opteron Woodcrest Opteron Woodcrest
BW (MB/s) Balance BW (MB/s) Balance BW (MB/s) Balance BW (MB/s) Balance
2 DIMMs 3294.2 12.6 3063.7 20.9 5655.7 7.3 3672.8 17.4
1 Thread 4 DIMMs 3227.1 12.9 3252.0 19.7 5572.9 7.4 3896.4 16.4
8 DIMMs 3731.0 11.1 3338.1 19.2 5769.8 7.2 3959.6 16.2
2 DIMMs 3708.6 22.4 3230.5 39.6 6056.0 13.7 3967.9 32.2
2 Threads 4 DIMMs 3212.3 25.9 4345.8 29.4 6114.7 13.6 5061.7 25.3
8 DIMMs 4854.7 17.1 5232.6 24.5 6520.9 12.7 5876.6 21.8
2 DIMMs 3142.9 52.9 3255.1 78.6 5025.1 33.1 3949.3 64.8
4 Threads 4 DIMMs 7426.8 22.4 4322.3 59.2 11527.4 14.4 5111.2 50.1
8 DIMMs 9345.7 17.8 5294.5 48.3 12796.4 13.0 5653.6 45.3
Table 1. Results of STREAM TRIAD Benchmark
the bandwidth to the memory compared with the PathScale
compiler results.
IMB The Intel MPI Benchmarks [13] provide a set of
concise communication kernels for evaluating the most im-
portant MPI functions. It delivers simple timings and
throughput values for message sizes between 1 Byte and
4 MiB in the standard mode.
Our goal is to compare the four different MPI implemen-
tations, Open MPI 1.2.0, MVAPICH2-0.9.8, MVAPICH-
0.9.8 and MVAPICH-0.9.9beta. For space reasons we only
compare the PingPong, PingPing, SendRecv, Allreduce,
Alltoall and Broadcast benchmark results because they are
the most important ones for us.
The Ping-Pong kernel uses the blocking MPI Send() and
MPI Recv() functions to implement its well-known uni-
directional communication pattern. The Ping-Ping ker-
nel uses the non-blocking MPI Isend() and starts this op-
erations on both sides simultaneously and then block in
an appropriate MPI Recv(). In this way it is similar to
a Ping-Pong benchmark with non-optimal conditions (on-
coming traffic). To measure the bi-directional performance,
the Send-Recv kernel establishes a periodic communica-
tion chain where each process receives from the left and
sends to the right. This benchmark should reveal the pos-
sible full-duplex bandwidth. The collective benchmarks,
in our case Allreduce, Bcast and Alltoall, are simple calls
to their appropriate collective MPI functions with a simple
root-rotation (cf. [13]).
In Figure 4 and 5 we show the measured bandwidth for
the PingPong and PingPing benchmark using two nodes (1
core per node). The results for SendRecv, Allreduce, All-
toall and Broadcast using 32 nodes (1 core per node) are
shown in Figure 6, 7, 8 and 9.
In the PingPong benchmark all MPIs show nearly the
same numbers and achieve bandwidth values of about
900MB/s whereby Open MPI exhibits a little bit better
bandwidth for large messages and a bit worse for small mes-
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
in
 M
B/
s
Message Size in Bytes
IMB Results of PingPong for 2 Nodes 
MVAPICH2-0.9.8
MVAPICH-0.9.9beta
MVAPICH-0.9.8
Open MPI 1.2.0
Figure 4. PingPong Results
sages. Between 8KB and 16KB message size all MPIs im-
plements the transition from an eager protocol to a hand-
shake protocol.
Bigger differences among the MPI implementations can be
seen in the PingPing benchmark which simulates a non-
optimal condition. The maximum bandwidth achievable
is 700MB/s for MVAPICH-0.9.8, about 750MB/s (Open
MPI) and about 800MB/s (MVAPICH2 and MVAPICH-
Beta) when using a 4MB message size.
The SendRecv test shows twice the bandwidth of the one-
way PingPing benchmark as expected but a strange behav-
ior can be seen for MVAPICH2 which achieves less than
half the bandwidth of the other MPIs. Sometimes this ef-
fect is visible on higher node counts but we have no expla-
nation, currently. When running on 2, 4, 8 and sometimes
16 nodes it achieves the same bandwidth as MVAPICH-
0.9.9beta. This effect is still under investigation.
Another transition from InfiniBand inline send to “normal”
send operations seems to be visible in the Alltoall bench-
mark as the first buckling. Another buckling for Open MPI
is seen again between 8KB and 16KB message size which
comes from the protocol transition. Maybe due to some
45
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
in
 M
B/
s
Message Size in Bytes
IMB Results of PingPing for 2 Nodes 
MVAPICH2-0.9.8
MVAPICH-0.9.9beta
MVAPICH-0.9.8
Open MPI 1.2.0
Figure 5. PingPing Results
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
in
 M
B/
s
Message Size in Bytes
IMB Results of SendRecv for 32 Nodes 
MVAPICH2-0.9.8
MVAPICH-0.9.9beta
MVAPICH-0.9.8
Open MPI 1.2.0
Figure 6. SendRecv Results
 10
 100
 1000
 10000
 100000
 1  10  100  1000  10000  100000  1e+06  1e+07
Av
er
ag
e 
Co
m
m
un
ica
tio
n 
Ti
m
e 
in
 µ
s
Message Size in Bytes
IMB Results of Allreduce for 32 Nodes 
MVAPICH2-0.9.8
MVAPICH-0.9.9beta
MVAPICH-0.9.8
Open MPI 1.2.0
Figure 7. Allreduce Results
 10
 100
 1000
 10000
 100000
 1e+06
 1  10  100  1000  10000  100000  1e+06  1e+07
Av
er
ag
e 
Co
m
m
un
ica
tio
n 
Ti
m
e 
in
 µ
s
Message Size in Bytes
IMB Results of Alltoall for 32 Nodes 
MVAPICH2-0.9.8
MVAPICH-0.9.9beta
MVAPICH-0.9.8
Open MPI 1.2.0
Figure 8. Alltoall Results
 10
 100
 1000
 10000
 100000
 1  10  100  1000  10000  100000  1e+06  1e+07
Av
er
ag
e 
Co
m
m
un
ica
tio
n 
Ti
m
e 
in
 µ
s
Message Size in Bytes
IMB Results of Broadcast for 32 Nodes 
MVAPICH2-0.9.8
MVAPICH-0.9.9beta
MVAPICH-0.9.8
Open MPI 1.2.0
Figure 9. Broadcast Results
optimizations of the Alltoall MPI function the other MPIs
didn’t show this behavior at this message size. The latency
of this MPI collective for small message sizes is worse for
Open MPI compared to the other MPIs.
The Allreduce benchmark exhibits a worse behavior for
Open MPI compared to the other MPIs as already seen in
the Alltoall benchmark. The Broadcast benchmark shows
no significant differences among the several MPIs.
The biggest problem with this benchmark is the ambigu-
ous interpretation of the results. For some parameter test-
ing this benchmark seems to be a good test tool but for
an overall evaluation of the several MPI implementations
it should always be used in combination with application
benchmarks. One example is the usage of polling, or call-
back triggered completion. Polling is the fastest method for
waiting on messages but wastes CPU cycles. So, what is
good for micro-benchmarks need not necessarily be good
for real applications [2].
46
HPL Solving a system of linear equations is fundamen-
tal in the field of scientific computing. The typical way to
implement an algorithm to solve such a system of linear
equations is using an LU factorization and a backward sub-
stitution.
The High-Performance Linpack benchmark [21] solves a
random dense linear system on distributed-memory com-
puters using the above methods. The aim is to measure the
maximum floating-point performance of a supercomputer.
The algorithm itself is scalable but depends slightly on the
latency of the communication network and on the memory
subsystem which is already shown in Figure 10. For this
test we used merely 4 nodes with 16 cores and compared
when using 4 or 8 memory slots out of 12, and using the
TCP stack with IP over InfiniBand or the native InfiniBand
verbs inside the MPI implementation. The parameters were
always the same in the input file for the benchmark and we
measured the floating point performance for several process
grids. Using the definition of the efficiency below we can
achieve an about 3% better efficiency when using the na-
tive InfiniBand verbs which exhibits a much lower latency.
We can add a further 1% if we use 8 instead of 4 memory
slots. We repeated the benchmark several times and the rel-
ative gap was still the same. Running on 4 nodes we could
achieve an efficiency of 84% when taking the best measured
value into account.
 58
 60
 62
 64
 66
 68
 70
 72
4_42_81_16
Fl
oa
tin
g 
Po
in
t P
er
fo
rm
an
ce
 (G
flo
p/s
)
P_Q Grid
HPL Results for 4 Nodes (16 Cores)
OpenIB_4DIMMS
OpenIB_8DIMMS
TCP_4DIMMS
Figure 10. HPL Results
Therefore, this benchmark can be used to get another
measure of the balance of the whole system, the efficiency.
system efficiency =
Rmax
Rpeak
Rmax is the measured HPL performance and Rpeak is the
theoretical peak performance which is presented in the
STREAM paragraph. The result of this benchmark is used
for the well-known bi-annual Top500 list of the fastest
supercomputers in the world on which this benchmark was
run. The problem with this benchmark is that it primarily
assess one aspect of today’s supercomputers mainly, the
floating-point performance. A ranking depending only
on this result is not expressive enough to assess a super-
computer. To overcome this problem the HPC Challenge
suite [18] was composed.
The biggest measure we have finished on the CHiC was
using 520 nodes (2080 cores), the PathScale-2.4 compiler
suite, MVAPICH-0.9.7-mlx2.2.0 (shipped with OFED-1.1)
and the Goto-BLAS library version 1.10. The achieved re-
sult was 8210 GFlop/s which is a 76% efficiency value. The
CHiC was entered in the Top500 list with rank 117 in June
2007 (rank 237 in November 2007) . Using an equivalent
system utilizing Intel Woodcrest CPUs one could achieve
about twice as much.
IOR The best way to assess the performance of the par-
allel file system and its underlying hardware components is
to benchmark with several application access patterns. The
benchmark b eff io [23] is aimed at producing a character-
istic average number of the I/O bandwidth achievable with
parallel MPI-I/O applications exhibiting various access pat-
terns. The result should be a comparable number for stor-
age systems similar to the Top500 benchmark. This bench-
mark was not executable on our Lustre file system due to
lack of full POSIX locking support in version 1.6.x of Lus-
tre. Therefore we have chosen another benchmark, IOR [6],
which fullfills the above requirement. IOR is a parallel file
system bandwidth testing code which was initially devel-
oped to test GPFS [25] from IBM on ASCI Blue Pacific and
White machines at the Lawrence Livermore National Labo-
ratory [28]. The supported access patterns were an attempt
to represent ASC application’s access patterns.
The benchmark has the capability of 3 access patterns,
“one file per process”, “shared file segmented access”, and
“shared file strided access”. The main difference of the two
shared file access patterns is whether the data of a process is
contiguous (segmented) or non-contiguous (strided) in the
file. Several interfaces like POSIX and MPI-IO are avail-
able with the possibility to fine-tune some interface specific
parameters like the usage of collective functions with the
MPI-IO interface. The result of the benchmark is always
the best read/write bandwidth achieved among all repeti-
tions. The implementers justify this with the argument that
they run the benchmark during the production cycles where
other applications access the storage system simultaneously.
Our benchmark runs were performed during the production
cycles as well.
In Figure 11 and 12 we show the read and write perfor-
mance for the three access patterns described above using
the POSIX interface on several node counts. For the “file
per process” case we measured with striping of the 2.5GB
file over one object storage target (OST) or 20 OSTs. For
47
the “shared file” test cases the single file is always striped
over 20 OSTs. The file size is No.ofNodes · 2.5GB
whereby each node reads/writes a 2.5GB data set in this
case. The transfer size parameter of IOR is set to 1MB
which is the stripe size of the Lustre file system installation.
 0
 500
 1000
 1500
 2000
 2500
 3000
 0  20  40  60  80  100  120  140  160
Ag
gr
eg
at
e 
I/O
 T
hr
ou
gh
pu
t (M
iB/
s)
No. of Nodes
IOR Results for 1MB Transfer size READ
fpp_1OST
fpp_20OST
seg_20OST
str_20OST
Figure 11. IOR Read Results
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 0  20  40  60  80  100  120  140  160
Ag
gr
eg
at
e 
I/O
 T
hr
ou
gh
pu
t (M
iB/
s)
No. of Nodes
IOR Results for 1MB Transfer size WRITE
fpp_1OST
fpp_20OST
seg_20OST
str_20OST
Figure 12. IOR Write Results
The highest values could be achieved with the “file per
process” test case with no file striping running on 96 nodes,
3.2 GiB/s write and 2.6 GiB/s read performance. These re-
sults can be held relatively stable if at least 16 nodes are
working on a big file per process. This comes from the
separation of files among the OSTs. In this case using 16
nodes each file is put on a separate OST. This means also
a relatively good scalability of this access pattern up to 500
nodes.
If a file is striped over all 20 OSTs the concurrency of ac-
cessing the hard drives shows a major impact on perfor-
mance results for the “file per process” test case when us-
ing more than 16 nodes. If less than 16 nodes are used the
write performance is much better compared to the no strip-
ing case. The read performance for this test case is always
worse due to much higher seek time overhead.
In the “shared file” benchmark the performance numbers
for node counts of 96 and 128 could not always be mea-
sured due to some strange behavior of the Metadata server
during the runs. We believe that the reason is the usage of
the 1.6Beta7 version of Lustre. The strided case shows for
node counts of more than 8 nodes a bad write performance
which was expected due to the non-contiguous access pat-
tern. The segmented case shows nearly the same write per-
formance as the “file per process” case on small node counts
which was also expected since the segmented access pattern
is nearly equivalent to the “file per process” one using one
or several OSTs. For higher node counts the access pat-
tern corresponds more with “file per process” case using 20
OSTs which can also be seen in the performance numbers.
The biggest influences on the read/write performance when
striping over all 20 OSTs is the number of locks which are
necessary to access the part of the file, and the slow seek
time of the SATA disks. More performance with the same
number of storage servers can be gained by using more hard
drives and thus more OSTs per server when running with a
big number of clients.
3.2 Application Benchmarks
In this section we are presenting some performance num-
bers of real world applications, which are used at our site,
that we could gather during the tender process. These appli-
cation runs were done on a 16 node Intel Woodcrest cluster
and a 16 node AMD Opteron cluster. Both used InfiniBand
as interconnect and the nodes were dual processor dual core
machines. The Intel cluster comprised 3.0GHz CPUs, 8GB
RAM per node with Intel compiler suite 9.x and math ker-
nel library 8.x installed. The nodes of the AMD system
were similar to the current CHiC nodes except that they
comprised 8GB RAM per node with all memory slots filled.
This system had installed the PathScale compiler suite 2.3
and the appropriate AMD math core library 3.0. For bench-
marking, both compilers were used with no aggressive op-
timization settings.
ABINIT ABINIT is a package for quantum mechanics
calculations whose main program allows one to find the to-
tal energy, charge density and electronic structure of sys-
tems made of electrons and nuclei (molecules and peri-
odic solids) within Density Functional Theory (DFT), us-
ing pseudo-potentials and a planewave basis. The results
for a small Si−SiO2 system [7] with 43 atoms, 126 bands,
48728 plane waves and a 61x61x256 FFT grid is shown in
Table 2. In this benchmark the Opteron system shows a 5%
advantage compared to the Woodcrest system. This result
seems to be mainly influenced by the speed of the memory
subsystem.
48
AMD Cluster Intel Cluster
Time in s 1,384.6 1,454.2
Table 2. Results of ABINIT Benchmark on 32
Cores
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
643216
R
un
ni
ng
 T
im
e 
in
 s
No. of Cores
NAMD Results for 16 Nodes
Opteron System
Woodcrest System
Figure 13. Results of ApoA1 Benchmark
NAMD NAMD is a parallel molecular dynamics code de-
signed for high-performance simulation of large biomolecu-
lar systems. Based on Charm++ parallel objects [14] which
provide adaptive overlap of communication and computa-
tion across modules, NAMD scales to hundreds of pro-
cessors [26]. For benchmarking we used the ApoA1 test
case [22] which calculates a complex system of 92,224
atoms and is therefore a good estimate of performance for a
long production simulation. We benchmarked this test case
on 16 nodes with a various number of processor cores as
shown in Figure 13. One can clearly see the good scal-
ing behavior of the application when adding more cores per
node. This also means that the memory subsystem plays no
primary role. Due to its overlap of communication and com-
putation there is also no major impact by the MPI imple-
mentation. Finally, the computation throughput of the pro-
cessor is the primary accelerator and thus, the Intel Wood-
crest exhibits the best results.
4 Conclusions
In this paper we presented our design decisions for a
2048 processor core cluster using the InfiniBand high-speed
interconnect and the Lustre parallel filesystem. We showed
that finding a balanced system for a limited budget is a chal-
lenging task.
We presented benchmark results using micro-
benchmarks and real world applications. With the
STREAM memory bandwidth benchmark the AMD
Opteron can outperform an Intel Woodcrest system by
a factor of 2. Taking the HPL (maximum floating-point
performance) into account than it is exactly reverse. By
comparing the running time of two application test cases
we also got no real winner. The answer which architecture
is suited or not is, it depends. For the system we purchased
the accumulated benchmark results were almost similar
between Intel and AMD architectures. We chose IBM
because they offered the better overall system approach.
The Lustre parallel filesystem over our storage system
exhibits 3.2 GiB/s write and 2.6 GiB/s read bandwidth mea-
sured with IOR and 96 nodes. Under load conditions it hap-
pens from time to time that the metadata server shows some
stability issues but we believe, with installing the latest Lus-
tre version and running with the latest OFED stack these
issues will disappear.
Figure 14. View on the CHiC (1 row)
5 Future Work
To further enhance the performance of the cluster we will
work on several software components of the system. First
of all we are trying to shrink the node image to only 50MB
from currently 300MB to increase the available application
memory. For better configurability we are planning a new
qsub command implementation using the Python interface.
To accelerate the I/O speed to the Lustre parallel filesys-
tem we have the intention to create some kind of hierarchi-
cal storage management. Therefore we are investigating the
usage of a Lustre filesystem in a RAM disk. Another small
project will be the usage of our graphic accelerator cards
in the 12 visualization nodes for scientific computing. Our
work on MPI implementations and InfiniBand which is one
of the topics of our research group will bring further opti-
mization to the system.
49
References
[1] Forschung mit Profil. [http://archiv.tu-
chemnitz.de/pub/2005/0148/data/05 TU SonderheftBS.pdf],
ISSN 0946–1817.
[2] R. Dimitrov and A. Skjellum. Impact of Latency on Ap-
plications’ Performance. In Proceedings of the Fourth MPI
Developer’s and User’s Conference, March 2000.
[3] E. Ford, B. Elkin, S. Denham, B. Khoo, M. Bohnsack,
C. Turcksin, and L. Ferreira. IBM Redbook, ISBN
0738426776, 2002.
[4] J. L. Furlani and P. W. Osel. Abstract Yourself with Mod-
ules. In Proceedings of the Tenth Large Installation Systems
Administration Conference (LISA ’96), 1996.
[5] R. Grabner, F. Mietke, and W. Rehm. An MPICH2 Channel
Device Implementation over VAPI on InfiniBand. In Pro-
ceedings of Workshop on Communication Architecture for
Clusters (CAC’04) held in conjuncation with IPDPS, 2004.
[6] R. Hedges, B. Loewe, T. McLarty, and C. Morrone. Paral-
lel File System Testing for the Lunatic Fringe: the Care and
Feeding of Restless I/O Power Users. In Proceedings of the
22nd IEEE/13th NASA Goddard Conference on Mass Stor-
age Systems and Technologies (MSST 2005). IEEE Com-
puter Society, 2005.
[7] T. Hoefler, R. Janisch, and W. Rehm. Parallel scaling
of Teter’s minimization for Ab Initio calculations. In
Proceedings of International Conference for High Per-
formance Computing, Networking, Storage and Analysis
(SC06), 2006.
[8] T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. Fast Bar-
rier Synchronization for InfiniBand. In Proceedings of the
20th IEEE International Parallel and Distributed Process-
ing Symposium (IPDPS), 4 2006.
[9] T. Hoefler and W. Rehm. A Communication Model
for Small Messages with InfiniBand. In Proceedings
of the Parallel-Algorithmen, -Rechnerstrukturen und -
Systemsoftware (PARS) Workshop 2005. ISSN 0177-0454.
[10] T. Hoefler, C. Siebert, and W. Rehm. A practically constant-
time MPI Broadcast Algorithm for large-scale InfiniBand
Clusters with Multicast. In Proceedings of the 21st IEEE In-
ternational Parallel and Distributed Processing Symposium,
page 232. IEEE Computer Society, 03 2007.
[11] InfiniBand Trade Association. InfiniBand Architecture Spec-
ification Release, 1.2, volume 1 edition, October 2004.
[12] InfiniBand Trade Association. InfiniBand Architecture Spec-
ification Release, 1.2, volume 2 edition, October 2006.
[13] Intel GmbH. Intel MPI Benchmarks, Users Guide and
Methodology Description. Intel GmbH, D-50321 Bru¨hl,
Germany, 2006.
[14] L. V. Kale and S. Krishnan. Charm++: Parallel Program-
ming with message-driven objects. MIT Press, 1996.
[15] M. Knapp. Evaluierung Paralleler Dateisysteme unter
Linux, 2006. Seminar Paper, Computer Architecture Group,
Chemnitz University of Technology.
[16] G. M. Kurtzer. Warewulf: The Cluster Node
Management Solution. Technical Presentation
at Supercomputing Conference 2003 in Phoenix,
http://scs.lbl.gov/html/reports/warewulf-SC2003.pdf.
[17] R. W. Lucke. Building Clustered Linux Systems. Pear-
son Education Inc., Upper Saddle River, New Jersey 07458,
2004.
[18] P. Luszczek, J. Dongarra, D. Koester, R. Rabenseifner,
B. Lucas, J. Kepner, J. McCalpin, D. Bailey, and D. Taka-
hashi. Introduction to the HPC Challenge Benchmark Suite,
March 2005. [http://icl.cs.utk.edu/hpcc].
[19] J. D. McCalpin. Memory Bandwidth and Machine Bal-
ance in Current High Performance Computers. IEEE Com-
puter Society Technical Committee on Computer Architec-
ture (TCCA) Newsletter, December 1995.
[20] F. Mietke, D. Dunger, T. Mehlan, T. Hoefler, and W. Rehm.
A native InfiniBand Transporter for MySQL Cluster. In Pro-
ceedings of the 2nd Workshop Kommunikation in Cluster-
rechnern und Clusterverbundsystemen (KiCC’07), 2007.
[21] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary.
HPL - A Portable Implementation of the High-
Performance Linpack Benchmark for Distributed-
Memory Computers, January 2004. version 1.0a,
[http://www.netlib.org/benchmark/hpl].
[22] J. C. Phillips, W. Wriggers, Z. Li, A. Jonas, and K. Schul-
ten. Predicting the structure of apolipoprotein A-I in recon-
stituted high density lipoprotein disks. Biophysical Journal,
73, 1997.
[23] R. Rabenseifner, A. E. Koniges, J.-P. Prost, and R. Hedges.
The Parallel Effective I/O Bandwidth Benchmark: b eff io.
Calculateurs Parallles Journal on Parallel I/O for Cluster
Computing, Special Issue, February 2004.
[24] R. Rex, F. Mietke, C. Raisch, H.-N. Nguyen, and W. Rehm.
Improving Communication Performance on InfiniBand by
Using Efficient Data Placement Strategies. In Proceed-
ings, International Conference on Cluster Computing (Clus-
ter 2006), 2006.
[25] F. Schmuck and R. Haskin. GPFS: A Shared-Disk File Sys-
tem for Large Computing Clusters. In Proceedings of the
First Conference on File and Storage Technologies (FAST),
pages 231–244, 2002.
[26] K. Schulten, J. C. Phillips, L. V. Kale, and A. Bhatele.
Biomolecular modelling in the era of petascale computing.
Chapman and Hall/CRC Press, Taylor and Francis Group,
New York, 2008. In press.
[27] J. D. Sloan. High Performance Linux Clusters with OS-
CAR, Rocks, openMosix, and MPI. O’Reilly Media Inc.,
Sebastopol, California 95472, 2005.
[28] F. Wang, Q. Xin, B. Hong, S. A. Brandt, E. L. Miller,
and D. D. E. Long. File System Workload Analysis for
Large Scale Scientific Computing Applications. In Proceed-
ings of the 21st IEEE/12th NASA Goddard Conference on
Mass Storage Systems and Technologies (MSST 2004). IEEE
Computer Society, 2004.
50
Planungsverfahren in heterogenen Umgebungen mit Hilfe eines Genetischen
Algorithmus
Silke Schuch, Rodolfo Bamberg, Thomas Bemmerl
Lehrstuhl f¤ur Betriebssysteme
RWTH Aachen
Kopernikusstrasse 16, 52056 Aachen, Germany
{schuch, bamberg, bemmerl}@lfbs.rwth-aachen.de
Abstract
Es existieren viele verschiedene Hardwarearchitektu-
ren zur Ausf¤uhrung paralleler Anwendungen. Eine MPI-
Implementierung, die den Fokus auf die Unterst ¤utzung vie-
ler verschiedener Plattformen und Interconnects legt, ist
das am Lehrstuhl f¤ur Betriebssysteme der RWTH entwickel-
te MP-MPICH. Um die Ausf¤uhrung einer MPI-Applikation
auf einem heterogenen System zu starten, wurde am Lehr-
stuhl f¤ur Betriebssysteme das Programm MP-Cluma entwi-
ckelt. MP-Cluma unterst¤utzt den Prozessstart einer MPI-
Applikation auf allen von MP-MPICH unterst¤utzten Platt-
formen.
In diesem Artikel werden die Anforderungen an ein Start-
programm f¤ur eine MPI-Applikation beschrieben, welche in
einer solchen heterogenen Umgebung entstehen. Des Wei-
teren werden Ans¤atze vorgestellt, wie Scheduling-Systeme
solche Umgebungen unterst¤utzen. Im Anschluss erfolgt die
Vorstellung des in MP-Cluma verwendeten Scheduling-
Ansatzes, welcher auf einem Genetischen Algorithmus auf-
baut.
1. Einleitung
Als Computecluster wird eine Menge von gekoppelten
Rechnern bezeichnet, die gemeinsam an einer Aufgabe ar-
beiten. Diese Rechner ko¨nnen auch aus handelsu¨blichen
Personalcomputern bestehen. Der Vorreiter fu¨r Cluster aus
handelsu¨blichen PCs ist der von D. Becker und T. Sterling
1995 vorgestellte Beowulf-Cluster [11].
In der Literatur wird oft unterschieden zwischen Cluster
of Workstations (COW) und Network of Workstations
(NOW). Als COW wird eine Gruppe von meist einheitli-
chen Computern bezeichnet, die ra¨umlich nah zueinander
aufgestellt sind und meist u¨ber ein schnelles Interconnect
verfu¨gen. Ihre Hauptaufgabe es ist, als Clusterknoten ein-
gesetzt zu werden.
Als (NOW) werden Arbeitsstationen im lokalen Netz-
werk bezeichnet, die zusa¨tzlich zur Berechnung paralle-
ler Applikationen benutzt werden. Wenn zusa¨tzlich zu de-
dizierten Clustern (COW) auch regula¨re Arbeitsstationen
(NOW) als Rechenknoten in einer parallelen Umgebung
eingesetzt werden, dann wird eine solche Umgebung in die-
sem Artikel als Ofce Grid bezeichnet.
1.1 Office Grid
Abbildung 1. Beispiel eines Ofce Grid
Die Besonderheit des Ofce Grid ist eine erho¨hte He-
terogenita¨t. Es werden nicht nur verschiedene Plattformen
verwendet, sondern auch verschiedene Interconnects.
Das Projekt MP-MPICH [1] bietet die Mo¨glichkeit,
MPI-Applikationen auf einem Ofce Grid auszufu¨hren.
MP-Cluma bietet dem Benutzer eine komfortable Mo¨glich-
keit, alle zu einer MPI-Applikation geho¨renden Prozesse zu
starten. Wie in diesem Artikel gezeigt wird, ist der Start ei-
ner MPI-Applikation nicht trivial. Es mu¨ssen die Besonder-
heiten der zu der Applikation geho¨renden Prozesse eben-
so beachtet werden, wie verwendete Plattformen und die
Struktur des Ofce Grid. Um dem Benutzer gro¨ßtmo¨glichen
51
Komfort zu bieten, wird der Prozessstart optional von einem
Scheduler u¨bernommen, der mit einem genetischen Algo-
rithmus arbeitet.
Dieser Artikel ist wie folgt strukturiert: In Kapitel 2 ge-
hen wir kurz auf die Besonderheiten von Prozessen ein, die
zu einer MPI-Applikation geho¨ren. Danach wird in Kapi-
tel 3 kurz die Struktur des Clustermanagement-Tools MP-
Cluma [10] erkla¨rt. Die Struktur des fu¨r MP-Cluma ent-
wickelten Schedulers, zusammen mit Testresultaten und
Mo¨glichkeiten der Optimierung, wird in Kapitel 4 pra¨sen-
tiert. Eine Zusammenfassung und ein Ausblick auf ku¨nftige
weitere Entwicklungen wird in Kapitel 5 gegeben.
2. MPI und MP-MPICH
Das Massage Passing Interface MPI [3] stellt einen Stan-
dard zur Implementierung von Kommunikation u¨ber Nach-
richtenaustausch dar. Eine MPI-Applikation besteht dabei
aus mehreren miteinander kommunizierenden Prozessen.
Die Identifizierung eines Prozesses geschieht dabei u¨ber
den so genannten Rang. Der Prozess mit dem Rang Null
wird oft auch als Masterprozess bezeichnet, da diesem
durch den Anwendungsprogrammierer oft spezielle Aufga-
ben zugewiesen werden.
MP-MPICH bietet verschiedene MPI-Bibliotheken fu¨r
Rechner an, welche mit Unix/Linux oder Windows Be-
triebssystemen laufen. Die Unterstu¨tzung von bestimmten
Plattformen und Netzwerken geschieht dabei u¨ber so ge-
nannte Devices. Das ch smi Device [13] unterstu¨tzt Cluster
mit SCI-Interconnect auf Basis von X86 und Sparc sowie
die Betriebssysteme Linux, Solaris und Windows. Durch
ch wsock2 [9] werden X86 basierte Rechner mit Windows-
Betriebssystem und Kommunikation mittels TCP/IP u¨ber
die Windows Sockets 2 Schnittstelle unterstu¨tzt. Zusa¨tzlich
unterstu¨tzt ch wsock2 die Kommunikation u¨ber gemeinsa-
men Speicher, wenn sich mehrere Prozesse auf dem selben
Rechner befinden. Das neueste Device des Lehrstuhls fu¨r
Betriebssysteme ist ch usock. Hierbei handelt es sich um
ein reines TCP/IP Device, welches auch Rechner mit Linux
als Betriebssystem unterstu¨tzt. Mittels ch usock ist es auch
mo¨glich, Prozesse, die zu einer MPI-Applikation geho¨ren,
sowohl auf Linux als auch auf Windows Rechnern im Ge-
mischtbetrieb laufen zu lassen.
Beim Start der MPI-Applikation muss das passende
Device fu¨r die entsprechende Plattform gewa¨hlt werden.
Zusa¨tzlich sind die Kommandozeilenparameter der einzel-
nen MPI-Prozesse abha¨ngig von dem gewa¨hlten Device.
3. MP-Cluma
Bei MP-Cluma handelt es sich um eine verteilte Applika-
tion, wie in Abbildung 2 gezeigt. Die Kommunikation zwi-
schen den einzelnen Teilen wird mittels CORBA [6] durch-
gefu¨hrt.
Abbildung 2. Schematische Struktur von MP-
Cluma
Auf jedem Rechenknoten la¨uft ein Programm, wel-
ches den Start von MPI-Prozessen auf diesem Knoten
ermo¨glicht. Dieses Programm bietet zusa¨tzlich Aufgaben
zur Verwaltung an und u¨bermittelt Informationen u¨ber den
Rechenknoten.
Das Managersystem verwaltet Informationen u¨ber die
verfu¨gbaren Rechenknoten und die Benutzer des Systems.
Es besteht aus einem in Java geschriebenen Managerpro-
gramm sowie zwei CORBA-Diensten. Dem Namensdienst
[5], in welchen sich der Manager und die Rechenknoten ein-
tragen und dem Ereignisdienst [4], u¨ber den Ereignisse zwi-
schen den Komponenten ausgelo¨st werden ko¨nnen.
Das Managerprogramm, oder kurz der Manager, ha¨lt In-
formationen u¨ber alle zur Verfu¨gung stehenden Rechen-
knoten bereit. Die Rechenknoten werden dabei in einer hier-
archischen Struktur verwaltet. Informationen dazu sind in
einer Konfigurationsdatei an jedem Rechenknoten gespei-
chert. Durch dieses System kann die physikalische Struk-
tur der Rechenknoten innerhalb des Ofce Grid durch MP-
Cluma repra¨sentiert werden.
Der Manager verwaltet ebenfalls die Rechte fu¨r alle
Benutzer des Systems. So kann der Zugriff von Benutzer
auf bestimmte Rechenknoten beschra¨nkt werden. Zusa¨tz-
lich werden fu¨r jeden Benutzer passwortgeschu¨tzte Infor-
mationen u¨ber die Betriebssystemprofile gespeichert. Ein
Benutzer meldet sich mit den MP-Cluma Accountdaten im
System an, der Manager wa¨hlt entsprechend den gewa¨hlten
Rechenknoten das betriebssystemspezifische Benutzerpro-
fil aus und verwendet dieses fu¨r den Prozessstart.
Der Benutzer kann die MPI-Applikation u¨ber ein spe-
zielles grafisches Frontend starten oder u¨ber ein speziel-
les mpiexec. Innerhalb des grafischen Frontends sieht der
Benutzer alle verfu¨gbaren Rechenknoten in einer Baum-
struktur, welche der physikalischen Struktur der Rechner
entspricht. So ko¨nnen beispielsweise alle Rechner eines
Clusters zusammengefasst werden. Der Pfad /DE/LFBS/P3
52
wu¨rde alle Rechner des Pentium 3-Clusters des Lehrstuhls
enthalten, der Pfad /DE/LFBS/P4 alle Rechner des P4-
Clusters.
3.1. Applikationsstart
Um eine MPI-Applikation zu starten, wa¨hlt der Benut-
zer eine Gruppe von Rechnern aus. Fu¨r die Auswahl stehen
zusa¨tzlich zu den hierarchischen Informationen Informatio-
nen u¨ber das Betriebssystem und die Hardware, wie Takt-
rate und Anzahl der Prozessoren, zur Verfu¨gung. Es wird
ebenfalls angezeigt, ob auf einer Arbeitsstation momentan
Benutzer eingeloggt sind. Zusa¨tzlich kann sich der Benut-
zer von MP-Cluma anzeigen lassen, welche Prozesse mo-
mentan auf den mo¨glichen Zielknoten laufen, um so eine
Abscha¨tzung u¨ber die momentane Auslastung zu erhalten.
Die Maximalanzahl der MPI-Prozesse wird durch die Anga-
be einer Zahl von Slots festgelegt. Diese Zahl wird dem Be-
nutzter angezeigt sowie die momentane Anzahl freier Slots
pro Rechenknoten.
Der Benutzer wa¨hlt fu¨r jedes vertretene Betriebssystem
eine passende ausfu¨hrbare Datei mit eventuellen Startpara-
metern und ein passendes Device aus. Es ist auch mo¨glich,
diese Werte fu¨r jeden Knoten individuell zu wa¨hlen. Dann
spezifiziert der Benutzer noch eine minimale und maxima-
le Anzahl von Prozessen, mit denen die Applikation laufen
soll. Falls der Benutzer entsprechende Rechte besitzt, kann
er die Applikation sofort starten, ohne dass ein Scheduling
stattfindet. Im Normalfall wird diese Beschreibung an den
Scheduler von MP-Cluma u¨bermittelt, der Teil des Mana-
gers ist. Nach welchen Kriterien die Auswahl des Startzeit-
punktes und der Rechner fu¨r den Applikationsstart stattfin-
det, wird im folgenden Kapitel 4 erkla¨rt.
4. Scheduling
Generell plant ein Scheduler den Ablauf eines Jobs in
Abha¨ngigkeit von anderen Jobs und verfu¨gbaren Ressour-
cen. Im Fall von MP-Cluma besteht ein Job aus dem qua-
si gleichzeitigen Start aller Prozesse, die zu einer MPI-
Applikation geho¨ren. Wir gehen davon aus, dass diese MPI-
Applikationen in sich geschlossen und nicht voneinander
abha¨ngig sind. Daher ist die Reihenfolge der Jobs beliebig.
Verschiedene Jobs konkurrieren jedoch um die so genann-
ten Slots, das heißt, um die zur Verfu¨gung stehenden ”freien
Pla¨tze“ auf den Rechenknoten. Dabei muss beru¨cksichtigt
werden, dass die Gruppe der Zielknoten durch den Benut-
zer vorgegeben ist sowie die minimale und maximale An-
zahl von Prozessen.
Der Scheduler muss also fu¨r jeden Job innerhalb aller
Rechenknoten eine bestimmte Anzahl und Menge von Kno-
ten spezifizieren, auf denen der Prozessstart stattfinden soll.
Fu¨r eine Menge von Jobs wird dann eine Ablaufreihenfol-
ge bestimmt, in der die MPI-Applikationen auf den Kno-
ten laufen. Die Bestimmung einer solchen Ablaufreihenfol-
ge bezeichnet man als Scheduling, die Ablaufreihenfolge
selbst als Schedule.
Der Begriff g¤ultiger Schedule bezeichnet eine Ablauf-
reihenfolge, bei der jeder Job wa¨hrend seiner Laufzeit alle
beno¨tigten Ressourcen zur Verfu¨gung stehen hat und u¨ber
diese verfu¨gen kann. Fu¨r die Ermittlung einer Ablaufrei-
henfolge ist jedoch das Wissen um die Laufzeit eines Jobs
wichtig. In einem Ofce Grid ist durch die Heterogenita¨t
der Umgebung die Abscha¨tzung der Laufzeit schwierig.
Zusa¨tzlich erschwert die flexible Anzahl der Prozesse ei-
ne Bestimmung. Von daher wird als Basis fu¨r die Laufzeit
einer Applikation eine Laufzeitabscha¨tzung des Benutzers
verwendet.
Der Benutzer kann u¨ber seine Angaben bezu¨glich der
Spanne in der Prozessanzahl und der Heterogenita¨t der
Gruppe der gewa¨hlten Rechenknoten eine Abscha¨tzung
u¨ber die Heterogenita¨t des Jobs treffen. So ist ein Job, der
auf 3 oder 4 Knoten des P4-Clusters la¨uft, wesentlich ho-
mogener als ein Job, der mit 2 bis 20 Prozessen auf allen
verfu¨gbaren Arbeitsstationen des Instituts gestartet werden
soll. Davon ausgehend bietet MP-Cluma zwei Mo¨glichkei-
ten, wie mit Laufzeitabscha¨tzung des Benutzers verfahren
werden soll.
Eventuell hat der Benutzer die MPI-Applikation schon
ein- oder mehrfach in einer a¨hnlichen Konfiguration gestar-
tet und kann daher eine gute Laufzeitabscha¨tzung abgeben.
In diesem Fall wird die Abscha¨tzung des Benutzers unmo-
difiziert u¨bernommen. Allerdings wird in diesem Fall nach-
gehalten, wie genau die Abscha¨tzung war. Falls ein Benut-
zer wiederholt die Laufzeit seiner Applikation als zu gering
angibt, wird auf die Abscha¨tzung zuku¨nftiger Applikatio-
nen eine ”Strafzeit“ aufaddiert. So soll verhindert werden,
dass sich Benutzer durch die Angabe zu kurzer Laufzeiten
einen Vorteil beim Scheduling verschaffen.
Wenn der Job heterogen ist, dann kann der Benutzer sei-
ne Laufzeitabscha¨tzung als ein Prozess auf Basis des lang-
samsten Rechenknotens angeben. In diesem Fall wird die
Abscha¨tzung durch den Manager modifiziert, je nach An-
zahl der tatsa¨chlich dem Job zur Verfu¨gung gestellten Re-
chenknoten.
Ein optimaler Schedule ist ein auf bestimmte Kriterien
hin optimierter Schedule. Kriterien ko¨nnen beispielswei-
se eine gleichma¨ßige Auslastung der Ressourcen oder ei-
ne minimale Gesamtlaufzeit aller Jobs sein. Das Problem,
einen optimalen Schedule zu bestimmen, ist NP-Vollsta¨ndig
[2]. Ein Ansatz, das Problem zu lo¨sen, ist die Verwendung
eines Genetischen Algorithmus (GA). Zusa¨tzlich zu eini-
gen Ansa¨tzen GA in homogenen Umgebungen einzuset-
zen, gibt es auch Lo¨sungen in heterogenen Umgebungen
[7, 12, 8, 14].
53
In den vorgenannten Arbeiten werden meist von ein-
ander abha¨ngige Task geplant, wobei sich die Heteroge-
nita¨t nur auf unterschiedliche Ausfu¨hrungsgeschwindigkei-
ten bezieht. Beispielsweise repra¨sentiert der in [7] vor-
gestellte Algorithmus einen Job anhand seiner beno¨tigten
MFLOPs. Alle verfu¨gbaren Ressourcen werden ebenfalls
durch ihre Geschwindigkeit in MFLOPs pro Sekunde cha-
rakterisiert. Diese Charakterisierung reicht fu¨r MP-Cluma
nicht aus, da hier sta¨rkere Restriktionen in Bezug auf mo¨gli-
che Zielrechner bzw. Zielplattformen existieren.
4.1. Der Genetische Algorithmus in MP-
Cluma
Ein GA wird benutzt, um aus einer Menge von mo¨gli-
chen Schedule einen optimalen Schedule zu entwickeln.
Die Methode ist an die evolutiona¨ren Techniken der Na-
tur angelehnt. Aus einem Pool von Chromosomen werden
durch Vera¨nderungen neue Chromosomen erzeugt. Aus die-
sen wird durch Anwenden bestimmter Qualita¨tskriterien der
Pool fu¨r den na¨chsten Iterationsschritt bestimmt. Innerhalb
des genetischen Algorithmus stellt ein Chromosom einen
mo¨glichen Schedule dar.
Abbildung 3. Beispiel f ¤ur einen g ¤ultigen
Schedule
Zur Erzeugung des Startpools werden daher eine Reihe
unterschiedlicher gu¨ltiger Schedule beno¨tigt. Im Fall von
MP-Cluma ist nicht nur die Reihenfolge der Jobs sondern
auch die Anzahl der tatsa¨chlichen Prozesse innerhalb der
gegebenen Spanne variabel. Gleichzeitig ist der Anspruch
an einen gu¨ltigen Schedule besonders hoch, da fu¨r die Pro-
zesse eines Jobs nicht der komplette Rechnerpool verwen-
det werden darf, sondern nur Knoten innerhalb der von Be-
nutzer vorgegebenen Gruppe. Ein zufa¨llig generiertes Sche-
dule wa¨re unter unseren Rahmenbedingungen mit hoher
Wahrscheinlichkeit ungu¨ltig.
Daher werden wir keine Chromosomen mit willku¨rli-
chem Schedule erzeugen und diese nachher auf Gu¨ltigkeit
testen, sondern wir verwenden ausschließlich Chromoso-
men, die ein gu¨ltiges Schedule repra¨sentieren. In Abbildung
3 ist ein Chromosom gezeigt, welches einen gu¨ltigen Sche-
dule entha¨lt. Ein gu¨ltiges Schedule wird erzeugt, indem erst
eine zufa¨llige Reihenfolge der Jobs gewa¨hlt wird und dann
der Reihe nach bei jedem Job eine zufa¨llige Anzahl von Pro-
zessen zwischen dem gegebenen Minimum und Maximum
gewa¨hlt wird. Dann wird getestet, ob die gewa¨hlten Knoten
freie Zeiten haben. Beim 1. Job sind noch keine Knoten be-
legt, er kann sich jeweils vorne in die Warteschlangen von
Knoten eins und zwei einreihen. Job 2 soll nur auf einem
Knoten ausgefu¨hrt werden, der fru¨heste Startzeitpunkt liegt
zum Zeitpunkt Null auf Knoten 3. Job 3 wird nach Job 1
auf Knoten 1 ausgefu¨hrt. Da Job 4 alle drei Knoten belegen
will, ist der Startzeitpunkt fu¨r die Prozesse von Job 4 auf al-
len drei Knoten der Zeitpunkt 5. Auf dem Knoten 2 existie-
ren jetzt ab dem Zeitpunkt 2 drei freie Zeiteinheiten und auf
dem Knoten 3 existieren ab dem Zeitpunkt drei zwei freie
Zeiteinheiten. Der Job 5 kann jetzt in diesen freien Berei-
chen auf den Knoten zwei und drei platziert werden.
Abbildung 4. Zwei zuf ¤allige Chromosomen
Die Menge solcherart generierter Chromosome mit
gu¨ltigem Schedule bildet den Ausgangspool des Geneti-
schen Algorithmus. Operationen, die auf diesen Chromoso-
men durchgefu¨hrt werden, sind die Mutation und das Tau-
schen, welche im Folgenden erkla¨rt werden.
4.1.1 Mutation
Bei der Mutation werden zwei Eigenschaften des Chromo-
soms gea¨ndert. Zum einen wird zufa¨llig ein Job innerhalb
54
Abbildung 5. Chromosomen nach dem
Tauschvorgang
des Chromosoms ausgewa¨hlt und die Anzahl der Zielpro-
zesse vera¨ndert, zum anderen wird die Reihenfolge der Jobs
modifiziert. Jetzt wird aus dem urspru¨nglichen Schedule
und dem modifizierten Job ein neues gu¨ltiges Schedule er-
stellt.
4.1.2 Tauschen
Beim Tausch werden zufa¨llig zwei Chromosomen gewa¨hlt
und aus diesem wird per Zufall ein Job ausgewa¨hlt. Die
Zielknoten der Jobs werden jetzt zwischen den beiden
Chromosomen vertauscht. Die Reihenfolge der Jobs bleibt
dabei unvera¨ndert, allerdings ko¨nnen sich die Startzeiten
der nachfolgenden Jobs verschieben.
Ein Beispiel fu¨r den Tauschvorgang ist in den Abbildun-
gen 4 und 5 gezeigt. Abbildung 4 stellt die urspru¨nglichen
Chromosomen dar, die zufa¨llig fu¨r den Tauschvorgang aus-
gewa¨hlt wurden. Job 2 wird jetzt ausgewa¨hlt, um zwischen
den Chromosomen getauscht zu werden. Im ersten Chromo-
som belegt Job 2 die Knoten n2 und n3, im zweiten Chro-
mosom belegt der Job alle Knoten. Nach dem Tauschvor-
gang belegt Job 2 alle Knoten im ersten Chromosom und
nur noch die Knoten n2 und n3 im zweiten Chromosom.
Das Ergebnis ist in Abbildung 5 dargestellt.
4.1.3 Test und Justierung
Im Folgenden wird das Scheduling von vier verschiedenen
Jobs exemplarisch dargestellt, um die Arbeitsweise des Ge-
netischen Algorithmus zu verdeutlichen:
• Job 1: Ausfu¨hrungszeit: 100, Max. Prozesse: 2, Min.
Prozesse: 2, mo¨gliche Zielknoten: n1, n2, n3 and n4.
• Job 2: Ausfu¨hrungszeit: 200, Max. Prozesse: 3, Min.
Prozesse: 2, mo¨gliche Zielknoten : n2, n3, and n4.
• Job 3: Ausfu¨hrungszeit: 300, Max. Prozesse: 2, Min.
Prozesse: 1, mo¨gliche Zielknoten: n1 ,n3 and n4.
• Job 4: Ausfu¨hrungszeit: 200, Max. Prozesse: 4, Min.
Prozesse: 1, mo¨gliche Zielknoten: n1, n2, n3 and n4.
Da der Ausgangspool bereits aus Chromosomen besteht, die
ein gu¨ltiges Schedule enthalten, dient der Genetische Algo-
rithmus nur zur Optimierung. Da wir also keine Chromo-
somen im Pool haben, die ein ungu¨ltiges Schedule darstel-
len, ko¨nnen wir den Pool klein halten. Um den Einfluss der
verschiedenen Parameter auf die Gu¨te des Ergebnisses zu
ermitteln, werden im Folgenden vier unterschiedlich einge-
stellte Algorithmen vorgestellt. Es wird sich zeigen, dass
die Qualita¨t des Endergebnisses steigt, wenn die Ha¨ufigkeit
der Modifikation von Chromosomen pro Iterationsschritt
zunimmt. In allen Fa¨llen wird jeweils ein Pool mit 3 Chro-
mosomen betrachtet.
Im ersten Ansatz werden fu¨nf Iterationen mit jeweils ei-
ner Mutation und einem Tauschvorgang durchgefu¨hrt. Ab-
bildung 6 zeigt vier unabha¨ngige Ergebnisse, die jeweils
einem Durchlauf des Algorithmus entsprechen. Der beste
Schedule hat eine Gesamtlaufzeit von 201 ZE, der schlech-
teste eine Gesamtlaufzeit von 291 ZE. Die Schwankung
in der Gesamtlaufzeit der Ergebnis-Schedules verdeutlicht,
dass die Anzahl von Iterationen, Mutationen und Tausch-
vorga¨ngen nicht ausreichend ist, um zuverla¨ssig gute Ergeb-
nisse zu generieren.
Als na¨chstes werden zwei Fa¨lle betrachtet, die auf unter-
schiedliche Weise die Genauigkeit und damit auch den Be-
rechnungsaufwand im Vergleich zum vorherigen Beispiel
erho¨hen. Im ersten Fall wird nur die Anzahl der Iteratio-
nen von 5 auf 50 erho¨ht. Im zweiten Fall wird die Anzahl
von Mutationen und Tauschvorga¨ngen pro Iterationsschritt
erho¨ht, es finden jedoch nur 5 Iterationen statt.
In Abbildung 7 erkennt man durch die erho¨hte Anzahl an
Iterationen eine Verbesserung in den Laufzeiten. Der beste
Schedule hat eine Gesamtlaufzeit von 191 ZE, der schlech-
teste eine Gesamtlaufzeit von 225 ZE, damit schwanken die
Laufzeiten in begrenztem Maß.
Die Ergebnisse der vier Durchla¨ufe mit wenig Itera-
tionen aber gro¨ßeren Vera¨nderungen an den Chromoso-
men sind in Abbildung 8 dargestellt. Hier entstehen als
Endergebnisse jeweils zwei Schedule mit einer Laufzeit
55
Abbildung 6. Vier Endergebnisse nach 5 Ite-
rationen mit jeweils einer Mutation und einem
Tauschvorgang
Abbildung 7. Vier Endergebnisse nach 50 Ite-
rationen mit jeweils einer Mutation und einem
Tauschvorgang
56
Abbildung 8. Vier Endergebnisse nach 5 Ite-
rationen mit mehreren Mutation und Tausch-
vorg¤angen
Abbildung 9. Vier Endergebnisse nach 50 Ite-
rationen mit mehreren Mutation und Tausch-
vorg¤angen
57
von 191 ZE und 200 ZE. Die besseren Ergebnisse die-
ser Durchla¨ufe und vor allem die geringe Schwankung der
Laufzeiten zeigen, dass eine ho¨here Anzahl von Mutationen
und Tauschvorga¨ngen effizienter ist, als mehr Iterations-
schritte zu verwenden. Der Grund dafu¨r ist, dass bei meh-
reren Durchla¨ufen die Chromosomen in kleineren Schritten
progressiv verbessert werden. Im Gegensatz dazu werden
bei mehreren Mutationen und Vertauschungen gro¨ßere Va-
riationen in jeder Iteration betrachtet.
Um zu u¨berpru¨fen, ob eine la¨ngere Laufzeit das Ergeb-
nis weiter verbessert, wird der Algorithmus mit mehreren
Vera¨nderungen in jedem Iterationsschritt nun auch mit 50
Iterationsschritten durchgefu¨hrt. In Abbildung 9 sind die
Ergebnisse fu¨r diesen Fall dargestellt. Alle Endergebnisse
enthalten ein Schedule mit einer Laufzeit von 191 ZE.
Man kann erkennen, dass der Algorithmus nach einer
bestimmten Anzahl von Iterationen eine optimale Lauf-
zeit fu¨r das Schedule erreicht. In Abbildung 8 ist dies bei
zwei Durchla¨ufen bereits nach 5 Iterationen der Fall. Eine
Erho¨hung auf 50 Iterationen hatte hier keinen wesentlichen
Vorteil gebracht.
Die Laufzeit aller oben angefu¨hrten Beispiele bewegt
sich im Sekundenbereich. Sowohl die Vergro¨ßerung der
Komplexita¨t eines Iterationsschrittes als auch die Erho¨hung
der Iterationsschritte um den Faktor zehn haben jeweils zu
einer Verzehnfachung der Laufzeit gefu¨hrt.
Um Ressourcen effizient zu nutzen, ist es wichtig, die Pa-
rameter fu¨r den Algorithmus optimal bestimmen zu ko¨nnen.
Da sich die Anzahl der Jobs mit ihren speziellen Rahmen-
bedingungen und die verfu¨gbaren Rechenknoten zur Lauf-
zeit a¨ndern ko¨nnen, kann es no¨tig sein, die Parameter dyna-
misch der vera¨nderten Situation anzupassen. Ein weiteres
Augenmerk muss darauf gelegt werden, wie sich die Lauf-
zeiten bei einer gro¨ßeren Komplexita¨t des Systems entwi-
ckeln.
5. Zusammenfassung und Ausblick
In diesem Artikel haben wir die erfolgreiche Integra-
tion einen Schedulers in MP-Cluma vorgestellt. Obwohl
die Rahmenbedingungen fu¨r ein gu¨ltiges Schedule einge-
schra¨nkt sind, liefert der Genetische Algorithmus auch mit
kleinem Chromosomenpool und wenigen Iterationen sehr
gute Ergebnisse. Die na¨chste Aufgabe ist eine Bestimmung
mo¨glichst optimaler Parameter von Chromosomenpool, Ite-
rationsanzahl und Sta¨rke der Vera¨nderung in Abha¨ngigkeit
von der Anzahl und Struktur der Jobs sowie der Anzahl der
zur Verfu¨gung stehenden Rechner.
Als erster Schritt ist daher vorgesehen, diese Parameter
zur Laufzeit vera¨nderbar zu machen. Eine Einstellmo¨glich-
keit kann in das bereits vorhandene Administrationspro-
gramm von MP-Cluma eingebaut werden. Optimal wa¨re ei-
ne automatische Anpassung der Parameter an die a¨ußeren
Gegebenheiten durch den Scheduler selbst.
Des Weiteren wa¨re es interessant, weitere Scheduling-
mechanismen in MP-Cluma einzubauen und die Ergebnis-
se unter verschiedenen Rahmenbedingungen mit denen des
Genetischen Algorithmus zu vergleichen.
Literatur
[1] Chair for Operating Systems, RWTH-Aachen, University.
MP-MPICH – User Documentation & Technical Notes.
[2] M. R. Garey and D. S. Johnson. Computers and Intracta-
bility : A Guide to the Theory of NP-Completeness (Series
of Books in the Mathematical Sciences). W. H. Freeman,
January 1979.
[3] MPI Forum. MPI: A Message-Passing Interface Stan-
dard. International Journal of Supercomputing Applicati-
ons, 1994.
[4] OMG Technical Document formal/01-03-01. Event Service
Specification, 1.1 edition, 2001.
[5] OMG Technical Document formal/04-10-03. Naming Ser-
vice Specification, 1.3 edition, 2004.
[6] OMG Technical Document orbos/98-05-05. CORBA Mes-
saging Specification, 1998.
[7] A. J. Page and T. J. Naughton. Dynamic task scheduling
using genetic algorithms for heterogeneous distributed com-
puting. In IPDPS ’05: Proceedings of the 19th IEEE In-
ternational Parallel and Distributed Processing Symposium
(IPDPS’05) - Workshop 6, page 189.1, Washington, DC,
USA, 2005. IEEE Computer Society.
[8] A. J. Page and T. J. Naughton. Framework for task schedu-
ling in heterogeneous distributed computing using genetic
algorithms. Artif. Intell. Rev., 24(3-4):415–429, 2005.
[9] S. Schuch, C. Clauss, T. Arens, S. Lankes, and T. Bemmerl.
Entwurf und Implementierung einer Windows-spezifischen,
TCP/IP-basierten Gera¨teschnittstelle fu¨r MP-MPICH. In Ta-
gungsband zum Workshop KiCC05, TU Chemnitz, Germa-
ny, November 2005.
[10] S. Schuch and M. Po¨ppe. MP-Cluma - A CORBA Ba-
sed Cluster Management Tool. In Proceedings of the In-
ternational Conference on Parallel and Distributed Proces-
sing Techniques and Applications (PDPTA 2004), Las Ve-
gas, USA, June 2004.
[11] T. Sterling, D. Savarese, D. J. Becker, J. E. Dorband, U. A.
Ranawake, and C. V. Packer. BEOWULF: A parallel work-
station for scientific computation. In Proceedings of the
24th International Conference on Parallel Processing, pa-
ges I:11–14, Oconomowoc, WI, 1995.
[12] L. Wang, H. J. Siegel, V. R. Roychowdhury, and A. A.
Maciejewski. Task matching and scheduling in heteroge-
neous computing environments using a genetic-algorithm-
based approach. J. Parallel Distrib. Comput., 47(1):8–22,
1997.
[13] J. Worringen. Effizienter Nachrichtenaustausch auf spei-
chergekoppelten Rechnerverbundsystemen mit SCI Verbin-
dungsnetz. PhD thesis, RWTH Aachen, 2003.
[14] H. Yu, D. C. Marinescu, A. S. Wu, and H. J. Siegel. A ge-
netic approach to planning in heterogeneous computing en-
vironments. ipdps, 00:97a, 2003.
58
First Experiences with Intel Cluster OpenMP
Christian Terboven Dieter an Mey Dirk Schmidl
Marcus Wagner
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23, 52074 Aachen, Germany
{terboven, anmey, schmidl, wagner}@rz.rwth-aachen.de
Abstract
MPI and OpenMP are the de-facto standards for
distributed-memory and shared-memory parallelization, re-
spectively. By employing a hybrid approach, that is comb-
ing OpenMP and MPI parallelization in one program, a
cluster of SMP systems can be exploited. Nevertheless, mix-
ing programming paradigms and writing explicit message
passing code might increase the parallel program develop-
ment time significantly. Intel Cluster OpenMP is the first
commercially available OpenMP implementation for a clus-
ter, combining the ease of use of the OpenMP paralleliza-
tion paradigm with the cost efficiency of a commodity clus-
ter. In this paper we present our first experiences with Intel
Cluster OpenMP.
1 Introduction
The main advantage of shared-memory parallelization
with OpenMP [2] over MPI [1] is that data can be accessed
by all instruction streams without reasoning whether it must
be transferred beforehand. This allows for an incremental
parallelization approach and leads to shorter parallel pro-
gram development time. Complicated dynamic data struc-
tures and irregular and possibly changing data access pat-
terns make programming in MPI more difficult, whereas
the level of complexity introduced by shared-memory paral-
lelization is lower in many cases. As OpenMP is a directive-
based language, the original serial program can stay in-
tact, which is an advantage over other shared-memory par-
allelization paradigms.
The downside of OpenMP and other shared-memory
paradigms as well is that the resulting parallel programs
are normally restricted to execute in a single address space.
Bus-based multi-processor machines typically do not scale
well beyond four processors for memory-intense applica-
tions. Larger SMP and ccNUMA systems require to em-
ploy scalable and thus expensive interconnects. Because
of that, several attempts to bring OpenMP to clusters have
been made in the past.
In [8] an OpenMP implementation for the TreadMarks
software has been presented, which supports only a subset
of the OpenMP standard. In [10] an OpenMP implemen-
tation on top of the page-based distributed shared-memory
(DSM) system SCASH has been presented for the Omni
source-to-source translator. In this approach, all accesses
to global variables are replaced by accesses into the DSM
and all shared data is controlled by the DSM. Although the
full OpenMP specification is implemented, support for the
C++ programming language is missing. In 2006, Intel made
the first commercial implementation of OpenMP for clus-
ters available, named Intel Cluster OpenMP [6], sometimes
referred to as ClOMP in this paper. The full OpenMP 2.5
standard for Fortran, C and C++ is implemented, although
nested parallel regions are not yet supported.
This paper is organized as follows: In section 2 we give
an overview of OpenMP and Intel Cluster OpenMP. In sec-
tion 3 we present micro-benchmark [3, 9] measurements of
OpenMP and ClOMP constructs and discuss which types of
applications we expect to profit from running on a cluster.
In section 4 we present results of four applications with In-
tel Cluster OpenMP. The current tool support for ClOMP is
discussed briefly in section 5. We draw our conclusions and
touch on future plans in section 6.
2 OpenMP
This section will present an overview of OpenMP and In-
tel Cluster OpenMP and discuss some aspects of the mem-
ory model of both.
2.1 Overview
OpenMP consists of a collection of compiler directives,
library functions and a few environment variables. It ap-
plies the so-called fork/join programming model, thus an
OpenMP program starts as a single thread. At the entrance
59
of a parallel region, additional worker threads are created,
thus forming a team of threads together with the initial
thread, which becomes the master of that team. The worker
threads are suspended at the end of the parallel region and
are ready to be reused at the next opportunity.
Unless specified otherwise, all threads execute the whole
code within the parallel region redundantly. If, for example,
a loop inside a parallel region is enclosed by an OpenMP
work-sharing loop construct, the loop iterations are dis-
tributed across the threads of the current team. The way in
which the loop iterations are distributed among the threads
can be controlled elegantly via the schedule clause. Other
work-sharing constructs are available as well as are reduc-
tion type operations and synchronization constructs.
As OpenMP is a shared-memory parallelization
paradigm, all threads share a single address space, but still
can have thread local storage to hold private data. It is the
programmer’s responsibility to control the scoping, that is
the classification of variables into shared and private, of all
variables that are used within a parallel region.
At the beginning and the end of any parallel region, all
threads of a team are implicitly synchronized. At barrier
synchronization points all threads have to wait until every
team member has arrived, before any thread may continue.
One thread modifying a shared variable and other threads
reading or writing the same variable without careful syn-
chronization may lead to so-called data races. A data race
causes the program’s output to depend on the actual inter-
leaving of threads, which cannot be predicted. It is the
programmer’s responsibility to use the synchronization con-
struct provided by OpenMP in order to make sure that mod-
ifications of shared data are properly reflected to all threads.
2.2 Memory Model
OpenMP provides a relaxed memory consistency model
similar to the weak ordering memory model [5]. Each
thread has a temporary view of the memory that is not re-
quired to be consistent with the memory at all times. Writes
to memory are allowed to overlap other computation and
reads from memory are allowed to be satisfied from a local
copy of memory, under some circumstances. For example,
if within one synchronization period the same memory lo-
cation is read again, this can be done from fast local stor-
age (the temporary view, e.g. a cache). Thus, it is possible
to hide the memory latency within an OpenMP program to
some extent. This also allows Intel Cluster OpenMP to ful-
fill reads from local memory under certain circumstances,
instead of accessing remote memory in all cases, as will be
explained in the following subsection 2.3.
The flush construct of OpenMP serves as a memory syn-
chronization operation, as it enforces consistency between
the temporary view and the memory, by writing back a set
of variables or even all thread’s variables to the memory.
All reads and writes from and to the memory are unordered
with respect to each other (except for those being ordered
by the semantics of the base language), but ordered with re-
spect to an OpenMP flush operation. All OpenMP barriers
also contain an implicit flush operation.
2.3 Intel Cluster OpenMP
Beginning with version 9.1, the Intel C/C++ and Fortran
compilers for Linux are available with Cluster OpenMP.
The distributed shared-memory (DSM) system of Intel
Cluster OpenMP is based on a licensed derivative of the
TreadMarks software.
Intel has extended OpenMP with one additional direc-
tive: The sharable directive. It identifies variables that are
referenced by more than one thread and thus have to be
managed by the DSM system. While certain variables are
automatically made sharable by the compiler, some vari-
ables have to be declared sharable explicitly by the pro-
grammer, e.g. file-scope variables in C and C++. Thus,
the programmer’s responsibility for variable scoping has
been extended to finding all variables that have to be made
sharable, in the cases where the compiler was unable to de-
tect it. As will be shown in section 4, this can sometimes be
a tedious task for application codes.
For the Fortran programming language several compiler
options exist to make different kinds of variables sharable
automatically, e.g. all module or common block variables.
In addition to finding all variables that have to be declared
sharable, dynamic memory management in an application
deserves some attention. For all variable allocations from
the heap (e.g. by malloc), it has to be determined whether
the memory should be taken from the regular heap, thus be-
ing only accessible by the thread calling malloc, or from
the DSM heap, thus being accessible by all threads. Intel
Cluster OpenMP provides several routines to easily replace
native heap memory management routines by DSM heap
routines.
The task of keeping shared variables consistent across
multiple nodes is handled by the Cluster OpenMP runtime
library. Intel provides detailed information on how this pro-
cess works in the product documentation and in a white
paper [6]. In principle the mechanism relies on protect-
ing memory pages via the mprotect system call; pages
that are not fully up-to-date are protected against reading
and writing. When a program reads from such a protected
page, a segmentation fault occurs and after intercepting the
corresponding signal the runtime library requests updates
from all nodes, applies them to the page and then removes
the protection. At the next access, the instruction finds the
memory accessible and then the read will complete success-
fully. Still the page is protected against writing. In case of a
60
write operation, a so-called twin page is created for further
reads and writes on the accessing node, after the protection
has been removed. The twin page then becomes the thread’s
temporary view.
The higher the ratio of cheap memory accesses, that
means to thread private memory or to twin pages, versus
expensive memory accesses, the better the program will per-
form. At each synchronization construct, e.g. a barrier,
nodes receive information about pages modified by other
nodes and invalidate those. As a consequence, the next ac-
cess will be expensive.
3 Micro-Benchmarks
In order to better understand the behavior of Intel Clus-
ter OpenMP’s DSM mechanism and to get an estimate of
how expensive the DSM overhead is, we created a set of
micro-benchmarks. In addition, we compared the well-
known OpenMP micro-benchmarks [3, 9] with Intel Cluster
OpenMP on two different network fabrics.
All measurements presented in this and the following
sections were carried out on a cluster of eight Dell Pow-
erEdge 1950 servers equipped with two Intel Xeon 5160
(dual-core, 3.0 GHz) CPUs. All nodes are running Scien-
tific Linux 5.0 and are connected via Gigabit Ethernet (re-
ferred to as Eth) and 4x SDR InfiniBand (referred to as IB).
The InfiniBand adapters are attached to the PCI-Express
bus. We used the Intel 10.0.025 compiler suite for 64-bit
systems.
Table 1 shows selected results of the OpenMP micro-
benchmarks for traditional OpenMP and Intel Cluster
OpenMP. The EPCC OpenMP micro-benchmarks measure
the overhead of OpenMP constructs by comparing the time
taken for a section of code executed sequentially, to the time
taken for the same code executed in parallel enclosed in a
given directive. We ported the EPCC micro-benchmarks to
Intel Cluster OpenMP by adding sharable directives, where
necessary.
It it obvious that there is a severe difference in overhead
between OpenMP and Intel Cluster OpenMP, independent
of the network fabric. Thus, the granularity of parallelism
to be efficiently exploitable with Cluster OpenMP has to be
much coarser. While for a run with a single thread only
a small difference between the two network fabrics can be
observed, the overhead increase with two and four threads
is significantly lower on InfiniBand than on Ethernet. As
will be seen in section 4, application codes resemble this
behavior. We found that using a fast network like InfiniBand
is crucial in order to exploit application scalability with Intel
Cluster OpenMP.
We implemented a couple of own micro-benchmarks
especially to test the DSM performance by employing
OpenMP ClOMP ClOMP
Eth IB
PARALLEL FOR
1 thread 0.31 478.82 482.84
2 threads 1.00 1159.53 720.62
4 threads 1.12 1540.97 962.52
BARRIER
1 thread 0.01 478.24 481.37
2 threads 0.43 738.38 589.95
4 threads 0.60 751.61 634.64
REDUCTION
1 thread 0.35 479.44 481.34
2 threads 1.54 1888.25 1302.87
4 threads 2.32 3315.19 2660.42
Table 1. Selected results (overhead in mi-
croseconds [us]) of the EPCC OpenMP
micro-benchmarks for OpenMP and Intel
Cluster OpenMP, with one thread per node.
the same measurement approach as the EPCC micro-
benchmarks, of which the following are of interest here:
• testheap: A number of pages is allocated via
kmp aligned sharable malloc (OpenMP:
valloc), then they are written and then freed again.
This process is repeated a couple of times and the
average runtime is calculated.
• read f other: The time required to read a page allo-
cated via the DSM by a different thread is measured.
For the Cluster OpenMP runtime that requires trans-
ferring the page.
• write t other: Similar to read f other, but now the
page allocated by a different thread is written. For the
Cluster OpenMP runtime that requires creating a twin
page.
The performance results for traditional OpenMP and In-
tel Cluster OpenMP are shown in table 2. The OpenMP
measurements were run with two threads. Both ClOMP
measurements were run with two Cluster OpenMP pro-
cesses on one and two nodes, respectively.
We experienced noteworthy variations in the results on
InfiniBand. This is due to thread creation by the Intel Clus-
ter OpenMP runtime for communication handling.
It becomes obvious that allocating dynamic memory gets
more expensive. It is considered good parallel program-
ming practice to allocate as large chunks of memory as pos-
sible (thus as seldom as possible) in order to not stress the
operating system’s memory management. With Intel Clus-
ter OpenMP, special care has to be taken in case of dynamic
61
testheap read f other write t other
OpenMP 0.85 1.8 2.32
ClOMP, Eth
1 node 3.81 2.74 2.44
2 nodes 10.75 255.56 251.82
ClOMP, IB
1 node 3.80 1.76 4.26
2 nodes 26.33 101.34 104.54
Table 2. Selected results (two threads, over-
head in microseconds [us]) of our Cluster
OpenMP micro-benchmarks.
data structures which involve many allocations, maybe even
hidden from the user via an abstract interface.
Although Cluster OpenMP allows the programmer to ac-
cess memory on other nodes transparently, from a perfor-
mance perspective this is not for free. Intel Cluster OpenMP
can be started to use more than one thread per node, instead
of multiple processes on one node. In that case, accessing
memory from a different thread on the same node is signif-
icantly cheaper.
With Intel Cluster OpenMP it is even more important to
respect and stick to the following OpenMP tuning advices:
• Enlarge the parallel region: Creating a team of threads
at the entrance to a parallel region and putting it aside
at the exit involves some overhead, although most cur-
rent compilers do a good job in keeping it minimal.
Fewer and shorter serial parts contribute to better scal-
ability, thus parallel regions should be as large as possi-
ble in most cases. With Cluster OpenMP the overhead
of creating or activating a team of threads is higher
than for OpenMP, as all involved nodes have to com-
municate.
• Work on data locally: Keeping data local is very im-
portant on ccNUMA architectures. We found that tun-
ing measures for ccNUMA also improve performance
on Cluster OpenMP, for example respecting the first
touch initialization strategy of the Linux operating sys-
tem. If threads are accessing local memory, no page
transportation or twin page has to be created.
• Prevent false sharing: Normally, false sharing occurs
when threads write to different parts of the same cache
line. Thus, false sharing does not result in a data race.
Because of that, e.g. in the case of two threads run-
ning on two different cores that do not share a cache,
only one core can hold the valid cache line, thus the
other core has to wait and update later. This can affect
the performance significantly. In the case of Cluster
OpenMP, false sharing becomes an issue on a per page
basis. If two or more threads write to different loca-
tions on the same page, the update process has to occur
at the next synchronization point. This kind of prob-
lem can be resolved by inserting appropriate padding
in many cases, although it is pretty hard to detect in
complex applications.
4 Applications
In this section, we present our experiences of porting
four different applications to Intel Cluster OpenMP. We
used the same experiment setup as in the previous section.
4.1 Jacobi
We tried Intel Cluster OpenMP on the Jacobian solver
available on the OpenMP website [2]. We measured the
scalability using a matrix size of 6000 × 6000. We com-
pared the Cluster OpenMP version on Ethernet and Infini-
Band to traditional OpenMP and two MPI implementations,
one with synchronous communication and one with asyn-
chronous communication.
In all versions the domain decomposition approach for
the parallelization is exactly the same. The main differ-
ence is that with MPI the data on the boundary has to be
transferred explicitly, while the programmer does not have
to reason about that in OpenMP. As the DSM system of Intel
Cluster OpenMP works on a per page basis, in some cases
depending on the total number of threads and the number of
threads per node, some threads will have to access pages on
other nodes for reading data at or near the boundaries.
The comparison between OpenMP and Cluster OpenMP
is shown in figure 1. We found that by binding
Cluster OpenMP threads to scattered cores with the
KMP AFFINITY environment variable the performance
can be improved. Binding lead to speedup improvements
of up to 10% and was especially effective for the runs with
two threads.
It can be noticed that the scalability on one node is lim-
ited to two threads, as the Jacobi solver stresses the mem-
ory bandwidth. Thus, running with two Cluster OpenMP
threads on two nodes shows a better scalability (1.95 over
1.67) than the traditional OpenMP version on a single node,
as the memory bandwidth available to the application is vir-
tually doubled by running on two nodes. Of course, using
more than two threads per node does not improve scalability
with Cluster OpenMP for the same reason as with the tradi-
tional OpenMP version. The maximum speedup is obtained
with eight nodes: 9.92 with InfiniBand and two threads per
node and 7.50 with Ethernet and four threads per node.
The scaling of the MPI version with synchronous com-
munication is shown in figure 2, the version with asyn-
chronous communication is shown in figure 3.
62
 2
 4
 6
 8
 10
 1  2  4  8
S p
e e
d u
p
# nodes
best effort OpenMP
1 thread/node, Eth
4 threads/node,Eth
1 thread/node, IB
2 threads/node, IB
4 threads/node, IB
Figure 1. Speedup of the Cluster OpenMP
version of Jacobi.
Overlapping communication and computation with
asynchronous MPI is particularly beneficial when employ-
ing the slower Gigabit Ethernet network fabric, whereas
for InfiniBand it does not make a big difference. Like-
wise the Cluster OpenMP version profits from the faster
network, because communication and computation cannot
be overlapped. In all cases MPI clearly outperforms Cluster
OpenMP.
Both MPI versions deliver a speedup of slightly more
than 13 with eight nodes and four processes when using
the fast InfiniBand network fabric, whereas the speedup of
Cluster OpenMP is limited to 9.92 employing 2 threads per
node at best.
There are two places in the program where communi-
cation is involved: In updating data on the boundaries of
the subdomains and in the reduction operation to calcu-
late the error estimation. In order to improve the Cluster
OpenMP version, we implemented prefetching with an ad-
ditional Posix-thread, that would be similar to overlapping
computation and communication.
Unfortunately, we were unable to achieve any signifi-
cant performance improvement by prefetching the bound-
ary data (we used the segvprof.pl tool provided by In-
tel to make sure the prefetching worked as expected). To un-
derstand this disappointing result, we extrapolated the run-
time for eight threads on eight nodes from the serial run-
time assuming perfect scalability. On the other hand, we
predicted the runtime on the basis of our previous micro-
benchmark measurements for page transfers and reduction
operations on eight nodes. As both estimations only differ
within 1.5 percent, we concluded that with prefetching there
is only little to gain. This result corresponds to the obser-
 2
 4
 6
 8
 10
 12
 14
 1  2  4  8
S p
e e
d u
p
# nodes
best effort OpenMP
1 process/node, Eth
4 processes/node, Eth
1 process/node, IB
2 processes/node, IB
4 processes/node, IB
Figure 2. Speedup of the synchronous MPI
version of Jacobi.
vation that the asynchronous and synchronous MPI versions
perform similarly on the fast InfiniBand network. When ap-
plying the prefetch strategy in combination with the slower
GE network, we observed a slight speedup improvement of
about four percent.
We took a closer look at the MPI version as well. Us-
ing the Intel Trace Analyzer tool, we observed a commu-
nication overhead of 3.3 percent of the total runtime in
MPI Recv and about 22.4 percent in MPI Allreduce for a
run of 32 processes. That approves that the collective re-
duction operation is much more expensive than the point-
to-point sends and receives and this gap will further grow
with increasing the number of processes.
As our micro-benchmark experiments revealed that an
MPI reduction operation performs significantly better than
a reduction operation in Cluster OpenMP, we linked the
Intel Cluster OpenMP program with the Intel MPI library
and called the MPI reduction operation from within the
Cluster OpenMP program; this combination is probably not
officially supported by Intel. Unluckily, the current Intel
MPI version does not support full multi-threading, so we
had to implement expensive locking. By replacing Cluster
OpenMP’s reduction operation with the MPI reductions, we
got an increase in speedup of 1.5% on the presented dataset.
On a different dataset where more iterations are required
and thus more reduction operations are called, we achieved
a speedup of up to 12% with the MPI reduction operation.
We concluded that there is still room for improvement in the
Intel Cluster OpenMP implementation, as the reduction can
be implemented with less locking than in our experiments.
63
 2
 4
 6
 8
 10
 12
 14
 1  2  4  8
S p
e e
d u
p
# nodes
best effort OpenMP
1 process/node, Eth
4 processes/node, Eth
1 process/node, IB
2 processes/node, IB
4 processes/node, IB
Figure 3. Speedup of the asynchronous MPI
version of Jacobi.
4.2 Sparse Matrix-Vector-Multiplication
A sparse matrix-vector-multiplication (SMXV) typically
is the most time consuming part in iterative solvers. In or-
der to estimate whether Intel Cluster OpenMP is suited for
this class of applications, we examined the SMXV bench-
mark kernel of DROPS, a 3D CFD package for simulating
two-phase flows with a matrix of some 300 MB and about
19,600,000 nonzeros.
The performance of the SMXV benchmark is shown in
table 3. In addition to the Woodcrest-based systems (UMA),
we evaluated the performance on a Sun Fire V40z server
system, equipped with four AMD Opteron 848 single-core
2.2 GHz CPUs (ccNUMA), which provides a ccNUMA ar-
chitecture. We compared two parallelization strategies: In
the rows-strategy the parallel loop runs over the number of
rows and a dynamic loop schedule is used for load balanc-
ing, while in the nonzeros-strategy the number of nonzeros
is statically partitioned into blocks of approximately equal
size, one block for each thread.
The nonzeros-strategy outperforms the rows-strategy on
the ccNUMA architecture and on Intel Cluster OpenMP
as well, when carefully initializing all data respecting the
operating system’s first touch policy. While the dynamic
loop scheduling in the rows-strategy successfully provides
good load balance, the memory locality is not optimal.
The nonzeros-strategy shows a neglectible load imbalance
for the given dataset, but its advantage is that each thread
works on local data. Employing the locality of the nonze-
ros-strategy, we observed a nearly linear speedup for the
case of one thread per node. There is only little difference
between Gigabit Ethernet and InfiniBand, as there is only
little communication involved. In short, Cluster OpenMP
rows nonzeros
1 thread 4 threads 1 thread 4 threads
p. node p. node p. node p. node
OpenMP 561.9 960 561.5 978.1
UMA
OpenMP 326.3 793.9 324.5 1147.6
ccNUMA
ClOMP 548.0 887.2 551.8 939.4
Eth, 1 node
ClOMP 113.0 540.1 1058.7 1382.4
Eth, 2 nodes
ClOMP 14.5 136.8 2037.9 2435.6
Eth, 4 nodes
ClOMP 547.9 817.9 551.9 940.4
IB, 1 node
ClOMP 904.4 1208.4 1072.0 1415.3
IB, 2 nodes
ClOMP 1328.3 1845.4 2075.0 2536.6
IB, 4 nodes
Table 3. Performance [MFLOP/s] of SMXV.
behaves like a distinct ccNUMA architecture.
4.3 Fire
The Flexible Image Retrieval Engine (FIRE) [4] has been
developed at the Human Language Technology and Pattern
Recognition Group of the RWTH Aachen University. The
benchmark version which we examined consists of more
than 35,000 lines of C++ code. The current version of FIRE
is available for download in the Internet.
Given a query image and the goal to find k images from
a database that are similar to the query image, a score is cal-
culated for each image from the database and the k database
images with the highest score are returned. In [11] two
layers have been parallelized with OpenMP and displayed
nearly linear scalability. Shared-memory parallelization
is obviously more suitable than distributed-memory paral-
lelization for the image retrieval task, as the image database
can then be accessed by all threads and does not need to
be distributed. Because of that, we expected FIRE to be
a perfect candidate for Intel Cluster OpenMP as search-
ing through the database involves very little synchronization
and only neglectible writing to shared memory.
To make variables of the C++ STL sharable, instances of
such variables have to use the kmp sharable allocator.
In order to achieve this, that allocator has to be specified at
the variable declaration. On one hand this solution is ele-
gant and does not require much code changes at the decla-
ration point, but on the other hand the type signature of the
variable is changed. This implies that if such a variable is
64
passed as a parameter to a function, the function declaration
has to be changed to reflect the type change.
The FIRE code makes extensive usage of the STL, many
of FIRE’s object data types use STL data types as mem-
bers or even are derived from STL data types. Variables are
passed down the call stack to all functions requiring access
to them. In order to make FIRE work with Intel Cluster
OpenMP, virtually the whole code base would have to be
touched and nearly every class would have to be changed.
This is not feasible in a limited amount of time and in con-
trast to the findings in [11] that with OpenMP only very
little code changes were necessary. Providing a STL which
allocates all STL variables as sharable might be a solution
for this and similar codes.
4.4 PANTA
PANTA is a 3D solver that is used in the modeling of
turbomachinery [12]. The package used in our experiments
consists of about 50,000 lines of Fortran 90 code. Several
approaches to parallelize this code have been described, e.g.
[7]. In order to achieve the best possible speedup with Clus-
ter OpenMP, we have chosen the highest level paralleliza-
tion currently exploited with OpenMP, that is a loop over 80
inversion zones.
We had to manually compute the distribution of loop iter-
ations onto threads, as the OpenMP DO work-sharing con-
struct was not applicable in this case because of the code
structure. As the number of loop iterations is relatively
small and as at the end of each loop iteration there is a crit-
ical region in which some global arrays are updated in a
reduction-type manner, we cannot expect good scaling from
this code.
Creating a Cluster OpenMP version of the PANTA
code parallelized with OpenMP was straight forward: We
enabled the compiler’s autodetection and propagation of
sharable variables and asked the compiler to make all ar-
gument expressions, all common block variables, all mod-
ule variables and all save variables sharable by default. The
performance of the resulting program is shown in figure 4.
We are aware of the fact that making all these variable
types sharable by default puts more variables under the con-
trol of the DSM than necessary and that this will probably
cause a performance penalty. Nevertheless, the scalability
of the Cluster OpenMP version on a single node is similar
to the OpenMP version, thus the penalty is acceptable in
the case where as many Cluster OpenMP threads (not pro-
cesses) are used per node as possible.
Better scalability with traditional OpenMP on a single
node is prohibited because the available memory band-
width is saturated. Using Intel Cluster OpenMP, we can use
more than one node and thus effectively increase the avail-
able memory bandwidth. Using two nodes, the best effort
 1
 2
 3
 4
 5
 1  2  4  8
S p
e e
d u
p
# nodes
best effort OpenMP
1 thread/node, Eth
4 threads/node, Eth
1 thread/node, IB
2 threads/node, IB
3 threads/node, IB
4 threads/node, IB
Figure 4. Speedup of Panta.
speedup can be increased from 2.9 with traditional OpenMP
to 3.3, using four nodes to 4.3. Adding more nodes will only
lead to slight improvements.
Using four threads per node performs worse than only
three threads per node. According to Intel, one possible ex-
planation might be that the Cluster OpenMP management
thread taking care of the DSM system infers with the com-
putational threads.
As already seen with the micro-benchmarks and the Ja-
cobian solver, using InfiniBand improves the performance
of Cluster OpenMP significantly. For PANTA, Gigabit Eth-
ernet performs worse than traditional OpenMP in all cases.
Improvements in the latency and bandwidth of recent Infini-
Band products might increase the scalability of this applica-
tion further.
Unluckily, the current version of Intel Cluster OpenMP
does not support Nested OpenMP. For the Panta code, there
is an additional OpenMP parallelization at the loop level
available, namely at the linear equation solver [7]. We sus-
pect that employing this level with two or even four threads
per node would increase the total scalability of the program.
5 Tool support
The DSM-mechanism used by Intel Cluster OpenMP
uses segmentation fault signals to activate the page move-
ment and synchronization mechanism. That makes debug-
ging a Cluster OpenMP program very hard, if not impossi-
ble, if the debugger cannot be taught to ignore the segfaults
and to not step into the Cluster OpenMP library’s handler
routine. In doing so we successfully used the Intel com-
mand line debugger and the TotalView GUI-based debugger
with Cluster OpenMP programs. Nevertheless, using tradi-
tional debuggers is not very helpful in finding errors related
65
to Intel Cluster OpenMP. The typical problem is that a vari-
able has erroneously not been made sharable. In this case
some threads will run into segmentation faults when access-
ing that memory location, but the runtime system is unable
to deliver the page and thus terminates the program in most
cases.
In order to find the places in which accesses to variables
that are not sharable occur, one can use the command line
tool addr2line on a core dump. We found it easy to use
and in most cases it was no problem to figure out which
variable has caused the problem. Intel has announced that
future versions of the Intel Thread Checker tool will also
find variables that should be made sharable.
In addition, Intel delivers a command line tool named
segvprof.pl that provides means to count the number
of segmentation faults on the function level. This can be
handy in locating parts of the program that are not perform-
ing well, as e.g. too many accesses to remote pages occur.
Again, this tool is very basic in it’s current form and for
complex codes like PANTA, the provided functionality is
too limited to find and understand performance problems
related to Cluster OpenMP. Intel has announced that future
versions of the Intel Trace Collector and Analyzer will sup-
port such an analysis.
6 Conclusions and Future Work
Intel Cluster OpenMP allows shared-memory OpenMP
programs to be executed on a cluster. It takes advantage of
the relaxed consistency memory model of OpenMP. Never-
theless, OpenMP primitives get two to four orders of mag-
nitudes more expensive.
Intel Cluster OpenMP proved to be successful for sev-
eral small applications and while preserving the easier and
more comfortable parallelization paradigm of OpenMP and
shared-memory, a cluster of SMP nodes could be exploited.
But for more complex applications like PANTA, scalability
does not come for free and further tuning has to be invested.
We ran into problems with C++ programs employing the
STL, which still have to be resolved. We suspect that there
is room for improvement concerning Intel’s current imple-
mentation of reductions and on the tool support.
Future work will be to evaluate more programs of the
scientific domain with Intel Cluster OpenMP. We will ap-
ply tuning measures to Cluster OpenMP programs: Porting
codes like PANTA was straight forward because of the com-
piler features provided, still the full performance potential
has not yet been achieved. We are interested in combining
Cluster OpenMP with other parallelization paradigms to en-
able multi-level parallelism.
Acknowledgements
We sincerely thank Jay Hoeflinger and Larry Meadows
from Intel for providing hints on potential performance im-
provements.
References
[1] MPI: A Message-Passing Interface Standard. Technical re-
port, University of Tennessee, Knoxville, TN, USA, May
1994.
[2] ARB. OpenMP Application Program Interface, May 2005.
[3] J. M. Bull. Measuring Synchronisation and Scheduling
Overheads in OpenMP. In European Workshop on OpenMP
(EWOMP), Lund, Sweden, September 1999.
[4] T. Deselaers, D. Keysers, and H. Ney. Features for Image
Retrieval - a quantitative comparison. In 26th DAGM Sym-
posium, Pattern Recognition (DAGM 2004), number 3175
in Lecture Notes in Computer Science, pages 228 – 236,
Tu¨bingen, Germany, 2004.
[5] J. L. Hennessy and D. A. Patterson. Computer Architecture
- A Quantitative Approach. Morgan Kaufmann Publishers
Inc., 2006.
[6] J. P. Hoeflinger. Extending OpenMP to Clusters. 2006.
[7] Y. Lin, C. Terboven, D. an Mey, and N. Copty. Automatic
Scoping of Variables in Parallel Regions of an OpenMP Pro-
gram. In Workshop on OpenMP Applications and Tools
(WOMPAT 2004), Houston, USA, May 2004.
[8] H. Lu, Y. C. Hu, and W. Zwaenepoel. OpenMP on Network
of Workstations. 1998.
[9] F. J. L. Reid and J. M. Bull. OpenMP Microbenchmarks Ver-
sion 2.0. In 6th European Workshop on OpenMP (EWOMP
2004), pages 63 – 68, Stockholm, Sweden, October 2004.
[10] M. Sato, H. Harada, A. Hasegawa, and Y. Ishikawa. Cluster-
enabled OpenMP: An OpenMP compiler for the SCASH
software distributed shared memory system. Scientific Pro-
gramming, 9(2,3):123–130, 2001.
[11] C. Terboven, T. Deselaers, C. Bischof, and H. Ney. Shared-
Memory Parallelization for Content-based Image Retrieval.
In ECCV 2006 Workshop on Computation Intensive Meth-
ods for Computer Vision (CIMCV), Graz, Austria, May
2006.
[12] T. Volmar, B. Brouillet, H. E. Gallus, and H. Benetschik.
Time Accurate 3D Navier-Stokes Analysis of a 1.5 Stage
Axial Flow Turbine, 1998.
66
PET-Bildrekonstruktion auf der Cell BE
Josef Minde, Josef Weidendorfer, Tobias Klug, Carsten Trinitis
Lehrstuhl für Rechnertechnik und Rechnerorganisation
Institut für Informatik
TU München, Boltzmannstraße 3, 85748 Garching bei München
{minde|weidendo|klug|trinitic}@in.tum.de
Irene Torres Espallardo
Klinikum rechts der Isar der TU München, Nuklearmedizin
Ismaninger Straße 22
81675 München
i.torres@lrz.tu-muenchen.de
Kurzfassung
Der Cell BE Prozessor, der eigentlich für den
Multimedia-Markt entwickelt wurde, hat seit seinem
Erscheinen auch auf dem Gebiet des Hochleistungsrech-
nens starkes Interesse ausgelöst aufgrund seiner hohen
maximalen Rechenleistung.
Dieser Artikel beschäftigt sich mit der Frage, wie
gut sich diese Leistung für eine gegebene Anwendung
aus dem Bereich der Nuklearmedizin einsetzen lässt.
Die vorgestellte Arbeit wurde als Diplomarbeit im Rah-
men der Münchner Multicore-Initiative durchgeführt,
um Erfahrungen mit einem wichtigen Vertreter hetero-
gener Multicore-Prozessor aufzubauen.
1 Einleitung
Anfang 2006 gab es die ersten Gerüchte über die sehr
hohe Rechenleistung eines Chips, den IBM im Auftrag
von Sony für dessen neueste Spielekonsole entwickelte.
Bei seinem Erscheinen konnte der Cell dann für geeig-
neten Code zeigen, dass er der theoretischen Maximal-
leistung von etwa 250 GFlop/s ziemlich nahe kommt.
In [10] und [11] zeigen Williams et al., dass sich diese
Rechenleistung auch gut im wissenschaftlichen Rech-
nen nutzen lässt.
Prozessoren anderer Hersteller, die bei gleicher
Strukturbreite hergestellt werden, erreichen nur deut-
lich niedrigere Rechenleistung. Allerdings verwenden
moderne Standardprozessoren Rechenkerne, die opti-
miert sind für eine maximale Ausführungsgeschwin-
digkeit von sequentiellem Code, was durch Techniken
wie Sprungvorhersage, dynamisches Umordnen der In-
struktionen sowie spekulativer Ausführung durch eine
stark erhöhte Komplexität erkauft wird. All diese Tech-
niken werden für die Beschleunigereinheiten des Cell,
die Synergistic Processing Elements (SPEs), vermie-
den, was Transistorressourcen spart und es ermöglicht,
viele dieser Einheiten in einem Chip zu integrieren [9].
Neben acht dieser Beschleunigereinheiten besitzt ein
Cell-Prozessor der ersten Generation einen normalen
PowerPC-Kern (Power Processing Element - PPE), auf
dem ein reguläres Betriebssystem (hier Linux) laufen
kann und der der Steuerung der Beschleunigerkerne
dient. Wie Abbildung 1 zeigt, kommunizieren die he-
terogenen Recheneinheiten über einen internen, mehr-
fach ausgelegten Ringbus (Element Interconnect Bus
- EIB). Während das PPE über eine reguläre Cache-
Hierarchie verfügt, können die SPEs nur jeweils auf
eigenen, 256 KB großen Speicher (Local Store - LS)
zugreifen. Die Kommunikation zwischen diesen lokalen
Speichern und dem Hauptspeicher erfolgt unter Nut-
zung expliziter DMA-Operationen. Die Folge ist, dass
es für die Software teilweise schwierig ist, die theore-
tisch verfügbare Rechenleistung der Beschleunigerein-
heiten des Cell-Prozessor auch auszunutzen.
Mit der Notwendigkeit für Multicore-Architekturen
heutzutage ist zu erwarten, dass die für den Cell-
Prozessor eingeschlagene Strategie der Vereinfachung
einzelner Kerne für eine deutlich verbesserte Gesamt-
leistung auf einem Chip auch von anderen Herstellern
verfolgt wird. Der Gedanke dabei ist, dass neu geschrie-
bene Software, die sowieso parallel arbeitet, insgesamt
bei Nutzung vieler einfacher Kerne zu höherer Gesamt-
67
Abbildung 1. Blockschaltbild des Cell
leistung führt als Parallelausführung auf wenigen kom-
plexen Kernen. Da es jedoch immer Anwendungen ge-
ben wird, die sich schlecht parallelisieren lassen, ist da-
von auszugehen, dass bei Standardprozessoren auch ei-
ne geringe Anzahl komplexer Kerne auf dem Chip blei-
ben werden, so wie es auf dem Cell-Prozessor mit sei-
nem PowerPC-Kern der Fall ist. Von daher ist dieser
Prozessor ideal, um Erfahrungen mit der Klasse der he-
terogenen Multicore-Prozessoren zu sammeln, um bes-
ser die zukünftige Prozessorentwicklung einschätzen zu
können. Letzteres ist ein zentraler Punkt der Münchner
Multicore-Initiative1, in deren Rahmen die in diesem
Artikel vorgestellte Anwendung auf den Cell-Prozessor
portiert wurde als Diplomarbeit des Erstautors. Im
Vorfeld wurde bereits eine Analyse zur Programmier-
barkeit des Cell-Prozessors durchgeführt [2].
Geeignet für die Untersuchungen zum Cell-
Prozessors erschien ein Programm aus der medizini-
schen Bildverarbeitung, das in einer Forschungsgruppe
für Nuklearmedizin am Klinikum Rechts der Isar am
Lehrstuhl von Prof. Schwaiger verwendet wird. Das
Programm rekonstruiert iterativ aus Messdaten eines
sogenannten PET-Scanners (MADPET-II) ein 3D-Bild
eines Objektes (siehe Abschnitt 2). In einer seit Jah-
ren bestehenden Kooperation mit dem Lehrstuhl von
Prof. A. Bode wird derzeit der serielle Code mit Hil-
fe von MPI für ein zukünftiges Cluster parallelisiert.
In der vorliegenden Arbeit wird untersucht, inwieweit
sich der Cell-Prozessor als Plattform für eine Paralleli-
sierung eignet.
Eine Bildrekonstruktion für ein einzelnes 3D-Bild
mit Auflösung 140 × 140 × 40 aus den Messdaten des
Forschungsgerätes mit 1152 Detektoren benötigt bei
60 Iterationsschritten momentan auf einem normalen
PC mehrere Stunden. Die Vision ist allerdings eine
Echtzeit-Rekonstruktion, in der ein schlagendes Herz
beobachtet werden kann. Die Bildrekonstruktion ver-
wendet vorberechnete Eingabedaten, eine sogenann-
1http://mmi.cs.tum.edu
te Systemmatrix, die statistisch die erwartete Mes-
sung zu einzelnen Zerfallsvorgängen im untersuchten
Bereich erfasst. Sie kann per Monte-Carlo-Simulation
approximiert werden, wobei eine längere Monte-Carlo-
Simulation bessere Rekonstruktion verspricht. Die ak-
tuell verwendeten Systemmatrizen haben eine Größe
von bis zu 56 GB. Damit passen sie bei heute üblichen
PCs nicht in den Arbeitsspeicher, weshalb die Lauf-
zeit dann von der Leseleistung der Festplatte abhängt.
Dieselbe Problematik gilt prinzipiell auch für verfügba-
re Cell-Systeme. Allerdings lässt sich die Struktur der
Systemmatrix modellieren. Dies erlaubt es, die Matrix
bei Bedarf während der eigentlichen Bildrekonstrukti-
on zur Laufzeit zu berechnen. Damit erkauft man sich
einen sehr niedrigen Hauptspeicherbedarf mit deutlich
höherem Rechenaufwand. Allerdings scheinen die Be-
schleunigereinheiten des Cell für diesen Ansatz gut ge-
eignet zu sein.
Im Folgenden wird zunächst näher auf den medi-
zinischen Hintergrund und den verwendeten Algorith-
mus MLEM eingegangen. Danach wird beschrieben,
wie der Algorithmus für die Cell-Portierung ergänzt
wurde. Kapitel 4 zeigt, wie hoch die Rechenleistung des
Cell-Prozessors bei dieser Anwendung ist und welche
Auswirkungen die vorgenommenen Anpassungen auf
die Qualität der Bildrekonstruktion haben. Abschlie-
ßend werden Erfahrungen zum Cell-Prozessor zusam-
mengefasst und ein Ausblick gegeben.
2 Medizinischer Hintergrund
Medizinische Bildverarbeitung ermöglicht es, Bil-
der sowohl vom Äußeren, als auch vom Inneren des
menschlichen Körpers bzw. des Körpers von Tieren
zu erstellen. Es lassen sich dabei sowohl die Struk-
tur des Körpers darstellen, als auch die Funktionswei-
se von Vorgängen wie zum Beispiel dem Stoffwechsel
oder der Blutzirkulation. Die wichtigsten bildgebenden
Verfahren sind Röntgenfilm, Mammographie (mit wei-
cher Röntgenstrahlung), Computertomographie (CT,
bewegliche Röntgentechnik zur 3D-Darstellung), Ma-
gnetresonanztomographie (MRT, ebenfalls ein 3D-
Verfahren, das starke Magnetfelder zur Schwingungs-
anregung von Molekülen nutzt), Sonographie (Re-
flektion von Ultraschall) und Positronen-Emissions-
Tomographie (PET) [5, 3]. Das in dieser Arbeit un-
tersuchte Programm wird für das letztere bildgebende
Verfahren eingesetzt. Es wird zur Erfassung von Herz-
und Hirnfunktion, aber auch zur Tumorerkennung ge-
nutzt.
68
Funktionsweise PET
PET erzeugt wie CT oder MRT Schnittbilder ei-
nes untersuchten Körpers. Grundlage der PET ist die
Darstellung der Verteilung einer radioaktiv markierten
Substanz (Radiopharmakon) im Organismus, v.a. um
biochemische Vorgänge abzubilden. Das Radiopharma-
kon wird per Injektion oder Inhalation verabreicht. Da-
nach verteilt es sich im Körper und sendet Positronen
aus. Diese Positronen (e+) sind nicht stabil, sondern
treten nach kurzer Distanz (durchschnittlich ca. 2-3
mm) mit Elektronen (e−) in Wechselwirkung. Bei die-
ser sogenannten Annihilation wird Energie frei, die in
Form von zwei γ-Quanten (je 511 keV) freigesetzt wird.
Diese beiden γ-Quanten entfernen sich im Winkel von
ca. 180◦ von einander (siehe Abb. 2), was ausgenutzt
werden kann, um den Ort des radioaktiven Zerfalls zu
bestimmen und damit den Weg des Radiopharmakons
durch den Körper nachzuvollziehen.
Abbildung 2. Zerfallsvorgang beim PET
Das Herzstück des PET-Scanners besteht aus ei-
nigen hundert ringförmig angeordneten γ-Detektoren
(Szintillationszähler): Werden zwei γ-Quanten nahezu
gleichzeitig (koinzident) detektiert (d.h. während eines
Zeitfensters von ca. 10 Nanosekunden), wird dies als
Ereignis auf der gedachten Linie zwischen den signal-
gebenden Detektoren angenommen. Diese gedachte Li-
nie wird als Line Of Response (LOR) bzw. Koinzi-
denzlinie bezeichnet. Da die Detektoren eine räumliche
Ausdehnung haben und der Winkel der Abstrahlung
der γ-Quanten um bis zu 5◦ von den 180◦ abweichen
kann, besitzt eine LOR eine Dicke, die in der Mitte et-
was zunimmt. Zur Bildrekonstruktion werden die für
einzelne LORs detektierten Ereignisse auf Aktivitäten
in quaderförmigen Voxeln umgerechnet, womit ein 3D-
Bild entsteht. Je mehr Detektoren vorhanden sind und
je mehr Zerfälle gemessen werden können, desto hö-
her ist die Ortsauflösung des PET-Scanners. Bei übli-
chen PET-Scannern beträgt die Anzahl der Detektoren
10.000, womit eine Ortsauflösung von ca. 4-5 mm er-
reicht werden kann [1].
Ein im Rahmen der Forschung am Klinikum Rechts
der Isar in München verwendeter PET-Scanner ist der
MADPET-II. Ziel der Forschung ist, auch für relativ
kleine Geräte, die z.B. zur Untersuchung von Kleintie-
ren verwendet werden können, eine ebenso hohe Bild-
qualität zu erreichen wie bei Großgeräten im klinischen
Einsatz. Dabei muss ein Kompromiss zwischen Orts-
auflösung und Empfindlichkeit gefunden werden. Die
erwünschte hohe Ortsauflösung lässt sich nur mit klei-
nen Detektoren erreichen; kleine Detektoren bedeuten
jedoch schlechte Erkennungsraten (= niedrige Emp-
findlichkeit), da diese abhängig sind von der Schnittlän-
ge eines Photonenstrahls mit einem Detektorkristall.
Um eine für die Qualität einer Messung ausreichen-
de Anzahl von Zerfallen zu erkennen, müsste man die
Konzentration des Radiopharmakon erhöhen, was zu
einer höheren Belastung des untersuchten Organismus
führt. Deshalb sind beim MADPET-II die Detektoren
zur besseren Auflösung und höheren Empfindlichkeit
in zwei konzentrischen Ringen angeordnet, statt nor-
malerweise in einem [6]. Die Anzahl der Detektoren ist
dabei 1152 [7], und es kann eine Ortsauflösung von ca.
0,5 mm erreicht werden. In Abbildung 3 ist die De-
tektoranordnung mit einigen Verbindungsgeraden zwi-
schen Detektoren, den sogenannten LORs zu den ent-
sprechenden Detektorpaaren, zu sehen.
Abbildung 3. Detektoranordnung mit LORs
69
2.1 Bildrekonstruktion bei PET
Das Ziel der Bildrekonstruktion ist die Umrechnung
von Messdaten gi, die jeweils der Anzahl der in der
Messung erfassten Koinzidenzereignisse auf der Verbin-
dungslinie eines Detektorpaares i entsprechen, in ein
dreidimensionales Bild, das aus einem Gitter von Vo-
xeln zusammengesetzt ist, repräsentiert durch die Am-
plitude fj für Voxel j. Für eine gegebene Anordnung
gibt es eine Wahrscheinlichkeitsverteilung A, die aus-
sagt, wie hoch die Wahrscheinlichkeit ai,j dafür ist,
dass ein Zerfall innerhalb von Voxel j stattgefunden
hat, wenn Detektorpaar i einen Zerfall erkannt hat. Die
Bildrekonstruktion kann damit in der Formel
g = Af
zusammengefasst werden, wobei die Messung g bekannt
ist, und das passende Bild f gesucht wird. Die Wahr-
scheinlichkeitsverteilung A wird dabei als Systemma-
trix bezeichnet, da sie das Systemverhalten eines PET-
Scanners komplett beschreibt. Sie kann entweder durch
langwierige physikalische Experimente ermittelt wer-
den (ein Strahler durchläuft alle Voxelpunkte), oder
per Monte-Carlo-Simulation angenähert werden, wobei
dann Effekte wie Abschattung von Detektoren und die
Streuung von γ-Strahlung berücksichtigt werden müs-
sen [7].
Da es sich bei den Zerfallsprozessen im PET-Scanner
um statistische Vorgänge handelt, ist es sinnvoll und
deutlich schneller, die Aufgabe der Bildrekonstruktion
iterativ statt exakt zu lösen. Eine entsprechende Vor-
schrift liefert das Verfahren MLEM (Maximum Like-
lihood Expectation Maximization) [8], wobei f (n) den
iterativen Lösungen des Bildes entspricht:
f
(n+1)
j =
f
(n)
j∑
i′
ai′j
∑
i
aij
gi∑
j′
aij′f
(n)
j′
Diese Iterationsvorschrift errechnet aus der Näherungs-
lösung f (n) in einem ersten Schritt, der sogenannten
Vorwärtsprojektion, für jedes Detektorpaar i das Mes-
sergebnis g(n)i =
∑
j′ aij′f
(n)
j′ , das man eigentlich für
diese Näherung hätte bekommen müssen. Daraus wird
mit Hilfe des tatsächlichen Messwertes gi ein Korrek-
turfaktor berechnet, der im zweiten Schritt, der Rück-
wärtsprojektion, zu einer verbesserten nächsten Lösung
für jeden Voxel j führt. Der für jedes j konstante Ko-
effizient 1/
∑
i′ ai′j sorgt dabei nur für eine nötige Nor-
mierung am Ende eines Iterationsschritts.
Eine stabile Konvergenz der Lösung mit Hilfe eines
iterativen Verfahrens setzt voraus, dass die Zahl der
Unbekannten im Gleichungssystem in etwa der Anzahl
der Bekannten entspricht, d.h. dass Messwerte unge-
fähr gleich der Voxelanzahl ist. Dies entspricht der Be-
obachtung, dass die erreichbare Auflösung eines PET-
Scanners von der Anzahl der Detektorpaare abhängt.
Abbildung 4. Wahrscheinlichkeitsverteilung
von LOR 35 mit Interpolation
Im Fall des MADPET-II mit seinen 1152x1152 De-
tektorpaaren wurde eine sinnvolle Voxelanzahl von
140× 140× 40 gewählt. Die per Monte-Carlo Simulati-
on erstellte Systemmatrix, die für die Untersuchungen
im Rahmen dieser Arbeit benutzt wurde, hat eine An-
zahl von etwa 650 Millionen Einträgen ungleich Null.
Obwohl sie bei naiver Speicherung schon eine Größe
von über 7 GB besitzt, kann sie nur als eine sehr gro-
be Näherung der Wahrscheinlichkeitsverteilung gelten,
wie exemplarisch Abb. 4 den sehr löchrigen Strahl LOR
35 zeigt2. Als Erfahrungswert hat sich herausgestellt,
dass man befriedigende Rekonstruktionsergebnisse bei
MADPET-II erhält, wenn 60 Iterationen durchgeführt
werden. Die angenäherte Lösung mit MLEM nach Ite-
ration 60 in Abb. 5 zeigt die im Vergleich zum Refe-
renzbild etwas eingeschränkte Auflösefähigkeit.
Abbildung 5. Referenzbild (links) und MLEM-
Rekonstruktion nach 60 Iterationen (rechts)
Mit Kompressionsverfahren kann man die Größe der
Systemmatrix sowohl auf Festplatte als auch im Spei-
2Eine etwas bessere Näherung sollte die aktuell im Entstehen
befindliche Systemmatrix mit einer Größe von über 50 GB(!)
ergeben.
70
cher zwar um einiges reduzieren; die Hauptspeicher-
kapazität aktueller PCs wird trotzdem überschritten.
Deshalb kann eine einzige Iteration leicht eine Lauf-
zeit von 5 Minuten haben, was dem sequentiellen Le-
sen der Systemmatrix durch die Festplatte entspricht.
Für 60 Iterationen ergibt sich so eine inakzeptable Re-
konstruktionszeit für ein einziges Bild, solange keine
bessere Hardware eingesetzt wird, die die Matrix im
Hauptspeicher aufnehmen kann.
3 Anpassung an den Cell
Akzeptable Zeiten für eine Bildrekonstruktion las-
sen sich nur erreichen, wenn sich die Leistung ei-
nes externen Speichermediums nicht beschränkend aus-
wirkt, d.h. alle Berechnungen im Hauptspeicher erfol-
gen könnten. Das im Rahmen der Arbeit verfügbare
System IBM Cell-Blade QS20 besitzt nur 1 GB Haupt-
speicher, der für eine vorberechnete Systemmatrix
sinnvoller Größe nicht ausreicht. Prinzipiell kann auch
eine Parallelisierung auf mehrere CellBlades vorgenom-
men werden, sodass die Systemmatrix-Partitionen in
den Speicher jeweils einer Blade passen. Leider stan-
den entsprechende Systeme nicht zur Verfügung; in An-
betracht der Tatsache, dass die Systemmatrix je nach
Länge der Monte-Carlo-Simulation sehr groß werden
kann (momentan existiert eine maximale Systemma-
trix von 56 GB), wurde dieses Vorgehen als praktischen
Gründen nicht verfolgt3.
Im Folgenden wird eine Modellierung der Matrix
verwendet, die sich schnell nach Bedarf berechnen lässt
und nur wenige Parameter benötigt.
Modellierung der Systemmatrix
Anhand theoretischen Überlegungen, sowie auf-
grund der Wahrscheinlichkeitsverteilung, wie sie die
Monte-Carlo-Simulation (etwa für LOR 35 in Abb. 4)
liefert, scheint eine einfache Modellierung möglich zu
sein. Die Idee, die Berandungen der LORs anhand der
vorhandenen Systemmatrix zu extrahieren, ist wegen
der Löchrigkeit leider nicht durchführbar.
Stattdessen wird von der exakten geometrischen
Positionierung der Detektoren ausgegangen. Für die
Voxel-Überdeckung der LORs wird ein einfaches Mo-
dell herangezogen, das die acht Eckpunkte der bei-
den involvierten Detektorenkristalle miteinander ver-
bindet, und dann pro Y-Ebene die überstrichene Fläche
in X/Z-Richtung mit gleicher Wahrscheinlichkeit be-
legt, wie in Abbildung 6 gezeigt: Die schwarzen Recht-
ecke repräsentieren zwei Detektoren, die rote Fläche
3Eine MPI-Parallelisierung für ein Rechencluster ist in der
Entwicklung
Abbildung 6. Modellierung einer LOR
den Bereich, der im LOR dieses Detektorpaares liegt.
Dabei zeigt das dunkelrote Kreissegment den für die
Bildrekonstruktion relevanten Ausschnitt.
Abbildung 7. MLEM bei Modellierung
Bei Verwendung dieser Modellierung der Systemma-
trix zeigt Abb. 7 das Ergebnis der Bildrekonstruktion
im Vergleich zum Referenzbild aus Abb. 5. Das Ergeb-
nis des MLEM mit Nutzung dieser Modellierung ist
zwar sichtbar schlechter als mit der aus der Monte-
Carlo Simulation erhaltenen Systemmatrix. Allerdings
lag die Priorität zunächst in einer möglichst einfachen
und damit für den Cell-Prozessor schnell implementier-
baren, effizienten Modellierung, die Aussagen über die
generelle Eignung des Prozessors treffen kann.
Parallelisierungsstrategie
Das MLEM-Verfahren selbst besteht innerhalb einer
Iteration aus
• einer Vorwärtsprojektion, die für die aktuelle Bild-
näherung berechnet, wie die Messung hätte ausse-
hen sollen (die Summe unter dem hinteren Bruch
der Iterationsformel);
71
• einer Rückwärtsprojektion, die mit Hilfe der er-
warteten Messung für die aktuelle Bildnäherung
(aus erstem Schritt) und der tatsächlichen Mes-
sung ein korrigiertes Bild errechnet (der Summen-
term in der Iterationsformel)
• einer Normalisierung der Voxelwerte, sodass eine
Konvergenz der Lösung möglich ist (der vordere
Faktor in der Iterationsformel)
Da die Systemmatrix dünnbesetzt und in komprimier-
ter Form gespeichert ist, greifen die ersten beiden
Schritte jeweils über Indirektionen auf Messdaten, so-
wie auf alte und neue Bilddaten zu. Man müsste die re-
levanten Teile dieser Datenstrukturen in vielen kleinen
Stücken für eine Bearbeitung durch die Beschleunige-
reinheiten transferieren, da die Datenstrukturen selbst
nicht in die kleinen lokalen Speicher passen. Aus diesem
Grund wurde der MLEM in der prototypischen ersten
Implementierung auf dem PowerPC-Kern des Cell be-
lassen.
Im Gegensatz dazu ist die Berechnung des gerade be-
nötigten Teils der Systemmatrix zur Laufzeit über die
oben erwähnte Modellierung relativ einfach zu imple-
mentieren und auf die acht Beschleuniger aufzuteilen.
Abb. 8 zeigt diesen erweiterten Master/Slave-Ansatz.
In der Zeichnung bezieht sich PPE auf den PowerPC-
Kern und SPE auf einen Beschleuniger.
Abbildung 8. Parallelisierungskonzept
Die Daten, die von einer SPE für eine LOR an die
PPE geschickt werden, bestehen aus eine Liste von Tu-
peln (Voxel, Wahrscheinlichkeit), die bei der vorliegen-
den Modellierung eine durchschnittliche Länge von ca.
4500 Voxeln gleich 36 KB besitzt.
4 Ergebnisse
Die Modellierung der Systemmatrix wurde zunächst
auf einem PC (Intel P4, 3.2 GHz) implementiert. Ob-
wohl deutlich mehr Rechenleistung durch die ständige
Berechnung der Matrix über die Modellierung benötigt
wird, ist erstaunlicherweise diese Variante (ca. 6 Minu-
ten pro Iteration) schneller als der ursprüngliche Code
(ca. 8,5 Minuten pro Iteration), der die vorberechnete
Matrix von der Festplatte laden muss.
Abbildung 9. Laufzeiten des parallelen MLEM
Für die Messungen auf dem CellBlade wurden
die Laufzeiten bei Benutzung des GNU Compilers
(gcc) und des IBM Compilers (xlc), jeweils aus dem
CellSDK 2.1, gemessen. Die parallele Variante auf dem
Cell mit 16 SPEs benötigt für eine Iteration 219 s (xlc)
bzw. 218 s (gcc). Dies ist unerwartet langsam, da der
Cell damit nur etwas schneller ist als der sequentielle
Code inklusive Modellierung auf dem PC. Die Lauf-
zeiten (siehe Abb. 9) gehen zwar durch den Einsatz
mehrerer SPEs zurück, aber man erkennt einen hohen
sequentiellen Anteil von über 200 Sekunden. Dies kann
nur bedeuten, dass der auf dem PowerPC-Kern laufen-
de MLEM verantwortlich ist für das zeitlich schlechte
Verhalten bei Nutzung vieler SPEs, und die Beschleu-
niger dann nicht ausgelastet werden können.
Abbildung 10. Laufzeiten der reinen Modellie-
rung
72
Daraufhin wurde das Laufzeitverhalten der Be-
schleunigereinheiten getrennt betrachtet, ohne den
bremsenden MLEM auf dem PowerPC-Kern auszufüh-
ren. Abb. 10 zeigt, dass die Berechnungen einer Iterati-
on bei 16 Beschleunigern in 19 Sekunden (xlc) durchge-
führt werden, im Gegensatz zu den 219 Sekunden, die
insgesamt für eine Iteration des MLEM benötigt wird.
Wie in Abbildung 11 ersichtlich, erreichen die reinen
Berechnungen der Beschleunigerkomponenten für die
modellierte Systemmatrix einen sehr guten Speedup.
Sehr gut zu erkennen ist der deutliche Vorteil des
IBM Compilers in Abbildung 10, da bei der reinen Mo-
dellierung nur die SPEs arbeiten. Vergleicht man dies
mit Abbildung 9, gilt dies jedoch nicht für die PPE, da
bei 16 SPEs insgesamt der Code des GNU Compilers
minimal schneller ist.
Abbildung 11. Speedup der reinen Modellie-
rung
Diese Ergebnisse des ersten Prototyps erlauben fol-
gende Schlussfolgerungen:
• Die Beschleunigereinheiten sind bei einfachen Auf-
gaben schnell und ihr Potential lässt sich gut nut-
zen.
• Der PowerPC-Kern sollte nur mit dem Verteilen
von Aufgaben an die Beschleuniger vertraut wer-
den. Für eigene Aufgaben ist er  relativ zu der
Leistungsfähigkeit der Beschleuniger  schwach
ausgelegt.
• Der IBM Compiler (xlc) erzeugt für die SPEs den
deutlich besser optimierten Code.
Die Parallelisierungsstrategie des ersten, im Rahmen
der verfügbaren Zeit implementierbaren Prototyps war
offensichtlich, aber ziemlich unerwartet, falsch gewählt.
Jede andere Strategie, die den MLEM selbst verteilt
auf den Beschleunigerkomponenten ausführt, ist vor-
aussehbar komplexer und aufwendiger zu implementie-
ren, da disjunkte Teile der Messdaten sowie der alten
und neuen Bilddaten einer Iteration in vielen kleinen
Nachrichten auf die SPEs kopiert werden und anschlie-
ßend aggregiert werden müssen. Andererseits haben die
Beschleunigereinheiten des Cell-Prozessors für die vor-
liegende Aufgabe noch so viel freie Ressourcen, dass
eine deutlich verbesserte Modellierung effizient imple-
mentierbar sein sollte.
Die Ergebnisse der Arbeit legen  aufgrund der an-
gesprochenen Schwierigkeiten einer vollständigen Auf-
teilung des MLEM auf die Beschleunigereinheiten 
den Schluss nahe, dass es eine gute Lösung sein kann,
einen Standardprozessor von Intel oder AMD für den
MLEM-Algorithmus selbst zu benutzen, aber die Mo-
dellierung der Systemmatrix einem im selben System
arbeitenden Cell-Prozessor zu überlassen. Genau die-
se Strategie wird aktuell von IBM im Roadrunner-
Projekt verfolgt [4].
5 Zusammenfassung und Ausblick
In dieser Arbeit wurde der Algorithmus MLEM, der
für die Bildrekonstruktion bei PET eingesetzt wird, auf
den Cell-Prozessor portiert. Voraussetzung dafür war
eine analytisches Modell zur Generierung der System-
matrix eines PET-Scanners zur Laufzeit, da jede für
eine brauchbare Qualität nötige Matrix die Hauptspei-
cherkapazität des verfügbaren CellBlade QS20 um ein
Vielfaches überschreitet. Dabei hat sich gezeigt, dass
die Implementierung eines analytische Modells sehr gut
für Beschleunigereinheiten des Cell-Prozessors geeignet
ist. Zwar benötigt man für qualitativ hochwertige Er-
gebnisse der Bildrekonstruktion eine verbesserte Mo-
dellierung der Systemmatrix; eine entsprechende An-
passung sollte einfach möglich sein.
In dem in diesem Artikel vorgestellten ersten Schritt
wurde der eigentliche MLEM-Algorithmus unverän-
dert auf dem PowerPC-Kern des Cell-Prozessors aus-
geführt. Es hat sich herausgestellt, dass diese Strategie
wegen der geringen Leistung dieses Kerns nicht sinn-
voll ist. Um dieses Problem zu umgehen, gibt es im
wesentlichen zwei Lösungsansätze:
• Der MLEM-Algorithmus wird verteilt auf die Be-
schleunigereinheiten des Cell-Prozessors. Die Ver-
teilung kann entweder per LOR geschehen, oder
über eine geometrische Partitionierung des Voxel-
raumes.
• Bei Verfügbarkeit von Hybridsystemen, die Cell-
Prozessoren als Coprozessoren von Standardpro-
zessoren einsetzen, ist eine Parallelisierung des
MLEM auf den verfügbaren Standardprozessoren
mit Auslagerung der Modellierung auf die Be-
73
schleunigereinheiten  so wie in der vorliegenden
Arbeit  erfolgversprechend.
6 Danksagungen
Wir bedanken uns bei IBM Deutschland für die un-
komplizierte Zusammenarbeit und den Fernzugriff auf
ein zur Verfügung gestelltes Cell-Blade-Center.
Literatur
[1] J. Braun. Bildgebende Verfahren in der Medizin. Cha-
rité Campus Benjamin Franklin, Institut für Medizi-
nische Informatik, 2007.
[2] T. Eissler. Bewertung paralleler Programmierkonzep-
te für heterogene Multicore-Architekturen am Beispiel
der Cell BE. TU München, Fakultät für Informatik,
2007.
[3] E. Krestel. Handbuch der Medizinischen Informatik,
chapter Bildgebende Systeme für die medizinische Dia-
gnostik, page 349. Hanser Verlag München, 2002.
[4] Los Alamos National Laboratory: Roadrunner Home-
page. http://www.lanl.gov/orgs/hpc/roadrunner.
[5] T. Lehman, J. Hiltner, and H. Handels. Handbuch der
Medizinischen Informatik, chapter Medizinische Bild-
verarbeitung, pages 339414. Hanser Verlag München,
2002.
[6] M. Rafecas, G. Böning, B. J. Pichler, E. Lorenz,
M. Schwaiger, and S. I. Ziegler. Inter-crystal scat-
ter in a dual layer, high resolution LSO-APD positron
emission tomograph. Phys Med Biol, 48(7):821848,
Apr 2003.
[7] M. Rafecas, B. Mosler, M. Dietz, M. Pogl, A. Sta-
matakis, D. McElroy, and S. Ziegler. Use of a Monte
Carlo-based probability matrix for 3-D iterative recon-
struction of MADPET-II data. Nuclear Science, IEEE
Transactions on, 51(5):25972605, Oct. 2004.
[8] G. Steidl. Bildrekonstruktion für PET/SPECT-
Systeme. Seminar Algorithmen der Bildrekonstruktion
in der Medizintechnik, 2007.
[9] I. Systems and T. Group. Cell Broadband Engine
Architecture. www.ibm.com/developerworks/power/
cell, 2006.
[10] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands,
and K. Yelick. The potential of the cell processor for
scientific computing. In Proceedings of the 3rd Confe-
rence on Computing Frontiers (Ischia, Italy, May 03 -
05, 2006), New York, NY, 2006. ACM Press.
[11] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands,
and K. Yelick. Scientific computing kernels on the cell
processor. International Journal of Parallel Program-
ming, 35(3):263298, June 2007.
74
