MPI-Semantic Memory Checking Tools für Parallel Applikationen by Fan, Shiqing
High Performance Computing Center Stuttgart HLRS
University Stuttgart
Prof. Dr.-Ing. Dr. h.c. Dr. h.c. Michael Resch
Nobelstrasse 19
70569 Stuttgart
MPI-Semantic Memory
Checking Tools for Parallel
Applications
Von der Fakultät Energie-, Verfahrens- und Biotechnik der Universität
Stuttgart
zur Erlangung der Würde eines Doktor-Ingenieurs (Dr. - Ing.)
genehmigte Abhandlung
vorgelegt von
M.Sc. Shiqing Fan
aus Liaoning/China
Hauptberichter: Prof. Dr.-Ing. Dr. h.c. Dr. h.c. Michael Resch
Mitberichter: Prof. Dr. Ulrich Rüde
Prof. Dr.-Ing. Rainer Keller
Tag der Einreichung: 6. Juli 2012
Tag der mündlichen Prüfung: 22 
Juli 2013
High Performance Computing Center Stuttgart
2012
All rights reserved.
c©2012 by Shiqing Fan
High Performance Computing Center Stuttgart (HLRS)
University of Stuttgart
Nobelstraße 19
D-70569 Stuttgart
Abstract
The Message Passing Interface (MPI) is a language-independent application interface
that provides a standard for communication among the processes of programs running
on parallel computers, clusters or heterogeneous networks. However, writing correct and
portable MPI applications is diﬃcult: inconsistent or incorrect use of parameters may
occur; the subtle semantic diﬀerences of various MPI calls may be used inconsistently or
incorrectly even by expert programmers. The MPI implementations typically implement
only minimal sanity checks to achieve the highest possible performance.
Although many interactive debuggers have been developed or extended to handle the
concurrent processes of MPI applications, there are still numerous classes of bugs which
are hard or even impossible to ﬁnd with a conventional debugger. There are many cases
of memory conﬂicts or errors, for example, overlapping access or segmentation fault,
does not provide enough and useful information for programmer to solve the problem.
That is even worse for MPI applications, due to the ﬂexibility and high-frequency of
using memory parallel in MPI standard, which makes it more diﬃcult to observe the
memory problems in the traditional way. Currently, there is no available debugger
helpful especially for MPI semantic memory errors, i. e. detecting memory problem or
potential errors according to the standard. For this speciﬁc purpose, in this dissertation
memory checking tools have been implemented. And the corresponding frameworks in
Open MPI for parallel applications based on MPI semantics have been developed, using
diﬀerent existing memory debugging tool interfaces. Developers are able to detect hard
to ﬁnd bugs, such as memory violations, buﬀer overrun, inconsistent parameters and so
on. This memory checking tool provides detailed comprehensible error messages that
will be most helpful for MPI developers. Furthermore, the memory checking frameworks
may also help improve the performance of MPI based parallel applications by detecting
whether the communicated data is used or not. The new memory checking tools may
also be used in other projects or debuggers to perform diﬀerent memory checks.
The memory checking tools do not only apply to MPI parallel applications, but may
also be used in other kind of applications that require memory checking. The technology
allows programmers to handle and implement their own memory checking functional-
ities in a ﬂexible way, which means they may deﬁne what information they want to
know about the memory and how the memory in the application should be checked and
reported.
The world of high performance computing is Linux-dominated and open source based.
However Microsoft is becoming also a more important role in this domain, establishing
its foothold with Windows HPC Server 2008 R2. In this work, the advantages and
disadvantages of these two HPC operating systems will be discussed. To amend pro-
grammability and portability, we introduce a version of Open MPI for Windows with
i
Abstract
several newly developed key components. Correspondingly, an implementation of mem-
ory checking tool on Windows will also be introduced.
This dissertation has ﬁve main chapters: after an introduction of state of the art,
the development of the Open MPI for Windows platform is described, including the
work of InﬁniBand network support. Chapter four presents the methods explored and
opportunities for error analysis of memory accesses. Moreover, it also describes the two
implemented tools for this work based on the Intel PIN and the Valgrind tool, as well
as their integration into the Open MPI library. In chapter ﬁve, the methods are based
on several benchmarks (NetPIPE, IMB and NPB) and evaluated using real applications
(heat conduction application, and the MD package Gromacs). It is shown that the
instrumentation generated by the tool has no signiﬁcant overhead (NetPIPE with 1.2%
to 2.5% for the latency) and accordingly no impact on application benchmarks such as
NPB or Gromacs. If the application is executed to analyze with the memory access
tools, it extends naturally the execution time by up to 30x, and using the presented
MemPin is only half the rate of dropdown. The methods prove successful in the sense
that unnecessary data communicated can be found in the heat conduction application
and in Gromacs, resulting in the ﬁrst case, the communication time of the application
is reduced by 12%.
ii
Zusammenfassung
Das Message Passing Interface (MPI) ist eine standardisierte, programmiersprachenun-
abhängige Anwendungsschnittstelle zur Ausführung von paralleler Software auf Höchst-
leistungsrechnern, wie z. B. Cluster- und Supercomputer. MPI-Implementierungen haben
aus Gründen der Geschwindigkeit nur minimale Prüﬀunktionalität, wie bspw. Prüfung
auf inkonsistente und inkorrekte Verwendung von Parametern. Zur korrekten und por-
tablen Programmierung von MPI-Anwendungen bedarf es jedoch einer weitaus intensi-
veren Prüﬀunktionalität zur Gewährleistung der Korrektheit bei der Verwendung von
MPI-Funktionen, sowie zur Entwicklung von komplexen wissenschaftlichen Softwarepro-
jekten, welche Höchstleistung erzielen sollen.
Trotz mehreren vorhandenen Lösungen zur interaktiven Fehlersuche (Debugger) von
MPI-parallelen Prozessen, gibt es zahlreiche Klassen von Fehlern, deren Identiﬁzierung
mit herkömmlichen Debuggern schwierig oder gar unmöglich ist. Hierzu gehören Spei-
cherzugriﬀskonﬂikte und Fehler wie überlappender Zugriﬀ, deren Resultat im guten Fall
als so-genannter Segmentation Fault sichtbar wird, der Programmierer allerdings kaum
nützliche Informationen zur Ursache oder besser zur Behebung bekommt. Das Problem
wird für MPI-parallele Anwendungen aufgrund der Flexibilität von MPI, sowie der par-
allelen Verarbeitung noch erheblich verschlimmert. Die traditionellen Analyseverfahren
sind kaum anwendbar zum Auﬃnden dieser Fehler. Gegenwärtig gibt es keine Debug-
ger, die dem Programmierer bei der Erkennung und Behebung der Speicherprobleme
in MPI-Codes behilﬂich sein können. Zu diesem Zweck wurde in dieser Dissertation ei-
ne Memory Checking Methode entwickelt und hierfür zwei Tools implementiert, sowie
deren Einsatzzwecke erforscht. Dies ist durch ein spezialisiertes Framework realisiert,
das mehrere nützliche Debugging Technologien und Tools integriert und dem Benut-
zer zur Verfügung stellt. Das Framework wurde umgesetzt in Open MPI, eine der am
meisten verbreitete Open-Source MPI-Implementierung. Anhand dem vorgeschlagenen
Frameworks und Tools können Entwickler zahlreiche MPI Fehlertypen identiﬁzieren und
beheben, wie z. B. Speicher-Verletzungen, Puﬀerüberlauf, inkonsistente MPI-Parameter,
welche sonst kaum nachweisbar wären. Darüber hinaus wird gezeigt, wie mit den Tools
ein Beitrag zur Verbesserung der Kommunikationsleistung erzielt werden kann, indem
kommunizierter, aber nicht für die Berechnung verwendeter Speicher identiﬁziert wird.
Die hier vorgestellten Tools zur Speicherzugriﬀskontrolle können sowohl für MPI-
parallele als auch für andere Arten von Anwendungen verwendet, sowie in andere De-
bugger integriert werden. Die Technologie gibt dem Programmierer die Möglichkeit, die
Art der Überprüfung selbst festzulegen, d. h. sie können deﬁnieren, welche Informationen
sie über den Speicher wissen wollen und wie der Speicher in der Anwendung überprüft
und das Ergebnis geliefert werden soll.
iii
Zusammenfassung
Die Welt des Höchstleistungsrechnens ist dominiert von Linux-basierten Systemen.
Dennoch spielt auch Microsoft eine wichtige Rolle seit der Einführung von Windows
HPC Server 2008 R2. Um Programmierbarkeit und Portabilität für bestehende Nut-
zer von Windows Systemen zu novellieren, führen wir eine Version von Open MPI für
Windows ein. Entsprechend stellt diese Arbeit ein für Windows entwickeltes Tool zur
Speicherzugriﬀskontrolle vor.
Diese Dissertation besteht aus fünf wesentlichen Kapiteln: nach einer Einführung und
dem Stand der Technik wird die Entwicklung der Komponenten von Open MPI für die
Windows-Plattform beschrieben, inklusive der Arbeiten für das InﬁniBand-Netzwerk.
Kapitel vier stellt die hier erforschten Methoden und Möglichkeiten zur Fehleranalyse
von Speicherzugriﬀen vor. Darüberhinaus werden die beiden für diese Arbeit implemen-
tierten Tools basierend auf dem Intel Pin-, sowie dem Valgrind-Tool beschrieben, sowie
deren Integration in die Open MPI-Bibliothek. In Kapitel fünf werden die Methoden
anhand mehrerer Benchmarks (NetPIPE, IMB, NPB) evaluiert und mittels echter An-
wendungen (Wärmeleitungsapplikation, sowie das MD Paket Gromacs) auf ihren Nutzen
analysiert. Hierbei zeigt sich, daß die Instrumentierung durch das Tool keinen nennens-
werten Overhead generiert (NetPIPE mit 1,2% bis 2,5% bei der Latenz)  und ent-
sprechend keine große Auswirkung auf Applikationsbenchmarks wie NPB oder Gromacs
hat. Wird die Anwendung zur Analyse mit den Speicherzugriﬀstools ausgeführt, verlän-
gert sich naturgemäß die Ausführungszeit um bis zu 30x, mittels dem hier vorgestellten
MemPin ist der Einbruch nur halb so stark. Die Methoden erweisen sich erfolgreich, in
dem Sinn, daß in der Wärmeleitungsapplikation sowie in Gromacs unnötig kommuni-
zierte Daten gefunden werden, woraus sich im ersten Fall die Kommunikationszeit der
Anwendung um 12% reduzieren lässt.
iv
Acknowledgements
This dissertation would not have been possible without the support of many people.
Thanks to Professor Michael Resch of the High Performance Computing Center Stutt-
gart, for all necessary support to accomplish this work. I also want to thank Professor
Ulrich Rüde, who gave me many invaluable comments and suggestions. Special and
great thanks to Professor Rainer Keller, who was my group leader and my supervisor of
the dissertation, for his persistent help, guidance, correction and encouragement.
I also want to thank Professor Jürgen Pleiss at the Institute of Technical Biochemistry
and two of his students, Sascha Rehm and Sven Benson, for their cooperation in the
project, especially for providing me several benchmarks and helping me understand
them, so I could directly run with my frameworks.
Many thanks to all the colleagues at High Performance Computing Center Stuttgart,
especially our work group. Dr. Colin W. Glass, Dr. José Gracia, and Christoph Ni-
ethammer gave me a lot of comments and suggestions in the PhD seminar presentations.
Thanks to Dr. Alexey Cheptsov, who help me with the correction of the German ab-
stract. Thanks to Blasius Czink, who shared a lot of his technical knowledge with me.
A great thanks to the Open MPI community, especially Dr. Jeﬀ Squyres. With their
knowledge and technology, I could integrate and realize my design and framework in the
Open MPI project.
Finally, I want to thank my family. Without their understanding and support, this
work would never become true.
v

Contents
Abstract i
Zusammenfassung iii
Acknowledgements v
1 Introduction and Motivation 1
1.1 Computer Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 The von Neumann Architecture and Moore's Law . . . . . . . . . 2
1.2.2 Flynn's Classical Taxonomy . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Parallel Computer Memory Architectures . . . . . . . . . . . . . . 4
1.2.4 Parallelization Strategies . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . 7
1.3 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Supercomputers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 High Performance Computing on Diﬀerent Platforms . . . . . . . 10
1.4 About this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 State of the Art 15
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 MPI Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 MPIch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Microsoft MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Intel MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Open MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Parallel Debugging Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 How Valgrind Works . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Memcheck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Shadow Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Intel Pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 How Intel Pin Works . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Pin Instrumentations . . . . . . . . . . . . . . . . . . . . . . . . . 32
vii
Contents
3 Open MPI for Windows 35
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Integration with Windows . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Multiple Node Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Integration with WMI . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Integration with CCP . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 High-Speed Network Support . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Introduction to InﬁniBand . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 BTL Implementations . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 A realization of a Windows Cluster . . . . . . . . . . . . . . . . . . . . . 50
4 Semantic Memory Checking Frameworks 53
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 MPI Semantic Memory Checking . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Pre-communication memory checks . . . . . . . . . . . . . . . . . 53
4.2.2 Post-communication memory checks . . . . . . . . . . . . . . . . . 55
4.2.3 Semantic MPI memory errors by code examples . . . . . . . . . . 56
4.3 Valgrind memory debugging framework . . . . . . . . . . . . . . . . . . . 59
4.3.1 Valgrind extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.2 Implementation and Integration with Valgrind . . . . . . . . . . . 61
4.3.3 Implementation in Open MPI . . . . . . . . . . . . . . . . . . . . 62
4.4 Intel Pin tools debugging framework . . . . . . . . . . . . . . . . . . . . 64
4.4.1 MemPin Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Implementation and integration with MemPin . . . . . . . . . . . 67
4.4.3 Implementation in Open MPI . . . . . . . . . . . . . . . . . . . . 68
5 Performance Implication and Real Use Cases 71
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Performance Implication and Benchmarks . . . . . . . . . . . . . . . . . 72
5.2.1 Intel MPI Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 NAS Parallel Benchmark . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 NetPIPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 A 2D Heat Conduction Algorithm as a Use Case . . . . . . . . . . . . . . 76
5.4 MD Simulation as a Use Case . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Conclusion 93
Glossary 97
Bibliography 107
Index 113
viii
List of Figures
1.1 The von Neumann Architecture . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Moore's Law on Intel Processors (modiﬁed based on Keller [23]) . . . . . 3
1.3 Non-Uniform Memory Access architecture . . . . . . . . . . . . . . . . . 5
1.4 Uniform Memory Access architecture . . . . . . . . . . . . . . . . . . . . 5
1.5 Distributed memory architecture . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Overview of the Open MPI abstraction layers . . . . . . . . . . . . . . . 18
2.2 Overview of the component architecture in Open MPI . . . . . . . . . . . 19
2.3 Screen shot of Allinea DDT debugger . . . . . . . . . . . . . . . . . . . . 20
2.4 Overview of program execution with a Valgrind tool . . . . . . . . . . . 23
2.5 A/V bits addressing mechanism . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Overview of program execution with an Intel Pin tool on Windows . . . . 31
3.1 WMI architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Work ﬂow of the PLM component using WMI . . . . . . . . . . . . . . . 39
3.3 Architecture of the job scheduler in Windows HPC Server . . . . . . . . 40
3.4 Work ﬂow of the RAS component using CCP . . . . . . . . . . . . . . . 42
3.5 Work ﬂow of the PLM component using CCP . . . . . . . . . . . . . . . 43
3.6 Example of IBA architecture . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 InﬁniBand network layer abstraction . . . . . . . . . . . . . . . . . . . . 45
3.8 Example of IBA QP communication . . . . . . . . . . . . . . . . . . . . . 47
3.9 The software stack of WinOF . . . . . . . . . . . . . . . . . . . . . . . . 47
3.10 Latency of openib and winverbs on Windows . . . . . . . . . . . . . . . 49
3.11 Bandwidth of openib and winverbs on Windows . . . . . . . . . . . . . 49
3.12 Latency of Microsoft MPI and Open MPI on Windows . . . . . . . . . . 50
3.13 Bandwidth of Microsoft MPI and Open MPI on Windows . . . . . . . . . 51
4.1 Non-blocking buﬀer check . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Non-blocking receive buﬀer usage check . . . . . . . . . . . . . . . . . . . 55
4.3 Buﬀers storing access information of communicated data . . . . . . . . . 56
4.4 Example of user application using Valgrind . . . . . . . . . . . . . . . . 61
4.5 Integration of Memcheck into Open MPI . . . . . . . . . . . . . . . . . . 62
4.6 Extended shadow memory on watch buﬀer . . . . . . . . . . . . . . . . . 63
4.7 Run-time structure of MemPin . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Example of user application using MemPin . . . . . . . . . . . . . . . . . 66
4.9 Integration of MemPin into Open MPI . . . . . . . . . . . . . . . . . . . . 68
4.10 Shadow memory for pre-communication check with MemPin . . . . . . . . 69
ix
List of Figures
4.11 Shadow memory for post-communication checks with MemPin . . . . . . . 69
5.1 IMB benchmark Pingpong test on two nodes of BWGrid . . . . . . . . . 73
5.2 IMB benchmark Bi-directional get and put on two nodes of Viscluster . . 74
5.3 NAS Parallel BT Benchmark performance . . . . . . . . . . . . . . . . . 75
5.4 NetPIPE comparison over TCP (memchecker disabled) . . . . . . . . . . 76
5.5 NetPIPE comparison over Inﬁniband (memchecker disabled) . . . . . . . 77
5.6 NetPIPE comparison over TCP (memchecker enabled) . . . . . . . . . . 77
5.7 NetPIPE comparison over Inﬁniband (memchecker enabled) . . . . . . . 78
5.8 NetPIPE comparison over Inﬁniband (memchecker enabled and disabled) 78
5.9 An example of border update in domain decomposition . . . . . . . . . . 79
5.10 Transferred but unused data in example domain decompositions . . . . . 80
5.11 Running the heat program with two processes and checked with memory
checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.12 Comparison of the communication time between the original and modiﬁed
Heat Conduction program on 4 nodes . . . . . . . . . . . . . . . . . . . . 82
5.13 Communication time comparison of the Heat Conduction program . . . . 82
5.14 Communication time and computation time comparison of the Heat Con-
duction program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.15 Inﬂuence of Solvent on Lipases, lid open and closed states . . . . . . . . . 84
5.16 Simulation of the lipase lid in diﬀerent solvent . . . . . . . . . . . . . . . 85
5.17 Screen shot of visualization using PyMOL . . . . . . . . . . . . . . . . . . 86
5.18 Simulation screen shot of lid open and closed state . . . . . . . . . . . . . 86
5.19 Gromacs run on Windows with/without MemPin integration using shared
memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.20 Gromacs run on Windows with/without MemPin integration over TCP
connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.21 Gromacs run on Nehalem with/without Valgrind and MemPin integration
using Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.22 Gromacs run on Nehalem with/without Valgrind and MemPin integration
using InﬁniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.23 Gromacs run on Nehalem with Valgrind and MemPin supervision using
Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
x
List of Tables
1.1 Flynn's Classical Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Memcheck client requests in Valgrind 3.7.0 . . . . . . . . . . . . . . . . . 28
4.1 MemPin macros for user application . . . . . . . . . . . . . . . . . . . . 65
4.2 Intel Pin API for cache instrumentations . . . . . . . . . . . . . . . . . . 68
xi

1 Introduction and Motivation
1.1 Computer Simulation
In many scientiﬁc ﬁelds, e. g. physics and chemistry, when starting to solve a problem,
researchers would envision and build a model that describes the system. This process is
known as modeling. The traditional way of forming large models of systems in research
area is using mathematical models, as a basis of the computer simulation process. Mod-
eling and simulation are used to predict or evaluate the system behavior from a set of
initial parameters and conditions.
Computer simulation is the technique of simulating an abstract model of a particular
system. Although many problems may be solved and analyzed by experiments in the
laboratory, computer simulation is irreplaceable due to many reasons. It may be used
in problems that are unsolvable by traditional theoretical and experimental approaches,
e. g. prediction of future climate, or in cases that is too hazardous to study in the lab-
oratory, e. g. nuclear detonation, or in cases that is expensive or time consuming to
solve by traditional means, e.g. car crashes or development of new materials. Modeling
and simulation are the tunnel that connects theory conception and reality. Computer
simulation now has been widely used in many ares, not only scientiﬁc but also industry
ﬁelds. In health industry, it is used for blood ﬂow simulation, that is not possible for
experiment on human beings. Computer simulation also plays a critical role in automo-
bile industry, as it is helpful for minimizing output of pollutants and fuel consumption,
and optimizing safety in car crashes.
1.2 Parallel Computing
The computing power of the average desktop has exploded in the past years. The
performance of a typical personal computer has been exceeding that of a supercomputer
a decade ago. However, it does not mean that performance is no longer an issue. There's
always a need to harness the processing power even with the fastest computers.
Traditional software is mainly developed for serial computation running on a single
computer with a single Central Processing Unit (CPU). And the problem it solves is
broken into a discrete series of instructions that are executed one by one. However, only
one instruction may execute at the same time. Parallel computing is a way of delivering
and parallelizing the computing tasks onto several or even thousands of computers and
processors, in order to solve problems that have not been able to solve before, or to
decrease the computational time. In Parallel computing, problem is broken into discrete
parts that can be solved simultaneously. Operations from each part are distributed to
1
1 Introduction and Motivation
CPU
Memory
Input
Control Arithmetic
Unit
Output
Unit
Logic
Figure 1.1: The von Neumann Architecture
diﬀerent computers and CPUs. Today, parallel computing has been widely used as an
evolution of serial computing for simulating in the natural world in areas like galaxy
formation, planetary movement, weather prediction, automobile assembly line and so
on.
The reasons of using parallel computing are not only for decreasing the computing
time and saving resources, but also for breaking the physics limitations, e. g., it uses a
form of simultaneous computation with a Multiple Instruction Multiple Data (MIMD)
processing and involves domain decomposition of partitioning the work load. Parallel
programs are more diﬃcult to write than sequential ones, because new classes of potential
software bugs might exist in concurrent work ﬂows, of which race conditions and memory
violations are the most common cases.
1.2.1 The von Neumann Architecture and Moore's Law
Nowadays, all computers have followed the basic design of the von Neumann architecture,
which deﬁnes that a computer is comprised of memory, control unit, Arithmetic Logic
Unit (ALU), and input/output devices, as shown in Figure 1.1. Program instructions
and data are stored in the Random Access Memory (RAM). They are fetched and
decoded by the CPU sequentially to accomplish the programmed task. Basic arithmetic
operations are processed by the ALU, and the communication between computer and
human is done via input and output devices.
Moore's law describes the trend of the development of the computing hardware, that
the number of transistors that can be placed inexpensively on an integrated circuit
doubles approximately every two years [39]. After decades of development, the capacity
and speed of RAM and CPU have increased signiﬁcantly. A computer is as fast as
a supercomputer 30 years ago. For example, an Intel Pentium-III 550 Coppermine
processor in 2000 provided a clock frequency of up to 550MHz. But an Intel Core
i7 990x processor that is based on Sandy Bridge platform, has a speed of 3.46GHz.
2
1.2 Parallel Computing
Pe
rfo
rm
an
ce
 [M
Flo
ps
]
Fr
eq
ue
nz
 [M
Hz
]
Jahr
Befehls
pipeline
Integrierter
Cache-Speicher
Mehrfach-
Instruktionen
pro Takt
Spekulative
out-of-order Befehle
MMX-Befehle
Level-2 Cache
mit vollem Takt
Laengere Pipeline,
2xTakt Arith.
Hyper-threading
Multi-core
 1
 10
 100
 1000
 10000
 1986  1991  1996  2001  2006  2011
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
Pe
rfo
rm
an
ce
 [M
Flo
ps
]
Fr
eq
ue
nz
 [M
Hz
]
80386 80486 Pentium
Pentium ProPentium MMX
Pentium 3
Pentium 4
Pentium Xeon
Pentium D
Core 2
Nehalem Core i5
Sandy Bridge Core i7
     
Pe
rfo
rm
an
ce
 [M
Flo
ps
]
Fr
eq
ue
nz
 [M
Hz
]
Figure 1.2: Moore's Law on Intel Processors (modiﬁed based on Keller [23])
The speed up is approximately 6.4, which largely applies to Moore's law. Figure 1.2
is a Moore's Law Microprocessor Chart of Intel processors in the past 25 years, which
presents that after decades, Moore's Law still remains true. However, apart from the
law, one bottleneck of von Neumann architecture becomes more important that the
throughput between the CPU and memory is much smaller than the CPU work rate.
Caches are used for solving this problem, and another solution is parallel computing.
1.2.2 Flynn's Classical Taxonomy
There are many ways to classify parallel computers. One popular and widely used
method is called Flynn's Taxonomy, which is introduced by Michael Flynn in the mid-
dle of the 1960s. Parallel computers are classiﬁed based on whether operations are
under centralized control by a single control unit or work independently along the two
independent dimensions of instruction and data [16].
Flynn's Taxonomy deﬁnes four classiﬁcations, as shown in Table 1.1. A Single Instruc-
tion Single Data (SISD) system is normally a serial computer, which processes single
instruction and single data. This is the type of modern single core computers. Single
Instruction Multiple Data (SIMD) is a type of parallel processor, that processes sin-
3
1 Introduction and Motivation
SISD SIMD
Single Instruction Single Data Single Instruction Multiple Data
MISD MIMD
Multiple Instruction Single Data Multiple Instruction Multiple Data
Table 1.1: Flynn's Classical Taxonomy
gle instruction but multiple data. Each processing unit can operate on diﬀerent data
elements. An example of SIMD is the Connection Machine. It was a series of supercom-
puters designed in Danny Hillis' research on alternatives to the traditional von Neumann
architecture of computation in the early 1980s at MIT [18].
Multiple Instruction Single Data (MISD) refers to systems, in which multiple pro-
cess units operate on the same data independently via independent instruction streams.
There are only a few actual MISD architecture systems in existence. In MIMD systems,
every processor may execute a diﬀerent instruction stream and work with a diﬀerent data
stream at the same time. The execution can be synchronous or asynchronous, determin-
istic or non-deterministic. In most of the modern supercomputers, Cluster architectures
are employed, with multi-processor Symmetric Multiprocessing (SMP) compute nodes,
which may be classiﬁed as MIMD.
1.2.3 Parallel Computer Memory Architectures
Another major way to classify parallel computers is based on the memory architectures.
For shared memory parallel computers, generally all processors may access all memory
through global address space. The same memory resources are shared and operated
by every processor independently. Changes in one memory location modiﬁed by one
processor are visible to all other processors. More speciﬁcally, shared memory com-
puters can be divided into Uniform Memory Access (UMA) and Non-Uniform Memory
Access (NUMA) architectures. In UMA architecture, each processor may use a private
cache but share the physical memory uniformly (Figure 1.4). Most SMP computers use
UMA architecture, where accessing local memory is faster over accessing remote memory
(Figure 1.3). Data accessing for shared memory architectures is fast due to the proximity
of memory to CPUs. But adding more CPU will increase traﬃc on the path. Moreover,
the programming diﬃculties for synchronization of memory accesses is extremely high.
A more scalable fashion of parallel computer memory architecture is the distributed
memory architecture (see Figure 1.5), which is a multi-processor computer system. Each
processor has its own private memory, and computational tasks are only operated lo-
4
1.2 Parallel Computing
Distributed Shared Memory Network with Directory
CPUCPUCPUCPUCPUCPU
CPUCPUCPU
Memory BUS
MemoryMemory
Memory BUS
Memory
Memory BUSMemory BUS
CPU CPU
Memory
CPU
Figure 1.3: Non-Uniform Memory Access architecture
Memory
CPUCPU
Memory BUS
CPU
Figure 1.4: Uniform Memory Access architecture
5
1 Introduction and Motivation
CPUCPU
Memory Memory
CPU
Memory Memory
High Speed Interconnection
CPU
Figure 1.5: Distributed memory architecture
cally. Therefore, a communication network among the processors is also required for
transmitting data to remote memory. There is no global memory address space across
all processors, as memory address in one processor do not map to any other processors.
Changes done to local memory of one processor do not aﬀect memory locations on other
processors. The programmer has to explicitly deﬁne how and when data is communi-
cated if one processor needs to access data in another processor's private memory. In
this architecture, the total size of memory scales with the number of processors. Every
processor can fast access its private memory without the overhead of maintaining the
cache coherency. However, the programmer needs to take more care to map existing
data structures based on global memory and manage the data communication between
processors.
As both UMA and NUMA have their beneﬁts and drawbacks, a hybrid memory ar-
chitecture, is designed to combine the advantages. The memory is physically separated
within several SMP but logically shared as one address space. Processors on a given
SMP can address all memory as global, but communication may still be needed for ac-
cessing remote data. Most of the largest and fastest computers today employ the hybrid
memory architecture.
1.2.4 Parallelization Strategies
In order to design a parallel program, the speciﬁc problem has to be partitioned into
discrete parts, which can be distributed and operated simultaneously. This process is
known as decomposition or partitioning. The main computation is divided into the
programmer deﬁned tasks by means of decomposition. The amount of work per task
may be arbitrary, and diﬀerent problem sizes per task may lead to the problem of load-
imbalance.
6
1.2 Parallel Computing
There are mainly three methods for decomposing a problem into smaller tasks to
be performed in parallel: functional decomposition, domain decomposition, or a com-
bination of both. Functional decomposition separates the problem into diﬀerent tasks
that can be distributed to multiple processors for simultaneous execution. It focuses
on decomposing the computation that is to be performed. Functional decomposition is
frequently used when there is no static structure or pre-determinated of the number of
calculations to be performed. Domain decomposition divides the problem data domain
and distributing portions to multiple processors for simultaneous execution. The data
partitioning may use diﬀerent methods to decompose program data, e. g., cyclic distri-
bution is normally used for one dimensional data structure and cyclic block distribution
is for two dimensional. In large scientiﬁc problems, like multi-body or ﬂuid vortices
problems, the combination of these two types of decomposition is commonly used.
1.2.5 Parallel Programming Models
A parallel programming model is an abstraction above hardware and memory architec-
tures, and most often is not speciﬁc to a particular type of machine or memory archi-
tecture. The most commonly used parallel programming models are shared memory,
thread, data parallel, and message passing. Using any of these models has its beneﬁts,
but the actual use is normally a combination of several models.
Shared Memory Model
In the shared memory model, a common memory address space is shared among the exe-
cuting entities. The common memory address may be read and written asynchronously.
Various synchronization mechanisms, such as locks and semaphores may be used to con-
trol access to the shared memory. Programmers do not need to specify explicitly the
communication of data between tasks, so that the program development is simpliﬁed.
However, managing local data is diﬃcult, due to cache refreshing and bus traﬃc that
occurs when multiple processors use the same or nearby data within a cache-line.
Threads Model
Threads parallel programming model uses a single process having multiple concurrent
execution paths. The main program is scheduled to run by the native operating system.
It loads and acquires all of the necessary system and user resources to run. Then the
main program performs some serial work, and creates a number of worker threads that
can be scheduled and run by the operating system concurrently. Each thread has local
data, but shares the entire resources of the main program such as memory, ﬁles and
signals. This saves the overhead associated with replicating a program's resources for
each thread. Each thread also beneﬁts from a global memory view because it shares
the memory space of the main program. A thread's work may best be described as a
subroutine within the main program. Any thread can execute any subroutine at the same
time as other threads. Threads communicate with each other through global memory.
7
1 Introduction and Motivation
This requires synchronization constructs to ensure that not more than one thread is
updating the same global address at any time. Threads can be terminated or created,
but the main program remains present to provide the necessary shared resources until
the application has completed. The worker threads are commonly associated with shared
memory architectures and operating systems.
Data Parallel Model
Data parallel programming model is actually a mechanism using data partitioning for
parallelism. The data set that the parallel application works on is typically built into
a common structure like arrays. Every task in the worker set operates on a diﬀerent
partition of the same data structure. On shared memory architectures, all worker tasks
may have access to the data structure through global memory. On distributed memory
architectures the data structure is divided into partitions in the local memory of each
task. The data parallel construction is normally done by calls to data parallel subrou-
tine library or compiler directives, e. g. the High Performance Fortran (HPF) supports
data parallel programming. More recent examples for data parallel programming and
execution is General-purpose computing on graphics processing units (GPGPU), where
tremendous parallelism may be achieved by employing hundreds of albeit simple cores.
Message Passing Model
In comparison to the parallel programming model OpenMP [6], which aims at shared-
memory architectures and is mainly compiler-based, with the Message Passing Parallel
Model, the programmer makes calls to libraries to explicitly share information between
processors. A set of tasks or processes are deﬁned to use their own local memory during
computation. Multiple processes can reside on the same machine or across an arbitrary
number of machines. Tasks exchange data through communications by sending and
receiving messages via speciﬁc network connection. Data transfer usually requires each
process to perform cooperative operations. For example, a send operation must have a
matching receive operation.
The Message Passing Interface (MPI) is the most important parallel programming
model providing libraries with hundreds of function call interfaces, which can be directly
called in Fortran, C and C++. It is very portable, powerful and highly eﬃcient, and there
are many diﬀerent practical, eﬃcient and free implementations. Almost all of the parallel
computer vendors support for it.
The original idea of MPI was born in 1992 and initially standardized in 1994. In
June of the next year, a revision, MPI-1.1 [28] was released for the completion and
extension to the previous standard. Version 1.2 [30] is an extension to 1.1, which contains
clariﬁcations and corrections. As the standard has been widely accepted, extending and
improving the functionality became more important. In July of 1997, based on MPI-1.1
and MPI-1.2 [30], the MPI extension MPI-2 [29] was published. This new version focuses
on process creation and management, one-sided communications, extended collective
communications, external interfaces and parallel I/O. The latest version of MPI is 2.2 [32]
8
1.3 High Performance Computing
released in September of 2009 provides additional clariﬁcations and errata corrections
as well as a few enhancements.
The MPI standard was generated relatively late. However, as it absorbs the advan-
tages of a variety of other parallel environments, taking into account the performance,
functionality, portability and other characteristics, in just a few years, it became quickly
the most popular mode of parallel programming standard for message passing, which
also shows the vitality and superiority of MPI. There are many MPI implementations
for multiple platforms, such as Open MPI, MPIch, Intel MPI and Microsoft MPI. A
more detailed description and discussion of these MPI implementations will be given in
Section 2.2.
1.3 High Performance Computing
High Performance Computing (HPC) is the use of parallel processing for running ad-
vanced application and programs eﬃciently, reliably and quickly. It has made tremen-
dous achievements in science technology and military technology. Without HPC systems
that have more than a teraﬂops computational power, the researches in human genome,
accurately predicting the global climate, ocean circulation loop and so on, would not be
accomplished easily.
1.3.1 Supercomputers
Supercomputer namely stands for a more powerful subset of high performance comput-
ers, and supercomputing is known as a subset of HPC. The ﬁrst supercomputer CDC
1604 was introduced and designed in the 1960s by Seymour Cray, who is considered to
be the father of the supercomputer. Later in the 1970s, Cray founded his own company,
Cray Research, and designed the 100 megaﬂops CRAY-1 computer in 1976 and the 1-2
gigaﬂops CRAY-2 computer system in 1985, which took the top spot in supercomputing
for many years. In the 1990s, many other fast and powerful supercomputers appeared,
e. g., the GRAPE-4 [25] with 1,692-processors from University of Tokyo breaking the
one teraﬂops barrier. As the performance of microprocessors have increased and have
been widely used for parallel SMP-based servers, supercomputers moved from special-
ized processors to more common processors. An example of this shift was the Intel
Paragon machine, which was based on the newly developed i860 RISC processor, a chip
that was commonly used in laser printers and embedded devices, and whose technology
was integrated in Intel's i386-based product line. As the number of processors increased
dramatically in supercomputer systems, parallel processing attained a more important
role. And this also gave the supercomputer industry the chance to use relatively inex-
pensive third-party processors, such as the processors that were developed for personal
computers or workstations.
The Top 500 project collects a statistic of the 500 fastest high performance computers
available worldwide. The ranking is based on the measurement results of the High
Performance LINPACK (HPL) Benchmark [10] and contributed by the computer sites.
9
1 Introduction and Motivation
The statistical result from HPC Top 500 shows that since 1993, the performance grew
larger than Moore's law [38], where the performance doubled roughly every 14 months.
Until June of 2012, the fastest supercomputer is Sequoia, the IBM BlueGene/Q system
installed at Lawrence Livermore National Laboratory with 16.32 petaﬂops per second.
The Japanese K-Computer 1 [56], which was the 1st in the list of November 2011, has
also reached the performance of 8 petaﬂops per second.
1.3.2 High Performance Computing on Diﬀerent Platforms
The Microsoft Windows Operating System is the most widely used Operating System for
Desktop computers, which is pre-installed by default in almost of all the new personal
computers, while the Linux Operating System dominates for Parallel Computing. It also
has the most hardware and software support from all the manufacturers, while installing
and conﬁguring an old driver on Linux is rather diﬃcult especially for beginners. It does
not mean that Linux is not a good operating system, but it is normally the personal
ﬂavor to decide which one to use. People like Linux because it has higher conﬁgurability,
higher security, and most importantly it is free. But the fact is that it depends on which
people want to spend more, money or time. Windows is not cheap, although it is pre-
installed for new personal computers. On the other hand, getting used to Linux may
be time consuming, e. g., one who has no experience on Linux might need months or
years to get to use it, or even lose interest on it in the ﬁrst few days. The answer for
which operating system is better suited for a speciﬁc task will never be obvious, and it
is mostly a personal issue.
For parallel computing and especially high performance computing, Linux is the dom-
inating operating system, but the emergence of Windows HPC operating system also
gives people the opportunity to run their parallel applications on Windows clusters.
Windows HPC 2003 was the ﬁrst Windows cluster operating system released in year
2003, which provides the Microsoft MPI and the Cluster Manager Graphic User In-
terface (GUI). Meanwhile, the Windows drivers for high speed interconnect are also
available from many manufacturers, e. g., OpenFabrics Enterprise Distribution [44] and
Myricom [41] provide their own hardware drivers and Application Programming Inter-
face (API). Looking at the recent Top 500 list of supercomputers, there have already
been several clusters installed with the Microsoft HPC operating system and also in-
tegrated with high speed network connections. The overall performance of Windows
clusters is increasing very fast since last 5 years.
Microsoft Windows has only recently targeted cluster computing as a market, only a
few Windows HPC based clusters are in the Top 500 [57] list. But the potential of the
Windows HPC must not be underestimated, as seen from the Top 500 list in Novem-
ber 2001, the ﬁrst Windows cluster listed was Windows 2000 with Gigabit Ethernet.
In November 2008, the 10th fastest system, which is the Chinese Dawning 5000A at
Shanghai Supercomputer Center, became the largest cluster with Windows HPC 2008
operating system installed at that time.
1This computer is using an adapted Open MPI implementation
10
1.4 About this Dissertation
With regard to parallel computing on non-HPC environments, Windows could play a
more important role. Based on the statistical information provided by w3counter from
44,281 web sites [63], Windows has a share of more than 80% of all the sample operating
systems. Due to the high level of requirement of using an HPC cluster, and alternative
solution could be building up a cluster based on Windows too.
The most common users of parallel computing and HPC systems are mainly scientiﬁc
researchers, engineers and academic institutions, and there are also some government
agencies, particularly the military relying on HPC for complex applications. They might
have basic knowledge of the operating system and the hardware, but to learn how to
conﬁgure and use a complex and user unfriendly system could be time consuming, e. g.,
weeks or months may be spent on constructing a Transmission Control Protocol (TCP)
connection based cluster, and issues may arise due to the lack of knowing how the system
manages hardware.
MPI is a language-independent application interface, which provides a standard for
communication among the processes of programs on parallel computers, clusters, and
heterogeneous networks. It is the dominant model for the parallel programming and
HPC today. Currently, there are many MPI implementations available on both Linux
and Windows providing diﬀerent features. However, the current MPI implementations
on Windows are either not open source based or not user conﬁgurable. Open MPI is
one of the widely used MPI implementations, which is contributed by the PACX-MPI
team at the University of Stuttgart. It combines technologies and resources from several
other projects in order to build the best MPI library and to oﬀer advantages for system
and software vendors, application developers and scientiﬁc researchers. Part work of this
dissertation has been done in Open MPI to help MPI programmers with memory checks
and also to support it on Windows platforms including Windows HPC environment.
1.4 About this Dissertation
1.4.1 Motivation
Parallel programming with distributed memory paradigm using MPI is often consid-
ered as an error-prone process. Great eﬀort has been put into parallelizing libraries
and applications using MPI. However, when it comes to maintaining the software,
optimizing for new hardware or even porting the code to other platforms and other
MPI implementations, the developers may experience diﬃculties in ﬁxing errors due to
implementation-deﬁned behavior, hard-to-track timing-critical bugs or deadlocks due to
communication characteristics of the MPI implementation or even hardware dependent
behavior. A large class of hard-to-track bugs are memory errors, such as memory leak,
segmentation faults and memory violation according to the MPI standard.
However, another class of diﬃculties for parallel programming shows from the system
level. Learning to use a complex system is hard and time consuming. For non-computer
science students or researchers, they normally do not have a strong background of the
system architecture or operating systems, that makes it troublesome to learn such a
11
1 Introduction and Motivation
system like Linux. Contrary to that, Windows is considered to be user friendly and easy
to learn. On the other hand, when writing and testing a parallel program, a proper
Integrated Development Environment (IDE) is important and makes the programming
process easier and faster. It does not only help programmers check the correctness of the
syntax, but also help to test and debug the runtime outputs. There are many IDE or
debuggers for C, C++ and Fortran on Windows platforms, e. g., Visual Studio 2008 [61]
is a powerful IDE for program editing, local and remote debugging. Working on a easy-
to-handle platform, frees people from system level issues in order to concentrate on their
research.
The primary motivation in this work is to provide memory debugging features for
Open MPI, in order to help developers and programmers check memory problems in
MPI applications and also in Open MPI itself. This includes implementing and inte-
grating memory debugging tools into Open MPI. The other goal is to make Open MPI
available on Windows platforms, in order to simplify the process of creating, executing
and managing the parallel applications. This includes immigrating Open MPI on to
Windows, implementing features to better support it and developing a novel memory
debugging tool with the Intel Pin tools. And ﬁnally, we prove that the new memory
debugging tools are useful for detecting memory problems in real applications.
1.4.2 Dissertation structure
In this dissertation, the above-mentioned problems have been tackled separately and are
structured in the thesis as follows:
Chapter 2 ﬁrst introduces the current state of the art on the MPI implementations
and the debugging tools that are available on diﬀerent platforms. A short history of
MPI standard will be given, and several MPI implementations will be discussed and
compared on both Linux and Windows with regard to their features and capabilities.
Then two of the debugging tools that are integrated for memory checking in this work,
will be thoroughly introduced with regard to functionalities, internal implementations,
how they instrument user application, the shadow information for tracking memory
states and their performance implication.
Chapter 3 describes the eﬀorts that have been pushed to support Open MPI on the
Windows platforms. The ﬁrst part in this chapter introduces the build system. It is
implemented for automatically generating build solutions in diﬀerent system environ-
ments, such as native Windows and GNU for Windows environments. In the second
part, the integration of Open MPIwith Windows HPC and non-HPC environments will
be discussed. To accomplish this integration two new components were developed into
Open MPI based on API provided by Windows HPC server. Then the next part gives
a detailed description of the high speed network supports via diﬀerent drivers for Open
MPI on Windows. This aims to provide high speed and bandwidth communication for
parallel applications. Finally, in the last part, the performance results and improvements
will be shown.
Chapter 4 mainly focuses on the development of the novel memory checking frame-
works for Open MPI, using the tools that have been introduced in chapter 2, and discuss
12
1.4 About this Dissertation
the memory problems in parallel programs that are detected with this newly developed
frameworks. It ﬁrst gives an overview of the popular basic and parallel debuggers, what
kind of error check they can be used for. The next section explains the necessary MPI
semantic memory checking in diﬀerent communication models, then deﬁnes the errors
and gives examples code of diﬀerent classiﬁcation in parallel programs. The last two
sections introduce how the tools were implemented on Linux and Windows separately,
including how they were integrated in to Open MPI and to what extend it may help
debug or improve the performance of parallel applications.
In Chapter 5, several benchmarks and applications will be introduced, and the perfor-
mance comparison among diﬀerent MPI implementations on both Linux and Windows
will be discussed. The performance implication of using the memory debugging frame-
works will be analyzed.
Chapter 6 concludes the entire dissertation.
13

2 State of the Art
2.1 Overview
MPI is the most widely used parallel programming model for applications running on
cluster- and supercomputers. It is a language independent application interface standard
for allowing communication among processes by sending and receiving messages. The
goals of MPI are scalability, portability, and high performance. The standard provides
several concepts, such as communicator and derived datatypes for programmers to make
parallel applications. It also deﬁnes diﬀerent communication models like point-to-point,
collective and one-sided communications, which come along with MPI-2 series.
There are several versions of the MPI standard: MPI-1 [27] in 1995 and MPI-2 [29]
in 1998 with diﬀerent incarnations and corrections. MPI-2 added several functionalities
like one-sided communication, parallel File-IO and dynamic process management. Since
the MPI-Forum resumed its work in 2008 the latest version of MPI-1 is version MPI-
1.3 [31] deﬁned in 2008, while MPI-2 is available as MPI-2.2 [32]. MPI-2.2 deﬁnes over
500 functions and provides language bindings for C, C++ and Fortran, the C++ bindings
however are marked as deprecated.
Currently, there are many implementations for MPI, such as Intel MPI, Open MPI,
MPIch and Microsoft MPI. In this chapter, these primary MPI implementations will be
brieﬂy introduced and their features will be discussed and compared. Then we introduce
the memory debugging tools and libraries that are used for the debugging framework
implementation.
2.2 MPI Implementations
2.2.1 MPIch
One of the famous MPI implementations is the MPIch, which was developed during the
processing of MPI standard. The CH comes from Chameleon, the portability layer used
in original MPIch to provide portability to the existing message passing systems. MPIch
was one of the implementations that provides feedback to the MPI forum and usability
issues. It is a high performance and widely portable MPI implementation supporting
MPI-1 and MPI-2 series of standards. It could run on multiple communication environ-
ments, such as commodity clusters including desktop systems, shared memory systems
and multicore architectures, and it supports high speed networks and proprietary high-
end computing systems. The initial goal of the project is to replace the proprietary
message passing systems on the massively parallel computers at that time, such as the
15
2 State of the Art
Intel Paragon [11], IBM Scalable POWERparallel (IBM SP) [19], and Connection Ma-
chine 5 (CM5) [59]. The original implementation of MPIch is called MPIch1, which
implements the MPI-1.1 standard. The latest implementation is called MPIch2 and it
implements the MPI-2.2 standard 1.
In order to manage processes on multiple platforms, MPIch provides several interfaces,
the so-called Process Management Interface (PMI), such as Hydra, remshell, Gforker and
SMPD. Hydra is the default PMI used in MPIch. It is designed to natively work with
multiple daemons such as Secure Shell (SSH), Remote Shell (RSH), Portable Batch Sys-
tem (PBS) and SLURM [22], in order to provide run-time process management through
diﬀerent launchers. For the latest MPIch2, Hydra has added several capabilities that were
missing in previous versions, such as binding processes automatically based on a round-
robin mechanism or by command line speciﬁcation via argument or hostﬁle scheme. The
remshell provides a very simple launcher that makes use of the SSH to start processes
on a collection of machines. The Gforker is a process management system for starting
processes on a single machine, while SMPD is an alternate process manager that runs
on both Unix and Windows.
2.2.2 Microsoft MPI
Microsoft MPI is a portable, ﬂexible, and platform-independent implementation of the
MPI-2 speciﬁcation. It is the default MPI implementation for Windows HPC Server
2003 and 2008. The Microsoft MPI was developed based on and designed for maximum
compatibility with the reference MPIch2 implementation from Argonne National Labora-
tory. The exceptions to that compatibility are all on the job launch and job management
functionalities. These exceptions to the MPIch2 implementation were necessary to meet
the strict security requirements of Windows Compute Cluster environments.
Microsoft MPI contains more than 160 APIs, which includes bindings that support the
C, C++, Fortran 77, and Fortran 90 programming languages. Microsoft Visual Studio,
2005, 2008 and 2010 provide a remote debugger feature that works with its MPI imple-
mentation. Users can start their MPI applications on multiple compute nodes within the
Visual Studio environment. Visual Studio will then automatically connect the processes
on each node, so the developer can individually pause and examine program variables
on each node.
Microsoft MPI provides a job scheduler on Microsoft HPC Server 2008 and Windows
Compute Cluster Server 2003, which helps the user to submit jobs onto the compute
nodes. The job scheduling may be done with the cluster monitor console, command
line or scripts. The job scheduler manages the resources that are required by jobs. It
works primarily on a ﬁrst-come, ﬁrst-serve basis, with backﬁll capabilities. For example,
if a job that requires more nodes than are currently available is postponed in queue,
a job that requires fewer nodes might be sent to the cluster ﬁrst. The job scheduler
enables MPI jobs that are in a shared environment, so the jobs are governed by resource
allocation policies that are speciﬁed by the cluster administrator. It enables node access
1The name MPIch in this dissertation only refers to the latest MPIch2.
16
2.2 MPI Implementations
control that prevents unauthorized jobs from using restricted nodes. It also guarantees
fail-safe execution, i. e. when some nodes fail, a job will be assigned only healthy nodes.
Finally, the job scheduler also promises reliable termination of MPI processes that run
on nodes, thus preventing runaway processes from using resources that are needed by
the next job in the queue.
The Compute Cluster Pack (CCP) is an additional package for Microsoft Compute
Cluster Server 2003, Microsoft HPC Server 2008 and later server operating systems. It
provides secure, scalable cluster resource management, job scheduler and command line
APIs with C, C++, .NET and C# language bindings. In order to run an application that
uses this API, the computer must have the CCP installed 2.
The Microsoft MPI uses the Microsoft WinSock Direct and Network Direct protocols
for maximum compatibility and CPU eﬃciency. It can use any Ethernet interconnect
that is supported by the system, as well as interconnects like InﬁniBand or Myrinet.
Windows HPC environment supports the use of any network interconnect that has a
WinSock Direct or Network Direct provider. Gigabit Ethernet provides a high-speed
and cost-eﬀective interconnect fabric, while InﬁniBand and Myrinet are ideal for latency-
sensitive and high-bandwidth applications. The WinSock Direct protocol bypasses the
TCP/IP stack, using Remote Direct Memory Access (RDMA) on supported hardware
to improve performance and reduce CPU overhead. These will be discussed more in
Section 3.4.
2.2.3 Intel MPI
Intel MPI, another derivative of MPIch, implements the high performance MPI-2 speci-
ﬁcation on multiple fabrics. It focuses on making applications perform better on Intel-
based clusters, and enables to quickly deliver maximum performance even if intercon-
nects changes, without requiring major changes to the software or to the operating
environment. The latest Intel MPI is available for both Linux and Windows with C,
C++, Fortran 77, Fortran 90 language bindings. On Linux it provides a free run-time en-
vironment for installation and redistribution, while on Windows it provides the Software
Development Kit (SDK) including compilation tools, interface libraries, debug libraries,
trace libraries, include ﬁles and modules, and test codes.
A so-called Direct Access Programming Library (DAPL) methodology is implemented
to support various network protocols such as TCP, shared memory, or RDMA network
like InﬁniBand and Myrinet. The system will choose the most eﬃcient network module
automatically, or the user may specify the network interface with options. Moreover, the
Intel MPI library provides new levels of performance and ﬂexibility for applications by
improving interconnect support for the network interfaces, faster on-node messaging and
an application tuning capability that adjusts to the cluster architecture and application
structure.
Many job schedulers are integrated within the Intel library, for example, PBS, Torque,
and SLURM, which are intended for the Linux platform. The remote jobs started by a
2For the latest Windows HPC Server 2008 and R2 SDK, the CCP is installed by default.
17
2 State of the Art
Operating System
ORTE − Open Runtime Environment
OMPI − MPI Layer
OPAL − Open Portable Access Layer
Figure 2.1: Overview of the Open MPI abstraction layers
job scheduler are then handled by the process manager. It uses a similar scenario like
MPIch, that the process manager service is run as a service on each compute node for
accepting remote requests in order to launch local tasks.
2.2.4 Open MPI
Open MPI [45] is an open source MPI-1 and MPI-2 implementation on multiple plat-
forms including Linux, and now Windows in both 32 and 64 bit. It is developed and
maintained by a consortium of academic, research, and industry partners. Currently
there are 13 Members, 2 Partners, and 14 Contributors in total. The project goal is to
combine the expertise, technologies, and resources from all across the High Performance
Computing community in order to build the best MPI library available. It oﬀers ad-
vantages for system and software vendors, application developers and computer science
researchers [14].
The Open MPI software stack consists of three abstraction layers, as shown in Fig-
ure 2.1. The topmost layer within is the MPI API and supporting modules. The Open
RTE (Run-Time Environment) layer is the basis for launching, monitoring and properly
terminating Open MPI jobs, with support for diﬀerent back-end run-time systems. The
Open PAL (Portable Access Layer) provides utilities for interacting with the operating
system, for example, memory management, and also "glue" code used by OMPI and
ORTE, for example classes, data types.
The project uses the so-called Modular Component Architecture (MCA) as the foun-
dation of the entire Open MPI project. It provides all the component architecture
services that the rest of the system use. There are three basic elements in MCA: Frame-
work, Component and Module [15], as shown in Figure 2.2. An MCA framework is a
construct that is created for a single, speciﬁc purpose, e. g. Byte Transfer Layer (BTL)
framework is responsible for sending and receiving data on diﬀerent network connections.
A framework does not only provide a public interface that is used by external code, but
also has its own internal services that are responsible for ﬁnding, loading and ﬁnalizing
implemented components of a framework's interface at run-time. An MCA Component
is an implementation of a framework interface. Such a component is also called plug-
in. For example, as seen from Firugre 2.2, the framework BTL has tcp component for
TCP, openib and winverbs components for InﬁniBand and sm component for shared
memory protocols correspondingly. An MCA module is an instance of a component.
18
2.3 Parallel Debugging Tools
component
framework
winverbs
component
User Application
component
MPI API
component
Module Component Architecture
framework
openib
component
BTL
component
smtcp
Figure 2.2: Overview of the component architecture in Open MPI
The diﬀerence between components and modules is that modules have a private state
but components do not.
The wide variety of framework types allows third party developers to use Open MPI
as a research platform, a deployment engine for commercial products. The component
architecture in Open MPI enables the usage of multiple components within a single MPI
process. For example, a process can use several network device drivers simultaneously.
Due to its modular approach and its liberal licensing, it provides a convenient way to
use third party software, supporting both source code and binary distributions. Fur-
thermore, it makes it easy and fast for the process of development and integration of
new or derived frameworks.
2.3 Parallel Debugging Tools
Due to the complexity and diﬃculty of parallel programming, especially MPI program-
ming, debugging is highly necessary. The classical way to debug a parallel application
is done by attaching the parallel processes to a conventional debugger like gdb or other
parallel debuggers like TotalView and DDT. TotalView [58] is a GUI-based program
analysis tool that gives user control over processes, thread execution, and visibility into
program state and variables. TotalView also supports many parallel programming mod-
els including MPI. User is able to control the details of data, access patterns, memory
management and execution for C, C++ and Fortran applications. The TotalView GUI
also presents the detailed views of objects, data structures and pointers, simplifying
working with complex objects. However memory error detection is limited for memory
leaks and malloc errors. Another well-known and popular parallel debugging tools is
Allinea DDT [1], which provides more advanced features, such as detection of invalid
memory accesses and the visualization of Graphics Processor Units (GPU) data. A ex-
ample screen shot is shown in Figure 2.3. It also provides an add-on for Visual Studio
that can work together with the Visual Studio Remote Debugger for parallel programs
on Windows platforms.
While these traditional debuggers may not be freely available, since they are commer-
cial products. In order to use their debugging features, they may require source code
19
2 State of the Art
Figure 2.3: Screen shot of Allinea DDT debugger
20
2.4 Valgrind
modiﬁcation or language parsers to support parallel models. Also, they may be lim-
ited to special platforms. Most importantly, they do not catch incorrect memory usage
related to MPI semantics, but rather to analyze the situation only after the incorrect
usage has produced an error like a segmentation violation. They are not able to detect
whether a memory access is legal or illegal during the communication, and how the
parallel application uses the communicated data afterwards. The debugging framework
introduced in Chapter 4 is targeting to cover this hole.
2.4 Valgrind
Valgrind [51] is a set of simulation-based tools for debugging and proﬁling on Linux,
Android and Mac OSX systems. Each tool of Valgrind implements some kind of debug-
ging or proﬁling task. The default tool is Memcheck, which is a heavy-weight memory
debugging tool. It instruments every instruction of the running application, and marks
the memory undeﬁned, deﬁned, noaccess, addressable or unaddressable. The Memcheck
tool manages a large balanced tree data structure to keep track of every byte of mem-
ory, and for each memory operation, it ﬁnds and checks the corresponding memory state
entry in the tree and possibly output access violation messages to the standard output
device. With this scenario, Memcheck is able to detect various memory errors, such as
invalid read or write, buﬀer overrun, memory free on an already freed memory area, and
so on. The next section will introduce the classiﬁcation of errors that Memcheck may
detect and its internal gears for managing the memory states.
The Cachegrind is a system cache simulator for observing cache problems in user
applications. Modern computers normally have the conﬁguration of two levels of in-
struction and data caches. On machines that use three levels of caches, Cachegrind
simulates only the ﬁrst level and last level caches for eﬃciency. It gathers several sta-
tistical information, such as cache reads and writes, conditional and indirect branches
execution. This information is presented for the entire application and for each function
in the application. The counts caused by each line of source code, and the number of
instructions executed per source line may be annotated. These information can be useful
for traditional proﬁling of cache behavior in user applications.
Callgrind is a proﬁling tool, which generates the call-graph of the cache and branch
prediction. It collects data including the number of instructions executed, their location
in source lines, the caller/callee relationship between functions, and so on. Further
information of the cache simulation and branch prediction can also be generated for
run-time analysis of the application. KCachegrind [64], a call graph viewer is also
available to visualize and analyze the output from running Callgrind.
Helgrind is a thread error checking tool for detecting synchronization errors in C, C++
and Fortran programs that use the POSIX pthreads primitives. It abstracts the POSIX
pthreads functionalities such as thread creation, thread joining, thread exit and mutexes.
As a result, it may ﬁnd misuses of the POSIX pthreads API, potential deadlocks, or data
races in the user application. For non-pthreads primitives, it is also possible to extend
Helgrind to adapt user application behavior.
21
2 State of the Art
Another useful tool is Massif, a heap proﬁler. It measures how much heap memory
the application consumes over time, and may also measure the size of the application
stack when it is enabled. Using Massif heap proﬁler, the amount of memory that the
application uses could be reduced, so that to speed it up and to avoid system resource
exhaustion.
Besides the tools in Valgrind, it also provides interfaces and utilities for making new
tools. As part of this work, we directly extend Memcheck to be more powerful on memory
checks. This will be discussed in 4.3.1.
2.4.1 How Valgrind Works
Valgrind is highly dependent on the operating system and the processor. It provides
a synthetic CPU, that can be also considered as a Just In Time (JIT) virtual machine.
The synthetic CPU translates every instruction into a temporary form called Interme-
diate Representation (IR) for the core of Valgrind and perform itself like a real CPU
to process the translated IR. When running programs with Valgrind, regardless of
which tool is in use, Valgrind takes control of the program before it starts. The orig-
inal program does not run on the native processor but rather directly on the synthetic
processor without recompiling or relinking. With this technique, Valgrind may gather
information for debugging the programs. Nevertheless, running under the JIT virtual
machine involves adding extra information and possibly additional instrumentation code
into the original program that cause extra overhead. The slowdown may range from 5 to
100 times depending on which Valgrind tool and run-time parameters are used. How-
ever, it may not be a big concern for application developers for debugging or testing
purpose, as Valgrind helps ﬁnding bugs or hot spots, so that to improve the usability
and performance of the application.
When the JIT runs a client program, the Valgrind tool ﬁrst executes itself in the
client process and translates each basic block of the client program into an intermediate
representation (IR), which may be instrumented by the tool, and then converts the basic
block back into x86 code. The generated IR is stored in a code cache for rerun when
necessary. The dynamic complication and caching can be considered as another way
to interpreted execution with a diﬀerent trade-oﬀ, which means that to avoid having to
repeat operations such as instruction decoding, it is necessary to store the compiled code
by taking the extra space. Nevertheless, this process consumes most of the execution
time for generating, ﬁnding and running the translations, which introduces an heavy
overhead when running the application with Memcheck.
Because of the execution-driven feature of Valgrind, it is possible to naturally han-
dle almost all the code base, such as executable code, dynamically linked libraries and
dynamically generated code. However, for the system calls, indirect observation is re-
quired, and for self-modifying code, Valgrind provide a macro VALGRIND_DISCARD_
TRANSLATIONS to discard any translations of x86 code in a certain address range [42].
Figure 2.4(a) gives a conceptual overview of a normal program execution. The client
program can directly access the user-level parts of general-purpose registers on the ma-
chine, but can only access the system-level parts of the machine through the Operating
22
2.4 Valgrind
Hardware Hardware
Operating System
User Application
Operating System
Valgrind CoreUser Application
(Machine Level) (Machine Level)(User level)(User level)
System Libraries
General Purpose
RegisterRegister
General PurposeGeneral Purpose
System Libraries
Register
General Purpose
Register
libstdc++ libmpilibclibstdc++ libmpilibc
Figure 2.4: Overview of program execution with a Valgrind tool
System (OS) with system calls. Figure 2.4(b) shows how this changes when a program is
run under the control of a Valgrind tool. The client program and tool are in the same
process, but the tool has the overall control and mediates every action of the client.
But the validity of data structures passed from the Operating System upwards, e. g. the
Linux kernel, may not be checked by valgrind, due to the memory operations in the
system kernel are not handled by Valgrind. Following is an example of how to run user
application with Valgrind:
valgrind --tool=callgrind --trace-jumps ./application
2.4.2 Memcheck
Memcheck is a heavyweight tool in Valgrind tool suite. It may detect many memory-
management problems or memory errors in user applications. By processing every in-
struction translated to the synthetic CPU of Valgrind core, all reads and writes of
memory are checked whether they are legal. When copying around data using memcpy(),
strcpy() and so on, memory overlapping may happen due to the wrong oﬀset of the
starting address, which may also be detected by Memcheck. Furthermore libc calls
to malloc, new, free or delete are also intercepted, in order to keep track of allo-
cated memory and report errors such as double frees. For a detailed introduction how
Memcheck reports the application errors, following simple but erroneous C codes will be
taken as an example:
1 void main(void)
2 {
3 int a;
4 if(a>0)
5 a++;
23
2 State of the Art
6 }
The output of running an application with Valgrind may contain large amount of
information, for example, the output and status of the application, and also the infor-
mation or error messages from the Valgrind tool that the application is running with.
A simple example output of running the above program with Valgrind 3.7.0 using the
Memcheck tool is shown below:
==3746== Memcheck, a memory error detector
==3746== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==3746== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==3746== Command: ./test
==3746==
==3746== Conditional jump or move depends on uninitialised value(s)
==3746== at 0x400450: main (test.c:6)
==3746==
==3746==
==3746== HEAP SUMMARY:
==3746== in use at exit: 0 bytes in 0 blocks
==3746== total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==3746==
==3746== All heap blocks were freed -- no leaks are possible
==3746==
==3746== For counts of detected and suppressed errors, rerun with: -v
==3746== Use --track-origins=yes to see where uninitialised values come from
==3746== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
Every line output from Valgrind is started with the process ID number, in order
to distinguish program output from Valgrind commentary. The ﬁrst three lines of the
output shows the basic information of the Valgrind tool, i.e. the tool name, copyrights
and version number. The next four lines give the name of the running application and
the output from the application without the process. And the ﬁnal part of the output
summarizes the overall memory usage of the application. In the above example, the
variable a is compared with 0 but not initialized, which is an obvious error, and this
could generate undeterminable result of the program. Valgrind reports that there's a
conditional jump or move depending on some uninitialized values in line six of the main
function.
A common memory problem that can be detected by Memcheck is accessing memory
at an illegal place. This is normally caused by accessing unallocated memory or buﬀer
overrun. However, Memcheck is not able to detect buﬀer overrun if the memory is static
or stack allocated, for example, the errors in the following code will not be reported by
Memcheck:
1 #define NUM 10
2 void main(void)
3 {
24
2.4 Valgrind
4 int value[NUM]; /* on stack */
5 for(int i = 0; i <= NUM; i++)
6 value[i] = i; /* value[10] does not exist, overrun */
7 }
Another class of errors that might happen often is use of uninitialized memory, also
known as undeﬁned memory. Memcheck tracks all the data that are copied around in the
application, but it only complains when the uninitialized memory is used in a function
call or in a conditional branch.
The allocated memory must be deallocated before the application ﬁnishes, otherwise
there will be memory leaks which will result in diminishing performance of the com-
puter by reducing the amount of available memory or even crashing other applications.
Memcheck can also give reports on the number of memory leaks in the application by
indicating the addresses and sizes of the leaked memory. It tracks all memory blocks
asked for in calls to malloc, calloc, realloc and new operator, until the corresponding
free or delete operations are issued, keeps tracking of pointers to the allocated memory
and ﬁnally reports the memory that are not deallocated when the application ﬁnishes.
A contrary problem to memory leak is deallocating memory multiple time or with a
wrong deallocation operation. Because Memcheck tracks the allocation operations on the
memory, it can check whether the free or delete operations are compatible with how the
memory was allocated.
2.4.3 Shadow Memory
Dynamic Binary Analysis (DBA) programming tools, that are used for debugging pro-
grams and improve software quality, analyze client programs at the level of machine code
at run time. Many of them store information of shadow memory, which lets the tool
remember about the history of every memory location and/or value in memory, in order
to be able to tell where is the memory and what is stored. The shadow memory of the
client program is updated and checked by the DBA tools for correctness and violations,
so that any misuse or critical error will be reported to the programmer.
Memcheck shadows every bit of the client program memory for the addressability of
every byte and validation for every bit [43]. Every bit of the memory has an associated
validity bit (V bit) indicating whether it contains a valid value and every byte has an
associated addressability bit (A bit) presenting whether the memory is accessible. V
bits and A bits are followed all the time and checked when the corresponding part of
memory is accessed, for example, when reading a word size (four bytes) variable from
memory, its four A bits and 32 V bits will be checked for addressability and validation.
Memcheck remembers all the allocation/deallocation operations that have been issued
on each memory location, and can thus detect accesses of unreachable memory or already
deallocated memory. It also remembers which values are deﬁned which are not, in
order to detect uses of undeﬁned values. When the client program is launched, all
the global data areas are marked as accessible. When the program executes memory
allocation operations, the A bits for that area of memory are allocated and marked as
25
2 State of the Art
accessible. When the program accesses memory, Memcheck checks the A bits associated
with the address to assure the access indicates an invalid address. Furthermore, it checks
the V bits of that memory for any undeﬁned values. During the program execution
phase, A bits are also set for Stack Pointer (SP) register movements, which is useful for
automatically marking function entry and local variables accessible and inaccessible on
exit. The stack is marked as accessible from SP up to the stack base, and the stack
is inaccessible below SP. The operations on registers are also handled by Memcheck by
storing and calculating the relevant A/V bits in the simulated CPU until the register
is written back to memory, and the A/V bits are consulted and checked if values in
the CPU registers are used to generate a memory address or to determine conditional
branches.
Because Memcheck stores and tracks all memory and register address and data, it is
able to detect and report most of the memory problems at run time. However, there
are also cases that Memcheck may detect but does not report as errors. For example,
copying values around will not cause Memcheck to report, but only when a value is used
in a way that might aﬀect the result of the application, e. g. when writing the value to a
ﬁle or stdout, or when being the argument of a conditional jump instruction. This also
avoids long chains of error messages. Another case that Memcheck may not complain is
using low level operations, such as add. The V bits for the operands are calculated for
the result V bits, which might be partially or wholly undeﬁned. The V bits are only
checked for deﬁned when a value is used to generate a memory address, or when making
control ﬂow decisions and system calls. Only then an error message is issued, once the
undeﬁned property is detected.
As described in the previous section, every byte of memory has eight V bits for each
bit and one A bit for the whole byte, which is 9 bits in total. Memcheck uses compressed
maps to store those bits, in order to avoid the overhead for memory. A two level map
structure is used on 32-bit machines, where the top level is used as a shadow memory
that saves the status of all memory in the pointer to a valid second level map, while the
second level stores the entries to the accessibility and validity permissions (A bits and
V bits) on corresponding memory regions. The top level map is indexed by the top 16
bits of the address, and the second level is indexed by the lower 16 bits. So there are
2^16=65536 entries on each level and 65536*2/8=16384 bytes shadowed by the second
level. The 4GB address space is consequently divided in to 64 k lumps, with 64 kb of
each, as shown in Figure 2.5. As many of the 64 kb chunks might have the same status for
every bit, either accessible or not, the primary map entry points to three distinguished
pre-deﬁned maps for indicating not accessible, undeﬁned or deﬁned, so that to decrease
the size of the stored memory status bits. Actually, for running a real application, more
than half of the addressable memory is deﬁned or undeﬁned [43].
On the other hand, on 64-bit machines the implementation is more complicated. A
four-level structure could also be used, but it causes the amount of memory accesses
to be extremely large. As a result, an improved two level structure is implemented to
reduce the cost. The top level map is increased to 2^19 entries, indexed from bits 16 to
34 of the memory address space. This new top level map covers the bottom 32 GB of
memory. Accesses to the top 32GB are handled by a sparse auxiliary table.
26
2.4 Valgrind
0134
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . .
. . .
.
.
.
.
.
.
XXXX
Lower 16 bitsUpper 16 bits
0134
0135
0136
Memory address
2^16 entries2^16 entries
Primary Map:
XXXX0135
0136 XXXX
01
01
01
01
DEFINED
UNDEFINED
NOACCESS
Distinguished Map 2^16 bytes
Memory
.
.
.
.
.
.
.
.
0134
8bits
8bits
8bits
8bits
8bits
0134
0134
0134
0001
0002
FFFF
0134 0003
0000
Figure 2.5: A/V bits addressing mechanism
Client request mechanism
The client request mechanism is a method provided by Valgrind to better interact with
user applications. The client requests are unlike normal function invocations, but they
are rather macros that can be used directly in user applications. These requests only
aﬀect the application when running under the control of Valgrind. Table 2.1 lists several
powerful client requests. The client requests are parsed and converted by Valgrind into
processor instructions that do not otherwise change the semantics of the application.
By inserting this special instruction preamble, Valgrind detects commands to steer the
instrumentation, these instructions, otherwise do not have any eﬀects on registers, ﬂags
or other state of the processor.
The special instruction preamble rotates the register several times. On the x86-
architecture, the right-rotation instruction ror is used to rotate the 32-bit register edi,
by 3, 13, 29 and 19, which is 64 bits in total, leaving the same value in edi. The special
instruction preamble is deﬁned as follows:
#define __SPECIAL_INSTRUCTION_PREAMBLE \
"roll $3, %%edi ; roll $13, %%edi\n\t" \
"roll $29, %%edi ; roll $19, %%edi\n\t"
The actual command to be executed is then encoded with an register-exchange in-
struction (xchgl) that replaces a register with itself (in this case ebx). The complete
client request assembly code macro is deﬁned as:
#define VALGRIND_DO_CLIENT_REQUEST( \
_zzq_rlval, _zzq_default, _zzq_request, \
_zzq_arg1, _zzq_arg2, _zzq_arg3, _zzq_arg4, _zzq_arg5) \
27
2 State of the Art
VALGRIND_MAKE_MEM_NOACCESS Marks address ranges as inaccessible
VALGRIND_MAKE_MEM_UNDEFINED Marks address ranges as undeﬁned
VALGRIND_MAKE_MEM_DEFINED Marks address ranges as deﬁned
VALGRIND_MAKE_MEM_DEFINED Marks address ranges as deﬁned when it is
_IF_ADDRESSABLE addressable
VALGRIND_DISCARD Stops reporting errors on user-deﬁned blocks
VALGRIND_CHECK_MEM_ Checks whether address range is addressable
IS_ADDRESSABLE
VALGRIND_CHECK_MEM_IS_DEFINED Checks whether the address range is deﬁned
VALGRIND_CHECK_VALUE_IS_DEFINED Checks whether the value is deﬁned
VALGRIND_DO_LEAK_CHECK Checks immediately memory leak
VALGRIND_DO_QUICK_LEAK_CHECK Checks immediately a full memory leak with a
summary
VALGRIND_COUNT_LEAKS Tests harness code, after calling VALGRIND_
DO_LEAK_CHECK or VALGRIND_DO_QUICK_LEAK_
CHECK, and returns the number of bytes in each
category
VALGRIND_COUNT_LEAK_BLOCKS Tests harness code, after calling VALGRIND_
DO_LEAK_CHECK or VALGRIND_DO_QUICK_LEAK_
CHECK, and returns the number of blocks
VALGRIND_GET_VBITS Gets V bits for an address range
VALGRIND_SET_VBITS Sets V bits for an address range
Table 2.1: Memcheck client requests in Valgrind 3.7.0
28
2.5 Intel Pin
{ volatile unsigned int _zzq_args[6]; \
volatile unsigned int _zzq_result; \
_zzq_args[0] = (unsigned int)(_zzq_request); \
_zzq_args[1] = (unsigned int)(_zzq_arg1); \
_zzq_args[2] = (unsigned int)(_zzq_arg2); \
_zzq_args[3] = (unsigned int)(_zzq_arg3); \
_zzq_args[4] = (unsigned int)(_zzq_arg4); \
_zzq_args[5] = (unsigned int)(_zzq_arg5); \
__asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE \
/* %EDX = client_request ( %EAX ) */ \
"xchgl %%ebx,%%ebx" \
: "=d" (_zzq_result) \
: "a" (&_zzq_args[0]), "0" (_zzq_default) \
: "cc", "memory" \
); \
_zzq_rlval = _zzq_result; \
}
For example, when calling MAKE_MEM_NOACCESS, the client request is parsed by the
macro deﬁned as:
#define VALGRIND_MAKE_MEM_NOACCESS(_qzz_addr,_qzz_len) \
VALGRIND_DO_CLIENT_REQUEST_EXPR(0 /* default return */, \
VG_USERREQ__MAKE_MEM_NOACCESS, \
(_qzz_addr), (_qzz_len), 0, 0, 0)
During preprocessing this is turned into the assembler sequence:
rol $0x3,%edi
rol $0xd,%edi
rol $0x3d,%edi
rol $0x33,%edi
xchg %ebx,%ebx
mov %edx,0xfffffffffffffff0(%ebp)
mov 0xfffffffffffffff0(%rbp),%eax
mov %eax,0xfffffffffffffffc(%ebp)
Which, when not executed using Valgrind, the bit-shifts by 64 leave the register
intact, i. e. the sequence of rol and xchg do not change the processor state ﬂags. While
if it runs under execution of Valgrind, it marks the memory as not accessible.
2.5 Intel Pin
Robust and powerful software instrumentation tools are important for program analysis
tasks such as proﬁling, performance evaluation, and bug detection. The Intel Pin is
designed to meet this need. Intel Pin is a JIT tool for analyzing and instrumenting
29
2 State of the Art
application code. Similar tools are Valgrind and DynamoRIO [4, 5]. The goals of this
framework are to provide easy-to-use, portable, transparent, and eﬃcient instrumenta-
tion framework for building instrumentation tools (Pintools) written in C/C++. The
Pin framework follows the ATOM [54] model, which allows the Pintools to analyze the
application at the instruction level without knowing the detailed underlying instruction
set. The framework API is designed to be platform independent, in order to make the
Pintools compatible across diﬀerent architectures. However, the framework can provide
architecture speciﬁc details when necessary. The instrumentation process is transparent
as the application and Pintool observe the original application code. Pin uses techniques
like inlining, register re-allocation, and instruction scheduling, in order to run more ef-
ﬁciently. The basic overhead of Pin framework ranges from 10% to 20%, and extra
overhead may be caused by the Pintool.
Pin can be used to solve problems in high level application abstractions such as pro-
cedure invocations, shared library loading, thread creation, and system call execution.
Problems in lower level of the applications, like memory references, instruction execu-
tion, and control ﬂow can also be observed. When running applications with Pin, it
intercepts execution at the beginning of a process and injects a run-time agent that
is similar to a dynamic binary translator. After inserting the instrumentation code it
generates the new code sequence, which consists of calls to functions written in C++,
and then transfers the control to the new code sequence. The only executed code is the
generated code, and the original code is used as reference.
The API suite provided by the Pin framework abstracts the underlying instruction
set. It allows application context information to be passed to the injected code as
parameters. Pin also automatically saves and restores the registers that are overwritten
by the injected code, so the application may continue to work. Limited access to symbol
and debug information is available as well by cooperating with the OS-dependent system
libraries, such as dbghelp.dll provided on Windows, or libc, libstdc++, libelf and
counterparts on Linux.
2.5.1 How Intel Pin Works
The Pin JIT compiler re-compiles and instruments small chunks of binary code imme-
diately before executing them. The modiﬁed instructions are stored in a software code
cache. It allows code regions to be generated once and reused for the remainder of
program execution later on, in order to reduce the cost of recompilation. The overhead
introduced by the compilation is highly dependent on the application and workload [24].
Figure 2.6 shows a basic structure of the Pin software architecture. It consists of
three main processes, launcher process, server process and instrumented process. The
launcher process creates the other two processes, injects the Pin modules and instru-
mented code into the instrumented process, then it waits for the process to terminate.
The server process provides services for managing symbol information, injecting Pin or
communicating with the instrumented process via shared memory. The instrumented
process includes the Pin Virtual Machine Manager (VMM) (pinvm.dll), the user deﬁned
Pin tool library, and the application executable and libraries. The VMM performs as a
30
2.5 Intel Pin
Process Launcher
PIN.exe
Server Process
Create Process
(DBGHELP.DLL)
Debug Symbol Server
Create Process
PinTool
Library
Application
Code and Data
PIN Virtual Machine Monitor (PINVM.DLL)
System call
DispatcherEmulator Dispatcher
ThreadEvent Code
Cache
NTDLL.DLL
KERNEL32.DLL
Windows Kernel
Figure 2.6: Overview of program execution with an Intel Pin tool on Windows
system call emulator, event and thread dispatcher, and also as JIT compiler. It is the
engine of the entire instrumented process. After Pin takes over control of the application,
the VMM coordinates its execution. Then the JIT compiler instruments the code and
passes it to the dispatcher, which launches the execution. The compiled code is stored
in the code cache. The Pin tool contains the instrumentation and analysis routines. It
is a plug-in and linked with the Pin library, which allows it to communicate with the
Pin VMM.
In order to gain control of execution and then instrument the application, the Pin
VMM uses the injection procedure to be loaded into the address space of the application.
The injection is desirable to be performed as early as possible when Pin executable is
launched, so that the Pin tools can observe the execution of guest application. The
application process is created in suspended state and is attached to by Pin via Win32
debugger APIs. When the system kernel ﬁnished initializing the process, the debugger
then detaches from the debugged process. The context of the application is stored and
the instruction pointer is modiﬁed to load the VMM boot routine. After the VMM is
fully initialized, it loads the Pin tool. Principally, this is the earliest time to load the
tool library, as the initialization of the kernel libraries is not possible to be instrumented.
For system calls, such as I/O access and new process creation, Pin is able to handle
them on behalf of the application. It ﬁrst detects when the application is about to
execute a system call, then instruments immediately after the system call instructions,
31
2 State of the Art
and ﬁnally execute the call with precisely constructed register state of the application.
The reason of instrumenting extra code for executing system calls, is because the kernel
may cause the application to continue after the system call. This ensures that Pin always
has control of the execution.
Exceptions are synchronous events, which are delivered immediately from the system
kernel when detected by hardware or generated by software. Windows provides the so-
called Structured Exception Handling (SEH) [34] mechanism to handle exceptions, and
this can also be used by applications to capture exceptions. The further information
of exceptions then can be analyzed and appropriate exception handlers can be speciﬁed
by the application. This makes it possible for Pin to intercepts exceptions and insert
instrumented code at the entry of the original exception routine. However, it is necessary
to distinguish exceptions from the application and internal Pin. The Pin dispatcher
checks the exception origin to accomplish this. As the inserted instructions from Pin
are ensured to not fault into the code cache, the only source of the exceptions is from
the application. The JIT compiler guarantees that the same exceptions are raised in the
instrumented applications under the same conditions as the original instructions.
The overhead introduced by instrumenting with Pin is dependent on the application
and the Pin tool. In the VMM, time is spent on processing the system events like
system calls, exceptions, and Asynchronous Procedure Call (APC)s, invoking the JIT
compiler, and management of the code cache. Internally, time is also spent on executing
the application or its instrumentation by code cache and translating the application and
inserting instrumentation by the JIT compiler. On the other hand, the interaction with
the operation system has also overhead, especially on Windows [52], that shows a higher
overhead from system calls than exceptions, APCs and callbacks.
2.5.2 Pin Instrumentations
Using the API, Pintools may observe all the architectural state of a process includ-
ing the contents of registers, memory and control ﬂow. The mechanism is similar to
ATOM [54]; the user adds analysis functionalities to the application process, and inserts
instrumentation routines to determine where and when to call the analysis routines. The
architectural state or constants of the application, such as instruction pointer, register
content, or memory address, may be passed as arguments to the analysis routines. It is
also possible to overwrite the application registers and application memory, in order to
change the application behavior.
As the instrumentation of Pin is performed by a JIT compiler, the input to this com-
piler is not byte code, but a native executable. Pin intercepts the execution of the ﬁrst
instruction of the executable and generates new following code sequence. The generated
code sequence is almost identical to the original one, but Pin ensures that it gains control
when reaches the exit of the code sequence. Then Pin generates more code for the next
block and continues execution. When the JIT compiler fetches some code, the Pintool
has the opportunity to instrument the code and the instrumentation will be saved in a
code cache for future execution of the same sequence of instructions. A Pintool must
run in the same address space as Pin and the executable to be instrumented. Therefore,
32
2.5 Intel Pin
the Pintool has access to all of the executive's data. It also shares ﬁle descriptors and
other process information with the executable. There are several modes for Pintool to
instrument the application within diﬀerent scope of the execution.
Trace Instrumentation
The trace instrumentation occurs immediately before a code sequence is executed. It
allows the Pintool to inspect and instrument an executable at run time. A trace is a
single-entrance-multiple-exits sequence of instructions. It usually begins at the target
of a taken branch and ends with an unconditional branch, including calls and returns.
When executing a branch in the middle of a trace, Pin creates a new trace that inspecting
the branch target.
Basic Block (BBL) Instrumentation
A BBL refers to a portion of code decomposed by compilers in a program with certain
properties for amenable analysis. In a BBL, the code has only one entry point and exit
point, which means that no jump instruction anywhere in the program targets this BBL,
and only the last instruction of this BBL can cause the program to begin executing
code in a diﬀerent BBL. Brieﬂy, a BBL is a single entrance, single exit sequence of
instructions. Pin breaks the trace into BBLs. Branch in the middle of a BBL will cause
Pin creating a new trace and hence a new BBL. The same as trace instrumentation,
certain BBLs and the instructions inside of them may be generated and instrumented
multiple times.
Image Instrumentation
Image instrumentation allows a Pintool to inspect and instrument an entire executable
image at the loading phase. The Pintool therefore can walk though each section of the
loaded image, each routines of its sections, and each instructions of its routines.
Routine Instrumentation
In routine instrumentation mode, the Pintool inspects and instruments an entire routine
once it is called. Instrumentation can be inserted before or after a routine is executed. As
a Pintool can walk the instructions of a routine, the instrumentation can also be inserted
or before or after an instruction of the routine is executed. Routine instrumentation can
be more eﬃcient than image instrumentation for small numbers of routines in one image
execution.
Instruction Instrumentation
Using the instruction instrumentation mode is convenient, as the Pintool can inspect and
instrument an executable on a single instruction at a time. This is essentially identical to
trace instrumentation where the instructions have to be iterated inside a trace. However,
33
2 State of the Art
instrumenting on traces and BBLs will reduce the number of analysis calls and make
the instrumentation more eﬃcient.
The above various types of instrumentation methods make it ﬂexible and easy to
build Pintools on diﬀerent level of observation. We implemented a Pintool using image
and trace instrumentations for wrapping the entry functions in the user application and
processing the memory instructions. More details will be give in Section 4.4.
34
3 Open MPI for Windows
3.1 Overview
Open MPI is an open source MPI-1 and MPI-2 implementation targeting to be used
on various Linux and Windows in both 32 and 64 bit versions. Earlier versions were
not natively supported on Windows, but rather under Cygwin [47], which caused low
performance in compilation and run-time. On the other hand, the network fabric support
was also weak, as there was no multiple node support on Windows at all. The parallel
jobs may only be run on single compute nodes, which might parallelize the job by
using shared memory only. In order to better support Open MPI on Windows, new
work has been accomplished including introducing the new cross-platform build system,
integrating the Windows HPC environment, and also supporting communication over
various network protocols.
Besides, many other features were incomplete or missing for the internal support
of Windows: the tcp component in the BTL framework needed to be improved; the
event library which drives the entire event dispatch in the project was not optimized
for Windows; there was no support for the Windows HPC environment; the support of
high speed network fabrics was missing. All these problems had to be improved and
implemented in order to run parallel tasks natively and eﬃciently on Windows.
3.2 Integration with Windows
Open MPI was originally supported for Linux platforms. The common way of building
Open MPI is using so-called autotools chain. However, this is a Unix-centric approach to
detect installed software and hardware, based on so-called m4-macro language and shell
scripts. Therefore software using autotools is not easily ported onto Windows, due to the
lack of support for Linux-like environment on Windows. Although it has been success-
fully tested with Cygwin for earlier Open MPI releases, this method has been deprecated
because of the high complexity of maintenance and the time consuming conﬁguration
and compilation. Furthermore, it would only support single node parallelization, which
means the user application can only be parallelized with shared memory on multi-core
systems. Another problem is that the overhead of running the parallel program under
Cygwin on Windows is very high, because Cygwin works as an extra software layer
between the system and the application.
In order to simplify the build process on Windows, and to get both lower complexity
and higher building speed, CMake [7], a cross-platform build system, has been integrated
into Open MPI on Windows. It integrates the entire project into the Visual Studio
35
3 Open MPI for Windows
environment, and uses the native Microsoft Visual C++ compiler, which speeds up the
conﬁguration and build process a lot 1. It gives the advantages of generating Windows
solution ﬁles based on the selection of compiler environment, for example, Visual Studio
and MinGW [37] environments. This allows user building Open MPI with Visual C++
compiler and also GNU is Not Unix (GNU) compilers for Windows, such as gcc, g++
and gfortran. The integration of CMake also enables the combination of the build
process with diﬀerent IDEs on Windows, for example, the entire project may be build
under Visual Studio 2, or CodeBlocks [8] within MinGW environment. Additionally,
the debug features of these IDE may be used to debug and improve the user parallel
application.
The implementation also provides support of the MCA libraries granularity with the
help of libtool [60]. When conﬁguring the project with libtool enabled, each MCA
library will be compiled as a single library. Open MPI dynamically loads the required
MCA libraries automatically at run-time. This reduces the size of the main Open MPI
libraries and therefore consumes less memory.
As a base event library Libevent [55] is used, to receive and relay events from the
Operating System into components in Open MPI registered with Libevent. It plays an
important role during the porting process. It supports diﬀerent kernel event notiﬁcation
mechanisms on Linux, for example, poll, kqueue, and select, while the current working
mechanisms for Windows are thread-based call backs and select. Because the event
based thread mechanism is expensive in catching and dispatching events, and because
it's hard to scale as much as required, using threads has highly limited performance on
Windows platforms. Therefore, a performant and eﬃcient event engine has to be used.
Using select would give the opportunity to have much better performance on socket
events, and it could scale well. Currently, select is the default Libevent mechanism
used in Open MPI.
In order to periodically test the Windows build unattended, a nightly build system is
used to report conﬁguration and compilation errors from Viscluster at HLRS. The MPI
Testing Tool (MTT) is a general infrastructure for nightly testing MPI implementations
and running performance benchmarks automatically, across many diﬀerent clusters, en-
vironments and organizations. It gathers all the results back to a central database for
analysis, including the conﬁgure options, logs, compilation outputs, warnings and er-
rors. The MTT was written in Perl, and was made to work under Windows platforms.
The CMake module for testing Open MPI on Windows has been integrated into MTT.
These nightly tests are running on the Viscluster at HLRS, which are machines with
AMD Opteron 250 processors and 4.3GB memory running Windows HPC 2008 R2 with
TCP and InﬁniBand connections.
1The speedup is approximately 5 times faster for conﬁguration and compilation.
2This includes Visual Studio 2005, 2008, 2010 and also the Express versions.
36
3.3 Multiple Node Support
Provider
Cimv2 WMI
managed entity
Other
...
.NET C# VB
WMI COM API
.NET WMI
Provider
.NET managed
application/entity managed entity
Windows
Native C/C++
Other WMI
Providers
(CIM Object Manager) WMIRepository
WMI Service (Core)
Consumers
WMI
Client
WMI providers
objects
.NET
Client
C/C++
and managed
Client
Scripts
Infrastructure
WMI
Figure 3.1: WMI architecture
3.3 Multiple Node Support
The multiple node support in Open MPI has been developed as MCA components for
diﬀerent platforms, such as rsh and slurm using RSH and SLURM on Linux systems.
However, on Windows platforms, there was no remote node start-up support, and only
local process may be launched under Cygwin. Therefore, new components have been
integrated in order to support the multiple node support on Windows.
Two new components for the Resource Allocation Subsystem (RAS) and the Process
Launch Management (PLM) frameworks have been developed and integrated with HPC
Cluster Manager using the API of the CCP [33], so that MPI jobs can be submitted
and monitored by the job scheduler. For non-HPC environments, such as Windows
XP or Vista systems, a Windows Management Instrumentation (WMI) [35] based PLM
component is also implemented.
3.3.1 Integration with WMI
WMI is an infrastructure for management of data and operations on Windows-based
operating systems, which was ﬁrst introduced in the Windows 2000 Professional with
Service Pack 2. Additional SDKs may be installed to support WMI on earlier operating
systems. The WMI help system administrators and programmers automate administra-
tive tasks on remote computers and management data to other parts of the operating
system. For building WMI applications, APIs for C/C++, C# and Visual Basic ex-
ist. WMI is based on the Web-based Enterprise Management (WBEM) initiative and
the Common Information Model (CIM) adopted by the Distributed Management Task
37
3 Open MPI for Windows
Force (DMTF). WMI includes the managed objects deﬁned by CIM as well as exten-
sions to the CIM model for additional information available from the Windows platform.
it also provides a uniform interface for any local or remote applications or scripts that
obtain management data from a computer system, a network, or an enterprise. Many
operating system APIs cannot be called by scripts clients or Visual Basic applications
and many of them cannot make calls to remote computers. Therefore, WMI uniform
interface is designed in the fashion that WMI client applications and scripts do not have
to call a wide variety of operating system APIs.
The WMI architecture contains three levels, as shown in Figure 3.1. The lowest level
consists of WMI providers and managed objects. A WMI provider is a Component
Object Model (COM) object which monitors one or more managed objects for WMI.
A managed object is a logical or physical component, for example, a hard disk drive,
network adapter, database system, operating system, process, or service. A provider
handles the data and messages between WMI and managed objects. WMI providers
consist of a DLL ﬁle and a Managed Object Format (MOF) ﬁle that deﬁnes the classes
for which the provider returns data and performs operations. The WMI infrastructure
includes the WMI Service, and the WMI repository. The WMI repository is organized
by WMI namespaces. In Windows systems, the WMI service creates some namespaces
for system functionalities, for example, namespace for system start-up. The remaining
namespaces found in the system are created by providers for other parts of the operating
system or products. The WMI service also provides the core functionalities like an inter-
mediary between the providers, management applications, and the WMI repository. On
the top level, there may be several WMI consumers: management applications or script
clients that interact with the WMI infrastructure. The clients can query, enumerate
data, run provider methods, or subscribe to events through the WMI COM API.
In order to support multiple node start-up for Open MPI on Windows, WMI has
been integrated as a component in the PLM framework. As shown in Figure 3.2, when
it is called to launch jobs on multiple nodes, the WMI PLM component ﬁrst reads the
user speciﬁed node list, which might be a list of remote host names or IP addresses.
As the required resources are always speciﬁed by user, there is no need to allocate
resources explicitly, and another RAS component is not necessary. If there is no list of
hosts speciﬁed, the job will be directly started locally. Otherwise, it tries to connect
to the remote host using the WMI service provider. When user credential is needed, it
prompts in the command line or a GUI for user input. The user name and password
are then encrypted and passed to the standard Windows credential interface to create
connections to remote hosts namespaces. After building up connections, the environment
settings like installation path of Open MPI in the remote registry will be inquired. With
the valid path of Open MPI on remote hosts, it is possible to generate the commands
that launch the Open MPI daemon on each host. The remote launch commands are
executed by WMI service on each host and the process IDs will be returned. There
are special security requirements for multiple node start-up using WMI. The user must
have suﬃcient privilege to use the remote execution method. Furthermore, a few C++
namespaces must be made accessible to read across the network with corresponding user
privileges [50].
38
3.3 Multiple Node Support
No
host info
Credential
Get user
Read command
line arguments
Is there a 
host file?
Connect to
host
Connected?
Read remote
settings
The last host?
Execute remote
command line
Prepare remote
command line
Error
Launch job locally
Yes
No
No
Yes
Yes
Get next 
Figure 3.2: Work ﬂow of the PLM component using WMI
39
3 Open MPI for Windows
CCP COM API
Command line tools Applications
Job manager
Scheduler listener
Node manager
Cluster Service
Job scheduler service Data layer
C# API
Scheduler Layer
Execution Layer
Interface Layer
Figure 3.3: Architecture of the job scheduler in Windows HPC Server
3.3.2 Integration with CCP
The CCP is a development toolkit released by Microsoft ﬁrst with Compute Cluster
Server 2003, and it is now included in the Windows HPC pack by default for Windows
HPC Server. It provides simpliﬁed interfaces for job submission and monitoring, as well
as ﬂexible and extensible job scheduling and cluster resource allocation. In addition,
CCP ensures secure process start-up and complete cleanup.
Figure 3.3 shows a brief architecture of the job scheduler in Windows HPC Server. The
node manager in the execution layer manages compute nodes. In the scheduler layer, the
core service for CCP is the job scheduler service, which controls resource allocation, job
execution, and recovery on failure. The cluster resources are allocated by job priority,
which ensures that high-priority jobs are added at the front of the queue. If jobs have
equal priority, resources are allocated to the job that was submitted ﬁrst. However,
there is no task priority within a job. Resources are allocated to tasks in the order that
they were added to the job. The job scheduler selects the best available nodes in the
cluster to run each job. A job may also specify a list of nodes on which it can be run,
and the job scheduler will choose the nodes from this list in the speciﬁed order. The job
scheduler service also ensures that a resource-intensive application will not delay other
applications that are ready to run. The job scheduler will schedule a lower-priority job if
a higher-priority job is waiting for resources to become available and the lower-priority
40
3.4 High-Speed Network Support
job can be ﬁnished with the available resources without delaying the start time of the
higher-priority job.
The Cluster Service is a .NET remote service. It provides cluster-wide settings, node-
related operations, job-related operations, task-related operations, and resource-usage
information. The scheduler listener provides communication between the node manager
and the job scheduler. It is called when a compute node starts or a task is ﬁnished
on a compute node [33]. The job manager in the interface layer helps administrators
perform approve, pause, resume, and remove operations. Meanwhile, the CCP API can
be directly called from applications and command-line tools to access the functionality
of the job scheduler [33]. The RAS and PLM components use this scenario to launch
and manage jobs through Open MPI libraries.
Figure 3.4 shows the run-time ﬂowchart of the RAS component using CCP. When
allocating new resources, the component will ﬁrst get a full list of the reserved resources
that were used for previous tasks. If the reserved resources are not enough for current
job, then additional nodes need to be allocated. By connecting to the head node of
the Windows HPC cluster, all available nodes will be traversed, and new nodes will be
registered for the job according to the requirement. After checking the last node, if the
registered resources are still not enough for the job, an error message will be generated
and the job will not be submitted. Otherwise, when suﬃcient resources are allocated,
all the allocation information will be written into the central register of the RAS and
the job will be submitted successfully.
After successfully allocating resources, the CCP PLM component is then used for
launching the remote computation tasks. As shown in Figure 3.5, the PLM system
will ﬁrst generate a list of nodes that have undeleted daemons from previous jobs and
reset the daemon counter. The existent daemons may be reused for this job. Then it
connects to the head node of the cluster and conﬁgure the job details, like how many
cores does the job require, how many processes shall be started on each node and so
on. The full node list will be checked with the generated list of reusable nodes and
launch the daemons with the task conﬁguration if they are not yet running. Finally, a
daemon command will be sent to each daemon on remote nodes, and the corresponding
computation processes will be started.
With the help of the new CCP RAS and CCP PLM components, the job status may
be monitored within the Windows HPC Monitor. Users may see the running progress,
cancel or delete the jobs.
3.4 High-Speed Network Support
In modern HPC, a computation task may run parallel on thousands of compute nodes
or cores. The interconnection between the nodes and cores is one of the key points of
performance. InﬁniBand is an architecture and speciﬁcation for transferring data among
processors and I/O devices with high bandwidth and low latency. It provides a switched
fabric communications link that is primarily used in HPC. Many other features like high
throughput, low latency, quality of service and failover are also available.
41
3 Open MPI for Windows
Inquire next
The last node?
Headnode
Connect to 
Have enough 
resource?
Store allocation
into registry
Success
Found enough?
Register new
node
New node?
reserved resources
Obtain list of
Error
Yes
No
Yes
No
Yes
No
NoYes
available node
Figure 3.4: Work ﬂow of the RAS component using CCP
42
3.4 High-Speed Network Support
parameters
Setup job
Has daemon?
by daemons
Launch applications
Queue job
(start daemons)
The last node?
Inquire next
mapped node
Add task
No
Yes
Yes
No
Connect to
headnode
Reset daemon
counter
Check reusable
daemons
Figure 3.5: Work ﬂow of the PLM component using CCP
43
3 Open MPI for Windows
Router
SwitchSwitch
I/O Subsystem Node
TCA
I/O Subsystem Node
TCA
Process Node
CPU CPU CPU
Mem HCA
Process Node
HCA Mem
CPUCPUCPU
 Infiniband Subnets, 
     WAN or LAN
Figure 3.6: Example of IBA architecture
Support for high speed network drivers in Open MPI has been implemented as MCA
components for UNIX and Linux environments, for example, openib for InﬁniBand, mx
for Myrinet Express. On Windows, Mellanox and OpenFabrics both provide their drivers
supporting InﬁniBand on Windows. The WinOF driver supported by OpenFabrics has
been chosen and integrated into Open MPI, because it supports the basic libibverbs
that is commonly used under Linux, and also many latest interfaces published by Mi-
crosoft, such as WinVerbs. In this section, the basic technology of InﬁniBand will be
brieﬂy introduced, and the work of integrating WinOF using libibverbs and WinVerbs
interfaces will be discussed. 3
3.4.1 Introduction to InﬁniBand
Inﬁniband was one of the solutions 4 to break through the bandwidth and fanout limi-
tations of the bus using a switched fabric architecture. InﬁniBand Architecture (IBA) is
an industry standard architecture designed by InﬁniBandSM Trade Association (IBTA)
for server I/O and inter-server communication. The standard aims to provide the levels
of reliability, availability, performance, and scalability for server systems. The funda-
mental communications and management architecture is Storage Area Network (SAN),
that supports both I/O, as well as Inter-process communication (IPC) within one or
more computer systems.
3This work has been completed by Mr. Jie Hou with his Master Thesis under my supervision.
4Among others, there existed solutions like PCI-X, PCI-DDR by Mellanox Technologies.
44
3.4 High-Speed Network Support
Relay
Packet
LIN
K
M
A
C
LIN
K
M
A
C
Router
M
A
C
M
A
C
Relay
Packet
Switch
Physical
Layer
Layer
Link
Layer
Network
Layer
Transport
Layer
Upper Consumer Operations
Inter Subnet Routing
Messages (QP)
(GRH)
Control
Flow
Subnet Routing (LRH)
EndnodeEndnode
Control
Client
   Link
      IBA
Operations
Network
      IBA
Operations
 Media
Access
Control
 Media
SAR
Network
   Link
Encoding
SAR
Client
Access
Encoding
Figure 3.7: InﬁniBand network layer abstraction
In an Inﬁniband fabric network, each subnet has a uniquely identiﬁed subnet ID that
is known as the subnet-preﬁx. The subnet manager assigns the attributes of all ports
in a subnet with its subnet-preﬁx, and also conﬁgures routers with information about
the subnet (such as which virtual lanes to use). The subnet-preﬁx is also used to form
identities for each port. There are several devices and components, such as links, channel
adapters, switches, and routers, as the example shown in Figure 3.6. A link can be a
copper cable, optical cable or printed circuit wires on a backplane, that connects channel
adapters, switches and routers. Channel adapters are IBA devices like network cards
that connect processor nodes and I/O units in the fabric. They generate and consume
some packets for the communication within the system. A channel adapter support local
and remote Direct Memory Access (DMA) operations. There are two types of channel
adapters: Host Channel Adapter (HCA) and Target Channel Adapter (TCA). The
HCA provides a consumer interface with functions speciﬁed by IBA Verbs, but there
are no speciﬁed Verbs for TCA. Switches interconnect links for other components by
relaying packets between them. They may also consume packets for conﬁguration and
management themselves. A switch may have more than one port and each port of a
switch may have several virtual lanes, which can be conﬁgured by the subnet manager.
Similarly to switches, routers can consume packets for conﬁguration and management.
A router forwards packets based on their Global Route Header (GRH) and replace the
packets' Local Route Header (LRH) as they pass from one subnet to another subnet.
IBA routers are necessary components for the inter-subnet routing. Routers interconnect
subnets by relaying packets between them.
45
3 Open MPI for Windows
The IBA network abstracts the Open Systems Interconnection model (OSI model)
into ﬁve layers, as shown in Figure 3.7. Similar communication functions are grouped
into logical layers. An instance of a layer provides services to its upper layer instances
while receiving services from the lower layer. Physical layer deﬁnes the symbols used
for framing, and speciﬁes how bits are placed on the wire to form symbols. In physical
layer, the link speed may be 1X, 4X or 12X, and each individual link is a four-wire
serial diﬀerential connection (two wires for every direction) which provides a full duplex
connection at 2.5 Gigabits per second. The packet format and protocols for point-
to-point link operations are described in link layer. The data ﬂow between two end
links are managed in this layer with a credit based method. The receiver on a link
sends credits to the transmitter on the other end. The sender will not send packets
unless the receiver indicates its receiving buﬀer has enough spaces. The network layer
is responsible for routing packets between subnets by identifying their headers and IDs.
These are controlled by the upper transport layer which deﬁnes ﬁve types of transport
services, which are Reliable Connection, Reliable Datagram, Unreliable Connection,
Unreliable Datagram and Raw Datagram. It also handles transaction data segmentation
at the sending end, and reassembly at the receiving end. In the upper level layer, IBA
supports several upper layer protocols such as Internet Protocol over InﬁniBand (IPoIB),
librdmacm and so on, which can be used by various user consumers.
The communication mode in IBA is based on so-called Queue Pairs, which consist of
a send work queue and a receive work queue. Figure 3.8 shows the basic communication
process using Queue Pairs. A work queue is a buﬀer, which schedules a set of instructions
for hardware execution on the HCA. The send work queue hold instructions for sending
data, while the receive work queue holds instructions for receiving data from another
consumer. Communication operations are described in a work request, that is submitted
to the queue pairs as a work queue element. A work queue element is placed in the
appropriate work queue and is executed by channel adapters. When a work queue
element is completed, a completion queue element is placed in a completion queue.
Applications may check the completion queue to see whether any work queue request
has been ﬁnished or not. In addition, a consumer may have its own set of queue pairs,
where every queue pair being independent from the others.
3.4.2 BTL Implementations
There are many drivers for Inﬁniband on Windows provided by several vendors, for ex-
ample, Mellanox WinOF [26] and OpenFabrics WinOF [46] for Windows. In order to
support high speed network for Open MPI, we have implemented two MCA components
in Open MPI based on the OpenFabrics WinOF driver. The InﬁniBand software archi-
tecture, shown in Figure 3.9, consists of both kernel level and user level components,
which complement each other to help provide an end-to-end solution. In the software
stack, the low level InﬁniBand kernel driver module is hardware speciﬁc and ported
on top of the associated hardware. The rest of the architecture is hardware agnostic
and is divided into kernel-space software and user-space software. The normal way of
accessing network service is using WinSock Direct provider, which deﬁnes a standard
46
3.4 High-Speed Network Support
Receive
Transport Engine
WQE CQE
QP
Send Receive
Transport Engine
Host
Channel
Adapter
Send
Port
Consumer Transactions,
Port
     IB
Operation
   IB
Packet
Remote Process
Fabric
Client Process
Target
Channel
Adapter
WQE CQE
QP
Figure 3.8: Example of IBA QP communication
Bypass
Kernel
Provider
Provider & Switch
WinSock
GigE
Network Hardware
Access Layer (Kernel API)
User API
(verbs based) WinVerbs
IB WinSock
Provider SPI
User API
(verbs based)
Provider SPI
Network Direct
MPI Application
Virtual Bus Driver
TCP
IP
NDIS
(GigE)
Miniport
(IPoIB)
Miniport
RDMA
Figure 3.9: The software stack of WinOF
47
3 Open MPI for Windows
interface between a Windows TCP/IP client application (such as an FTP client or a
web browser) and the underlying TCP/Internet Protocol (IP) protocol stack. It has
been also extended to support InﬁniBand implicitly by calling the InﬁniBand WinSock
Direct Provider Service Provider Interface (SPI), which calls the internal verbs imple-
mentation to access the kernel API. This is an ineﬃcient way of driving the hardware
like InﬁniBand. A better way to implement this is to make use of the so-called kernel
bypass feature. Normally, memory protection and virtual address translation are han-
dled by the operating system, which consume a signiﬁcant amount of CPU resources and
thus seriously impact performance. InﬁniBand implements these functions in hardware
so that they need not be performed by the kernel of the operating system. This ker-
nel bypass technique frees up the CPU, thus making cycles available for applications
rather than for low level operating system functions. Network Direct is another SPI on
provider layer, which uses the internal Inﬁniband verbs directly, and performs kernel
bypass on top of verbs. The WinVerbs API, has been developed for more direct access
to the InﬁniBand hardware. User applications may call WinVerbs directly to get even
better performance than the other APIs. Furthermore, an abstraction for libibverbs
API, which is commonly used in Linux platforms, has also implemented on top of the
WinVerbs API for Windows.
Open MPI has a good support of InﬁniBand on Linux with the libibverbs API in the
openib MCA component. The library libibverbs allows user processes to use RDMA
verbs as described in the InﬁniBand Architecture Speciﬁcation [2] and the RDMA
Protocol Verbs Speciﬁcation [17]. This includes direct hardware access from user space
to InﬁniBand adapters (kernel bypass) for fast path operations. This has been reused
for direct InﬁniBand support on Windows, which only involves changing system calls
according to Windows standards in openib. In order to fully exert the performance
of InﬁniBand on Windows, a new MCA component has been implemented using the
WinVerbs API. The WinVerbs user-space library provides a transport neutral RDMA
interface. It's designed to integrate with the Windows programming environment, pro-
viding a COM interface and supporting asynchronous operations. The WinVerbs library
itself is threadless, leaving all threading control under the direction of the calling appli-
cation.
The kernel driver of WinVerbs handles input and output control (IOCTL) calls from
the WinVerbs user-space library. Its purpose is to export an asynchronous IOCTL inter-
face to user-space and provide support for general RDMA communication management.
Its lower interface is designed to support any RDMA based devices such as InﬁniBand.
Its upper interface is capable of providing a low latency Verbs interface. WinVerbs is
the lowest level interface for applications. It provides the lowest abstraction to the hard-
ware, while also exposing Windows speciﬁc features such as overlapped operations for
asynchronous control. WinVerbs is also used by other APIs from Microsoft, such as
Network Direct interface and Direct Access Programming Library.
Both openib and WinVerbs are integrated as MCA components in Open MPI BTL
framework. Using these two implementations improves the communication performance
a lot comparing with TCP or IPoIB connections. Figure 3.10 shows the performance
comparison of using openib and winverbs BTLs. The benchmark used here is the Net-
48
3.4 High-Speed Network Support
 6
 6.5
 7
 7.5
 8
 8.5
 9
 9.5
 1  10  100  1000
La
te
nc
y 
[us
ec
]
Message Size [Byte]
Open MPI wv Windows
Open MPI openib Windows
Figure 3.10: Latency of openib and winverbs on Windows
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
B/
s]
Message Size [Byte]
Open MPI wv bandwidth
Open MPI openib bandwidth
Figure 3.11: Bandwidth of openib and winverbs on Windows
49
3 Open MPI for Windows
 4
 4.5
 5
 5.5
 6
 6.5
 7
 7.5
 8
 8.5
 9
 9.5
 1  10  100  1000
Ti
m
e 
[us
ec
]
Message Size [Byte]
Open MPI wv Windows
Open MPI openib Linux
Microsoft MPI Windows
Figure 3.12: Latency of Microsoft MPI and Open MPI on Windows
PIPE benchmark, which will be thoroughly introduced in Section 5.2.3. The benchmark
was run ten times, and eight results were selected to calculate the arithmetic mean of
the data. The test has been done on the Viscluster at HLRS, which has AMD 2.4GHz
Opteron Processor 250 (2 processors), 4.25GB RAM and a Windows Server 2008 HPC
(64 bit) operating system with InﬁniHost MT25208. It is obvious to see in the ﬁgure
that the latency of both openib and winverbs is about 6.5 to 7.5 microseconds, and the
winverbs component is 0.93 microseconds better than the openib component. How-
ever, the latency of using winverbs components still needs to be optimized to reach the
principle hardware latency, which is normally around 5 microseconds. The bandwidth
of the same test is shown in Figure 3.11, where the using winverbs gives a slight higher
bandwidth than using openib for messages larger than 100KB.
The Viscluster being a dual boot system, allows comparing the performance of Open
MPI on Windows and Linux. Microsoft MPI is also taken into account. As shown in
Figure 3.12, the Microsoft MPI has the best latency, which is about 4.5 microseconds.
And Open MPI on Linux is around 4.7 microseconds. Both of them show very good result
comparing to the winverbs component on Windows. Figure 3.13 shows the bandwidth
results of the test. Using winverbs results a much lower output when message size is
larger than 10KB. The comparison also means that there is still space to improve the
performance of the winverbs component in the future.
3.5 A realization of a Windows Cluster
Computing resource is a signiﬁcant factor for the simulation tasks at the institute of
Technical Biology (ITB). Their scientiﬁc researchers run large simulation jobs on the
50
3.5 A realization of a Windows Cluster
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
B/
s]
Message Size [Byte]
Open MPI wv Windows
Open MPI openib Linux
Microsoft MPI Windows
Figure 3.13: Bandwidth of Microsoft MPI and Open MPI on Windows
HPC platforms oﬀered by High Performance Computing Center Stuttgart. However,
the computation budget allotted to their project is limited. So they are always eager for
cheap and easy solutions for running small or middle size jobs and also for validating
the simulation process before putting it onto large systems. This will make sure that
the simulations running on HPC clusters may have less chance to fail before it ﬁnishes.
Small or medium sized simulations that do not have higher priorities may be also carried
out on the cheap platforms in order to save the HPC computing resources.
The ITB oﬀered a PC room used for tutorials equipped with more than 20 computers
mainly for teaching or training. Most of the time, the PCs are idle, i. e. power is on
but no one is using them. All the PCs are installed with Windows XP and wired with
Local Area Network (LAN). The idea is to make use of the idle computers available for
parallel computing, especially for running parallel Gromacs jobs.
Gromacs is not originally developed for Windows system, although there were ported
version available, but none of them were updated and easily working. The developers
provide a CMake solution for Windows since end of 2010, which makes the build process
for Windows possible. However, the solution does not apply easily for Windows: several
dependencies were missing; CMake bugs were found in earlier versions. These issues have
been ﬁxed, and a Windows 32 bit Version of Gromacs together with a pre-build Open
MPI provided to the ITB.
The software package was successfully installed on all the PCs. Students and re-
searchers may run their Gromacs jobs locally via shared memory or parallel with TCP
connections. The computing power of this experimental cluster is not able to compare
with large HPC cluster, however, this gives the students and researchers a small test
environment where they may learn Gromacs or run small simulations.
51

4 Semantic Memory Checking
Frameworks
4.1 Overview
In this chapter, the classes of MPI semantic memory errors will be discussed and new
frameworks for memory debugging in Open MPI will be introduced. One of the main ob-
jectives involves extending and implementing memory debugging tools, then integrating
the tools into Open MPI as an MCA plug-in to support memory checking for diﬀerent
communication modes. It is possible to enhance debugging of MPI parallel applica-
tions for memory errors, such as buﬀer overruns, non-standard usage of non-blocking
communication from within the MPI library.
As described in Section 2.4 and 2.5, both Valgrind and Intel Pin may be used to
instrument user application. However, they are not able to be used directly in MPI
applications for advanced memory checking. In Section 4.3, we introduce an extension
for Valgrind that may help check MPI communication buﬀers. On the other hand,
Intel Pin provides only a set of APIs that can be used for developing tools, and there
is no available shadow memory mechanism in the Pin libraries. Therefore, developing
a complete new tool was necessary. In Section 4.4 we describe the details of the newly
developed tool using Pin libraries, as well as its shadow memory mechanism.
4.2 MPI Semantic Memory Checking
4.2.1 Pre-communication memory checks
Based on the MPI standards, the send and receive buﬀers in diﬀerent communication
modes have strict limitation for read and write operations. The parallel computing re-
sults are highly dependent on the way of the buﬀer usage. Violation of buﬀer usage
passed to MPI library will cause unpredictable behavior and possibly lead to wrong
results or even segmentation faults. In order to observe the communication buﬀer of
the user MPI applications, several memory checks rules have been implemented by in-
tegrating memory checking tools into Open MPI. In this section, the scenario of the
communication buﬀer checks will be discussed, and the implementation of the tools and
their integration with Open MPI will be introduced in the next sections.
For non-blocking communications, memory being passed to send operations should
be checked for accessibility and deﬁnedness, while pointers in receive operations are
checked for accessibility only. Reading or writing to buﬀers of active non-blocking receive
53
4 Semantic Memory Checking Frameworks
{
{
Process 2
User code
Process 1
User code
Time
MPI Library
MPI Library
MPI_Wait MPI_Finalize
MPI_Wait MPI_FinalizeMPI_Irecv
MPI_Isend
Buffer Check
Buffer readable
Send/Receive
Buffer no access
Figure 4.1: Non-blocking buﬀer check
operations and writing to buﬀers of active non-blocking send operations are obvious bugs.
As a result, buﬀers being passed to non-blocking operations (after the above checking)
should be set to undeﬁned within the MPI layer of Open MPI until the corresponding
completion operation is issued. When the application touches the corresponding part in
memory before the completion with MPI_Wait, MPI_Test or multiple completion calls, an
error message should be issued. This setting of the visibility is being handled independent
of non-blocking MPI_Isend or MPI_Irecv function but in the so-called lower layer BTL,
which is adapted to set the fragment in question to accessible and deﬁned, in order to
allow the lower-level MPI functionality to send the user buﬀer as fragment. Care has
also been taken to handle derived data types and their implications. Figure 4.1 shows
this scenario as an example.
The same check is also extended for collective communication in Open MPI, checking
whether the buﬀer is addressable for receiving or whether the buﬀer is deﬁned for sending.
Inter- and intra-communicators are also taken care of.
For one-sided communications, the MPI-2 standard deﬁnes, that any conﬂicting ac-
cesses to the same memory location in a window are erroneous (see [29], p. 112). If a
location is updated by a put or an accumulate operation, then this location cannot be
accessed by a load or another remote memory access operation until the updating oper-
ation is completed on the target. If a location is fetched by a get operation, this location
cannot be accessed by other operations as well. When a synchronization call starts, the
local communication buﬀer of the remote memory access call should not be updated un-
til it is ﬁnished. User buﬀer of MPI_Put or MPI_Accumulate, for instance, are set to not
accessible when these operations are initiated, until the completion operation ﬁnished.
Valgrind will produce an error message, if there is any read or write to the memory
area of the user buﬀer before corresponding completion operation terminates. In Open
MPI, there are two one-sided communication modules, point-to-point and RDMA. Ac-
cording to the standard, relative checks have been implemented for MPI_Get, MPI_Put,
MPI_Fence and MPI_Accumulate in point-to-point module.
54
4.2 MPI Semantic Memory Checking
{
{
Time
User code
User code
MPI Library
MPI Library
Process 2
Process 1 MPI_Wait MPI_Finalize
MPI_Finalize
MPI_Isend
MPI_Irecv MPI_Wait
Watch mem operations
Check
Buffer usage
Figure 4.2: Non-blocking receive buﬀer usage check
All the checks described above require the memory checking tool to register the send
and receive buﬀers when the communication request is started, and to deregister the
buﬀers when the communication is ﬁnished. During the whole communication, i. e.
the pre-communication phase, the memory checking back-end will check every memory
access of the client application, and all registered buﬀers will be monitored for illegal
read or write. The shadow memory will contain information of the read and write
permissions of the communication buﬀers. If an illegal buﬀer access happened during
the communication, the tool will immediately report it and show a trace of the current
application stack.
4.2.2 Post-communication memory checks
In MPI communications, data may be checked for whether they are correctly transferred
to ensure correct calculation results. However, it is also important to check whether the
communicated data have been used correctly. For example, there may be communicated
data that is never actually used or got overwritten before its value is read out. This
issue may be improved for better performance of the communication.
In the post-communication memory check phase, all communicated data buﬀers are
monitored with regard to the read and write operation orders. More precisely, taking
the non-blocking communication as an example (see Figure 4.2), a read access before
any read or write on the received buﬀer is counted as a good access, while a Write
Before Read (WBR) or write operations is not, because the communicated buﬀer is
overwritten before any actual use. In order to achieve this goal, the Memcheck callback
extension is integrated into Open MPI memchecker component, to notify Open MPI
of the corresponding tasks based on the type of operations, i.e. read or write to the
received buﬀer.
Figure 4.3 shows the overall data allocated to store the corresponding information
within trivial loop of a receiver. One may see, that the ﬁrst element in buf is ﬁrst being
55
4 Semantic Memory Checking Frameworks
}
user
V−Bit
read
write
Read Callback
Write Callback
Read Callback
Write Callback
buf
(buf,len)
MPI_Recv(buf, len, ...)
{
}
data = buf[0];
buf[1] = 42;
data += buf[1];
buf[1] = 43;
while (...) {
Figure 4.3: Buﬀers storing access information of communicated data
accessed as read (good), while the second element is accessed ﬁrst as a write (bad) 
even following accesses will not change the outcome of bad access, i.e. overwriting the
communicated data. Upon re-entry into MPI_Recv, our extension will trigger a warning
of unused data, i.e. data that has been transferred over the wire, but not properly
accessed on the receiving side.
Another case that may happen in the post-communication phase is the transferred
buﬀer may not be used (read). This could be checked for better communication perfor-
mance. In order to implement these checks, the tool has to remember all received buﬀer
states and check the states for non-read buﬀer when calling MPI_Finalize.
Similarly as the pre-communication phase, all the communicated buﬀers on the re-
ceiver peers in post-communication phase will be registered and monitored. The memory
checking tool back-end will check every memory access of the communicated buﬀers and
records the sequence of the read and write operations in the shadow memory. The shadow
memory is the same one as was used for pre-communication phase, but it is cleared cor-
respondingly for post-communication on each peer. If a Write Before Read (WBR)
happens, the tool will generate a error message immediately indicating the source of this
problem. Finally, when the application calls MPI_Finalize, a summary will be generated
including how many WBR happened and how many bytes of buﬀer was communicated
but actually not accessed.
4.2.3 Semantic MPI memory errors by code examples
Memory problems may arise because of the coding mistakes, some of which are not easily
detectable by normal debuggers, for example, usage of uninitialized buﬀer or WBR on
a receive buﬀer in parallel applications. In order to better understand these memory
errors that may often happen in the user application, some example code are shown with
a short description in this section. Most of these problems in the examples are not able
be noticed by programmers or by normal debuggers, but they may be found with help of
the new implementations of the memory checking framework in Open MPI introduced
56
4.2 MPI Semantic Memory Checking
in the following sections.
• Wrong input parameters, wrongly sized send buﬀers:
char * send_buffer;
send_buffer = malloc(SIZE);
memset(send_buffer, 0, SIZE);
MPI_Send(send_buffer, SIZE+1, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
The send buﬀer is allocated with SIZE bytes, but the send operation is going to
send one more element which is out of the buﬀer. This may be considered as a
buﬀer overrun problem for the send process.
• Memory that is outside of receive buﬀer is overwritten:
buffer = malloc( SIZE );
memset (buffer, SIZE , 0);
MPI_Recv(buffer, SIZE+1, MPI_CHAR, 1, 0, MPI_COMM_WORLD, &status);
This is another buﬀer overrun problem, that the received data size might be larger
than the size of the receive buﬀer. This may result unpredictable behavior of the
application.
• Uninitialized input buﬀers:
int * buffer;
buffer = (int *) malloc (10*sizeof(int));
MPI_Send (buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
In this example, the send buﬀer is allocated without initialization. The content of
the buﬀer is not initialized, and the computation result would be wrong. Some of
the debuggers will warn about this problem, such as Memcheck and Intel Inspector.
However, the information provided hints to code within the MPI library and not
the faulty user's code.
• Usage of the uninitialized MPI_ERROR-ﬁeld of MPI_Status:
MPI_Wait (&request, &status);
if(status.MPI_ERROR != MPI_SUCCESS) /* use undefined value */
return ERROR;
The MPI-1 standard deﬁnes the MPI_ERROR ﬁeld in the status structure to be
undeﬁned for single-completion calls such as MPI_Wait or MPI_Test (see [27] p.
22.
57
4 Semantic Memory Checking Frameworks
• Writing into the buﬀer of active non-blocking Send or Recv-operation or persistent
communication:
int buf = 0;
MPI_Request req;
MPI_Status status;
MPI_Irecv (&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
buf = 4711; /* write receive buffer during communication */
MPI_Wait (&req, &status);
This example shows a case of overwriting the receive buﬀer during the communi-
cation, which is strictly not allowed according to the MPI standard.
• Read from the buﬀer of active non-blocking Send-operation in strict-mode:
int inner_value = 0, shadow = 0;
MPI_Request req;
MPI_Status status;
MPI_Isend (&shadow, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
inner_value += shadow; /* read send buffer during communication */
MPI_Wait (&req, &status);
This a similar example as the last one. The send buﬀer is read out in the communi-
cation. According to the MPI-1 and MPI-2 standard, it is not allowed to read the
send buﬀer during the non-blocking send operation. This behavior is changed in
MPI-2.2, since no known MPI implementation required this performance-related
feature; therefore the developed tool tests this situation only in the so-called strict
mode. This is further explained in Section 4.3.
• Write before read on received buﬀer:
MPI_Recv(buffer, 2, MPI_CHAR, 1, 0, MPI_COMM_WORLD, &status);
buffer[0] = sizeof(long);
result = buffer[1]*1000;
In an MPI communication, the received data is normally used for calculation.
Overwriting the received buﬀer may not be what the program intends to do and
may be a potential error.
• Buﬀer read and write of an active accumulate operation:
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
MPI_Win_fence(0, win);
58
4.3 Valgrind memory debugging framework
MPI_Accumulate(A, NROWS*NCOLS, MPI_INT, 1, 0, 1, xpose, MPI_SUM, win);
printf("\n%d\n",A[0][0]); /* Will produce a warning */
A[0][1] = 4711;
MPI_Win_fence(0, win);
In an accumulate operation, the speciﬁed buﬀer is read out and then overwritten
before the operation is complete. The MPI standard restricts that the accumula-
tion buﬀer should remain not accessible until the accumulation is ﬁnished.
• Write to the buﬀer of active get operation:
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
MPI_Win_fence(0, win);
MPI_Get(A, NROWS*NCOLS, MPI_INT, 1, 0, 1, xpose, win);
A[1][0] = 4711; /* Will produce a warning */
MPI_Win_fence(0, win);
The buﬀer of an active get operation gets overwritten before the get operation is
ﬁnished. This violates the standard that the buﬀer should not be written before
the operation is complete.
• Transmitted data not used
while(have_data) {
MPI_Recv(buffer, 2, MPI_CHAR, 1, 0, MPI_COMM_WORLD, &status);
sum+=buffer[1];
}
MPI_Finalize();
exit(0);
In this example, part of the received buﬀer is not used for any computation. The
ﬁnal result of the calculation only adds up the second element of each transmission.
Although this would be of no harm to the application itself, the result might be
wrong. Furthermore, the performance overhead will be increased due to transfer-
ring additional data, not used at all.
4.3 Valgrind memory debugging framework
4.3.1 Valgrind extensions
According to the requirement of advanced MPI semantic memory checking, as described
in Section 4.2.1 and 4.2.2, the current functionalities of Valgrind Memcheck do not
59
4 Semantic Memory Checking Frameworks
suﬃce. Furthermore, the MPI-2.2 changed behavior compared to previous versions in
that it allows read access to send buﬀers of non-blocking operations. In order to detect
send buﬀer rewritten errors, Memcheck has to know which memory region is readable or
writable. As a result, a new extension based on the current Valgrind, is necessary in
order to support this kind of memory checking with Open MPI.
Make Memory Readable and Writable
Making user buﬀer readable or writable will make the memory checking more ﬂexible
and accurate, but it is not implemented in the current design of Valgrind Memcheck.
For MPI parallel memory debugging, this feature will also become important according
to the MPI standard, for example the MPI-2 deﬁnes, that the send buﬀer of MPI_Send
must be completely inaccessible during the operation, but the MPI-2.2 standard removes
the read restriction and to make the buﬀer readable [40]. Two more new Memcheck client
requests were implemented:
• VALGRIND_MAKE_MEM_READABLE(addr,len)
Make the user speciﬁed memory region readable.
• VALGRIND_MAKE_MEM_WRITABLE(addr,len)
Make the user speciﬁed memory region writable.
The new design uses the so-called Ordered Set (OSet) implementation originally from
Valgrind source base. It stores the memory region with readable and writable ﬂags
in a binary tree. Overlapping and multiple deﬁnition on the same memory region are
handled to avoid redundant records. The basic searching algorithm is binary search with
the starting address of the memory region. When a memory region is being accessed,
the memory OSet table will be ﬁrst searched, if this memory matches any entry in the
table, the recorded memory state will be checked so that to eliminate the read or write
limit.
Callback Extension on Memory Access
On the other hand, in order to perform a precise data transfer evaluation, for example,
whether the received data has been correctly accessed and used, it is necessary to have
a functionality to monitor on the buﬀer access for user demand. To achieve this, two
more new Memcheck client requests were implemented:
• VALGRIND_REG_USER_MEM_WATCH(addr,len,op,cb,data)
Registers a callback function for the user speciﬁed buﬀer on speciﬁc operation,
and the callback function will be called when the operation is triggered.
• VALGRIND_UNREG_USER_MEM_WATCH(addr,len,cb)
Remove the callback function for the memory region.
60
4.3 Valgrind memory debugging framework
Valgrind Corevoid callback(...)
{
}
{
  int mem=100;
MEMCHECK_REG_MEM_WATCH(
void main(void)
MEMCHECK_UNREG_MEM_WATCH(&mem,sizeof(int));
mem +=100;
&mem, sizeof(int), 1/*write*/, 
&callback, NULL);
}
printf("accessed %d times\n", ++cnt);
Memcheck Tool
memcheck_reg_mem(...);
memcheck_cb(...);
memcheck_search(...);
memcheck_unreg_mem(...);
Figure 4.4: Example of user application using Valgrind
This Memcheck callback extension uses the Ordered Set to record the memory starting
address, size of the block, callback function pointer, memory operation type that is being
watched, and user callback data which could be returned by the callback function. The
search algorithm used for managing OSet entries is binary search with multiple keys. It
will search for a match address for the give key in the OSet, and then search the match
of the memory size and ﬁnally callback function pointer. When the OSet entry is found,
the watched memory operation is compared with the current operation, and the callback
function is triggered upon match. Special care must be taken to do correct type-casting
in the invoked callback, as the user data is always passed as void pointer, it has to be
casted correctly, otherwise run-time errors may occur.
Figure 4.4 shows an example how this callback extension works. The main function
deﬁnes a variable and then register it into the Memcheck OSet based address map using
macro MEMCHECK_REG_MEM_WATCH. Then the variable content is modiﬁed, which will
trigger the Memcheck analysis function on the memory instruction. The memory address
is searched in the address map, and the callback function is triggered when its address
is found. The user deﬁned callback function of name callback shows the access count
on the memory. Finally, the application deregister the memory watch and exits.
4.3.2 Implementation and Integration with Valgrind
As explained in Section 2.2.4, the Open MPI project consists of three abstraction layers,
which was shown in Figure 2.1. In this project, it uses an MCA scenario as its foundation,
which provides services that could be used for the rest of the system. The architecture
of MCA in Open MPI was described in Figure 2.2.
In order to provide the debug functionalities for user application and also for Open
MPI itself, the new memory debugging features are then integrated as an MCA frame-
61
4 Semantic Memory Checking Frameworks
Operating System
ORTE − Open Runtime Environment
Open MPI
OPAL − Open Portable Access Layer
... ...
Memchecker
Valgrind
Memchecker Memchecker
Figure 4.5: Integration of Memcheck into Open MPI
work named memchecker. The component that integrates and wraps Memcheck uses the
name valgrind. The basic structure is shown in Figure 4.5.
To use this debug framework in Open MPI, the valgrind component has to be enabled
in conﬁguration phase when building Open MPI via option --enable-memchecker. The
default system installed Valgrind will be searched to build memchecker framework. If
no Valgrind is found in the system, the build of the component will fail. There is also an
option --with-valgrind that may be used by the user to specify a diﬀerent Valgrind
installation.
Launching the user application with memchecker is easy. The user application can be
simply prepended with valgrind and any valgrind-parameters, for example:
mpirun -np 8 valgrind --num_callers=20 ./my_app inputfile
The valgrind module in memchecker framework will detect whether the user appli-
cation is running with Valgrind tool. If the application is running without Valgrind,
the memchecker calls will do nothing but simply return. Otherwise, all the calls to
memchecker will start speciﬁc memory checking routines.
4.3.3 Implementation in Open MPI
In Open MPI, the Memcheck callback extension is integrated into memchecker compo-
nent for checking user buﬀers in blocking and non-blocking communications. The buﬀer
registration is implemented within the Package Management Layer (PML) framework,
which manages for packing and unpacking message data for sending and receiving. Im-
plementing buﬀer checking on PML layer will allow memchecker to trace the receive
buﬀer accesses in PML layer and also in user applications. By tracking the received
data ﬂows, one may know whether the received data has been correctly used. In block-
ing and non-blocking communications, the receive buﬀer is registered for read and write
callbacks after the transferred data has been stored. When MPI_Finalize is issued, the
registered entries are removed.
62
4.3 Valgrind memory debugging framework
Description
1  0
1  1
Flag
Memory not accessed.
Read before write.
Write without read.
Write before read.
(Byte−wise)
(Bit−wise)
Write access flag
Read access flag
Buffer
Shadow
0  0
0  1
Figure 4.6: Extended shadow memory on watch buﬀer
For every registered buﬀer, every byte is shadowed with 2 Bits of ﬂags, Figure 4.6
shows an example of the shadow memory. Every registered buﬀer is shadowed with the
same size of memory, so that access information for each byte is stored at run time.
In the ﬁnalization stage, the access information of each byte is checked for write before
read problems and unused received buﬀer.
Using the memory callbacks would certainly introduce extra overhead, and this is
mainly due to the nature of Valgrind and also the result of the sorting algorithm it
uses. Currently the callback extension is based on Memcheck tool, which normally adds
10 to 80 times overhead, sometimes even more than 100 times, to the original applica-
tion. However, as the extension relies only on the instruction translation mechanism in
Memcheck, making it a self-sustained tool could reduce most of the overhead.
On the other hand, the performance when running application without Valgrind
is minimum, even though the callback checks are executed. Care has been taken to
limit the overhead by using the binary search trees in the back-end search algorithm.
The binary tree architecture is fast for searching, which has an average performance of
O(logN). However, when searching for an address range, it is not as simple as searching
for a single value. In this back-end search algorithm, we use two binary trees: one
stores the start address of the registered memory region and callback function pointer;
the other one stores the end address of the registered memory region and the callback
function input parameters. When given a key memory region for searching, ﬁrstly, all
registered memory start addresses that are less than or equal to the start address of the
key memory; the second step is to search in the end tree for addresses that are larger
than or equal to the start address of the key memory, then ﬁnd the common memory
regions with the search result in the ﬁrst step, i. e. like doing an AND operation on the
two search results; then do two searches on start addresses that smaller than or equal
to the end address of the key memory and end addresses that larger or equal to the end
address of the key memory, and do the same AND operation on the two search results;
63
4 Semantic Memory Checking Frameworks
ﬁnally, merge the result from the second and third steps, similar to an OR operation,
resulting the ﬁnal range. This optimized algorithm is much faster than simply iterating
every registered memory regions and comparing their addresses, especially when the key
memory is small and it does not generate multiple matches.
4.4 Intel Pin tools debugging framework
4.4.1 MemPin Tool
As described in Chapter 3, Open MPI has been successfully supported on Windows
platform. But Valgrind is not supported for Windows platforms, which makes it not
possible to use the new developed extension described in the previous section. Intel
PIN may be used to build a similar tool as Valgrind for both Windows and Linux.
Furthermore, Intel PIN involves less overhead than Valgrind [24]. A new tool named
MemPin 1 has been designed on top of Intel Pin framework to meet this need.
The MemPin tool takes advantages of Intel Pin's instrumentation API to provide call-
back on memory access functionalities directly in the user application. That means,
the user can launch the MemPin API to register memory regions with speciﬁc callback
function and parameter pointers. When the user application is not running with the
Pintool, all MemPin calls will be taken as empty macro, and adding no overhead to the
application. But if running with MemPin tool, Pin ﬁrst reads the entire executable, and
all the MemPin calls will be replaced with corresponding function calls that are deﬁned
inside MemPin. The generated instrumented codes is then executed, and MemPin observes
and responses to the behavior of the user application.
MemPin uses two modes of instrumentations, image and trace. The image instrumen-
tation is done when the image is loaded. In this stage, all the MemPin calls used in
the user application will be replaced, and the main entry function, like main will be
instrumented for starting the trace engine and the callstack log of MemPin. The next
stage will mainly take care of the memory access, callback functions and the user ap-
plication callstack. The trace instrumentation is analyzed according to each BBL, and
every memory operation in the BBL is checked. When the memory is read or written,
the single instruction of the memory operation is instrumented with an analysis function
with memory information as operands. In order to generate useful information of where
exactly the memory operation has happened, a callstack log engine is instrumented also
in this stage. The callstack engine is implemented using a simple C++ stack structure,
which stores only the necessary historical instruction addresses of the application and
translates the addresses into source information when required. The new function entry
address from the caller will be pushed onto the stack, and it is popped oﬀ at the end
of the callee. To achieve this goal, the tail instruction of each BBL has to be analyzed.
More precisely, every call and return instruction are instrumented for pushing and
popping the instruction address stack.
1Namely, MemPin stands for Memory debugging with Intel Pin.
64
4.4 Intel Pin tools debugging framework
MEMPIN_RUNNING_WITH_PIN Checks whether the user application is running un-
der Pin and Pintool
MEMPIN_REG_MEM_WATCH Registers the memory entry for speciﬁc memory
operation
MEMPIN_UPDATE_MEM_WATCH Updates the memory entry parameters for speciﬁc
memory operation
MEMPIN_UNREG_MEM_WATCH Deregisters one memory entry
MEMPIN_UNREG_ALL_MEM_WATCH Deregisters all the memory entries
MEMPIN_SEARCH_MEM_INDEX Returns the memory entry index from the memory
address storage
MEMPIN_PRINT_CALLSTACK Prints the current callstack to standard output or
a ﬁle
Table 4.1: MemPin macros for user application
There are several macros deﬁned in MemPin, which can be directly called in the user
application. The implemented macros are provided in Table 4.1.
The macro MEMPIN_RUNNING_WITH_PIN does a empty instrumentation, which returns
true if the instrumentation function in MemPin. Otherwise the original macro just returns
false to indicate that the macro is from the original executable. The registration and
deregistration are handled with a C++ multimap with the memory address as the key.
MemPin uses the fast binary tree search algorithm to ﬁnd the key, in order to insert,
delete or search the memory entries. However, registration for multiple entries with the
same address is not allowed, as the registered memory parameters can be updated easily
with macro MEMPIN_UPDATE_MEM_WATCH. In general, the referenced memory size for an
instruction is 1 to 16 bytes small, it is also forbidden to have overlapped memory region
in the address map for easy management. The second element of the multimap contains
index number of the entry, size of the memory region, the memory operation that is
watching on, the callback function pointer and its arguments.
The callback function is deﬁned by the user with ﬁxed parameters:
typedef int (*mempin_reg_cb_t)(void* addr, size_t size,
int offset, int is_write, void* cb_info);
The parameter addr is the starting address of the memory operand and has a pointer
type. The second is the size of the referenced memory of the instruction. The offset
parameter stands for the diﬀerence between the memory operand address and the regis-
tered memory address. Parameter is_write is used to indicate the memory operation
type, where 1 presents write operations and 0 refers to read operations. Any other
information necessary for the callback function may be passed by using cb_info pointer,
which has to be deﬁned and created by the user.
Figure 4.8 shows an example user application running under MemPin observation. The
main function ﬁrst deﬁnes a variable and then register it into the MemPin address map
using macro MEMPIN_REG_MEM_WATCH. Then the variable content is modiﬁed, which will
65
4 Semantic Memory Checking Frameworks
Open MPI
Memchecker
Callback Impl
MPI Routines
set mem state
buffer usage
reg callback
r/w counts
WBR check
PIN/Valgrind
MemPin/Memcheck
Application Code
Callback
Search Engine
Mem  Reg
Engine
Memory Management
Memory
Storage
Registration
Translation Engine
insert delete
Figure 4.7: Run-time structure of MemPin
PIN Core
return MEMPIN_PRINT_CALLSTACK;
{
}
{
  int mem=100;
void main(void)
int callback(...)
MEMPIN_UNREG_MEM_WATCH(&mem,sizeof(int));
mem +=100;
&mem, sizeof(int), 1/*write*/, 
&callback, NULL);
}
MEMPIN_REG_MEM_WATCH(
MemPin Tool
mempin_reg_mem(...);
mempin_cb(...);
mempin_search(...);
mempin_print_callstack();
mempin_unreg_mem(...);
Figure 4.8: Example of user application using MemPin
66
4.4 Intel Pin tools debugging framework
trigger the MemPin analysis function on the memory instruction. The memory address is
searched in the address map, and the callback function is triggered when its address is
found. The user deﬁned callback function has the name callback, which does nothing
but returns a predeﬁned value to tell MemPin to print the current application callstack.
Finally, the application deregister the memory watch and exits.
Optimized MemPin
When running with large applications with MemPin, the performance is aﬀected heavily.
So a good optimization is extremely necessary for MemPin. The optimization is done in
a way that the main functionality of MemPin is not aﬀected, but only the most ineﬃcient
feature is disabled. The implementation of callstack engine in MemPin instruments ev-
ery entry and exit of functions and branches, which allow MemPin to maintain a vector
collection of instruction addresses according to the application context. The instruction
addresses are inserted (pushed) and removed (popped) within a vector data structure
when an entry or exit instruction is detected. Only when requested, the collection of
the instruction addresses are all translated to source information, and then printed out
from bottom to top. This callstack mechanism consumes a lot of CPU cycles because
of the heavy management of the address collection and also the recursive address trans-
lation. Furthermore, the callstack mechanism is replaced by translating and printing
the current executing instruction address. Users may disable and enable this option for
better performance or for detailed callstack information. Normally, printing the current
execution source would be suﬃcient to trace down the source of the memory problems.
In Section 5.2.3, we will see the performance improvement of the optimized MemPin.
Other Possibilities of Using MemPin
Intel Pin library already provides a rich and powerful API set for cache analysis. [20]
Table 4.2 shows several basic cache trace functions in the Pin library. The MemPin tool
is developed originally for memory access analysis. However, it may be extended to
other checking functionalities. Cache-line access analysis may be implemented into the
current tool. Integrating with the current MemPin callback mechanism, a detailed cache
analysis may be possible. For example, the cache analysis may have functionalities for
collecting cache information including the cache size, cache misses and hits. The cache
analysis tool may help user to improve the overall performance of the applications.
4.4.2 Implementation and integration with MemPin
Similar as the integration of Valgrind, MemPin was also integrated as an MCA in Open
MPI, as shown in Figure 4.9. However, its implementation is diﬀerent. As MemPin is able
to detect and implement speciﬁc functions in the image load stage, the API functions in
Open MPI may be much easier. For example, the function for registering memory watch
has to be directly call in the Memcheck core for Valgrind integration. But in MemPin
integration, the function is deﬁned empty with necessary parameters, and MemPin may
67
4 Semantic Memory Checking Frameworks
AddCacheInitFunction Adds a function that gets called once, when the
code cache is ﬁrst formed.
AddCacheBlockFunction Adds a function that gets called whenever a new
cache block is formed.
AddFullCacheFunction Adds a function that gets called whenever the
cache ﬁlls up
AddCacheFlushedFunction Adds a function that gets called whenever the
cache is physically ﬂushed.
AddCodeCacheEnteredFunction Adds a function that gets called whenever control
enters the code cache.
AddCodeCacheExitedFunction Adds a function that gets called whenever control
exits the code cache.
Table 4.2: Intel Pin API for cache instrumentations
Operating System
ORTE − Open Runtime Environment
Open MPI
OPAL − Open Portable Access Layer
MemPin ...
Memchecker
Valgrind
Memchecker Memchecker
Figure 4.9: Integration of MemPin into Open MPI
automatically detect the function name and load corresponding operation internally with
provided parameters.
Functionalities for checking whether memory is deﬁned or addressable are not provided
by MemPin, their integrated APIs make no actual eﬀect. And this is not included in the
original goals of this work. Nevertheless, one may still implement these in the MemPin
tool, but it requires another storage of the A/V bits just like in Valgrind, which will
introduce much overhead at run time.
4.4.3 Implementation in Open MPI
Unlike Valgrind, MemPin has no information about whether the memory is readable
or writable. But similar functionalities for making memory readable and writable have
been implemented in Open MPI using the callback scenario of MemPin. In the callback
function deﬁned in Open MPI, for both pre- and post-communication checks, the same
memory states Bits are used. For pre-communication checks, the two Bits of memory
68
4.4 Intel Pin tools debugging framework
Description
1  0
1  1
Flag
Read and write are both allowed.
Read is not allowed.
Read and write are both not allowed.
Write is not allowed.
(Byte−wise)
(Bit−wise)
Write access flag
Read access flag
Buffer
Shadow
0  0
0  1
Figure 4.10: Shadow memory for pre-communication check with MemPin
1  0
1  1
Flag
Memory not accessed.
Read before write.
Write without read.
Write before read.
Description
(Byte−wise)
(Bit−wise)
Write access flag
Read access flag
Buffer
Shadow
0  0
0  1
Figure 4.11: Shadow memory for post-communication checks with MemPin
69
4 Semantic Memory Checking Frameworks
state are used for marking the memory readable or writable, as shown in Figure 4.10. If
the ﬁrst Bit is set to 1, the memory is marked as not readable. The second Bit is the
writable Bit, which means a 1 is for not writable. When both Bits are set to 1, then
the memory is marked as not accessible at all.
For post-communication checks, the same Bit table will be used in order to save the
storage. But the two bits have diﬀerent meanings, see Figure 4.11. The ﬁrst Bit indicates
whether the byte of memory has been read or not. The second bit is for whether the byte
of memory has been written or not. The callback function will check for each registered
receive buﬀer, whether they are read before written, otherwise reporting a WBR error.
Furthermore, in the MPI_Finalize call, all memory state Bits are checked for buﬀer
that are not used after communication.
These two phases of memory checks in MPI communication may change automatically
to the other phase. If a parallel application has several communications, whenever the
communication is started, for example calling the MPI_Isend, the pre-communication
check will be enabled. When the communication is ﬁnished, for example MPI_Wait is
called, the post-communication check will be executed. The shadow memory for both
checks does not need to be reallocated, as they have the same format but diﬀerent
meaning of the Bits. All the registered memory checks will be cleared in the MPI_
Finalize, which is the end of the parallel computation.
The back-end search algorithm is a little diﬀerent to the one for Valgrind extension.
The data structure in the MemPin search engine is a standard C++ multimap structure,
which provides easy and powerful methods. However, the idea behind the data structure
is the same as what has been described in 4.3.3, in order to minimize the overhead when
running with the tool.
70
5 Performance Implication and Real
Use Cases
5.1 Overview
For parallel applications, the communication speed for small messages is dominated
by the overhead in the communication layers, meaning that the transmission time is
latency bound. For larger messages, the communication rate becomes bandwidth limited
by the subsystem, e. g. PCI buses, network card links or network switches. Adding
instrumentation 1 to the code certainly induce a performance hit due to the assembler
instructions as explained in Chapter 4, even when the application is not run under the
debugging tools 2. In this Chapter, we will mainly focus on the performance of Open
MPI that has been integrated with the memory debugging features. And ﬁnally, it
is shown that integrating the debugging features does not involve much performance
impact for the normal run of the applications and benchmarks.
The performance is evaluated using Open MPI built with memchecker framework
and Open MPI built without memchecker. The test benchmarks and applications are
run with and without the debugging tool supervision. The test environments include
NEC Nehalem cluster, BWGrid cluster and Viscluster. The Nehalem cluster has 700
Dual Socket Quad Core, each of which has Intel Xeon (X5560) Nehalem 2.8GHz CPU
with 8MB Cache and 12GB memory. The node to node interconnect uses InﬁniBand
and Gigabit Ethernet. The BWGrid cluster has 545 nodes in total, with Intel Xeon
5150 and Intel Xeon E5440 processors. There are also small amount of IBM cell and
Nvidia Quadro FX 5800 installed on BWGrid cluster. InﬁniBand and GigE are used
for interconnection. Both of Nehalem and BWGrid clusters are installed with scientiﬁc
Linux and a lot of softwares and tools, such as Intel compiler, Open MPI, DDT and so
on. Viscluster has basically 9 nodes, with AMD Opteron 250 2.5GHz CPU and 4.25GB
memory on each. It has also InﬁniBand and GigE as the interconnect among the nodes.
The operating system on Viscluster is Windows 2008 Server with HPC package installed.
The equipped softwares include Visual Studio 2008, CMake, NSIS, Intel Parallel Studio
XE 2011 and so on.
This Chapter has tree main parts. In the ﬁrst part, three widely used benchmarks are
tested under following conditions: run with Open MPI compiled with memory checking
frameworks but without supervision of debugging tools, which will show minimum over-
1In the rest of this Chapter, instrumentation refers to the newly developed memory checking frame-
works, which has been integrated into Open MPI.
2In this Chapter, debugging tools only refer to Valgrind and MemPin.
71
5 Performance Implication and Real Use Cases
head added by the frameworks; run with Open MPI compiled with memory checking
frameworks and with supervision of debugging tools, which will compare the performance
of using the debugging tools and frameworks together; run with Open MPI compiled
without memory checking frameworks and without supervision of debugging tools, which
is a pure and plain Open MPI test run. Each single test is run ﬁve times, where the
best and worst results are excluded and the remaining three are used for calculating
the average results. With these basic conﬁgurations, diﬀerent performance results are
collected and compared, in order to give a detailed view of the performance implication
of the new tools and frameworks.
The second part gives an example of using the memory checking tools and frameworks
with a 2D Heat Conduction Algorithm. This 2D Heat Conduction Algorithm is based
on Parallel Computational Fluid Dynamics (CFD) Test Case [48]. It solves the partial
diﬀerential equation for unsteady heat conduction over a square domain. The test run
results shows that sending extra data, which is not used at all for the computation, is
also a factor that slows down the communication speed. Using the newly implemented
debugging tools, the data sent but not used can be found, and the communication speed
can be improved.
The third part shows another real example, i. e. Molecular Dynamics (MD) simulation
using Gromacs. Gromacs is a versatile package to perform molecular dynamics primarily
designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot
of complicated bonded interactions. It simulates the Newtonian equations of motion for
systems with hundreds to millions of particles with fast calculating the non-bonded in-
teractions. Gromacs is widely used for researches on biological or non-biological systems.
The simulation result shows that running the application with memory debugging tools
indeed indicates several problems.
The work in this part has been greatly cooperated with Professor Jürgen Pleiss in In-
stitute of Technical Biochemistry at the University of Stuttgart. Sascha Rehm and Sven
Benson from the Institute have given a lot of help to this work, including inputs for the
examples code to run with Gromacs, and they have also helped me with understanding
the examples. With the provided examples, we were able to run Gromacs and to see the
performance implication of running with and without the debugging tools.
5.2 Performance Implication and Benchmarks
5.2.1 Intel MPI Benchmark
The Intel MPI Benchmark (IMB) suite is a popular performance benchmark in the HPC
community. It provides a concise set of elementary MPI and MPI-2 benchmark kernels.
The suite is written in ANSI C plus standard MPI, and it contains a subset of supported
benchmarks that can be run with command line. In standard mode, the message size
may be set to 0, 1, 2, 4, 8, and maximum 4MB. There are three classes of benchmarks,
namely single transfer, parallel transfer and collective benchmarks. In this section, we
will show and compare the performance results of the IMB for MPI 1 and 2 standards.
72
5.2 Performance Implication and Benchmarks
 48
 50
 52
 54
 56
 58
 60
 62
 64
 66
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
IMB-3.2 Alltoall on 2 nodes over TCP
with instrumentation
without instrumentation
(a) Over TCP connection
 4.5
 5
 5.5
 6
 6.5
 7
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
IMB-3.2 Alltoall on 2 nodes over InfiniBand
with instrumentation
without instrumentation
(b) Over InﬁniBand connection
Figure 5.1: IMB benchmark Pingpong test on two nodes of BWGrid
Figure 5.1 shows the MPI_Alltoall test of the IMB collective benchmarks over TCP
(Figure 5.1(a)) and Inﬁniband (Figure 5.1(b)) connection between two nodes of Visclus-
ter. The nodes are running with Windows HPC 2008. As we can see, the overhead intro-
duced by the memory checking instrumentation is no more than 4% for MPI_Alltoall
test on both TCP and InﬁniBand connections. The integration of the memory checking
frameworks does not introduce any large overhead on Windows platforms.
Similar tests have also been run on BWGrid cluster at HLRS. Bi-directional put and
get are then used, both in aggregate mode, i. e. both tests will run with varying transfer
sizes in bytes which is issued by the corresponding one sided communication call, and
timings will be averaged over multiple samples. The bi-directional benchmarks are exact
equivalents of the message passing PingPing. All tests were run in the same cases as
mentioned above.
Figure 5.2 presents the average time of running bi-directional get and put tests with
and without the memchecker implementation running without Valgrind. The perfor-
mance of MPI_Get (see Figure 5.2(a)) in these cases is nearly identical, and the one
with memchecker implementation is losing only 1% of run time. For MPI_Put (see Fig-
ure 5.2(b)), we got similar result as MPI_Get. However, notably, MPI_Put has a better
performance than MPI_Get in general. There are several factors aﬀecting the perfor-
mance of MPI_Put transfer, for example the choice of window location and the shape
and location of the origin and target buﬀer. Transfers to a target window in memory allo-
cated by MPI_ALLOC_MEM may be much faster on shared memory systems; transfers from
contiguous buﬀers will be faster on most systems; the alignment of the communication
buﬀers may also impact performance [29, p. 114].
73
5 Performance Implication and Real Use Cases
 318
 319
 320
 321
 322
 323
 324
 325
 326
 0  200  400  600  800  1000
Ti
m
e 
[us
ec
]
Message Length [Byte]
IMB-3.2 get overTCP
with instrumentation
without instrumentation
(a) Bi-directional get connection
 159.5
 160
 160.5
 161
 161.5
 162
 162.5
 163
 163.5
 164
 0  200  400  600  800  1000
Ti
m
e 
[us
ec
]
Message Length [Byte]
IMB-3.2 put over TCP
with instrumentation
without instrumentation
(b) Bi-directional put connection
Figure 5.2: IMB benchmark Bi-directional get and put on two nodes of Viscluster
5.2.2 NAS Parallel Benchmark
The NAS Parallel Benchmark (NPB) is a small set of programs designed to help evaluate
the performance of parallel computers. It is MPI based implementations written and
distributed by NASA Advanced Supercomputing (NAS). The benchmark are derived
from CFD applications, and it consists of ﬁve kernels and three pseudo-applications. The
BT benchmark, which is one of the CFD applications, has been successfully immigrated
and built on Windows platforms. This section will focus on the BT benchmark tests on
BWGrid cluster, and performance result will be given and discussed.
The BT-Benchmark has several classes, which have diﬀerent complexity, and data size.
The algorithm of BT-Benchmark solves three sets of uncoupled systems of equations,
ﬁrst in the x, then in the y, and ﬁnally in the z direction. Class A and Class B benchmark
have been chosen on BWGrid cluster running with and without observation from the
main debugging tool (Valgrind in this case). The Class A (size of 64x64x64) and Class B
(size of 102x102x102) test was run with the standard parameters (200 iterations, time-
step dt of 0.0008). Two versions of Open MPI have been evaluated with the benchmark:
Open MPI built without memchecker framework, and Open MPI built with memchecker
framework. Figure 5.3 shows the performance of running the BT Benchmark.
As expected this benchmark does not show any performance implications whether the
memory checking instrumentation is added or not (see Figure 5.3(a)). Of course due
to the large memory requirements, the execution shows the expected slow-down when
running under Valgrind, as every memory access is being checked (see Figure 5.3(b)).
5.2.3 NetPIPE
NetPIPE [53] is a protocol independent performance tool. It maps the performance of
a network across a wide range. Its protocol independence allows for visualization of
the overhead associated with a protocol layer. NetPIPE visually represents the network
74
5.2 Performance Implication and Benchmarks
 0
 50
 100
 150
 200
 250
 300
Class A, 4 Class A, 9 Class B, 4 Class B, 9
Ti
m
e 
[se
c]
NAS Parallel Benchmarks 2.3 -- BT Benchmark
plain
memchecker/No MPI object checking
(a) Running without Valgrind
 1000
 1500
 2000
 2500
 3000
 3500
 4000
Class A, 9 Class B, 9
Ti
m
e 
[se
c]
NAS Parallel Benchmarks 2.3 -- BT Benchmark
plain with valgrind
memchecker/no MPI-object checking with valgrind
(b) Running with Valgrind
Figure 5.3: NAS Parallel BT Benchmark performance
performance under a variety of conditions by performing simple ping-pong tests, bounc-
ing messages of increasing size between two processes, either across a network or within
an SMP system. Message sizes are regular intervals and with slight perturbations, to
provide a complete test of the communication system. Each data point runs many ping-
pong iterations to provide an accurate timing. Latencies are calculated by dividing the
round trip time in half for messages that are less than 64 bytes.
NetPIPE can also measure the MPI communication layer that run upon TCP/IP,
Shared Memory, or InﬁniBand connections. Use of the MPI interface for NetPIPE de-
pends on the MPI implementation being used. We measure the performance of two
diﬀerent versions of Open MPI, i. e. Open MPI built without the memory checking
framework and Open MPI built with the memory checking framework. TCP and In-
ﬁniBand connections are tested by using tcp and winverbs BTL frameworks in Open
MPI.
Two versions of Open MPI, i. e. compiled with and without memchecker framework,
are used, in order to compare and analyze the overhead introduced by the memory
checking framework. The benchmark was run on two nodes of the Viscluster at HLRS
on both TCP and InﬁniBand connections on Windows HPC 2008 cluster nodes.
Tests have been run without the control of MemPin tool. Figure 5.4 shows the latency
and bandwidth of both Open MPI versions on TCP connection. The additional costs
on latency (see Figure 5.4(a)) incurred by the memchecker framework rages from 1.2%
to 2.5%, which is hardly noticeable. On the other hand, both versions show nearly the
same bandwidth (see Figure 5.4(b)) with NetPIPE on TCP connection. As described in
Section 4.4.3, the integration of MemPin in Open MPI adds a few empty macros, which
has minimum eﬀect on the speed of execution.
The latency using Inﬁniband connection is shown in Figure 5.5. For the test runs,
winverbs framework was used to get high performance of the benchmark. The latencies
of the compared Open MPI versions (see Figure 5.5(a)) are around 3 microseconds, and
75
5 Performance Implication and Real Use Cases
 36
 36.5
 37
 37.5
 38
 38.5
 39
 39.5
 40
 40.5
 41
 41.5
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
NetPIPE Latency - TCP (Viscluster)
with instrumentation
without instrumentation
(a) Latency over TCP connection
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
bp
s]
Message Size [Byte]
NetPIPE Bandwidth - TCP (Viscluster)
with instrumentation
without instrumentation
(b) Bandwidth over TCP connection
Figure 5.4: NetPIPE TCP latency (left) and bandwidth (right) comparison of Open
MPI compiled with and without the memchecker framework
the diﬀerence between them is only about 0.1-0.3 microsecond. The bandwidths (see
Figure 5.5(b)) of the two tests shows nearly the same result. The introduced overhead
is extremely small and acceptable for users and developers.
In order to compare the overhead when the memory checking frameworks are func-
tioning, performance has also been measured with two compute nodes on Nehalem clus-
ter. Figure 5.6 shows the test results of running NetPIPE benchmark with memchecker
enabled Open MPI using TCP connection. Two tests were run under supervision of
Valgrind and optimized MemPin (see Section 4.4.1) respectively. The latency (in Fig-
ure 5.6(a)) of using MemPin is nearly 50% less than the latency using Valgrind. The
diﬀerence of bandwidth between the two tests increases largely when the message size
increases, as shown in Figure 5.6(b). Using the memchecker framework based on MemPin
has a better performance than using framework based on Valgrind.
Another two tests were run with the same parameters and conﬁguration, but using
InﬁniBand connection, as shown in Figure 5.7(a). The performance using memchecker
framework based on MemPin is far better than using the framework based on Valgrind.
The latency is about 80% better, as shown in Figure 5.7(a).
Running the benchmark with the debugging tools introduces extra overhead. Running
with MemPin is about four times slower than running the plain Open MPI without
supervision of the debugging tools (see Figure 5.8(a)). While running with Valgrind is
more than 30 times slower. The bandwidth comparison is shown in Figure 5.8(b), where
MemPin reaches the same bandwidth as the plain test has for very large data sizes.
5.3 A 2D Heat Conduction Algorithm as a Use Case
In Section 1.2.4, a common process of designing a parallel program has been introduced,
i. e. domain decomposition. This technique is important for implementing existing single
76
5.3 A 2D Heat Conduction Algorithm as a Use Case
 3.7
 3.8
 3.9
 4
 4.1
 4.2
 4.3
 4.4
 4.5
 4.6
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
NetPIPE Latency - Infiniband (Viscluster)
with instrumentation
without instrumentation
(a) Latency over InﬁniBand connection
 0
 1000
 2000
 3000
 4000
 5000
 6000
 7000
 8000
 9000
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
bp
s]
Message Size [Byte]
NetPIPE Bandwidth - Infiniband (Viscluster)
with instrumentation
without instrumentation
(b) Bandwidth over InﬁniBand connection
Figure 5.5: NetPIPE Inﬁniband latency (left) and bandwidth (right) comparison of
Open MPI compiled with and without the memchecker framework
 50
 100
 150
 200
 250
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
NetPIPE Latency
Running with MemPin
Running with Valgrind
(a) Latency over TCP connection
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
bp
s]
Message Size [Byte]
NetPIPE Bandwidth
Running with MemPin
Running with Valgrind
(b) Bandwidth over TCP connection
Figure 5.6: NetPIPE TCP latency (left) and bandwidth (right) comparison of Open
MPI run with the memchecker framework
77
5 Performance Implication and Real Use Cases
 20
 40
 60
 80
 100
 120
 140
 160
 180
 200
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
NetPIPE Latency
Running with MemPin
Running with Valgrind
(a) Latency over InﬁniBand connection
 0
 2000
 4000
 6000
 8000
 10000
 12000
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
bp
s]
Message Size [Byte]
NetPIPE Bandwidth
Running with MemPin
Running with Valgrind
(b) Bandwidth over InﬁniBand connection
Figure 5.7: NetPIPE Inﬁniband latency (left) and bandwidth (right) comparison of
Open MPI run with the memchecker framework
 1
 10
 100
 1000
 1  10  100
Ti
m
e 
[us
ec
]
Message Size [Byte]
NetPIPE Latency
Running with MemPin
Running with Valgrind
Running without tools
(a) Latency over InﬁniBand connection
 0.01
 0.1
 1
 10
 100
 1000
 10000
 100000
 1  10  100  1000  10000  100000  1e+06  1e+07
Ba
nd
wi
dt
h 
[M
bp
s]
Message Size [Byte]
NetPIPE Bandwidth
Running with MemPin
Running with Valgrind
Running without tools
(b) Bandwidth over InﬁniBand connection
Figure 5.8: NetPIPE Inﬁniband latency (left) and bandwidth (right) comparison of
Open MPI run with and without the memchecker framework
78
5.3 A 2D Heat Conduction Algorithm as a Use Case
Blocks that will be computed
Overlap area of two processes
Blocks that are used for computation
Data block owned by one process
Process 1 Process 2
Communicated blocks but not used for computation
Figure 5.9: An example of border update in domain decomposition
processor algorithm into a parallel context. For example, a good domain decomposition
will normally lead to good load balance and fast communication among sub-domains.
While a bad decomposition may result load imbalance and heavy communication, which
will show a poor overall performance.
The basic idea of domain decomposition is to divide the original computational do-
main Ω into sub-domains Ωi, i = 1, ....M, and then solve the global problem as a sum of
contributions from each sub-domain, that may be computed in parallel.The process as-
sociated with a sub-domain requires elements belonging to its neighbors when it updates
the elements on the border of its partition [13], that requires large amount of exchang-
ing data between each neighbor pairs. However, there are cases that part of the border
data need not to be updated. For example, in a 2D domain decomposition algorithm,
it may require calculate elements only from their horizontal and vertical neighbor ele-
ments, but the whole border element arrays are updated from neighbor sub-domain, as
shown in Figure 5.9. The result is that for the border update in every sub-domain, there
will be four corner elements that will never be used for calculation (without periodic
boundary condition, the virtual border [36] elements are not taken into account in the
example). This might be no harm for the calculation result of the algorithm. But when
decomposing the entire problem into a large number of sub-domains, the total amount
of transferred but unused data may be high, and as consequence communication might
require more time.
79
5 Performance Implication and Real Use Cases
(a) 4 x 4 decomposition (b) 8 x 8 decomposition
Figure 5.10: Transferred but unused data in example domain decompositions
Figure 5.10 shows a more detailed example of the transferred but unused data in
the domain decomposition example. The left ﬁgure is a 4x4 domain decomposition,
where every element calculation requires horizontal and vertical neighbor elements. In
this speciﬁc condition, there will be 72 elements transferred but not used (36 corner
elements transferred two times). When scaling this code by doubling the number of
processors used to compute this domain, the number of elements communicated but
not used increases dramatically. On the right side, a similar example of 8x8 domain
decomposition, that has 392 elements (196 corner elements transferred two times) might
not be communicated. Assuming we have a M × N domain decomposition, the total
amount of such elements are described by:
(M − 1)× (N − 1)× 4× 2 (5.1)
It is obvious that, the number of unnecessary communicated data grows superlinearly
with the domain decomposition.
An example 2d heat conduction algorithm, which behave in the way described above,
has been used for running with the new implemented memory checking framework on
Windows. The algorithm is based on Parallel CFD Test Case [48]. It solves the partial
diﬀerential equation for unsteady heat conduction over a square domain. It was run
with two processes and under control of the memory checking tool (MemPin in this
case). The entire domain has been decomposed into two sub-domains, one has the
range of [(0,0), (8,15)], and the other is [(7,0), (15,15)]. The run-time output (see
Figure 5.11) shows the details of the execution. In the end of the output, out tool
MemPin reports that in total 112 bytes of data on each process have been transferred
but actually not used in the program. All of these bytes are from the corner elements
exchange. Each of the process has two corner elements, and each element is transferred
once for every communication. In the following of this Section, more test results will
show that with more corner elements, the communication time of the application is
aﬀected. And reducing such data will improve the communication performance.
80
5.3 A 2D Heat Conduction Algorithm as a Use Case
Figure 5.11: Running the heat program with two processes and checked with memory
checking
The heat program has been tested with more processes on diﬀerent number of nodes
on BWGrid and Nehalem Cluster at HLRS, in order to discover the relationship between
the communicated but unused corner elements and the communication performance. For
the ﬁrst test, the heat program was set to a 1500× 1500 domain, and parallelized with
diﬀerent number of processes over four compute node (eight cores on each node) on
Nehalem cluster. A modiﬁed version of the heat program was also used for the test,
which does not send any unused corner elements. The average communication time and
overall run time are measured based on ﬁve executions on diﬀerent number of processes,
as shown in Figure 5.12. When not oversubscribing the nodes (each core has no more
than one process), the modiﬁed version is generally better than the original version
ranging from 3% to 7%. As we can see, communicating the corner elements of the
sub-domains will indeed aﬀect the communication time of the program.
Another test on 64 nodes was made on Nehalem cluster to start large number of pro-
cesses without oversubscribing the compute node cores. The processes are assigned using
round-robin algorithm among the nodes, in order to achieve a better load balancing for
the simulation. Figure 5.13 shows the communication time of running the same simu-
lation with diﬀerent number of processes on the cluster. It presents the communication
time for diﬀerent number of processes (64 to 310). The modiﬁed version has a shorter
communication time on average, which is 10% better. The best case is even 20% better
than the original version.
The communication time does not increase of decrease linearly with the number of
processes, because the domain decomposition will inﬂuence the communication eﬃciency.
81
5 Performance Implication and Real Use Cases
 195
 200
 205
 210
 215
 220
 225
 230
 5  10  15  20  25  30
Ti
m
e 
[us
ec
]
Number of Processes
Communication time of the Heat Conduction program
Original program
Modified program
Figure 5.12: Comparison of the communication time between the original and modi-
ﬁed Heat Conduction program on 4 nodes
 100
 150
 200
 250
 300
 350
 400
 450
 500
 550
 100  150  200  250  300
Ti
m
e 
[us
ec
]
Number of Processes
Communication time of the Heat Conduction program on 64 nodes
Original program
Modified program
Figure 5.13: Communication time comparison between the original and modiﬁed Heat
Conduction program on 64 nodes
82
5.4 MD Simulation as a Use Case
Assuming we have eight bytes data in each corner element, for a 96 processes run (12x8
decomposition), the number of border exchange is (12 − 1) × (8 − 1) × 4 × 2 (based
on equation 5.1), which is 616 times. This results to 4928 bytes of communicated but
never used data. For the same conﬁguration, if running with 128 processes (16 × 8
decomposition), the size of each border element is halved, i. e. four bytes. But the
number of border exchange is now (16 − 1) × (8 − 1) × 4 × 2, which is 840 with 3360
bytes in total. One may argue that the total size of transferred data is smaller with high
resolution of domain decomposition, the communication speed should increase. However,
this is not true. The overall communication speed is highly determined by the number
of communication but not the data size that is transferred. In Open MPI, for blocking
and non-blocking communication, there are two transmission protocols, i. e. Eager and
Rendezvous. When the data size is smaller than 12 kB, the data will be sent in one
package (Eager protocol). But when the data size is larger than 12 kB, the data will
be divided into smaller packages (Rendezvous protocol), so there is not only one send
and receive operation on this data. When the data size does not exceed the limit, the
number of the communication will determine the overall communication speed. This
also explains why the communication time is larger when running with 64 processes. In
this case, the corner data is much larger than 12 kB, so the number of communication
is doubled or even tripled.
Figure 5.14 shows the full range comparison of the overall execution time of the two
versions. The overall execution time does not diﬀer a lot, which is only 0.7% on average.
For a single test run, the communication time is only about 0.003% of the execution
time, so the eﬀect of the communication time for the execution is extremely small. On
the other hand, with increasing number of processes, the execution time drops down
linearly.
5.4 MD Simulation as a Use Case
In this use case, a MD simulation on the lid behavior of the lipases was used. A li-
pase is a water-soluble enzyme used normally in breaking down lipids. It plays a very
important role for human health, such as overweight and underweight, cardiovascular
disease, diabetes, strokes and degenerative muscle diseases, cancer, degenerative diseases
of the brain and nervous system, and also for rejuvenation and regeneration in general.
Lipases are able to catalyze the hydrolysis of carboxylic esters in water and the reverse
reaction, the acylation of alcohols with carboxylic acids in organic solvents [49]. The
characteristic of most lipases is their activation of a mobile lid binding to a hydrophobic
substrate interface [21]. In water, the active site is covered by the lid, which opens
upon binding of the lipase to a hydrophobic interface [3]. As shown in Figure 5.15, the
stable state of the lid is closed, and it changes to open when use organic solvent, and
vice versa. This phenomenon is because the inside of the lid is hydrophobic, whereas
the outside is hydrophilic. Therefore, the stable form in water is closed state, while in
organic solvent the stable form is open state. When the solvent changes, the lid state
also changes accordingly.
83
5 Performance Implication and Real Use Cases
 4
 5
 6
 7
 8
 9
 10
 11
 80  100  120  140  160  180
Ti
m
e 
[se
c]
Number of Processes
Total run time of the Heat Conduction program on 64 nodes
Original program
Modified program
Figure 5.14: Comparison of the communication and computation time between the
original and modiﬁed Heat Conduction program on 64 nodes
Closed
Open
Interficial Activation
water
lid opens
lid closes
organic solvent
organic solventwater
Stable State
Figure 5.15: Inﬂuence of Solvent on Lipases, lid open and closed states
84
5.4 MD Simulation as a Use Case
Figure 5.16: Simulation of the lipase lid in diﬀerent solvent
In order to observe this lid movent in water and organic solvent, we run the simula-
tion under Gromacs, and then visualized the result in PyMOL, which is an open-source
molecular visualization system. The simulation procedure is presented in Figure 5.16.
First a homology model has to be created or use an existing 3D structure. A box is
created around the protein, where the solvent may be added with water and organic
solution. Another step before running the simulation should be taken to set up all the
run-time environment parameters, such as temperature, number of the solvent atoms
and simulation time period.
When the simulation is done, the result may be visualized in PyMOL. A screen shot
of the visualization windows of PyMOL is shown in Figure 5.17. The red atoms are the
water solvent and the green atoms in the middle are the organic solvent. The Rhizomucor
miehei Lipase (RML) that is used for the simulation belongs to GX [12] class and have
a short lid helix, connected to the remaining part of the lipase with two hinges [62]. The
lipases lie in the left side of water solvent as shown in Figure 5.17. The simulation process
is moving the lipases from water solvent into the organic solvent and again back to water
solvent, and the lid movent may be observed from this procedure. The simulated lids
movement may be animated and replayed in PyMOL. Figure 5.18 shows the lid open and
closed state shot from PyMOL simulated result.
This simulation has been run on both Linux (Nehalem) and Windows (Viscluster)
clusters, in order to evaluate the performance overhead introduced by the debugging
features. Figure 5.19 shows the comparison of running Gromacs simulation using shared
memory and two versions of Open MPI on one Windows cluster node, one integrated
with the MemPin framework and one without. The simulations have been run without
supervision of MemPin. Diﬀerent simulation sizes (steps) were used, where each size was
85
5 Performance Implication and Real Use Cases
Figure 5.17: Screen shot of visualization using PyMOL
(a) Lid is open (b) Lid is closed
Figure 5.18: Simulation screen shot of lid open and closed state
86
5.4 MD Simulation as a Use Case
 0
 200
 400
 600
 800
 1000
100 200 300 400 600 800 1000
Ti
m
e 
[se
c]
Simulation Size (Steps)
Gromacs run on Windows without MemPin using SM
integrated with MemPin
plain
(a) Simulation time
 1.9
 1.92
 1.94
 1.96
 1.98
 2
 2.02
 2.04
 2.06
100 200 300 400 600 800 1000
G
Fl
op
s
Simulation Size (Steps)
Gromacs run on Windows without MemPin using SM
integrated with MemPin
plain
(b) Simulation performance
Figure 5.19: Gromacs run on Windows with/without MemPin integration using shared
memory
 0
 200
 400
 600
 800
 1000
100 200 300 400 600 800 1000
Ti
m
e 
[se
c]
Simulation Size (Steps)
Gromacs run on Windows without MemPin using TCP
integrated with MemPin
plain
(a) Simulation time
 1.5
 1.6
 1.7
 1.8
 1.9
 2
100 200 300 400 600 800 1000
G
Fl
op
s
Simulation Size (Steps)
Gromacs run on Windows without MemPin using TCP
integrated with MemPin
plain
(b) Simulation performance
Figure 5.20: Gromacs run on Windows with/without MemPin integration over TCP
connection
87
5 Performance Implication and Real Use Cases
 10
 15
 20
 25
 30
 35
 40
 45
 2  3  4  5  6  7  8
Ti
m
e 
[se
c]
Number of cores
Gromacs run without debugging tools (SM)
integrated with Valgrind
integrated with MemPin
integrated without tool
(a) Simulation time
 5
 10
 15
 20
 25
 30
 35
 2  3  4  5  6  7  8
G
Fl
op
s
Number of cores
Gromacs run integrated without debugging tools (SM)
integrated with Valgrind
integrated with MemPin
integrated without tool
(b) Simulation performance
Figure 5.21: Gromacs run on Nehalem with/without Valgrind and MemPin integration
using Shared Memory
run ﬁve times in order to get average numbers and reduce the inﬂuence of the system
processes. As shown in Figure 5.19(a), the diﬀerence between the simulation time is
extremely minimum, where running with integration of MemPin is approximately 4%
slower. In contrast, as seen in Figure 5.19(b), the performance of the simulation without
the tool is slightly better (around 0.6%). For the users who does not want to use the
memory checking feature, the performance of running without debugging tools will not
be aﬀected when the framework is integrated.
In Figure 5.20, the simulation result over the TCP connection with two Windows
cluster nodes is presented. The overall simulation time still shows very little diﬀerence
(less than 3%). While the performance showing in Figure 5.20(b) diﬀers a bit more,
which ranges from 5% to 7%. However, the reason of the throughput diﬀerence is not
from using the debugging tools and frameworks, bur rather from the communication
connection. For example, as we can see, for problem size 300, running without the
framework (plain) shows a better performance, while for problem size 600, running with
the framework is better. So whether the framework is integrated is not a determinant
for the throughput, but rather determined by the TCP connection.
On Nehalem cluster, the same simulation was used but with diﬀerent conﬁguration.
For tests that run without debugging tools, 10000 computation steps was used. For
tests that run with our debugging tools, it uses 100 computation steps in order to
save the time cycle of the simulation. Figure 5.21 shows the test result of running
the simulation on Nehalem with and without debugging frameworks integration using
shared memory on a single node. The tests were run without control of the debugging
tools. As shown in Figure 5.21(a), the simulation time is approximately the same among
the tests. Gromacs running with clean Open MPI has the lowest running time, while
running with integration of MemPin is slightly faster than running with integration of
Valgrind. The performance in Figure 5.21(b) shows the same result. This is because
88
5.4 MD Simulation as a Use Case
 10
 15
 20
 25
 30
 35
 40
 45
 50
 2  4  6  8  10  12  14  16
Ti
m
e 
[se
c]
Number of cores
Gromacs run without debugging tools (IB)
integrated with Valgrind
integrated with MemPin
integrated without tool
(a) Simulation time
 5
 10
 15
 20
 25
 30
 35
 40
 2  4  6  8  10  12  14  16
G
Fl
op
s
Number of cores
Gromacs run without debugging tools (IB)
integrated with Valgrind
integrated with MemPin
integrated without tool
(b) Simulation performance
Figure 5.22: Gromacs run on Nehalem with/without Valgrind and MemPin integration
using InﬁniBand
that MemPin does not add any direct instrumentation code into the application where
Valgrind does 2.4.3. The analysis functions of MemPin are all deﬁned empty, which
adds minimum extra instructions to the original application. On the other hand, all
of the test runs showed ideal linearities with increasing the number of cores. As this
test is using shared memory for communication, so when the number of processes is not
larger than the number of the cores on the node, the performance will increase when
the number of processes increases. However, if the number of processes is more than the
number of the cores, the performance will not increase linearly.
Figure 5.22 shows the same tests on Nehalem cluster with two nodes using InﬁniBand
connections. The execution time of the Gromacs is nearly the same for all test cases,
but the maximum diﬀerence is around two seconds. It is diﬃcult to examine which
test case is the best, but the case with MemPin integration is a bit slower for the tests
with ﬁve to ten cores (see Figure 5.22(a)). The throughput of this test show similar
result, where no obvious diﬀerence on the performance. But the test case without any
framework integration gets a higher overall score. Using a faster network connection will
also narrow the diﬀerence of the performance, which is more ignorable for the users.
Simulation under the supervision of the debugging tools introduces extra large over-
head. Both of the implementation slow down the simulation a lot. As shown in Fig-
ure 5.23, the same simulations were run under the control of the tools speciﬁcally on
Nehalem cluster with shared memory on single node. In Figure 5.23(a), the running time
of both implementation is more than 200 seconds (green and red lines) in comparison
to 48 seconds without any tools (see Figure 5.22). However, Valgrind shows a better
performance than MemPin. But the result does not show good linearities when running
on more cores.
On the other hand, the optimized version of MemPin, as described in Section 4.4.1,
shows much better performance comparing to Valgrind.
89
5 Performance Implication and Real Use Cases
 0
 50
 100
 150
 200
 250
 300
 2  3  4  5  6  7  8
Ti
m
e 
[se
c]
Number of cores
Gromacs run with debugging tools (SM)
run with Valgrind
run with MemPin
run with MemPin (Opt.)
(a) Simulation time
 0
 20
 40
 60
 80
 100
 120
 2  3  4  5  6  7  8
M
Fl
op
s
Number of cores
Gromacs run with debugging tools (SM)
run with Valgrind
run with MemPin
run with MemPin (Opt.)
(b) Simulation performance
Figure 5.23: Gromacs run on Nehalem with Valgrind and MemPin supervision using
Shared Memory
For the same test above, the performance of using MemPin is increased dramatically
by disabling the callstack engine. As shown in Figure 5.23 (blue dot line), the optimized
version of MemPin gains more than two times speedup comparing to both the original
MemPin and Valgrind. The test results also shows a big amount of write before read
errors in Gromacs. One piece of the detection messages from MemPin is shown as follows:
[cl3fr1:28476] memchecker: write before read at 1d65fca0:416
[cl3fr1:28476] memchecker: write before read at 1d65fca0:417
[cl3fr1:28476] memchecker: write before read at 1d65fca0:418
[cl3fr1:28476] memchecker: write before read at 1d65fca0:419
[cl3fr1:28476] memchecker: write before read at 1d65fca0:420
[cl3fr1:28476] memchecker: write before read at 1d65fca0:421
[cl3fr1:28476] memchecker: write before read at 1d65fca0:422
[cl3fr1:28476] memchecker: write before read at 1d65fca0:423
[cl3fr1:28476] memchecker: write before read at 1d65fca0:424
== <MemPin Debug 1> [28733:0] == **** application callstack ****
== <MemPin Debug 1> [28733:0] == <bc_inputrec> .../mvdata.c:470,0
The ﬁrst part of the message indicates that one byte of communicated memory was
overwritten before the received data is actually used. The second part is the callstack
information generated by MemPin. It shows the source of the detected problem with line
numbers. If we navigate to the speciﬁed source of Gromacs, we may see the following
code:
464 static void bc_inputrec(const t_commrec *cr,t_inputrec *inputrec)
465 {
466 gmx_bool bAlloc=TRUE;
90
5.4 MD Simulation as a Use Case
467 int i;
468
469 block_bc(cr,*inputrec);
470 snew_bc(cr,inputrec->flambda,inputrec->n_flambda);
471 nblock_bc(cr,inputrec->n_flambda,inputrec->flambda);
472 bc_grpopts(cr,&(inputrec->opts));
473 if (inputrec->ePull != epullNO) {
474 snew_bc(cr,inputrec->pull,1);
475 bc_pull(cr,inputrec->pull);
476 }
477 for(i=0; (i<DIM); i++) {
478 bc_cosines(cr,&(inputrec->ex[i]));
479 bc_cosines(cr,&(inputrec->et[i]));
480 }
481 }
This function is called on root process for preparing and broadcasting the input param-
eters among all the processes for Gromacs. It ﬁrst calls block_bc at line 469 to trigger a
broadcast operation on all processes. But immediately after that, it calls snew_bc (de-
ﬁned in smalloc.h) in line 470 , which allocates new memory for a number of elements
and returns it in a pointer. And then, the newly allocated memory pointer is broad-
casted again in line 471. So in total there were two duplicated (partially) broadcasts.
This behavior is of course harmless when the memory is handled correctly, but MemPin
still gives hints of the potential source of memory errors. For this speciﬁc case, Com-
bine the two broadcasts may be a good solution to reduce the communication time and
minimize the possibility of memory errors. On the other hand, MemPin also detects a lot
of communicated but unused data in running Gromacs simulations. The number of such
data increases when increasing the number of processes, as mentioned in Section 5.3.
This may be improved for better communication performance.
Running applications with the help of debugging tools and frameworks did show a lot
useful information. During the communication of the simulation processes, many bytes
were transferred but not used at last. When increasing the simulation size, the number
of bytes will increase heavily. This will cause a big burden for communication. Reducing
the communication of the unnecessary bytes may improve the overall performance of the
application.
From all the performance tests and comparisons, the integration of the tools only
introduces small and even ignorable overhead for the normal simulation runs without
running the debugging tools. When running with debugging tools, the overhead is
very high. And our optimized MemPin shows a much better general performance than
Valgrind. Nevertheless, the above results show that with the information from our
debugging tools, memory access errors and performance issues may be detected and
amend the simulation to not needlessly communicate data.
91

6 Conclusion
In this thesis, an advanced topic of MPI memory checking is introduced, and the imple-
mentation of the tools is also described. In this chapter, the outlines on each previous
chapter will be summarized.
Chapter 2 gives a state of the art on the MPI implementations and widely used de-
bugging tools. Nowadays, there are many MPI implementations, e. g. MPIch, Microsoft
MPI, Intel MPI, and Open MPI, each of which has diﬀerent features and system support.
Most of the MPI semantic checking work in this thesis has been done in Open MPI. In
the second half of this chapter, we introduce two important modern debugging tools, i. e.
Valgrind and Intel Pin, including their functionalities and basic working mechanisms on
the supported platforms. These are the foundations of the work in this thesis.
In Chapter 3, we mainly focus on the work of supporting Open MPI on Windows
system, which is a primary step of preparation for making the MPI memory checking
tools usable under Windows. The original Open MPI supports only Linux systems, and
it cannot be used on a native Windows platform. Although it previously took Cygwin
as a solution, it limits the performance of running MPI applications and also the build
process of the project on Windows. The work has been divided into three parts: the ﬁrst
part involves the native support for Windows, including a new build system, and event
system support; another part is the multiple node support for both HPC and non-HPC
Windows platform; and the last part adds High-speed network support (InﬁniBand) to
the project. The ﬁrst and second part of the work in this chapter was mainly founded
by the Microsoft TCI project 1. The WinVerbs support in the third part was mainly
implemented within a master thesis 2. All the work has been published in the project
main repository and several releases.
Chapter 4 describes the primary work of the thesis. It ﬁrst explains the scenarios of dif-
ferent phases of memory checking in MPI based parallel applications. Several erroneous
code examples are also given for a better understanding. Then the implementation of
two memory checking framework based on heavy-weight debugging tools are explained in
detail. The Memcheck tool in Valgrind tool suite has been extended for advanced mem-
ory checking. Another memory checking framework is developed with the Intel PIN API.
This new framework is aiming to provide similar functionalities as Valgrind does and
to enable memory checking on both Windows and Linux. Both of the frameworks have
been integrated in to Open MPI, i. e. the corresponding MCA components have been
implemented. The core of these frameworks is the so-called callback mechanism, which
1The Microsoft TCI is a project of research on software and visualization tools for the automotive
industry
2Mr. Jie Hou, a Master student from University of Stuttgart, was the main developer on this topic.
93
6 Conclusion
allows user deﬁned callback routines to be attached to a speciﬁc memory that user might
want to check 3. The attached callback function will be called when the corresponding
memory gets operated, either read or write. On the other hand, in order to realize the
advanced memory checking in Open MPI, we implemented a bit-wise checking callback
routine. For every memory that is registered, a bit-wise shadow memory is allocated,
where two shadow bits indicates one byte memory state. For example, when part of the
memory is written, the corresponding shadow bit is set to write state. At run-time, the
operated memory state will be checked for whether it is used correctly. Further more,
in the ﬁnalization stage of the MPI application, we check every registered memory state
to summarize the memory usage information, and generate the error report.
Chapter 5 presents the result of using the newly developed frameworks. The perfor-
mance implication section shows the result of running several benchmarks using Open
MPI that built with the memory checking framework, while the debugging tool is not
launched. This allows to evaluate the overhead introduced by integrating the memory
checking frameworks in Open MPI, which turns out to be negligible for the normal runs.
Several real use cases are also given in this chapter to show how the memory checking
framework helps with ﬁnding bugs or improving application performance in MPI ap-
plications. The 2D heat conduction algorithm and the Gromacs simulation both show
that the integration of the memory checking tools in Open MPI is helpful for ﬁnding
memory problems. There are indeed memory misuses in the algorithm and application,
which may be solved to improve the overall communication performance. Furthermore,
running Gromacs with the memory checking tools shows that the MemPin has a lower
overhead than Valgrind. On the other hand, we also introduced the cooperation with
ITB of realizing a Windows based test cluster using the existed applications and tools.
This also prove that it is also important to have and support Open MPI, Gromacs and
other relative tools on Windows platform for simulation and research. This solution may
be used widely in universities and institutes, where the tutorial PCs are installed with
Windows and are idle most of the time, in order to eﬃciently make use of the available
compute resources.
The memory checking frameworks are helpful for ﬁnding memory problems in MPI
parallel applications. However, the Valgrind extension and the MemPin tool introduced
in Chapter 4 may be also used in other projects or applications. For example, the parallel
free surface lattice boltzmann method [9], where data exchange has been adapted to
be communicated in a restricted neighborhood. When bubbles extend across several
subdomains and the topological changes occur through the coalescence of bubbles, data
exchange becomes more complicated, Therefore, special care has to be taken to make
sure data exchange is proper and eﬃcient. In this case, the memory debugging tools
may be helpful. On the other hand, work has been accomplished for memory checking
in MPI applications, but it can be used more widely in diﬀerent ways. For example, a
callback function can be implemented to simply count how many reads and writes to a
speciﬁc memory, or one may strictly limit the number of reads or writes on the memory,
3Here the word attach means that the speciﬁc memory address and size is recorded with a callback
function information, which sounds more like the callback is attached to the memory.
94
or even check data structures with correct alignment and padding. New functionalities
like cache-line analysis could be also implemented for improving the cache usage of
the application. The tools may help investigate the memory usage in any common
applications, and more invaluable callback design may be applicable and helpful in many
diﬀerent use cases.
95

Glossary
AD Active Directory:
A directory service for Microsoft Windows domain networks.
ALU Arithmetic Logic Unit:
A digital circuit unit that performs arithmetic and logical operations.
APC Asynchronous Procedure Call:
A function that executes asynchronously in the context of a particular thread.
API Application Programming Interface:
A particular set of code and speciﬁcations that software programs or packages can
follow to pass data with each other.
BBL Basic Block:
A portion of the code decomposed by compilers from a program with certain
desirable properties that make it highly amenable to analysis.
BTL Byte Transfer Layer:
The data transfer layer in Open MPI that manages several components for diﬀerent
hardware protocols, like InﬁniBand, TCP or Shared Memory.
CCP Compute Cluster Pack:
A programming package and software that provide secure, scalable cluster resource
management, a job scheduler, and a Message Passing Interface (MPI) stack for
parallel programming.
CFD Computational Fluid Dynamics:
A branch of methodology that uses numerical models and algorithms to solve and
analyze problems that involve ﬂuid ﬂows.
CIM Common Information Model:
An open standard which deﬁnes the presentation of managed elements in an IT
environment as a common set of objects and relationships between them. It allows
consistent management of managed elements, independent of their manufacturer
or provider.
Cluster Cluster:
A computer cluster is a group of computers, linked with high speed network like
Inﬁniband, working together in many respects as a single computer.
97
Glossary
CM5 Connection Machine 5:
One of the supercomputers that was desinged in Danny Hillis' research in the
early 1980s at MIT on alternatives to the traditional von Neumann architecture
of computation.
COM Component Object Model:
An interface standard in binary form for software componentry introduced by
Microsoft. It is widely used for interprocess communication and dynamic object
creation in many programming languages.
CPU Central Processing Unit:
A digital circuit unit that carries out the instructions of a computer program,
to perform the basic arithmetical, logical, input and output operations of the
computer system.
CQ Completion Queue:
A queue mechanism that provides notiﬁcations of completion on the I/O requests.
A client may access the completion queue to determine if a work request has been
completed.
CRC Cyclic Redundancy Check:
An error-detecting code designed to detect accidental changes to raw computer
data. It is commonly used in storage devices and digital networks such as TCP.
DAPL Direct Access Programming Library:
A direct access framework to be run on transports that support direct data access
like InﬁniBand.
DBA Dynamic Binary Analysis:
A method of analyzing the behavior of an application at runtime by tackling or
instrumenting its binary code.
DMA Direct Memory Access:
A feature of modern computers that allows certain hardware subsystems of the
computer to access system memory without interference of the CPU.
DMTF Distributed Management Task Force:
An industry organization which develops, maintains and promotes open industry
standards for systems management in enterprise IT environments.
DSM Distributed Shared Memory:
A form of memory architecture of modern computers that the physically separated
memories can be addressed as one logically shared address space.
GNU GNU is Not Unix:
a Unix-like computer operating system developed by the GNU project, which aims
to develop a complete Unix-compatible software system with free software.
98
Glossary
GPGPU General-purpose computing on graphics processing units:
a technique that computer graphics are computed by GPU, in order to perform
computation in applications traditionally that handled by the CPU.
GPU Graphics Processor Units:
A specialized circuit designed to rapidly manipulate and alter memory to accelerate
the building of images in a frame buﬀer. Due to its data parallel nature, the
graphics hardware is more and more used as general purpose computing device.
GROMACS Groningen Machine for Chemical Simulations:
An open source package for molecular dynamics simulation. It is developed at the
University of Groningen, University of Uppsala, University of Stockholm and the
Max Planck Institute for Polymer Research.
GUI Graphic User Interface:
A type of user interface that allows users to interact with comupter applications
or electronic devices with images rather than text commands.
HCA Host Channel Adapter:
An InﬁniBand network card supporting data transfer into the host's memory
(RDMA) without intervention of the processor (CPU).
HPC High Performance Computing:
A modern computing method uses supercomputers and computer clusters to solve
advanced computation problems.
HPF High Performance Fortran:
An extension of Fortran 90 with support of parallel computing, published by the
High Performance Fortran Forum.
IB InﬁniBand:
A switched fabric communications link used in HPC and enterprise data centers.
It has features of high throughput, low latency, quality of service, failover, and
scalability. The InﬁniBand architecture speciﬁcation deﬁnes a connection between
processor nodes and high performance I/O nodes such as storage devices.
IBA InﬁniBand Architecture:
A new industry standard architecture for server I/O and inter-server communica-
tion designed by InﬁniBand Trade Association.
IBM SP IBM Scalable POWERparallel:
A family of massively parallel computer systems from IBM.
IBTA InﬁniBandSM Trade Association:
A group of 180 or more companies founded in August 1999 to develop IBA.
99
Glossary
IDE Integrated Development Environment:
A software application that provides comprehensive facilities to computer pro-
grammers for software development.
IMB Intel MPI Benchmark:
A performance measurement tool in the high performance computing community
that provides a concise set of benchmarks targeted at measuring the most impor-
tant MPI functions.
IOCTL input and output control:
A system interface for device-speciﬁc operations and other operations that an
application can communicate directly with a device driver.
IP Internet Protocol:
The principal communications protocol used for relaying datagrams across a net-
work using the Internet Protocol Suite. It is responsible for routing packets across
network boundaries, and it is the primary protocol that establishes the Internet.
IPC Inter-process communication:
The communication of data among multiple threads in multiple processes sometime
between one or more computers connected with a network.
IR Intermediate Representation:
An abstract machine language independent of the source language and the under-
lying computer architecture. It was translated from the source by the compiler,
and contains information of raw memory, registers, and data addresses.
JIT Just In Time:
A phrase that is used for program compiling, debugging, or analysis. It refers to a
observation or operation is processing during the application runtime.
LAN Local Area Network:
A computer network that connects computers and devices in a limited area like
company or school.
MCA Modular Component Architecture:
The foundation architecture for Open MPI project. Each layer of the project
has several MCAs that provide services or interfaces for the project or the user
application.
MD Molecular Dynamics:
A computer simulation procedure of physical movements of atoms and molecules
within a period of time, in order to present a view of the motion of the atoms and
molecules.
MIMD Multiple Instruction Multiple Data:
One classiﬁcation under Flynn's taxonomy of a parallel processor where many
functional units perform diﬀerent operations on diﬀerent data.
100
Glossary
MISD Multiple Instruction Single Data:
One classiﬁcation under Flynn's taxonomy of a parallel processor where many
functional units perform diﬀerent operations on the same data.
MOF Managed Object Format:
The language for describing CIM classes in WMI for Windows.
MPI Message Passing Interface:
A standardized and portable message passing system designed by a group of re-
searchers from academia and industry. It is used to function on a wide variety of
parallel computers. The standard deﬁnes the syntax and semantics of a core of
library routines, and help the user write portable programs in C, C++ and Fortran.
MTT MPI Testing Tool:
A general infrastructure for testing MPI implementations and running performance
benchmarks automatically.
MTU Maximum Transfer Unit:
The largest size of data unit that the communication protocol can pass.
NIC Network Interconnect Controller:
A computer hardware component that connects a computer to a computer network.
NUMA Non-Uniform Memory Access:
A computer memory architecture design used in multiprocessors, where the mem-
ory access time depends on the memory location relative to a processor. A proces-
sor can access its own local memory faster than non-local memory which is local
to another processor or is shared between processors.
OEM Original Equipment Manufacturer:
Manufactures products or components that are purchased by a company and re-
tailed under that purchasing company's brand name. It refers to the company that
the product was originally manufactured.
OFED OpenFabrics Enterprise Distribution:
An open source software developed by OpenFabrics Alliance company for high-
performance networking applications that demand low latency and high scalability.
OpenMP OpenMP:
A compiler extension and API for multi-platform shared memory parallel program-
ming in C/C++ and Fortran. It deﬁnes a portable, scalable model with a simple
and ﬂexible interface for developing parallel applications on platforms from the
desktop to the supercomputer.
OS Operating System:
The most important type of system software in a computer system. It consists of
programs and data that runs on computers, manages computer hardware resources,
and provides common services for execution of various application software.
101
Glossary
OSI model Open Systems Interconnection model:
A prescription of characterizing and standardizing the functions of a communica-
tions system in terms of abstraction layers. Similar communication functions are
grouped into logical layers. An instance of a layer provides services to its upper
layer instances while receiving services from the lower layer.
P2P Peer-to-Peer:
Refers to computing or networking that a distributed application architecture par-
titions tasks or workloads between peers. Peers are equally privileged, equipotent
participants in the application.
PBS Portable Batch System:
A software that performs job scheduling, e.g. allocating resources, mainly for Unix
cluster environments.
PDE Partial Diﬀerential Equation:
A type of diﬀerential equation, that involves two or more independent variables,
an unknown function (dependent on those variables), and partial derivatives of the
unknown function with respect to the independent variables.
PLM Process Launch Management:
An Open MPI MCA framework for launching, managing and terminating local
and remote MPI processes.
PMI Process Management Interface:
An program interface that allows diﬀerent process managers to interact with the
MPI library in a standardized way.
PML Package Management Layer:
An Open MPI MCA framework for managing packages to be transferred. Neces-
sary information such as headers, data and peer context, are packed or unpacked
in this layer.
POSIX Portable Open System Interface for Unix:
The family of standards that deﬁnes API for software compatible with variants of
the Unix operating system.
RAM Random Access Memory:
A form of computer storage that allows stored data to be accessed in any order
with a worst case performance of constant time.
RAS Resource Allocation Subsystem:
An Open MPI MCA framework for looking up and allocating resources.
RDMA Remote Direct Memory Access:
A DMA from the memory of one computer into that of another without involv-
ing either one's operating system. This permits high throughput, low latency
networking.
102
Glossary
RML Rhizomucor miehei Lipase:
One of the most often used lipases obtained from fungi. It is also widely used as a
model for the determination of the structure of some other lipases due to the deep
knowledge of its three dimensional structure.
RPC Remote Procedure Call:
An inter-process communication that allows a computer program to execute a
subroutine or procedure in another address space or on another computer on a
shared network, without the programmer explicitly specifying the details for this
remote interaction.
RSH Remote Shell:
A command line computer program that can execute shell commands as another
user on another computer across a computer network.
SAN Storage Area Network:
A dedicated storage network that provides access to consolidated, block level stor-
age, in order to make storage devices accessible to servers, as if the devices is
locally attached to the operating system.
SDK Software Development Kit:
A set of development tools for the creation of applications, software framework,
hardware platform, computer system, operating system, or similar platform.
SEH Structured Exception Handling:
A Windows speciﬁc mechanism for handling hardware and software exceptions. It
enables the programmer to have complete control over the handling of exceptions,
provides support for debuggers, and is usable across all programming languages
and machines.
SIMD Single Instruction Multiple Data:
One classiﬁcation of parallel computers in Flynn's taxonomy. It describes comput-
ers with multiple processing elements that perform the same operation on multiple
data simultaneously.
SISD Single Instruction Single Data:
One classiﬁcation of parallel computers in Flynn's taxonomy. A single processor
executes a single instruction stream, to operate on data stored in a single memory.
SMP Symmetric Multiprocessing:
In a multiprocessor computer hardware architecture, two or more identical proces-
sors are connected to a single shared main memory and are controlled by a single
OS instance.
SP Stack Pointer:
A pointer in the form of a hardware register, points to the most recently referenced
103
Glossary
location on the application stack. It points to the origin of the stack, if the stack
is empty.
SPI Service Provider Interface:
A software mechanism to support replaceable components in a speciﬁc implemen-
tation of a service.
SSE Streaming SIMD Extensions:
A SIMD instruction set extension to the x86 architecture designed by Intel and
introduced in 1999. It contains 70 new instructions, most of which work on sin-
gle precision ﬂoating point data. The performance of SIMD instructions can be
increased when performing the same operations on multiple data objects.
SSH Secure Shell:
A network protocol that allows users to open a session on a local computer and
connect to a remote computer and display remote results locally.
SSI Single System Image:
Refers to a cluster of machines that appears to be one single system.
TCA Target Channel Adapter:
Similar to the HCA, but Target Channel Adapter are for peripheral devices.
TCP Transmission Control Protocol:
One of the core protocols of the Internet Protocol Suite, that major Internet ap-
plications rely on. It provides reliable, ordered delivery of a stream of bytes from
a program on one computer to another program on another computer.
UMA Uniform Memory Access:
A shared memory architecture used in parallel computers. All the processors in
the UMA model share the physical memory uniformly, and the access time to a
memory location is independent of which processor makes the request or which
memory chip contains the transferred data.
UPC Uniﬁed Parallel C:
An extension of the C programming language designed for HPC on large-scale
parallel machines, including SMP, NUMA and clusters that have a common global
address space or distributed memory.
VMM Virtual Machine Manager:
A component in Intel Pin for managing its virtual machine, which consists of a
JIT compiler, an emulator, and a dispatcher.
WBEM Web-based Enterprise Management:
A set of systems management technologies for the management uniﬁcation of dis-
tributed computing environments.
104
Glossary
WBR Write Before Read:
An error or problem that happens when the data transferred in the MPI communi-
cation have been overwritten but have not been read or really used. This normally
refers to transferring unnecessary data or wrong computation results.
WMI Windows Management Instrumentation:
The infrastructure for management data and operations on Windows-based op-
erating systems. The user can write WMI scripts or applications to automate
administrative tasks on remote computers.
105

Bibliography
[1] Allinea DDT. Internet, 2012. http://www.allinea.com/products/ddt.
[2] Inﬁniband T. Association. InﬁniBand Architecture Speciﬁcation, Release 1.0, 2012.
[3] Leo Brady, Andrzej M. Brzozowski, Zygmunt S. Derewenda, Eleanor Dodson, Guy
Dodson, Shirley Tolley, Johan P. Turkenburg, Lars Christiansen, Birgitte Huge-
Jensen, Lars Thim, and Ulrich Menge. A serine protease triad forms the catalytic
centre of a triacylglycerol lipase. Nature, 343(6260):767770, 1990.
[4] Derek Bruening, Evelyn Duesterwald, and Saman Amarasinghe. Design and Imple-
mentation of a Dynamic Optimization Framework for Windows. In ACM Workshop
on Feedback-Directed and Dynamic Optimization, Autin, Texas, Dec 2001.
[5] Derek Bruening and Qin Zhao. Practical Memory Checking with Dr. Memory. In
The International Symposium on Code Generation and Optimization, Autin, Texas,
Apr 2011.
[6] David Clark. OpenMP: A parallel standard for the masses. IEEE Concurrency,
6(1):1012, 1998.
[7] CMake. CMake. Internet, 2012. http://www.cmake.org.
[8] Code::Blocks. The open source, cross platform, free C++IDE, 2011.
[9] Stefan Donath, Christian Feichtinger, Thomas Pohl, Jan Götz, and Ulrich Rüde. A
Parallel Free Surface Lattice Boltzmann Method for Large-Scale Applications. In
Proceedings of the 21st International Conference on Parallel Computational Fluid
Dynamics, pages 198202, 2009.
[10] Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. The LINPACK Benchmark:
past, present and future. Concurrency and Computation: Practice and Experience,
15(9):803820, 2003.
[11] Rüdiger Esser and Renate Knecht. Intel Paragon XP/S - Architecture and Software
Enviroment. In Anwendungen, Architekturen, Trends, Seminar, pages 121141,
London, UK, 1993. Springer-Verlag.
[12] Markus Fischer and Jürgen Pleiss. The Lipase Engineering Database: a navigation
and analysis tool for protein families. Nucleic Acids Research, 31(1):319321, 2003.
107
Bibliography
[13] Geoﬀrey C. Fox, Mark A. Johnson, Gregory A. Lyzenga, Steve W. Otto, John K.
Salmon, and David W. Walker. Solving problems on concurrent processors. Vol. 1:
General techniques and regular problems. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA, 1988.
[14] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra,
Jeﬀrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew
Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timo-
thy S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI
Implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting,
pages 97104, Budapest, Hungary, September 2004.
[15] Richard L. Graham, Timothy S. Woodall, and Jeﬀrey M. Squyres. Open MPI: A
Flexible High Performance MPI. In Proceedings, 6th Annual International Confer-
ence on Parallel Processing and Applied Mathematics, Poznan, Poland, September
2005.
[16] Ananth Grama, George Karypis, Vipin Kumar, and Anshul Gupta. Introduction to
Parallel Computing. Addison Wesley, second edition, January 2003.
[17] Jeﬀ Hilland, Paul Culley, Jim Pinkerton, and Renato Recio. RDMA Proto-
col Verbs Speciﬁcation. Internet Draft, 2012. http://tools.ietf.org/html/
draft-hilland-rddp-verbs-00.
[18] W. Daniel Hillis. The connection machine. MIT Press, Cambridge, MA, USA, 1986.
[19] IBM. IBM SP Red Book. Internet, 2012. http://www.redbooks.ibm.com/
abstracts/sg244541.html.
[20] Intel PIN. Internet, 2012. http://software.intel.com/sites/landingpage/
pintool/docs/49306/Pin.
[21] Morten Ø. Jensen, Torben R. Jensen, Kristian Kjaer, Thomas Bjørnholm, Ole G.
Mouritsen, and Günther H. Peters. Orientation and Conformation of a Lipase
at an Interface Studied by Molecular Dynamics Simulations. Biophysical Journal,
83(1):98111, 2002.
[22] Morris A. Jette, Andy B. Yoo, and Mark Grondona. SLURM: Simple Linux Utility
for Resource Management. In Proceedings of Job Scheduling Strategies for Parallel
Processing (JSSPP), Lecture Notes in Computer Science (LNCS), pages 4460.
Springer-Verlag, 2002.
[23] Rainer Keller. Analyse und Optimierung der Softwareschichten von wis-
senschaftlichen Anwendungen für Metacomputing. PhD thesis, Universität
Stuttgart, Holzgartenstr. 16, 70174 Stuttgart, 2008.
108
Bibliography
[24] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoﬀ
Lowney, Steven Wallace, Vijay J. Reddi, and Kim Hazelwood. Pin: building cus-
tomized program analysis tools with dynamic instrumentation. In Proceedings of
the 2005 ACM SIGPLAN conference on Programming language design and imple-
mentation, pages 190200. ACM, 2005.
[25] Junichiro Makino, Makoto Taiji, Toshikazu Ebisuzaki, and Daiichiro Sugimoto.
GRAPE-4: a one-Tﬂops special-purpose computer for astrophysical N-body prob-
lem. In Proceedings of the 1994 conference on Supercomputing, pages 429438. IEEE
Computer Society Press, 1994.
[26] Mellanox network technology. Mellanox, Internet, 2012. http://www.mellanox.
com.
[27] Message Passing Interface Forum. MPI: A Message Passing Interface Standard,
Version 1.0, June 1994.
[28] Message Passing Interface Forum. MPI: A Message Passing Interface Standard,
Version 1.1, June 1995.
[29] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing In-
terface, Version 2.0, July 1998.
[30] Message Passing Interface Forum. MPI: A Message Passing Interface Standard,
Version 1.2, June 1998.
[31] Message Passing Interface Forum. MPI: A Message Passing Interface Standard,
Version 1.3, June 2008.
[32] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard,
Version 2.2, September 2009.
[33] Microsoft. Microsoft Compute Cluster Pack. Internet, 2012. http://msdn.
microsoft.com/en-us/library/cc136762.aspx.
[34] Microsoft. Structured Exception Handling. Internet, 2012. http://msdn.
microsoft.com/en-us/library/ms680657.aspx.
[35] Microsoft. Windows Management Instrumentation. Internet, 2012. http://msdn.
microsoft.com/en-us/library/aa394582.aspx.
[36] S. Miller and S. Luding. Event-driven molecular dynamics in parallel. Journal of
Computational Physics, 193(1):306  316, 2004.
[37] MinGW. Home of the MinGW, MSYS and mingwPORT Projects, 2011.
[38] Ethan Mollick. Establishing Moore's Law. IEEE Annals of the History of Comput-
ing, 28:6275, 2006.
109
Bibliography
[39] Gordon E. Moore. Cramming more components onto integrated circuits. Electron-
ics, 38(8), April 1965.
[40] MPI Forum ticket number 45. Internet, 2012. https://svn.mpi-forum.org/trac/
mpi-forum-web/ticket/45.
[41] Myricom. Internet, 2012. http://www.myri.com/.
[42] Nicholas Nethercote. Dynamic Binary Analysis and Instrumentation. PhD thesis,
Computer Laboratory, University of Cambridge, United Kingdom, November 2004.
[43] Nicholas Nethercote and Julian Seward. How to shadow every byte of memory
used by a program. In Proceedings of the 3rd international conference on Virtual
execution environments, VEE '07, pages 6574, New York, NY, USA, 2007. ACM.
[44] OpenFabrics Alliance. Internet, 2012. http://www.openfabrics.org/index.php.
[45] Open MPI. Open MPI project website, August 2011.
[46] OpenFabrics Alliance. OpenFabrics, Internet, 2012. https://www.openfabrics.
org.
[47] Jeﬀrey S. Racine. The Cygwin tools: a GNU toolkit for Windows. Journal of
Applied Econometrics, 15(3):331341, 2000.
[48] Michael Resch, Björn Sander, and Isabel Loebich. A comparison of OpenMP and
MPI for the parallel CFD test case. In Proc. of the First European Workshop on
OpenMP, pages 7175, 1999.
[49] Rolf D. Schmid and Robert Verger. Lipases: Interfacial Enzymes with Attractive
Applications. Angewandte Chemie International Edition, 37(12):16081633, 1998.
[50] Securing a Remote WMI Connection. MSDN, Internet, 2012. http://msdn.
microsoft.com/en-us/library/aa393266.aspx.
[51] Julian Seward and Nicholas Nethercote. Using Valgrind to detect undeﬁned value
errors with bit-precision. In Proceedings of the annual conference on USENIX An-
nual Technical Conference, ATEC '05, page 2, Berkeley, CA, USA, 2005. USENIX
Association.
[52] Alex Skaletsky, Tevi Devor, Nadav Chachmon, Robert S. Cohn, Kim M. Hazelwood,
Vladimir Vladimirov, and Moshe Bach. Dynamic program analysis of Microsoft
Windows applications. In ISPASS, pages 212. IEEE Computer Society, 2010.
[53] Quinn O. Snell, Armin R. Mikler, and John L. Gustafson. NetPIPE: A Network
Protocol Independent Performance Evaluator. In in IASTED International Conf.
on Intelligent Information Management and Systems, 1996.
110
Bibliography
[54] Amitabh Srivastava and Alan Eustace. Atom: A system for building customized
program analysis tools. In Proceedings of the ACM SIFPLAN 1994 conference on
Programming language design and implementation, PLDI '94, pages 196205, New
York, NY, USA, 1994. ACM.
[55] Lambert M. Surhone, Mariam T. Tennoe, and Susan F. Henssonow. Libevent. VDM
Verlag Dr. Mueller AG & Co. Kg, 2010.
[56] The K-Computer. FUJITSU, Internet, 2012. http://www.fujitsu.com/global/
about/tech/k.
[57] TOP500, Supercomputer sites. Internet, 2012. http://www.top500.org.
[58] TotalView. Rogue Wave Software, Inc, Internet, 2012. http://www.roguewave.
com/products/totalview-family/totalview.aspx.
[59] Lewis W. Tucker and George G. Robertson. Architecture and Applications of the
Connection Machine. Computer, 21:2638, August 1988.
[60] Gary V. Vaughan, Ben Elliston, Tom Tromey, and Ian L. Taylor. GNU Autoconf,
Automake, and Libtool. Pearson Education, October 2000.
[61] Visual Studio 2008. Microsoft Corporation, 2012. http://www.microsoft.com/
visualstudio/en-us/products/2008-editions.
[62] Karl Volz. Structural conservation in the CheY superfamily. Biochemistry,
32(44):1174111753, November 1993.
[63] W3counter, Global Web Stats. Internet, 2012. http://www.w3counter.com/
globalstats.php.
[64] Josef Weidendorfer. Sequential Performance Analysis with Callgrind and
Kcachegrind. In Tools for High Performance Computing, pages 93113, 2008.
111

Index
ALU, 2
BTL, 18, 35, 48, 54, 75
C++, 8, 12, 1517, 19, 21, 30, 3638, 64,
65, 70, 101, 107
CM5, 16
CPU, 14, 17, 22, 23, 26, 48, 67, 71
CRAY-1, 9
CRAY-2, 9
DDT, 19, 71
Flynn's Taxonomy, 3
Fortran77, 16
Fortran90, 16
Gforker, 16
GNU, 12, 36
GPGPU, 8
GPU, 19
HPC, 912, 16, 17, 3537, 40, 41, 50, 51,
71, 73, 75, 93
HPF, 8
Hydra, 16
IBM SP, 16
InﬁniBand, 17, 18, 36, 41, 4446, 48, 71,
73, 75, 76, 89
Intel MPI, 9, 15, 17, 93
Intel PIN, 29, 53, 64, 67
IP, 17, 48
LINPACK, 9
MCA, 18, 36, 37, 44, 46, 48, 53, 61, 67
MemPin, 64, 65, 67, 68, 70, 75, 76, 80,
85, 8891, 94
Microsoft MPI, 9, 10, 1517, 50, 93
Microsoft Network Direct, 17
MIMD, 2, 4
MISD, 4
MPI, 8, 9, 1113, 1519, 21, 36, 37, 53
55, 5760, 70, 72, 74, 75, 93, 94
MPI-1, 15, 18, 35, 57, 58
MPI-1.1, 16
MPI-1.3, 15
MPI-2, 1518, 35, 54, 58, 60, 72
MPI-2.2, 15, 16, 58, 60
MPIch, 9, 1518, 93
Myricom, 10
Myrinet, 17
Network Direct, 17
NUMA, 4, 6
OFED, 10
Open MPI, 9, 1113, 15, 18, 19, 3538,
41, 44, 46, 48, 50, 51, 5356, 60
62, 64, 67, 68, 71, 72, 7476, 83,
85, 88, 93, 94
PACX-MPI, 11
PBS, 16, 17
RAM, 2
RDMA, 17, 48, 54
remshell, 16
RISC, 9
RSH, 16
SIMD, 3, 4
113
Index
SISD, 3, 4
SLURM, 16
SMP, 4, 6
SMPD, 16
SSH, 16
TCP, 11, 17, 18, 35, 36, 48, 51, 73, 75,
76, 88
TotalView, 19
UMA, 4, 6
Valgrind, 2124, 27, 53, 54, 59, 60, 62,
68, 70
von Neumann, 2
WinSock Direct, 17, 46, 48
114
Curriculum Vitae
Personal data
Shiqing Fan
Anemonenstrasse 5
70771 Leinfelden-Echterdingen
Tel.: (0172) 7444810
Email: fan@hlrs.de
Date of Birth: 05.04.1981
Place of Birth: Liaoning, China
Education
09/199907/2003 Information Technology, Dalian University of Technology, China
04/200402/2007 Infotech, University of Stuttgart, Germany
Occupational history
08/200408/2006 Scientiﬁc Assistant (Wissenschaftliche Hilfskraft) at HLRS
02/200604/2006 Internship at Alcatel SEL AG in Stuttgart
03/2007Now Scientiﬁc Researcher at HLRS
Stuttgart, 6. Juni 2012
