Performance Analysis of Complex Shared Memory Systems by Molka, Daniel
Performance Analysis of Complex Shared Memory Systems
Dissertation
zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.)
vorgelegt an der
Technischen Universität Dresden
Fakultät Informatik
eingereicht von
Diplom-Informatiker Daniel Molka
geboren am 10. Dezember 1981 in Zeitz
Gutachter:
Prof. Dr. rer. nat. Wolfgang E. Nagel, Technische Universität Dresden
Prof. Dr. rer. nat. habil. Thomas Ludwig, Universität Hamburg
Tag der Einreichung: 12. September 2016
Tag der Verteidigung: 10. März 2017
Dresden, den 20. März 2017

Kurzfassung
Die Komplexität von Hochleistungsrechnern steigt stetig an. Dies geschieht einerseits durch den Einsatz
einer immer größeren Anzahl von Prozessoren. Andererseits erhöht sich auch die Leistungsfähigkeit
der einzelnen Prozessoren immer weiter. In den letzten Jahren geschieht dies zunehmend durch die Er-
höhung der Anzahl der Kerne pro Prozessor. Leider gelingt es wissenschaftlichen Anwendungen häufig
nicht das vorhandene Potenzial auch auszuschöpfen. Daher sind Analysewerkzeuge, die dabei helfen
eine ineffiziente Nutzung der Hardware zu erkennen und zu beheben, unverzichtbar. Große Rechnersys-
teme bestehen in der Regel aus mehreren Knoten, die über ein Netzwerk miteinander verbunden sind.
Werkzeuge für die Analyse von Leistungseinbußen, die durch die Nutzung mehrerer Knoten entste-
hen, sind zur Genüge vorhanden. Allerdings stellt der starke Anstieg der Kernanzahl eine wesentliche
Änderung der Architektur der Knoten dar. Diese Arbeit widmet sich daher der Leistungsanalyse auf
Knotenebene.
Ziel dieser Dissertation ist es das Verständnis für die von wissenschaftlichen Anwendungen auf ver-
fügbarer Hardware erzielte Rechenleistung zu verbessern. Die Leistungszuwächse durch den Einsatz
von Mehrkernprozessoren unterscheiden sich teilweise deutlich von denen, die beim Einsatz mehrerer
Einkernprozessoren zu beobachten sind. Daher werden sowohl die Eigenschaften von Ressourcen,
die sich die Kerne eines Prozessors teilen, als auch die Auswirkungen entfernter Speicherzugriffe
in Mehrprozessor-Systemen untersucht und ihr jeweiliger Einfluss auf die erzielte Rechenleistung
analysiert. Dazu werden zunächst Mikrobenchmarks entwickelt, die in der Lage sind die Eigen-
schaften von Speicherzugriffen für verschiedene Entfernungen von den genutzten Daten sowie für un-
terschiedliche Kohärenzzustände zu bestimmen. Mit diesen Benchmarks wird die Leistungsfähigkeit
der Speicherhierarchie aktueller Systeme mit mehreren Prozessoren eingehend untersucht, um mögliche
Engpässe zu identifizieren. Um jedoch speicherbedingte Leistungsverluste in Anwendungen erkennen
zu können, ist es darüber hinaus erforderlich festzustellen, in welchem Ausmaß die jeweilige Anwen-
dung von bestimmten Komponenten limitiert wird. Um das zu ermöglichen, wird eine Methodik entwi-
ckelt, um aus Daten der Hardwareüberwachung Metriken für die Auslastung einzelner Komponenten und
die durch Speicherzugriffe verursachten Verzögerungen im Programmablauf abzuleiten. Das Verfahren
basiert auf den zuvor entwickelten Mikrobenchmarks, die genutzt werden um einzelne Komponenten voll
auszulasten. Die gezielte Belastung einzelner Komponenten ermöglicht es diejenigen – von der Hard-
wareüberwachung aufgezeichneten – Ereignisse zu identifizieren, die eine brauchbare Einschätzung der
Auslastung der jeweiligen Komponente ermöglichen. Abschließend wird auf Grundlage vorhandener
Werkzeuge für die Leistungsanalyse eine Visualisierung entwickelt, die durch Speicherzugriffe verur-
sachte Leistungsverluste anzeigt.
Die Resultate der Mikrobenchmarks zeigen, dass die steigende Kernanzahl und der Einsatz von mehreren
Prozessoren pro Knoten zu komplexen Systemen mit – abhängig von der Distanz zu den zugegriffenen
Daten – sehr unterschiedlichen Leistungsmerkmalen von Speicherzugriffen führen. Des Weiteren lässt
sich beobachtet, dass die Leistung der von mehreren Kernen gemeinsam genutzten Komponenten nicht
notwendigerweise linear mit der Anzahl der Kerne wächst, wodurch die Skalierbarkeit paralleler Pro-
gramme begrenzt wird. Es wird gezeigt, dass die vorgestellte Methodik zur Erkennung von für die
Leistungsanalyse nutzbaren Hardware-Ereignissen zu brauchbaren Metriken führt, die das Auffinden
von speicherbedingten Leistungsverlusten ermöglichen.
Abstract
Systems for high performance computing are getting increasingly complex. On the one hand, the number
of processors is increasing. On the other hand, the individual processors are getting more and more
powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per
processor. Unfortunately, scientific applications often fail to fully utilize the available computational
performance. Therefore, performance analysis tools that help to localize and fix performance problems
are indispensable. Large scale systems for high performance computing typically consist of multiple
compute nodes that are connected via network. Performance analysis tools that analyze performance
problems that arise from using multiple nodes are readily available. However, the increasing number of
cores per processor that can be observed within the last decade represents a major change in the node
architecture. Therefore, this work concentrates on the analysis of the node performance.
The goal of this thesis is to improve the understanding of the achieved application performance on ex-
isting hardware. It can be observed that the scaling of parallel applications on multi-core processors
differs significantly from the scaling on multiple processors. Therefore, the properties of shared re-
sources in contemporary multi-core processors as well as remote accesses in multi-processor systems
are investigated and their respective impact on the application performance is analyzed. As a first step,
a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able
to determine the performance of memory accesses depending on the location and coherence state of the
data. They are used to perform an in-depth analysis of the characteristics of memory accesses in con-
temporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize
performance problems, it also has to be determined to which extend the application performance is lim-
ited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual
components in the memory hierarchy as well as waiting times caused by memory accesses is developed
in the second step. The approach is based on hardware performance counters, which record the number
of certain hardware events. The developed micro-benchmarks are used to selectively stress individual
components, which can be used to identify the events that provide a reasonable assessment for the uti-
lization of the respective component and the amount of time that is spent waiting for memory accesses
to complete. Finally, the knowledge gained from this process is used to implement a visualization of
memory related performance issues in existing performance analysis tools.
The results of the micro-benchmarks reveal that the increasing number of cores per processor and the
usage of multiple processors per node leads to complex systems with vastly different performance char-
acteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be
observed that the aggregated throughput of shared resources in multi-core processors does not necessar-
ily scale linearly with the number of cores that access them concurrently, which limits the scalability
of parallel applications. It is shown that the proposed methodology for the identification of meaningful
hardware performance counters yields useful metrics for the localization of memory related performance
limitations.
1Contents
1 Introduction 3
2 Background and Related Work 7
2.1 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Processor Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Multi-core Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4 Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Large Scale Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Snooping-based Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Snoop Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.3 Directory-based Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . 26
2.3.4 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 NUMA Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.1 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.2 Partitioned Global Address Space . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.3 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.4 Hybrid Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.3 Performance Analysis of Parallel Applications . . . . . . . . . . . . . . . . . . 37
2.6.4 Analytical Performance Modeling and Simulation . . . . . . . . . . . . . . . . . 41
3 Micro-benchmarks for Analyzing Memory Hierarchies 43
3.1 Objective and Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Coherence State Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Hardware Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Measurement Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Latency Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.2 Single-threaded Bandwidth Benchmarks . . . . . . . . . . . . . . . . . . . . . . 51
3.5.3 Aggregated Bandwidth Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.4 Throughput of Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.5 Support for Hardware Performance Counters . . . . . . . . . . . . . . . . . . . 53
3.5.6 Parameter Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
24 Performance Characterization of Memory Accesses 57
4.1 Systems With Two NUMA Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Dual-socket AMD Opteron 2435 . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Dual-socket Intel Xeon X5670 . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.3 Dual-socket Xeon E5-2670 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Standard Compute Nodes With Complex NUMA Topologies . . . . . . . . . . . . . . . 79
4.2.1 Dual-socket Intel Xeon E5-2680 v3 . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Quad-socket AMD Opteron 6274 . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Potential Bottlenecks in the Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . 101
4.3.1 Latency of Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.2 Bandwidth Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.3 Influence of the Cache Coherence Protocol . . . . . . . . . . . . . . . . . . . . 102
5 Performance Impact of the Memory Hierarchy 103
5.1 Case Study: SPEC OMPM2001 Scalability . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.1 Multi-core Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.2 Multi-processor Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Possible Causes for Low Application Performance . . . . . . . . . . . . . . . . . . . . 107
5.2.1 Impact of the Memory Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2.2 Bandwidth Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2.3 Saturation of Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3 Identification of Meaningful Hardware Performance Counters . . . . . . . . . . . . . . 112
5.3.1 Indicators for Bandwidth Utilization . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.2 Metrics for Memory-boundedness . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Identification of Limiting Resources in Parallel Applications . . . . . . . . . . . . . . . 128
5.4.1 Determining the Degree of Memory-boundedness . . . . . . . . . . . . . . . . . 128
5.4.2 Resource Usage per Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4.3 Utilization of Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6 Summary 133
Bibliography 135
List of Abbreviations 157
List of Figures 159
List of Tables 161
31 Introduction
High performance computing (HPC) is an indispensable tool that is required to obtain new insights in
many scientific disciplines [Vet15, Chapter 1]. HPC systems are getting more and more powerful from
year to year as documented in the Top500 list of the fastest supercomputers [Str+15, Figure 1]. Un-
fortunately, scientific applications typically are not able to fully utilize this potential [FC07, Figure 1].
Table 1.1 shows a selection of No. 1 ranking systems from several Top500 lists and compares their
theoretical peak performance (Rpeak) to the achieved LINPACK performance (Rmax) as well as the per-
formance that is actually reached by several distinguished scientific applications [Str+15, Table 1]. The
LINPACK benchmark typically achieves a high percentage of the peak performance. However, this is
not representative. Even in case of the award winning applications listed in Table 1.1 the achieved per-
formance is significantly lower in most cases. The effectiveness is even worse in several other scientific
applications. Utilization levels of 10% and lower are not uncommon [Kra12, Figure 2]; [Oli+05, Table 4].
The percentage of the peak performance that is achieved by an application can also differ significantly
on different systems [Oli+05, Table 2, 3, and 4].
The continuous performance improvement of HPC systems is enabled by two developments: increasing
the number of processors per system [Str+15, Figure 2] and increasing the performance per proces-
sor [Str+15, Figure 3]. Multi-processor systems can be implemented as shared memory or distributed
memory systems [DAS12, Chapter 5]. In shared memory systems all processors can directly access all
the memory in the system via load and store instructions. In distributed memory systems each processor
has a private memory that it can access directly while data is exchanged between processors via message
passing. Most contemporary HPC systems are hybrid systems, i.e., multiple shared memory systems are
connected via network to form a larger system. This is in part due to the shift from single-core proces-
sors with increasing operating frequency to multi-core processors with an increasing number of cores,
which has become an important driver of the performance per processor within the last decade [Str+15,
Figure 3]. The cores of a single processor typically share the memory interface, i.e., they form a shared
memory system. Furthermore, the building blocks (nodes) of large scale systems often contain multiple
processors that can access each others memory, e.g., [Bul13; Cra10; Meg14].
Table 1.1: Comparison of Top500 results and achieved application performance: Rpeak and Rmax
are taken from the official Top500 lists (see https://www.top500.org/lists/). The application perfor-
mance refers to the ACM Gordon Bell Prize winning application of the respective years as reported
in [Str+15, Table 1]. Only the years in which the winning application was executed on the number
one system are considered here in order to enhance comparability.
year system
performance
Rpeak [Tflop/s] Rmax [Tflop/s] application [Tflop/s]
1995 Numerical Wind Tunnel 0.236 0.170 (72.0%) 0.179 (75.8%)
1997 ASCI Red 1.830 1.338 (73.1%) 0.43 (23.5%)
2002 Earth Simulatur 40.96 35.86 (87.5%) 26.58 (64.9%)
2003 Earth Simulatur 40.96 35.86 (87.5%) 5.0 (12.2%)
2005 BlueGene/L 367.0 280.6 (76.5%) 101.7 (27.7%)
2006 BlueGene/L 367.0 280.6 (76.5%) 207.3 (56.5%)
2007 BlueGene/L 596.4 478.2 (80.2%) 115.1 (19.3%)
2009 Jaguar (Cray XT5) 2,331 1,759 (75.5%) 1,030 (44.2%)
2011 K Computer 11,280 10,510 (93.2%) 3,080 (27.3%)
4 1 Introduction
The increasing number of cores per processor in recent years [Str+15, Figure 3] represents a major
change in the node architecture, which poses a challenge for performance analysis and optimization
efforts. For many parallel applications, it can be observed that the performance improvement that is
achieved by using multiple cores of a single processor significantly differs from the attainable speedup
when multiple processors are used. Figure 1.1 illustrates this phenomenon using the SPEComp2001
suite [Mül+04] as an example. SPEComp2001 consists of eleven individual benchmarks, which have
very small sequential portions. Thus—according to Amdahl’s Law [Amd67]—speedups close to the
ideal linear speedup can be expected for small numbers of threads [Asl+01, Table 2]. However, as
illustrated in Figure 1.1a, the achieved speedup on a single multi-core processor is far from optimal for
some of the benchmarks. Furthermore, the scaling with the number of used processors, which is depicted
in Figure 1.1b, significantly differs from the multi-core scaling. A Similar discrepancy between multi-
core and multi-processor scaling can be observed for SPEComp2012 [Mül+12, Figure 3]. It must be
assumed that these differences are caused by characteristics of the hardware.
The complexity of today’s shared memory systems results in various potential bottlenecks that may
lead to suboptimal performance. The processor performance increases faster than the memory perfor-
mance [HP06, Section 5.1]. Therefore, modern microprocessors feature multiple levels of cache—small
and fast buffers for frequently accessed data—in order to improve the performance of memory accesses.
Nevertheless, memory accesses can account for a significant portion of the average cycles per instruc-
tion [BGB98] and thus constitute a significant portion of the processing time. Furthermore, caches and
the memory interface can be shared between multiple cores of a processor. The contention of shared re-
sources can limit the scalability of parallel applications [GSP11]. In multi-processor systems the physical
memory is typically distributed among the processors, which leads to different performance depending
on the distance to the accessed data. These non-uniform memory access (NUMA) characteristics also
influence the performance and scalability of parallel applications [MG11b; MG11a].
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 3 4 5 6 7 8
Sp
e
e
d
u
p
 
number of used cores 
(a) multi-core scaling using a single processor
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1 2 3 4
Sp
e
e
d
u
p
 
number of processors, 8 cores per processor 
(b) multi-processor scaling
peak GFLOPS 310.wupwise 312.swim 314.mgrid 316.applu 318.galgel
320.equake 324.apsi 326.gafort 328.fma3d 330.art 332.ammp
Figure 1.1: SPEC OMPM2001 scaling on a quad-socket Intel Xeon X7560 system [Mol+11], based on
ICA3PP’11 presentation1, slide 15: The performance increase due to using multiple cores (left) can
differ significantly from the speedup that is achieved using multiple processors (right), e.g., 312.swim
scales poorly on the selected multi-core processor, but benefits strongly from using multiple proces-
sors. 318.galgel and 320.equake show underwhelming performance gains if multiple processors are
used. The super-linear speedup 316.applu is a known characteristic of this benchmark [FGD07].
1https://fusionforge.zih.tu-dresden.de/plugins/mediawiki/wiki/benchit/images/9/97/2011_Molka_ICA3PP.pdf
5Cache coherence [HP06, Section 4.2 – 4.4] is another important aspect of shared memory systems. In
multi-core or multi-processor systems multiple caches are operating independently of one another. Thus,
multiple copies of a single memory address can exist. However, if multiple copies exist they must appear
to the software as a single copy. This is typically ensured by cache coherence protocols, which assign
status information to the cached data that indicates if the data is up-to-date and if other copies may
exist. Potentially conflicting accesses have to be coordinated, which involves communication between
the caches that may have a copy. Therefore, the state of the cached data influences the performance of
memory accesses [HLK97; Mol+09; MHS14; Mol+15].
In order to improve the understanding of the observed application performance it has to be determined
to which extent the performance is limited by the various components of the memory hierarchy. This
can be achieved with the help of performance monitoring units, which are included in many modern
processors [Amd13b, Section 2.7]; [Int14b, Volume 3, Chapter 18]; [Arm15, Chapter D5]. Performance
monitoring units implement multiple hardware performance counters that record each occurrence of a
certain event while the application is running. Using performance counters to detect memory related per-
formance issues is common practice [Era08; Lev09; THW13; Yas14]; [Int14a, Appendix B.3]. However,
the available events are often specific to a certain processor generation and their meaning is not always
obvious. Thus, it is not trivial to decide if the observed number of events is significant. To put things
into perspective, the maximal possible event rates—which are generally not available—are needed as a
reference. Furthermore, it has to be checked which events correctly represent the utilization of individual
components.
This thesis focuses on the performance analysis of parallel applications on shared memory systems.
The trend of increasing parallelism within the nodes is expected to continue for the foreseeable fu-
ture [DOE10]; [Don+11, Section 3.1], thus the node complexity is presumably also going to increase
further. Therefore, understanding the node performance is an essential prerequisite for the efficient
utilization of contemporary as well as future HPC systems. Furthermore, the trend of increasing par-
allelism is not limited to HPC. Multi-core processors are also commonly used in various electronic
consumer devices like smartphones and computers. Moreover, powerful multi-processor workstations
are an established tool in professional computing. The performance analysis of shared memory systems
is also relevant for these areas. Distributed memory systems are not considered in this work. However,
multiple performance analysis tools that facilitate the detection of the performance issues that are associ-
ated with distributed memory are already available, e.g., Scalasca [Gei+10], HPCToolkit [Adh+10], and
Vampir [Knü+08]. The approach presented here can be combined with such tools in order to analyze
applications that span multiple nodes.
In order to detect bottlenecks in the memory hierarchy, one needs to know the peak achievable perfor-
mance of the individual components [HS11]. Therefore, this thesis introduces highly optimized micro-
benchmarks for processors that implement the 64 bit version of the x86 instruction set (x86-64). These
benchmarks measure the achievable performance of data transfers in multi-core processors as well as
multi-processor systems. This includes latency and bandwidth measurements for data that is located in
local and remote caches as well as the system’s NUMA characteristics. Furthermore, the impact of the
cache coherence protocol is considered. Based on this, a methodology for the identification of meaning-
ful hardware performance counters—that can be used to measure the utilization of various resources and
determine the impact of the memory hierarchy on the performance of parallel applications—is presented.
The procedure comprises three steps:
1. a comprehensive analysis of the performance of cache and memory accesses in contemporary
multi-processor systems in order to identify potential bottlenecks
2. stressing individual components in the memory hierarchy using micro-benchmarks in order to
identify performance counters that measure the utilization of these resources as well as the time
spent waiting for the memory hierarchy
3. a proof-of-concept visualization of the component utilization within the memory hierarchy as well
as memory related waiting times using existing performance analysis tools
6 1 Introduction
Remote cache accesses as well as the impact of the coherence protocol are not sufficiently covered by
existing benchmarks. This necessitates the development of new benchmarks in order to consider all
potential bottlenecks in step 1. These highly optimized benchmarks can be configured to use individual
components to their full capacity. Furthermore, the amount of data that is accessed by the benchmarks
is known. This facilitates the identification of performance counters that correlate with the number of
memory accesses in step 2. Step 3 shows that the identified counters can be used to analyze the influence
of memory accesses on the achieved application performance. The contribution of this thesis is twofold:
• The newly developed micro-benchmarks enable an in-depth analysis of the memory performance
of shared memory systems including the impact of cache coherence protocols. Their sophisticated
design significantly advances the state-of-the-art in that area. The information that can be obtained
using these benchmarks provides valuable input for the analytical performance modeling of shared
memory systems [RH13; LHS13; Put14; PGB14; RH15; RH16].
• The methodology for the identification of meaningful hardware performance counters greatly im-
proves the ability of performance engineers to attribute performance problems to their respective
source. Due to the careful construction of the micro-benchmarks it can be verified which hard-
ware performance counters actually correlate with the utilization of the memory hierarchy. This
knowledge is an essential prerequisite for the performance counter based performance analysis.
This thesis is organized as follows: Chapter 2 discusses related work and provides the required back-
ground knowledge. Chapter 3 describes the design and implementation of the micro-benchmarks.
In Chapter 4, these benchmarks are used to analyze the characteristics of memory accesses on a se-
lection of contemporary NUMA systems. This includes the properties of shared resources in multi-core
processors and interconnects in multi-socket systems as well as the influence of the cache coherence pro-
tocols. The analysis reveals fundamental differences in the performance of the different systems, which
can be attributed to differences in the processor and system architecture. Furthermore, several cases of
unexpectedly low performance are uncovered. Chapter 5 presents the methodology to identify mean-
ingful performance counters as well as the visualization of memory related performance problems. The
verification of the number of reported events using the micro-benchmarks shows that making assump-
tions based on the name of an event can easily result in wrong conclusions. It also shows that events
that are suggested by the hardware vendor [Int14a, Appendix B.3] do not always work as promised.
Chapter 6 concludes the thesis with a summary.
The following text formatting is used to improve the readability: Important terminology is intro-
duced using bold font. Small capitals are used to emphasize ASSEMBLER INSTRUCTION MNEMONICS.
Monospaced non-italic font is used to refer to function_calls() while command line tools
are written in monospaced italic font. All other highlighting of words uses proportional italic font, which
is for example used for names of third party tools. Abbreviations (ABR) are introduced in brackets and
listed in the list of abbreviations at page 157 and following.
72 Background and Related Work
Shared memory systems—i.e., computers with multiple processing elements (cores) and a shared random
access memory—are commonly used in a wide range of electronic equipment. Smartphone and tablet
processors with multiple general purpose cores are provided by multiple vendors (e.g., [Qual15; Nvi14]).
Multi-core processors are also customary in laptop and desktop computers (e.g., [Yuf+12; Dor+07]).
Computationally demanding tasks like 3D rendering can be performed on workstations with multiple
processors (e.g., [Del12c; Fuj16]). Furthermore, large scale systems for high performance computing
(HPC) typically consist of a multitude of interconnected shared memory systems (e.g., [Bul13; Cra10]).
An efficient utilization of shared memory systems is desirable in all the aforementioned areas. Therefore,
this work focuses on the performance analysis of shared memory systems. This section discusses the
state-of-the-art in that field and conveys the necessary technical background.
Processors are a basic component of all shared memory systems. Their structure and principle of opera-
tion is discussed in Section 2.1. Section 2.2 describes how customary multi-processor systems as well as
large scale systems with thousands of processors are constructed. Cache coherence protocols, which are
required to maintain a consistent view on the shared memory for all connected processors, are detailed
in Section 2.3. An overview of the essential resource management techniques provided by the operating
system is given in Section 2.4. Section 2.5 introduces the prevalent parallel programming models that
are necessary to utilize today’s computers. Performance evaluation is a crucial aspect of software devel-
opment as it is important that parallel applications efficiently utilize the complex hardware. Established
evaluation techniques and performance analysis tools are introduced in Section 2.6.
2.1 Processor Architecture
Processor development is driven by Moore’s Law [Moo65]. In 1965 Moore observed that the practicable
number of components in integrated circuits had doubled every twelve month in the preceding years and
predicted that this trend would continue. In 1975 the expected growth rate was reduced to a factor of
two in approximately 24 month [Moo75]. Until today the actual development matches this 40 year old
prediction [SVL12]. The primary challenge of processor development is to turn the steadily increasing
number of transistors into more performance. One way is to improve the performance of the individual
cores as described in Section 2.1.1. The increasing transistor budget can also be used to increase the
cache size or add more cores as discussed in Section 2.1.2 and 2.1.3, respectively. There are many
technical terms in the field of processor architecture, which are used in this document as follows:
CPU CPU is the short form of central processing unit. Originally, CPU referred to the single
general purpose processor in a computer [SGG12, Section 1.3.1]. A CPU contains all
necessary functionality to process programs in memory, namely the control unit, the
arithmetic/logic unit (ALU), and registers [Lim12, Section 2.2]. From the operating
system perspective, a CPU is a computational resource that processes a program, i.e.,
an entity the operating system can assign tasks to (see Section 2.4.1). In this work
CPU is used to denote such an individual computational resource. Many contemporary
processors contain multiple such units [KH09; Kur+11; Saw+11; Yuf+12; Hua+12;
Ham+14; Dor+07; Wei+11], i.e., a single processor can contain multiple CPUs.
Processor In this work, the term processor refers to the physical device plugged into the mother-
board socket. A processor contains one or more CPUs. Multiple CPUs can be imple-
mented as chip multi-processor (CMP) [DAS12, Section 8.4] with multiple processor
cores or via multi-threading [HP06, Section 3.5].
8 2 Background and Related Work
Processor
core
This term as well as its short form core is used in the context of multi-core processors.
Each processor core contains all necessary functionality to execute a program, thus it
constitutes at least one CPU. However, some processors support multiple CPUs per
core, e.g., [Int14a, Section 2.5]; [Fri+14]; [KAO05].
Compute
unit
Compute unit (CU) is used by AMD to refer to multiple processor cores that share
certain function units [McI+12]. It is comparable to a processor core that exposes
multiple CPUs to the operating system. However, each CPU has its own L1 data
cache and several dedicated execution units to reduce the impact of multi-threading on
the performance of the individual CPUs [But+11], i.e., it is an implementation of the
concept of “conjoined cores” [DAS12, Section 8.4.3].
Thread Thread can have two meanings. From the hardware point of view it refers to a com-
putational resource that a processor core exposes to the operating system, i.e., it is
synonymous to CPU. Thread is also used as an operating system construct that defines
an instruction stream within an address space (see Section 2.4). In this work, thread is
used with the latter meaning while several CPUs that are provided by a single core are
referred to as hardware threads or logical CPUs.
Die, chip A die or chip is a single piece of silicon that is incorporated in a package [ANE98].
Package Package denotes the device that is plugged into a socket or soldered on the mother-
board. It includes one or more dies [ANE98; Too+11]. A package that contains multi-
ple dies is called multi-chip-module (MCM) [Too+11].
2.1.1 Processor Core
One way to achieve a high performance of individual cores is to exploit instruction level paral-
lelism [HP06, Chapter 2]. Many contemporary processors (e.g., [Int14a, Section 2.1 – 2.4]; [Amd11,
Appendix A]; [Amd14c, Chapter 2]; [Fri+14]; [Arm14]) are superscalar designs—i.e., they execute mul-
tiple instructions each cycle—and use pipelining as well as out-of-order execution to achieve a high rate
of instructions per second (IPS). Single instruction multiple data (SIMD) extensions are used to further
improve the performance by increasing the number of integer and floating point operations per second
(IOPS, FLOPS). These techniques are explained in the following sections.
2.1.1.1 Pipelining and Superscalar Architectures
Pipelining [DAS12, Section 3.3] is one way to improve the performance of a core. The execution is split
into multiple phases as depicted in Figure 2.1. Multiple instructions can be in the pipeline concurrently—
one in every phase. Ideally, one instruction is completed in each cycle. The phases are less complex and
therefore require less time than the complete processing of the instructions. Thus, the cycle time is
Instruction 
Fetch (IF)
Execute 
(EX)
Memory 
(ME)
wait for
operands
Instruction
Decode (ID)
Write Back 
(WB)
Figure 2.1: Basic 5-stage pipeline, based on [Bae10, Figure 2.1]: Each instruction is split into multiple
phases—Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory (ME), and Write
Back (WB). All instructions pass through all phases, thus the processing speed of individual instruc-
tions is not increased. However, the five units simultaneously work on different instructions. Each
cycle the instructions proceed to their respective next phase and the fetch unit picks the next instruc-
tion. If there are no stalls (e.g., due to data dependencies), one instruction completes in each cycle.
2.1 Processor Architecture 9
reduced what enables a higher clock rate, which increases the achievable IPS. The maximal speedup is
equal to the number of pipeline stages. However, the decoupling of the stages also introduces overhead.
Thus, the depth of the pipeline cannot be increased indefinitely [HP02]; [Eye+09, Section 4.2.1].
Superscalar architectures support multiple instructions in each pipeline stage. This increases the max-
imum number of instructions that can be executed per cycle (IPC). However, dependencies between
operations restrict the practically achievable performance [HP02]. Branch instructions—that interrupt
the sequential stream of instructions—as well as delays caused by memory accesses further reduce the
resource utilization [Eye+09, Section 3.1.2 - 3.1.4]. Consequently, increasing the width of the micro-
architecture has diminishing benefit for every stage of extension [Eye+09, Section 4.2.2].
2.1.1.2 Out-of-Order Execution
Processors that support out-of-order execution [DAS12, Section 3.4]; [HP06, Section 2.4 - 2.5]; [Bae10,
Section 3.3] (also called dynamic scheduling) process instructions in a sequence that is determined by
the availability of the required operands instead of the program order. Therefore, a so-called reservation
station (RS) is added, which decouples fetch and decode from the execution. As depicted in Figure 2.2
multiple operations can wait in the reservation station until their operands become available instead of
one instruction waiting in the decode stage as it is the case in in-order architectures. The operations
can enter the execution phase independent from the original program order. This reduces the number of
stall cycles due to true data dependencies [HP06, Section 2.1] (read-after-write (RAW) hazards) between
operations as subsequent independent operations can be used to fill the gaps. Furthermore, name depen-
dencies [HP06, Section 2.1] (write-after-write (WAW) and write-after-read (WAR) hazards) are resolved
by register renaming. Out-of-order execution is typically used together with branch prediction [Smi98;
YP92; ECP96]; [HP06, Section 2.3] and speculation [HP06, Section 2.6].
Branch instructions interrupt the sequential program flow. Thus, it is unclear which instructions have to
be fetched next. Waiting until the execution of the branch instruction is completed creates gaps in the in-
struction stream that lower the utilization of the pipeline. Branch prediction reduces this overhead [HP06,
Section 2.3]. Therefore, the more probable branch direction (taken or not taken) is determined—typically
based on the previous behavior of branch instructions, which is recorded at runtime. Sequential fetch and
decode continues if not taken is predicted. Otherwise the branch target is searched in the branch target
buffer [HP06, Section 2.9], which stores previous destinations. It is checked if the prediction was correct
when the processing of the branch instruction is completed. In case of a wrong prediction, execution is
resumed at the correct target. However, it has to be ensured that instructions from a wrong path do not
change the software visible state.
Speculation [HP06, Section 2.6] ensures that the software visible state is updated in program order.
Therefore, speculative results—i.e., results of instructions that are potentially never reached—are kept in
Execute 
(EX)
Instruction 
Fetch (IF)
Instruction 
Decode (ID)
Write Back 
(WB)
wait for
operands
Reservation
Station (RS)
In-Order Front-End Out-of-Order Execution
In-Order 
Completion
Reorder 
Buffer (ROB)
wait for
retirement
Figure 2.2: Speculative out-of-order execution, based on Figure 2.1, extended with RS and ROB stages
from [Gwe95, Figure 3]: Instructions are fetched, decoded, and issued to the reservation station in
program order. Instructions are then dispatched to the execution units depending on the availability
of the required operands [Sim97]. The results are stored in the reorder buffer where they wait until all
prior instructions are completed. The software visible state is updated in program order. The memory
phase can be omitted for instructions without memory access [Gwe95, Figure 3].
10 2 Background and Related Work
additional registers, which are not exposed by the instruction set architecture (ISA). This can for exam-
ple be implemented using a reorder buffer (ROB) as depicted in Figure 2.2. Instructions are completed
out-of-order and their results are written to the ROB from where they can be used as input for subsequent
instructions. Each cycle it is checked if the results of the oldest instructions in the ROB are available.
If this is the case, they are copied into the architectural registers. This is called “commit” or “retire-
ment” and completes the processing of the instructions. Speculation enables the processor to continue
execution beyond predicted branches. If a wrongly predicted branch is discovered, the entries of subse-
quent instructions are removed from the RS and the ROB and the execution is continued at the correct
branch target [Gwe95]. Speculation can also be implemented using physical register files that contain
architectural as well as the additional internal registers [HP06, Section 2.9].
Out-of-order architectures also contain load and store buffers in order to process memory accesses out-
of-order [HP06, Section 2.4]. After an address is calculated in the address generation unit (AGU) it
is stored in the load or store buffer assigned to the instruction. Afterwards the AGU can be used by
other instructions while the queued memory transfers are processed. These buffers also enable multiple
concurrently outstanding memory requests [Bae10, Section 6.2.2]. Furthermore, store buffers are used to
avoid that speculative stores become visible in the memory hierarchy. Therefore, data written to memory
remains in the store buffers until the corresponding instructions are retired [Int14a, Section 2.3.5.2].
While each CPU sees its own memory accesses in program order, the order that is perceived by other
CPUs can differ. The possible rearrangements are architecture specific [McK10, Table 5].
Out-of-order execution, branch prediction and speculation do not increase the peak performance. How-
ever, the increased effective IPC—that results from the data-driven execution model [But+91; BD97]—
improves application performance. The further the processor can look ahead the more likely it is to locate
independent operations. Thus, the reorder window—the number of consecutive instructions that can be
in execution concurrently—is constantly increasing. The advancement of out-of-order resources across
multiple generations of Intel micro-architectures is detailed in Table 2.1.
2.1.1.3 Single Instruction Multiple Data
Single instruction multiple data (SIMD) instructions are another way to improve the performance of
a single core. As depicted in Figure 2.3, a single SIMD instruction performs multiple operations on
different operands. SIMD instructions are available in many processor families, e.g:
• NEON extension for ARM processors [Arm13b]
• MMX technology, streaming SIMD extensions (SSE), and advanced vector extensions (AVX) in
x86 processors [Int14b, Volume 1, Chapter 9, 10, and 14]
• Vector/SIMD multimedia extension technology in PowerPC processors [Ibm05]
The width of the SIMD registers in x86 processors has been increased multiple times. Furthermore, sup-
port for fused multiply-add has been added in the Haswell micro-architecture. Table 2.1 shows how the
number of double precision floating point operations per cycle has developed in Intel micro-architectures.
SIMD instructions significantly increase the peak performance of a processor core if the width of the
floating point units (FPUs) matches the register size [Wec06].
X0 op Y0X1 op Y1X2 op Y2X3 op Y3
Src1
Src2
Dest
Y0Y1Y2Y3
X3 X0X1X2
opopopop
Figure 2.3: Functionality of SIMD instructions, based
on [Int14b, Vol. 1, Figure 9-4] (derived from [Mol08,
Figure 2.10]): SIMD instructions operate on vec-
tor registers that contain multiple operands (e.g.,
four single precision or two double precision float-
ing point values). Multiple operations are performed
with a single instruction. Each operation reads one
input operand from each source register and writes
one output operand to the destination register.
2.1 Processor Architecture 11
Table 2.1: Development of execution resources in Intel micro-architectures: Comparison of the P6
micro-architecture with the Core micro-architecture and its successors Nehalem, Sandy Bridge, and
Haswell1. Decode and retirement remain a 4-wide2 superscalar design since the introduction of the
Core micro-architecture. The reorder window is steadily increasing and the peak number of floating
point operations per cycle (flop/cycle) are growing due to more capable SIMD instructions.
micro-architecture P6 Core Nehalem Sandy Bridge Haswell
decode [instr./cycle] 3 4 4 4 4
execute [micro-ops3/cycle] 5 6 6 6 8
retirement [micro-ops3/cycle] 3 4 4 4 4
RS entries 20 32 36 54 60
reorder window (ROB entries) 40 96 128 168 192
Floating point / SIMD ISA x87 S-SSE3 SSE4.2 AVX AVX2, FMA
FPU width 1x 80 Bit 2x 128 Bit 2x 128 Bit 2x 256 Bit 2x 256 Bit
flop/cycle (double) 1 4 4 8 16
load/store buffers 16/12 32/20 48/32 64/36 72/42
In order to benefit from SIMD instructions the code needs to be vectorized, i.e., independent instructions
of the same type have to be identified and replaced by SIMD instructions. Furthermore, the operands
need to be assigned to different slots in the vector registers. For some cases this can be done efficiently by
the compiler, e.g. for loops that conform to certain requirements [NZ08]. However, it is very important
that the operands are consecutive in memory in order to utilize the vector load and store instructions
that transfer data in chunks of the register size. The achievable speedup is reduced significantly for non-
sequential access patterns [Fra+05; Pen+13] as the vector registers need to be packed and unpacked in
that case [TJB03, Section 4.2]. SIMD instructions can also be used manually by the programmer using
intrinsics [Lom11] or assembly language. Furthermore, highly optimized math libraries are available
that efficiently utilize the SIMD instructions [Int14c; Amd13a].
2.1.1.4 Multi-threading
Superscalar out-of-order architectures require a certain number of independent instructions within the
reorder window in order to achieve maximal throughput. However, the instruction level parallelism that
is available in applications is limited and only a fraction of it can actually be exploited with reasonable
effort [HP06, Section 3.2]. Thus, the execution units are often underutilized. Multi-threading [DAS12,
Section 8.3]; [HP06, Section 3.5]; [Lo+97] converts thread level parallelism into instruction level paral-
lelism by executing multiple threads on a single core in order to improve the resource utilization. There-
fore, the data structures that contain the state of the executing thread—e.g., the architectural registers—
are duplicated. Most other resources—e.g., renaming registers and ROB entries—can be dynamically
shared or statically partitioned between the logical CPUs [Int14a, Section 2.5.1]. The execution units
are also shared by all threads. Sharing can be implemented by switching threads at frequent intervals
(fine-grained multi-threading) or in case of long stalls (coarse-grained multi-threading). The former is
for example implemented in Sun’s Niagara (UltraSPARC T1) processor [KAO05], the later is used in
some Intel Itanium processors [MB05]. The most sophisticated implementations enable the concurrent
execution of instructions from different threads, which is called simultaneous multi-threading (SMT)
and provides the highest increase in resource utilization [HP06, Figure 3.8]. SMT is implemented in
many contemporary Intel x86 [Int14a, Section 2.5] as well as IBM POWER processors [Fri+14].
1Sources: P6 [Gwe95; Int95]; [Bae10, Section 3.4], Core [Wec06]; [Int14a, Section 2.3], Nehalem [Int09b]; [Int14a, Sec-
tion 2.4], Sandy Bridge [Yuf+12]; [Int14a, Section 2.2], Haswell [Ham+14]; [Int14a, Section 2.1].
2decode: up to five instructions per cycle with makro-op fusion [Wec06]; retirement: four fused micro-ops [Goc+03]
3Instructions are decoded into micro-ops (µops)—elementary operations that can be executed by the execution units [Gwe95].
12 2 Background and Related Work
2.1.2 Cache
Caches are fast on-chip memories that provide a higher bandwidth and have a lower access latency than
main memory, which is also called random access memory (RAM). The performance improvement of
processors has outpaced the development of the memory technology—dynamic random access mem-
ory (DRAM)—for more than three decades. This leads to the so-called processor-DRAM performance
gap [HP06, Section 5.1]. In modern server systems, memory accesses can take hundreds of clock cy-
cles [MHS14; Mol+15], which is beyond the reorder capabilities and would therefore frequently stall
the execution. Furthermore, the memory bandwidth is not high enough to provide the execution units
with operands each clock cycle. However, frequently used data is migrated into the caches in order to
bridge the processor-DRAM performance gap and thereby reduce stalls in the processor core. This is
based on the principle of locality [HP06, Section 1.9], which asserts that recently accessed data is likely
used again in the near future (temporal locality) and objects that are located close together in memory
are likely accessed close together in time (spatial locality).
Caches store copies of memory blocks alongside their corresponding addresses as depicted in Figure 2.4.
When a processor requests data for a certain memory address it first checks if the cache contains a valid
entry for that address. If a cache miss occurs—i.e., no valid entry is found—the data is requested from
memory and inserted in the cache [Bae10, Section 2.2.1]. Memory is byte-addressable. However, storing
the whole address of every accessed byte would be inefficient as this would consume more of the cache
memory than the actual data. Therefore, caches store consecutive blocks of memory with a common
base address in order to limit the area overhead to acceptable levels. The blocks are referred to as cache
lines. Typical cache line lengths are 32 to 128 bytes [Bae10, Section 2.2.2].
There are three types of caches: fully-associative caches, set-associative caches, and direct-mapped
caches. In a fully-associative cache [DAS12, Section 4.3.1] each cache line can be stored in every
byte select
00110b
tag
1F5F78h
37F08Ahvalid
status
data
valid 1F5F78h
requested memory address (p bit)
tags
index
001b
w
a
y
 1
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
00000 00001 00010 00011 00100 00101 2
k
-1
00000 00001 00010 00011 00100 00101 2
k
-1
001
010
000
2
m
-1
w
a
y
 n 001
010
000
2
m
-1
status tags
directory
k bitm bitp-(m+k) bit
=?
=?
P   
O   
48h
...
...
00110
00101
Figure 2.4: Cache structure and functional principle, based on [DAS12, Figure 4.4]; [Amd15a, Figure
7-3]: Caches consist of two data structures, a directory, which stores tags and status of cache lines,
as well as the data memory [DAS12, Section 4.3.1]. The requested memory address is typically split
into three parts, a tag, an index, and a byte select field. Data is returned from the cache if the directory
contains a valid entry that matches the tag of the requested address. In direct-mapped (single way)
and set-associative caches (multiple ways), the index restricts the number of possible locations for
a given address to one per way. In a fully-associative cache the tag has to be checked against all
directory entries. The byte select field is used to extract the required bytes from the cache line.
2.1 Processor Architecture 13
location of the cache. This is the most flexible design. However, this also requires to check the tag against
every directory entry, which is not feasible for large caches [Bae10, Section 2.2.1]. A set-associative
cache—as depicted in Figure 2.4—reduces the number of possible locations to a manageable quantity.
This is done by using bits from the requested address as an index to preselect a subset of directory
entries [DAS12, Section 4.3.1]. In a direct-mapped cache an n-bit index is used to select one of 2n
entries, which is then compared to the requested address. Set-associative caches are commonly used
as a reasonable trade-off between hardware effort and hit rate [Bae10, Section 2.2.2]. There are three
reasons for cache misses [DAS12, Section 4.3.5]: compulsory misses, capacity misses, and conflict
misses. Compulsory misses occur on the first access to a cache line. They can be reduced by prefetching.
Capacity misses happen if the working set does not fit into a certain cache level, i.e., the data is evicted
before it is accessed again. Conflict misses arise if the working set contains more cache lines with the
same index than the degree of associativity allows. Conflict misses can occur even if the working set is
smaller than the cache capacity.
Typically, caches are organized in multiple levels [BDM09]. The memory hierarchy is often represented
as a pyramid with the level one cache (L1) being the highest level and main memory being the lowest
level of primary memory [DAS12, Section 4.2]. The level one cache typically supports multiple requests
each cycle and delivers data within a few cycles. However, the extremely fast L1 caches are usually very
small. Thus, multiple levels of caches are required to bridge the memory gap. Each additional cache level
features a higher capacity. The bandwidth decreases and the access latency increases with each level as
exemplified in Table 2.2. The bandwidth of the lower levels in the memory hierarchy is too low to provide
the execution units with a sufficient number of operands each cycle. Furthermore, the long latencies can
stall the processing of instructions. However, hardware prefetchers [Int14a, Section 3.7.2 – 3.7.3] that
recognize common memory access patterns—e.g., strides—reduce the impact of memory related stalls
on the performance. The level one cache is often divided into a data cache and an instruction cache while
the lower cache levels are typically unified caches that store code and data [DAS12, Section 4.2]. Caches
can be inclusive of higher cache levels [DAS12, Section 4.2.3]. Inclusive caches contain copies of the
data in higher cache levels. In contrast, exclusive caches ensure that cache lines are removed when they
are requested by a higher cache level. Exclusive caches result in a higher usable cache size as redun-
dant copies are avoided. However, inclusive caches simplify the cache coherence mechanism [BW89]
(see Section 2.3).
The write policy can be write-through or write-back [DAS12, Section 4.3.3]. A write-through cache
also updates the lower levels in the memory hierarchy if cache lines are changed. In contrast, write-back
caches only update their copy of the cache line when they are written to. The changes are eventually
written back to lower levels when the cache line gets evicted. Write-back caches have a lower bandwidth
demand and are therefore often preferred—especially in the case of the last level cache (LLC)—in order
to reduce the required main memory bandwidth [Bae10, Section 2.2.2]. However, the write-back policy
can results in inconsistent copies of the cache line and thereby complicates the cache coherence mecha-
nism (see Section 2.3). Caches can implement a write-allocate or no-write-allocate policy in case of
cache misses during a store operation [HP06, Section C.1]. Write-allocate caches place a copy of the
cache line that is written to in the cache. In contrast, caches that implement the no-write-allocate policy
only update the lower levels of the memory hierarchy in that case.
Table 2.2: Latency and bandwidth for different levels in the memory hierarchy of an Intel Xeon X5570
processor (measured at 2.93 GHz) [Mol+09]: The time spent waiting for the memory hierarchy can
constitute a considerable fraction of a program’s execution time [HP06, Figure 4.10].
level of memory hierarchy capacity latency [cycles (ns)] read bandwidth [GB/s]
Level one data cache (L1D) 32 KiB 4 (1.3) 45.6
Level two cache (L2) 256 KiB 10 (3.4) 31.1
Level three cache (L3) 8 MiB 38 (13.0) 26.2
main memory (RAM) 12 GiB 191 (65.1) 10.1
14 2 Background and Related Work
2.1.3 Multi-core Processors
Increasing the number of cores in a processor is another way to utilize the growing transistor budget that
is enabled by the perpetual improvement of semiconductor manufacturing. Progress on the core level
is still being made as shown in Section 2.1.1 (see Table 2.1). However, the available instruction level
parallelism is limited and extracting it requires excessive control logic [HP06, Chapter 3]; [Eye+09].
Therefore, further enhancements of the number of instructions per cycle (IPC) are increasingly hard.
Furthermore, the huge clock rate improvements that have been common since the 1990s came to an
end around 2002 [DAS12, Section 1.2, Figure 1.5]. These two effects lead to a stagnation of the in-
structions per second (IPS) per core. Thus, SIMD instruction remain the only option to significantly
improve the performance per core. However, SIMD instructions require code modification or sophisti-
cated compilers to improve performance [HP06, Section B.8]. The speedup can be very low for complex
codes [Kri+12b]; [Pen+13]; [TJB03, Section 4.3]. Consequently, the focus of processor development
shifted towards increasing the number of processor cores in the early 2000s [HP06, Section 3.8] in order
to maintain the accustomed performance increase of new processor generations. The peak performance
scales linearly with the increasing core count. However, in order to benefit from multiple cores the appli-
cation has to use multiple threads or processes, which the operating system can assign to different cores.
A selection of parallel programming models that enable this is described in Section 2.5.
Multi-core processors can be homogeneous or heterogeneous. Homogeneous multi-core processors con-
sist of several identical cores [Bae10, Section 8.2]; [DAS12, Section 8.4.1]. In contrast, heterogeneous
multi-core processors consist of different types of cores [DAS12, Section 8.4.2]. This enables the in-
tegration of general purpose cores and graphics processing units in a single chip as it is found e.g., in
AMD’s accelerated processing units [Amd14b]. Another use case are so-called single ISA heteroge-
neous core architectures like ARM’s big.LITTLE [Arm13a] concept, which combines high performance
as well as low power cores with the same instruction set. As depicted in Figure 2.5 some resources
are usually competitively shared between all cores. A shared last level cache is a widely-used feature.
Furthermore, integrated memory controllers that all cores have access to are common nowadays [Int14a,
Section 2.1, 2.2, and 2.4]; [Amd11; Amd14c; Wen+11]. Unfortunately, the computational performance
increases faster than the memory bandwidth. Figure 2.6 shows this development using the example of
Intel server processors from 2004 to 2014. The memory bandwidth is steadily increasing, but it does not
keep pace with the combined effects of increasing core count and increased performance per core. While
the floating point performance increased by a factor of 69 in the examined period, the memory bandwidth
only improved by a factor of 10.6. Consequently, an efficient usage of the cache hierarchy is of growing
importance in order to avoid limitations caused by memory accesses. In multi-core processors, caches
provide an additional benefit. Data can be replicated, i.e. each core can have its own copy of the data
in order to reduce contention in the shared memory [HP06, Section 4.2]. However, it has to be ensured
that modifications become visible to all cores as if they would be made directly in memory. The cache
coherence mechanisms that ensure correct parallel operation are explained in Section 2.3.
Multi-core Processor
Core 0
Shared L3 Cache
Memory
Controller
Point-to-point
Connections
L1
Core 1 Core n
L2 L2L2
L1L1 ...
...
R
A
M
R
A
M
R
A
M
I/O
other
Processors
Figure 2.5: Composition of multi-core processors, based
on [Hil+10, Figure 2]; [Con+10, Figure 1] (derived from
[Mol+10, Figure 1]): In a multi-core processor several pro-
cessor cores are integrated on a single die. Typically, the
level one caches (L1) are duplicated as well [DAS12, Sec-
tion 8.4.1]. Separate level two caches (L2) for each core
also are a common feature [Int14a, Section 2.1, 2.2, and
2.4]; [Amd11; Amd14c; Wen+11]. However, certain sup-
porting components are usually shared by all cores [DAS12,
Section 8.4.1]. Most notably, the last level cache (LLC), the
integrated memory controller (IMC), and the processor in-
terconnects are often implemented as shared resources.
2.1 Processor Architecture 15
6.4 GB/s 
8.5 GB/s 
12.8 GB/s 
32 GB/s 
52.2 GB/s 
68.2 GB/s 
6.4 GFLOPS 
12.8 GFLOPS 
48 GFLOPS 
70.3 GFLOPS 
185.6 GFLOPS 
441.6 GFLOPS 
0x 10x 20x 30x 40x 50x 60x 70x 80x
Intel Xeon 3.20 GHz, 1 core,
2 flop/cycle at 3.2 GHz, FSB-800
Intel Xeon 5060: 2 cores,
2 flop/cycle at 3.2 GHz, FSB-1066
Intel Xeon X5472: 4 cores,
4 flop/cycle at 3.0 GHz, FSB-1600
Intel Xeon X5670: 6 cores,
4 flop/cycle at 2.93 GHz, 3x DDR3-1333
Intel Xeon E5-2690: 8 cores,
8 flop/cycle at 2.9 GHz, 4x DDR3-1600
Intel Xeon E5-2690 v3: 12 cores,
 16 flop/cycle at 2.3 GHz, 4x DDR4-2133
2
0
0
4
2
0
0
6
2
0
0
8
2
0
1
0
2
0
1
2
2
0
1
4
Performance improvement compared to 2004 
Double precision floating point performance and memory bandwidth development of 
Intel processors over one decade 
Normalized floating point performance Normalized memory bandwidth
Figure 2.6: Double precision floating point performance and memory bandwidth development of Intel
processors4: Bars are normalized to the performance of the 3.2 GHz Xeon processor from 2004. The
computational performance improved by a factor of 69 due to the increasing core count and more
capable FPUs (see Table 2.1). In the same time frame, the memory bandwidth only increased by a
factor of 10.6. Thus, the processor-DRAM performance gap [HP06, Section 5.1] is still widening.
2.1.4 Power Management
Contemporary processors implement several power management techniques in order to reduce the power
consumption if the full performance is not required. ACPI P-states and C-states [Hew+13] are of par-
ticular interest as they lower the power consumption of running systems.
P-states define several performance levels. P0 is the state with the highest performance. Higher P-states
reduce the performance and the power consumption. The operating system selects a P-state based on the
current load on the system. Often it is also possible to control them manually. The hardware implements
P-states using Dynamic Voltage and Frequency Scaling (DVFS) [Int04], which enables significant power
savings while the cores continue to execute instructions. However, DVFS can increase the application’s
runtime. The extent of the performance degradation varies depending on hardware and software char-
acteristics. Hsu et al. and Ge et al. present approaches that consider this and enable significant energy
savings with limited performance loss [HF05; Ge+07]. P-states can be set per CPU, i.e., independently
for each core or even per logical CPU if multi-threading is supported. However, the hardware may select
higher voltages or frequencies if multiple cores share a voltages or frequency domain. Some proces-
sors implement hardware controlled DVFS in addition to the software controlled ACPI P-states, e.g.,
Intel’s Turbo Boost [Int09b] and AMD’s Core Performance Boost [Amd13b, Section 2.5.2.1.1] feature.
These technologies attempt to improve performance by increasing the frequency beyond the nominal
frequency as long as certain thermal and power constraints are met. In that case a certain P-state is
selected, but the hardware dynamically changes voltages and frequencies. The operating frequency of
shared resources—e.g., the last level cache—can be coupled to the core clock [Hua+12], fixed to a certain
frequency [Hil+10], or changed independently [Amd13b, Section 2.5.2.2].
C-states reduce the power consumption of idle cores. C0 is the active state in which the processor cores
are fully operational. The execution of instructions stops in higher C-states. Contemporary processors
implement multiple C-states that gradually decrease the power consumption by disabling more and more
functionality. C-states enable more powerful power saving techniques like clock-gating or power-gating.
However, the reactivation of the cores is not instantaneous and the cache content can be lost. The operat-
ing system uses C-states if cores are idle. Thus, if an application does not utilize all available cores, spare
4Source: [Int02; Int06a; Int06b; Int13b; Int14a; Int15b; Int15d; Int15e]. The peak GFLOPS are calculated using the base
frequency (no Turbo Boost). For the Xeon E5-2690 v3 processor the reduced frequency for AVX workloads [Int15e,
Table 3] is considered.
16 2 Background and Related Work
cores will be deactivated to reduce the energy consumption. Dynamic Concurrency Throttling (DCT) is
a technique that dynamically adjusts the number of active threads depending on workload characteristics
in order to exploit C-states to improve the energy efficiency. Curtis-Maury et al. and Lively et al. show
that DCT can improve the energy efficiency of threaded applications [Cur+07; Liv+11]. It can also be
used in combination with DVFS to further improve the energy efficiency [Cur+08; Liv+12].
2.2 System Architecture
The performance of a single processor is limited by the manufacturable chip-size as well as the man-
ageable power density. Thus, multiple processors need to be combined to perform more demanding
tasks. Two or four processors are commonly used in off-the-shelf workstations and servers, e.g., [Del12c;
Hew14; Fuj14]. Multi-processor systems are also used as building blocks for larger systems, e.g., [Bul13;
Cra10; Meg14]. In that case they are referred to as node. Section 2.2.1 details the composition of these
nodes. The architecture of large scale systems for high performance computing (HPC) is described
in Section 2.2.2.
2.2.1 Node Architecture
Often multiple processors are combined in a node, e.g., [Del12c; Hew14; Fuj14]. All processors have
access to all the memory in such a multi-processor system [HP06, Section 4.1]. Figure 2.7 depicts two
customary types of multi-processor systems: bus-based systems with a central memory controller and
systems with distributed memory controllers and a point-to-point interconnection network.
In the bus based approach (see Figure 2.7a) all processors can access each memory location with the same
latency, which is called uniform memory access (UMA). However, the achievable bandwidth is limited
by the shared bus and the central memory controller [HP06, Section 4.2]. In contrast, the total bandwidth
scales with the number of processors if each processor has an integrated memory controller [Kel+03].
Therefore, AMD and Intel both switched from Front-Side-Bus (FSB) based systems to integrated mem-
ory controllers and point-to-point connections for their contemporary multi-socket x86 servers. AMD
conducted the change with the introduction of HyperTransport (HT) based systems [Kel+03] in 2003.
Intel adopted distributed memory controllers with the QuickPath Interconnect (QPI) [Int09a] in 2008.
Unfortunately, distributed memory controllers also have a downside. The characteristics of memory ac-
Node
Memory
Controller
Processor 1 Processor 2
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Core Core
Core Core
Core Core
Core Core
Processor 3 Processor 4
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
(a) bus based system
Node
Processor 3 Processor 4
Processor 1 Processor 2
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
(b) system with point-to-point connections
Figure 2.7: Memory distribution in systems with four processors: In bus based systems [HP06, Sec-
tion 4.1] (left, based on [Int09a, Figure 3]) all memory accesses are performed via the shared bus.
In systems with integrated memory controllers in the processors [CH07] (right, based on [Int09a,
Figure 6]) data delivered from local memory does not consume interconnect bandwidth. The cache
coherence protocol (see Section 2.3) creates additional traffic in both cases.
2.2 System Architecture 17
cesses depend on the distance between the requesting core and the memory controller that services the
request. Accesses to the local memory are generally faster than remote memory accesses, which require
additional transfers via the point-to-point connections between the processors. This non-uniform mem-
ory access (NUMA) behavior [HP06, Section 4.1] needs to be taken into account in order to achieve
good performance. Furthermore, the beneficial effects of migrating and replicating data in the caches are
more pronounced in NUMA systems, what makes their efficient usage even more important.
The term node can also be used in the context of NUMA systems [HP06, Section 4.1]. In that case, a
NUMA node refers to a unit (e.g. a processor) that contains one or more processor cores and a memory
controller. Thus, each NUMA node itself is a uniform memory access (UMA) architecture, i.e., the
cores within a node can access the local memory with identical performance. Only accesses to memory
in other NUMA nodes show the different performance levels of the NUMA architecture.
Hardware accelerators are used in many contemporary HPC systems, e.g., in four (Tianhe-2, Titan, Piz
Daint, and Stampede) of the top ten systems of the November 2015 TOP500 list [Top15]. They typically
consist of many rather simple cores. For instance, the Nvidia Tesla K20X contains 2688 cores [Nvi12].
Applications need to be able to utilize the immense extent of parallelism in order to benefit from accelera-
tors. Furthermore, sequential program phases have a more severe performance impact than on traditional
processors. Thus, it depends on the characteristics of the application if accelerators or general purpose
processors are more appropriate. Accelerators commonly possess a certain amount of integrated mem-
ory and are connected to the host system via PCI Express (PCIe). Examples for this approach are Intel’s
Xeon Phi coprocessors [Int15a], Nvidia’s Tesla K20X [Nvi12], and AMD’s FirePro S9000 [Amd14a].
In that case, the physical address spaces (see Section 2.4.2) of the host processors and the accelerator are
disjoint. Therefore, the required input data needs to be copied from the host to the accelerator memory
prior to the computation and the results have to be copied back. These data transfers can be avoided
if general purpose and accelerator cores are integrated in a heterogeneous multi-core processor with a
shared memory controller as they are available e.g., from AMD [Amd14b] and Nvidia [Nvi14].
2.2.2 Large Scale Systems
Large scale systems are required for enterprise applications and high performance computing. Such
systems consist of multiple nodes which are connected with a network. They can be implemented as
distributed shared memory systems with a single large address space or as distributed memory systems
with disjoint address spaces per node [DAS12, Chapter 5]. Distributed shared memory systems are
controlled by a single operating system as depicted in Figure 2.8a and provide multiple TiB of directly
addressable memory for large in-memory applications [Sgi12a]. Remote memory accesses are handled
transparently by the hardware. In distributed memory systems each node runs its own operating system
instance as depicted in Figure 2.8b. In such systems data is exchanged explicitly over the network.
Multiple TiB of Coherent Shared Memory
Coherent Interconnect (e.g., SGI NUMAlink)
Operating System (OS) and Applications
Thousands of Cores
(a) distributed shared memory
Several GiB
Memory
Non-coherent Interconnect (e.g., Infiniband, Ethernet)
Multiple 
Cores
OS and
Applications
Several GiB
Memory
Multiple 
Cores
OS and
Applications
Several GiB
Memory
Multiple 
Cores
OS and
Applications
Several GiB
Memory
Multiple 
Cores
OS and
Applications
(b) distributed memory
Figure 2.8: Software view of large scale Systems, based on [Sgi12a, Figure 2]: Distributed shared
memory systems (left) provide a single system view that hides the complex structure. In distributed
memory systems (right) the individual nodes (see Section 2.2.1) are visible to the software.
18 2 Background and Related Work
2.2.2.1 Distributed Shared Memory Systems
Large scale shared memory systems are provided by multiple companies, e.g., the SGI Altix UV
line [Sgi12a], IBM’s POWER 795 system [Che+13], and the Oracle SuperCluster M6-32 [Ora13]. They
provide a single large address space. Thus, they can be managed by a single operating system in-
stance and applications can directly access all the memory in the system by performing load or store
instructions [Sgi12a]. This enables large in-memory computations using the shared memory program-
ming model (see Section 2.5.1. The processors are spread over multiple printed circuit boards (PCBs),
which are coupled via an interconnection network [HP06, Figure 4.19]. The network is directly con-
nected to a coherent processor interface and extends the cache coherence protocol to the whole sys-
tem [Sgi12b]; [Che+13, Section 2.2.1]; [Ora13, Appendix A]. However, the NUMA characteristics are
much more pronounced than in the commodity multi-socket servers discussed in Section 2.2.1. In large
systems remote memory accesses can have a latency of 1 µs [Sgi12a, Table 1], which is an order of mag-
nitude higher than local memory accesses [Old14, Section 4.2.1]. The size of distributed shared memory
systems is limited to a few racks. E.g., the largest cache coherent configuration of the SGI Altix UV
2000 consists of four racks with 256 processors [Sgi12b]. One limitation is the scalability of the cache
coherence mechanism [DAS12, Section 5.5], which is described in Section 2.3.3. Furthermore, the total
memory capacity is limited by the width of the physical addresses that the processors support. E.g., the
46 bit physical addresses of the Sandy Bridge-EP processors (reported by CPUID for EAX=80000008H
in EAX:07-00 [Int14b, Table 3-17]) limit the memory capacity of the Altix UV2000 to 64 TiB [Sgi12b].
2.2.2.2 Distributed Memory Systems
Distributed memory systems are offered by many vendors, e.g, Bull [Bul13], Cray [Cra10], and Meg-
ware [Meg14]. They consist of multiple shared memory nodes that are connected via a network. Each
node has its own address space and is managed by an independent operating system instance. The basic
architecture of large scale systems is depicted in Figure 2.9. This architecture enables very large systems
with millions of cores, e.g., the number one system in the November 2015 TOP500 list [Top15], which
contains 3,120,000 cores in 16.000 nodes [Lia+14] (including accelerators). There are three distinct
levels of parallelism. The processors have multiple cores, each node contains multiple processors, and
multiple nodes are connected via network. Likewise, there are three categories of accessibility of data.
The cores of a processor typically share a memory controller (see Section 2.1.3), thus they have access to
the local memory with uniform memory access (UMA) behavior. As memory within the nodes is usually
distributed among the processors [CH07; Zia+10], accesses to remote memory in the node are affected
by the non-uniform memory access (NUMA) characteristics. Data in other nodes is not directly accessi-
ble by the processors via load or store instructions. Instead of performing memory accesses, data has to
be exchanged via network messages [DAS12, Section 5.3]. The transfers involve function calls and pass
through a software stack with multiple layers [Iso94]. Furthermore, the data needs to be addressed with
the unique id of the node and the node-internal memory address. Thus, applications need to utilize a pro-
gramming model that supports multiple address spaces, e.g., the message passing or partitioned global
address space programing model (see Section 2.5.3 and Section 2.5.2). Furthermore, the applications
have to partition the data and distribute it among the nodes.
shared memory node
Processor
Memory
...
NIC
CPU CPU CPU...
Processor
Memory
CPU CPU CPU...
Processor
Memory
CPU CPU CPU...
...
shared memory node
Processor
Memory
...
NIC
CPU CPU CPU...
Processor
Memory
CPU CPU CPU...
Processor
Memory
CPU CPU CPU...
shared memory node
Processor
Memory
...
NIC
CPU CPU CPU...
Processor
Memory
CPU CPU CPU...
Processor
Memory
CPU CPU CPU...
Inter-node communication
Intra-node communication
Figure 2.9: Architecture of distributed memory systems, based on [DAS12, Figure 5.23]: Multiple
shared memory nodes are connected via network to form a larger system.
2.3 Cache Coherence 19
Figure 2.10: Projected hardware development (figure taken from [DOE10]): Systems are expected to
grow in all dimensions. With several thousand cores, the node concurrency will presumably reach the
magnitude of contemporary large scale distributed shared memory systems.
2.2.2.3 Expected Future Development
Figure 2.10 shows the future development of high performance computers as expected by the U.S. De-
partment of Energy. A significant increase in node concurrency is predicted for the near future. The
International Exascale Software Project roadmap predicts a similar development [Don+11, Section 3.1].
Tianhe-2 [Lia+14]—the number one system in the November 2015 TOP500 list [Top15]—features 195
cores with 732 logical CPUs per node (24 2-way SMT host processor cores and 171 4-way SMT Xeon
Phi cores), which shows that the prediction for 2015 was correct. Thus, it can be assumed that the predic-
tions for the near future are appropriate as well. That means that the concurrency level and complexity
of today’s large scale distributed shared memory systems, e.g., [Sgi12b], will soon be found in standard
compute nodes. Future many-core designs with multiple memory controllers will likely show consider-
able NUMA characteristics [LB10; Kim+15]. Processors with multiple NUMA domains already exist.
For example, the 12-core Opteron 6100 series of processors consists of two 6-core dies in a multi-chip-
module [Con+10]. Xeon E5 v3 processors that have more than eight cores support the cluster-on-die
mode, which splits the chip in two NUMA nodes [Kar14]. Furthermore, directory based coherence pro-
tocols (see Section 2.3.3) will presumably become inevitable. Directory assisted coherence protocols
have already been introduced in contemporary multi-socket servers (see Section 2.3.2.2 and 2.3.2.3).
2.3 Cache Coherence
Shared memory systems consist of multiple processors, which can access all the available memory (see
Section 2.2.1 and 2.2.2). Caches (see Section 2.1.2) are used to bridge the processor-DRAM performance
gap (see Figure 2.6). Therefore, multiple copies of the same memory address can coexist in caches of
different cores as well as in different cache levels. Cache coherence ensures that modifications made by
any core become visible for all cores as it would be the case if the change was made directly in the shared
main memory. This is a prevalent feature in general purpose processors [BDM09].
Caches are coherent if all memory requests return the latest value written to the accessed memory loca-
tion [HP06, Section 4.2]. This requires, that changes made by any core eventually become visible for all
cores. Furthermore, changes made to the same memory location by different cores have to be observed
by all other cores in the same order. Cache coherence protocols are used to maintain the coherence of
cache lines. They can be distinguished on the basis of the behavior of writes to cache lines that have
multiple copies. Write-update protocols update all existing copies in the system while write-invalidate
protocols invalidate all other copies before the change is made. Write-invalidate is used in the majority
of systems, because of the high bandwidth demand of the write-update approach [HP06, Section 4.2].
20 2 Background and Related Work
Coherence protocols require a mechanism to perceive accesses by other processors. This can be im-
plemented with snooping or using directories [Bae10, Section 7.2]. Section 2.3.1 discusses snooping-
based coherence protocols, which observe all memory accesses via a shared bus or a broadcast network.
Snooping-based protocols can generate a considerable amount of additional traffic [HP06, Section 4.4],
which can be reduced using snoop filters as described in Section 2.3.2. Directory-based coherence
protocols, which eliminate broadcast messages entirely, are detailed in Section 2.3.3.
2.3.1 Snooping-based Cache Coherence Protocols
Snooping-based coherence protocols [AB86]; [HP06, Section 4.2]; [Pfi98, Section 6.4.3]; [Bae10, Sec-
tion 7.2.1] were originally developed for systems with a shared memory bus (see Figure 2.7a). In such
systems each processor listens on the shared medium, which is typically called snooping. Therefore,
memory accesses of any processor are visible to all processors. However, snooping based coherence
protocols can also be used in multi-processor systems with point-to-point connections (see Figure 2.7b).
In that case, snoop requests [Int09a] (also called probe requests [Con+10]) are broadcast to all proces-
sors and the snoop responses are sent back to the requester. Such protocols are called snoop broadcast
protocols and are used e.g., in AMD systems with Direct Connect Architecture [Kel+03; CH07] as well
as Intel systems that use the QuickPath Interconnect [Int09a; Zia+10]. Maintaining cache coherence in
NUMA systems involves multiple NUMA nodes [Cen10; Int09a; CH07; Con+10]:
Source Node The source node (or request node) is the NUMA node that contains the core
that initiates a memory request.
Home Node The home node is the NUMA node that contains the memory controller that is
responsible for the requested memory location.
Peer Node Other NUMA nodes that might contain copies of the requested memory loca-
tion in their caches are called peer nodes.
There are basically two methods to perform a snoop broadcast: source snoop and home snoop [Int09a].
In systems that use the source snoop approach, the node that contains the requesting core broadcasts the
request and all nodes send a reply. This is also called a two-hop coherence protocol and is the fastest
implementation of snooping in systems with point-to-point connections. The reply can be an empty
message (snoop response) or include the requested data (data response). At least one data response is
sent from the home node, which delivers the data from main memory. If the home snoop method is used,
the request is forwarded to the home node first, which then broadcasts the request. This is also called
a three-hop coherence protocol. This approach adds latency. However, the home node can implement
a snoop filter for its memory region in order to avoid the broadcast and reduce the number of snoop
responses. This reduces the coherence traffic on the interconnection network (see Section 2.3.2).
Coherence protocols assign a state to each cache line—stored in the state field (see Figure 2.4)—that
changes according to a state change diagram if the memory location is accessed [DAS12, Section 7.3.2].
Transitions that are caused by accesses of the processor or core the cache belongs to are denoted as
follows (based on [Bae10, Section 7.2]; [Hay02, Figure 7.65]):
Read Hit read access to data that is already present in the local cache
Write Hit write access to data that is already present in the local cache
Read Miss read access to data that is not yet present in the local cache
Write Miss write access to data that is not yet present in the local cache
Transitions can also be initiated by memory accesses that are observed on the bus (bus based systems)
or in response to incoming snoop requests (systems with point-to-point connections). This is denoted as
follows (in the style of [Hay02, Figure 7.65]):
Snoop Read Hit observed read access by another processor that hits in the local cache
Snoop Write Hit observed write access by another processor that hits in the local cache
2.3 Cache Coherence 21
Furthermore, some transitions depend on the presence and coherence state of the accessed data in caches
of other processors. In that case the attempt to access the memory is observed by the other processors,
which notify the requester about their copies and supply the most recent version of the data if necessary.
The responses are denoted as follows (using AMD’s terminology [KM01; Con10]):
Probe Hit [state] data is present in one or more remote caches [in a certain state]
Probe Miss data is not present in any remote cache
2.3.1.1 MESI Protocol
The MESI protocol [PP84]; [Bae10, Section 7.2.1] uses four states: Modified (M), Exclusive (E),
Shared (S) and Invalid (I). Each cache line is in one of these states. The states Exclusive and Mod-
ified guarantee that there are no further copies. Cache lines are in the state Shared if multiple cores
potentially5 contain a valid copy of the corresponding memory address. Invalid cache lines do not con-
tain useful data. They can be used to accommodate new data without replacing existing cache lines. State
transitions can be triggered by memory request as well as snoop events as depicted in Figure 2.11.
Table 2.3 summarizes the properties of the different states. All except Invalid cache lines can be read
by the processor the cache belongs to without any externally visible access [DAS12, Section 5.4.3].
Exclusive and Modified cache lines can also be written directly. However, Shared cache lines cannot
be modified without sending a notification to all other processors, since all other copies have to be
invalidated prior to changing the data. Exclusive and Shared cache lines have a valid copy in main
memory. They are consistent with main memory, are marked as “clean”, and do not require a write back
to memory if they are evicted (silent eviction). In contrast, the copies of Modified cache lines in main
memory are out-dated. They are marked as “dirty” and are therefore written back if they get evicted.
Modified cache lines also have to be forwarded to other cores if they request the cache line since it is
the only valid copy [DAS12, Section 5.4.3]. Exclusive lines can be forwarded as well. However, in that
case forwarding is optional since the copy in memory is valid, too. Shared cache lines are typically not
forwarded between caches even though the data is valid [HG05]; [DAS12, Section 5.4.3].
The MESI protocol has two major disadvantages:
• Modified cache lines have to be written back to main memory if they are read by other cores as the
resulting Shared copies require an up-to-date copy in main memory.
• Forwarding Shared cache lines is generally not implemented as avoiding or dealing with multiple
responses would be too complex, especially in point-to-point connected systems [HG05].
Therefore, contemporary multi-socket systems use more sophisticated coherence protocols that are ex-
plained in the following sections.
Snoop Write Hit
Snoop Read Hit
S
n
o
o
p
 W
ri
te
 H
it
I
S
E
M
Snoop
Read Hit
Read Hit,
Write Hit
Read Hit
S
no
op
 R
ea
d 
H
it
W
rite M
iss
S
noop W
rite H
it
Read Hit
Read Miss, Probe Miss
R
e
a
d
 M
is
s
, 
P
ro
b
e
 H
it
Write Hit
W
ri
te
 H
it
Figure 2.11: State change diagram of
the MESI coherence protocol, based
on [Hay02, Figure 7.65] (derived
from [Mol08, Figure 3.15]): All
write hits and write misses change
the state to Modified. Writes invali-
date all existing copies in other cores
(snoop write hit). Read misses in-
sert data in state Exclusive or Shared
depending on the probe responses.
Read hits do not change the state.
5Shared cache lines do not change back to Exclusive state if all other copies are evicted, thus a cache line can be in the state
Shared even if only a single copy exists [PP84].
22 2 Background and Related Work
Table 2.3: States of the MESI protocol [PP84]; [AB86, Section 2.4]; [Bae10, Section 7.2.1]: The column
“readable” indicates if read accesses can be serviced by the cache without sending a request to the
bus or broadcast network. The column “writable” indicates if the cache line can also be modified
without externally visible request. The column “clean” indicates if the cache line is consistent with
main memory. The column “eviction” indicates if a cache line is written back to memory when it is
evicted. The column “may forward” indicates if the protocol allows the cache to respond to requests
of other processors by sending the data.
State readable writable clean eviction may forward
Modified yes yes no write back yes
Exclusive yes yes yes silent yes
Shared yes no yes silent no6
Invalid no no - - -
2.3.1.2 MESIF Protocol
The MESIF protocol [Int09a; Zia+10; HG05] is used by contemporary Intel processors. It inherits the
states Modified (M), Exclusive (E), Shared (S), and Invalid (I) from the MESI protocol and extends it
with the Forward (F) state. Figure 2.12 depicts the state transitions of the MESIF protocol. Table 2.4
summarizes the properties of the different states. The Forward state is adapted from the Shared state.
In contrast to Shared cache lines, the Forward copy forwards data upon request. The protocol ensures
that at most one copy of a cache line can be in the Forward state. If another core requests the data the
Forward copy switches to state Shared and the line is inserted in the state Forward in the requester’s
cache. This enables cache-to-cache transfers of shared cache lines, but avoids the complexity of having
to deal with multiple data responses. However, it is not ensured that the data is provided by the closest
copy within the NUMA topology. MESIF supports source snooping as well as home snooping [Int09a].
In source snoop mode the source node broadcasts a snoop request to all nodes in case of an LLC miss.
Cache lines in state Modified, Exclusive, or Forward are forwarded directly to the requester. All nodes
also send their snoop response to the home node which provides data from memory if no copy was
forwarded. In home snoop mode snoop requests are first sent to the home node, which then broadcasts
them. This increases the memory latency of accesses to data that is cached in other nodes. However, the
Sn
oo
p
Re
ad
 H
it
S
n
o
o
p
 R
e
a
d
 H
it
Sn
oo
p 
W
rit
e 
Hi
t Snoop W
rite Hit
S
n
o
o
p
 W
rite
 H
it
S
n
o
o
p
 W
ri
te
 H
it
I
F E
S M
Read
Hit
Read
Hit
W
rite
 M
is
s
W
ri
te
 H
it
Re
ad
 M
iss
, P
ro
be
 H
it
Write Hit
Snoop Read Hit
Write Hit
Read M
iss, Probe M
iss
Snoop
Read Hit
Read Hit,
Write Hit
Read Hit
Figure 2.12: State change diagram of
the MESIF coherence protocol based
on Figure 2.11, extended with infor-
mation provided in [Int09a; HG05]:
The outgoing transitions of the M, E,
and S state are identical to the MESI
protocol. The I state turns into to the
additional F state in case of a read
miss if the data is also present in an-
other cache. The F state behaves like
the S state for read hits, write hits,
and snoop write hits. The only differ-
ence is the snoop read hit transition,
which changes the state to S while it
does not change the state if the line
already is in the S state.
6 Forwarding Shared cache lines is possible [PP84]; [Bae10, Section 7.2.1]. However, it is generally not implemented [HG05].
Instead, the data is typically read again from memory, which guarantees a single response no matter how many copies exist.
2.3 Cache Coherence 23
Table 2.4: States of the MESIF protocol [Int09a, Table 5]: The states Modified, Exclusive, Shared and
Invalid have identical properties as in the MESI protocol. The Forward state is derived from the
Shared state. It enables forwarding of shared clean data between caches.
State readable writable clean eviction may forward
Modified yes yes no write back yes
Exclusive yes yes yes silent yes
Shared yes no yes silent no
Forward yes no yes silent yes
Invalid no no - - -
coherence protocol’s bandwidth demand can be reduced using snoop filtering mechanisms, which are
described in Section 2.3.2.2.
2.3.1.3 MOESI Protocol
The MOESI protocol [Amd15a, Section 7.3] is used by contemporary AMD processors. It extends the
MESI protocol by implementing the Owned (O) state, which enables sharing of dirty cache lines without
writing them back to memory. The conventional MOESI protocol is depicted in Figure 2.13. Figure 2.14
shows an extended version [Lep+12]. The enhanced MOESI protocol introduces the ModifiedUnWrit-
ten (MuW) state [Lep+12]. Table 2.5 summarizes the properties of the different states.
Cache lines in state Owned are not consistent with main memory, thus need to be written back if they
are evicted. Owned differs from the Modified state by allowing further Shared copies in other caches.
The Owned copy remains in the core that originally modified the cache line and forwards data to sub-
sequent read requests. However, writes by other cores have to invalidate all copies. In contrast to the
MESIF protocol, shared clean data (i.e., all copies in state Shared) is not forwarded between caches in
the MOESI protocol. MOESI is a home snooping protocol, i.e., requests that miss in the local cache
hierarchy are forwarded to the home node, which broadcasts them to the other NUMA nodes. A snoop
filtering mechanism—called “HT Assist” [Con+10]—can be used to filter unnecessary requests as de-
scribed in Section 2.3.2.3.
The ModifiedUnWritten (MuW) state in the extended MOESI protocol enables a faster transitions to
Modified if multiple processors perform read-modify-write operations [Lep+12]. Write accesses always
change the state to Modified. However, subsequent reads by other cores invalidate the Modified copy and
Re
ad
 M
iss
, P
ro
be
 H
it
Read M
iss, Probe M
iss
Snoop
Read Hit
Snoop Read Hit
W
rite Hit
Sn
oo
p 
W
rit
e 
Hi
t Snoop W
rite Hit
S
n
o
o
p
 W
rite
 H
it
S
n
o
o
p
W
ri
te
 H
it
I
S E
O M
Snoop
Read Hit
Read
Hit
Read
Hit
W
rite
 M
iss
Write Hit
W
ri
te
 H
it
Snoop
Read Hit
Read Hit,
Write Hit
Read Hit
Figure 2.13: State change diagram
of the MOESI coherence protocol,
based on [Amd15a, Figure 7-2] (de-
rived from [Mol08, Figure 3.11]):
The outgoing transitions of the E, S,
and I state are identical to the MESI
protocol. The M state switches to
the additional O state in case of a
snoop read hit. The O state behaves
like the M state for read hits and
snoop write hits. In case of snoop
read hits the state does not change if
the line already is in state O. Write
hits cause the cache line to transi-
tion back to the state M.
24 2 Background and Related Work
Read M
iss, Probe M
iss
Snoop
Read Hit
W
rite Hit
Sn
oo
p 
W
rit
e 
Hi
t Snoop W
rite Hit
S
n
o
o
p
W
ri
te
 H
it
I
S E
O
M
Snoop 
Read Hit
MuW
R
e
a
d
 M
iss, P
ro
b
e
 H
it M
Snoop 
Read Hit
W
rite
 H
it
S
n
o
o
p
 W
rite
 H
it
R
e
a
d
 M
is
s,
 P
ro
b
e
 H
it 
M
u
W
/O
/E
Read Hit,
Write Hit
S
n
o
o
p
 R
e
a
d
 H
it
Read
Hit
Read
Hit
Read
Hit
S
n
o
o
p
 W
rite
 H
it, S
n
o
o
p
 R
e
a
d
 H
it
W
rite
 M
is
s
W
ri
te
 H
it
Read
Hit
W
rite
 H
it
Re
ad
 M
iss
, P
ro
be
 H
it S
Figure 2.14: State change diagram of
the extended MOESI protocol: This
diagram is derived from [MHS14,
Figure 3]. It is based on Figure 2.13
with adjustments according to the
information provided in [Lep+12].
The additional MuW state is reached
when a Modified cache line is read
by another core. Instead of chang-
ing its state to Owned and insert-
ing a Shared copy in the request-
ing core—as would be the case in
the traditional MOESI protocol—the
original copy is invalidated and the
requester gets the cache line in the
MuW state, which allows modifica-
tion of the cache line without further
action. If this dirty MuW copy is read
again by another processor, it is for-
warded, marked Shared, and inserted
into the second requesters cache in
the Owned state.
insert a MuW copy in the requesting core, which can turn into Modified without having to invalidate other
copies. Furthermore, the extended MOESI protocol (see Figure 2.14) always migrates the ownership of
forwarded cache lines to the last requester. Therefore, memory read requests that receive forwarded
data from other nodes (Probe Hit M/MuW/E/O) insert a MuW or Owned copy in the requesting core’s
cache and leave an Invalid or Shared copy in the cache of the previous owner. This simplifies the
implementation of the snoop filtering mechanism [Lep+12]. It also enables forwarding of shared clean
data, which is not supported in the conventional MOESI protocol. However, the transition from Exclusive
to Shared inserts an Owned copy in the second requester, which results in unnecessary write backs of
clean data. As is the case in the MESIF protocol, the data is not necessarily provided by the copy that is
closest to the requester within the NUMA topology.
Table 2.5: States of the MOESI protocol [Amd15a, Section 7.3]; [Lep+12]: Modified, Exclusive, and
Invalid state have identical properties as in the MESI protocol. Shared cache lines can be “dirty”.
However, in that case there also is an Owned copy which is responsible for the write back to memory.
Thus, Shared cache lines can be evicted silently even if they are not consistent with main memory.
The new ModifiedUnWritten state has the same permissions as the Modified state. However, the state
transitions are different (see Figure 2.14).
State readable writable clean eviction may forward
Modified yes yes no write back yes
Modified Unwritten yes yes no write back yes
Owned yes no no write back yes
Exclusive yes yes yes silent7 yes
Shared yes no no silent no
Invalid no no - - -
7Notification of home node required if snoop filter is used (see Section 2.3.2.3)
2.3 Cache Coherence 25
2.3.2 Snoop Filtering
Snoop filtering reduces the traffic that is generated by the coherence protocol. If a central memory
controller exists, it can keep track which processors requested a cache line [Int09a, Figure 4 and 5].
In point-to-point connected systems, distributed snoop filters can be implemented within the NUMA
nodes—each monitoring a portion of the installed memory [Con+10; Kot+12]. This is particularly im-
portant in systems with several processors and the corresponding costly broadcasts. This section focuses
on filter mechanisms that exist in contemporary x86 processors. However, comparable techniques are
also available in other architectures, e.g., IBM’s POWER6 micro-architecture [Le+07].
2.3.2.1 Intel Core Valid Bits
Since the Nehalem generation [Int14a, Section 2.4] the LLCs of Intel’s server processors implement core
valid bits [Sin08; SLC08] in order to reduce the number of snoop requests to the cores. One bit per core
indicates if a cache line can have copies in the higher level caches. If a bit is not set, the associated core
certainly does not hold a copy of the cache line. Therefore, it does not need to be snooped if another core
requests the line. If two or more core valid bits are set, the cache line is known to be shared. In that case
the cores cannot modify the line without acquiring exclusive ownership first. Thus, read requests can be
served directly by the L3 cache, which also contains a valid copy. However, unmodified cache lines may
be silently evicted from a core’s cache without clearing the corresponding core valid bit. Thus, a set core
valid bit does not guarantee that the cache line is still present in a higher level cache.
2.3.2.2 Intel Directory Assisted Snoop Broadcast Protocol
Some Intel processors implement a directory [Kot+12; Kar14]; [Int12a, Chapter 2.4.1] that supplements
the MESIF protocol (see Section 2.3.1.2). The resulting “directory assisted snoop broadcast protocol”
stores one or two bit of directory information per cache line in the memory’s error-correcting code (ECC)
bits [Int15g, Section 2.1.2], which are used to encode three states:
remote-invalid indicates that no copies exist in other NUMA nodes
snoop-all indicates that a potentially Modified copy could exist in another node
shared indicates the presence of multiple clean copies, requires two directory bits
Snoop requests to other nodes can be reduced using these states. The home node does not have to
forward requests concerning memory locations with the state remote-invalid to other nodes. Further-
more, reads can be serviced directly from the home node’s memory if the directory bits indicate state
shared [Gee+13]. Writes to locations that are marked shared or snoop-all as well as reads of data in
state snoop-all still require broadcasts. Since the ECC bits are read in the course of the memory access
anyway, the directory look-up does not require additional accesses. However, snoop requests that have
to be forwarded are delayed, which could reduce the performance in some cases. It is also possible to
combine the directory information with the MESIF’s source snoop variant—which is then called “early
snoop” [Kar14]. In that case the home node forwards data from memory without waiting for the snoop
responses of other nodes if the directory state allows it. This mode of operation optimizes latency but
does not reduce the generated snoop traffic.
The directory look-up can be accelerated with directory caches [Kar14]. These so-called “HitME”
caches [Mog+14] store presence vectors that indicate which NUMA nodes requested copies of a cache
line. If the directory cache contains an entry for a requested cache line, snoops are sent accordingly.
Entries in the directory cache are only allocated if cache lines are forwarded between NUMA nodes.
Thus, cache lines that are exclusively used by a single node do not occupy directory cache entries. If
an directory cache entry is allocated, the corresponding in-memory directory bits are set to the snoop-all
state. This circumspective choice ensures that no further updates in the in-memory directory are needed.
However, it also creates the possibility that the in-memory directory misleadingly indicates Modified
copies.
26 2 Background and Related Work
2.3.2.3 AMD HT Assist
AMD’s Direct Connect Architecture [CH07] supports up to eight NUMA nodes. The MOESI protocol is
used to maintain cache coherence (see Section 2.3.1.3). As broadcast messages are extremely expensive
in a system with up to eight NUMA nodes, current AMD processors implement a snoop filter to reduce
the coherence traffic [Con+10; CL12]. This so-called HT Assist (also referred to as probe filter) uses a
portion of the L3 cache, which reduces the usable L3 size [Con+10, Figure 1].
MOESI is a home snoop protocol, thus snoop requests are first send to the home node, which then
forwards them to the other nodes. If the HT Assist is enabled, each node tracks which nodes have copies
of cache lines from its local memory. The snoop filter entries contain a tag field, a state filed, and an
owner field [Con+10, Figure 5b]. The tag is required to identify the correct entry for the requested
physical address. The owner field stores the node id of the node that holds a Modified, Exclusive, or
Owned copy, i.e., it points to the node that forwards the data upon request. The state field encodes the
coherence state of the corresponding cache line. According to [Con+10; CL12] the possible states are:
EM there is an Exclusive or Modified copy in the owner node.
O owner has a copy in state Owned, additional Shared copies can exist in multiple nodes
S1 multiple Shared copies can exist in a single node
S Shared copies can exist in multiple nodes
I the line is not cached
Depending on the state and type of the request, snoop requests are filtered completely, forwarded only to
a single node, or sent to all nodes (see [Con+10, Figure 6]). It is ensured that requests that are forwarded
to a single node return the requested data. Therefore, the snoop filter entries are updated when cache
lines in the state Exclusive, Modified, or Owned are evicted. Furthermore, Exclusive cache lines change
their state to Owned instead of Shared when they are read by another core in order to retain a forwarding
copy [Con+10]. Capacity and associativity of the snoop filter are limited. If an entry has to be removed
in order to accommodate a new one, the corresponding cache line is removed from all caches.
Lepak et al. [Lep+12] describe an optimized version of the filtering mechanism, which is used in con-
junction with the extended MOESI protocol shown in Figure 2.14. In this version, different snoop filter
states are used that do not always correspond to the coherence state in the cache:
M owner has an Exclusive, Modified, MuW, or Owned copy, in case of Owned additional
Shared copies only exist in the owner node
O a Modified, MuW, or Owned copy exists in the owner node, in case of Owned additional
Shared copies can exist in multiple nodes
S Shared copies can exist in multiple nodes
I the line is not cached
2.3.3 Directory-based Cache Coherence Protocols
In snooping-based coherence protocols, locating the most recent copy of the required data after a last
level cache miss as well as invalidating all other copies in case of writes involves all processors. In multi-
processor systems with point-to-point connections, this results in frequent broadcast messages, which
limits the viable number of processors [HP06, Section 4.4]. The snoop filtering mechanisms described
in Section 2.3.2.2 and 2.3.2.3 reduce the number of broadcasts. However, they do not register every node
that receives a copy of a cache line. Thus, broadcasts can still be required. Therefore, directory based
coherence protocols, which further reduce the coherence traffic, are typically used in large scale shared
memory systems [HP06, Section 4.4]; [Bae10, Section 7.2.2]. Directory based coherence protocols keep
track of all copies of a cache line and forward requests only to the affected processors.
One example for directory based coherence protocols is the DASH protocol [Pfi98, Section 7.2.1.2]. It
is used to interconnect multiple bus based multi-processor systems—called clusters—and form a large
scale coherent shared memory system. It uses a distributed directory where each cluster has a directory
2.4 Operating Systems 27
for its portion of the shared memory. Coherence within the clusters is maintained by a snooping-based
protocol. Each cluster also contains a directory controller, which acts as a proxy for remote accesses.
The directory distinguishes the states uncached-remote, shared-remote, and dirty-remote [Len+90]. A
bit-vector is used to track remote copies and forward requests only to the affected clusters. However,
a full bit-vector is not suitable for large numbers of clusters as the aggregated size of the distributed
directories grows quadratically with the number of clusters [GWM92; Hei+99]. The size of the directory
can be restricted by storing only a limited number of cluster ids with a broadcast mechanism or a coarse
vector scheme as fallback for cache lines that have too many copies [GWM92].
Another approach for a directory based coherence mechanism is the Scalable Coherent Interface [Pfi98,
Section 7.2.1.1]; [Aln+90], which uses distributed doubly linked sharer lists instead of a bit-vector or
another representation of the sharing information in the home node. The home node only stores the
pointer to the first node in the list and each node stores a forward and a backward pointer alongside each
cache line. This has the advantage that the required memory to store the directory information does not
grow significantly if the number of nodes increases. However, the list approach also has disadvantages.
If, e.g., all copies have to be invalidated, the message has to be forwarded from node to node. This
adds latency compared to the bit-vector approach which enables the home node to send messages to all
affected nodes immediately.
2.3.4 Overhead
Cache coherence protocols cause overhead. Write-invalidate protocols introduce a new sort of cache
misses in addition to the types described in Section 2.1.2. These coherence misses [HP06, Section 4.3]
arise if cache lines that have been invalidated by writes of another CPU are accessed again. They can
be subdivided into true sharing misses and false sharing misses. True sharing misses occur if the in-
volved accesses affect the same or overlapping memory addresses. This happens if processes or threads
(see Section 2.4.1) exchange data via the shared memory, i.e., the accesses correspond to data dependen-
cies in the application. In contrast, false sharing generates cache misses without actual sharing on the
application level. False sharing misses are caused by the cache line granularity that is used by the co-
herence protocols. Therefore, invalidations can occur even though the CPUs work with disjoint subsets
of a cache line. Another form of overhead are waiting times that are introduced by the execution of the
coherence protocol. This direct protocol overhead [Hei+99] is caused by the processing time required
by the logic that implements the protocol as well as the delay caused by directory look-ups, snoop filter
accesses, and waiting for snoop responses. Furthermore, the coherence traffic, i.e., forwarding the re-
quests and responses between NUMA nodes, puts pressure on the interconnection network—especially
if broadcasts are involved.
The extend of the performance loss due to the coherence overhead depends on the application character-
istics. In [HP06, Figure 4.13] it is shown that the impact of true sharing can be significant in a memory
intensive transaction-processing workload. Other studies [Kol+13; Ben+13] show that coherence misses
are negligible for the Reverse Time Migration application used in geophysics. However, even if coher-
ence misses are not an issue, the direct protocol overhead and coherence traffic remain. Snoop filtering
and directory protocols reduce the coherence traffic (lower bandwidth demand) but can increase the di-
rect protocol overhead (higher latency). In contrast, unfiltered snoop broadcast protocols minimize the
latency while generating a lot of traffic.
2.4 Operating Systems
Applications require certain resources to execute their tasks—e.g., processor cores, memory, disk stor-
age. The operating system (OS) allocates these resources. Therefore, the OS provides abstractions of
the hardware resources—most notably processes, threads, and virtual memory—in order to establish a
generic environment for applications. These abstractions are also required to support multiple concur-
rently running applications.
28 2 Background and Related Work
2.4.1 Processes and Threads
Applications are implemented in the form of programs, i.e., files that contain sequences of machine
instructions. When a program is executed it becomes a process [SGG12, Chapter 3]. Therefore, the
program is loaded into memory in order to be executed by one or more CPUs (see Section 2.1). Multi-
ple programs can be used concurrently. However, each process has its own address space (AS), i.e., a
memory area with consecutive addresses. The required multitude of continuous address spaces is imple-
mented using virtual memory as described in Section 2.4.2. An address space typically contains different
regions [SGG12, Section 3.1.1]. The text section contains the program code. Global data is stored in
the data segment. Temporary data—e.g., local variables and function parameters—is stored on the stack,
which grows when functions are called. Dynamically allocated memory is placed on the heap.
CPUs process programs sequentially. Even if the processor cores utilize out-of-order execution, the
externally visible state—i.e., the registers defined by the ISA—is typically updated in program order
(see Section 2.1.1.2). The OS’s representation of such a sequential stream of instruction processing
within a process’s address space is called a thread [SGG12, Chapter 4]. A process includes one or
more threads. Threads that belong to the same process share the address space. However, each thread
has its own stack for temporary data [SGG12, Section 4.1]. Multiple threads from the same or different
processes can be executed concurrently, either rotational on a single CPU or simultaneously on multiple
CPUs. The allocation of CPUs to threads is called scheduling [SGG12, Chapter 6]. If a threads is not
assigned to a CPU its state is stored in an OS data structure [SGG12, Section 3.2.3]. Pending threads can
resume execution on any CPU that becomes available.
2.4.2 Virtual Memory
In multi-programming operating systems the available physical memory needs to be partitioned and as-
signed to multiple concurrently running processes. Therefore, applications use virtual memory [SGG12,
Chapter 9]; [DAS12, Section 4.4] instead of directly addressing the physical memory. As depicted in Fig-
ure 2.15 each process has a separate virtual address space. The allocation of physical memory is governed
by the operating system.
Virtual memory can be implemented using paging [SGG12, Section 8.5], which uses translation tables.
This is supported by many general purpose processor architectures—e.g., the paging mechanism in x86
processors [Amd15a, Section 1.2.2]; [Int14b, Volume 3, Chapter 4] or ARM’s virtual memory system
architecture [Arm15, Section D4]. Paging divides the address spaces into multiple blocks of the same
size, which are called pages. The translation from virtual to physical addresses uses page tables that are
stored in memory. Figure 2.16 depicts the address translation for 4 KiB pages in 64 bit x86 processors.
Contemporary 64 bit x86 processors are limited to 48 bit virtual addresses8 [Amd15a, Section 5.1]. The
currently unused upper 16 bits have to be the sign extension of the 48 bit virtual address. The width of
the physical address is implementation dependent. Up to 52 Bits are possible.
virtual address spacevirtual address space
physical memory
process A process B
memory of process A
memory of process B
shared memory
Figure 2.15: Basic principle of virtual memory, based on [SGG12, Figure 9.3 and 9.8] (derived
from [Mol08, Figure 2.5]): Processes have separate address spaces that isolate them from each other,
i.e., they cannot access each others private memory. This enables the coexistence of arbitrary applica-
tions. Shared memory regions are also possible, which can for example be used for shared libraries.
8future extension is intended but not yet specified
2.4 Operating Systems 29
12 bit page offset
9 bit PT offset9 bit PML4 offset 9 bit PDP offset 9 bit PD offset
Page-Map
Level-4 
Table
12 bit page offsetup to 40 bit page frame
PML4-Entry
Page TablePage 
Directory 
Table
Page 
Directory 
Pointer 
Table
PD-Entry
PT-Entry
PDP-Entry
48 bit virtual address
up to 52 bit physical address
PML4
base address
64 bit virtual address
sign extended
36 bit page number
Figure 2.16: Address translation in 64 bit x86 processors using 4 KiB pages, based on [Amd15a, Fig-
ure 5.17] (derived from [Mol08, Figure 2.15]): the virtual address is split into page number and page
offset. The page number is split into four indices, which are used to select entries from the four lev-
els of page tables. The PML4 base address is defined per process. The last level of the page table
hierarchy contains the page frame. Page frame and page offset form the physical address.
Due to the multi-level page table structure, four additional memory accesses are necessary to de-
termine the physical address. Translation Lookaside Buffers (TLBs) [Int14b, Volume 3, Sec-
tion 4.10.2]; [Arm15, Section D4.7] are used to reduce this tremendous overhead. TLBs are content-
addressable memories within the processor that store individual translations, i.e., they are addressed with
the requested page number and return the page frame if a matching entry is found. The page table
based translation is only used in case of TLB misses. As listed in Table 2.6, a single processor core
does not necessarily have enough 4 KiB entries to cover the whole last level cache, which can result
in costly page table walks even if the data is cached. The translation can be accelerated by additional
paging-structure caches [Int14b, Volume 3, Section 4.10.3]; [BT09, Section 2], which reduce the number
of required memory accesses. When the translation is complete, a corresponding TLB entry is created.
Thus, subsequent accesses to the same page are serviced by the TLB.
Another method to reduce the address translation overhead is to increase the page size. For instance,
contemporary x86 processors support 2 MiB pages—also called huge pages. Using huge pages reduces
the number of page table accesses to three, i.e, the “PD-Entry” in the “Page Directory” points to a 2 MiB
page instead of a page table and the offset for addressing within the page increases to 21 bit. Furthermore,
the TLBs support larger memory regions if huge pages are used as detailed in Table 2.6. Linux kernels
since version 2.6.38 support transparent huge pages (THP) [Arc11], i.e., large continuous memory
regions are automatically allocated in 2 MiB pages. If THP is not enabled, 2 MiB pages can be allocated
explicitly via hugetlbfs.
Page table entries (PTE) can be marked as invalid by the operating system [DAS12, Section 4.4.4], which
indicates that the physical pages are not present in memory. If such an unavailable page is accessed it
causes an exception. These so-called page faults have to be handled by the operating system. This can be
Table 2.6: Number of data TLB entries per core in customary x86 processors [Amd11, Appen-
dix A.10]; [Int14a, Section 2.4, 2.2, and 2.1]; [Amd14c, Section 2.9]:
micro-architecture LLC L1 TLB [entries: coverage] L2 TLB [entries: coverage]size 4 KiB pages 2 MiB pages 4 KiB pages 2 MiB pages
AMD Opteron 2435 (Istanbul) 6 MiB 48: 192 KiB 48: 96 MiB
512: 2 MiB
128: 256 MiB
Intel Xeon X5670 (Westmere-EP) 12 MiB
64: 256 KiB 32: 64 MiB
n/aIntel Xeon E5-2670 (Sandy-Bridge-EP) 20 MiB
Intel Xeon E5-2680 v3 (Haswell-EP) 30 MiB 1024: 4 MiB 1024: 2 GiBAMD Opteron 6274 (Interlagos) 6 MiB9 32: 128 KiB
98 MiB total, 2 MiB used by probe filter in four socket configurations, see Section 4.2.2 for details
30 2 Background and Related Work
used to implement swapping, which moves pages to disk if there is not enough physical memory [SGG12,
Section 8.5 and 9.2]. Page faults can also be used to implement on-demand allocation, which allocates
physical memory only for pages that are actually accessed [Gor04, Section 5.6]. Therefore, requests to
allocate memory (e.g., dynamic memory allocation using malloc()) only reserve the required number
of virtual pages and generate invalid page table entries. Physical pages are allocated when the page
faults are handled that arise when pages are accessed for the first time. Consequently, pages that are
never accessed do not consume physical memory. However, the exception handling increases the latency
of memory accesses that are affected thereof.
Parallel applications (see Section 2.5) can be implementing using a single process with multiple threads
or using multiple processes. Multiple threads in a process share the entire address space. In contrast,
each processes has a unique address spaces, i.e., a dedicated set of page tables. However, some pages can
have the same content. For instance, the text segments of processes that execute the same program are
identical. The copy-on-write mechanism [SGG12, Section 9.3] can be used to avoid redundant copies
in physical memory. In that case page table entries of different processes point to the same physical
pages with a set copy-on-write flag. When a process modifies such a shared page it is replicated and the
PTE of the writing process is updated prior to the change. Read-only pages—i.e., code and read-only
data—remain shared.
2.4.3 NUMA Considerations
In systems with a NUMA memory architecture (see Section 2.2.1) it is advantageous if the CPU a thread
is scheduled on and the physical memory it uses are in the same NUMA node [Tan+13]; [Lam06, Sec-
tion 4.1]. Therefore, the operating system needs to be aware of the distributed memory. On NUMA
systems Linux manages the memory of each node separately [Lam06, Section 3.2]. Consequently, mul-
tiple memory pools exist. Upon page allocation it has to be decided, which pool is used. Several policies
are available [Kle05]:
localalloc Allocate memory from the NUMA node that contains the CPU that processes the
requesting thread. This is the default setting.
bind Allocate memory from a specific node or set of nodes.
interleave Allocate physical pages for consecutive virtual addresses alternately from all nodes
in the specified set of nodes.
preferred Try to allocate pages from a specific node. Use pages of other nodes if no free pages
are available in the pool of the preferred node.
The first touch policy [Lam13] defines which policy is used for the allocation of physical pages if pages
are accessed for the first time. If the default policy is used, physical pages are allocated in the local
memory of the CPU that performs the first access. The numactl command line tool [Kle05] can be
used to alter the policy for whole processes. The shared library libnuma [Kle05] provides an API that
enables explicit control of the policy used for individual allocations. In general there is no fixed assign-
ment of CPUs to threads. Threads are periodically interrupted by the preemptive scheduling [SGG12,
Section 6.1.3] and can move to another CPU when they are restarted. It is possible that a thread is mi-
grated away from its memory to another NUMA node. Thus, even if local allocation is ensured at first
touch, it is not guaranteed that memory accesses remain local. Migration between NUMA nodes can
be avoided by restricting the set of allowed CPUs. This can be done per process via the command line
tools taskset or numactl. Linux also provides an API to explicitly manage the CPU affinity per
thread. The sched_setaffinity() system call [SGG12, Section 6.5.2] can be used to select the set
of CPUs the scheduler can choose from. Another possibility to reduce the number of remote memory ac-
cesses is to also migrate the physical pages [Rie14; GF09]. This is implemented by periodically marking
pages as invalid and checking in which node the resulting page fault occurs. If multiple accesses from
the same remote node are detected the content is copied to physical pages in the local memory of this
node, the page table entries are updated accordingly, and the old physical pages are freed.
2.5 Programming Models 31
2.5 Programming Models
A programming model provides an abstraction of the computer system [MSM04, Section 2.3]. This
simplified view of the hardware enables application developers to create programs that run on a variety
of systems. The OS already provides processes, threads, and virtual address spaces as abstractions of
CPUs and main memory (see Section 2.4). Parallel programming models also require a mechanism to
exchange data between processes or threads. This can for instance be implemented via shared memory,
network communication, or the file system.
Large scale systems are used for high throughput (HTC) as well as high performance computing (HPC).
High throughput computing [MML11] focuses on executing many independent or loosely coupled tasks
in parallel while the response time for a single task is of little interest. In contrast, high performance
computing [Vet15, Chapter 1] is utilizing parallelism to solve one large problem as fast as possible.
However, multiple processors that are working in parallel on a single problem have to exchange data
frequently. Figure 2.17 shows a selection of parallel programming models that facilitate the required
communication needed in HPC. These models make certain assumptions about the composition of the
system, i.e., they target a certain class of systems (see Section 2.2). Executing applications developed
using one model on a system that is better represented by another model can be inefficient or even
impossible [Pfi98, Chapter 9]. Unfortunately, no single model covers all levels of parallelism that are
available in contemporary HPC systems (see Figure 2.9 in Section 2.2.2.2).
2.5.1 Shared Memory
Shared memory programming models [Gra+03, Chapter 7]; [Pfi98, Section 9.4] rely on shared memory
to implement communication. Parallelism is typically implemented using multiple threads (see Sec-
tion 2.4.1) that share the whole virtual address space10 (see Section 2.4.2). This implies a flat logical
memory model [Gra+03, Figure 7.1], i.e., a uniformly accessible global address space as depicted in Fig-
ure 2.17a. This simple model provides some advantages. First of all, it is easy to use as one does not
have to think about the distribution of the data. The shared address space also enables dynamic load bal-
ancing as all threads can perform each task [Gra+03, Section 7.2]. However, there are also considerable
global address space (AS)
thread thread thread thread...
memory
(a) shared memory
thread thread thread thread...
mem mem mem mem
global 
AS
local 
AS
local 
AS
local 
AS
local 
AS
(b) partitioned global address space
thread thread thread thread...
mem mem mem mem
local 
AS
local 
AS
local 
AS
local 
AS
(c) message passing
load/store put/get send/receive
Figure 2.17: Communication in parallel programming models, based on [Wae+15, Figure 1]; [Ope15,
Figure 1]: In the shared memory programming model all threads have uniform access to all the
memory, which can be used to exchange data [Gra+03, Section 7.1]. In the partitioned global address
space (PGAS) model each thread has a local memory, which contains the thread’s private data as
well as a portion of globally shared data [Ope15, Figure 1]. All global memory is accessible by all
threads. However, accesses to another thread’s section can be slower than local memory accesses.
The message passing programming model uses separate address spaces for every thread [Gra+03,
Section 6.1]. Data is exchanged between the threads via messages.
10Each thread has its own stack for local variables, which is not intended to be used by other threads
32 2 Background and Related Work
limitations. The shared memory programming model is restricted to shared memory systems11 (see Sec-
tion 2.2.1 and Section 2.2.2.1). Furthermore, the non-uniform memory access (NUMA) characteristics
of contemporary multi-processor systems—which need to be taken into account in order to achieve high
performance—are not considered by the flat memory model. Therefore, locality has to be managed care-
fully [Nik+01] either explicitly—e.g., via libnuma [Kle05]—or implicitly by relying on the default first
touch policy and automatic page migration (see Section 2.4.3).
Another complication is that no assumptions can be made about the relative execution speed of different
threads [Gra+03, Section 7.9], which is—amongst other things—influenced by the OS’s scheduling deci-
sions. For example, threads can be interrupted and moved to another CPU or NUMA node, which poten-
tially increases the cache miss rates or the ratio of remote memory accesses, respectively. Furthermore,
out-of-order micro-architectures (see Section 2.1.1) can change the perceived order of memory accesses.
For instance, loads can appear to happen before stores to other locations that precede them in program
order, which is permitted, e.g., in x86 based multi-processor systems [Int14b, Volume 3, Section 8.2.2].
Broader types of reordering are possible in other micro-architectures [McK10, Table 5]. Consequently,
the chronology of memory accesses in parallel programs is hardly predictable. Cache coherence pro-
tocols (see Section 2.3) do not resolve this problem. They enforce that changes to the same memory
location are observed in the same order by all threads, but do not guarantee a specific order of compet-
ing accesses. Moreover, accesses to different cache lines are not considered at all since the state of the
cache lines is managed independent from one another. Therefore, synchronization mechanisms [SGG12,
Chapter 5]—e.g., mutual exclusion [Gra+03, Section 7.5.1] and barriers [Gra+03, Section 7.8.2]—are
often required to coordinate the memory accesses of multiple threads. Memory barriers [McK10] can
also be needed to ensure that changes become visible to other threads in the correct order—especially in
systems with aggressive reordering of memory accesses.
Low-level thread libraries are available in many environments, e.g., POSIX threads in Unix systems,
Windows threads, and Java threads [SGG12, Section 4.4]. However, implementing a parallel applica-
tion using the low-level APIs is a highly error prone process. Directive-based languages (or language
extensions), which hide a lot of the complexity from the programmer, are a more convenient alternative.
OpenMP [SGG12, Section 4.5.2]; [OMP13] is one example for such an approach. It enables developers
to easily parallelize applications written in C/C++ or Fortran. For that purpose, the source code is anno-
tated with #pragma directives that identify parallel regions, i.e., parts of the code that can be processed
by multiple concurrent threads. By default one thread is started on each CPU. As the number of threads
is determined at runtime, one executable can be used on systems with different numbers of cores. The
work distribution can be done manually based on the unique thread ids. However, there is a specific
#pragma to parallelize loops, which automatically assigns iterations to threads. This can be done stati-
cally, i.e., iterations are distributed as evenly as possible and each thread receives a continuous range of
iterations. Multiple dynamic loop scheduling methods are supported as well. In that case threads request
further chunks when they have completed their work until all iterations are executed. This can be used
to balance the load on the CPUs, but impedes NUMA optimizations.
2.5.2 Partitioned Global Address Space
The Partitioned Global Address Space (PGAS) [Wae+15] programming model also works with a glob-
ally shared address space. However, there is a clear distinction between local and remote accesses. Each
thread has a local address space that contains private data and a portion of the global address space as
depicted in Figure 2.17b. The threads have fast access to their private data and their local portion of
the global data. The performance of accesses to another thread’s piece of global memory can be signifi-
cantly lower. Such remote accesses are implemented using one-sided communication, i.e., the initiating
thread performs the memory access without involvement of the communication partner. This can be im-
plemented explicitly using put() and get() routines [Kri+12a, Section 4.1]; [Ope15, Section 8.3] or
11It is also possible to use virtual shared memory systems that implement coherent shared memory in software on top of a
distributed memory system [Sca12].
2.5 Programming Models 33
other syntactical means to distinguish remote from local accesses, e.g., by specifying the target thread’s
id to access a remote copy of a variable [NR98]. Remote accesses can also be implicit, i.e., syntactically
identical to local accesses [Wae+15, Section 3.4]. In that case other mechanisms are available that make
sure that each thread processes the objects in its local portion of the global memory, e.g., the affinity
statement in upc_forall() [UPC13].
The PGAS model can be mapped to memory accesses in cache coherent distributed shared memory
systems [Fei95] as well as to network communication in distributed memory systems [Kri+12a, Sec-
tion 2.1]; [GAS13]. The former is useful to implement NUMA-aware shared memory programs as
potentially costly remote accesses, i.e., accesses to memory in other NUMA nodes, have to be added
deliberately. The latter enables PGAS applications that span multiple nodes in a distributed memory
system. However, there is only one kind of remote memory. Thus, different costs, e.g., for intra-node
and inter-node communication (see Figure 2.9 in Section 2.2.2.2), cannot be considered within a sin-
gle application. The PGAS model can be implemented in the form of communication libraries [Fei95;
Kri+12a; GAS13] or language extensions [UPC13; NR98]. One of the oldest implementations of the
PGAS model is Cray’s SHMEM library [Fei95], which was used in their large NUMA systems, e.g.,
the T3D. This development continued under SGI’s control [TW12] and lead to the emergence of Open-
SHMEM [Ope15]. Another early PGAS representative is the Global Arrays toolkit [Kri+12a], which
provides PGAS semantics on top of a message passing library. Unified Parallel C [UPC13] and CoArray
Fortran [NR98] are PGAS extensions for C and Fortran, respectively.
2.5.3 Message Passing
The message passing programming model [Gra+03, Chapter 6]; [Pfi98, Section 9.5] is designed for dis-
tributed memory systems (see Section 2.2.2.2). It uses multiple processes that have separate address
spaces as depicted in Figure 2.17c. The data that constitutes the problem the program solves is distributed
among the processes. The processes cannot directly access each others portion of the date. Instead, data
has to be exchanged explicitly via messages, which can be implemented using network communication.
As a matter of principle, no shared memory is required to implement the message passing programming
model. However, the nodes in a distributed memory system typically are shared memory multi-processor
systems (see Section 2.2). Therefore, read-only pages from the text and data segment as well as the mes-
sage passing library itself can be shared by the processes in a node (see Section 2.4.2). Furthermore,
intra-node communication can be implemented via shared memory [EM05; PS00].
Even if a message passing based application is partially or completely executed in a shared memory
environment, there are little to no coherence misses as the virtual address spaces (see Section 2.4.2) of
the processes are disjoint. There are no true sharing misses except for the receive buffers that can be
invalidated during data exchange. False sharing also does not occur as the page size is a multiple of the
cache line length, thus cache lines cannot span pages that are mapped to different processes. However, the
coherence protocol overhead (see Section 2.3.4)—in the form of waiting times and additional messages
that are caused by the snoop requests and responses—exists even if coherence is not required. Because
of the separate address spaces it is rather easy to preserve locality in NUMA architectures. If each
processes is restricted to a single NUMA node, the default first touch policy avoids remote accesses
(unless a memory pool is depleted). However, shared pages (copy-on-write) for code and read-only
global variables can cause remote accesses.
The Message-Passing Interface (MPI) [MPI94] is a standard for portable message-passing programs.
MPI was developed in the early 1990s as an alternative to the numerous vendor specific message-passing
libraries that existed before it [Gra+03, Section 6.3]. A key concept of MPI are communicators, which
define communication domains—sets of processes that can communicate with each other [Gra+03, Sec-
tion 6.3.2]. The default communicator MPI_COMM_WORLD includes all processes. MPI supports
point-to-point as well as collective communication within the communication domains. The domains
can be organized in virtual topologies [Gra+03, Section 6.4], e.g., multi-dimensional grids. However,
these topologies do not necessarily correspond to the hardware topology. The point-to-point communica-
34 2 Background and Related Work
tion is implemented using pairs of send and receive functions [MPI94, Chapter 3]—e.g., MPI_Send()
and MPI_Recv(). This is called two-sided communication, as sender and receiver have to agree to
communicate. The basic collective operations [MPI94, Chapter 4] include MPI_Barrier() for syn-
chronization, MPI_Bcast() and MPI_Scatter() to spread data as well as MPI_Reduce() and
MPI_Gather() to consolidate distributed data. Later versions of MPI [MPI09; MPI12] extend the
standard to support, e.g., parallel I/O and one-sided communication via remote memory accesses (RMA).
The latter can be implemented using direct updates—i.e., ordinary load and store operations—in systems
with coherent shared memory [MPI12, Section 11.4].
2.5.4 Hybrid Programming Models
Contemporary HPC systems have up to three levels of parallelism12 (see Figure 2.9 in Section 2.2.2.2):
• multiple cores per processor (sometimes supplemented with multi-threading)
• multiple processors in each node
• multiple nodes in the system
Typically, multiple processor cores share a part of the cache hierarchy and a memory controller (see Sec-
tion 2.1.3). This introduces a first class of remote accesses: core-to-core transfers within a processor.
The nodes often have a NUMA architecture (see Section 2.2.1), which adds a second class of remote
accesses: transfers between NUMA nodes within a shared memory system. Using multiple nodes cre-
ates a third class of remote accesses: transfers between nodes via an interconnection network. None
of the programming models discussed so far can simultaneously consider the specifics of all classes of
remote accesses. The shared memory programming model enables a fine grained distribution of tasks,
but has no inherent perception of remote memory. It can be extended with NUMA-awareness [Kle05;
Nik+01]. However, this simple model is most suitable for uniform memory access architectures. In
contrast, PGAS and message passing innately distinguish local from remote accesses. Thus, they are
better suited for large scale systems. However, there is no intrinsic differentiation of intra-node (shared
memory) and inter-node (network) communication.
Hybrid programming models [DMN12, Chapter 6] interleave multiple programming models in order
to combine their specific advantages. For instance, MPI can be combined with OpenMP [RHJ09;
Liv+12]. In that case a single MPI process is started per node or NUMA node and OpenMP is
used to utilize all processor cores. Other combinations, e.g., MPI+Accelerators [SBO11; Aji+12] and
MPI+PGAS [Din+10], are also considered. The increasing node concurrency in HPC systems (see Fig-
ure 2.10 in Section 2.2.2.3) poses a challenge for MPI [Tha+10]. Therefore, hybrid programming models
are expected to become the prevalent model for future exascale systems [Tha+10; Hoe+13].
2.6 Performance Evaluation
Performance evaluation plays a prominent role in each phase of a computer system’s life cycle [Jai91,
Part I, preamble]. It’s use cases include: assessing the performance of different design alternatives
during development and comparing the performance of contemplable systems in a procurement as well
as testing how well applications are performing while the system is in operation. Multiple criteria (also
called metrics), e.g., response times or the throughput of arithmetic instructions, can be used to rate the
performance of computer systems [Jai91, Section 3.3]. An important aspect in the evaluation of parallel
applications is their scalability, which is discussed in Section 2.6.1. There are three techniques to perform
performance evaluations: analytical modeling, simulation, and measurement [Jai91, Section 3.1]; [Eis86;
HG97]. The measurement technique comprises benchmarks of the system performance as well as the
performance analysis of parallel applications. Benchmarks are discussed in Section 2.6.2. Section 2.6.3
introduces methods and tools for the performance analysis of parallel applications. An overview of
performance modeling and simulation is provided in Section 2.6.4.
12In addition to the processors’ exploitation of instruction level parallelism and SIMD capabilities (see Section 2.1.1).
2.6 Performance Evaluation 35
2.6.1 Scalability
An application is scalable if it can efficiently utilize an increasing number of processors [KG94]. There-
fore, the speedup (S) of a parallel implementation compared to the sequential solution of a problem is an
important metric for parallel applications [Gra+03, Section 5.2.3]; [MSM04, Section 2.5]. It is defined
as the ratio of the runtime of the sequential algorithm (T1) to the runtime of the parallel algorithm on N
processors (TN ) as defined in Equation (2.1) [KMC72, Equation 2]. A related metric is the parallel effi-
ciency (E), which divides the speedup by the number of processors as shown in Equation (2.2) [KMC72,
Equation 1].
S(N) =
T1
TN
(2.1)
E(N) =
S(N)
N
=
T1
N × TN (2.2)
The ideal speedup—which is also called linear speedup—using N processors is N . This is achieved
if TN = T1N and results in a parallel efficiency of 1.0. Unfortunately, linear speedup can hardly be
achieved in practice as usually a part of the application—called the serial fraction—cannot be paral-
lelized [MSM04, Section 2.5]. Parallelization overhead such as load imbalances and excess computation
in the parallel algorithm can further reduce the achievable speedup [Gra+03, Section 5.1]. Parallel ap-
plications typically include communication between processors. Therefore, the scaling of applications
also strongly depends on the topology of the interconnection network [NW88]. Furthermore, resource
contention can limit the performance of multi-core processors [GSP11]. However, in some rare cases
a super-linear speedup (S(N) > N ) can be observed. Possible causes for such behavior include re-
duced cache miss rates if the problem fits into the aggregated cache capacity of N processors [Gra+03,
Example 5.3] as well as a changed order of operations in the parallel algorithm, e.g., when traversing
data structures [Gra+03, Example 5.4]. The achievable speedup in parallel architectures is subject of
numerous research studies. A comprehensive overview is given in [KG94].
Three speedup metrics can be distinguished [HJ10, Section 3.3]—fixed-size (strong scaling), fixed-time
(weak scaling), and memory-bounded speedup. The fixed-size speedup is described by Amdahl’s
Law [Amd67]; [HP06, Section 1.9]; [MSM04, Section 2.5]. Amdahl’s Law states that the achievable
speedup for a given workload (w) is determined by the program’s serial fraction (fs) as shown in Equa-
tion (2.3), which has an upper bound of 1fs for N →∞.
S(N) =
T1(w)
TN (w)
=
T1(w)
fs × T1(w) + (1− fs)× T1(w)
N
=
1
fs +
1− fs
N
(2.3)
According to Gustafson [Gus88]; [MSM04, Section 2.5], the fixed-size assumption is not always ap-
propriate. He argues that parallel systems are used to solve large problems in a reasonable amount of
time, which is not possible on smaller systems. He divides the workload into a sequential (ws) and a
parallel part (wp). The parallel part is scaled such that solving the enlarged problem with N processors
(TN (ws + w′p)) takes as long as solving the original problem on one processor (T1(ws + wp)) [HJ10,
Figure 3.9]. It is assumed that the parallel part scales linearly with the number of processors, i.e.,
w′p = N×wp. The resulting scaled speedup (fixed-time speedup) is defined as shown in Equation (2.4).
S(N) =
T1(ws +N × wp)
T1(ws + wp)
=
fs × T1(w) +N × (1− fs)× T1(w)
T1(w)
= fs +N × (1− fs)
= N + (1−N)× fs
(2.4)
Sun and Ni [SN93] describe that the feasible problem size can be increased more than linearly with the
number of processors if the problem is constrained by the memory capacity and the arithmetic intensity
increases faster than the memory requirements. This further reduces the impact of the serial fraction,
thus the memory-bounded speedup can be higher than the fixed-time speedup.
36 2 Background and Related Work
2.6.2 Benchmarking
Using measurements in order to compare the performance of two or more computer systems is called
benchmarking [Jai91, Section 4.6]. The term benchmark is commonly used for all kinds of workloads
that are used in that process. This includes application benchmarks, synthetic programs (also called
synthetic benchmarks), and kernels [HG97, Section 4.1]; [Bae10, Section 1.3.1]. Application bench-
marks are based on real applications, i.e., they execute complex algorithms that solve problems of prac-
tical importance (not necessarily with real data). For instance, the SPEC MPI2007 [Mül+10], SPEC
OMP2001 [Mül+04], and SPEC OMP2012 [Mül+12] benchmarks fall into this category. In contrast,
synthetic programs do not perform any meaningful work. However, they carry out a sequence of oper-
ations that simulates a real workload [HG97, Section 4.1]. An example for that is eeMark [Mol+12],
which simulates MPI based parallel applications. The LU, SP, and BT benchmarks that are included in
the NAS parallel benchmarks [Bai+91] and emulate the behavior of computational fluid dynamics (CFD)
codes also belong to that category. A kernel [Jai91, Section 4.3]; [Eis86] is an elementary function that is
used by multiple applications. Unlike application benchmarks and synthetic programs, kernels have well
known parameters, i.e., the exact number and type of operations is known. Furthermore, many kernels
are focused on the processor’s computational performance, which allows them to derive the throughput
of integer or floating point operations. The LINPACK benchmark [DLP03], which is used to rank HPC
systems in the TOP500 list [Top15], can be categorized as such a processing kernel.
Due to their complexity, the overall performance of benchmarks is typically influenced by multiple sys-
tem components, e.g., execution units, caches, main memory, and the interconnection network. There-
fore, the performance of individual components cannot be measured with the types of benchmarks de-
scribed so far. Micro-benchmarks (also called component benchmarks) are designed to evaluate the
performance of a certain component [Lil00, Section 7.1.4]; [Sta05], e.g., measure the throughput of the
FPU or the main memory bandwidth. Therefore, they mostly execute operations that stress the target
component without being limited by other components. For instance, memory accesses can be avoided
by using only operands from the CPU’s registers. Micro-benchmarks do not reflect the behavior of real-
istic workloads. However, they are able to determine the maximal performance of a single component.
The versatile micro-benchmark suite lmbench [MS96] includes measurement routines that evaluate the
performance of memory accesses, I/O operations, and various operating system functions, e.g., the time
required for creating a new process. The initial version of lmbench does not include support for multi-
processor systems. However, scalability tests have been introduced later on [Sta05]. Another well-
known micro-benchmark is STREAM [McC95], which is included in lmbench [Sta05] and measures
the bandwidth of memory accesses. STREAM performs simple operations on large vectors that do not
reuse any data, e.g., a[i] = q * b[i]. The same approach is used in likwid-bench [THW12],
which implements the measurement in assembly language for x86 processors and also considers remote
accesses in NUMA systems. STREAM and likwid-bench can also measure the aggregated bandwidth of
concurrent memory accesses in shared memory multi-processor systems. The performance of message
passing based applications also is an important aspect of HPC systems. Therefore, various benchmarks
that focus on the properties of communication are available. A comparison of selected MPI benchmarks
is provided by Hamid and Coddington [HC10].
Detailed information about the memory hierarchy is required in order to optimize programs for a certain
system [Dan+13]. Thus, there are many benchmarks that are able to characterize the performance of
caches and TLBs, e.g., [SS95; BT09; YPS05a; Dan+13; Duc+08; Gon+10]. Saavedra and Smith present
micro-benchmarks that measure the latency of cache and TLB misses [SS95]. Babka and Tu˚ma present
a benchmark suite that determines the organization of the TLB and cache hierarchy and measures their
respective miss penalties [BT09]. The study also considers paging structure caches [Int14b, Volume 3,
Section 4.10.3], which is a distinctive feature compared to other works in that area. X-Ray [YPS05a] is
another tool that determines the cache organization and measures the latency of local memory accesses.
BlackjackBench [Dan+13] also determines cache and TLB parameters, namely the cache line length, the
number of cache levels, their capacity, associativity, and access latency as well as the page size and the
2.6 Performance Evaluation 37
number of TLB entires. It also measures the bandwidth of core-to-core transfers—within and between
multi-core processors—as well as the latency and throughput of arithmetic instructions. Multiple tools
include a latency measurements for local cache and memory accesses [MS96; Dan+13; YPS05a]. A
method that also considers the latency of core-to-core transfers in multi-core processors is presented
in [Sch07, Section 6.5]. P-Ray [Duc+08] discovers which cores share certain caches in multi-core pro-
cessors and measures the scaling of the bandwidth with the core count. The detection of shared caches
is also covered by Servet [Gon+10].
2.6.3 Performance Analysis of Parallel Applications
The purpose of performance analysis is to investigate the behavior of applications in a certain hardware
and software environment in order to locate and understand performance problems [Moo+01]. There-
fore, information about the progression of the program execution has to be recorded. This is called
monitoring [Jai91, Chapter 7]. Monitoring comprises hardware and software monitoring. A hardware
monitor is a measurement device that is attached to the system under test. A software monitor is a pro-
gram that uses functions provided by the operating system, compiler, or the programming language to
intercept the application and collect information. The hardware monitor uses dedicated resources for
the performance analysis, thus the impact on the system under test is minimal. In contrast, a software
monitor can significantly influence the behavior of the system.
2.6.3.1 Data Acquisition
There are basically two approaches to obtain performance data: sampling and instrumentation [Ils+15].
Sampling periodically intercepts the application. The events that trigger the data acquisition are provided
by the operating system or the hardware, thus sampling can be used without modification of the appli-
cation. Sampling can be implemented using timer interrupts [Jai91, Section 7.3.1], which leads to a
constant sampling interval. Some contemporary processors also support performance counter sampling,
which collects performance data on counter overflows [Moo02], e.g., after every 100000 instructions. In-
strumentation incorporates calls to the data acquisition functions into the program that is to be analyzed.
Common events are: function entries and exits that can be instrumented by the compiler as well as calls
to library functions that can be intercepted with wrapper libraries. Another possibility is to manually add
instrumentation to the source code at certain points in the execution. Furthermore, binary instrumenta-
tion tools—e.g. Dyninst [BH00]—can be used to instrument applications even if their source code is not
available.
The difference between sampling and instrumentation is depicted in Figure 2.18. Figure 2.19 shows the
runtime distribution that can be derived from the recorded data. Time-based sampling has the advantage
  int a(){
    //some computation
    return c(x);
  }
  int b(){
    //some computation
    return c(x)+c(y)+c(z);
  }
  int c(arg){
    //some computation
    return result;
  }
  int main(){
     return a() + b();
  }
time
main a() c() c()b() c() c()
main a() c() c()b() c() c()
Time-based Sampling:
Instrumentation:
measurement points
Figure 2.18: Comparison of sampling and instrumentation, in the style of [Vam15, Figure 3.12]: If
time-based sampling is used, samples are taken at a fixed frequency. The performance data from
a whole sampling interval is assigned to the source code location at which an interrupt occurs. If
instrumentation is used, the performance data is assigned to the corresponding program phase.
38 2 Background and Related Work
main
a()
b()
c()
(a) Time-based Sampling
main
a()
b()
c()
(b) Instrumentation
Figure 2.19: Exemplary runtime statistics: The pie charts correspond to Figure 2.18. Sampling does not
record enough information to derive the correct time distribution in this small example.
of a constant overhead of n samples per unit of time. On the other hand, the obtained performance
data is not necessarily accurate. Figure 2.19a depicts the imprecise runtimes that the sampling approach
generates for the example application. However, the statistical significance improves when the number
of samples increases, e.g., in loops with many iterations. In contrast, the instrumentation based approach
captures all the necessary information to correctly determine the time spend in each function as depicted
in Figure 2.19b. The downside is that the overhead depends on the number of events per unit of time that
is generated by the application.
2.6.3.2 Data Recording and Presentation
Performance analysis tools can be classified into two groups—profiling and tracing—based on the type
and amount of data that is preserved [Ils+15]. Both techniques can be combined with sampling as well as
instrumentation [Juc12, Figure 3.1]. Profiling generates statistics, e.g., the time spend in every function
as depicted in Figure 2.19. It is also possible to identify the instructions that are intercepted by each
sample. Instructions that take longer to be processed are intercepted more frequently. Thus, it can
be estimated to which extend individual instructions contribute to the total runtime. Tracing stores all
events together with a time-stamp. This, retains the chronology of events. The gathered data is typically
displayed using timelines [Ils+15] as shown in Figure 2.20. With timelines it is possible to investigate
what happened concurrently in different processes or threads. However, the memory accesses required
to store the performance data interfere with the processing of the application. Therefore, events have to
be infrequent in order to limit the perturbation.
Figure 2.20 shows three scenarios of two parallel processes that alternately execute compute bound and
bandwidth bound functions. Profiling tools typically gather the performance data independently for
each core, thus the concurrency of events on different cores is not captured. Therefore, the statistic
for the runtime distribution would be identical in all three cases. However, the load on the DRAM—
which typically is a shared resource in multi-core processors—differs significantly in the three scenarios.
Tracing is required to attribute this behavior to the different coaction of cores.
Core 0
Core 1
DRAM
foo bar foo bar
foo bar foo bar
(a) synchronous phases
Core 0
Core 1
DRAM
foo bar foo bar
bar foo bar
(b) shifted phases
Core 0
Core 1
DRAM
bar foo bar foo
foo bar foo bar
(c) inverted phases
Figure 2.20: Utilization of shared resources in a multi-core processor: Both cores alternately execute
functions foo() and bar(), which have different memory bandwidth requirements. The runtime
is evenly distributed between the functions in all three scenarios. However, the phases are shifted
differently. The timing information is essential to understand the behavior of shared resources.
2.6 Performance Evaluation 39
2.6.3.3 Hardware Performance Monitoring
Many contemporary processors include performance monitoring units (PMUs) that observe the hard-
ware utilization, e.g., [Amd13b, Section 2.7]; [Int14b, Volume 3, Chapter 18]; [Arm15, Chapter D5]. The
PMU typically contains multiple hardware performance counters, which can be programmed to count
certain events, e.g., cache misses and snoop requests. Usually, each core has dedicated counters. Ad-
ditional PMUs for the shared resources are common as well [Amd13b, Section 2.7.2]; [Int12a; Int14d].
Hardware performance counters are often used by performance analysis tools (see Section 2.6.3.4) to as-
sociate the behavior of applications with the hardware utilization. This is an example for hybrid monitor-
ing [Jai91, Section 7.6]. The PMU records performance data without disturbing the program execution.
However, a software monitor is used to occasionally retrieve and process the information.
Performance counters can be used in counting mode and sampling mode [Moo02]. Counting mode is
useful for instrumentation-based data capturing. In that case the counter is read whenever a instrumenta-
tion point is reached. This correctly assigns the hardware events that occur within an instrumented region
to that region. In sampling mode the PMU generates an interrupt that signals an overflow whenever a
counter reaches a configurable threshold. The percentage of overflows that is caused by the individual
instructions can be used to estimate their respective portion of the total number of events. However, there
can be a significant delay between the overflow and the generation of the corresponding interrupt, which
leads to imprecise results. Therefore, contemporary Intel processors also support precise event based
sampling (PEBS), which avoids this inaccuracy [Spr02]. A similar technique—called instruction based
sampling (IBS)—is available on several AMD processors [Dro07].
Performance counters are configured using control registers [Wea15], e.g., via the machine specific reg-
isters in x86 processors [Amd13b, Section 2.7]; [Int12a; Int14d]. Since version 2.6.32 the Linux kernel
includes the perf_event subsystem, which—among other things—provides access to hardware perfor-
mance counters [Wea15]. The command line tool perf13 can, for example, be used to create profiles
of applications (perf record) [Tak13]. A widely-used tool to access PMUs of various processor ar-
chitectures is the Performance API (PAPI) [Ter+09]. PAPI defines a standard interface to access perfor-
mance counters from within the application in order to record performance data per thread. The current
version of PAPI uses the perf_event infrastructure to access the PMUs of x86 processors [Wea15]. The
LIKWID tool suite [THW10] includes the command line tool likwid-perfCtr, which collects aggregated
performance data for the whole application or user defined regions. In contrast to PAPI, likwid-perfCtr
only supports x86 based processors. It accesses the PMUs directly via machine specific registers (MSRs)
and collects performance data on a per core basis. The Intel Performance Counter Monitor [Int16] can
be used to access the performance counters of selected Intel processors. It supports system-wide mea-
surements as well as instrumenting the source code.
2.6.3.4 Performance Analysis Tools
Benchmarks (see Section 2.6.2) determine how fast a certain task is performed. They are suitable to com-
pare the performance of computer systems and can help to detect poor system performance. However, a
benchmark does not reveal why it takes the measured time [Eis86, Section 4.2]. In contrast, performance
analysis tools enable programmers to identify and understand performance problems [Moo+01], which
is a prerequisite for fixing them. This section introduces a selection of such tools.
Intel’s tool Vtune Amplifier XE [Int13a; Cep13] uses the event based sampling facilities of the PMUs to
record performance data. It is focused on the performance of OpenMP applications on shared memory
nodes. However, hybrid MPI+OpenMP codes can be analyzed as well. Vtune generates function and
call path profiles and highlights potentially problematic code regions based on predefined thresholds for
certain event rates, e.g, instructions retired per cycle. It also provides a source code view that correlates
performance counter events with code locations. A timeline view that displays thread concurrency and
hardware events is available as well.
13Documentation available at: https://perf.wiki.kernel.org
40 2 Background and Related Work
The performance analysis tool Vampir [Mül+07; Knü+08] visualizes trace files that are generated us-
ing VampirTrace or Score-P [Knü+12]. Multiple processes can be used in order to process large trace
files [Bru07]. The sequences of functions that are executed by the individual processes and threads are
displayed in the form of timelines. Furthermore, communication is depicted using connecting lines be-
tween the participants. Vampir also derives profiling information from the trace that shows to what extent
individual functions contribute to the total runtime. The tracing infrastructures VampirTrace [Vam13] and
Score-P [Sco15] support compiler-based as well as manual source code instrumentation. MPI, OpenMP,
and Pthread calls can also be instrumented. OpenCL and CUDA support is available as well. Perfor-
mance counters are recorded using PAPI [Ter+09]. Furthermore, a plugin counter interface [Sch+11] can
be used to integrate information from other sources, e.g., power meters.
Scalasca [Gei+10] is a performance analysis toolkit for large-scale systems with many thousands of
cores. It supports the instrumentation of MPI, OpenMP, and hybrid MPI+OpenMP applications. Source
code annotation and compiler-based instrumentation are also available. Furthermore, Scalasca can be
used together with TAU’s source-code instrumentor [SM06] and Score-P [Knü+12]. A special feature
of Scalasca is the detection of wait states, e.g., in MPI calls that have to wait for their communication
partner due to load imbalances. Therefore, it creates so-called “runtime summaries”, which combine
profiling information with inefficiency metrics derived from event traces. The summaries are created by
a fully automatic trace analyzer, which uses as many threads as the original application. The results are
displayed in the result explorer Cube [Sav+15] or with third-party tools.
The TAU Performance System [SM06; She+06] is a versatile set of tools for the performance analysis of
parallel applications. It uses a modular approach with an instrumentation layer, a measurement layer, and
a visualization and analysis layer. The instrumentation layer supports preprocessor-based and compiler-
based instrumentation, manual code annotation, library wrapping, and binary instrumentation. Library
wrapping is used to record the communication in MPI and SHMEM applications. Preprocessor-based
instrumentation (source-to-source transformation) is, for example, used to incorporate OpenMP direc-
tives and monitor memory allocation. The measurement layer supports profiling as well as tracing. The
generated profiles consider the call path in addition to the code location in order to detect problems that
are specific to a function’s calling context. Hardware performance counters can also be recorded. The
visualization and analysis layer includes text-based and graphical tools to examine profiles. A trace visu-
alizer is not included. However, several trace translators are available that produce input for third-party
tools like Vampir [Knü+08] and Cube [Sav+15].
Periscope [FG06; GK07] is an online performance analysis tool for MPI and OpenMP codes. It uses
a hierarchy of distributed agents. The node-level agents collect the performance data independently in
every node of the system. They are executed on processor cores that are set aside for the performance
analysis in order to reduce the disturbance of the application. The intermediate agents aggregate the
collected data and send it to the master agent, which establishes the connection to the tool’s graphical user
interface. These additional agents are executed on nodes that are not used by the application. Periscope
searches for formally defined performance problems—e.g., imbalances in OpenMP parallel regions—in
order to detect inefficiencies. In addition to MPI and OpenMP related problems, Periscope can also
detect inefficient memory accesses. Performance counters can be recorded as well. This enables the
definition of performance problems that depend on the hardware utilization, e.g., cache miss rates.
HPCToolkit [Adh+10] is a sampling based performance analysis tool that works with fully optimized
binaries. It uses time-based or counter overflow-based sampling in order to generate statistics for the
runtime distribution and hardware events. Each sample obtains the complete call path using stack un-
winding in order to generate call path profiles or traces. The data acquisition can also be triggered by
certain program activities, e.g., when memory is allocated or I/O operations are performed. A binary
analysis mechanism recovers the program structure. Computed metrics—e.g., floating point operations
per cycle—help to identify inefficient hardware usage. It is also possible to detect scalability bottlenecks
in parallel applications. Furthermore, data-centric profiling [LM13] can be used to reveal which part of
the data is responsible for identified problems. HPCToolkit includes a viewer that correlates the recorded
metrics to the reconstructed program structure and, if available, to the source code location.
2.6 Performance Evaluation 41
2.6.4 Analytical Performance Modeling and Simulation
The performance analysis tools presented in Section 2.6.3.4 are suitable to analyze the behavior of paral-
lel applications on existing hardware. However, for some tasks it is necessary to predict the performance
of systems that are not available yet, e.g., in order to perform design space explorations for future pro-
cessor architectures [HM08]. This section provides an overview of modeling and simulation techniques
that facilitate such research.
2.6.4.1 Analytical Performance Modeling
Performance modeling can be analytical or empirical [BGH12]. Analytical performance modeling uses
mathematical techniques to solve systems of equations in order to estimate the steady state perfor-
mance [Eis86, Chapter 2]. Queuing models [Jai91, Part VI] are a widely-used technique in analytical
modeling. Once a queuing network model is solved it can be used to assess the resulting performance
for other input parameters. However, realistic models of complex systems result in systems of many
equations, which are unresolvable without using approximations. Empirical performance modeling is
based on measurements under certain conditions. Regression is used to derive a function that describes
the correlation between the observed performance and the varied system parameters [MPV12]. This
function can then be used to estimate the performance under different conditions, e.g., another number
of cores [Bar+08].
Analytical performance modeling is used in many studies that investigate the influence of the memory
hierarchy on the achievable performance. Williams et al. present the roofline model [WWP09], which
explains the achievable performance by the applications operational intensity—i.e., the number of float-
ing point operations per byte transferred between cache and main memory. Applications with low opera-
tional intensities are limited by the available memory bandwidth while applications with high operational
intensities can fully utilize the peak computational performance. The roofline model determines an upper
bound for the achievable performance, but it does not consider any specifics of the processor and system
architecture. Other models use more detailed representations of the memory hierarchy in order to obtain
more accurate performance predictions. Saavedra and Smith [SS95] consider the delay caused by cache
and TLB misses in their abstract machine performance model. Marin and Mellor-Crummey use reuse
distances to predict the number of cache misses on different architectures in order to estimate application
performance [MM04]. Treibig and Hager [TH10] analyze how data transfers between the individual
cache levels affect bandwidth-limited loops. Stengel et al. present the execution-cache-model [Ste+15],
which estimates the execution time of a number of loop iterations based on the time required to process
the instructions in the core and the delay caused by data transfers. Majo and Gross [MG11b] develop a
simple model to describe resource contention for local and remote memory accesses in NUMA systems.
The influence of the cache coherence protocol on communication performance in shared memory sys-
tems is investigated by Ramos and Hoefler [RH13]. All these models require detailed information about
hardware characteristics.
Analytical performance modeling is also used for the design space exploration of multi-core architec-
tures. The transistor budget is steadily increasing due to Moore’s Law (see Section 2.1). This leads to
the question how the additional transistors can be used most effectively, i.e., if they should be used to
improve the performance per core or to increase the number of cores [BC11]. Hill and Marty present a
simple speedup model based on Amdahl’s Law [HM08], which predicts the achievable performance for
different trade-offs between core complexity and number of cores. They argue that fixed-size speedup
is an appropriate metric for multi-core processors and under this assumption they conclude that single-
core performance remains highly relevant. Yavits et al. [YMG14] include core-to-core communication
in their study of multi-core scalability and arrive at even more pessimistic results than Hill and Marty. In
contrast, Sun and Chen examine fixed-time speedup and reason that there are no intrinsic scalability limi-
tations for multi-core processors [SC10]. Gunther et al. describe a universal scalability law [GSP11] that
considers the saturation of shared resources in multi-cores as well as core-to-core communication. Their
approach relies on empirical modeling, i.e., it requires measurements to determine the model parameters.
42 2 Background and Related Work
2.6.4.2 Simulation
Simulation [Bae10, Section 1.3.2]; [Jai91, Part V]; [Eis86, Chapter 3] is another technique to evaluate
the performance of computing systems. Arbitrary programs can be used to represent the target system.
Therefore, simulation is more versatile than analytical modeling. However, it is also more computation-
ally intensive. Simulations can operate at almost any level of detail, e.g., register transfer level, ISA level,
or system level. Furthermore, simulation is not limited to the steady state performance, i.e., the dynamic
behavior of applications can be investigated as well. Therefore, simulators typically have a representa-
tion of time and perform stepwise calculations of changes in the system state [Jai91, Section 24.5.3],
e.g., cycle by cycle.
There are mainly three simulation techniques that are used for performance evaluation: stochastic, trace-
driven, and execution-driven simulation [Bae10, Section 1.3.2]; [Eis86, Chapter 3]. Stochastic simulation
(also called Monte Carlo simulation) is based on random number generators that generate input parame-
ters for performance models with a specific distribution. This can for example be used to handle queuing
networks that cannot be solved analytically. Trace-driven simulation is based on trace files (see Sec-
tion 2.6.3.2) that are collected on one system. The traces are used as input for the simulator, which
replays them and manipulates execution times according to the hardware characteristics of the target sys-
tem. Execution-driven simulation uses the program itself as input and simulates how it interacts with the
hardware during its execution. This can be used to realistically reproduce the sequence of operations on
the micro-architectural level in order to analyze how applications utilize the hardware. In comparison to
performance analysis tools the attainable level of detail is much higher. However, a detailed software rep-
resentation of the target architecture is required to perform such a detailed reproduction of the processing
of instructions. The time required to simulate the execution of an application can be multiple orders of
magnitude higher than the actual execution time on real hardware [Bae10, Section 1.3.2]. Therefore,
simulating entire applications with realistic input data is typically not feasible.
Cachegrind [Sew+14] simulates how an application’s memory accesses interact with the memory hi-
erarchy. It generates statistics for cache hits and misses, which help to detect performance problems.
However, the simple two-level inclusive cache hierarchy is not representative for current multi-core pro-
cessors. Furthermore, the execution of multi-threaded applications is serialized [Sew+14, Section 2.7],
thus Cachegrind is not suitable to study the performance impact of shared resources.
There are multiple simulators that perform more detailed simulations of multi-core and many-core pro-
cessors [Ahn+13, Table 1]. PTLsim [You07] is a cycle-accurate simulator for x86 micro-architectures
that is based on the Xen virtual machine monitor. It uses elaborate models for the processor cores. How-
ever, it only supports dedicated cache hierarchies per core and the “instant visibility” coherence model
does not consider the latency induced by the coherence protocol. The detailed simulation results in a
simulation speed of around 400000 cycles per second on a 2.2 GHz machine, i.e., simulating one second
takes more than an hour. MPTLsim [Zen+09] is an extended version of PTLsim, which adds support for
shared caches, a cycle-accurate coherence protocol model, and models for on-chip interconnects and the
memory controller. Unfortunately, this further reduces the simulation speed to several ten thousands of
cycles per second. MARSSx86 [Pat+11] is also based on PTLsim for cycle-accurate micro-architecture
simulation. It uses QEMU—which operates completely in user-space–instead of Xen. Like MPTLsim it
supports multi-core processors with shared caches and also models the coherence protocol.
McSimA+ [Ahn+13] focuses on higher core counts and asymmetric many-core architectures. It sim-
ulates a directory-based coherence protocol as broadcasts are not practical in many-core architectures.
In order to enable the simulation of a high number of cores in reasonable time McSimA+ forgoes full-
system simulation, i.e., it does not consider system calls and operating system activities. Sniper [CHE11]
uses interval simulation—which is based on an analytical model instead of detailed micro-architecture
simulation—in order to improve simulation speed. This leads to simulation speeds of up to two million
instructions per second, which is considered as being fast but still means a slowdown of around 1000×
compared to native execution. The lack of detailed micro-architecture simulation also causes substantial
deviation—the authors measure up to 23.8%—from the actual execution time.
43
3 Micro-benchmarks for Analyzing Memory Hierarchies
The objective of this work is to identify components in the memory hierarchy that limit the performance
of parallel applications. The characteristics of memory accesses can significantly influence the achievable
performance on NUMA systems [MG11a]. Therefore, the capabilities of the local memory hierarchy as
well as the NUMA characteristics need to be determined. Accurate analytical performance modeling and
simulation-based performance evaluation (see Section 2.6.4) require detailed models of the hardware.
Unfortunately, many details—e.g., the exact implementation of the coherence protocol and prefetching
mechanisms—are not comprehensively specified in the publicly available documentation. This leads to
considerable deviations from the actual performance even for very sophisticated modeling approaches
and simulation tools, e.g., [Hof+16, Table 1]; [You07, Table 1]. Furthermore, accurate simulation is
very time-consuming (see Section 2.6.4.2). Therefore, I decided to use micro-benchmarks—which are
commonly used to determine the capabilities of the hardware (see Section 2.6.2)—to analyze the memory
subsystem of shared memory systems. This chapter describes the design and implementation of the used
micro-benchmarks. It is largely based on several publications that introduce the benchmarks [Mol08,
Section 4.3]; [Mol+09; HMN09; Mol+10; MHS14; Mol+15].
3.1 Objective and Realization
Systems for high performance computing (HPC) are typically constructed as distributed memory systems
with hundreds or thousands of compute nodes [Top15]. The compute nodes usually are shared memory
systems—often with multiple processors (see Section 2.2.1). This work focuses on the performance of
the compute nodes as:
• Understanding the node performance is relevant for all parallel applications, including shared
memory as well as message passing and hybrid applications (see Section 2.5).
• Inter-node communication is extensively covered by existing tools (see Section 2.6.3.4).
The achievable performance of applications is influenced by various factors. Due to their latency and
limited bandwidth (see Table 2.2; Figure 2.6) cache and main memory accesses can constitute a sig-
nificant portion of the execution time [HP06, Figure 4.10]. Furthermore, the performance of shared
resources in multi-core processors (see Figure 2.5) does not necessarily scale with the number of
cores [MG11b, Figure 2], which can limit the performance of parallel applications. Remote memory
accesses in NUMA systems as well as cache coherence protocols also have negative effects on the per-
formance [Tan+13]; [HP06, Figure 4.17]. No existing benchmark covers all of these aspects.
STREAM [McC95] measures the local cache and memory bandwidths. It is not inherently NUMA-
aware. The memory affinity can be controlled using additional tools (e.g., numactl [Kle05]). However,
STREAM cannot be used to measure the bandwidth of remote cache accesses. The cache and memory
latency can be measured with lmbench [MS96]. This is implemented by repeatedly traversing a chain of
pointers until 1000000 loads have been performed. This results in multiple accesses to each cache line
for small data set sizes. Thus, remote cache accesses cannot be measured as the first access to a cache
line creates a local copy. For the same reason, coherence protocol transactions cannot be analyzed. These
restrictions also apply to the latency measurements of X-Ray [YPS05b, Section 5]. The work presented
by Schöne [Sch07, Section 6.5] includes latency measurements of core-to-core transfers. However, dif-
ferent coherence states are also not considered in this study. Hristea et al. [HLK97] consider different
coherence states in their analysis of core-to-core transfers, but only cover the MESI protocol.
The insufficient consideration of remote cache accesses and cache coherence protocols in the exist-
ing benchmarks necessitates the development of new benchmarks. Therefore, x86-membench—a com-
44 3 Micro-benchmarks for Analyzing Memory Hierarchies
prehensive set of micro-benchmarks for characterizing the memory performance of shared memory
systems—has been developed. It supports latency and bandwidth measurements for local and remote
cache and memory accesses and complements them with a mechanism to control the coherence state of
the accessed data. The benchmarks are implemented as kernels for the BenchIT framework [Juc+04].
This performance measurement suite is designed to run micro-benchmarks on UNIX based systems.
BenchIT and x86-membench are available as open source from https://fusionforge.zih.tu-dresden.de/
frs/?group_id=885. The benchmarks use the Pthread library for parallelization and rely on libnuma to
implement CPU and memory affinity in NUMA systems. The data placement and coherence state con-
trol mechanisms, which are described in Section 3.2 and Section 3.3, load data into certain cache levels
and enforce a particular coherence state prior to the measurement. The hardware detection mechanism
detailed in Section 3.4 provides the necessary information about the system. The measurement routines,
which are implemented in inline assembler, are described in Section 3.5.
Portability is a design goal of many benchmarks, e.g., [McC95; MS96; YPS05b; Dan+13], as they are
typically used to compare different systems (see Section 2.6.2). However, this results in severe restric-
tions. The benchmarks have to be implemented in a high level language, thus the achieved performance
is also influenced by the compiler. Furthermore, standard timer functions like gettimeofday() have
to be used, which often have a limited resolution. In contrast, the x86-membench kernels are tailored
to the 64 bit x86 ISA—the most widely used instruction set in contemporary HPC systems [Top15].
Therefore, critical parts can be implemented in assembler. This avoids any unanticipated compiler in-
fluence. Furthermore, special instructions like RDTSC and CLFLUSH can be used, which do not exist on
other architectures. The data exchange between host CPUs and hardware accelerators—which are used
in numerous HPC systems (see Section 2.2.1)—also is not covered by x86-membench as the accelerator
memory is typically not directly accessible for the host CPUs. However, communication between host
and accelerators is covered by existing tools, e.g., [Juc12]. Furthermore, accelerators themselves are
shared memory systems, which can be analyzed if they use the x86 instruction set [RH13; Fan+14].
A basic version of the benchmarks has been introduced in [Mol08, Section 4.3]. Since then the bench-
marks have been improved significantly to enable a more sophisticated analysis. The enhancements
include: an improved data placement mechanism (see Section 3.2), a newly developed coherence state
control mechanism (see Section 3.3), support for additional instruction set extensions (see Section 3.5.2
and Section 3.5.3), improved thread synchronization in multi-threaded benchmarks (see Section 3.5.3),
kernels that measure the throughput of arithmetic instructions (see Section 3.5.4), and support for hard-
ware performance counters (see Section 3.5.5). These extensions have been presented in [Mol+09;
HMN09; Mol+10; MHS14; Mol+15]. Multiple other tools emerged since the development of x86-
membench has been started. BlackjackBench [Dan+13] also comprises the bandwidth of core-to-core
transfers. Likwid-bench [THW12] measures the throughput of loop-kernels, which includes the aggre-
gated bandwidth of parallel memory accesses. However, different coherence states and the latency of
remote cache accesses are not included in both benchmark suites. To the best of my knowledge there is
no other benchmark suite that provides the same functionality as x86-membench, especially with respect
to the influence of the cache coherence protocol.
3.2 Data Placement
Caches are typically managed in hardware (see Section 2.1.2), i.e., there are no special load instructions
that place data in a certain cache level. Thus, the data placement has to rely on the cache replacement
strategy. It is implemented by accessing the whole data set multiple times in order to replace other data
in the caches. After that the data resides in the highest level of the memory hierarchy that is large enough
to accommodate all data and partially in higher levels unless all data fits into the highest level—the
L1 cache. Performing a latency measurement (see Section 3.5.1) after this form of placement shows a
mixture of effects from different cache levels as depicted by the dark green dots in Figure 3.1a. This
effect is more pronounced in x86-membench than in other benchmarks [MS96; YPS05b; Dan+13]. The
underlying measurement technique is the same: a linked list of pointers is traversed and the average
3.2 Data Placement 45
(a) With enabled cache flushes it is possible to clearly distin-
guish the performance of different cache levels.
(b) Placing data in other cores’ caches facilitates the evalua-
tion of data transfers between cores.
Figure 3.1: Data placement mechanism: Multiple accesses by threads that are pinned to certain CPUs
ensure that the data is present in the caches of a certain core prior to the measurement. Optional cache
flushes can be used to clearly distinguish the individual levels of the memory hierarchy. These figures
are based on the PACT’09 [Mol+09] presentation1, slide 8 – 11.
latency is calculated by dividing the total runtime by the number of accesses. However, lmbench, X-Ray,
and BlackjackBench traverse the chain of pointers multiple times. Only the first pass is affected by the
initial caching of the data while the hit rates in the subsequent passes depend on the reuse distance of
the cache lines. As soon as the whole data set does not fit into a certain cache level, most accesses—
starting with the second pass—go to the next level. In contrast, x86-membench dereferences every pointer
exactly once as the design goals to consider different coherence states and remote cache accesses preclude
multiple accesses. An optional cache flush routine can be used to place data in a certain cache level or
main memory. It removes the data from all cache levels that are too small for the whole data set. This
enables precise performance measurements for the individual levels in the memory hierarchy as depicted
by the orange squares in Figure 3.1a. This is implemented by repeated accesses to another memory area,
which replaces the data that is later used for the measurement. The size of this additional memory area
is chosen slightly higher than the capacity of the cache level that should be flushed.
Data placement and measurement can be executed on different cores. This enables the analysis of core-
to-core transfers as shown in Figure 3.1b. Since contemporary server systems consist of multi-core
processors, including all cores in a measurement can result in redundant results and an unnecessary long
runtime. Therefore, the measurement can be limited to a subset of the cores, e.g., one core per processor.
One thread is pinned to each selected core via sched_setaffinity() (sched.h). Each thread allo-
cates its own memory and controls the memory affinity via numa_set_membind() (numa.h). The
core and NUMA node selection is configured using the BENCHIT_KERNEL_CPU_LIST and BEN-
CHIT_KERNEL_MEM_BIND parameters (see Section 3.5.6). The measurement of core-to-core trans-
fers is implemented by Algorithm 3.1. This sequence is repeated for various data set sizes (selected by
BENCHIT_KERNEL_{MIN|MAX|STEPS}). Threads that are inactive perform a busy waiting loop that
periodically checks for signals from the master thread.
The data placement mechanism [Mol08, Section 4.3.1]; [Mol+09] is a modification of the approach pre-
sented in [Sch07]. It is implemented using Pthreads instead of OpenMP and supports more than two
cores by repeating the data placement prior to each measurement. Furthermore, each thread uses a dedi-
cated memory area for the data placement and all measurements are performed by a single thread instead
of performing successive measurements by different threads on the same memory area. Using multiple
buffers reduces the number of threads that access a single buffer, which reduces the influence of hard-
1https://fusionforge.zih.tu-dresden.de/plugins/mediawiki/wiki/benchit/images/9/91/2009_Molka_PACT_slides.pdf
46 3 Micro-benchmarks for Analyzing Memory Hierarchies
1 f o r ( i = 0 ; i < n u m _ s e l e c t e d _ c p u s ; i ++){
2 t h r e a d on CPU [ 0 ] : i f ( i != 0 ) s i g n a l t h r e a d on CPU[ i ] t o a c c e s s i t s d a t a s e t
3 t h r e a d on CPU[ i ] : l o a d d a t a from CPU[ i ] ’ s b u f f e r i n t o l o c a l cache h i e r a r c h y
4 t h r e a d on CPU [ 0 ] : i f ( i != 0 ) w a i t u n t i l t h r e a d on CPU[ i ] f i n i s h e s d a t a p l a c e m e n t
5 t h r e a d on CPU [ 0 ] : pe r fo rm measurement u s i n g t h e b u f f e r a s s i g n e d t o CPU[ i ]
6 }
Algorithm 3.1: Data placement mechanism for measuring core-to-core transfers: The ids of the CPUs
that have been selected for the measurement are stored in the array CPU[num_selected_cpus]. For
each selected CPU a thread is created and pinned to that CPU. Furthermore, each thread allocates a
memory area (per default from the corresponding CPU’s local NUMA node). The measurements are
performed one after another by loading data into the caches of a certain CPU before measuring the
performance of accesses by the first CPU. The first measurement evaluates the local memory hierar-
chy. The remaining iterations determine the performance of remote cache and memory accesses.
ware prefetchers. The original version of the benchmarks relies on the default first touch policy—which
allocates memory at the NUMA node of the core that first accesses a memory page [Lam06]—to control
the threads’ memory affinity [Mol08, Section 4.1.4]. All threads allocate and initialize their own memory
buffers, thus the initial memory affinity is determined by the CPU affinity. However, this is not sufficient
on contemporary Linux systems as these are able to move pages between NUMA nodes depending on the
origin of later accesses [Rie14]. Such page migrations would hinder the analysis of NUMA character-
istics. Therefore, an enhanced memory affinity control mechanism has been implemented that enforces
memory affinity using libnuma [Kle05]. This also enables intentional remote allocations in order to
investigate the performance of the interconnection between the processors.
3.3 Coherence State Control
The consideration of different coherence states is an essential capability of x86-membench. In the original
implementation, the benchmarks perform the data placement by reading and writing the data multiple
times [Mol08, Section 4.3.1]. Therefore, cached data always is in state Modified before the measurement.
With the added coherence state control mechanism [Mol+09; HMN09; MHS14; Mol+15] the impact of
the coherence protocol can be analyzed. The example in Figure 3.2 shows how the coherence state
influences the latency of cache accesses.
(a) Cache lines in state Exclusive (b) Cache lines in state Modified
Figure 3.2: Coherence state control mechanism: Data is cached with a certain coherence state, which
determines the required coherence state transitions as well as the source of the response (direct cache-
to-cache forwarding or via main memory). These figures are based on [Mol+09, Figure 2].
3.3 Coherence State Control 47
The coherence state control mechanism enhances the basic approach presented in [HLK97] by adding
support for contemporary cache coherence protocols. It supports the coherence protocols MESI, MESIF,
and MOESI (see Section 2.3.1). The coherence states are generated as follows, where thread N is pinned
to core N and thread M is pinned to another core [MHS14, Section 3.3]:
• Modified state in caches of core N is generated by:
1) thread N: writing the data (invalidates all other copies of the cache line)
• Exclusive state in caches of core N is generated by:
1) thread N: writing the data to invalidate copies in other caches,
2) thread N: invalidating its cache using the CLFLUSH instruction2,
3) thread N: reading the data
• Invalid state in all caches is generated by:
1) thread N: writing the data to invalidate copies in other caches,
2) thread N: invalidating its cache using the CLFLUSH instruction
• Shared state in caches of core N is generated by:
1) thread N: caching data in Exclusive state,
2) thread M: reading the data
• Forward state in caches of core N is generated by:
1) thread M: caching data in Exclusive state,
2) thread N: reading the data
• Owned state in caches of core N is generated by:
1) thread N: caching data in Modified state,
2) thread M: reading the data,
3) thread N: reading the data
• ModifiedUnWritten state in caches of core N is generated by:
1) thread M: writing the data,
2) thread N: reading the data
State Forward is only supported by Intel processors since the Nehalem generation. It is similar to the
state Shared. In both cases, data is read by two cores. The state is determined by the order of the
accesses. The last core that reads the data receives it in state Forward while the older copy changes its
state to Shared.
State Owned is only available on AMD processors. AMD family 15h processors also support the addi-
tional MuW state [Amd13b, Section 1.5.2]. Support for the extended MOESI protocol is implemented
based on information provided by Lepak et al. [Lep+12]. The sequence that generates the Owned state is
compatible with the conventional MOESI protocol as well as its extended version (see Section 2.3.1.3).
In the conventional MOESI protocol the three steps result in the states (core N/other core):
1): (*/*)→ (M/I) 2): (M/I)→ (O/S) 3): no change
In the extended MOESI protocol step 2) and 3) have different results:
1): (*/*)→ (M/I) 2): (M/I)→ (I/MuW) 3): (I/MuW)→ (O/S)
However, the final states are identical.
The result of the Shared state on AMD processors is influenced by the protocol version. The first step
generates the states (E/I) which the second step changes into (S/S) for the original MOESI protocol
and (S/O) in the extended MOESI protocol. Thus, the coherence state on the target core is identical
in both protocols. However, the always migrate approach of the extended protocol results in Owned
copies in the caches of the helper thread. The parameter BENCHIT_KERNEL_FLUSH_SHARED_CPU
(see Section 3.5.6) can be used to remove the Owned cache lines before the measurement.
2The usage of CLFLUSH can be disabled (see BENCHIT_KERNEL_DISABLE_CLFLUSH in Section 3.5.6)
48 3 Micro-benchmarks for Analyzing Memory Hierarchies
3.4 Hardware Detection
The benchmarks require detailed information about the hardware configuration of the examined system.
This includes:
• number of available CPUs
• operating frequency
• number of cache levels as well as their respective size and associativity
• number of TLB levels and their respective number of entries
• supported ISA extensions
An extended version of the hardware detection mechanism presented in [Mol08, Section 4.1.3] provides
the required information. Support has been added for AMD family 12h and 15h as well as Intel Nehalem,
Westmere, Sandy Bridge, Ivy Bridge and Haswell based processors. Furthermore, the accuracy of the
clock rate measurement has been increased.
The number of active CPUs is detected via sysconf(_SC_NPROCESSORS_ONLN) (unistd.h). If
this is not available, the names of the directories in /sys/devices/cpu are parsed considering /sys/de-
vices/cpu/cpu*/online to identify disabled CPUs. Algorithm 3.2 determines the processor’s clock rate
using the time-stamp counter (TSC) [Int14b, Volume 2, Section 4.2] and gettimeofday() (time.h).
This method ensures that the measurement error is limited to 0.05% of the actual TSC frequency. The
measured clock rate is used to determine time durations based on start and stop time-stamps. This re-
quires a “constant rate” time-stamp counter—i.e., the TSC readings need to be independent of the ACPI
P-states (see Section 2.1.4). In that case the reported performance in ns and GB/s is correct even if the
actual CPU clock rate—which is influenced by DVFS—differs from the TSC frequency. If the TSC has
a variable rate, the clock rate needs to be fixed (e.g., by disabling the power saving features) in order
to ensure correct results. Contemporary x86 processors export all the required information about the
cache and TLB hierarchy via the CPUID instruction [Amd15b, Appendix E]; [Int14b, Volume 2, Sec-
tion 3.2]. Thus, there is no need to measure the cache and TLB parameters with micro-benchmarks like
Servet [Gon+10] or X-Ray [YPS05a]. However, AMD family 10h and 15h processors that consist of two
dies in one package report the L3 size per package, which is divided by two in order to determine the L3
size per die. The supported ISA extensions can also be detected via the CPUID instruction.
1 i =1 ;
2 do {
3 s t a r t _ t s c _ o u t e r = a s m _ r d t s c ( ) ;
4 g e t t i m e o f d a y ( s t a r t ) ;
5 s t a r t _ t s c _ i n n e r = a s m _ r d t s c ( ) ;
6 do { t e m p _ t s c = a s m _ r d t s c ( ) ; } w h i l e ( t e m p _ t s c < s t a r t _ t s _ i n n e r + i ∗1000000 ) ;
7 e n d _ t s c _ i n n e r = a s m _ r d t s c ( ) ;
8 g e t t i m e o f d a y ( end ) ;
9 e n d _ t s c _ o u t e r = a s m _ r d t s c ( ) ;
10 upper_bound = e n d _ t s c _ o u t e r−s t a r t _ t s c _ o u t e r ;
11 lower_bound = e n d _ t s c _ i n n e r−s t a r t _ t s c _ i n n e r ;
12 i ++;
13 } w h i l e ( upper_bound > 1 .001 ∗ lower_bound ) ;
14 c y c l e s = ( upper_bound + lower_bound ) / 2 ;
15 u s e c s = ( end . t v _ s e c ∗1000000 + end . t v _ u s e c ) − ( s t a r t . t v _ s e c ∗1000000 + s t a r t . t v _ u s e c ) ;
16 f r e q u e n c y = c y c l e s / u s e c s ;
Algorithm 3.2: Measurement of the CPU clock rate: The measurement routines (see Section 3.5) use
the CPU’s clock rate to derive time durations from the measured number of CPU cycles. The clock
rate in MHz (million cycles per second) is computed by dividing the counted number of clock cycles
by the elapsed time in µs. The time duration is determined via gettimeofday(). The cycles
are counted with asm_rdtsc()—an inline assembler routine that executes the RDTSC instruction.
The measurement is repeated with increasing time periods between start and end time-stamp until the
maximal error is below 0.05%.
3.5 Measurement Routines 49
3.5 Measurement Routines
The micro-benchmarks are designed to run on 64 bit x86 processors. Inline assembler is used to generate
instruction sequences that cannot be generated using a high level language like C, e.g., transferring data
into registers without performing any computations with it. However, only the measurement routines and
in part the data placement and coherence state control mechanisms (see Section 3.2 and Section 3.3) are
implemented in assembler. The initialization and coordination of the threads is implemented in the high
level language C. The design goals to consider different coherence states and remote cache accesses pose
a significant challenge. Each access to a cache line can change its coherence state and creates a local copy
if it does not exist yet. Thus, each cache line can only be accessed once during the measurement, which
results in extremely short durations if small data sets are used. Consequently, as proposed in [Sch07] the
high-resolution time-stamp counter (TSC) is used to measure durations. However, a small overhead still
exists, which is noticeable in some results.
3.5.1 Latency Benchmark
Like other latency benchmarks [MS96; YPS05b; Dan+13], the latency measurement uses pointer-chasing
to determine the latency of memory accesses, i.e., each load operation provides the address for the next
access. At least 24 loads at randomly selected addresses are performed for each measurement, which is
significantly less than in other latency benchmarks. The number of accesses is restricted by the data set
size as x86-membench accesses each cache line only once. A configurable minimal distance between the
accesses avoids the reuse of cache lines. The default is a distance of at least 512 byte between accesses,
which also reduces the influence of hardware prefetchers. Therefore, the minimal data set size is 12
KiB, which typically fits into the L1 cache. The number of accesses is increased for larger data set
sizes in order to reduce variation in the results. This is implemented by repeating the inner loop until
the maximal number of accesses that fit into the data set or a predefined upper limit—which defaults to
2400—is reached.
Algorithm 3.3 [Mol+09, Section IV.A]; [HMN09, Section 3] implements a latency measurement of
CPU 0 accessing memory that was previously used by CPU N (N>0). Step 1 ensures that TLB entries
for the accessed data are created in the same way for measurements concerning local as well as remote
cache accesses. Step 3 places data in the caches of core N in Exclusive or Modified state as described
in Section 3.3. Generating cache lines in state Shared, Forward, Owned, or MuW requires accesses by
another CPU, which is not considered here. Step 5 optionally performs the cache flush routine described
in Section 3.2. Step 6 carries out the latency measurement. The duration of the whole access sequence is
derived from TSC readings at the beginning and the end of the measurement routine. Memory barriers
(MFENCE3) are used to ensure that all data transfers are performed between the start and the stop time-
stamp. The duration is divided by the number of accesses to determine the average latency of a single
access. The measurement is performed multiple times for each data set size and returns the minimal
measured value. The latency is reported in nanoseconds and clock cycles. If the actual operating fre-
quency differs from the TSC frequency—e.g., if DVFS is used (see Section 2.1.4)—the reported number
1 t h r e a d on CPU 0 : t o u c h t h e memory i n o r d e r t o warm−up t h e TLB
2 t h r e a d on CPU 0 : s i g n a l t h r e a d on CPU N t o a c c e s s i t s d a t a s e t
3 t h r e a d on CPU N: pe r fo rm d a t a p l a c e m e n t (−> E x c l u s i v e o r Modi f i ed s t a t e i n CPU N)
4 t h r e a d on CPU 0 : w a i t u n t i l t h r e a d on CPU N f i n i s h e s d a t a p l a c e m e n t
5 bo th t h r e a d s : pe r fo rm c a c h e s f l u s h e s ( o p t i o n a l )
6 t h r e a d on CPU 0 : measure l a t e n c y u s i n g memory a s s i g n e d t o CPU N
Algorithm 3.3: Latency measurement of remote cache accesses: A thread that is pinned to CPU N (N>0)
performs the data placement before the thread pinned to CPU 0 performs the measurement.
3MFENCE is not an official serializing instruction, but has proven to be sufficient. If CPUID is used in its place, the accuracy
of the measurement is reduced because of its high overhead.
50 3 Micro-benchmarks for Analyzing Memory Hierarchies
of cycles is incorrect. The optional parameter BENCHIT_KERNEL_CPU_FREQUENCY can be used
to ascertain correct cycle values if a fix clock rate other than the one reported by the hardware detection
(see Section 3.4) is used.
An integrated random number generator—a mixed linear congruential generator (LCG) [Jai91, Sec-
tion 26.2] derived from BenchIT’s bi_random48() function—generates a new sequence of addresses
for each individual measurement. It generates a random permutation of the possible addresses without
costly checks if a generated address has already been selected. This is done during the data placement
(Step 3). Thus, in case of remote cache accesses the hardware prefetchers of the measuring CPU have no
way of knowing the selected addresses beforehand. The random number sequence is calculated on the
basis of Equation (3.1):
Xn+1 = (a×Xn + b) mod m (3.1)
LCGs have a maximal period of m (if a, b, and m meet certain conditions [Jai91, Section 26.2]), i.e., a
sequence of all values from 0 to m − 1 can be generated. For b = 0 a full cycle of all values from 1 to
m− 1 is generated if m is prime and a is a primitive root of m [Jai91, Section 26.2.3], i.e, the period is
m− 1. The period remains m− 1 for b 6= 0 if m is prime and a is a primitive root of m [Mas98]. In that
case, all values from 0 to m − 1 are created with the exception of one fixed point Xfix as described by
Equation (3.2)4:
Xfix = (a×Xfix + b) mod m (3.2)
In order to generate a random access sequence, the random number generator is initialized with a seed
value (tv.usec determined by gettimeofday()) and a maximal value randmax (the number of ac-
cesses that fit into the data set size). The parameters of the LCG are determined as follows:
1) set m to smallest prime number greater than randmax
2) check prime factors of m− 1 for primitive roots of m
3) if a primitive root >
√
m is found use it as a
else increase m to the next larger prime number and go back to 2)
4) set b to a value between m4 and
3m
4 : b =
m
4 + seed mod
m
2
5) determine fixed point Xfix using Algorithm 3.4
6) set X0 to seed mod m, if X0 = Xfix set X0 to 0.
A call to the function _random() returns the next random value in the sequence. It repeatedly performs
iterations of (3.1) until the latest random value is within the limit defined by randmax. Furthermore, it
hides the gap in the sequence that is created by the fixed point Xfix by subtracting 1 from the return
value if the current random value is > Xfix. Therefore, randmax calls of _random() create a random
permutation of all values from 0 to randmax − 1.
1 f i x = −1; / / −1 i n d i c a t e s t h a t no f i x e d p o i n t e x i s t s
2 f o r ( x = 0 ; x <= a ; x ++){
3 f1 = ( ( x ∗ m) − b ) / ( a − 1 ) ;
4 f2 = ( ( f1 ∗ a ) + b ) % m;
5 i f ( f1 == f2 ) { f i x = f1 ; b r e a k ; }
6 }
Algorithm 3.4: Determining the fixed point in LCGs with prime modulus: The number of possible fixed
points is limited: fix = a ∗ fix + b | fix = (a ∗ fix + b) −m | fix = (a ∗ fix + b) − 2m | . . . |
fix = (a ∗ fix + b) − (a ∗m). This is reflected by the for loop in line 2). The loop counter (x) is
carried over to line 3), which is derived from Equation (3.2) as follows: f1 = (a ∗ f1+ b)− (x ∗m)
→ 0 = (a− 1) ∗ f1 + b− (x ∗m)→ f1 = ((x ∗m)− b)/(a− 1). Line 4 performs one iteration of
the LCG algorithm (see Equation (3.1)) to check if f1 is a fixed point. If so, line 5 aborts the loop.
4 Some examples: a = 71, b = 821,m = 1847 → Xfix = 859; a = 2371, b = 120017,m = 355651 → Xfix = 25010;
a = 989011, b = 641804515,m = 900000011 → Xfix = 460309667. Full-period LCGs with composite m do not
show this behavior, i.e., Algorithm 3.4 does not find a fixed point.
3.5 Measurement Routines 51
1 " r d t s c ; s h l $32 , %%rdx ; add %%rdx ,%%r a x ; " / / 1 s t t ime−s tamp −> RAX
2 " mfence ; " / / l i g h t w e i g h t s e r i a l i z a t i o n
3 /∗ measurement loop ∗ /
4 " _work_loop_movdqa_2 : "
5 " movdqa 0(%%rbx ) , %%xmm0 ; movdqa 16(%%rbx ) , %%xmm1 ; "
6 " movdqa 32(%%rbx ) , %%xmm0 ; movdqa 48(%%rbx ) , %%xmm1 ; "
7 [ . . . ] / / 29 more l i n e s l i k e t h i s
8 " movdqa 992(%% rbx ) , %%xmm0 ; movdqa 1008(%% rbx ) , %%xmm1 ; "
9 " add $1024 , %%rbx ; "
10 " sub $1 , %%r c x ; "
11 " j n z _work_loop_movdqa_2 ; "
12 " mfence ; " / / l i g h t w e i g h t s e r i a l i z a t i o n
13 "mov %%rax , %%rbx ; " / / 1 s t t ime−s tamp −> RBX
14 " r d t s c ; s h l $32 , %%rdx ; add %%rdx ,%%r a x ; " / / 2nd t ime−s tamp −> RAX
Algorithm 3.5: Implementation of the single-threaded bandwidth measurement: This measurement rou-
tine measures the read bandwidth using 128 bit SSE instructions with two registers. The load instruc-
tions use the address stored in RBX and an increasing offset in order to generate consecutive memory
addresses. The base pointer is increased by 1024 bytes in each iteration. The data set size is specified
by the loop counter in register RCX. The measurement is surrounded by time-stamp counter accesses
that measure the duration. The benchmarks can also be configured to use CPUID for serialization.
3.5.2 Single-threaded Bandwidth Benchmarks
The phases of the single-threaded bandwidth benchmarks are very similar to the latency benchmark
(see Algorithm 3.3). Steps 1 to 5 are identical, except that the generation of the random pointer chain is
omitted. The measurement routines (step 6) perform sequential accesses to the whole data set in order
to determine the bandwidth that is available for a single core that performs reads, writes, or a mixture of
reads and writes. The data placement enables bandwidth measurements for accesses to the local cache
hierarchy, core-to-core transfers as well as inter-processor communication. The data set size determines
the cache level that is used for the measurement. The buffer is accessed only once in order to facilitate
the analysis of remote cache accesses and the impact of the cache coherence protocol.
The measurement routines are implemented as extensively unrolled loops that access 1024 byte in the
loop body as shown in Algorithm 3.5. Transport instructions are used to load ((V)MOVDQA) or store
((V)MOVDQA, (V)MOVNTDQ) data without performing any computation in order to avoid being limited
by arithmetic operations. Multiple widths of load and store instructions are supported to assess the in-
fluence of SIMD instructions on the memory performance. The peak performance can only be reached
with a sufficient number of independent instructions [Duc+08, Fig. 8]. Therefore, the number of used
registers is configurable via the parameter BENCHIT_KERNEL_BURST_LENGTH (see Section 3.5.6).
The different registers are used alternately in order to maximize the number of consecutive independent
operations. With respect to the write bandwidth, it is important to note that contemporary x86-64 pro-
cessors do not allow one core to write into another core’s cache. Any write access to a cache line that
is not found in the local cache in either Exclusive or Modified state triggers a read for ownership (RFO)
request. This invalidates all other copies (see “Snoop Write Hit” in Figure 2.11, 2.12, 2.13, and 2.14)
and grants exclusive ownership to the requesting core. Therefore, results of the benchmarks that include
writes show a combination of two effects: first, reading the data from its original location, and second,
writing to the local cache. This has to be considered in the interpretation of the results.
3.5.3 Aggregated Bandwidth Benchmarks
The aggregated bandwidth benchmarks measure the achievable bandwidth for a variable number of
threads that perform concurrent memory accesses. This is particularly helpful to determine the char-
acteristics of shared caches and memory controllers. Algorithm 3.6 [Mol+09, Section IV.A]; [HMN09,
Section 3] implements the measurement. The aggregated bandwidth benchmarks only support the coher-
52 3 Micro-benchmarks for Analyzing Memory Hierarchies
1 a l l t h r e a d s : a c c e s s d a t a (−> Modif ied , E x c l u s i v e , o r I n v a l i d )
2 a l l t h r e a d s : f l u s h c a c h e s ( o p t i o n a l )
3 a l l t h r e a d s : b a r r i e r s y n c h r o n i z a t i o n u s i n g cmpxchg i n s t r u c t i o n s
4 m a s t e r t h r e a d : d e f i n e s t a r t t ime i n f u t u r e
5 a l l t h r e a d s : p o l l t ime−s tamp c o u n t e r u n t i l s t a r t t ime i s r e a c h e d
6 a l l t h r e a d s : measure t _ b e g i n
7 a l l t h r e a d s : a c c e s s d a t a ( r e a d / w r i t e / r e a d + w r i t e )
8 a l l t h r e a d s : measure t _ e n d
9 d u r a t i o n = max ( t _ e n d ) − min ( t _ b e g i n )
Algorithm 3.6: Multi-threaded bandwidth measurement: After all threads have arrived at the barrier, the
master thread defines a start time in the future (current TSC value + 50000). All threads wait until
this start time is reached before they begin their measurement.
ence states Modified, Exclusive, and Invalid as those can be created independently by each thread. Step
3 through 8 are implemented as a continuous block of inline assembler. The synchronization mechanism
(step 3-5) ensures that all threads are tightly synchronized. The memory access sequences used in step
7 are identical to the single-threaded bandwidth benchmarks (see Section 3.5.2), i.e., each thread con-
secutively accesses a variable amount of data. Measurement routines are available for the read and write
bandwidth as well as the bandwidth of combined reads and writes. The supported widths of the individ-
ual data transfers are 64 bit (scalar), 128 bit (packed SSE), and 256 bit (packed AVX). The earliest of the
recorded start time-stamps (t_begin) and the latest of the end time-stamps (t_end) are used to calculate
the accumulated bandwidth to make sure that all accesses occurred between the selected time-stamps.
The memory affinity can be specified by the parameter BENCHIT_KERNEL_MEM_BIND (see Sec-
tion 3.5.6). This can be used to measure the interconnect bandwidth by binding all threads to CPUs in
one NUMA node and allocating memory from another NUMA node.
3.5.4 Throughput of Arithmetic Instructions
The benchmark for the throughput of arithmetic instructions [Mol+10] is an adopted version of the
multi-threaded bandwidth benchmark (see Section 3.5.3). In these routines, the (V)MOVDQA instruc-
tion used for loading data is replaced by instructions that additionally perform arithmetic operations.
This combines the original load operations with arithmetic operations without significantly changing the
benchmark code. The arithmetic instructions comprise 64 bit (scalar) as well as 128 bit (SSE) and 256 bit
(AVX) instructions. Stores and floating point load instructions are also considered. Fused multiply-
add instructions—3-address and 4-address format—are supported as well. Table 3.1 lists the available
benchmarks. The load variants move data from a cache or main memory location into registers but per-
form no operation on it. The and, add, mul, div, and sqrt benchmarks perform one operation on each
operand. The multiply-add benchmark consists of a multiply instruction using a memory operand and
a subsequent addition using the result from the register. The fused multiply-add benchmark uses only
one memory operand per fma-instruction as well. Thus, two floating point operations are executed per
memory operand by the multiply-add benchmarks. The results of the calculations are not stored back to
memory. The measurement routines can also be configured to use only register operands as input.
The purpose of this benchmark is not only to measure the performance of various operations. It is meant
to investigate differences in the power consumption as well [Mol+10]. However, the power consumption
of the system fans as well as the processor itself depends on the temperature [LHL05]. Thus, in order
to perform reliable power measurements the runtime has to be extended to multiple minutes in order to
reach a stable temperature. However, increasing the data set size is not an option as this would limit the
ability to measure the throughput of arithmetic operations using cached data. Therefore, the runtime of
the measurement routine itself is increased. This is implemented by accessing the whole data set multiple
times. In this case, the replacement strategy of the caches ensures that data is mostly evicted back to the
intended cache level before it is accessed again. However, the throughput benchmarks do not consider
different coherence states as the initial coherence state of the data cannot be preserved.
3.5 Measurement Routines 53
Table 3.1: Measurement routines of the throughput kernel: Like the aggregated bandwidth benchmarks,
the load and store routines perform parallel memory accesses without using the data for any compu-
tation. The remaining routines perform arithmetic operations in addition to loading the data into the
registers.
data type operation
instruction (required ISA)
scalar packed 128 bit packed 256 bit
64 bit
load MOV (Intel64) MOVDQA (SSE2) VMOVDQA (AVX)
integer
store MOV (Intel64) MOVDQA (SSE2) VMOVDQA (AVX)
and AND (Intel64) PAND (SSE2) -
add ADD (Intel64) PADDQ (SSE2) -
mul IMUL (Intel64) PMULDQ (SSE4.1) -
single
load - MOVAPS (SSE) VMOVAPS (AVX)
precision
store - MOVAPS (SSE) VMOVAPS (AVX)
floating
and - ANDPS (SSE) VANDPS (AVX)
point
add ADDSS (SSE) ADDPS (SSE) VADDPS (AVX)
mul MULSS (SSE) MULPS (SSE) VMULPS (AVX)
div DIVSS (SSE) DIVPS (SSE) VDIVPS (AVX)
sqrt SQRTSS (SSE) SQRTPS (SSE) VSQRTPS (AVX)
fused - - VFMADD132PS (AVX2, FMA)
multiply-add - - VFMADDPS (FMA4)
double
load - MOVAPD (SSE2) VMOVAPD (AVX)
precision
store - MOVAPD (SSE2) VMOVAPD (AVX)
floating
and - ANDPD (SSE2) VANDPD (AVX)
point
add ADDSD (SSE2) ADDPD (SSE2) VADDPD (AVX)
mul MULSD (SSE2) MULPD (SSE2) VMULPD (AVX)
div DIVSD (SSE2) DIVPD (SSE2) VDIVPD (AVX)
sqrt SQRTSD (SSE2) SQRTPD (SSE2) VSQRTPD (AVX)
multiply-add - MULPD, ADDPD (SSE2) VMULPD, VADDPD (AVX)
fused - - VFMADD132PD (AVX2, FMA)
multiply-add - - VFMADDPD (FMA4)
3.5.5 Support for Hardware Performance Counters
All measurement routines can record hardware performance counters (see Section 2.6.3.3) in addition
to the performance metrics. Therefore, the inline assembler regions that implement the performance
measurement are surrounded by calls to the PAPI library [Ter+09]. PAPI_reset() is used to reset
the counters before each individual measurement. The number of events that occurred during the mea-
surement is read out via PAPI_read() directly after the measurement routine. This avoids that events
that occur during the setup phase—e.g., data placement and coherence state control—are included. Fig-
ure 3.3 shows an example with enabled hardware performance counter measurement. Multiple counters
can be recorded concurrently. However, the number of counter registers as well as the valid combina-
tions of events are architecture specific. The benchmark is aborted if PAPI_add_event() reports
that the selected combination of counters cannot be measured concurrently. The throughput benchmarks
(see Section 3.5.4) are also instrumented via the VampirTrace API [Vam13, Section 2.4]. If they are
compiled with vtcc the benchmarks create a trace file that contains the performance information for
the selected levels of the memory hierarchy (e.g., “L1”, “L2”, “L3”, or “RAM”). Furthermore, addi-
tional metrics—e.g. the power consumption—can be recorded via the VampirTrace’s plugin counter
interface [Sch+11]. The tightly coupled performance measurement and hardware monitoring in con-
junction with the data placement and coherence state control mechanisms enables the identification of
performance counters that indicate the utilization of certain components in the memory hierarchy.
54 3 Micro-benchmarks for Analyzing Memory Hierarchies
Figure 3.3: Hardware performance counter
example: The read bandwidth using Mod-
ified data shows an unexpected behavior
for data set sizes slightly larger than the
level two cache. This can be explained by
write backs from the level two cache that
happen in parallel to the reads from the
level three cache. The effect diminishes
if the data set size is increased further as
only the data that is initially located in the
L2 cache needs to be written back. This
figure is based on the PACT’09 [Mol+09]
presentation5, slide 31.
3.5.6 Parameter Description
The benchmarks have many tunable parameters that can be tailored to the system under test. The param-
eters that configure the measurement are listed below.
• BENCHIT_KERNEL_ACCESSES: This setting specifies maximal number of accesses, which is
automatically reduced for small data set sizes. It is only available in the latency kernel.
• BENCHIT_KERNEL_ALIGNMENT: Specifies the minimal distance between accesses (default
512 byte). Values lower than 512 are not suitable for measurements of remote cache ac-
cesses as the hardware prefetchers typically induce a significant number of local cache hits
for small data set sizes in that case. This parameter is only available in the latency kernel.
• BENCHIT_KERNEL_ALLOC: Specifies memory allocation policy (L, G, or B):
– local (L) each thread allocates memory in its local memory
– global (G) all memory is allocated by the master thread
– bind-to-core (B) memory affinity is defined individually for each thread
• BENCHIT_KERNEL_AVX_STARTUP_REG_OPS: Specifies number of AVX instructions per-
formed before the actual measurement routine (default 0). A sufficiently high number en-
sures that the transition to AVX mode is completed prior to the measurement. This setting
is only available in the bandwidth kernels.
• BENCHIT_KERNEL_BURST_LENGTH: Defines how many consecutive accesses are made be-
fore reusing a source or destination register (1, 2, 3, 4, or 8; default 8)
• BENCHIT_KERNEL_CPU_FREQUENCY: CPU frequency in Hz, overrides the operating fre-
quency determined by the hardware detection (see Section 3.4).
• BENCHIT_KERNEL_CPU_LIST: Selects the CPUs that are used by the benchmark. All available
CPUs are used if this parameter is not specified.
• BENCHIT_KERNEL_DISABLE_CLFLUSH: Disables usage of CLFLUSH instruction in coher-
ence state control routine (0 | 1, default 0). Improves the measured L3 performance on AMD
processors with enabled HT Assist feature in some cases. It is strongly recommended to set
BENCHIT_KERNEL_ENABLE_CODE_PREFETCH to 1 as well when this workaround
is activated.
• BENCHIT_KERNEL_ENABLE_CODE_PREFETCH: Enables prefetching of the measurement
routine (0 | 1, default 0). If enabled, the measurement routine is called with dummy data
prior to the measurement. This ensures that the code needed for the measurement is in the
L1 instruction cache, but partially evicts data needed for the measurement.
• BENCHIT_KERNEL_ENABLE_PAPI: Enables recording of hardware performance counters via
PAPI (0 | 1 | 2, default 0). 0: disabled, 1: enables core counters, 2: enables uncore counters
5https://fusionforge.zih.tu-dresden.de/plugins/mediawiki/wiki/benchit/images/9/91/2009_Molka_PACT_slides.pdf
3.6 Limitations 55
• BENCHIT_KERNEL_FLUSH_L{1 | 2 | 3}: Enables cache flushes (0 | 1, default 0). Sometimes
necessary to clearly distinguish the cache levels.
• BENCHIT_KERNEL_FLUSH_MODE: Specifies coherence state of cache lines after the flush (M,
E, or I): The caches can be filled with Modified (M), Exclusive (E), or Invalid (I) cache lines.
• BENCHIT_KERNEL_FLUSH_SHARED_CPU: Enables cache flushes on assisting cores (0 | 1,
default 0). Removes copies of the data in the caches of the assisting cores (see accesses
by thread M in Section 3.3) after coherence state generation.
• BENCHIT_KERNEL_{HUGEPAGES | HUGEPAGE_DIR}: Specifies parameters for hugetlbfs.
• BENCHIT_KERNEL_INSTRUCTION: This setting selects the instructions used by the measure-
ment, which defines the width of memory accesses (64, 128, or 256 bit) as well as the
performed arithmetic operation (e.g., mul, add, fma) in the throughput kernel. It also in-
cludes options for non-temporal stores. This parameter is only available for bandwidth and
throughput kernels.
• BENCHIT_KERNEL_MEM_BIND: This setting is required if BENCHIT_KERNEL_ALLOC is
set to bind-to-core (B). It defines the memory affinity of each thread. Needs to have at least
as many entries as BENCHIT_KERNEL_CPU_LIST.
• BENCHIT_KERNEL_{MIN | MAX | STEPS}: These parameters determine the range and number
of different data set sizes used for the measurement. Automatically chooses data set sizes
suitable for display on a logarithmic scale.
• BENCHIT_KERNEL_PAPI_COUNTERS: Comma separated list of hardware performance coun-
ters that should be recorded. The performance counters are only counted for the actual
measurement routine.
• BENCHIT_KERNEL_RANDOM: Enable random order of measurements (0 | 1, default 0).
• BENCHIT_KERNEL_REGONLY: Use register operands only (0 | 1, default 0) Only available in
throughput kernel.
• BENCHIT_KERNEL_RUNS: Number of repetitions for each measurement.
• BENCHIT_KERNEL_SERIALIZATION: Specifies which instruction is used for serialization be-
tween time-stamp counter readings and the measurement routine (MFENCE | CPUID, default
MFENCE).
• BENCHIT_KERNEL_SHARE_CPU_LIST: Selects one or more cores that assists in generating
the requested coherence state. Required for BENCHIT_KERNEL_USE_MODE = S, F,
O, or U. Should be as far away (maximal number of QPI/HT hops) from the first CPU in
BENCHIT_KERNEL_CPU_LIST as possible. The selected cores must not be included in
BENCHIT_KERNEL_CPU_LIST.
• BENCHIT_KERNEL_TIMEOUT: Timeout in seconds (default 3600). When the configured run-
time limit is reached the benchmark terminates itself after receiving a SIGTERM signal
from an additional watchdog thread, which sleeps until the time limit expires.
• BENCHIT_KERNEL_USE_MODE: Specifies the coherency state of the data prior to the measure-
ment (M, E, I, S, F, O, or U): The supported coherence states are Modified (M), Exclusive
(E), Invalid (I), Shared (S), Forward (F), Owned (O), and ModifiedUnWritten (U).
3.6 Limitations
Unfortunately, the design decisions that enable x86-membench’s novel features also result in considerable
limitations. First and foremost, x86-membench is limited to processors with 64 bit x86 instruction set
while most other benchmarks do not depend on a certain ISA. Although x86-membench is not designed
for portability, a port to the ARMv7 instruction set has successfully been implemented [Old13]. This
shows that the methodology is applicable to other processor architectures, too. Another unfavorable
circumstance is that adjustments to the coherence state control mechanism can be required whenever
the coherence protocols are adapted as it has been then case for the extended MOESI protocol in AMD
family 15h processors (see Section 2.3.1.3).
56 3 Micro-benchmarks for Analyzing Memory Hierarchies
The data placement mechanism is not impeccable, especially if coherence states other than Exclusive and
Modified are used. In that case the hardware prefetchers of the core that assists in generating the selected
coherence state occasionally “steal” some data before the measurement. In addition, the cache flush
routines pollute the lower level caches. Restricting the hardware managed dynamic frequency scaling
mechanisms in contemporary processors—e.g., Haswell’s clock rate reduction for AVX workloads and
uncore frequency scaling [Kar14, pp. 19–20]—in order to obtain reproducible results is also getting
increasingly difficult. Therefore, many parameters (see Section 3.5.6) need to be tuned in order to obtain
optimal results—especially the cache flush parameters and the number of repetitions.
Another obstacle is that marginal source code modifications can cause noticeable changes in the achieved
performance. For instance, on some systems the measured bandwidth of the level one cache changes if
another register is used to store the memory address and the effectiveness of the hardware prefetchers is
influenced by the extend of loop unrolling. Consequently, some results in Chapter 4 slightly differ from
the previously published results [Mol+09; HMN09; MHS14; Mol+15] as they are performed with the
latest version of the benchmarks, which include the following changes:
• an increased minimal distance (512 byte) between the accesses of the latency benchmark in order to
reduce the influence of the hardware prefetchers as well as a reduced minimal number of accesses
(24 instead of 32) to allow data set sizes < 16K in spite of the increased distance
• using the same register for the memory address in all measurement routines
• more extensive loop unrolling in the bandwidth measurement in order to reduce the overhead
caused by branch instructions
57
4 Performance Characterization of Memory Accesses
In this chapter the benchmarks presented in Chapter 3 are used to perform an in-depth analysis of con-
temporary multi-processor systems. In Section 4.1 a selection of systems with two NUMA nodes is
analyzed. Section 4.2 evaluates more complex systems with up to eight NUMA nodes. Section 4.3
summarizes the identified bottlenecks in the memory hierarchy. AMD’s Core Performance Boost and
Intel’s Turbo Boost technology (see Section 2.1.4) are disabled during the measurements to avoid varia-
tion in the results, which could be caused by the variable operating frequency. Apart from that, default
system settings are used unless otherwise noted. Memory is allocated in 2 MiB pages via hugetlbfs by
default in order to reduce the impact of TLB misses.
The analysis includes latency and bandwidth measurements of core-to-core transfers as well as the ag-
gregated bandwidth of parallel memory accesses. The characteristics of core-to-core transfers are inves-
tigated for multiple data locations (see Section 3.2):
local data placed in the measuring core’s cache hierarchy or main
memory within the measuring core’s NUMA domain.
within NUMA node data placed in the cache hierarchy of another core or main mem-
ory within the measuring core’s NUMA domain.
other NUMA node [(distance)] data placed in the cache hierarchy of another core or main mem-
ory in another NUMA domain.
In case of the aggregated bandwidth benchmarks all cores access their local cache hierarchy. If not noted
otherwise, all cores also allocate memory within their NUMA domain. These benchmarks are performed
for various numbers of active cores to determine the performance and scalability of shared resources.
4.1 Systems With Two NUMA Nodes
Table 4.1 lists the properties of the selected two socket test systems. The systems are analyzed in detail
in the following sections.
Table 4.1: Dual-socket Test Systems: Sun Fire X4140 [Ora09]; [Amd11, Appendix A] , Dell PowerEdge
R510 [Del12a]; [Int14a, Section 2.4] , Dell PowerEdge R720 [Del12b]; [Int14a, Section 2.2]
System Sun Fire X4140 Dell PowerEdge R510 Dell PowerEdge R720
Processors 2x AMD Opteron 2435 2x Intel Xeon X5670 2x Intel Xeon E5-2670
Codename Istanbul Westmere-EP Sandy Bridge-EP
Cores/logical CPUs 12/12 12/24 16/32
Core clock (Turbo) 2.6 GHz (n/a) 2.93 GHz (3.33 GHz)
2.6 GHz (3.3 GHz)
Uncore/NB clock 2.2 GHz 2.66 GHz
FPUs 2x 128 Bit 2x 128 Bit 2x 256 Bit
L1 cache 2x 64 KiB per core 2x 32 KiB per core
L2 cache 512 KiB per core 256 KiB per core
L3 cache 6 MiB per chip 12 MiB per chip 20 MiB per chip
IMC per socket 2x PC2-5300R 3x PC3L-10600R 4x PC3-12800R
Memory size 16 GiB (8x 2 GiB) 12 GiB (6x 2 GiB) 64 GiB (8x 8 GiB)
Interconnect HT 4.8 GT/s (19.2 GB/s) QPI 6.4 GT/s (25.6 GB/s) QPI 8.0 GT/s (32.0 GB/s)
58 4 Performance Characterization of Memory Accesses
4.1.1 Dual-socket AMD Opteron 2435
This section details the cache and memory performance of a system with two AMD Opteron 2435
processors. It extends the analysis of the preceding processor generation, which has been published
in [HMN09]. The text in this section is partially based on this publication.
The cores are based on AMD’s family 10h micro-architecture [Amd11, Appendix A], which is depicted
in Figure 4.1. Up to three x86 instructions can be decoded into so-called “makro-ops” each cycle. Simple
instructions are converted into one or two makro-ops (DirectPath). Complex instructions generate a
sequence of makro-ops (VectorPath). The makro-ops are sent to the instruction control unit (ICU), which
contains the 72-entry reorder buffer. In the next step operands are renamed and the makro-ops are issued
to the schedulers (reservation stations). Makro-ops can contain two micro-ops (µops)—an ALU or FPU
operation as well as an AGU operation. Micro-ops are dispatched to the execution units out of program
order. The results are written back to the ICU where they wait for retirement. The three integer units
execute instructions that operate on the 64 bit general purpose registers. All floating point and SIMD
instructions are handled by the FPU. There are two 128 bit wide execution units (FMUL and FADD),
thus four double precision or eight single precision operations can be performed each cycle. The third
unit (FSTORE) handles stores. Memory accesses are carried out by the load store unit (LSU). The LSU
can perform two data cache accesses each cycle. Loads can be 128 bit wide, whereas stores are limited to
64 bit. The L1 instruction and data caches each have a capacity of 64 KiB. Each core also has a 512 KiB
unified L2 cache, which is exclusive of the L1 caches.
Figure 4.2 depicts the composition of the dual-socket test system. The cores are connected to a system
request interface (SRI), which provides access to the shared L3 cache and main memory. The shared
“non-inclusive” L3 cache [Con+10] is directly connected to the processor’s SRI. It can retain a copy
(inclusive behavior) of a cache line that is transfered to an L1 cache, “if it is likely the data is being
accessed by multiple cores” [Amd11, Appendix A5.4]. Otherwise the cache line is removed from the
L3 (exclusive behavior). Memory requests are sent to a crossbar which forwards them to the integrated
memory controller if the address points to local memory or routes the requests to another processor via
the HyperTransport links. Each integrated memory controller supports two DDR2 channels. Thus, the
installed DDR2-667 memory (PC2-5300R) provides 10.6 GB/s per socket. The HyperTransport 3.0 link
that connects the processors operates at up to 2.4 GHz (4.8 GT/s) [Amd10, Table 6] and transfers 16 bit
per cycle (9.6 GB/s) in each direction. Cache coherence is maintained by the MOESI protocol (see Sec-
tion 2.3.1.3). The test system supports the HT Assist feature (see Section 2.3.2.3) to reduce the snoop
traffic between the sockets. The feature is disabled by default, but can be enabled in the BIOS.
64 KiB 
L1 Inst. 
Cache
Fetch Branch Predict
Decode
DirectPath VectorPath
Instruction Control Unit – 72 entries
(Dispatch/Retirement)
Sched FP Scheduler
FADD
SSE,MMX
512 KiB
L2 Cache
Load Store Unit – 44 entries
1
2
8
 b
FMUL
SSE,MMX 
FSTORE
A
L
U
A
L
U
A
G
U
A
L
U
A
G
U
A
G
U
Sched Sched
2x 128 b
64 KiB 
L1 Data 
Cache 2x 64 b
FP RenameInt Rename
To
System
Request
Interface
memory subsystem
in-order front-end
out-of-order execution
 256 b
Figure 4.1: AMD family 10h micro-architecture, based on [Amd11, Figure 8] (derived from [Mol08,
Figure 2.29]): The processor cores implement superscalar out-of-order execution (see Section 2.1.1)
with up to 72 instructions in flight.
4.1 Systems With Two NUMA Nodes 59
Opteron 2400 series processor
six-core Istanbul die
Opteron 2400 series processor
six-core Istanbul die
System Request Interface
Core
0
Shared L3 Cache
(non-inclusive)
Memory Controller
(2 Channels)
HyperTransport
Interconnect
L1
Core
1
Core
2
Core
3
L2 L2L2L2
I/O
L1L1L1
D
D
R
2
 A
D
D
R
2
 B
Core
4
L1
Core
5
L2L2
L1
Crossbar
System Request Interface
Core
6
Crossbar
HyperTransport
Interconnect
Memory Controller
(2 Channels)
L1
Core
7
Core
8
Core
9
L2 L2L2L2
L1L1L1
D
D
R
2
 C
D
D
R
2
 D
Core 
10
L1
Core 
11
L2L2
L1
Shared L3 Cache
(non-inclusive)
Figure 4.2: Composition of the dual-socket AMD Opteron 2435 system, based on [Con+10, Figure 1]
(derived from [Mol+10, Figure 1]): The cores are attached to a system request interface, which
connects them to the shared L3 cache and the crossbar. The crossbar provides access to the integrated
memory controller and the HyperTransport interconnect.
4.1.1.1 Latency of Cache and Main Memory Accesses
This section details the results of the latency benchmark (see Section 3.5.1). Cache lines are placed in
the memory hierarchy of different cores with variable coherence state. Figure 4.3 depicts the behavior
using default settings (HT Assist disabled). Table 4.2 summarizes the observed performance levels.
The read latency is independent of the coherence state if the data is found in the requesting core’s local
cache hierarchy (“local” cases in Figure 4.3). The L1 and L2 caches have a latency of 1.15 and 5,8 ns
respectively. Accesses to the L3 cache take 16.9 ns. If data is present in another core’s L1 or L2 cache
(“within NUMA node” cases up to 512 KiB in Figure 4.3), the access latency strongly depends on its
coherence state. Core-to-core transfers only occur for Modified and Owned cache lines. They require
46 ns. Requests to Shared and Exclusive cache lines are apparently serviced by main memory since
they resemble the behavior of case Invalid, which represents the latency of DRAM accesses—except for
very small data set sizes that show minor prefetching effects. This is common for Shared cache lines
(see Section 2.3.1) in order to avoid multiple replies. However, it is a missed opportunity not to forward
Exclusive cache lines between cores as accessing the local DRAM causes a delay of 80 ns.
The latency increases further if data is located in the second processor’s caches or memory (“other
NUMA node” cases in Figure 4.3). The increment is 49 ns for forwarding L1 and L2 cache lines (Modi-
fied, Owned), which requires 95 ns in that case compared to 46 ns for on-chip transfers. Exclusive L1 and
L2 cache lines as well as Shared cache lines from all cache levels are fetched again from main memory.
Thus, the latency is on the same level as the main memory latency (case Invalid). The ascending slope
for memory sizes up to approximately 1 MiB is presumably caused by the characteristics of DRAM ac-
Table 4.2: Opteron 2435—memory read latency: accesses to the local cache hierarchy compared to
accesses to data in other locations. All results are in nanoseconds (cycles). HT Assist is disabled.
Source State L1 L2 L3 DRAM
local M/O/E/S 1.15 (3) 5.8 (15)
16.9 (44) 801 (208)
within NUMA node
Modified/Owned 46 (121)
Exclusive
80 (208)
Shared
other NUMA node
Modified/Owned 95 (247)
93 (243)
1291 (336)Exclusive
112 - 126
Shared 129 (336)
1The measured DRAM latencies occasionally switch from 80/129 to 79/127 ns and vice versa when the system is rebooted.
60 4 Performance Characterization of Memory Accesses
(a) Modified / Owned (2nd copy in other NUMA node) (b) Exclusive
(c) Shared (2nd copy in other NUMA node) (d) Invalid, delivered from DRAM
Figure 4.3: Opteron 2435—memory read latency: One core accesses cache lines in its local cache
hierarchy (local) as well as caches of another core in the same package (within NUMA node) or in
the second processor (other NUMA node). Local cache hits always return the data. Modified and
Owned cache lines are forwarded by all non-local caches. Exclusive cache lines are forwarded by the
remote L3 cache, but not by another core’s L1 or L2 cache. Shared cache lines are never forwarded.
cesses. Small data sets fit into a few DRAM pages2, thus there is a high likelihood to access an already
opened page. In contrast, random accesses to large data sets mostly access closed pages, which takes
longer [Che04, Chapter III, Section 2.3.1]. The snooping of the second processor conceals this effect in
case of local memory accesses. The remote L3 cache forwards Exclusive, Modified, and Owned cache
lines with a latency of 93 ns. Remote caches accesses take longer than accesses to local DRAM (93-95 ns
vs. 80 ns). The second processor’s memory has a latency of up to 129 ns.
Figure 4.4 shows how the HT Assist feature (see Section 2.3.2.3) affects the latencies. It reduces the local
memory latency as the DRAM response is not delayed until the other processor’s snoop response arrives.
Therefore, the reduced latency for small data sets, which is caused by the DRAM characteristics, can
also be observed for local accesses. The latency using large data sets decreases from 80 to 74 ns. On the
other hand, the snoop filter lookup delays necessary snoop requests. This increases the latency of core-
to-core transfers by 10 ns from 46, 93, and 95 ns to 56, 103, and 105 ns, respectively. The remote memory
latency does not change, as DRAM requests and snoop filter lookup are performed in parallel [Con+10].
Exclusive cache lines—which are read again from memory if HT Assist is disabled—as well as Modified
and Owned cache lines are forwarded between cores. Shared cache lines3are still not forwarded.
2Rows within the memory banks, not to be confused with pages in the context of virtual memory management.
3If HT Assist is enabled, Exclusive cache lines change to state Owned when they are read by another core [Con+10]. Thus,
state Shared is generated using the coherence state control mechanism for state Forward, which creates an Owned copy
in another core (CPU 11) and a Shared copy in the target CPU. Afterwards, the unintentionally created Owned copy is
removed (BENCHIT_KERNEL_FLUSH_SHARED_CPU=1), which also invalidates the remote L3 cache.
4.1 Systems With Two NUMA Nodes 61
(a) Modified / Owned (2nd copy in other NUMA node) (b) Exclusive
(c) Shared (2nd copy removed)3 (d) Invalid, delivered from DRAM
Figure 4.4: Opteron 2435—memory latency with enabled HT Assist: HT Assist slightly reduces the
local memory latency. However, it also increases the latency of core-to-core transfers.
Exclusive cache lines are evicted earlier than intended from the remote L3 cache if HT Assist is enabled
and the default benchmark configuration is used. This effect can be traced back to the usage of CLFLUSH
in the coherence state control mechanism of state Exclusive (see Section 3.3). It disappears, if an al-
ternative flush routine is used that accesses an adequate amount of other data. The divergent behavior
is regarded as measurement artifact since the access sequence that includes CLFLUSH is quite unusual.
Apart from that the characteristics of Exclusive and Modified are identical.
Figure 4.5 shows the impact of TLB misses on the memory latency. If huge pages are used, there are no
TLB misses up to a data set size of 96 MiB, which is covered by the L1 TLB. After that there is a barely
noticeable increase in memory latency to around 80.5 ns until the L2 TLB is exhausted at 256 MiB. For
larger data set sizes the memory latency steadily increases. Using 4 KiB pages results in much higher
latencies, but this does only occur when transparent huge pages are deliberately deactivated.
Figure 4.5: Opteron 2435—TLB miss penalty:
The additional delay caused by the transla-
tion of virtual addresses (see Section 2.4.2)
depends on the size of the data set and the
used page size. If transparent huge pages
(THP) [Arc11] are enabled—which is the de-
fault setting—there is no performance differ-
ence between memory that is allocated with
malloc() and explicitly using 2 MiB pages
via hugetlbfs.
62 4 Performance Characterization of Memory Accesses
4.1.1.2 Bandwidth of Local Cache Accesses and Core-to-core Transfers
In this section the results of the benchmark introduced in Section 3.5.2 are presented, which show the
bandwidth of the individual cache levels as well as the available data rates of core-to-core transfers
within and between NUMA nodes. Figure 4.6 depicts the read and write bandwidth using 128 bit SSE
instructions depending on the data’s coherence state. The results are summarized in Table 4.3, which
also details the performance of 64 bit wide loads and stores.
The read bandwidth of local cache accesses (“local” cases in Figures 4.6a, 4.6c, and 4.6e) is hardly
influenced by the coherence state. The measured L1 bandwidth is 82.1 GB/s—only 1.3% below the
expected 83.2 GB/s of the two 128 bit L1 read ports. The L2 interface is 128 bit wide [Wal07, page 9].
(a) read, Modified (b) write, Modified
(c) read, Exclusive/Shared (d) write, Exclusive
(e) read, Owned (f) write, Owned/Shared
Figure 4.6: Opteron 2435—single-threaded read and write bandwidths: One core accesses cache lines in
its local cache hierarchy (local) as well as caches of another core in the same package (within NUMA
node) or in the second processor (other NUMA node). The read bandwidth is almost independent of
the coherence state while the write bandwidth shows significant differences.
4.1 Systems With Two NUMA Nodes 63
Table 4.3: Opteron 2435—core-to-core read and write bandwidths in GB/s using 128 bit MOVDQA
(64 bit MOV) instructions. Using 128 bit loads doubles the read bandwidth from the local L1 cache
compared to 64 bit loads. Local L2 cache and memory accesses as well as core-to-core transfers also
benefit from utilizing SSE. The write bandwidths do not improve if 128 bit stores are used.
Source State
read bandwidth write bandwidth
L1 L2 L3 DRAM L1 L2 L3 DRAM
local
Modified
82.1 20.7
9.5
5.2
41.3 (41.3) 13.0 (13.3) 8.7
3.8
Exclusive
(41.3) (18.7)
(9.5)
(4.7)
17.7 (17.7) 9.0 (9.8) (8.7)
(3.8)
Shared
2.7 (2.7)
4.2
Owned
up to
(4.2)
within
Modified 6.8 (5.0)
11.2
4.2 (4.2)
8.7
NUMA
Exclusive
5.2 (4.1)
(10.5)
(8.7)
node
Shared
3.9 (3.9)
4.2
Owned 6.8 (5.0) (4.2)
other
Modified 4.2 (3.6)
3.7
3.5 (3.5)
3.2
NUMA
Exclusive
3.7 (3.6)
(3.7)
3.6 (3.6)
(3.2)
node
Shared
Owned 4.2 (3.6) 3.5 (3.5)
However, the achievable bandwidth of 20.7 GB/s is slightly below 20.8 GB/s, which are possible with
64 bits per cycle. The L2 cache is exclusive of the L1 cache. Therefore, an evicted L1 cache line is
written back for every cache line read by the core, which requires half of the raw bandwidth. The L3
and main memory bandwidth is 9.5 and 5.2 GB/s, respectively. The L3 cache is mostly exclusive, thus
reading from the L3 typically causes the same number of write backs to the L3. This is apparently not
the case for Owned cache lines, which can be read with up to 11.2 GB/s. The higher effective read
bandwidth indicates that there are fewer write backs, i.e., the L3 shows its inclusive behavior in this case.
If 64 bit loads are used, the L1 bandwidth is cut in half (see Table 4.3). Furthermore, the L2 and memory
bandwidth is reduced by about 10%.
The local write bandwidth (“local” cases in Figures 4.6b, 4.6d, and 4.6f) strongly depends on the co-
herence state. Data that already is in state Modified in the L1, L2 or L3 cache can be written to with
41.3, 13.0, and 8.7 GB/s respectively. Exclusive cache lines show significantly lower write bandwidths
of 17.7 in the L1 and 9.0 GB/s in the L2 cache. This is presumably caused by the L1 data cache, which is
virtually indexed in AMD family 10h [BT09, Table 2]. The 64 KiB 2-way set-associative cache (2x 512
cache lines4) requires nine index bits, which in all probability are taken from the address bits [14:6]5.
Bits [11:0] are always identical between virtual and physical address (see Figure 2.16 in Section 2.4.2).
However, using bits [14:12] from the virtual address is problematic since the same physical address can
be inserted at eight possible indices. The observed behavior indicates that multiple (consistent) copies of
a cache line may coexist in the L1 data cache. Consequently, all possible locations have to be checked
for duplicates prior to the write, which reduces the performance. Writing to Shared and Owned cache
lines is extremely slow for all cache levels (2.7 – 4.2 GB/s) as snoops are broadcast in order to invalidate
all other copies. The local memory can be written with 3.8 GB/s. Using 128 bit stores does not provide
any benefit compared to 64 bit stores (see Table 4.3). The local L2 performance even decreases.
The measurements shown in Figure 4.6 and Table 4.3 use memory that is allocated in 2 MiB pages via
hugetlbfs. However, the page size does not have a significant impact on sequential memory accesses
of sufficient size since the TLB contains the required translation after the first access to a page. Thus,
the overhead shown in Figure 4.5 is distributed among many accesses. It is negligible for 2 MiB pages.
464 byte cache lines according to CPUID Fn8000_0005_ECX[L1DcLineSize] (see [Amd15b, Appendix E.4.4])
5According to [Amd11, Section 5.7] the index comprises bits [14:7] However, the eight banks in the data memory contain
128 byte, i.e., two consecutive cache lines. The index into the tag directory requires one more bit.
64 4 Performance Characterization of Memory Accesses
If 4 KiB pages are used (and THP is disabled), one TLB miss occurs per 256 accesses, which slightly
reduces the memory read bandwidth to 5.1 GB/s at a data set size of 1 GiB.
The read bandwidths of accesses to other core’s data (“within NUMA node” and “other NUMA node”
cases in Figures 4.6a, 4.6c, and 4.6e) reflect the latency results. Modified and Owned cache lines are
forwarded by other cores’ L1 and L2 caches with 6.8 GB/s within a NUMA node while Exclusive and
Shared cache lines are read with 5.2 GB/s, which is identical to the memory bandwidth. With up to
11.2 GB/s data exchanges via the shared L3 cache are significantly faster. The L3 shows its inclusive be-
havior independent of the coherence state, if data was placed in it by another core. Data is forwarded from
the second processor’s caches with 4.2 GB/s. Reading from the remote memory is limited to 3.7 GB/s.
Writing to other cores’ data (“within NUMA node” and “other NUMA node” cases in Figures 4.6b, 4.6d,
and 4.6f) includes a read for ownership and a write to the local cache hierarchy. The write bandwidth is
limited to 4.2 GB/s if a copy of the data exists in another L1 or L2 cache on the requesting core’s die and
the coherence state is Modified or Exclusive. It reduces to 3.9 GB/s for Owned and Shared cache lines
that also have a copy in the other processor. The performance of writes to the shared L3 cache within
a node does not change if the data was placed there by another core. Data that is cached in the second
processor can be written with 3.5 to 3.6 GB/s. The write bandwidth of the remote memory is 3.2 GB/s.
4.1.1.3 Bandwidth Scaling of Shared Resources
The aggregated bandwidth benchmark introduced in Section 3.5.3 stresses the shared resources with
concurrent memory accesses of multiple cores. Table 4.4 shows how the L3 and main memory bandwidth
scales with the number of cores. The L3 read bandwidth increases with every added core, but it does
not scale linear. It reaches a maximum of 39.4 GB/s. The L3’s write bandwidth is slightly lower than
the read bandwidth up to four concurrently writing cores, but reaches the same maximum of 39.4 GB/s.
The read bandwidth from local memory is 5.2 GB/s for a single thread. It increases significantly if a
second thread is used. It is saturated at 9.7 GB/s—91% of the theoretical maximum of 10.6 GB/s—with
three concurrent read streams. The local memory’s write bandwidth using normal store instructions
(MOVDQA) does not scale with the number of cores. It is limited to 3.8 GB/s. In contrast, up to 7.8 GB/s
can be reached with non-temporal stores (MOVNTDQ). The remote memory accesses are limited by the
interconnect. This reduces the read bandwidth to 8.2 GB/s while the maximal write bandwidth does not
change. However, single-threaded write accesses as well as read accesses with up to two cores show
additional performance degradations. This is presumably caused by the higher latency in conjunction
with the limited number of outstanding requests per core. The snoop filter (HT Assist) does not improve
the achievable aggregated bandwidths. However, single-threaded read accesses as well as non-temporal
stores with one or two concurrent threads benefit from enabling the feature. On the other hand, the
remote memory bandwidth is reduced.
Table 4.4: Opteron 2435—L3 and main memory bandwidth scaling: The aggregated L3 performance
increases with every added core. The memory bandwidth can be saturated with three cores. The
measurements use 128 bit loads (MOVDQA) and stores (MOVDQA / MOVNTDQ).
cores
bandwidth in GB/s with disabled (enabled) HT Assist
L3 local memory remote memory
read write read write write-nt read write
1 9.5 (9.5) 8.7 (8.7) 5.2 (6.1) 3.8 (3.8) 4.0 (7.7) 3.7 (3.6) 3.1 (3.1)
2 18.8 (18.8) 17.3 (17.3) 9.0 (9.1) 3.8 (3.8) 6.2 (7.7) 6.9 (6.8) 3.8 (3.8)
3 25.5 (25.5) 24.7 (24.7) 9.7 (9.7) 3.8 (3.8) 7.7 (7.8) 8.2 (7.8) 3.8 (3.8)
4 30.5 (30.5) 30.1 (30.1) 9.7 (9.6) 3.8 (3.7) 7.8 (7.8) 8.2 (7.6) 3.8 (3.7)
5 35.0 (35.0) 34.9 (34.7) 9.7 (9.6) 3.6 (3.6) 7.8 (7.8) 8.2 (7.4) 3.7 (3.6)
6 39.4 (39.4) 39.3 (38.9) 9.7 (9.6) 3.6 (3.5) 7.8 (7.8) 8.2 (7.3) 3.6 (3.5)
4.1 Systems With Two NUMA Nodes 65
4.1.2 Dual-socket Intel Xeon X5670
In this section the memory subsystem of a dual-socket system with Intel Xeon X5670 processors is
analyzed. It builds on the analysis of the preceding processor generation, which has been published
in [Mol+09] and [HMN09]. The text in this section is partially based on these publications.
Figure 4.7 shows a block diagram of the processor cores, which are based on the Westmere micro-
architecture [Kur+11]—a shrink of Nehalem [Int14a, Section 2.4] to 32 nm. The cores implement
Hyper-Threading [Int14a, Section 2.4.9 and 2.5]—an implementation of simultaneous multi-threading
(SMT)—with two hardware threads per core. Four decoders convert the fetched x86 instructions into
microops (µops). Microops are elementary operations that can be processed by the execution units. The
three simple decoders only handle instructions that are translated into a single microop. This includes
fused microops, which contain an arithmetic operation and a memory access [Goc+03]. Instructions that
require multiple microops are processed by the complex decoder. Macrofusion [Wec06]—which com-
bines certain pairs of x86 instructions in a single microop—is supported as well. Thus, up to five x86
instructions can be decoded each cycle. In the next step register operands are renamed and the required
reorder buffer (ROB) and memory order buffer (MOB) entries are allocated. The microops then enter the
36 entry scheduler, which dispatches them to the execution units via six ports. Port 0, 1, and 5 are used
for arithmetic and logic operations. However, not all instructions are supported by every port [Int14a,
Table 2-23]. Up to four double precision operations can be performed each cycle by the two 128 bit
wide floating point units—FP MUL and FP ADD. Port 2, 3, and 4 handle memory accesses. Up to 48
loads and 32 stores can be queued in the MOB [Int14a, Section 2.4.5], which ensures proper memory
ordering [Int14b, Volume 3, Section 8.2.2]. The 32 KiB level one data cache supports one 128 bit load
and one 128 bit store per cycle. Each core also has a dedicated 256 KiB level two cache.
The composition of the Xeon X5670 (Westmere-EP) system is depicted in Figure 4.8. Each processor
contains six cores and several shared resources, which Intel refers to as Uncore [Hil+10]. The cores are
connected to the so-called Globale Queue (GQ), which connects them to the other components. A large
portion of the chip is used as last level cache (LLC), which is inclusive of the L1 and L2 caches [Int14a,
Section 2.4.4]. All caches use the write-back policy. Each processor contains an integrated memory
controller (IMC) with three DDR3 channels. The DDR3-1333 memory (PC3L-10600R) has a theoretical
peak bandwidth of 32 GB/s per socket. The QuickPath Interconnect (QPI) [Int09a] is used to connect
the two processors and access I/O devices. Cache coherence is maintained by the MESIF protocol
(see Section 2.3.1.2). Core valid bits are used to determine which cores may have copies of cache lines
that are present in the LLC [Hil+10, p. 38]. This reduces the snoop traffic as described in Section 2.3.2.1.
32 KiB 
L1 Inst. 
Cache
32 KiB 
L1 Data 
Cache
Scheduler (Reservation Station) – 36 entries
Reorder Buffer – 128 entries
Rename/Alloc
Instruction Queue
Store
addr
Load
Int ALU
Int SIMD
FP MUL
Int ALU
Int SIMD
Int ALU
Int SIMD
FP ADD
port2 port0port4port3 port5port1
6 x86
4+1 x86
4 µops 1 µop 1 µop1 µop
Memory Order Buffer
48 load / 32 store buffers
128 b
128 b
512 KiB
L2 
Cache
To 
Global 
Queue
4 µops
Store
data
memory subsystem
in-order front-end
out-of-order execution
128 b Fetch and Predecode
Decode – 4+1 x86 Inst
complex simple simplesimple
Branch Predict
Figure 4.7: Intel Westmere micro-architecture, based on [Int14a, Figure 2-8] (derived from [Mol08,
Figure 2.26]): The cores implement superscalar out-of-order execution (see Section 2.1.1). A single
reservation station (scheduler) is used. The reorder buffer supports 128 microops in flight.
66 4 Performance Characterization of Memory Accesses
Xeon X5600 series processorXeon X5600 series processor
six-core Westmere-EP die
Core
2
Global Queue
Memory 
Controller
QuickPath 
Interconnect
L1
Core
3
Core
4
Core
5
L2 L2L2L2
I/O Hub
L1L1L1
six-core Westmere-EP die
Core
6
Global Queue
L1
Core
7
Core
8
Core
9
L2 L2L2L2
L1L1L1
D
D
R
3
 A
D
D
R
3
 C
D
D
R
3
 B
QuickPath 
Interconnect
Memory 
Controller
D
D
R
3
 D
D
D
R
3
 F
D
D
R
3
 E
Shared LLC 
(inclusive)
Core
0
L1
Core
1
L2L2
L1
Core
10
Core
11
L2L2
L1L1
Shared LLC 
(inclusive)
Figure 4.8: Composition of the dual-socket Xeon X5670 test system, based on [Hil+10, Fig-
ure 2]; [Kur+11, Figure 2] (derived from [Mol+09, Figure 1]): The six cores in each processor are
connected to the Global Queue, which provides access to the shared components. The processors are
connected to each other and to the I/O Hub via QuickPath connections operating at 6.4 GT/s.
4.1.2.1 Latency of Cache and Main Memory Accesses
This section details the measurements of the latency benchmark (see Section 3.5.1). Figure 4.9 depicts
the results for accesses to Modified, Exclusive, and Shared cache lines that are placed in different loca-
tions of the memory subsystem. Case Invalid shows the DRAM latency depending on the data set size.
The results are summarized in Table 4.5, which also includes results for state Forward.
The latency of accesses to a core’s local L1 and L2 cache is independent of the coherence state. Loads
require 4 and 10 cycles, respectively. The local L3 latency is 14.7 ns (43 cycles) if the data can be sent
directly to the requesting core. This is the case if two or more core valid bits are set (Shared/Forward
cache lines) and if all core valid bits are clear (Modified cache lines). If a single core valid bit is set, the
corresponding core has to be snooped unless it is the bit associated with the requesting core itself. This
happens if another core silently evicts Exclusive cache lines and increases the L3 latency to 25.9 ns. The
L3 cache also services requests to cache lines in the states Exclusive, Shared, and Forward that are still
present in other cores. If Modified cache lines exist in another core, the copies in the L3 are outdated.
Thus, data has to be read from the other core’s L1 or L2 cache, which has a slightly higher latency of
33.4 or 29.7, respectively. The local main memory has a latency of 68.5 ns.
Accesses to caches and memory of the second processor include an additional delay for the data transfer
via QPI. The L3 latency increases from 25.9 to 70.6 ns for Exclusive cache lines (including a core snoop)
and from 14.7 to 63.4 ns for cache lines in state Forward (no core snoop). Shared cache lines are not
forwarded by the remote L3 cache. Thus, the data is delivered from main memory with a latency of up to
Table 4.5: Xeon X5670—memory read latency: Accesses to the local cache hierarchy compared to
accesses to data in other locations. In the Shared cases the Forward copy—which is created by the
coherence state control mechanism (see Section 3.3)—is removed prior to the measurement (BEN-
CHIT_KERNEL_FLUSH_SHARED_CPU=1). All results are in nanoseconds (cycles).
Source State L1 L2 L3 DRAM
local M/E/S/F 1.36 (4) 3.4 (10)
14.7 (43)
68.5 (201)
within NUMA node
Modified 33.4 (98) 29.7 (87)
Exclusive 25.9 (76)
Shared/Forward 14.7 (43)
other NUMA node
Modified up to 113
109.4 (321)
Exclusive 70.6 (207)
Forward 63.4 (186)
Shared up to 110
4.1 Systems With Two NUMA Nodes 67
(a) Exclusive (b) Modified6
(c) Shared, 2nd copy (Forward) removed (d) Invalid, delivered from DRAM
Figure 4.9: Xeon X5670—memory read latency: One core accessing its local cache hierarchy (local)
as well as cache lines of another core in the same processor (within NUMA node) and a core in the
second processor (other NUMA node).
110 ns. The latency of accesses to Modified cache lines is measured with up to 113 ns, which is slightly
higher than the remote memory latency. This is presumably caused by the required write backs to main
memory as the MESIF protocol does not allow sharing dirty cache lines.
Figure 4.10 shows the impact of TLB misses on the memory latency. If 2 MiB pages are used explicitly
through hugetlbfs or implicitely via THP, TLB misses do not occur during the measurement up to a data
set size of 64 MiB, which is covered by the 32 entries in the L1 TLB (see Table 2.6). After that the
average latency increases slightly. It reaches an average of 75 ns at a data set size of 2 GiB. 4 KiB pages
already show a performance degradation at data set sizes above 256 KiB when the L1 TLB capacity is
exceeded. The gap between 4 KiB and 2 MiB pages continuously widens for larger data set sizes.
Figure 4.10: Xeon X5670—TLB miss penalty:
The duration of the page table walk depends
on the size of the data set and the page size.
In the default system configuration transpar-
ent huge pages (THP) [Arc11] are enabled.
Consequently, there is no performance differ-
ence between using malloc() and allocat-
ing 2 MiB pages from hugetlbfs. The mem-
ory latency increases significantly if the us-
age of 4 KiB pages is enforced.
6Measured with BENCHIT_KERNEL_USE_ACCESSES=1 (default 4) to avoid prefetching by other cores.
68 4 Performance Characterization of Memory Accesses
4.1.2.2 Bandwidth of Local Cache Accesses and Core-to-core Transfers
In this section the results of the single-threaded bandwidth benchmark (see Section 3.5.2) are presented.
The results show the bandwidth of the individual cache levels for accesses of a single thread as well as
the available data rates of core-to-core transfers within and between NUMA nodes. Figure 4.11 depicts
the read and write bandwidths for various coherence states and data set sizes using SSE instructions. The
results are summarized in Table 4.6. Figure 4.12 shows the advantage of 128 bit loads (MOVDQA) over
64 bit loads (MOV) using the example of data cached in state Exclusive.
The read bandwidth from the local L1 and L2 caches does not depend on the coherence state (see “local”
cases in Figure 4.11a, 4.11c, and 4.11e). It is measured with 46.7 and 31.0 GB/s, respectively. The L1
performance can be explained with the single 128 bit read port, while the L2 bandwidth cannot be derived
(a) read, Exclusive (b) write, Exclusive
(c) read, Modified (d) write, Modified6
(e) read, Shared (f) write, Shared
Figure 4.11: Xeon X5670—single-threaded read and write bandwidths: A thread on core 0 accesses
data in its local memory hierarchy (local) as well as data that is present in caches of another core on
the same chip (within NUMA node) and in the second processor (other NUMA node).
4.1 Systems With Two NUMA Nodes 69
Figure 4.12: Xeon X5670—SIMD effect on
bandwidth: Using 128 bit SSE instructions
(MOVDQA) instead of 64 bit loads (MOV) has
a huge impact on the achievable bandwidths.
The local L1 bandwidth doubles since the
128 bit ports can only be fully utilized with
SIMD instructions. The L2 and L3 band-
widths also increase significantly. The effect
on remote cache and main memory accesses
is smaller, but still noticeable.
from the width of the data paths. Data can be read with 24.9 GB/s from the shared L3 cache, which also
services read requests to unmodified data in other core’s L1 and L2 caches. As depicted in Figure 4.12
the local cache bandwidths drop to 23.3 (L1), 17.6 (L2), and 15.8 GB/s (L3) if 64 bit loads are used. The
local DRAM bandwidth also benefits from using 128 bit loads. Data in the state Exclusive that belongs
to another core (see “within NUMA node” test series in Figure 4.11a) can only be read with 22.9 GB/s7.
Since Exclusive cache lines are evicted silently the corresponding core valid bits are still set in that case,
which requires snooping the other core. Modified cache lines in another core’s L1 and L2 cache can be
read with 8.7 GB/s and 13.9 GB/s, respectively.
The write bandwidth strongly depends on the coherence state and location of the data (see Figure 4.11b,
4.11d, and 4.11f). The measured 46.7 and 28.8 GB/s for writes to Exclusive and Modified cache lines
in the local L1 and L2 cache are almost identical to the respective read bandwidths. The performance
of writes to the L3 cache depends on the state of the core-valid-bits. It is 15.1 GB/s if one or more core
valid bits of other cores are still set because of silent evictions (see Table 4.6: Exclusive within NUMA
node and Shared/Forward with both copies within the NUMA node). Otherwise it can be written to
with 17.6 GB/s. Unmodified data that is still present in other cores’ L1 or L2 cache can be written to
with 25.6 GB/s and 20.8 GB/s, respectively. This is above the L3 performance since the data is only
read from the L3 while the written data stays in the local L1 or L2 cache. However, writing to data
that is cached in another core’s L1 cache (25.6 GB/s) is also slightly faster than reading from the same
location (22.9 or 24.9 GB/s). This unexpected behavior is possibly caused by the limited number of
Table 4.6: Xeon X5670—core-to-core read and write bandwidths using 128 bit instructions: In the
Shared/Forward cases two copies of the data exist—one in state Shared and one is in state Forward.
If a local copy exists, it is in state Shared. If both cores share the L3 cache, it contains a single copy.
State Source
read bandwidth in GB/s write bandwidth in GB/s
L1 L2 L3 L1 L2 L3
Modified
local 46.7 31.0
24.9
46.7 28.8
17.6
within NUMA node 8.7 13.9
8.7
13.3
other NUMA node 7.7 9.6 8.9
Exclusive
local 46.7 31.0 24.9 46.7 28.8 17.6
within NUMA node 22.9 25.6 20.8 15.1
other NUMA node 9.8 9.6 9.7 8.2
Shared/
local + within NUMA node 46.7 31.0
24.9
25.6 20.9 15.1
Forward
2 copies within NUMA node 24.9
local + other NUMA node 46.7 31.0
9.6
9.2 7.2
2 copies in other NUMA node 10.0 9.7 8.2
7 This result applies to the standard block size of 1 KiB (see Section 3.5.2). The gap disappears (23.8 GB/s in both cases), if
only 512 byte are read in the inner loop, which is in line with the results presented in [HMN09]. The varying performance
is presumably caused by the hardware prefetchers that also have an influence on the L3 bandwidth [Mol+09].
70 4 Performance Characterization of Memory Accesses
outstanding requests. The load and store buffers cover twelve cache lines (48 entries) and eight cache
lines (32 entries), respectively. However, the load buffers cannot be released until the corresponding
instructions retire after the requested data has arrived. In contrast, stores can be completed and merged
in the write combining buffers before the read for ownership is finished [Int14a, Section 3.6.10]. Due to
the additional buffers, stores do not stall the execution as quickly as loads. It is also important to note,
that writes to data that is shared between multiple cores on the chip achieve the same performance as
writes to Exclusive cache lines in another core. That is, no RFO request is sent to the second processor
if cache lines are shared only within one processor. This is another benefit of the core valid bits, which
handle the internal sharing while maintaining exclusivity for the processor [Hil+10, p. 38].
The bandwidth is limited by the QPI link if the second processor is involved (see “other NUMA node”
cases in Figure 4.11). Considering the protocol overhead of 11% [Int09a, Table 7], up to 11.4 GB/s
are possible. The high latency in combination with the limited number of outstanding requests further
reduces the achievable bandwidth. The read bandwidths are, 7.7 GB/s, 9.8 GB/s or 10.0 GB/s for accesses
to Modified, Exclusive, and Shared data in remote caches, respectively. Accesses to Modified cache lines
cause write backs to memory since after the read two Shared copies exist, which requires a valid copy
in main memory (see MESIF protocol in Section 2.3.1.2). Thus, the bandwidth is limited by the main
memory accesses. This is not the case if the accessed data is in Exclusive or Shared/Forward state. The
small difference between the two (9.8 vs. 10.0 GB/s) can again be explained with the core snoops that
are required for Exclusive cache lines, which slightly increase the latency. The write bandwidths are
between 7.2 and 9.7 GB/s, if data is cached in the other processor. The measured L1 and L2 performance
mostly is above the L3 performance since data is not written back to the local L3 cache. The exception
are writes to modified data in another core’s L1, which are already limited to 8.7 GB/s if the copy is on
the same chip. Surprisingly, writes to Modified cache lines in the second processor are faster than the
corresponding reads. In the case of writes, no write back of the data to main memory is required, as
Modified copies of the cache lines are inserted in the new owner. The L3 bandwidth (read from remote
L3, write to local L3) varies depending on the number of required core snoops. If the data is in state
Modified it is 8.9 GB/s as no core snoops are required in either processor. If the other socket contains an
Exclusive or Forward copy and no local L3 copy exists, the bandwidth drops to 8.2 GB/s. If copies exist
in both processors, the bandwidth is reduced further to 7.2 GB/s.
4.1.2.3 Bandwidth Scaling of Shared Resources
This section examines the performance of the shared resources. The L1 and L2 cache bandwidths scale
linear with the number of used cores since each core has dedicated resources. They are therefore not
considered here. The scaling of the L3 and memory bandwidth with the number of concurrently active
cores is depicted in Figure 4.13. Table 4.7 summarizes these results and also details the influence of
Hyper-Threading, the available remote memory bandwidth, and the non-temporal store performance.
The L3 read bandwidth is measured using cache lines in state Exclusive while the write bandwidth
benchmark writes to already Modified cache lines. Therefore, the read bandwidth is not influenced by
write backs from the L1 and L2 (silent eviction) while the write bandwidth includes them. The influence
of write backs is investigated in more detail in [Mol+09].
The L3 read bandwidth scales almost linear from 24.9 GB/s for a single thread to 72.1 GB/s with three
concurrently reading cores. Using four cores further increases the bandwidth from 72.1 to 83.1 GB/s.
The benefit of using five or six cores is very small. Using two threads per core slightly decreases the
achievable read bandwidths. If 64 bit loads are used, the L3 bandwidth per core drops significantly
(see Figure 4.12). In that case the L3 bandwidth scales almost linear from 15.8 GB/s using one core to
77.8 GB/s using five cores and reaches a maximum of 83.1 GB/s using all six cores. The write bandwidth
already starts to saturate with two concurrently writing cores. Furthermore, the maximum of 25.7 GB/s is
much lower than the achievable read bandwidth—lower than the expected 50%, which could be explained
by the read-modify-write procedure. The L3 write bandwidth does not depend on the width of the store
instructions. The influence of Hyper-Threading is minimal.
4.1 Systems With Two NUMA Nodes 71
(a) read (b) write
Figure 4.13: Xeon X5670—aggregated bandwidth using 128 bit loads and stores: L3 cache and local
memory reach their maximum performance without using all cores. One thread per core is used in
these measurements. See Table 4.7 for more details.
The local memory bandwidths start to saturate with only two active cores. With up to 19.2 GB/s, the
read bandwidth reaches only 60% of the 32 GB/s, which the three PC3-10600R channels should provide.
It decreases further to 18.0 GB/s, if 64 bit loads are used. If Hyper-Threading is used the local read
bandwidth is limited to 17.5 GB/s per socket, which are achieved with four threads on two cores. Using
more than two cores is disadvantageous in this case. The write bandwidth of up to 8.8 GB/s (9.0 GB/s
using Hyper-Threading) is close to 50% of the read bandwidth, which is in line with expectations. Non-
temporal stores (write-nt) enable higher data rates of up to 12.8 GB/s per socket. The influence of TLB
Misses on the bandwidth of sequential memory accesses is small. If 4 KiB pages are used instead of
2 MiB pages and THP is disabled, the read and write bandwidths using a data set size of 1 GiB decrease
from 19.2 to 18.6 GB/s and from 8.7 to 8.1 GB/s, respectively. The bandwidth of remote memory ac-
cesses is limited by the QPI link that connects the processors. The read bandwidth reaches 11.1 GB/s.
The corresponding write bandwidth of up to 7.1 GB/s is not limited to 50% of that since the bi-directional
QPI connection allows reads from and writes to remote memory to be performed concurrently. However,
the remote write bandwidth is noticeable lower than the local write bandwidth. This indicates that the
performance is limited by the number of credits for remote accesses [Int09a, p. 9]; [Hil+10, p. 40].
The L3 and memory performance is similar to the preceding processor generation, which is described
in [Mol+09] and [HMN09]. However, the number of cores increases from four to six. The increased
computational performance is not accompanied by an equivalent bandwidth improvement. Therefore, a
higher flop per byte ratio is required in order to achieve the peak performance.
Table 4.7: Xeon X5670—L3 and main memory bandwidth scaling: The last level cache and main mem-
ory bandwidths do not scale linearly with the number of cores. The write bandwidths are significantly
lower than the corresponding read bandwidths. All measurements use 128 bit SSE instructions.
cores
bandwidth in GB/s with one (two) threads per core
L3 local memory remote memory
read write read write write-nt read write
1 24.9 (23.2) 17.6 (17.8) 12.7 (12.9) 7.5 (7.4) 8.0 (7.4) 8.9 (8.9) 5.8 (5.8)
2 48.3 (46.5) 24.6 (24.8) 18.2 (17.5) 8.5 (8.8) 9.6 (9.3) 10.7 (11.0) 7.1 (7.1)
3 72.1 (68.7) 25.5 (25.4) 19.2 (16.9) 8.8 (8.8) 10.4 (10.0) 11.1 (11.0) 7.0 (6.9)
4 83.1 (82.5) 25.7 (25.6) 19.2 (16.9) 8.7 (8.8) 11.2 (10.7) 11.0 (10.9) 6.9 (6.9)
5 85.3 (85.0) 25.6 (25.6) 19.2 (16.7) 8.7 (8.9) 12.2 (11.3) 11.0 (10.9) 6.8 (6.9)
6 85.9 (85.0) 25.6 (25.6) 19.2 (16.7) 8.7 (9.0) 12.8 (11.9) 10.9 (10.9) 6.8 (6.9)
72 4 Performance Characterization of Memory Accesses
4.1.3 Dual-socket Xeon E5-2670
This section details the cache and memory performance of a system with two Intel Xeon E5-2670 proces-
sors, which are based on the Sandy Bridge [Int14a, Section 2.2] micro-architecture. Preliminary results
have been previously published in [MHS14]. The text in this section is largely based on this publication.
Sandy Bridge is the successor of the Nehalem micro-architecture (see Section 4.1.2). It also is a super-
scalar out-of-order architecture and supports Hyper-Threading—Intel’s implementation of SMT—with
two logical CPUs per core. A block diagram of the processor core is shown in Figure 4.14. The fetch
window and the throughput of the four decoders have not changed. However, a micro-op cache that
caches decoded instructions has been added [Int14a, Section 2.2.2.2]. Fetch and decode are skipped if
the required instructions are found in this cache [Sol+03]. Register renaming and retirement still process
four micro-ops per cycle while the scheduler can dispatch up to six micro-ops. The number of scheduler
and reorder buffer entries has been increased to 54 entries and 168, respectively. Speculative results are
kept in a physical register file together with the architectural registers [Lem11]. The floating point units
can process two arithmetic instructions per cycle—one addition and one multiplication. They support
256 bit wide SIMD instructions from the newly introduced Advanced Vector Extensions (AVX) instruc-
tion set. However, mixing AVX and SSE code causes transition penalties [Int14a, Section 11.3]. The
data cache capacities of 32 KiB L1 and 256 KiB L2 per core remain unchanged. A second read port has
been added to the L1 cache. Thus, it supports two 128 bit loads and one 128 bit store per cycle. The
number of load and store buffers has been increased to 64 and 36, respectively.
Figure 4.15 depicts the dual-socket test system with two Xeon E5-2670 (Sandy Bridge-EP) processors.
Each processor has eight cores, which share an inclusive 20 MiB L3 cache. The L3 cache is divided into
eight 2.5 MiB slices. Cache lines are placed in a certain slice depending on a hash of the corresponding
memory address [Int12a, Section 2.3.1]. Thus, all cores scatter their data over all slices. Each integrated
memory controller has four DDR3 channels, which are populated with PC3-12800R memory modules.
This results in a memory bandwidth of 51.2 GB/s per socket. A bi-directional ring bus is used to connect
the components on the chip [Hua+12]. The two processors are connected with two QPI links that operate
at 8 GT/s. Each link has a bandwidth of 32 GB/s (16 GB/s per direction), thus 32 GB/s can be transfered
in each direction. The MESIF protocol (see Section 2.3.1.2) is used to maintain cache coherence. The
protocol is implemented by the caching agents (CAs) within each LLC slice and the home agents (HAs)
in the memory controllers [Int12a, Section 2.3.1 and 2.4.1].
32 KiB 
L1 Inst. 
Cache
32 KiB 
L1 Data 
Cache
Scheduler (Reservation Station) – 54 entries
Reorder Buffer – 168 entries
Rename/Alloc
Instruction Queue
Load/
Store
addr
Load/
Store 
addr
Int ALU
Int SIMD
FP MUL
Int ALU
Int SIMD
Int ALU
Int SIMD
FP ADD
port2 port0port4port3 port5port1
6 x86
4+1 x86
4 µops 1 µop 1 µop1 µop
Memory Order Buffer
64 load / 36 store buffers
128 b
512 KiB
L2 
Cache
To 
Ring 
Bus
4 µops
Store
data
memory subsystem
in-order front-end
out-of-order execution
128 b Fetch and Predecode
Decode – 4+1 x86 Inst
complex simple simplesimple
128 b
128 b
Micro-op Cache
6 µops
Branch Predict
Figure 4.14: Intel Sandy Bridge micro-architecture, based on [Int14a, Figure 2-3] (derived from [Mol08,
Figure 2.26]): The processor cores implement superscalar out-of-order execution (see Section 2.1.1).
A single reservation station (called scheduler) dispatches instructions to the execution units.
4.1 Systems With Two NUMA Nodes 73
8-Core Sandy Bridge-EP package
8-Core Die
Core 
3
L1
L2
L3
QPI
SA
DDR3 A
DDR3 B
IMC
DDR3 C
DDR3 D
Core 
4
L1
L2
L3
Core 
2
L1
L2
L3
Core 
5
L1
L2
L3
Core 
1
L1
L2
L3
Core 
6
L1
L2
L3
Core 
0
L1
L2
L3
Core 
7
L1
L2
L3
8-Core Sandy Bridge-EP package
8-Core Die
Core 
11
L1
L2
L3
QPI
SA
DDR3 E
DDR3 F
IMC
DDR3 G
DDR3 H
Core 
12
L1
L2
L3
Core 
10
L1
L2
L3
Core 
13
L1
L2
L3
Core 
9
L1
L2
L3
Core 
14
L1
L2
L3
Core 
8
L1
L2
L3
Core 
15
L1
L2
L3
PCH
PCIePCIe
Figure 4.15: Composition of the dual-socket Xeon E5-2670 test system, based on [Int12a, Figure 1-1]
(derived from [MHS14, Figure 1a]): Each processor contains eight cores, 20 MiB of shared last
level cache (LLC), and an integrated memory controller (IMC), which are connected with a ring bus.
The system agent (SA) provides 40 PCIe 3.0 lanes per socket. The remaining I/O functionality is
implemented in the platform controller hub (PCH). The two processors are connected with two QPI
links [Int12b].
4.1.3.1 Latency of Cache and Main Memory Accesses
Figure 4.16 depicts latency measurements (see Section 3.5.1) for different coherence states. The DRAM
latency is shown in Figure 4.17 for comparison. The results are summarized in Table 4.8. The influence
of TLB misses on the latency is examined in Figure 4.18.
The behavior of on-chip transfers is similar to the Westmere-EP system described in Section 4.1.2. The
local L1 and L2 cache have a latency of 1.5 and 4.6 ns, respectively. L3 accesses cause an average delay
of 15 ns. Surprisingly, accesses to Shared cache lines in the local L1 or L2 cache also take 15 ns. This
indicates that the local caching agent is involved in order to reclaim the Forward state from the other
processor. The additional latency for accesses to Exclusive cache lines that have been used by another
core, which can be observed on the Westmere-EP system, is also present in the Sandy Bridge micro-
architecture. This is again caused by silent evictions, which do not clear the corresponding core valid
bits. The Snooping of another core increases the latency to 33.5 ns. Shared and Forward cache lines
Table 4.8: Xeon E5-2670—memory read latency: accesses to the local cache hierarchy compared to
accesses to data in other locations. In the Modified and Exclusive cases a single valid copy exists in
the respective location. In the Shared and Forward cases the second copy is more distant or does not
exist anymore. All results are in nanoseconds (cycles).
Source State L1 L2 L3 DRAM
local
Modified/Exclusive/Forward 1.5 (4) 4.6 (12)
15.0 (39)
81.5 (212)
Shared, Forward in other node 15.0 (39)
within
Modified 41.5 (108) 38.1 (99)
NUMA node
Exclusive 33.5 (87)
Forward
15.0 (39)
Shared, Forward in other node
other
Modified 127 - 141
134
NUMA node
Exclusive 84.6 (220)
Forward 87.3 (227)
Shared, no Forward copy 128 - 134
74 4 Performance Characterization of Memory Accesses
(a) Modified (b) Exclusive
(c) Forward, 2nd copy in other NUMA node (d) Shared, 2nd copy (Forward) removed
Figure 4.16: Xeon E5-2670—memory read latency: Thread on core 0 accessing its local cache hierarchy
(local) as well as cache lines of core 1 in the same processor (within NUMA node) and core 8 in
the second processor (other NUMA node). Hardware prefetchers are disabled in order to obtain
consistent results for the cases Forward and Shared. The results for the cases Modified and Exclusive
are independent of the prefetcher settings.
are evicted silently as well. However, in this case it is not necessary to check the other cores as the data
in the L3 cache is guaranteed to be valid. Therefore, data is returned directly from the L3 cache with a
latency of 15 ns. The latency increases to around 39 ns if the hardware prefetchers are enabled and the
helper thread that is used by the coherence state control mechanism (see Section 3.3) runs on a core in
the other socket. In this case the prefetchers anticipate the next data placement phase, which disturbs the
measurement.
Accesses to caches in the second processor and main memory have a higher latency. The remote L3
sends data with a delay of 84.6 ns if the cache lines are in state Exclusive. Surprisingly, reading data in
Figure 4.17: Xeon E5-2670—DRAM latency:
The DRAM latency is measured after invali-
dating all caches using the CLFLUSH instruc-
tion. The latency is noticeably lower, if small
data set sizes are used. The effect can be at-
tributed to the DRAM characteristics. Small
data sets fit into fewer DRAM pages. This in-
creases the likelihood of accessing an already
opened page, which is faster [Che04, Chapter
III, Section 2.3.1].
4.1 Systems With Two NUMA Nodes 75
Figure 4.18: Xeon E5-2670—impact of TLB
misses on memory latency: If 2 MiB pages
are used, the measurements are not influ-
enced by TLB misses up to a data set size of
64 MiB. Since the kernel transparently uses
huge pages by default, it is not necessary
to explicitly allocate memory from hugetlbfs.
The memory latency increases significantly if
the usage of 4 KiB pages is enforced by dis-
abling THP.
state Forward takes slightly longer instead of being noticeably faster as it is the case for on-chip transfers.
This indicates that transferring the Forward state from on processor to the other involves a snoop request
to the cores, i.e., the cores actually distinguish the states Shared and Forward and have to be notified to
consummate the transition. Cache lines in state Modified have a higher access latency of 127 – 141 ns
as the data has to be written back to memory. The inclination shown for remote L1 and L2 accesses is
caused by the decreasing likelihood of accessing already opened DRAM pages. The local main memory
latency is measured with 81.5 ns, if at least one core in the second socket is active. It increases to 86.5 ns
if the second processor is completely idle. The difference is even higher in case of remote memory
accesses. They have a latency of 134 ns, which increases to 151 ns if the second processor uses deep
sleep states. The TLB entries for 2 MiB pages support working sets of up to 64 MiB. Using larger data
sets results in slightly higher average latencies. Allocating memory in 4 KiB pages has a significant
impact on the latency, but this does not happen unless transparent huge pages are deliberately disabled.
4.1.3.2 Bandwidth of Local Cache Accesses and Core-to-core Transfers
This section details the results of the single-threaded bandwidth benchmark (see Section 3.5.2), which
shows the bandwidth of the individual cache levels as well as the available data rates of core-to-core
transfers within and between NUMA nodes. Figure 4.19 depicts the read and write bandwidth for ac-
cesses to Exclusive, Modified, and Shared cache lines. The performance of accesses to Shared (Forward)
cache lines depends on the location of the copies. Figure 4.19 shows the worst cases. The results are
summarized in Table 4.9, which also includes different distributions of Shared and Forward cache lines.
Table 4.9: Xeon E5-2670—read and write bandwidths of core-to-core transfers in GB/s: Writes consists
of a read from the original location and a write to the local cache hierarchy. All measurements use
256 bit load and store instructions (VMOVDQA). Values in brackets show the L3 performance with
disabled hardware prefetchers.
State Source
read bandwidth write bandwidth
L1 L2 L3 L1 L2 L3
Modified
local 82.8 35.2
25.1 (22.4)
41.0 24.5 17.9 (18.9)
within NUMA node 8.2 12.0 8.2 11.8 16.2 (18.9)
other NUMA node 7.0 8.5 8.7 6.8 8.4 7.2
Exclusive
local 82.8 35.2 25.1 (22.4) 41.0 24.5 17.9 (18.9)
within NUMA node 19.4 (16.1) 16.4 13.6 (15.1)
other NUMA node 8.5 8.3 7.8 7.2
Shared/
local + within NUMA node 82.8 35.2
25.1 (22.4)
16.4 13.6 (14.9)
Forward
2 copies, within NUMA node 25.1 (22.4) 15.6 13.6 (14.6)
local + other NUMA node 17.7 (22.4)
7.8 7.2
2 copies, other NUMA node 8.7
76 4 Performance Characterization of Memory Accesses
(a) read, Exclusive (b) write, Exclusive
(c) read, Modified (d) write, Modified
(e) read, Shared (f) write, Shared
Figure 4.19: Xeon E5-2670—single-threaded read and write bandwidths using 256 bit instructions
(VMOVDQA): Thread on core 0 accesses data in its local memory hierarchy (local) as well as data
that is present in caches of another core on the same chip (within NUMA node) and in the sec-
ond processor (other NUMA node). In the Shared cases, the Forward copy—which is created
by the coherence state control mechanism—has been removed before the measurement. BEN-
CHIT_KERNEL_STARTUP_REG_OPS is set to 1 in order to absorb the AVX/SSE transition
penalty [Int14a, Section 11.3] prior to the measurement.
The measured 82.8 GB/s for reads from the local L1 cache are close to the theoretical peak performance
for two 128 Bit loads per cycle at 2.6 GHz. As depicted in Figure 4.20 the L1 bandwidth can only be fully
utilized using SIMD instructions. It is cut in half if 64 bit loads are used. The L2 bandwidth also depends
on the width of the load instructions. It reaches 35.2, 46.0, and 27.4 GB/s using 256 (VMOVDQA),
128 (MOVDQA), and 64 bit (MOV), respectively. If SIMD instructions are used, the inclusive L3 cache
supports a read bandwidth of up to 25.1 GB/s for requests that it can service directly. It drops to 19.4 GB/s
if a snoop request has to be sent to another core (see “within NUMA node” case in Section 4.19a).
4.1 Systems With Two NUMA Nodes 77
Figure 4.20: Xeon E5-2670—the ISA’s impact
on memory bandwidth: The read bandwidths
in the local cache hierarchy strongly depend
on the used ISA. SIMD instructions signifi-
cantly increase the achievable bandwidths in
all cache levels and main memory. How-
ever, using 256 bit loads (VMOVDQA) instead
of 128 bit loads (MOVDQA) does not signifi-
cantly increase the data rates. The L2 band-
width even decreases if AVX code is used.
Modified cache lines from other cores’ L1 and L2 caches can be read with 8.2 and 12.0 GB/s, respectively.
The read bandwidth of data in state Shared is measured with 17.7 GB/s—even from the local L1 or L2
cache. In this case the helper thread used by the coherence state control mechanism runs on a core in the
second processor and its hardware prefetchers disturb the measurement as has already been observed in
the latency measurements. If the hardware prefetchers are disabled, the L3 read bandwidth is 22.4 GB/s
in all cases that do not include a snoop request to another core. The L1 and L2 bandwidths remain on the
L3 performance level, even if the prefetchers are disabled. This is presumably caused by the coherence
protocol, which transfers the Forward state to the requesting core’s socket.
The write bandwidths are generally lower than the corresponding read bandwidths. The L1 and L2 cache
support writes with 41.0 and 24.5 GB/s, respectively. The L3 cache can be written with up to 17.9 GB/s.
There is no performance difference between SSE and AVX stores. 64 bit stores limit the L1, L2, and
L3 bandwidth to 20.3, 18.3, and 15.6 GB/s, respectively. If another core needs to be invalidated, the
L3 bandwidth is reduced to 13.6 GB/s. Writes to Exclusive cache lines in other L1 or L2 caches are
slightly faster (16.4 GB/s) as the data is only read from the L3 and written to the local L1 or L2 cache.
Surprisingly, writes to Modified cache lines also exhibit performance differences depending on the core
that performed the data placement (16.2 vs. 17.9 GB/s). In contrast to the different performance levels
for reading Exclusive cache lines, this effect disappears if the hardware prefetchers are disabled, i.e., it
is caused by suboptimal data placement. Writes to data in state Shared (Forward) are as fast as writes to
data in state Exclusive, if the data has only been shared within one processor. Thus, on-chip sharing is
apparently managed by the core valid bits while maintaining exclusivity for the processor as is it is the
case in the Westmere-EP micro-architecture [Hil+10, p. 38].
The bandwidths are significantly lower, if the second processor is involved. The read bandwidth from the
remote L3 cache is limited to 8.7 GB/s. Modified data from L1 or L2 caches in the other NUMA node
can only be read with 7.0 and 8.5 GB/s respectively. The write bandwidths are between 6.8 and 8.4 GB/s.
The measured L1 and L2 performance mostly is above the L3 performance since data is not written back
to the local L3 cache. The exception are writes to modified data in another core’s L1 which is already
limited by the low read bandwidth.
4.1.3.3 Bandwidth Scaling of Shared Resources
The aggregated memory bandwidth of one to eight concurrently reading and writing cores within one
processor is depicted in Figure 4.21. The results are summarized in Table 4.10, which also covers non-
temporal stores (write-nt), remote memory accesses, and the impact of Hyper-Threading.
The L3 read and write bandwidths scale almost linear with the number of cores. They reach 199.6
and 140.2 GB/s, respectively. This is a significant improvement compared to the preceding processor
generation, which only reaches 85.9 and 25.6 GB/s (see Section 4.1.2). The main memory bandwidths
of up to 44.2 and 20.1 GB/s per socket for reading and writing are significantly higher as well. Hyper-
Threading slightly improves the L3 bandwidth. However, the main memory bandwidth decreases a bit
if the utilization is close to its maximum. The usage of non-temporal stores (VMOVNTDQ) reduces the
achievable aggregated bandwidth if less than four cores write concurrently. However, due to the better
78 4 Performance Characterization of Memory Accesses
(a) read (b) write
Figure 4.21: Xeon E5-2670—bandwidth using multiple cores (one thread per core): The L3 bandwidth
scales almost linear with the number of cores. The memory bandwidth can be fully utilized without
using all cores (see Table 4.10 for details). 256 bit loads and stores are used in these measurements.
scaling with the number of cores, up to 36.6 GB/s can be written per socket, which is almost twice the
performance of normal stores (VMOVDQA).
The memory bandwidths are influenced by the power management on the second socket [MHS14]8.
In the measurements shown in Table 4.10 the cores in the second socket are idle. Thus, the operating
system uses ACPI C-states (see Section 2.1.4) to reduce their power consumption. If at least one core
in the second processor is kept active, the local read bandwidth increases to 45.1 GB/s. The influence
of TLB misses on the aggregated bandwidth is very small. Since transparent huge pages are enabled
by default, the performance stays the same if memory is allocated with malloc() instead of using
hugetlbfs. The memory bandwidth is only affected if THP is disabled as well. In that case the per socket
read bandwidth decreases from 45.1 GB/s to 43.8 GB/s at a data set size of 1 GiB.
The two QPI links have a combined bandwidth of 32 GB/s in each direction. Thus, the QPI links are not
wide enough to fully utilize the remote bandwidth. The headers of the QPI packages reduce the effective
bandwidth to 28.44 GB/s as each 64 byte cache line is transferred using a 72 byte package. However,
only 59% (up to 16.8 GB/s with five reading cores) of this theoretical peak bandwidth are reached in the
Table 4.10: Xeon E5-2670—L3 and main memory bandwidth using 256 bit loads and stores: The local
memory bandwidth starts to saturate with four active cores. The performance of remote memory
accesses is limited by the QPI interconnect.
cores
bandwidth in GB/s with one (two) threads per core
L3 local memory, 2nd socket idle remote memory
read write read write write-nt read write
1 25.1 (25.4) 17.9 (18.3) 11.5 (12.8) 9.0 (9.3) 5.2 (4.9) 7.8 (9.3) 6.8 (7.5)
2 50.0 (50.7) 35.6 (36.4) 21.9 (24.3) 16.0 (16.9) 10.4 (9.7) 14.4 (14.9) 8.9 (8.9)
3 74.8 (75.4) 53.0 (54.2) 31.1 (34.0) 18.1 (19.8) 15.4 (14.4) 16.7 (16.4) 8.5 (8.4)
4 99.4 (100.4) 70.3 (71.8) 38.9 (39.8) 19.9 (19.9) 20.3 (19.0) 16.7 (16.7) 8.3 (8.3)
5 123.9 (125.2) 87.3 (89.1) 43.3 (42.1) 19.8 (19.9) 25.1 (23.3) 16.8 (16.5) 8.2 (8.3)
6 148.4 (150.0) 104.5 (106.9) 44.2 (42.2) 20.1 (18.5) 29.7 (27.3) 16.7 (16.0) 8.2 (8.3)
7 172.6 (174.6) 121.3 (123.2) 43.1 (41.1) 19.4 (19.2) 33.8 (30.6) 16.7 (16.0) 8.2 (8.2)
8 197.4 (199.6) 137.5 (140.2) 43.8 (40.8) 19.8 (19.0) 36.6 (32.8) 16.5 (15.8) 8.2 (8.2)
8The results presented in this publication are slightly different as they have been obtained with the BIOS option “Alternate
RTID setting” set to enabled while the default setting (disabled) is used here.
4.2 Standard Compute Nodes With Complex NUMA Topologies 79
default configuration. One reason for the suboptimal performance is the idle state of the second processor.
If one core is kept active, the remote memory bandwidth increases to 20.2 GB/s. The performance of
remote accesses can be improved further by enabling the BIOS option “Alternate RTID setting”, which
is disabled by default. This increases the remote memory bandwidth to 23.6 GB/s—83% of the effective
QPI bandwidth. However, changing this setting also reduces the local bandwidth to 42.2 GB/s and the
BIOS tuning guide advises against using it [Dell12, p. 19]. Nevertheless, the performance discrepancy
between the modes demonstrates that the memory bandwidth is influenced by the number of concurrent
coherence protocol transactions.
Apart from the L2 cache (see Figure 4.20), there are no significant differences in the achievable cache
and memory bandwidths between the SSE and AVX routines. However, the reduced performance of
64 bit loads and stores does also influence the aggregated performance. The read and write bandwidths
that are supported by the L3 cache still scale close to linear with the number of cores, but they are limited
to 20.9 (one core) – 162.1 GB/s (eight cores) and 15.6 (one core) – 122.8 (eight cores), respectively. The
read bandwidth from local memory is reduced as well. It starts at 8.2 GB/s using one core instead of the
11.8 GB/s that can be achieved using SIMD instructions. With 41.5 GB/s the aggregated bandwidth of
8 concurrently reading cores is comparable to the SIMD results. However, six instead of four cores are
required to get close to the maximum.
4.2 Standard Compute Nodes With Complex NUMA Topologies
In this section the memory subsystem performance of servers with a more complex NUMA topology is
discussed. Table 4.11 lists the properties of the selected test systems. The Bull system contains only
two 12-core Xeon E5 v3 processors, but can be configured to expose four NUMA nodes to the operating
system. The Megware system contains four 16-core Opteron 6200 series processors. Each processor
contains two dies, which the OS sees as separate NUMA nodes. Both systems feature snooping based
cache coherence protocols with directory support (see Section 2.3.2.2 and Section 2.3.2.3) to reduce the
number of requests and responses between the processors.
Table 4.11: Test systems with complex NUMA topologies: Bull SAS bullx R421 E4 [Bul14];[Int14a,
Section 2.1], Megware/SuperMicro server [Sup06; Sup14; Amd12a];[Amd14c, Section 2.2]
Vendor Bull SAS Megware/Supermicro
System bullx R421 E4
SuperChassis 818TQ-1400LPB
with H8QG6 motherboard
Processors 2x Intel Xeon E5-2680 v3 (Haswell-EP) 4x AMD Opteron 6274 (Interlagos)
Cores/logical CPUs 24/48 64 (32 compute units9)/64
Core clock 2.5 GHz10 2.2 GHz
Uncore/NB clock variable, up to 3.0 GHz11 2.0 GHz
FPUs 2x 256 bit FMA per core 2x 128 bit FMA per compute unit9
L1 cache 2x 32 KiB per core
64K L1-I per compute unit9,
16K L1-D per core
L2 cache 256 KiB per core 2 MiB per compute unit9
L3 cache 30 MiB per chip 2x 6 MiB per socket
IMC per socket 4x PC4-2133P-R 4x PC3-12800R
Memory size 128 GiB (8x 16 GiB) 64 GiB (16x 4 GiB)
Interconnect QPI 9.6 GT/s (38.4 GB/s) HT 6.4 GT/s (25.6 GB/s)
NUMA topology 2x 12 cores or 4x 6 cores 8x 8 cores
9A compute unit (CU) is a dual-core module, which is used as building block of the Opteron 6200 series processors.
102.1 GHz base frequency for AVX workloads [Int15e, Table 3]
11Uncore frequency scaling automatically adjusts frequency based on the workload [Hac+15]
80 4 Performance Characterization of Memory Accesses
4.2.1 Dual-socket Intel Xeon E5-2680 v3
This section details the cache and memory performance of a dual-socket system with Intel Xeon E5-
2680 v3 processors, which are based on the Haswell micro-architecture. Most of the results have been
previously published in [Mol+15]. The text in this section is largely based on this publication.
The Haswell micro-architecture [Int14a, Section 2.1] is the successor of Sandy Bridge (see Sec-
tion 4.1.3). It features superscalar out-of-order cores with Hyper-Threading (two hardware threads per
core). The instruction set has been extended to support 256 bit integer SIMD instructions (AVX2) as well
as fused multiply-add (FMA) instructions. A block diagram is depicted in Figure 4.22. The front-end is
similar to Sandy Bridge. Instructions are fetched in 16 byte windows and decoded into micro-ops by four
decoders. The micro-op cache, which was introduced in Sandy Bridge, is also present. The out-of-order
execution is enhanced significantly. The number of scheduler and reorder buffer entries increases from
54 and 168 in Sandy Bridge to 60 and 192 in Haswell. Furthermore, the scheduler has two additional
dispatch ports—another ALU (port 6) and a third address generation unit (port 7). Two 256 bit FMA
instructions can be processed per cycle (port 0 and 1), which increases the throughput to 16 double pre-
cision floating point operations per cycle per core. The L1 in Haswell supports two 256 bit loads and one
256 bit store each cycle, which corresponds to the three AGUs (port 2, 3, and 7). The L2 bandwidth is
64 byte per cycle. The MOB contains 72 load and 42 store buffers.
The server version (Haswell-EP) is available in three variants [Int14d, Section 1.1]—an eight-core die,
a 12-core die, and an 18-core die. The eight-core die uses a single bi-directional ring interconnect like
Sandy Bridge-EP (see Section 4.1.3). The 12- and 18-core dies use a partitioned design with two inter-
connected rings. The Xeon E5-2680 v3 processors in the test system are based on the 12-core version,
which is depicted in Figure 4.23. Eight cores, eight L3 slices, one memory controller, the QPI interface,
and the PCIe controller are connected to one bi-directional ring. The remaining cores, L3 slices, and the
second memory controller are connected to another bi-directional ring. The complex ring topology is
hidden from the operating system in the default configuration, which is depicted in Figure 4.23a. How-
ever, an optional Cluster-on-Die (COD) mode can be enabled in the BIOS. This mode splits the 12 cores
into two NUMA nodes as depicted in Figure 4.23b. For easier reference, the first cluster (core0 – core5)
is called “primary node” and the second cluster (core6 – core11) is called “secondary node” in this doc-
ument. Both nodes contain six cores, thus the software visible NUMA topology (6+6) does not match
the hardware configuration (8+4). Figure 4.24 compares the two possible NUMA topologies.
32 KiB 
L1 Inst. 
Cache
32 KiB 
L1 Data 
Cache
Scheduler (Reservation Station) – 60 entries
Reorder Buffer – 192 entries
Rename/Alloc
Instruction Queue
Load/
Store
addr
Load/
Store 
addr
Int ALU
Int SIMD
FMA
port2 port0port4port3 port5port1
6 x86
4+1 x86
4 µops 1 µop 1 µop1 µop
Memory Order Buffer
72 load / 42 store buffers
256 b
512 KiB
L2 
Cache
To 
Ring 
Bus
4 µops
Store
data
memory subsystem
in-order front-end
out-of-order execution
128 b Fetch and Predecode
Decode – 4+1 x86 Inst
complex simple simplesimple
256 b
256 b
Micro-op Cache
6 µops
Store
addr
port7 port6
Int ALU
Int SIMD
FMA
Int ALU
Int SIMD
Int 
ALU
Branch Predict
Figure 4.22: Intel Haswell micro-architecture, based on [Int14a, Figure 2-1] (derived from [Mol08,
Figure 2.26]): The processor cores implement superscalar out-of-order execution (see Section 2.1.1).
The scheduler can dispatch up to eight micro-ops to the execution units each cycle.
4.2 Standard Compute Nodes With Complex NUMA Topologies 81
12-Core Haswell-EP package
                     12-Core Die
IMC
L3
QPI
PCI Express
D
D
R
4
 A
D
D
R
4
 B
D
D
R
4
 C
D
D
R
4
 D
Core 
4
L3
L3
Core 
5
L3
L3
Core 
6
L3
L1 L2 L3
Core 
7
L3
Core 
3
L1 L2
Core 
2
L1 L2
Core 
1
L1 L2
Core 
0
L1L2
L1L2
L1L2
L1L2
IMC
L3
L3
L3
L1 L2 L3
Core 
11
L1 L2
Core 
10
L1 L2
Core 
9
L1 L2
Core 
8
Queue
Queue
Queue
Queue
(a) default configuration
12-Core Haswell-EP package
                     12-Core Die
IMC
L3
QPI
PCI Express
D
D
R
4
 A
D
D
R
4
 B
D
D
R
4
 C
D
D
R
4
 D
Core 
4
L3
L3
Core 
5
L3
L3
Core 
6
L3
L1 L2 L3
Core 
7
L3
Core 
3
L1 L2
Core 
2
L1 L2
Core 
1
L1 L2
Core 
0
L1L2
L1L2
L1L2
L1L2
IMC
L3
L3
L3
L1 L2 L3
Core 
11
L1 L2
Core 
10
L1 L2
Core 
9
L1 L2
Core 
8
Queue
Queue
Queue
Queue
(b) Cluster-on-Die mode
NUMA node 0 NUMA node 1
Figure 4.23: Structure of the 12-core Xeon E5 v3 die, based on [Int14d, Figure 1-2] (derived
from [Mol+15, Fig. 1a]): There are two memory controllers—one connected to each ring. QPI and
PCIe links are connected to the first ring. In the default configuration (left) all cores can access the
whole L3 cache and memory is interleaved over all four channels. The Cluster-on-Die mode (right)
splits the twelve cores into two six-core groups, which share one memory controller each [Kar14].
The MESIF protocol (see Section 2.3.1.2) is used to keep caches coherent. The default configuration
uses a source snoop mechanism, i.e., the caching agents in the L3 slices broadcast snoop requests when
necessary. The behavior can be changed to home snooping by disabling the Early Snoop option in the
BIOS. If COD mode is enabled, a home snoop mechanism with directory support (see Section 2.3.2.2)
is used [Kar14]. Haswell-EP also includes directory caches—14 KiB per home agent—to accelerate
the directory lookup [Mog+14]. These so-called “HitME” caches store 8-bit presence vectors, which
indicate if copies exist in other nodes. They only contain entries for cache lines that have been forwarded
between caching agents. If a request misses in the directory cache, the home agent reads the status from
the in-memory directory bits [Kot+12] and sends snoop requests accordingly.
Haswell-EP uses integrated voltage regulators [Bur+14]; [Int15f, Section 2.1], which enable individual
voltages and frequencies for every core (per-core P-states) [Kar14]. The voltage and frequency of the
uncore—which includes the last level cache—can also be changed separately. While the frequencies of
the cores can be controlled via the ACPI P-states, uncore frequency scaling (UFS) is performed transpar-
ently in hardware [Hac+15]. Another unique feature of the Xeon E5 v3 processors are the separate AVX
base and turbo frequencies [Kar14]. If AVX instructions are detected, the base frequency is reduced from
2.5 to 2.1 GHz and the available turbo frequencies are restricted [Int15e, Table 3].
Socket 2Socket 1
Node 0 Node 1
I/O I/O
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
(a) default configuration
Socket 2Socket 1
Node 0 Node 2
I/O I/O
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Node 1 Node 3
(b) Cluster-on-Die mode
Figure 4.24: Composition of the dual-socket Xeon E5 v3 system: In the default configuration, the system
contains two NUMA nodes. If COD is enabled four NUMA nodes are visible to the operating system.
These figures are derived from [Mol+15, Fig. 2].
82 4 Performance Characterization of Memory Accesses
4.2.1.1 Latency of Cache and Main Memory Accesses
Latency measurements for the default configuration (source snooping) are depicted in Figure 4.25a
and Figure 4.25b. Reads from the local cache hierarchy have a latency of 1.6 ns in the L1, 4.8 ns in
the L2, and 21.2 ns in the L3 cache. Forwarding Modified cache lines from another core’s L1 or L2 cache
takes 53 or 49 ns, respectively. Modified cache lines that have been written back to the L3 by another core
are delivered in 21.2 ns. In contrast, accesses to Exclusive cache lines have a higher latency of 44.4 ns
if they belong to another core. In that case the owning core is snooped even if the line has already been
evicted, since the silent evictions do not clear the core valid bits. If multiple copies of a cache line exist
within the same NUMA node (not depicted), the coherence state can only be Shared (or Forward). In
that case multiple core valid bits are set, thus the L3 services requests in 21.2 ns without snooping the
other cores. Read accesses to Shared cache lines in the local L1 or L2 cache also induce the L3 latency
while Forward cache lines can be read directly. This indicates that the L3 cache is notified of accesses
to Shared cache lines in order to reclaim the Forward state. Forwarding Modified cache lines from the
remote L1, L2, or L3 cache requires 113, 109, or 86 ns, respectively. The core snoop, which is required
for Exclusive cache lines, increases the remote L3 latency to 104 ns.
Main memory has a latency of 96.4 ns for local and 146 ns for remote accesses. Figure 4.26 shows the
influence of TLB misses on the local access latency. If 2 MiB pages are used—explicitly via hugtlbfs or
implicitly by THP—the effect is negligible while disabling THP results in costly page table walks. The
(a) Modified, default configuration (b) Exclusive, default configuration
(c) Modified, COD mode (d) Exclusive, COD mode
Figure 4.25: Xeon E5-2680 v3—read latency of Modified and Exclusive data: Comparison of accesses
to a core’s local cache hierarchy (local) with accesses to caches of another core in the same NUMA
node (within NUMA node) and accesses to the second processor (other NUMA node (1 hop)). The
COD mode results also cover on-chip transfers between the clusters (other NUMA node (on-chip)),
transfers between a primary and a secondary node (other NUMA node (2 hops)), and transfers be-
tween the two secondary nodes (other NUMA node (3 hops)).
4.2 Standard Compute Nodes With Complex NUMA Topologies 83
Figure 4.26: Xeon E5-2680 v3—impact of TLB
misses on memory latency: The measure-
ments are not influenced by TLB misses up
to a data set size of 64 MiB, if 2 MiB pages
are used. With approximately 3 ns, the in-
crease in latency for data sets sizes up to
2 GiB is smaller than observed on the other
systems due to the large L2 TLB (see Ta-
ble 2.6). The memory latency increases sig-
nificantly if malloc() is used without THP.
impact of switching to the home snooping protocol (Early Snoop disabled) is included in Table 4.12.
The delayed snoop requests increase the latency of local memory and remote cache accesses by 11 ns.
The latency of remote memory accesses does not increase, as the requests are sent directly to the remote
home agent in both cases.
Figure 4.25c and Figure 4.25d show latency measurements in Cluster-on-Die mode. In these measure-
ments the node that contains the cached copy prior to the measurement also is the home node of the
data. The local L3 latency is reduced to 18.0 or 37.2 ns in COD mode compared to 21.2 or 44.4 ns in the
default configuration. The local memory latency drops from 96.4 to 89.6 ns. Accesses to the second L3
partition on the chip require 57.2 ns without and 73.6 ns including a core snoop. Memory attached to the
other cluster can be accessed in 96 ns. The latency of remote cache and memory accesses depends on the
number of hops. The L3 latency is between 90 and 103 ns for immediate replies and between 104 and
118 ns if a core needs to be snooped. Remote memory accesses require between 141 and 153 ns. The
asymmetric chip layout results in performance variations depending on the location of the core, which
are detailed in Table 4.12. Only the six cores in the primary node show the substantial latency reductions
for local accesses described above. The four cores that are connected to the second ring also benefit
significantly from activating COD mode. However, the latency reduction is much lower for the two cores
that are connected to the first ring but belong to the secondary NUMA node. Furthermore, the cores in
the secondary node suffer from higher latencies for accesses to the second processor.
In COD mode it is possible that requester, home node, and the forwarding cache span three nodes. Fig-
ure 4.27 compares such scenarios to cases that have a Forward copy in the home node. Accesses to small
data sets show an unexpected behavior if the Forward copy is outside of the home node. Performance
counter readings12 show that data is forwarded from the home node’s memory for data set sizes up to
256 KiB13. Thus, data is delivered faster by node 1 as no QPI transfers are involved. For larger data set
Table 4.12: Xeon E5-2680 v3—L3 cache and main memory latency in nanoseconds for different coher-
ence protocol modes [Mol+15, Table III]. L3 results are for cache lines in state Exclusive.
source
default con- Early Snoop
COD mode
figuration disabled
primary secondary node
node first ring second ring
L3
local 21.2 18.0 (-15%) 20.0 (-5.6%) 18.4 (-13.2%)
remote first node
104 115 (+10.5%)
104 (+/- 0) 108 (+3.8%) 111 (+6.7%)
remote 2nd node 113 (+8.7%) 118 (+13.5%) 120 (+15.4%)
memory
local 96.4 108 (+12.0%) 89.6 (-7.1%) 94.0 (-2.5%) 90.4 (-6.2%)
remote first node
146
141 (-3.4%) 145 (-0.7%) 148 (+1.3%)
remote 2nd node 147 (+0.7%) 151 (+3.4%) 153 (+4.8%)
12PAPI 5.4.1, native event: MEM_LOAD_UOPS_L3_MISS_RETIRED:REMOTE_DRAM
13The variation below 256 KiB presumably is a DRAM effect. Small data sets fit into fewer DRAM pages, thus the likelihood
to access already open pages is higher. This is faster than accesses to closed pages [Che04, Chapter III, Section 2.3.1].
84 4 Performance Characterization of Memory Accesses
(a) Forward copy in home node (b) Forward copy in third node
Figure 4.27: Xeon E5-2680 v3—read latency of shared data: core in node0 accesses cache lines in state
Forward. The home nodes contain a Shared copy if the Forward copy is in another node. The sharing
also affects the memory latency.
sizes an increasing number of cache lines is forwarded by the L3 cache that contains the Forward copy14.
Forwarding shared data from memory is allowed if the directory state is shared [Gee+13]. However, the
fact that it only happens for small data sets suggests that the effect is actually caused by the directory
cache. Directory cache entries are allocated when cache lines are forwarded between nodes—which hap-
pens during the access sequence performed by the coherence state control mechanism for state Forward
(see Section 3.3). If two bits are set in the presence vector of the HitME cache entry (Shared copy in
home node and Forward copy in another node), the data in memory is valid and can be forwarded.
Table 4.13 details the L3 latencies for reads by a core in node 0 for data set sizes larger than 2 MiB, which
mostly miss in the directory cache. If a copy is available in the local L3 (first row and first column), it
is forwarded to the requesting core. The cases on the diagonal do not require a snoop broadcast as a
Forward copy is found in the home node15. The remaining cases show the delay for forwarding cache
lines from a third node. The latency is between 162 and 177 ns depending on the distance between
the nodes. Caching data outside the home node can also influence the memory latency as shown in
Table 4.14. In these measurements, data has already been evicted from all caches. In the cases on the
diagonal the in-memory directory state is remote-invalid. All other cases apparently require a snoop
broadcast, which adds between 78 and 89 ns to the memory latency. Therefore, the directory state has to
be snoop-all. According to the DAS protocol (see Section 2.3.2.2) the state should be shared in this case.
However, the access sequence performed by the coherence state control mechanism apparently creates
directory cache entries, which implicates snoop-all state in the in-memory directory [Mog+14].
Table 4.13: Xeon E5-2680 v3—L3 latency in
nanoseconds if copies exist in multiple NUMA
nodes [Mol+15, Table IV].
node with home node (Shared copy)
Forward copy node0 node1 node2 node3
node0 18.0 18.0 18.0 18.0
node1 18.0 57.2 170 177
node2 18.0 166 90.0 166
node3 18.0 169 162 96.0
Table 4.14: Xeon E5-2680 v3—memory latency
in nanoseconds if data has been shared by mul-
tiple cores [Mol+15, Table V].
node that had home node
Forward copy node0 node1 node2 node3
node0 89.6 182 222 236
node1 168 96.0 222 236
node2 168 182 141 236
node3 168 182 222 147
14according to PAPI 5.4.1, native event: MEM_LOAD_UOPS_L3_MISS_RETIRED:REMOTE_FWD
15The local snoop in the home node and the directory lookup are done in parallel [Mog+14].
4.2 Standard Compute Nodes With Complex NUMA Topologies 85
4.2.1.2 Bandwidth of Local Cache Accesses and Core-to-core Transfers
The single-threaded bandwidth benchmarks (see Section 3.5.2) are influenced by the reduced fre-
quency for AVX workloads. Furthermore, the execution of 256 bit instructions is slowed down dur-
ing the transition from normal execution mode to AVX mode to ensure stable operation [Kar14,
p. 21]. Therefore, the benchmarks do not show the maximal performance unless the transition to
AVX mode is completed prior to the measurement. This is ensured by setting the parameter BEN-
CHIT_KERNEL_AVX_STARTUP_REG_OPS to 2000016. Consequently, the achievable L1 bandwidths
significantly increases compared to the values reported in [Mol+15]. Haswell’s uncore frequency scaling
(UFS) feature [Hac+15] influences the measured L3 performance. The L3 performance that is shown
here represents the lower bound of the observed results. The possible performance increase enabled by
UFS is discussed in conjunction with the bandwidth scaling in Section 4.2.1.3.
The single-threaded read and write bandwidths in the default configuration are depicted in Figure 4.28.
Data can be read from the local L1 cache with 154 GB/s using 256 bit instructions (VMOVDQA). The
L2 measurements are unstable. They vary from 57.2 to 65.8 GB/s in the selected measurements. Other
measurements even show a fluctuation range from 54 to 68 GB/s. If the hardware prefetchers are dis-
abled, the L2 bandwidth is fairly constant at around 77 GB/s. As depicted in Figure 4.29 the available
bandwidth also depends on the width of the load instructions. With 128 bit loads it is limited to 77.1
from the L1 and 48.2 GB/s from the L2 cache. Using 64 bit loads further reduces this to 39.7 and 29.4,
respectively. Accesses to shared cache lines in the local L1 or L2 cache (not depicted) only achieve
the full performance if the requesting core’s node holds the Forward copy. Otherwise, the bandwidth is
(a) read, Exclusive (b) write, Exclusive
(c) read, Modified (d) write, Modified
Figure 4.28: Xeon E5-2680 v3—single-threaded read and write bandwidth: Comparison of accesses to
a core’s local cache hierarchy (local) with accesses to cache lines of another core in the same NUMA
node (within NUMA node) as well as accesses to the second processor (other NUMA node).
16A single AVX operation before the measurement—as used in Section 4.1.3.2—is not sufficient on Haswell
86 4 Performance Characterization of Memory Accesses
Figure 4.29: Xeon E5-2680 v3—the ISA’s im-
pact on memory bandwidth: The full L1 per-
formance can only be reached with 256 bit
wide AVX instructions (VMOVDQA) as the
L1 ports have a width of 256 bit. The L1
and L2 bandwidths decrease significantly if
the width of the loads is reduced to 128
(MOVDQA) or 64 bit (MOV). At least 128 bit
wide loads are required to fully utilized the
L3 cache and local memory bandwidth.
limited to the L3 bandwidth, which resembles the behavior on Sandy Bridge (see Section 4.1.3.2). The
write bandwidths are lower than the corresponding read bandwidths. The L1 and L2 cache support data
rates of 76.8 and 25.5 GB/s, respectively.
On-chip transfers (see “within NUMA node” cases in Figure 4.28) are significantly slower than local L1
and L2 cache accesses. Reading Modified cache lines from another core’s L1 and L2 cache is limited
to 7.8 and 10.6 GB/s, respectively. The L3 bandwidth is 26.2 GB/s for Modified cache lines that have
been evicted by other cores. Unmodified cache lines are always delivered from the L3 cache, even if
copies still exist in other cores. The bandwidth is 26.2 GB/s for cache lines in state Shared/Forward and
15.0 GB/s for Exclusive cache lines, which require that snoop requests are sent to another core. The
L3 cache supports writes with up to 15.0 GB/s if data is read directly from the L3 and written back to
it. It drops to 10.5 GB/s if another core needs to be snooped. Writes to Exclusive cache lines that are
located in another core’s L1 or L2 cache are slightly faster as the data does not have to be written back
to the L3 cache. Writes to Modified data in other L1 and L2 caches are limited by the corresponding
read bandwidth. QPI transfers between the sockets further reduce the achievable bandwidths (see “other
NUMA node” cases in Figure 4.28). Data can be read from the remote L3 with 9.1 GB/s if no core
snoops are required. The bandwidth drops to 8.8 GB/s for Exclusive cache lines and 6.8 or 8.1 GB/s for
Modified cache lines that are forwarded from a remote L1 or L2 cache, respectively.
The system configuration affects the achievable bandwidths as detailed in Table 4.15. If Early Snoop is
deactivated, the local memory bandwidth decreases from 10.3 to 9.5 GB/s while the remote L3 and mem-
ory bandwidths increase slightly. Enabling COD mode increases the local L3 and memory bandwidths.
However, the cores show inconsistent performance depending on their location. The L3 bandwidth in-
creases by 10.7% to 29 GB/s in the primary node. In the secondary node the increase is 3.8% for the
two cores that are connected to the first ring and 5.3% for the four cores connected to the second ring. A
significantly higher local memory bandwidth is available to all cores in COD mode. The bandwidth of
remote accesses shows various performance levels depending on the distance between the data and the
core that is accessing it. The difference from the default configuration is between -9% and +5%.
Table 4.15: Xeon E5-2680 v3—single-threaded read bandwidth in GB/s depending on coherence proto-
col mode and data location [Mol+15, Table VI]. L3 results are for cache lines in state Exclusive.
source
default con- Early Snoop
COD mode
figuration disabled
primary secondary node
node first ring second ring
L3
local 26.2 29.0 (+10.7%) 27.2 (+3.8%) 27.6 (+5.3%)
remote first node
8.8 8.9 (+1.1%)
8.7 (-1.1%) 8.3 (-5.7%) 8.4 (-4.5%)
remote 2nd node 8.3 (-5.7%) 8.0 (-9.0%) 8.1 (-8.0%)
memory
local 10.3 9.5 (-7.8%) 12.6 (+22.3%) 12.5 (+21.3%) 12.6 (+22.3%)
remote first node
8.0 8.2 (+2.5%)
8.4 (+5.0%) 7.8 (-2.5%) 8.1 (+1.3%)
remote 2nd node 8.0 (+/- 0) 7.4 (-7.5%) 7.5 (-6.3%)
4.2 Standard Compute Nodes With Complex NUMA Topologies 87
Table 4.16: Xeon E5-2680 v3—L3 bandwidth scaling [GB/s] in the default system configuration: The
measured bandwidths vary depending on the benchmark configuration (number of accesses), which
is presumably caused by Haswell-EP’s uncore frequency scaling.
cores
L3 bandwidth in GB/s (UFS benefit)
one thread per core two threads per core
read write read write
1 26.2 - 29.8 (+13%) 15.0 - 16.4 (+9%) 28.2 - 32.4 (+15%) 16.4 (+/- 0)
2 51.7 - 60.0 (+16%) 29.6 - 32.4 (+9%) 55.7 - 64.4 (+16%) 32.4 (+/- 0)
3 75.5 - 89.4 (+18%) 44.1 - 54.9 (+24%) 80.5 - 96.0 (+19%) 48.0 - 54.6 (+14%)
4 99.7 - 118.9 (+19%) 58.6 - 73.0 (+24%) 107.0 - 127.8 (+19%) 63.8 - 72.6 (+14%)
6 148.5 - 177.9 (+19%) 86.9 - 109.0 (+25%) 158.8 - 191.2 (+20%) 95.0 - 108.5 (+14%)
8 194.6 - 235.4 (+20%) 114,3 - 144,3 (+26%) 205.7 - 252.6 (+23%) 124.6 - 143.5 (+15%)
10 237.0 - 290.4 (+22%) 138.0 - 177.2 (+28%) 251.4 - 312.3 (+24%) 150.4 - 176.1 (+17%)
12 278.3 - 343.3 (+23%) 161.6 - 210.0 (+30%) 291.5 - 367.8 (+26%) 173.7 - 208.6 (+20%)
4.2.1.3 Bandwidth Scaling of Shared Resources
Table 4.16 details the L3 bandwidth scaling in the default configuration using AVX instructions
(VMOVDQA). The results differ between the aggregated bandwidth benchmark (see Section 3.5.3) and
the throughput benchmark (see Section 3.5.4). This anomaly does only occur in the L3 cache. Therefore,
the difference is presumably caused by the uncore frequency scaling (UFS). In order to determine the
minimal bandwidth the aggregated bandwidth benchmark is configured to perform only a single sequen-
tial access (BENCHIT_KERNEL_RUNS set to 1). The upper bound is estimated using the throughput
kernel, which accesses the buffer repeatedly until at least 3.2 GB have been accessed. Even with this large
number of accesses there is significant variation between measurements, especially if Hyper-Threading
is used. Table 4.16 lists the maximum of the observed results.
The L3 bandwidth scales almost linear with the number of cores—from 26.2 GB/s to 278.3 GB/s for reads
and from 15.0 GB/s to 161.6 GB/s for writes, respectively. If the processor detects L3-heavy workloads
and increases the uncore frequency, the bandwidths increase by up to 23% for reads and 30% for writes,
which increases the L3 bandwidth per processor to 343 and 210 GB/s, respectively. Using two threads
per core further increases the read bandwidths and the minimal write bandwidths while the maximal
write bandwidths are slightly lower. The L3 performance does not change when Early Snoop is disabled.
If COD mode is activated, the L3 read and write bandwidth per NUMA node is 154 – 197 GB/s and 94
– 113 GB/s, respectively.
Table 4.17: Xeon E5-2680 v3—memory read bandwidth in GB/s: the bandwidth depends on the Early
Snoop configuration and the number of threads per core (SMT off: one thread per core, SMT on: two
threads per core)
source
Early
SMT
number of concurrently reading cores
Snoop 1 2 3 4 5 6 7 8 9 10 11 12
default off 10.3 21.8 33.6 43.9 53.3 59.8 62.9 63.6 63.7 63.4
local (enabled) on 12.4 27.3 39.9 51.4 59.5 61.8 62.0 62.5 61.7
DRAM
disabled
off 9.5 18.9 31.8 41.7 51.4 58.3 62.6 63.5 62.9
on 11.7 26.0 38.8 49.4 58.6 61.7 62.0 61.7
default off 8.0 13.1 14.1 14.7 15.2 15.6 15.9 16.3 16.5 16.6 16.8
remote (enabled) on 9.8 14.4 15.4 16.0 16.3 16.6 16.8 17.0 17.1 17.2 17.4
DRAM
disabled
off 8.2 16.1 24.0 28.3 30.2 30.4
30.6
on 10.5 20.1 28.3 30.2 30.5
88 4 Performance Characterization of Memory Accesses
Table 4.18: Xeon E5-2680 v3—memory write bandwidth in GB/s: The bandwidth depends on the Early
Snoop configuration and the number of threads per core (SMT off: one, SMT on: two).
source
Early
SMT
number of concurrently writing cores
Snoop 1 2 3 4 5 6 7 8 9 10 11 12
default off 7.5 16.0 22.3 25.2 26.3 26.1 26.0 25.9
local (enabled) on 8.5 18.4 23.6 25.4 25.5
DRAM
disabled
off 7.3 15.2 22.2 24.8 26.3 26.1 25.8 25.5
on 8.2 18.2 23.7 25.4 25.6 25.4 25.3
default off 5.5 8.2 9.2 9.7 10.0 10.3 10.4 10.6 10.7 10.8 10.9 11.1
remote (enabled) on 6.5 9.2 10.0 10.2 10.6 10.7 10.9 11.0 11.2 11.3
DRAM
disabled
off 6.1 12.1 18.0 22.2 24.0 24.8 25.0 25.2
on 7.3 14.7 20.1 24.1 24.8 25.0 24.8
The impact of the Early Snoop setting and Hyper-Threading on the DRAM read bandwidth is detailed
in Table 4.17. The corresponding write bandwidths are listed in Table 4.18. Local DRAM can be read
with up to 63.7 GB/s and written with up to 26.3 GB/s. Disabling Early Snoop slightly reduces the local
DRAM bandwidth if only a few cores are used while the maximal bandwidth per socket does not change.
The memory bandwidth does not scale linearly with the number of cores. Using two cores instead of one
results in slightly super-linear speedup, which probably is another effect of the uncore frequency scaling.
The read bandwidth is saturated with six or seven cores, depending on the number of threads per core.
The write bandwidth can already be fully utilized with five cores. Non-temporal stores (VMOVNTDQ)
can be used to improve the performance of writes. This increases the single-threaded write bandwidth
to 14.5 GB/s. Using two or three cores concurrently results in an aggregated bandwidth of 29.1 and
43.6 GB/s, respectively. The maximum of 47.0 GB/s per socket is reached with four concurrently writing
cores. The impact of TLB misses is negligible for data set sizes up to 2 GiB.
The remote memory bandwidths are much higher if Early Snoop is disabled. The read bandwidth via
QPI is 30.6 GB/s instead of 17.4 GB/s, which are available in the default configuration. Furthermore, the
write bandwidth to remote DRAM increases from 11.3 GB/s to 25.2 GB/s, which is almost identical to
the local write bandwidth. The differences show that the achievable remote bandwidths are not limited
by the raw bandwidth of the QPI links in the default configuration. As demonstrated by the measure-
ments with different “Alternate RTID setting” options in Section 4.1.3.3, the achievable bandwidth can
also be restricted by the number of concurrent coherence protocol transactions. Apparently, the default
configuration supports fewer concurrent remote transactions.
Table 4.19 shows the bandwidth scaling in COD mode. Up to 32.5 GB/s can be read from local DRAM
in each NUMA node. The corresponding write bandwidth is 13.7 GB/s, which is reached with three
concurrently writing cores. Data can be read with up to 18.8 GB/s from the second memory controller
on the chip. Reading from DRAM that is attached to the other socket is limited to 15.6 or 14.7 GB/s—
depending on the number of hops (see Figure 4.24b). Writing to other NUMA node’s memory is limited
to 8.2 – 8.7 GB/s.
Table 4.19: Xeon E5-2680 v3—memory read (write) bandwidth in COD mode, based on [Mol+15,
Table VIII] All results are in GB/s.
distance
number of cores, one thread per core
1 2 3 4 5 6
local DRAM 12.6 (8.3) 24.3 (12.9) 30.6 (13.7) 32.5 (13.4) 32.5 (13.2) 32.5 (13.1)
1 hop on-chip 7.0 (6.6) 15.2 (8.2) 18.6 (8.2) 18.8 (8.2) 18.8 (8.1)
1 hop QPI 5.9 (5.9) 12.8 (8.7) 15.4 (8.7) 15.6 (8.5) 15.6 (8.4)
2 and 3 hops 5.5 (5.5) 12.2 (8.7) 14.4 (8.7) 14.7 (8.5) 14.7 (8.3)
4.2 Standard Compute Nodes With Complex NUMA Topologies 89
4.2.2 Quad-socket AMD Opteron 6274
This section details the cache and memory performance of a system with four AMD Opteron 6274
processors. Preliminary results from a very similar system have been previously published in [MHS14].
The text in this section is partially based on this publication. This section also includes findings from
Mario Ludwig’s bachelor thesis [Lud12], which he conducted under my supervision.
The Opteron 6200 series of processors (Interlagos) is based on AMD’s family 15h micro-architecture
(Bulldozer) [Amd14c, Chapter 2], which is shown in Figure 4.30. The processors are composed of so-
called compute units (CUs)—tightly coupled dual-core modules that share many resources [McI+12;
But+11]. The instruction fetch unit, the decoders, the floating point unit, the L1 instruction cache, and
the L2 cache are shared by both cores. Four integer execution units, a unified 40-entry scheduler, a 128-
entry retirement queue, a load store unit, and a write-through L1 data cache are replicated for each core.
Instructions are fetched from the 64 KiB L1 instruction cache in 32 byte windows. Four instructions
can be decoded each cycle. They are issued to one of the integer schedulers and to the FPU scheduler.
The two ALUs per core only handle legacy x86 instructions. Integer SIMD as well as all floating point
instructions are processed by the shared FPU. The two AGLUs per core are mainly used for address
generation, but also support increment and decrement instructions. The shared FPU consists of two
128 bit fused-multiply-accumulate (FMAC) units and two legacy MMX units. The second MMX unit
also handles stores from the FPU. AVX and FMA4 instructions—fused-multiply-add with four-address
format—are supported. However, 256 bit SIMD instructions are split in two 128 bit parts—so-called
“makro-ops” in AMD’s terminology. The two FMAC units provide a peak performance of 17.6 GFLOPS
per compute unit. The integer cores as well as the FPU use physical register files to avoid copying the
results to a separate register file during retirement.
The cache hierarchy of the compute units is quite complex. The load store units (LSUs) [Amd14c,
Section 2.12] decouple the speculative out-of-order execution from the memory subsystem, i.e., they
service loads out of program order but ensure that only completed stores become visible. Each LSU
contains 40 load as well as 24 store buffers and is connected to a 16 KiB write-through L1 data cache
via two 128 bit read ports and one 128 bit write port. Loads into the floating point register file—which
includes the SIMD registers—go through the FP load buffer [But+11, Figure 2]. Stores are coalesced
in a 4 KiB write coalescing cache (WCC) [Amd14c, Section 2.13]. A shared 2 MiB L2 cache, which is
inclusive of the L1 caches, connects the compute unit to the shared system request interface.
64 KiB L1 
Inst. 
Cache
Fetch
4x x86 Decode
2 MiB
L2 
Cache
Load Store Unit
40/24 entries
A
L
U
16 KiB L1D Cache
1
2
8
To
System
Request
Interface
memory subsystem
shared
in-order front-end
core 0
out-of-order execution
256 b
A
L
U
A
G
L
U
A
G
L
U
F
M
A
C
F
M
A
C
M
M
X
M
M
X
/
F
S
T
O
R
E
1
2
8
Load Store Unit
40/24 entries
A
L
U
16 KiB L1D Cache
1
2
8
A
L
U
A
G
L
U
A
G
L
U
R
e
ti
re
 Q
u
e
u
e
1
2
8
 e
n
tr
ie
s
1
2
8
FP Load Buffer
128
128 b
1
2
8
1
2
8
128
4 KiB WCC128
core 1
out-of-order execution
shared FPU
128 b
Int Scheduler
40 entries
FP Scheduler
60 entries
Int Scheduler
40 entries
128 b
R
e
ti
re
 Q
u
e
u
e
1
2
8
 e
n
tr
ie
s
1
2
8
1
2
8
Branch Predict
Figure 4.30: AMD family 15h (models 00h – 0Fh) micro-architecture, based on [But+11, Figure 1 and
Figure 2]: The processors consist of dual-core compute units (CUs). Each CU contains two out-of-
order integer cores that share the in-order front-end, the L2 cache, and the FPU.
90 4 Performance Characterization of Memory Accesses
The 16-core AMD Opteron 6200 series processors [Amd12c] are implemented as multi-chip-modules
(MCM) that consist of two 8-core dies as depicted in Figure 4.31a. Each die contains four compute units
(8 cores), 6 MiB L3 cache17, an integrated memory controller, and four HyperTransport 3.0 links. The
shared L3 cache—that operates at the northbridge frequency of 2 GHz [Amd12a]—is connected to the
system request interface (SRI). It is a mostly exclusive victim cache for cache lines that are evicted from
the compute units. However, as in AMD’s family 10h micro-architecture (see Section 4.1.1), cache lines
that are shared by multiple cores can have a copy in the L3 cache. The dual channel memory controller
and the four HyperTransport links—that operate at 6.4 GT/s [Amd12b]—are connected to the crossbar.
The links HT0, HT1, HT2, and HT3 are 16 bit wide. Each link can be split into two 8-bit links. Links
that are used as a single link are called “ganged links”, links that have two sublinks are called “unganged
links” [Amd13b, Section 2.12.1.1]. The links are used to connect the two dies with each other and to
provide the external connections supported by the socket as depicted in Figure 4.31a.
The system contains eight NUMA nodes as depicted in Figure 4.31b. Each die is directly connected
to the second die within the socket via a 16-Bit link18 as well as to three dies in other sockets via 8-
Bit links. This results in two subsets of four fully connected nodes: {0, 2, 4, 6} and {1, 3, 5, 7}. The
MOESI protocol (see Section 2.3.1.3) is used to maintain cache coherence. A snoop filter [Con+10]—
called HT Assist—reduces the coherence traffic (see Section 2.3.2.3). Furthermore, an “Accelerated
Transition to Modified” (ATM) extension of the MOESI protocol is implemented [Amd13b, Section
1.5.2], which needs to be activated if the snoop filter is used [Amd13b, Section 2.9.4.2]. The exact
implementation of the extended MOESI protocol is not disclosed in the processor manuals. However,
Lepak et al. [Lep+12] describe a version of the snoop filtering mechanism that uses a modified MOESI
protocol with an additional modified unwritten (MuW) state. The coherence state control mechanism of
the benchmarks (see Section 3.3) supports this extended protocol. The benchmark results are plausible if
the ATM-compatible version is used, which indicates that the protocol described in [Lep+12] is actually
used in AMD 15h processors19. In contrast, the benchmarks produce implausible results if the coherence
state control is performed according to the conventional MOESI protocol. For instance, trying to generate
an Owned copy of a cache line—by writing to it followed by a read by another core—invalidates the local
copy instead of the expected state changes M→O and I→S. This behavior corresponds to the M→I and
I→MuW transitions of the extended MOESI protocol.
16-Core Interlagos package
8-Core Die
Core 
0
System Request Interface
L3 CacheCrossbar
L1
L2
Memory Controller
Core 
1
L1
Core 
2
L1
L2
Core 
3
L1
Core 
4
L1
L2
Core 
5
L1
Core 
6
L1
L2
Core 
7
L1
HyperTransport
8-Core Die
Core 
8
System Request Interface
L3 Cache Crossbar
L1
L2
HyperTransport
Core 
9
L1
Core 
10
L1
L2
Core 
11
L1
Core 
12
L1
L2
Core 
13
L1
Core 
14
L1
L2
Core 
15
L1
Memory Controller
D
D
R
3
 A
D
D
R
3
 B
x
1
6 x
1
6
NCx
8
x
8
x16
x8
x
8
I/O
D
D
R
3
 C
D
D
R
3
 D
x
8
x
8
x
8
HT3 HT1 HT0
HT2
HT3 HT0 HT2
HT1
(a) 16 core Interlagos MCM, based on [Amd13b, Figure 2 and 13]
Socket 1
Node 1
Node 0
Socket 2
Node 3
Node 2
Socket 3
Node 5
Node 4
Socket 4
Node 7
Node 6
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
NC NC
I/O NC
(b) system topology, based on [Amd13b, Figure 7]
Figure 4.31: Composition of the quad-socket Opteron 6274 test system: Each processor (left, derived
from [MHS14, Figure 1]) consists of two dies in a multi-chip module. Each die has an integrated
memory controller and therefore is a separate NUMA node. Consequently, the four socket system
(right, derived from [MHS14, Figure 2b]) has eight NUMA nodes.
17Each die contains 8 MiB of L3 cache [Wei+11]. However, 2 MiB are reserved for the snoop filter [Amd13b, Section 2.9.4.1]
18the 8-Bit link connecting the dies is disabled [Amd13b, Chapter 2.12.1.5]
19The ATM extension is closely related to the snoop filtering mechanism, thus may not be active in all systems
4.2 Standard Compute Nodes With Complex NUMA Topologies 91
4.2.2.1 Latency of On-chip Transfers
Figure 4.32 shows latency measurements on the Opteron 6274 system. In order to benchmark the lo-
cal cache hierarchy (“local” test series), data placement (see Section 3.2) and measurement (see Sec-
tion 3.5.1) are both performed on CPU 0. Core-to-core transfers within a NUMA node are examined by
performing the measurement on CPU 0 after placing the data in the cache hierarchy of CPU 1 and CPU 2,
respectively20. The corresponding latencies are displayed by the “within NUMA node (same CU)”, and
“within NUMA node (other CU)” test series. In the “other NUMA node” cases data is placed in caches
of CPU 8 (same MCM), CPU 16 (other MCM, 1 hop), and CPU 24 (other MCM, 2 hops).
Reads from the local L1, L2, and L3 cache (“local” test series up to 6 MiB) have a latency of 1.8 ns
(4 cycles), 9.1 ns (20 cycles), and 27.3 ns (60 cycles), respectively. However, the 16 KiB L1 cache is
hardly visible due to the minimal data set size of 12 KiB (see Section 3.5.1). Furthermore, the results
vary from 1.8 to 3.2 ns (4 to 7 cycles). If the minimal distance between the accesses is reduced to
256 byte, data set sizes between 6 and 12 KiB can be used as well21. In that case the data fits into two or
three of the L1 cache’s four ways and the latency of four cycles can be measured reproducibly.
(a) Exclusive22 / Modified (b) MuW / Owned (Shared copy in node 7)
(c) Shared (Owned copy in node 0) (d) Shared (Owned copy in node 7)
Figure 4.32: Opteron 6274—memory read latency: Comparison of accesses within a core’s local cache
hierarchy (local) with accesses to other cores’ data. The other core can be the second core within
the compute unit (within NUMA node (same CU)), in another compute unit on the same die (within
NUMA node (other CU)), in the 2nd chip of the same processor (other NUMA node (same MCM)),
or in another processor. Cores in another socket can be on a die that is directly connected to the
requester’s die (other NUMA node (other MCM, 1 hop)) or on a die without direct HyperTransport
connection (other NUMA node (other MCM, 2 hops)).
20Generating Owned and Shared cache lines creates an additional copy of the data (see Section 3.3).
21The increased prefetcher effectiveness is not relevant for local L1 measurements as the data is already as close as possible.
22BENCHIT_KERNEL_DISABLE_CLFLUSH set to 1 (see Section 3.5.6) as using CLFLUSH leads to inadvertent L3 evictions.
92 4 Performance Characterization of Memory Accesses
Data transfers between the cores in a CU (“within NUMA node (same CU)” test series) are handled by
the shared L2 cache up to a data set size of 2 MiB. Accesses to Modified and Exclusive cache lines that
have been placed in the L2 cache by the second core, have a latency of 19.5 ns compared to 9.1 ns in the
states Owned, MuW, and Shared. The higher latency indicates that the second L1 is snooped, which is
required if the second core has write permission for the data23. From this it follows that a mechanism
has to be in place to determine which core a writable cache line belongs to. Furthermore, cache lines are
apparently evicted silently from the L1 caches. Consequently, snoops are necessary even if the second
core does not contain a writable copy any more. This behavior is similar to the core snoops after silent
evictions of Exclusive cache lines in Intel systems (see Section 4.1.2.1, 4.1.3.1, and 4.2.1.1). In contrast,
Shared and Owned cache lines can be read directly, as the second core does not need to be notified
of further read accesses. Interestingly, the MuW state behaves like Owned instead of being equal to
Modified, i.e., the cores apparently do not distinguish the states MuW and Owned.
If a required cache line is not found in the CU’s L2 cache, it is requested from the system request
interface. The L3 cache is directly connected to the SRI and delivers cache lines with a latency of 27.3 ns
if it contains a valid copy (“within NUMA node (∗)” cases between 2 and 6 MiB). The L3 latency does not
increase if data is placed in it by another CU as the (mostly) exclusive design guarantees that no Modified
or Exclusive copies remain in the L1 and L2 caches. In case of an L3 miss, the request is handled by
the memory controller in the home node [CH07]. If the snoop filter entry (see Section 2.3.2.3) points
to an owner of the cache line, the request is sent to the node that contains the forwardable copy (state
Modified, MuW, Exclusive, or Owned). If the requester’s node is the home node and data is forwarded
from another CU on the same die (see “within NUMA node (other CU)” test series in Figure 4.32a, 4.32b,
and 4.32c up to 2 MiB), the latency is 89.5 ns. Figure 4.32c also includes measurements of core-to-core
transfers within node 0 that are triggered by a remote home node. The cases “other NUMA node (*)”
show significantly higher latencies up to a data set size of 2 MiB although the data is transferred between
two CUs on the same die as in the “within NUMA node (other CU)” case. However, the requests are
forwarded to a memory controller in another node, which then sends back the snoop requests to node 0.
The additional HyperTransport transfers increase the latency to 126 ns (one hop within MCM), 131 ns
(one hop between sockets), or 168 ns (two hops). Figure 4.32d includes the case that a Shared copy
exists within the requester’s NUMA node (“within NUMA node (other CU)” up to 2 MiB). However,
Shared cache lines are not forwarded (see Section 2.3.1.3). Instead, data is forwarded from node 7, which
contains the second copy. This confirms that the extended MOESI protocol [Lep+12] (see Figure 2.14)
is used as the conventional MOESI protocol would not create a forwardable copy.
4.2.2.2 Latency of Remote Cache Accesses
The characteristics of remote cache accesses are illustrated by the “other NUMA node (*)” test series
in Figure 4.32a, 4.32b, and 4.32d up to a data set size of 6 MiB. The latency of die-to-die transfers within
a MCM is measured between node 0 and node 1 (same MCM). The external HyperTransport links are
benchmarked using node 2 (other MCM, 1 hop) and node 3 (other MCM, 2 hops) as representatives for
a distance of one hop and two hops, respectively. Latencies are measured slightly higher than reported
in [MHS14], which is presumably caused by the different mainboards.
Modified, Exclusive, MuW, and Owned cache lines are forwarded to the requesting core. This comprises
three steps as MOESI is a home snoop protocol (see Section 2.3.1.3):
1) A request is sent to the home node of the cache line (determined by the physical address).
2) The corresponding snoop filter entry is checked and the request is forwarded to the owner.
3) The owner forwards the data to the requester.
The number of required HyperTransport transfers depends on the distances between the participating
nodes as depicted in Figure 4.33.
23The L1 is a write-through cache. However, it needs to perceive requests by other cores in order to perform required coherence
state transitions. Furthermore, the WCC needs to be checked for data that is not up-to-date in the L2 cache.
4.2 Standard Compute Nodes With Complex NUMA Topologies 93
Node 1
Node 0
(Shared)
Node 3
Node 2
(home node)
Node 5
Node 4
Node 7
Node 6
(Owned)
(a) one subset
Node 1
(home node)
Node 0
(Shared)
Node 3
Node 2
Node 5
Node 4
Node 7
(Owned)
Node 6
(b) two subsets that span two sockets
Node 1
Node 0
(Shared)
Node 3
Node 2
(home node)
Node 5
Node 4
Node 7
(Owned)
Node 6
(c) two subsets that span three sockets
send request to home node forward request to owner send response to requester
Figure 4.33: Opteron 6274—HyperTransport transfers that comprise three nodes: The required number
of HyperTransport transfers to forward a cache line depends on the distribution of the involved nodes.
These figures are derived from [MHS14, Figure 6].
Figures 4.32a and 4.32b show transfers that involve only two nodes—the source node (node 0) and the
home node, which also contains the required cache lines. Therefore, step 2) does not involve Hyper-
Transport transfers. Consequently, two HyperTransport transfers are required for nodes that are directly
connected to node 0—one to send the request to the home node and one to send back the response. The
latency is 131.8 ns if data is forwarded from a L2 cache in the processor’s second node. With 141.8 ns,
transfers between two processors take noticeably longer. Nodes that are not directly connected to node 0
require two HyperTransport transfers in each direction, which increases the latency to 184.5 ns. In the
measurements shown in Figure 4.32d (“within NUMA node (other CU)” and “other NUMA node (*)”,
up to 2 MiB) an Owned copy of the requested cache lines exist—outside of the home node—in node 7.
Thus, the snoop request has to be forwarded to the owner node (directed probe, see [Con+10, Figure 6]).
Table 4.20 details the resulting number of HyperTransport transfers. The access latency is 186 ns if four
hops are required. It increases to 207 ns if a fifth transfer is necessary. Examples for these cases are
depicted in Figure 4.33b and 4.33c, respectively.
The observed latencies can be explained using the message delivery times (tmsg) detailed in Table 4.21,
which are derived from the propagation delays (tdelay) and transmission times (ttrans) using equa-
tion (4.1). Transmission times are calculated from the message sizes and data rates using equation (4.2).
tmsg(link,message) = tdelay(link) + ttrans(link,message) (4.1)
ttrans(link,message) = sizeof(message)/data-rate(link) (4.2)
Table 4.20: Opteron 6274—Number of Hyper-
Transport hops for accessing remotely cached
data: The delay for requests from node 0 to data
that is cached in node 7 varies depending on the
location of the home node.
request from node 0, home node
owned copy in node 7 0 1, 6 2, 4 3, 5 7
send request to home node 0 1 1 2 2
forward request to node 7 2 1 2 1 0
send response to node 0 2 2 2 2 2
total hops 4 4 5 5 4
Table 4.21: Opteron 6274—HyperTransport mes-
sage delivery times: Time for sending one-way
messages (tmsg) based on propagation delays
(tdelay) and transmission times (ttrans) of trans-
fers between the dies in a MCM (near) as well
as transfers between sockets (far).
transfer trt tdelay ttrans tmsg
ne
ar request 42.3 ns 17.7 ns
0.9 ns 18.6 ns
response 6.0 ns 23.7 ns
fa
r request 52.3 ns 19.2 ns
1.9 ns 21.1 ns
response 12.0 ns 31.2 ns
94 4 Performance Characterization of Memory Accesses
Requests have a size of 8 byte [HTC10, Section 4.4.1]24. Responses consist of a 4 byte response
packet [HTC10, Section 4.5.1] and a 64 byte data packet [HTC10, Section 3.2.2]. The per packet CRC
code25 increases the message sizes to 12 byte for requests and 76 byte for responses. The raw data rates
are 12.8 and 6.4 byte/ns for the internal (16 bit) and external (8 bit) links, respectively. The usable band-
widths are limited to 12.7 and 6.35 byte/ns by periodic CRC windows26, which consume 4 out of every
516 bus cycles [HTC10, Section 10.1.1]. This results in average transmission times of 0.94 and 1.89 ns
for requests as well as 5.98 and 11.97 ns for responses (rounded to one decimal place in Table 4.21).
The propagation delays (tdelay) are derived from the calculated transmission times (ttrans) and the mea-
sured HyperTransport round-trip times (trt). Based on the latency of on-chip transfers between CUs,
processor internal (near) and external (far) round-trips require 42.3 ns (131.8ns − 89.5ns) and 52.3 ns
(141.8ns − 89.5ns), respectively. These are composed of propagation delays and transfer times as
shown in equation (4.3), which is solved for tdelay in equation (4.4). The resulting delays are 17,7 ns
((42.3ns− 0.9ns− 6.0ns)/2) for near and 19.2 ns ((52.3ns− 1.9ns− 12.0ns)/2) for far requests.
trt(link) = tmsg(link, request) + tmsg(link, response)
= 2 ∗ tdelay(link) + ttrans(link, request) + ttrans(link, response)
(4.3)
tdelay(link) = (trt(link)− ttrans(link, request)− ttrans(link, response))/2 (4.4)
Table 4.22 lists the possible HyperTransport round-trip times (trt) within the complex NUMA topology
(see Figure 4.31b) as well as the resultant access latencies (ttotal). On-chip transfers of cache lines that
have a remote home node are predicted with 126.7 ns, 131.7 ns, and 168.9 ns depending on the distance
to the home node. The corresponding measurements show slightly lower latencies of 126 ns, 131 ns, and
168 ns, respectively. The estimated latency of 184.1 ns for transfers that require two requests and two
responses is only slightly lower than the measured 184.5 ns (two nodes involved) or 186 ns (three nodes
involved). The forecast of 205.2 ns for forwarding cache lines using five HyperTransport transfers is
close to the measured 207 ns as well. All deviations are below 1%, which shows that Table 4.22 provides
substantiated predictions for the other cases.
Table 4.22: Opteron 6274—latency of cache-to-cache transfers that involve multiple NUMA nodes:
The total latency (ttotal) depends on the number involved NUMA nodes and their distribution in the
NUMA topology, which determines the required number of HyperTransport transfers.
involved example transfers [ns] trt ttotal
nodes sockets subsets source home owner 18.6 23.7 21.1 31.2 [ns] [ns]
1
1
1
node 0
node 0 0
0
0
0
0 89.5
2
2
node 1 node 0 2 37.2 126.7
node 1 1 1 42.3 131.8
2
1
node 2 node 0
0
0
2 42.2 131.7
node 2 1 1 52.3 141.8
2
node 3 node 0 2 2 0 79.4 168.9
node 3 1 1
1 1
94.6 184.1
3
node 1 node 2 2 0 89.5 179.0
node 2
node 1
1 1
2 0 84.5 174.0
node 3 1
1
94.6 184.1
3
1 node 6 0 0
2
73.4 162.9
2
node 7 1 1 115.7 205.2
node 3 node 6 2 0 110.6 200.1
24Address extension [HTC10, Section 3.2.1.3] is not required as 40 bit are sufficient to address all the installed memory.
25Retry mode [HTC10, Section 10.1.3] has to be enabled as the links operate at Gen3 frequencies [HTC10, Section 7.5.7].
26Periodic CRC is disabled in retry mode, but the CRC windows are present nevertheless [HTC10, Section 10.1.3].
4.2 Standard Compute Nodes With Complex NUMA Topologies 95
Remote L3 accesses (Figure 4.32a and 4.32b “other NUMA node (*)”, from 2 to 6 MiB) are slightly faster
than the corresponding L2 accesses (up to 2 MiB). However, the latency increase compared to local L3
accesses (“local” and “within NUMA node (*)“, from 2 to 6 MiB) is larger than can be explained by the
additional HyperTransport transfers. This is due to the fact that the local L3 cache is accessed directly
while remote cache accesses are routed through the home node’s memory controller.
4.2.2.3 Main Memory Latency
Table 4.23 summarizes the latency of local and remote cache and memory accesses. With a latency of
83.6 ns, data is delivered faster from local memory than from remote caches. Accessing memory that
is attached to the processor’s second die requires 136.8 ns. Another socket’s memory can be read with
a delay of 151 and 194 ns for 1-hop and 2-hop connections, respectively. These values do not include
TLB misses, which increase the latency as depicted in Figure 4.34. The impact is particularly large if
neither hugetlbfs nor transparent huge pages are used. In contrast to Haswell’s directory assisted COD
mode (see Section 4.2.1.1), the memory read latency is independent of the data’s coherence state prior
to its eviction. It is guaranteed that data is forwarded from cache if the snoop filter indicates an owner
(see Section 2.3.2.3). Consequently, silent evictions of forwardable cache lines (see Table 2.5) are not
allowed. Evictions of Modified, MuW, and Exclusive cache lines invalidate the corresponding snoop
filter entries while Owned cache lines leave directory entries in state S. Shared cache lines are evicted
silently [Con+10], thus entries in state S can remain in the directory. However, reads can be serviced
without snooping other nodes even if the snoop filter indicates the presence of Shared copies. Writes—
which are affected by directory entries in state S—are not covered by the latency benchmark.
The snoop filter (HT Assist, see Section 2.3.2.3) is much more important on the four socket Opteron 6274
test system than on the dual-socket Opteron 2435 system discussed in Section 4.1.1. Despite the signif-
icantly more complex topology, the local memory latency of 83.6 ns is comparable to the dual-socket
systems as data can be read directly from memory if no cached copies exist. The local memory latency
would presumably increase to about 150 ns (≈ 184.5 ns− 16.1 ns− 18.2 ns)27, if snoop broadcasts were
necessary. The measurements that involve a third node indicate that even the remote memory latencies
would probably be higher without HT Assist. Unfortunately, the BIOS of the test system does not include
an option to disable the snoop filter. Thus, the HT Assist advantage cannot be quantified exactly.
The difference between the local and the remote memory latency is higher than the round-trip times (trt)
detailed in Table 4.21. This can be explained by the adaptive DRAM prefetch [Con+10], which starts
local memory requests in parallel to the L3 access. Furthermore, the gap between internal (within MCM)
and external transfers (between sockets) also increases. This indicates that more data is transferred as this
difference is mainly caused by the longer transmission times on the external links. According to [CL12,
para. [0107]] additional information that indicates how many additional snoop responses will arrive may
be included in the response. However, the size of the added data is not revealed.
Table 4.23: Opteron 6274—latency of local and remote cache and memory accesses. Cached data in
state Exclusive. All measurements are in nanoseconds.
level in location (owner = home node) three nodes
memory same die 2nd die other socket involved
hierarchy local 2nd core other CU in MCM 1 hop 2 hops (worst case)
L1 1.8
19.5 89.5 131.8 141.8 184.5 207
L2 9.1
L3 27.3 118.6 127.7 172.3 195
DRAM 83.6 136.8 151.4 194.1 n/a
27 Based on 184.5 ns for forwarding cache lines from the most distant caches, 16.1 ns for the transmissions of the data packet
(6.0 ns (near response) + 12.0 ns (far response)) / 76 byte * 68 byte), and 18.2 ns for the snoop filter lookup (≈ L3 accesses
time = 27.3 ns (L3 latency) − 9.1 ns (L2 latency)).
96 4 Performance Characterization of Memory Accesses
Figure 4.34: Opteron 6274—impact of TLB
misses on memory latency: The L1 TLB cov-
ers 128 KiB in 4 KiB pages. Using 2 MiB
pages—via hugetlbfs or THP—increases its
coverage to 64 MiB. Larger data sets (up to
4 MiB and 2 GiB, respectively) revert to the
L2 TLB. This leads to higher latencies, e.g.,
the L3 latency increases from 27.3 to 38.2 ns
if malloc() is used without THP. Exceed-
ing the L2 TLB results in costly page table
walks.
4.2.2.4 Bandwidth of Local Cache Accesses and Core-to-core Transfers
Figure 4.35 shows the achievable read bandwidth using 64 bit scalar and 256 bit vector loads as measured
by the single-threaded read bandwidth benchmark (see Section 3.5.2) with cached data in state Exclu-
sive28. Normally, all selected locations are measured in a single benchmark invocation (see Section 3.2).
However, if the second core in the measuring core’s compute unit is included, its busy waiting loop af-
fects the measurements. Therefore, the “local” test series is measured separately and integrated in the
figures via BenchIT’s result plotter [Juc+04]. The effect on the other locations is negligible.
Table 4.24 summarizes the measurements depicted in Figure 4.35 and adds results obtained using 128 bit
loads (MOVDQA SSE instruction). The usage of SIMD instructions has a significant influence on the
bandwidth of local accesses and on-chip transfers. The local L1 and L2 bandwidths using SIMD in-
structions are approximately twice as high as those achieved by scalar instructions. The L3 and DRAM
bandwidths increase by approximately 25% if packed SSE or AVX instructions are used. Since the data
paths between the core and the cache hierarchy are only 128 bit wide, it should be possible to achieve
(a) 64 bit scalar loads (x86 MOV instruction) (b) 256 bit vector loads (VMOVDQA AVX instruction)
Figure 4.35: Opteron 6274—memory read bandwidth of scalar and vector instructions: Comparison of
accesses within a core’s local cache hierarchy (local) with accesses to other cores’ data. The other
core can be the CU’s second core (within NUMA node (same CU)), in another CU on the same die
(within NUMA node (other CU)), in the processor’s 2nd chip (other NUMA node (same MCM)), or
in another processor. Cores in another socket can be one (other NUMA node (other MCM, one hop,
*)) or two HyperTransport hops away (other NUMA node (other MCM, two hops, *)). Furthermore,
the bandwidth of remote accesses is different for ganged and unganged external links.
28No CLFLUSH in coherence state control as it causes unintended L3 misses in conjunction with the snoop filtering mechanism.
4.2 Standard Compute Nodes With Complex NUMA Topologies 97
Table 4.24: Opteron 6274—read bandwidth depending on width of load instructions: Measurements
are performed on core 0 in node 0. Results include the local cache hierarchy (node 0, core 0), trans-
fers within a compute unit (node 0, core 1), on-chip transfers (node 0, core 2-7), and remote accesses
(node 1, node 2, node 4/6, node 5/7). Cached data is in state Exclusive. All measurements are in GB/s.
location 64 bit scalars (MOV) 128 bit vectors (MOVDQA) 256 bit vectors (VMOVDQA)
node core L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM
0
0 31.7 11.7
7.7 6.2
57.2 21.8 10.0
7.7
65.7 23.3 10.2
7.71 9.3 10.8 19.4 20.3
9.7
16.8 17.5
9.7
2-7 4.5 4.9 6.7 6.9 7.3 7.5
1
0-7
3.5 3.6 3.9 4.0 5.0 5.1 5.0 5.5 5.6 5.5 5.3
2 3.0 3.2 3.4 4.2 4.3 4.0 4.3 4.6 4.2
3 2.5 2.7 3.3 3.5 3.4 3.4 3.7
4/6 2.4 2.6 2.7 2.8 3.0 2.8 3.0
5/7 1.7 1.9 1.8 1.9 1.8 1.9
the full performance using 128 bit instructions. However, the measurements show a small advantage for
AVX code. The L1 cache’s theoretical maximum of 70.4 GB/s is not reached in the measurements. The
bandwidth of the shared L2 cache is reduced if data has been placed in it by the CU’s second core. 64
and 128 bit loads show a small reduction of 7%. With 25%, the performance degradation is much more
severe using 256 bit instructions. As for the latency results, this penalty does only apply to reads of
Exclusive and Modified data, which cause coherence state transitions in the other core’s L1 cache. Data
in other coherence states can always be read from the L2 cache with the full bandwidth.
SIMD loads also achieve slightly higher remote bandwidths. The data rates supported by remote
caches are almost identical to the corresponding memory bandwidth. As observed in [Lud12, Sec-
tion 4.4.1], one-hop and two-hop connections between sockets each show two distinct performance lev-
els: node 0←node 2 (one hop, fast), node 0←node 4/6 (one hop, slow), node 0←node 3 (two hops, fast),
node 0←node 5/7 (two hops, slow). Furthermore, the fast two-hop connection can sustain a higher band-
width than the slow one-hop connection. All external HyperTransport links are 8 bit wide. However, the
link between node 0 and node 2 is connected to HT0 (ganged link) in node 0, which only has one active
sublink [Lud12, Section 4.2]. Therefore, requests that are targeted at or routed through node 2 can utilize
all of HT0’s request buffers [Lud12, Section 3.3.2]. The resulting remote memory bandwidths are 4.2
and 3.7 GB/s for the one-hop (node 2) and two-hop (node 3) connection, respectively. In contrast, node 4
and node 6 are both connected to HT3 (unganged link) of node 0. Consequently, only half of the request
buffers are available to each sublink, which limits the achievable bandwidth to 3.0 GB/s for one-hop and
1.9 GB/s for two-hop connections. With up to 5.3 GB/s for accessing node 1’s memory, the internal 16 bit
link is only 25% faster than the ganged external link.
Figure 4.36 depicts the achievable write bandwidths for various locations of the accessed data. The
performance strongly depends on the data’s coherence state as the snoop filter (see Section 2.3.2.3) is
less effective for stores [Con+10, Figure 6]. The coherence states Exclusive28 and Modified have identi-
cal performance characteristics, which are shown in Figure 4.36a. Due to the L1 cache’s write-through
policy, the L1 bandwidth is limited by the L2 performance. The local L2 can be written with 11.2 GB/s—
approximately half of the read bandwidth. In accordance to the latency and read bandwidth measure-
ments, the bandwidth is significantly reduced if the data has been placed in the shared L2 cache by the
other core. Data can be written to the L3 cache and local memory with 6.0 and 5.0 GB/s, respectively.
If data resides in another CU’s L2 cache or a remote L3 cache, the write bandwidths are limited by the
corresponding read bandwidths. The additional local writes further reduce the achievable bandwidths.
For data set sizes above 8 MiB, data is read from and written back to the remote memory29. The achiev-
able bandwidth is 4.2 GB/s for the memory attached to the second die in the MCM. The memory of other
29Up to 8 MiB remain in the local cache hierarchy of the measuring core.
98 4 Performance Characterization of Memory Accesses
(a) Exclusive28 / Modified, no further copies (b) MuW, no further copies (already invalidated in node 7)
(c) Owned (Shared copy in node 7) (d) Shared28 (Owned copy in node 7)
Figure 4.36: Opteron 6274—write bandwidth: Comparison of write accesses to data in a core’s local
cache hierarchy (local) with modifications of other cores’ data. The writing core has to obtain ex-
clusive ownership of the affected cache lines before modifications are performed. Therefore, writes
comprise two operations: a local or remote read—that also invalidates all remote copies—as well
as the actual update of the local copy. Eventually data is evicted from the measuring core’s cache
hierarchy to the respective home node’s memory. The data locations are identical to Figure 4.35. All
measurements use 256 bit stores (VMOVDQA).
sockets can be written to with 2.8 and 1.9 GB/s for one-hop and two-hop connections, respectively. The
benefit of the higher read bandwidth provided by the ganged link in node 0 disappears if data also has
to be transfered back. This is caused by the asymmetrical topology, which connects the ganged link in
node 0 to an unganged link in node 2 [Lud12, Section 4.2]. The 8 bit link between node 1 and node 3,
which is ganged on both sides, enables a write bandwidth of 3.3 GB/s.
Figure 4.36b depicts the performance of write accesses to data in state MuW. The local L1 and L2
bandwidth equals the lower of the two performance levels observed for states Exclusive and Modified,
which indicates that the second core is always snooped before writes. This makes sense as the low read
latency (see Figure 4.32b) for state MuW shows that both cores can make read-only copies without
snooping the other core. Consequently, invalidations are required before writes. The results for data in
remote caches—including other L2 caches on the same chip—are different as well. This is caused by
the coherence state control mechanism (see Section 3.3), which involves accesses by another core (in
another NUMA node) to generate state MuW. This apparently changes the directory state to O, which
indicates that multiple nodes may have a copy (see Section 2.3.2.3). Therefore, broadcast invalidates are
required. This significantly reduces the performance to 1.0 – 3.3 GB/s instead of 1.9 – 6.0 GB/s, which
are measured for writes to Exclusive and Modified data. The memory performance is not affected as
MuW evictions clear the corresponding snoop filter entries as the only copy is evicted.
4.2 Standard Compute Nodes With Complex NUMA Topologies 99
Figure 4.36c depicts the performance of write accesses to data in state Owned. In contrast to MuW,
Owned cache lines are not exclusive to a CU. Therefore, local writes require broadcast invalidations as
well. This reduces the achievable write bandwidth to 2.8 GB/s. Furthermore, the memory bandwidth
(data sets larger than 8 MiB) is reduced. This is caused by the snoop filter. Owned evictions leave
directory entries in state S as other copies can still exist. Therefore, broadcast invalidations are sent,
even if there are no cached copies. The penalty decreases for larger data set sizes as only the last 8 MiB
of the accessed buffer are affected by this30. The performance characteristics of writes to Shared cache
lines are depicted in Figure 4.36d. Shared cache lines are evicted silently, thus do not clear the directory
state. Therefore, the memory bandwidths shows the same reduction as observed for state Owned. The
write bandwidth of cached data depends on the distance to the owner node. Figure 4.36d depicts the
worst case (two HyperTransport hops between requester and owner). Surprisingly, the bandwidth of
1.9 GB/s is higher than the worst case observed for state Owned and MuW. In the measurements shown
in Figure 4.36b and 4.36c, the home node is also the owner of the cached copy. This is the actual
worst case as the snoop filter look-ups interfere with the cache accesses. In the measurements shown
in Figure 4.36d data is not forwarded from the home node’s L3 cache, which avoids this bottleneck.
4.2.2.5 Bandwidth Scaling of Shared Resources
Figure 4.37 depicts the bandwidth scaling within a single compute unit. The results are obtained using
the load and store routines of the throughput benchmark (see Section 3.5.4), which enable smaller data
set sizes that fit into the write coalescing cache (WCC) [Amd14c, Chapter 2.13]. The L1 read bandwidth
can also be measured more precisely due to the longer runtime of the individual measurements. Up
to 35.1 GB/s can be achieved using 64 bit loads that target general purpose registers. Using 128 bit or
256 bit SIMD instructions increases the L1 bandwidth to 68.0 GB/s. Although each core has its own
L1 cache, the aggregated bandwidth of concurrent reads is identical to the bandwidth a single core can
achieve. For SIMD loads this can be explained by the floating point load buffer (see Figure 4.30), which
limits L1 accesses to two 128 bit loads per cycle (70.4G˙B/s). However, both integer clusters should be
able to sustain two loads per cycle independent of each other. This is apparently not the case, which
indicates that the shared front-end can only handle two loads per cycle. The read bandwidth from the L2
cache doubles compared to the values in Table 4.24 if both cores access it concurrently.
(a) 64 bit loads and stores (x86 MOV instruction) (b) 128 bit loads and stores (MOVDQA SSE instruction)
Figure 4.37: Opteron 6274—bandwidth scaling within compute unit: The achievable read and write
bandwidth within a compute unit depend on the number of used cores. The scaling is not linear in
most cases due to the shared components. The L1 cache uses a write-through policy. However, up to
4 KiB can be stored in the write coalescing cache (see Figure 4.30), which supports a higher sustained
bandwidth than the L2 cache.
30The coherence state control (see Section 3.3) is only functioning on the data that remains in cache (2 MiB L2 + 6 MiB L3).
100 4 Performance Characterization of Memory Accesses
Table 4.25: Opteron 6274—bandwidth scaling using SIMD instructions: Except for writes to the L3
cache, the bandwidths do not scalar linear with the number of used compute units per die. In contrast,
the scaling with the number of used chips and sockets is mostly linear. All measurements are in GB/s.
number of used bandwidth
threads
compute chips L3 memory
units (sockets) read write write-nt read write write-nt
1
1
1 (1)
10.2 6.0
3.2
7.7 5.0 7.2
2 10.9 6.7 9.0 6.0 7.0
4 2 21.2 13.7 4.0 15.7 7.8 12.0
6 3 27.0 20.5 4.6
16.8 8.0
13.2
8 4 31.1 27.2 5.1 13.7
16 8 2 (1) 62.1 54.1 9.0 33.5 15.8 27
32 16 4 (2) 124 108 18.2 66.9 31.5 54
64 32 8 (4) 248 216 32.3 135 62.5 107
Up to a data set size of 4 KiB, writes stay in the WCC, thus are not limited by the L2 bandwidth. The
achievable bandwidth—which also does not scale with the number of cores—is 17.4 and 34.1 GB/s for
64 and 128 bit stores, respectively. That is approximately half of the read bandwidth, which is to be
expected because of the single store port. The write bandwidths using 256 bit stores (not depicted) are
limited to 23.2 GB/s for a single core and 17.5 GB/s for two concurrently writing cores. Splitting 256 bit
operations into two 128 bit writes to the WCC presumably requires additional precautions to ensure that
they become visible at the same time, which may affect the performance in this case. If the data exceeds
the WCC size, The write bandwidth is limited by the L2 cache. Up to 12.9 GB/s are possible when data
is read from the L1 and written back to the L2 cache. If data is also read from the L2, the achievable
bandwidth drops to 11.3 GB/s, which is already achieved with one core using SIMD instructions.
Table 4.25 details how the L3 cache and main memory bandwidth scales with the number of concurrent
threads. The bandwidth improves slightly if both cores perform concurrent reads or writes. The L3 band-
width increases significantly if multiple compute units are used. However, with less than 4 GB/s per core
for concurrent operations, the performance is very low. The per-chip performance even decreased com-
pared to the preceding family 10h micro-architecture (see Section 4.1.1). The memory bandwidth shows
a noticeable improvement when two compute units are used instead of one. The benefit of using more
than two compute units is small. The performance of non-temporal stores (write-nt) is rather low, if the
data is present in the L3 cache. The achievable bandwidth increases for larger data set sizes. Table 4.25
lists the performance of sequential non-temporal stores for 1 GiB of memory per node. However, the
bandwidth increase of up to 70% compared to normal stores is rather theoretical.
The read bandwidths that are available via the HyperTransport links are detailed in Table 4.26. The
bandwidth of concurrent accesses is only slightly higher than the bandwidth a single thread can achieve.
The impact of TLB misses on the bandwidth of sequential accesses is negligible if 2 MiB pages are used.
Table 4.26: Opteron 6274—HyperTransport bandwidths: One or eight threads running on node 0, read-
ing from memory in other NUMA nodes (based on [MHS14, Table 8]). All results are in GB/s.
memory
hops
minimal link bandwidth
allocated at link width type one thread eight threads
node 1
1
16 Bit
ganged
5.3 5.9
node 2
8 Bit
4.2 4.3
node 4/6 unganged 3.0 3.0
node 3
2
ganged 3.7 3.8
node 5/7 unganged 1.9 1.9
4.3 Potential Bottlenecks in the Memory Hierarchy 101
4.3 Potential Bottlenecks in the Memory Hierarchy
This section summarizes the major findings from Section 4.1 and 4.2 and presents common bottlenecks
that potentially limit the performance of parallel applications on NUMA systems.
4.3.1 Latency of Memory Accesses
The memory accesses latency has a wide range of values depending on the data’s location and coherence
state (see Section 4.1.1.1, 4.1.2.1, 4.1.3.1, 4.2.1.1, and 4.2.2.1 – 4.2.2.3). The L1 cache delivers data
with a latency of only a few cycles—three to four cycles in the examined micro-architectures. The
latency increases significantly if deeper levels of the cache hierarchy are accessed. Although only a
small selection of test systems is considered in this work, substantial differences can be observed with
respect to the latency of L2 and L3 accesses. The observed delays range from 10 to 20 cycles and from
39 to 60 cycles, respectively. The characteristics of on-chip core-to-core transfers also show noteworthy
differences. They are handed by the inclusive L3 cache on the analyzed Intel systems, which results in
a relatively low latency of approximately twice the L3 latency. On AMD systems requests for data that
is present in other cores’ caches are typically handled by the home node’s memory controller as the L3
cache is not inclusive. This results in a higher latency, especially if the data is not forwarded by the core
that has the most recent copy (see Figure 4.3).
Main memory and remote cache accesses generally take longer than forwarding data from an on-chip
location. Furthermore, the distance from the home node has a huge influence. Local memory accesses
already take several hundred cycles. The NUMA factor [Pil+11] describes how costly remote accesses
are. It is computed by dividing the latency of remote memory accesses by the latency of local memory
accesses. The analyzed dual-socket servers have relatively low NUMA factors of 1.61 (Opteron 2435),
1.60 (Xeon X5670), 1.64 (Xeon E5-2670), and 1.51 (Xeon E5-2680 v3). The quad-socket Opteron 6274
server has a NUMA factor of 2.32 for two-hop connections, i.e., remote memory accesses can take more
than twice as long as local accesses. However, compared to large scale shared memory systems this is still
relatively low. For instance, the SGI UV 2000 has remote memory latencies up to 870 ns [Old14, Sec-
tion 4.2.1], which leads to a NUMA factor of 10.88. The address translation overhead (see Section 2.4.2)
is significant if 4 KiB pages are used (see Figure 4.5, 4.10, 4.18, 4.26, and 4.34). If transparent hugepages
are enabled (typically the default) TLB misses are of small importance.
4.3.2 Bandwidth Limitations
The bandwidth of memory accesses strongly depends on the location of the data as well (see Sec-
tion 4.1.1.2, 4.1.2.2, 4.1.3.2, 4.2.1.2, and 4.2.2.4). The available L1 bandwidth is limited by the sup-
ported number of accesses per cycle. The peak bandwidths of the deeper levels in the memory hierarchy
is restricted by the width of the data paths between them [Hag+14]. Furthermore, the data paths cannot al-
ways be fully utilized due to a limited number of outstanding requests. Figure 4.38 shows the achievable
bandwidths for a single reading core and how they are influenced by the width of the load instructions.
0
20
40
60
80
100
120
140
160
L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM
Xeon E5-2680 v3 Xeon E5-2670 Xeon X5670 Opteron 6274 Opteron 2435
b
an
d
w
id
th
 [
G
B
/s
] 
64 bit loads (MOV) 128 bit loads (MOVDQA) 256 bit loads (VMOVDQA)
Figure 4.38: Influence of the used ISA on the achievable read bandwidth: Using wider load instructions
is mostly beneficial, especially in the L1 cache. They should be at least as wide as the L1 ports.
102 4 Performance Characterization of Memory Accesses
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12
b
an
d
w
id
th
 [
G
B
/s
] 
used cores 
(a) L3 read bandwidth
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12
b
an
d
w
id
th
 [
G
B
/s
] 
used cores 
(b) RAM read bandwidth
Xeon E5-2680 v3
Xeon E5-2670
Xeon X5670
Opteron 6274
Opteron 2435
Figure 4.39: Scaling of the bandwidth with the number of cores: The L3 behavior strongly depends on
the architecture. All tested processors can saturate their memory interface without using all cores.
The largest benefit of using SIMD instructions can be observed for accesses in the L1 caches. Using
128 bit SSE instructions doubles the L1 read bandwidth compared to 64 bit loads on all test systems, as
they all have at least 128 bit wide L1 read ports. Using 256 bit loads does not improve the performance
if the L1 ports are only 128 bit wide. Therefore, only the Haswell system—which features 256 bit L1
ports—shows an additional doubling of the L1 performance. Performing memory accesses that do not
use the full width of the L1 ports wastes a lot of the available L1 bandwidth. This performance deficit
also propagates into the deeper levels of the memory hierarchy.
In addition to the per core limits there are also per chip limits for the bandwidth of shared resources
(see Section 4.1.1.3, 4.1.2.3, 4.1.3.3, 4.2.1.3, and 4.2.2.5). Figure 4.39 shows how the bandwidth of L3
and local main memory accesses scales with the number of used cores. The L3 caches of the Intel archi-
tectures with ring-based uncore design (Xeon E5-2670 and Xeon E5-2680 v3) perform remarkably well.
They combine a high single-core performance with close to linear scaling with the number of cores. The
memory bandwidth does not scale linear with the number of cores on the examined systems. Further-
more, none of the test systems reaches the theoretical peak bandwidth of the used DRAM technology.
Remote memory accesses are also restricted by the interconnection network between the processors.
The bandwidth of data transfers between the NUMA nodes is typically not sufficient to use the remote
memory controllers to their full capacity.
4.3.3 Influence of the Cache Coherence Protocol
Cache coherence protocols (see Section 2.3) contribute to the latency of memory access. They ensure
that an up-to-date copy of the requested data is delivered. However, the data is not necessarily delivered
from the closest possible location. For instance, shared cache lines can be forwarded by a distant copy
in the Forward or Owned state although a closer valid copy in state Shared exists. Furthermore, main
memory accesses are delayed until all snoop responses arrive. This is partially resolved by snoop filtering
mechanisms (see Section 2.3.2.2 and 2.3.2.3), which prevent snoop requests if the directory information
guarantees that there are no cached copies that need to be transformed or invalidated. For example,
the local DRAM latency is reduced from 80 to 74 ns in the Opteron 2435 system if the HT Assist is
enabled (see Section 4.1.1.1). However, the directory lookup also delays necessary snoop requests,
which increase the latency of remote cache accesses.
The coherence protocols also consume bandwidth on the interconnection network between the proces-
sors. However, the packets are small compared to the data packets that typically transfer whole cache
lines. Therefore, the generated traffic is negligible in systems with only two NUMA nodes. However,
the overhead would become significant in systems with four or more nodes due to the required broadcast
messages. This effect is also mitigated by the snoop filtering mechanisms as snoop requests can often
be filtered completely or restricted to a single node. Therefore, home snooping with distributed snoop
filters is preferred in more complex systems instead of source snooping, which provides the lowest access
latency. A remaining issue—that can be observed on all test systems—is writing on shared data, which
results in low sustainable bandwidth due to the required invalidations.
103
5 Performance Impact of the Memory Hierarchy
The achievable application performance is determined by hardware characteristics as well as application-
specific factors [MM04; SWC01; WWP09]. In Chapter 4 the characteristics of memory accesses in
NUMA systems have been analyzed. This chapter introduces a methodology that facilitates the attribu-
tion of the lost performance in parallel applications to the identified potential bottlenecks in the memory
hierarchy. The focus is on shared memory systems (i.e., the node performance). However, existing per-
formance analysis tools (see Section 2.6.3.4) can be used to extend the applicability of the performance
impact analysis to parallel applications on distributed memory systems.
The proposed approach uses micro-benchmarks to stress individual components in the memory subsys-
tem in order to identify meaningful hardware performance counters that indicate the resource utilization
as well as the number of cycles spent waiting for the memory hierarchy. This information can be used by
performance analysis tools to detect the components that limit the performance. The remainder of this
chapter is structured as follows: Section 5.1 presents a case study that demonstrates fundamental differ-
ences between multi-core and multi-processor scaling using the example of the SPEComp2001 bench-
mark suite [Asl+01; Mül+04]. Section 5.2 explains how the characteristics of memory accesses affect the
attainable processing speed. The workflow for the identification of meaningful hardware performance
counters is described in Section 5.3. Section 5.4 shows how the identified performance counters can be
used to visualize performance problems in parallel applications.
5.1 Case Study: SPEC OMPM2001 Scalability
The scalability of the SPEComp2001 benchmark suite [Asl+01; Mül+04] on multi-processor systems has
been studied in [Mol+11]. This section summarizes the major findings of this publication, which com-
prises the achieved speedup when using multiple cores of a multi-core processor as well as the scaling
with the number of used processors. Two quad-socket NUMA systems are used in the experiments—one
with four Intel Xeon X7560 processors and one with four AMD Opteron 6172 processors. The hardware
configurations are summarized in Table 5.1. Each Xeon X7560 processor contains eight cores, which are
based on the Nehalem micro-architecture [Int09b]; [Int14a, Section 2.4] (see Figure 4.7 in Section 4.1.2).
However, the 24 MiB L3 cache consists of eight 3 MiB slices and a ring-based on-chip interconnect—
comparable to Sandy Bridge-EP (see Section 4.1.3)—is used [Rus+09]. The cores of the AMD Opteron
6172 processors are based on AMD’s family 10h micro-architecture [Amd11, Appendix A] (see Fig-
ure 4.1 in Section 4.1.1). Each Opteron 6172 processor contains two six-core dies [Con+10]. The L3
Table 5.1: Hardware configuration of quad-socket systems, based on [Mol+11, Table 1]. References:
Intel [Rus+09; Int11]; [Int15c, Table 2], AMD [Con+10]; [Amd11, Appendix A]; [Amd10, Table 6]
System Intel AMD
Processors 4×Intel Xeon X7560 (Nehalem-EX) 4×AMD Opteron 6172 (Magny Cours)
Cores 32 (Hyper-Threading disabled) 48
Core clock 2.26 GHz (Turbo Boost disabled) 2.1 GHz
Cache
2×32 KiB L1, 256 KiB L2 per core 2×64 KiB L1, 512 KiB L2 per core
24 MiB L3 per processor 2×6 MiB L3 per processor
Interconnect 6.4 GT/s QuickPath Interconnect 6.4 GT/s HyperTransport
Memory 256 GiB DDR3-1066 64 GiB DDR3-1333
configuration 4 SMI channels per socket 4 DDR3 channels per socket
104 5 Performance Impact of the Memory Hierarchy
Table 5.2: Aggregate read and write bandwidths per NUMA node in GB/s, based on [Mol+11, Table 4]:
The write bandwidths include the read from and the write back to the respective location.
Bandwidth
Intel Xeon X7560 AMD Opteron 6172 (single die)
1 core 2 cores 4 cores 6 cores 8 cores 1 core 2 cores 4 cores 6 cores
L3 read 19.2 38.3 76.6 114 152 7.8 15.2 24.1 30.8
L3 write 12.7 25.4 50.8 76.0 101 7.1 14.1 24.5 30.9
RAM read 5.5 11.4 20.6 24.5 25.7 6.1 10.7 13.2 13.2
RAM write 4.9 8.5 10.8 10.9 10.9 5.1 6.6 7.1 7.1
cache has a total capacity of 12 MiB—6 MiB per die. Both systems support four memory channels per
socket. The Intel system is optimized for a high DRAM capacity. Therefore, the scalable memory in-
terface (SMI) is used [Rus+09]; [Int11, Section 1]. In the AMD system the DDR3 modules are directly
connected to the processors’ memory controllers.
The differences in the composition of the L3 caches and the used memory technologies result in differ-
ent characteristics of memory accesses [Mol+11, Section 5.1 and 5.2]. Each SMI-channel in the Intel
system connects the processor with a scalable memory buffer that supports two DDR3 channels [Int11,
Figure 1-1]. The additional hop results in rather high latencies of 130.4 and 192.8 ns for local and remote
memory accesses, respectively [Mol+11, Table 2]. The AMD system does not use such an indirection.
Furthermore, a snoop filtering mechanism (HT Assist) is implemented [Con+10] (see Section 2.3.2.3).
This results in a local memory latency of only 65.7 ns [Mol+11, Table 2]. The remote access latencies
of up to 159 ns are also much lower than on the Intel system. The scaling of the L3 and memory band-
widths with the number of used cores is shown in Table 5.2. The L3 read and write bandwidths of the
Intel Xeon X7560 processors scale almost linear from 19.2 and 12.7 GB/s for a single core to 152 and
101 GB/s using eight cores. On the AMD Opteron 6172 processors, the L3 bandwidths do not scale lin-
early with the number of used cores and the maximum of up to 30.9 GB/s per die is much lower than on
the Intel system. With around 26 GB/s, the achievable memory bandwidth per socket is almost identical
on both systems. Due to the lower performance for the single core case, the scaling with the number
of cores is better on the Intel system. Figure 5.1 compares the NUMA topologies of the test systems.
The QPI links in the Intel system support a much higher bandwidth than the HyperTransport links in the
AMD system [Mol+11, Section 5.2]. With up to 11.0 GB/s the available bandwidth for remote cache and
memory accesses in the Intel system is more than twice as high as the achievable bandwidth between the
dies of one Opteron 6172 processor (5.3 GB/s). The bandwidth provided by the half-width links between
the sockets in the AMD system is even limited to 2.1 GB/s.
Node 0 Node 1
Node 2 Node 3
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
(a) Quad-socket Intel Xeon X7560 [Int09a, Figure 6]
Node 1
Node 0
Node 3
Node 2
Node 5
Node 4
Node 7
Node 6
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
Mem
(b) Quad-socket AMD Opteron 6172 [Con+10, Figure 3(b)]
Figure 5.1: Topology of quad-socket test systems, based on [Mol+11, Fig. 1]: The Intel system (left)
consists of four fully connected NUMA nodes with eight cores per node. The AMD system (right)
contains eight NUMA nodes, which are not fully connected. Each Opteron 6172 processor consists
of two NUMA nodes with six cores per node.
5.1 Case Study: SPEC OMPM2001 Scalability 105
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
1 2 3 4 5 6 7 8
Sp
e
e
d
u
p
 
number of used cores 
(a) Xeon X7560
1.0
2.0
3.0
4.0
5.0
6.0
1 2 3 4 5 6
Sp
e
e
d
u
p
 
number of used cores 
(b) Opteron 6172 (one NUMA node)
peak GFLOPS L3 bandwidth RAM read RAM write 310.wupwise
312.swim 314.mgrid 316.applu 318.galgel 320.equake
324.apsi 326.gafort 328.fma3d 330.art 332.ammp
Figure 5.2: SPEC OMPM2001—scaling with the number of used cores [Mol+11, Fig. 5]: The achieved
speedups of the individual benchmarks vary considerably. In many cases the performance increase is
much lower than expected on the basis of Amdahl’s Law (see [Asl+01, Table 2]).
5.1.1 Multi-core Scaling
Figure 5.2 shows the speedups that the SPEComp2001 benchmarks achieve on the examined multi-core
processors and compares them to the scaling of the shared resources. Except 318.galgel, the benchmarks
have parallel fractions above 98% [Asl+01, Table 2]. If eight processors are used, this results in a maxi-
mal speedup according to Amdahl’s Law (see Section 2.6.1) between 7.2 and 7.9 for the majority of the
benchmarks (318.galgel: 6.5). However, only three benchmarks—324.apsi, 330.art, and 332.ammp—
actually scale that well. The other benchmarks scale significantly worse than predicted by Amdahl’s
Law. Some benchmarks—e.g. 318.galgel and 320.equake—are known to scale poorly even if only a few
threads are used [FGD07]. However, up to eight threads there is no significant parallelization overhead in
the form of load imbalances, thread management, and synchronization [FGD07, Fig. 1]. Furthermore, the
benchmark 312.swim—which is known to scale well on multi-processor systems [Sai+03, Fig. 4]—also
scales very poorly. Thus, there has to be another reason for the suboptimal performance increase.
The behavior of the individual benchmarks is very similar on both systems. All observed speedups lie
between the optimal speedup and the lower bound defined by the aggregated memory write bandwidth.
The value patterns indicate that the attainable speedup is restricted by the scaling of the shared resources.
However, no benchmark exactly reproduces the performance increase of a single resource. This is to be
expected as different program phases are probably limited by different resources. Thus, the totals over
the whole runtime show a mixture of effects caused by multiple limiting factors.
5.1.2 Multi-processor Scaling
The scaling with the number of used processors is depicted in Figure 5.3. In that case, the aggregated
L3 and memory bandwidths scale linearly with the number of processors. The multi-processor scal-
ing significantly differs from the multi-core scaling in some benchmarks—most obviously in 312.swim
and 316.applu. 312.swim does scale poorly with the number of used cores on the selected multi-core
processors. In contrast, the scaling with the number of processors is very good on both test systems.
The multi-processor scaling of 316.applu shows a super-linear speedup. In accordance with [FGD07]
106 5 Performance Impact of the Memory Hierarchy
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1 2 3 4
Sp
e
e
d
u
p
 
number of  processors, 8 cores per processor 
(a) Quad-socket Intel Xeon X7560
1.0
1.5
2.0
2.5
3.0
3.5
4.0
1 2 3 4
Sp
e
e
d
u
p
 
number of processors, 12 cores per processor 
(b) Quad-socket AMD Opteron 6172
peak GFLOPS L3 bandwidth RAM read RAM write 310.wupwise
312.swim 314.mgrid 316.applu 318.galgel 320.equake
324.apsi 326.gafort 328.fma3d 330.art 332.ammp
Figure 5.3: SPEC OMPM2001—scaling with the number of used processors [Mol+11, Fig. 6], based
on ICA3PP’11 conference presentation1, slide 15: The multi-processor scaling shows significant
differences between the two test systems. The parallel efficiency of the fully connected Intel system
(left) is noticeably better than on the AMD system (right).
this happens when the aggregated L3 cache size gets larger than 24 MiB. However, some benchmarks
do not benefit as much from using multiple processors as they do from using multiple cores within a
single processor. For instance, 332.ammp is among the best scaling benchmarks if the number of cores
is increased. In contrast, the multi-processor scaling is mediocre.
Table 5.3 lists the parallel efficiencies for the multi-core and multi-processor case. Even the best scaling
benchmarks have parallel efficiencies below 0.9 if four processors are used instead of one. The subopti-
mal scaling can partially be explained by the increasing parallelization overhead [FGD07]. However, the
limited bandwidth of the interconnection network between the processors as well es the higher latency
of remote accesses could also be limiting the performance. The average multi-processor efficiency (ex-
cluding 316.applu) is 0.74 on the Intel system compared to 0.65 on the AMD system. Thus, the AMD
system’s lower interconnect bandwidth and/or the higher NUMA factors—1.81 (other socket, one hop)
and 2.42 (other socket, two hops) compared to 1.48 (any other socket) on the Intel system—supposably
have an adverse effect on the application performance. This is most noticeably the case for 326.gafort
and 328.fma3d, which scale much better on the Intel system.
Table 5.3: SPEC OMPM2001—parallel efficiency: The multi-core efficiency illustrates the parallel
efficiency in case of using all cores within a NUMA node (Intel: 8, AMD: 6) instead of one core. The
multi-processor efficiency denotes the parallel efficiency of using four processors instead of one.
benchmark number 310 312 314 316 318 320 324 326 328 330 332
multi-core Intel 0.73 0.36 0.85 0.64 0.43 0.54 0.93 0.85 0.79 0.99 0.95
efficiency AMD 0.66 0.32 0.71 0.58 0.50 0.49 0.89 0.74 0.77 0.88 0.89
multi-proces- Intel 0.71 0.83 0.81 1.47 0.40 0.42 0.85 0.89 0.86 0.87 0.75
sor efficiency AMD 0.84 0.85 0.76 0.83 0.29 0.42 0.74 0.54 0.64 0.75 0.65
1https://fusionforge.zih.tu-dresden.de/plugins/mediawiki/wiki/benchit/images/9/97/2011_Molka_ICA3PP.pdf
5.2 Possible Causes for Low Application Performance 107
5.2 Possible Causes for Low Application Performance
Various potential bottlenecks that could affect the performance have been identified in Section 4.3. There
are two major categories for performance limitations due to memory accesses: latency bound and band-
width bound. In both cases the execution of instructions is stalled while waiting for the completion of
data transfers. In the latency bound case, the processing of further instructions is blocked until a certain
data transfer completes. This can be caused by data dependencies—i.e., subsequent instructions cannot
proceed to the execution stage because of missing operands. It can also happen that the maximal number
of instructions in flight is reached, which prevents the processing of further instructions until the block-
ing data transfer retires. In the bandwidth bound case, multiple independent load or store instructions
are processed concurrently. The processing speed is then limited by the availability of certain resources
(load/store buffers, entries in request queues, etc.) as well as the widths of the data paths.
5.2.1 Impact of the Memory Latency
As described in Section 4.3.1, the latency of memory accesses strongly depends on the location of the
data. The values range from a few cycles for local cache accesses to several hundred cycles for data that
is delivered from main memory or forwarded by a remote cache. Instructions that need the requested data
as input operands are stalled until it is delivered. However, processors that support out-of-order execution
can already process subsequent instructions that do not have a dependency on the missing data. Thus, the
impact that the latency of memory accesses has on the achieved performance depends on the availability
of independent instructions as well as the processor’s ability to extract the available instruction level
parallelism. Depending on the number of processed instructions per cycle, the supported number of in-
structions in flight results in a certain number of cycles that can be covered by the out-of-order execution,
which is referred to here as “out-of-order windows”. Figure 5.4 compares the memory latencies observed
in Chapter 4 to the out-of-order windows of the respective test systems. The relatively low latencies of
L1 and L2 accesses can typically be hidden by the out-of-order execution—if there is a sufficient number
of independent operations. However, the higher latencies of L3 and especially main memory accesses
can exceed the out-of-order window as it takes only a few tens of cycles to fill the reorder buffer (ROB)
if instructions are executed at the maximal rate of three or four instructions per cycle. Furthermore, even
short latencies can result in performance losses if there are not enough independent instructions to fill the
gaps. Fortunately, hardware prefetchers [Int14a, Section 3.7.2 – 3.7.3]; [Amd14c, Section 6.5] recognize
many common access patterns and request the required data early, which reduces these delays.
0
50
100
150
200
250
L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM L1 L2 L3 RAM
Xeon E5-2680 v3 Xeon E5-2670 Xeon X5670 Opteron 6274 Opteron 2435
cy
cl
e
s 
access latency ooo-window, IPC=1 ooo-window, IPC=2 ooo-window, IPC=3 ooo-window, IPC=4
Figure 5.4: Comparison of access latency and out-of-order windows: Memory accesses can cause long
delays (see Section 4.3.1). Out-of-order execution (see Section 2.1.1.2) enables processors to partially
fill these gaps by processing subsequent instructions ahead of schedule. The micro-architectures
analyzed in Chapter 4 support different numbers of instructions in flight. The resulting out-of-order
windows (ooo-windows)—that specify how many cycles can be bridged—also depend on the number
of instructions that are executed per cycle. The depicted limits assume that each x86 instruction is
translated into one fused micro-op (Intel) or one makro-op (AMD).
108 5 Performance Impact of the Memory Hierarchy
5.2.2 Bandwidth Limitations
The latencies of subsequent memory accesses do not necessarily add up. For instance, memory accesses
typically result in a transfer of a full cache line to the local L1 cache. If there are multiple accesses to the
same cache line and the cache line is not evicted or invalidated in between, the potentially costly transfer
of the data to the local L1 cache occurs only once. Furthermore, there can be multiple concurrently
outstanding requests. Load and store buffers enable the processors to continue executing instructions in
case of memory related stalls [HP06, Section 2.5 and 2.6]—including the possibility to process further
memory accesses while others are still pending2. The supported number of concurrent requests is limited
at multiple points. The L1 caches can only handle a certain number of misses at any one time. The
capacity of several request queues within the memory hierarchy is restricted as well.
The achievable data rate depends on the location of the accessed data (see Section 4.3.2). The limited
bandwidth can also restrict the achievable performance [WWP09]. Figure 5.5 shows how the bandwidths
of the individual levels in the memory hierarchy limit the achievable double precision floating point per-
formance. The performance is measured with x86-membench’s throughput kernel (see Section 3.5.4).
The selected measurement routines either solely use register operands or each floating point operation
reads one of their two input operands from the specified location, which results in a flop per byte ratio of
0.125. The workloads that perform only one kind of operation are limited to 50% of the processor’s the-
oretical peak performance as they only use one of the two floating point units. Only the AVX version of
the multiply-add routine—which performs the same number of additions and multiplications—gets close
to the theoretical maximum. The achievable performance is cut in half if 128 bit wide SSE instructions
are used. If each operation loads an operand from the memory hierarchy, the x86-membench kernels be-
come bandwidth bound. The performance is then limited to 50% of the peak performance even if all data
is present in the L1 cache. In that case the L1’s two 128 bit ports—which can be fully utilized with one
AVX instruction per cycle—are the limiting factor. The achievable performance drops further if lower
levels of the memory hierarchy are involved. The selected workloads are limited to around 25%, 15%,
and 3.3% if the data is read from the L2 cache, L3 cache or main memory, respectively. The results show
that the available bandwidth can severely limit the achievable computational performance. However, the
extent of the performance degradation depends on the characteristics of the application [Tiw+14].
0
50
100
150
200
250
300
350
add mul mad add mul mad add mul mad add mul mad add mul mad
register only L1 cache L2 cache L3 cache local memory
G
FL
O
P
S 
operation and operand location 
 128 bit SSE (packed double)  256 bit AVX (packed double)  peak performance
Figure 5.5: Xeon E5-2670—Throughput of arithmetic instructions depending on data location: The
test system (see Section 4.1.3) has a peak performance of 332.8 GFLOPS (8 flop/cycle, 2.6 GHz, 16
cores), which can only be reached if each of the two pipelines (“FP ADD” and “FP MUL”) performs
four operations per cycle (AVX packed double). The “add” and “mul” benchmarks are limited to
50% of the peak performance as they only use one pipeline. The “mad” (multiply-add) benchmark
makes use of both units. However, the peak performance is only reached in the register only case. In
the remaining cases each instruction requests one of their operands from the memory hierarchy. This
significantly reduces the achievable performance.
2 Certain restrictions apply with respect to the perceived ordering of memory accesses [Int14b, Volume 3, Section 8.2.2].
5.2 Possible Causes for Low Application Performance 109
5.2.3 Saturation of Shared Resources
As observed in Section 5.1 multi-core scaling can be worse than multi-processor scaling, which is pre-
sumably caused by the shared resources in multi-core processors. However, the scalability metrics intro-
duced in Section 2.6.1 assume perfect scaling of the parallel fraction, thus do not consider the additional
loss of efficiency that can be caused by the limited capacity of shared resources. An example for subobti-
mal scaling due to the saturation of shared resources is depicted in Figure 5.6. The bandwidth bound part
of the parallel fraction does not benefit as much from using more cores as the compute bound portion.
Therefore, the achievable speedup is significantly lower than predicted by Amdahl’s Law. Section 5.2.3.1
and 5.2.3.2 describe how the fixed-size and fixed-time speedup metrics can be generalized in order to in-
corporate such effects.
5.2.3.1 Considering Shared Resources in Amdahl’s Law
In its original form Amdahl’s Law does not consider limitations due to shared resources in multi-core
processors. However, it can be generalized to consider all sorts of enhancements [DAS12, Section 1.4.2].
LetRScomp(N) andRSmem(N) be the resource scaling of the computational performance and the mem-
ory bandwidth, respectively. These functions describe the relative performance using N cores compared
to using one core (see ’peak GFLOPS’ and ’RAM bandwidth’ in Figure 5.6a). Using Equation (2.3)
(see Section 2.6.1) one arrives at Equation (5.1) for applications that only depend on one resource.
S(N) =

T1(w)
TN (w)
= T1(w)
fs×T1(w)+ (1−fs)×T1(w)RScomp(N)
= 1
fs+
1−fs
RScomp(N)
for compute bound applications
T1(w)
TN (w)
= T1(w)
fs×T1(w)+ (1−fs)×T1(w)RSmem(N)
= 1
fs+
1−fs
RSmem(N)
for bandwidth bound applications
(5.1)
For simplification it is assumed that the computational performance scales linear with the number of
cores3, i.e., (RScomp(N) = N ). In order to consider influences of multiple resources the parallelizable
part of the workload can be split into multiple non-overlapping sections, e.g., a compute bound (wc) and
a bandwidth bound part (wb). This leads to Equation (5.2) where 1 − fs = fp = fc + fb with fc being
the compute bound and fb being the bandwidth bound fraction of the application.
S(N) =
T1(ws + wc + wb)
TN (ws + wc + wb)
=
T1(w)
fs × T1(w) + fc×T1(w)N + fb×T1(w)RSmem(N)
=
1
fs +
fc
N +
fb
RSmem(N)
(5.2)
0
1
2
3
4
5
6
7
1 2 3 4 5 6
re
la
ti
ve
 p
e
rf
o
rm
an
ce
 
number of cores (N) 
peak GFLOPS
L3 bandwidth
RAM bandwidth
(a) resource scaling
0
1
2
3
4
5
6
0
2
4
6
8
10
12
1 2 3 4 5 6
sp
e
e
d
u
p
 (
S(
N
))
 
ru
n
ti
m
e
 (
T N
) 
number of cores (N) 
RAM bound
L3 bound
FPU bound
serial fraction
Amdahl's Law
actual speedup
(b) parallel runtime and speedup
Figure 5.6: Effect of resource scaling in multi-core processors on fixed-size speedup: The figure on the
left shows the Opteron 6172 processor’s resource scaling based on the values of the ’peak GFLOPS’,
’L3 bandwidth’, and ’RAM read’ data series shown in Figure 5.2b. The figure on the right shows how
the resources affect the parallel runtime and the achievable speedup. The example assumes a serial
fraction of 10% and that the parallel fraction is limited by the FPU performance, the L3 bandwidth,
and the RAM bandwidth in equal shares if a single core is used.
3This is not necessarily the case as the maximal turbo frequency can depend on the number of active cores [Int15e, Table 2].
110 5 Performance Impact of the Memory Hierarchy
For the general case of n resources (FPU, L3 cache, RAM, QPI, etc.) Equation (5.2) can be rewritten as
shown in Equation (5.3) where RSi(N) defines the scaling of resource i, wi is the part of the workload
that is limited by resource i, and fi is the corresponding fraction of the workload (fs +
n∑
i=1
fi = 1).
S(N) =
T1(ws +
n∑
i=1
wi)
TN (ws +
n∑
i=1
wi)
=
1
fs +
n∑
i=1
fi
RSi(N)
(5.3)
In contrast to the universal scalability law [GSP11]—a black-box modeling approach that uses regres-
sion analysis to determine the model parameters—Equation (5.3) differentiates influences of different
resources. Therefore, measurable system characteristics can be used to model the application perfor-
mance on architectures that show non-linear scaling of shared resources as it is, for instance, the case for
the L3 and the main memory bandwidth of the Intel Xeon X5670 and AMD Opteron 6274 test systems
(see Section 4.1.2.3 and 4.2.2.5).
5.2.3.2 Fixed-time Speedup Under Resource Constraints
Gustafson’s argument that parallel systems are typically used to solve larger problems than single pro-
cessor systems [Gus88] also applies to multi-core processors. Thus, the fixed-size speedup can be overly
pessimistic. However, scaling the problem size linearly with the number of cores can significantly in-
crease the runtime due to the limited scaling of shared resource. Therefore, an application specific
average scaling factor RS(N) that keeps the runtime constant needs to be determined. Equation (5.4)
follows from the fixed-time assumption.
T1(ws + wp) = TN (ws +RS(N)× wp)
T1(ws) + T1(wp) = T1(ws) + TN (RS(N)× wp)
T1(wp) = TN (RS(N)× wp)
(5.4)
Using wi, fi, and RSi(N) as defined for Equation (5.3) and perf(1) as a measure for the performance
of a single core, one arrives at Equation (5.5), which can be solved for RS as shown in Equation (5.6).
wp
perf(1)
=
n∑
i=1
RS(N)× wi
RSi(N)× perf(1) (5.5)
RS(N) =
wp
n∑
i=1
wi
RSi(N)
=
1− fs
n∑
i=1
fi
RSi(N)
(5.6)
Figure 5.7 shows some examples of RS(N) for the case that the scaling of the parallel fraction is de-
termined by the FPU performance, the L3 bandwidth, and the RAM bandwidth to various degrees. If a
large portion of the program is limited by a resource that does not scale well with the number of cores
then the workload size cannot be increased much without increasing the runtime. Figure 5.8 shows this
for the case ’33.3% FPU, 33.3% L3, 33.3% RAM’. For simplicity it is assumed that the scaling of the
workload size does not change the portions of the workload that are limited by a certain resource4. Under
this assumption the workload can only be scaled to 3.48 times the size of the sequential version if six
cores are used.
4 This is not necessarily the case as the workload scaling could also change the flop per byte ratio, which would change
the proportions. The restriction can be lifted by replacing the constants wi and fi in Equation (5.6) with functions of the
workload size (RS(N) × wp). However, this would require a model that predicts how the resource dependencies change
for increasing workload sizes, which is beyond the scope of this work. It would also complicate the solution of the formula.
5.2 Possible Causes for Low Application Performance 111
0
1
2
3
4
5
6
7
1 2 3 4 5 6
av
e
ra
ge
 r
e
so
u
rc
e
 s
ca
lin
g 
number of cores (N) 
100% FPU bound
100% L3 bound
100% RAM bound
80% FPU, 10% L3, 10% RAM
50% FPU, 25% L3, 25% RAM
33.3% FPU, 33.3% L3, 33.3% RAM
25% FPU, 25% L3, 50% RAM
16.7% FPU, 16.7% L3, 66.7% RAM
10% FPU, 10% L3, 80% RAM
Figure 5.7: Average resource scaling example considering three resources: The possible scaling factor
RS—that keeps the runtime constant—depends on the scaling of the shared resources and the per-
centage of the serial runtime that depend on a certain resource. The values for the resource scaling
are based on the Opteron 6172 measurements (see Figure 5.6a).
The fixed-time speedup can be derived from RS as shown in Equation (2.4) (see Section 2.6.1), which
leads to Equation (5.7).
S(N) =
T1(ws +RS(N)× wp)
T1(ws + wp)
= RS(N) + (1−RS(N))× fs
(5.7)
Figure 5.9 compares the impact of shared resources on the fixed-size and fixed-time speedup. In the
bandwidth limited scenarios, the fixed-time speedup is not much higher than the fixed-size speedup.
In both cases, the achievable speedups are very low in the RAM-bound cases, which shows that the
saturation of shared resources can be a severe bottleneck in multi-core processors. A limited memory
bandwidth is a common characteristic of contemporary processors (see Figure 4.39b). Therefore, Sun
and Chen’s conclusion [SC10] that in principle there are no multi-core specific scalability problems—as
the authors note themselves—is not valid for existing hardware. It is also important to note that the way
how the different workloads fan out between the limits defined by the best and worst scaling resource is
reminiscent of the behavior of actual applications shown in Figure 5.2. This substantiates the assumption
that the observed behavior is in fact caused by the characteristics of the shared resources.
0
10
20
30
40
1 2 3 4 5 6
w
o
rk
lo
ad
 s
iz
e
 
number of cores (N) 
(a) required workload scaling to meet fixed-time criterion
0
2
4
6
8
10
12
1 2 3 4 5 6
ru
n
ti
m
e
 (
T N
) 
number of cores (N) 
(b) runtime distribution
serial fraction FPU bound L3 bound RAM bound
Figure 5.8: Workload scaling and resulting runtime distribution: In this example the parallel section is
divided into three parts. Each part is limited by a certain resource and accounts for one third of the
serial runtime. For each number of cores, all workload components are scaled with the same factor
RS(N) (see Figure 5.7). The parallel runtime stays constant. However, the fraction of time that is
consumed by the program phases that are limited by the memory bandwidth increases significantly.
112 5 Performance Impact of the Memory Hierarchy
1
2
3
4
5
6
1 2 3 4 5 6
sp
e
e
d
u
p
 
number of used cores 
(a) fixed-size speedup, fs = 0.1
1
2
3
4
5
6
1 2 3 4 5 6
sp
e
e
d
u
p
 
number of used cores 
(b) fixed-time speedup, fs = 0.1
100% FPU bound 100% L3 bound 100% RAM bound
80% FPU, 10% L3, 10% RAM 50% FPU, 25% L3, 25% RAM 33.3% FPU, 33.3% L3, 33.3% RAM
25% FPU, 25% L3, 50% RAM 16.7% FPU, 16.7% L3, 66.7% RAM 10% FPU, 10% L3, 80% RAM
boundedness of parallel fraction: 
Figure 5.9: Multi-core speedup under resource constraints: The fixed-size as well as fixed-time speedup
are strongly influenced by the saturation of shared resources. In the FPU-bound scenarios—which
allow a significant increase of the workload size—the fixed-time speedup is significantly higher than
the fixed-size speedup. However, there is little difference between the two if the workload size cannot
be increased much due to the saturation of shared resources.
5.3 Identification of Meaningful Hardware Performance Counters
Contemporary x86 processors include performance monitoring units [Amd13b, Section 2.7]; [Int14b,
Volume 3, Chapter 18], which can be used to count numerous events (see Section 2.6.3.3). Such hard-
ware performance counters are a useful tool for detecting performance problems related to the mem-
ory subsystem [Era08; Lev09; THW13; Yas14]; [Int14a, Appendix B.3]. Therefore, many performance
analysis tools are able to record them in addition to information about the application behavior (see Sec-
tion 2.6.3.4). However, it is not always obvious what exactly is counted by the different events, e.g.,
if cache misses are counted per access or per cache line. Thus, it is not trivial to derive the degree of
capacity utilization of a certain component from the associated performance counters, which makes it
difficult to judge if the observed event rates are significant. In this section, it is evaluated if hardware
performance counters can be used to measure the capacity utilization within the memory hierarchy and
determine to which extent memory accesses affect the performance.
Hardware performance counters are architecture specific, i.e., the evaluation of their significance has to
be repeated for every new micro-architecture. Therefore, a portable approach based on a small selection
of x86-membench kernels is proposed and its applicability is demonstrated on two test systems—one with
Intel Xeon E5-2670 (see Section 4.1.3) and one with Xeon E5-2680 v3 processors (see Section 4.2.1).
X86-membench (see Chapter 3) is particularly suitable to stress individual components in the memory
subsystem. The constant workloads in combination with the integrated hardware monitoring (see Sec-
tion 3.5.5) are used to identify performance counters that provide good estimates for the utilization of
the individual components. However, a high utilization of the memory hierarchy does not necessarily
result in low performance. As long as the gaps between the memory accesses can be filled with useful
work, they are not an issue. Therefore, counters that indicate memory related waiting times, i.e., periods
without any useful computation, are searched as well.
5.3 Identification of Meaningful Hardware Performance Counters 113
5.3.1 Indicators for Bandwidth Utilization
The correlation of hardware performance counters with the utilization of the memory hierarchy is evalu-
ated using the load and store variants of x86-membench’s throughput kernel (see Table 3.1). As described
in Section 4.3.2 there are per core limits for the achievable bandwidths for each level in the memory hi-
erarchy as well as limited aggregate bandwidths for the shared resources. Therefore, the analysis is
performed once using a single core and once using all cores that share a resource in order to determine
the respective upper bounds for the associated hardware performance counters.
5.3.1.1 Bandwidth Usage of a Single Core
In order to determine suitable performance counters for the bandwidth that is consumed per core the
throughput benchmark has to be configured as follows:
• A single CPU (CPU 0) is entered in BENCHIT_KERNEL_CPU_LIST.
• BENCHIT_KERNEL_MEM_BIND contains a single CPU from another socket that is directly
connected to CPU 0.
• The minimal data set size is set to 50% of the L1 cache size or lower.
• The maximal data set size is set to at least ten times the LLC size.
• The allocation method (BENCHIT_KERNEL_ALLOC) is first set to local (L) in order to evaluate
the local memory hierarchy. It is changed to bind-to-core (B) to examine remote memory accesses.
• BENCHIT_KERNEL_INSTRUCTION is set to load or store with different widths as required.
The workflow to identify meaningful counters comprises the following steps:
1. Identify events that show high event rates if data is transferred between the core and the L1. This
is the case if the data set fits into the L1 cache. Load and store instructions with multiple widths
are used in this step to test if the usage of SIMD instructions can be recognized.
2. For each further level in the memory hierarchy: identify events that show high event rates if data is
located there. This step is performed using the widest load and store instructions that are available.
It is important to identify events that consider the read for ownership requests (RFOs) in case of
writes in order to capture all transfers to the L1 cache.
3. Change allocation method to bind-to-core (B) and identify events that show high event rates if data
is located in remote memory.
4. Determine the overlap between the identified counters and, if possible, use additional counters to
compensate for it in order to measure the number of accesses for every location.
5. Derive upper bounds for the event rates of the identified counters based on the measured perfor-
mance and the observed number of events per memory access.
In the following this approach is exemplarily executed on the Intel Xeon E5-2670 system examined
in Section 4.1.3. Since uncore counters record aggregate performance data that can hardly be attributed
to actions of a single core, the evaluation is restricted to the core counters. However, there are still
numerous memory related events.
Figure 5.10 shows the recorded number of loads from the L1 data cache as well as loads that miss in
the L1 cache for the read bandwidth measurement. The event perf::L1-DCACHE-LOADS counts all
load instructions. This can be used to determine how many times the load ports are used. However,
the amount of the transfered data cannot be derived from this information as no distinction is made
between loads of different widths. As long as the data set fits into the L1 cache, the number of reported
perf::L1-DCACHE-LOAD-MISSES is close to zero. For larger data sets the number of events per load
instruction depends on their width. One out of eight 64 bit loads generates a miss event. 128 bit loads
cause a miss on every fourth access (not depicted). In case of 256 bit wide loads every second load
instruction misses the L1 cache. This means that the perf::L1-DCACHE-LOAD-MISSES increase by one
for every accessed cache line. Writes to deeper cache levels also generate one load miss per cache line,
i.e., RFO requests are included in the measurement. Therefore, the perf::L1-DCACHE-LOAD-MISSES
114 5 Performance Impact of the Memory Hierarchy
(a) 64 bit loads (b) 256 bit loads
Figure 5.10: Xeon E5-2670—read bandwidth and perf::L1-DCACHE-LOADS / -LOAD-MISSES: The
L1 accesses and misses are counted per instruction. Loads that miss in the L1 cache are counted
as L1-DCACHE-LOAD nevertheless. Therefore, hits in the L1 cache can be derived by subtracting
the LOAD_MISSES from the LOADS. Only one miss event is generated per cache line. Thus, the
AVX version causes 50% misses (two accesses per cache line) while the scalar version causes 12.5%
misses (eight accesses per cache line).
provide a good estimate for the number of cache lines that are brought into the L1 cache. Analogous
events for write accesses are available as well. The event perf::L1-DCACHE-STORES counts all write
accesses—except non-temporal stores. In addition to the perf::L1-DCACHE-LOAD-MISSES, writes also
cause perf::L1-DCACHE-STORE-MISSES. They presumably count cache lines that are written back to
the memory hierarchy since RFOs are already included in the perf::L1-DCACHE-LOAD-MISSES. The
perf::PERF_COUNT_HW_CACHE_L1D:READ and perf::PERF_COUNT_HW_CACHE_L1D:WRITE
events can also be used to count the cache lines that are requested by (:READ) or written back (:WRITE)
from the data cache unit.
The remaining challenge is to find events that differentiate accesses to different levels in the mem-
ory hierarchy and distinguish local from remote memory accesses. As depicted in Figure 5.11 the
MEM_LOAD_UOPS_RETIRED events are not suitable to determine the origin of the data in case of
sequential loads. Even for main memory accesses a certain number of L1 hits is to be expected as the
benchmark performs two loads per cache line. However, according to this counter there are no accesses
to the deeper levels of the memory hierarchy at all—except for some negligible noise in the performance
counter readings. Apparently, the hardware prefetchers conceal the L2, L3, and main memory accesses
in the default system configuration. This behavior does not change if multiple cores perform concurrent
memory accesses. Furthermore, read for ownership (RFO) requests—that transfer data to the L1 cache
as well—are not considered (not depicted).
Figure 5.11: Xeon E5-2670—read bandwidth
and MEM_LOAD_UOPS_RETIRED events
(MLUR): This counter can determine the ori-
gin of the accessed data. However, it does
not observe data transfers from deeper levels
in the memory hierarchy that are requested
timely by the prefetchers and found close to
the core when the actual load operation takes
place. In that case it mostly records hits in
the L1 cache as well as in the line fill buffers
(LFB) that handle outstanding L1 misses.
5.3 Identification of Meaningful Hardware Performance Counters 115
Figure 5.12 depicts performance counter readings that dissect the data transferred to the L1 cache ac-
cording to the data’s prior location within the memory hierarchy. If data is delivered from the L2 cache,
loads cause L2_TRANS:LOAD and L2_RQSTS:ALL_DEMAND_RD_HIT events while stores result in
L2_TRANS:RFO and L2_RQSTS:RFO_HITS. The event rates are equivalent to the number of lines re-
quested by the L1 cache as reported by perf::PERF_COUNT_HW_CACHE_L1D:READ, i.e., one event
per accessed cache line is recorded. The benchmarks used in these experiments use 256 bit loads and
stores, thus one event is recorded for every second access. For data sets that exceed the L2 capacity the
values reported by L2_TRANS and L2_RQSTS differ significantly. The sub-events of L2_TRANS also
capture accesses to deeper levels of the memory hierarchy. In contrast, the L2_RQSTS readings are con-
fined to data delivered by the L2. The values reported by L2_TRANS:LOAD and L2_TRANS:RFO for L3
and main memory accesses are slightly higher than the corresponding number of misses in the L1 cache
reported by perf::PERF_COUNT_HW_CACHE_L1D:READ. The difference is presumably caused by
additional prefetcher requests.
The OFFCORE_RESPONSE_0 and OFFCORE_RESPONSE_1 events observe requests that miss in the
L2 cache and provide numerous filters to isolate data transfers from a certain location. Events are spec-
ified in the following format: OFFCORE_RESPONSE_{0|1}<request type><response type> where the
response type is either :ANY_RESPONSE or <supplier><snoop>. The latter is used here as the sup-
plier and snoop fields turn out to be useful for the differentiation of data transfers regarding their source
location. The request type is set to :ANY_DATA:ANY_RFO in order to include read only requests as
well as the RFOs caused by stores. The supplier definition can be used to distinguish L3 and memory
accesses. If it is set to :LLC_HITMESF all L3 accesses that hit cache lines in state Modified, Exclusive,
Shared, or Forward are considered. However, the :SNP_NOT_NEEDED setting for the snoop informa-
tion excludes accesses that involve snooping other cores. This is not a problem for the single-threaded
benchmark used here. Then again, :SNP_NO_FWD:SNP_MISS and :HITM need to be added in order
to include accesses to Exclusive cache lines that are owned by another core as well as on-chip transfers
of Modified cache lines (see Section 5.3.1.3). The number of cache lines requested from local main
memory can be counted using :LLC_MISS_LOCAL as supplier. Remote accesses can be recorded via
:LLC_MISS_REMOTE (not depicted). The :SNP_ANY setting used for counting memory accesses also
includes cache lines that are forwarded from remote caches. However, this provides the best estimate
as the setting :SNP_MISS:SNP_NO_FWD—which should only count memory accesses that are not for-
warded from another cache as well—does not work as expected due to high rates of :SNP_NONE events,
which denote accesses where no snoop related information is available. Furthermore, it cannot be ruled
(a) 256 bit loads (b) 256 bit stores
Figure 5.12: Xeon E5-2670—counters that identify the source of the accessed data: The Utilization of
the L2 cache can be measured via the L2_TRANS and L2_RQSTS events. L3 cache and main mem-
ory accesses can be recorded using the OFFCORE_RESPONSE_0 (or OFFCORE_RESPONSE_1)
counters. The correlation is not perfect but good enough to gain insight.
116 5 Performance Impact of the Memory Hierarchy
out that cache lines that are forwarded from another cache are delivered from memory as well. Thus,
selecting SNP_ANY is the most cautious choice.
The L2_RQSTS and OFFCORE_RESPONSE:LLC_MISS_LOCAL / :LLC_MISS_REMOTE events pro-
vide good estimates for the amount of data delivered from the L2 cache and main memory, respectively.
Thus, the number of cache lines delivered from the respective location can be measured as follows:
SourceL2 = L2_RQSTS:ALL_DEMAND_RD_HIT + L2_RQSTS:RFO_HITS
Sourcemem–local = OFFCORE_RESPONSE_0:ANY_DATA:ANY_RFO:LLC_MISS_LOCAL:SNP_ANY
Sourcemem–remote = OFFCORE_RESPONSE_0:ANY_DATA:ANY_RFO:LLC_MISS_REMOTE:SNP_ANY
The OFFCORE_RESPONSE:LLC_HITMESF event correctly represent the number of cache lines read
from the L3 cache if the data is actually located in the L3 cache. However, main memory accesses also
cause a significant number of LLC hit events. Unfortunately, the recorded number of LLC hits is not
inclusive of the LLC misses in this case, which makes it difficult to compensate the overlap. Luckily,
an estimate for the total number of cache lines requested from the memory hierarchy is also available
in the form of the L2_TRANS events. Therefore, the number of cache lines delivered from the L3 cache
(including on-chip core-to-core forwarding) can be derived from this as follows:
L3raw = OFFCORE_RESPONSE_0:ANY_DATA:ANY_RFO:LLC_HITMESF:SNP_NOT_NEEDED
:SNP_NO_FWD:SNP_MISS:HITM
L3max = L2_TRANS:LOAD+L2_TRANS:RFO− (SourceL2 + Sourcemem–local + Sourcemem–remote)
SourceL3 = min(L3raw,L3max)
The number of transfers from the lower cache levels to the L1 cache—including data from lower levels
that passes through the intermediate levels—can be determined as follows:
ReadL2 = L2_TRANS:LOAD + L2_TRANS:RFO
ReadL3 = ReadL2 − SourceL2
Figure 5.12b shows a write bandwidth measurement, but the recorded events only cover the read for
ownership transfers that place the data in the L1 cache prior to the modification. Indicators for the
write backs are depicted in Figure 5.13. The total number of cache lines that are write back from the
L1 cache can be counted via the L2_TRANS:L1D_WB event, which reports as many write backs as the
perf::PERF_COUNT_HW_CACHE_L1D:WRITE event. The other write back events report values close
to zero as long as the data set fits into the L2 cache. Write backs from the L2 cache to deeper levels
of the memory hierarchy are represented by the values reported by L2_TRANS:L2_WB. However, the
number of write backs from the L2 cache is lower than expected. Very similar values are reported for
L2_LINES_IN and L2_LINES_OUT. Apparently, some write backs bypass the L2 and go directly to the
L3 cache. This makes sense if the cache line has already been evicted from the L2 cache, which is
not inclusive of the L1 [Int14a, Table 2-9]. The OFFCORE_RESPONSE_0:WB:ANY_RESPONSE event
also includes the lines that bypass the L2 cache. Therefore, it is more suitable to determine the number
of write backs to the L3 cache.
Figure 5.13: Xeon E5-2670—write bandwidth
and write back events: The numbers re-
ported for the L2_TRANS:L1D_WB event
represent write backs from the L1 cache.
The number of write backs from the L2
cache—reported by L2_TRANS:L2_WB—
is lower than that if data is written back to
the L3 cache or main memory. Fortunately,
the :WB:ANY_RESPONSE sub-event from
the OFFCORE_RESPONSE can be used to
estimate the total number of cache lines writ-
ten back by a core.
5.3 Identification of Meaningful Hardware Performance Counters 117
Core counter events that indicate the number of write backs to memory have not been found. The uncore
event LLC_VICTIMS:M_STATE can be used to count write backs to main memory. It also provides
filters to select certain cores, which should enable per core measurements of write backs to memory.
The values reported without a core filter are plausible, but when a core filter mask is added the return
values are all zero. Furthermore, the :M_STATE sub-event cannot be combined with the :NID sub-event
and a node filter mask (PAPI_add_event() returns with error), which would enable the distinction
of write backs to local and remote memory. The uncore event TOR_INSERTS:NID_EVICTIONS reports
plausible values if node and core filter masks are added, i.e., it can be used to count evictions of L3 cache
lines that belonged to a certain core and target a certain NUMA node. Unfortunately, it does also include
evictions of clean lines. Therefore, the number of cache lines written to memory cannot be counted on a
per core basis. The available indicators for the bandwidth required by write backs are defined as follows:
WriteL2 = L2_TRANS:L1D_WB − L2_L1D_WB_RQSTS:MISS
WriteL3 = OFFCORE_RESPONSE_0:WB:ANY_RESPONSE
Table 5.4 shows the bandwidths a single thread can achieve for read and write accesses to the local
memory hierarchy as well as remote memory (see Section 4.1.3). The measured bandwidths is used
to calculate the resulting number of transfers per second. The maximal event rates are derived from
the performance counter readings presented in this section. Most of them are identical to the number
of transfers. However, the counts reported by ReadL2 and ReadL3 are slightly higher than the actual
number of cache lines requested from the L1—presumably as some prefetcher requests are included as
well. Unfortunately, the number of transfers per second cannot be used to derive the utilization of the L1
cache. With 64 bit instructions it is possible to reach the maximal event rates while using only 50% of the
available bandwidth. This could still be defined as 100% load as all L1 load or store ports are active each
cycle. However, with 256 bit instructions it is also possible to fully utilize the available bandwidth while
the event rates are at 50% of their respective maximum. Thus, it has to be known if the code is properly
vectorized in order to interpret the event rates correctly. This can be checked via the SIMD_FP_256 event
or a manual code inspection. For L2 accesses and beyond the selected metrics provide good estimates
for the used bandwidth independent of the width of the load and store instructions.
Table 5.4: Indicators for bandwidth usage per core: The number of transfers between the registers and
the L1 cache depends on the bandwidth and the width of the load and store instructions. In contrast,
full cache lines are transfered between the L1 cache and the deeper levels of the memory hierarchy.
Therefore, only 128 bit instruction—which achieve the highest bandwidths (see Figure 4.38)—are
considered in this case. The selected indicators correlate well with the number of transfers.
Source/ access type, achievable million transfers most appropriate indicator maximal number of
Dest. instr. width bandwidth per second for bandwidth utilization events per second
L1D
read, 64b 41.4 GB/s 5,175 (8 byte) 5,175 million
read, 128b 82.8 GB/s 5,175 (16 byte) perf::L1-DCACHE-LOADS
read, 256b 2,587 (32 byte) 2,587 million
write, 64b 20.5 GB/s 2,562 (8 byte) 2,562 million
write, 128b 41.0 GB/s 2,562 (16 byte) perf::L1-DCACHE-STORES
write, 256b 1,281 (32 byte) 1,281 million
L2 read, 128b 46.0 GB/s 718 (64 byte) ReadL2 739 million
write, 128b 24.5 GB/s 382 (64 byte) WriteL2 382 million
L3 read, 128b 25.1 GB/s 392 (64 byte) ReadL3 418 million
write, 128b 17.9 GB/s 279 (64 byte) WriteL3 279 million
local read, 128b 11.5 GB/s 179 (64 byte) Sourcemem−local 179 million
memory write, 128b 9.0 GB/s 140 (64 byte) n/a n/a
remote read, 128b 7.8 GB/s 121 (64 byte) Sourcemem−remote 121 million
memory write, 128b 6.8 GB/s 106 (64 byte) n/a n/a
118 5 Performance Impact of the Memory Hierarchy
5.3.1.2 Utilization of Shared Resources
The last level cache, the memory bandwidth, and the links between the processors have been identified
as potential bottlenecks in Chapter 4. In this section the Xeon E5-2670 processor’s uncore performance
counters [Int12a] are analyzed regarding their applicability to measure the utilization of these shared
resources. For these experiments the benchmark configuration is changed as follows:
• All CPUs from the first socket are entered in BENCHIT_KERNEL_CPU_LIST.
• BENCHIT_KERNEL_MEM_BIND contains all CPUs from another socket that is directly con-
nected to the first socket.
The uncore performance monitoring is implemented by multiple per-component performance monitoring
units (also called “boxes”) [Int12a, Figure 1-1, Table 1-1]. Each slice of the shared L3 cache has a
dedicated “C-Box”. The integrated memory controller (IMC), the home agent (HA), and the QuickPath
interface (QPI) each have one performance monitoring unit as well. Uncore events are specified in
the format: snbep_unc_<comp>::<event name> where comp selects a component and the event name
specifies the events that are counted. The “snbep_unc_” prefix is omitted here. X86-membench records
performance counters only on one CPU and derives the event ratios by dividing the recorded number of
events by the number of memory accesses performed by this CPU. This is sufficient for the collection of
core counters since the workload is homogeneous, i.e., all cores would report very similar values. It also
avoids conflicts between the CPUs if uncore counters are recorded as only a single CPU is accessing the
uncore PMUs. However, in this case all the recorded events are attributed to a single CPU, which has
to be considered in the interpretation of the results. In case of the Xeon E5-2670 processor all resources
are shared by eight cores, i.e., the reported event ratios have to be divided by eight to arrive at the correct
result.
Figure 5.14 and 5.15 depict last level cache (C-Box) and home agent (HA) events, respectively. The
memory controller (IMC) events record DRAM specific information (precharges, refreshes, ECC errors,
etc.). They are omitted here as the home agent events prove to be sufficient to observe memory accesses.
The UNC_C_LLC_LOOKUP and UNC_C_TOR_INSERTS C-Box events can be used to measure the
aggregated L3 bandwidth. As can be seen in Figure 5.14b, the UNC_C_LLC_LOOKUP:DATA_READ
event does not cover RFOs. Therefore, UNC_C_LLC_LOOKUP:ANY has to be used in order to capture
(a) 256 bit loads (b) 256 bit stores
Figure 5.14: Xeon E5-2670—last level cache counters: All C-Boxes report very similar results, i.e., the
data is evenly distributed across the L3 slices. For the purpose of clarity only the results from a single
C-Box are shown. This also compensates the incorrect attribution of events caused by eight CPUs to
a single CPU as only one eighth of the events is considered. Thus, the shown event ratios are correct.
5.3 Identification of Meaningful Hardware Performance Counters 119
(a) 256 bit loads (b) 256 bit stores
Figure 5.15: Xeon E5-2670—home agent counters: Each processor has a single home agent that is
shared by all eight cores. The reported event ratios of up to four events per memory access include
events caused by all eight cores that are attributed to the memory accesses performed by one core.
The actual number of memory accesses is eight times higher, thus the correct event ratio is one eighth
of the depicted values.
all loads. This event however also includes stores and the event rates are higher than expected in case of
L2 accesses as well as reads from the L3 cache. Furthermore, some but not all main memory accesses
are also counted as LLC lookups. The UNC_C_LLC_LOOKUP:WRITE events correlate well with the
number of cache lines written to the L3 cache. The event ratio of around 0.5 events per 256 bit access
means that one event is generated for each cache line. The UNC_C_TOR_INSERTS events also count L3
accesses. In contrast to the UNC_C_LLC_LOOKUP:ANY event, the reads (:OPC_DRD and :OPC_RFO)
also include all main memory accesses, which simplifies compensating the overlap with the DRAM
counters. Unfortunately, different UNC_C_TOR_INSERTS events cannot be counted concurrently.
The UNC_H_REQUESTS events observe reads and writes to the local memory. The corrected event ratio
is again 0.5 events per memory access5, i.e., one event per cache line. Writes can also be counted via
UNC_H_IMC_WRITES:FULL, which explicitly excludes partial writes. Unfortunately, the qpi0:: and
qpi1:: counters are not operational on the tested system6. They are listed by papi_native_avail,
but any attempt to add them to an eventset causes PAPI_add_event() to fail. Therefore, the utiliza-
tion of the QPI links cannot be measured via the uncore counters. Table 5.5 lists the available indicators
for the utilization of the shared resources.
Table 5.5: Indicators for bandwidth usage per processor: The peak bandwidths are based on the mea-
surements presented in Table 4.10 (one thread per core). The maximal event rates are derived from the
performance counter readings presented in this section. The maximal event rates for the UNC_C_*
events refer to the sum of the events reported by the eight C-Boxes. If the maximal event rates are
reached the respective component is 100% occupied, i.e., the achievable bandwidth is fully utilized.
resource access peak most appropriate indicator max. eventstype bandwidth for bandwidth utilization per second
L3 read 197 GB/s UNC_C_TOR_INSERTS:OPC_DRD / OPC_RFO
7 3,306 million
write 138 GB/s UNC_C_LLC_LOOKUP:WRITE 2,148 million
memory read 43.8 GB/s UNC_H_REQUESTS:READS 684 million
controller write 19.8 GB/s UNC_H_IMC_WRITES:FULL 309 million
5the depicted four events per access include events caused by all cores but only considers the measuring core’s accesses
6Dell PowerEdge R720, Ubuntu 16.04 LTS, kernel 4.4.0-21-generic, and PAPI 5.4.3.0
7cannot be measured together, alternatively: UNC_C_LLC_LOOKUP:ANY − UNC_C_LLC_LOOKUP:WRITE, but this does
not include all DRAM accesses that pass through the L3 cache
120 5 Performance Impact of the Memory Hierarchy
Figure 5.16: Xeon E5-2670—indicators for core-
to-core transfers: CPU 0 performs read ac-
cesses after the data has been placed in the
memory hierarchy by another core within the
same processor. Data is allocated from the pro-
cessor’s local memory. The cached copies are
in state Exclusive. The LLC hits are measured
via OFFCORE_RESPONSE_0:ANY_DATA:*
events, which are also able to differentiate in-
coming cache lines regarding their source.
5.3.1.3 Data Transfers Between Cores
The throughput kernel does not involve core-to-core transfers. Therefore, these measurements are per-
formed with the single-threaded bandwidth benchmark (see Section 3.5.2), which is configured to in-
clude CPU 0, another CPU of the same processor, and one CPU from each other processor in the
measurement. Figure 5.16 depicts reads from the L3 cache that involve snooping another core. Up
to a data set size of 256 KiB the other core still has Exclusive copies of the requested cache lines,
which result in OFFCORE_RESPONSE_0:ANY_DATA:LLC_HITMESF:SNP_NO_FWD events. Ac-
cesses to L3 cache lines that have already been evicted by the other core are recorded by the event OFF-
CORE_RESPONSE_0:ANY_DATA:LLC_HITMESF:SNP_MISS. Cache lines that are directly delivered
from the L3 cache cause OFFCORE_RESPONSE_0:ANY_DATA:LLC_HITMESF:SNP_NOT_NEEDED
events. Modified cache lines that are transferred from another core’s L1 or L2 cache can be counted
via OFFCORE_RESPONSE_0:ANY_DATA:LLC_HITMESF:HITM. With around 0.5 events per 256 bit
access the above events measure the number of accessed cache lines. If the request type is changed from
:ANY_DATA to :ANY_DATA:ANY_RFO, read for ownership requests are captured as well.
The OFFCORE_RESPONSE counters can also be used to count remote cache hits. Therefore, the sup-
plier field has to be set to LLC_MISS_LOCAL:LLC_MISS_REMOTE in order to cover all LLC misses.
If the snoop field is set to :SNP_FWD:HITM, cache lines that are forwarded from a remote cache are
counted. The setting :SNP_NO_FWD counts remote cache hits that do not provide the data (Shared
cache lines). The combination of all three snoop types with the request type :ANY_RFO can be used
to count writes that hit (and thereby invalidate) remote caches. Unfortunately, in case of L3 misses,
:SNP_NONE events are generated for around half of the accessed cache lines, which indicates that no
snoop related information is available. Therefore, remote cache hits can be significantly underestimated
using the above events.
5.3.1.4 Applicability for Succeeding Processor Generation
The presented approach can also be used on the system equipped with Intel Xeon E5-2680 v3 processors
(see Section 4.2.1). However, the counters identified for Sandy Bridge are not transferable one-to-one.
First of all, the event names are slightly different, e.g., many occurrences of “LLC” are replaced by “L3”.
Another example is the “:HITM” snoop type of the OFFCORE_RESPONSE counters, which is named
“:SNP_HITM” on the Haswell based processor. Furthermore, the uncore events start with “hswep_unc_”
instead of “snbep_unc_”. After compensating the dispensable renaming most events are working like
they do on Sandy Bridge. However, there are also some significant differences:
• The L2_RQSTS:RFO_HIT event reports zero if all data is delivered from the L2 cache while
L2_RQSTS:ALL_RFO − L2_RQSTS:RFO_MISS correctly counts RFOs that hit the L2.
• The OFFCORE_RESPONSE counters for LLC misses—that provide good estimates for the
DRAM bandwidth used by the individual cores on Sandy Bridge—report way to low values.
Therefore, the per core bandwidth utilization cannot be measured on Haswell. However, the mea-
surement of the aggregated bandwidth utilization via the uncore counters is working.
5.3 Identification of Meaningful Hardware Performance Counters 121
• OFFCORE_RESPONCE_0:WB:ANY_RESPONSE reports zero on Haswell. Fortunately, it is
also not needed as—in contrast to Sandy Bridge—L2_TRANS:L2_WB correctly counts the num-
ber of write backs to the L3 cache.
• There are two home agent PMUs in each Haswell processor. Their results have to be combined.
• The QPI counters are functioning on Haswell (with the same software versions that do not work
on Sandy Bridge).
Figure 5.17 depicts the performance of repeated read and write accesses to data allocated from the sec-
ond processor’s memory. As long as the data fits into the local cache hierarchy, it is only requested once
from remote memory. This does not show in the event rates because of the high number of repetitions.
Modifications are also kept local due to the write back policy of the L3 cache. For data sets that are
significantly larger than the L3 cache however, the data has to be read via QPI every time and the mod-
ifications are eventually written back8. Both QPI PMUs report 24 events for every access performed by
CPU 0. That actually means 48 flits9 per twelve accesses due to x86-membench’s attribution of all events
to the accesses of a single core (see Section 5.3.1.2), i.e., four flits per 256 bit access. This matches the
expectation of 64 bit data per flit (80 bits including CRC and control [Int09a]). It is also possible to count
the snoop requests and response caused by the coherence protocol.
5.3.2 Metrics for Memory-boundedness
As shown in Section 5.3.1, hardware performance counters can be used to derive the utilization of each
level in the memory hierarchy as well as the interconnection network between the processors. However,
in out-of-order micro-architectures—which are commonly used in high performance computing—the
delays caused by memory accesses can overlap with speculatively executed arithmetic instructions as
well as other memory accesses (see Section 2.1.1.2). Therefore, calculating the total waiting time from
the number of accesses to each location multiplied by the respective latency would often overestimate
the impact of memory accesses on the total runtime [Int14a, Appendix B.3.4.1]. Fortunately, perfor-
mance counters that count various types of stall cycles are also a widely used feature. In this section
it is analyzed which of these counters correctly represent the fraction of time that is spent waiting for
the memory hierarchy and if the delays can be attributed to the potential bottlenecks in the memory
subsystem (see Section 4.3).
(a) 256 bit loads (b) 256 bit stores
Figure 5.17: Xeon E5-2680 v3—QPI counters: The inter-processor transfers are counted separately for
both QPI links. The “RXL_FLITS” correlate with the number of data packets received from the other
processor. The “TXL_FLITS” measure the amount of data that is sent in the other direction.
8 The decline of the performance is not as abrupt as shown in the bandwidth measurements in Section 4.2.1.2. Apparently, the
caching is more effective for the repetitive accesses performed by the throughput kernel (cache flushes are not supported).
9flow control unit—the link layer’s unit of transfer [Int09a]
122 5 Performance Impact of the Memory Hierarchy
Table 5.6: Xeon E5-2670—Counters for stall cycles: These events can be used to detect inactivity of
certain components. Descriptions are based on the output of papi_native_avail (PAPI 5.4.3.0).
event sub-event (umask) indicates
CYCLE_ACTIVITY
:CYCLES_NO_DISPATCH stalled cycles
:STALLS_L1D_PENDING execution stalled due to L1D pending loads
:STALLS_L2_PENDING execution stalled due to L2 pending loads
INT_MISC :RAT_STALL_CYCLES cycles RAT external stall is sent to IDQ10
RESOURCE_STALLS
:ANY cycles stalled due to resource related reason
:LD_SB stalls due to load or store buffers all being in use
:RS cycles stalled due to no eligible RS entry available
:ROB cycles stalled due to re-order buffer full
RESOURCE_STALLS2 :OOO_RSRC cycles stalled due to out-of-order resources full
UOPS_ISSUED :STALL_CYCLES cycles no µops issued by this thread
UOPS_EXECUTED :STALL_CYCLES number of cycles with no µops executed
UOPS_RETIRED :STALL_CYCLES cycles no executable µop retired
Table 5.6 lists a selection of stall counters that are available on Intel Xeon E5-2670 processors. The
counter that most accurately represents the delay caused by memory accesses can be identified with
a slightly adapted version of x86-membench’s latency benchmark (see Section 3.5.1). This benchmark
represents a worst case scenario with only one outstanding memory request at a time, a hardly predictable
access pattern, and no independent instructions between the memory accesses that would benefit from
out-of-order execution. In order to investigate the effect of the out-of-order execution on the stall cycles,
instructions are added between the loads. There are two versions of the modified latency benchmark.
The alteration shown in Algorithm 5.1 adds multiplications to the workload that do not have a data
dependency from the accessed memory locations. Algorithm 5.2 shows the second modification. In that
case the multiplications are part of the dependency chain.
Figure 5.18 shows the effect of the modifications described above on the measured time per memory
access and the performance counter readings. Multiple counters show very similar results. They are
grouped together as follows (the * marks the representative of the group): CYCLES_NO_ISSUE repre-
sents the event rates observed for INT_MISC:RAT_STALL_CYCLES (*), RESOURCE_STALLS:ANY,
as well as UOPS_ISSUED:STALL_CYCLES. CYCLES_NO_DISPATCH illustrates the behavior of
CYCLE_ACTIVITY:CYCLES_NO_DISPATCH (*) and UOPS_EXECUTED:STALL_CYCLES. The
counter UOPS_RETIRED:STALL_CYCLES behaves like UOPS_ISSUED:STALL_CYCLES if inde-
pendent operations are added to the workload. However, if the multiplications are part of the de-
pendency chain, it resembles the event rates of UOPS_EXECUTED:STALL_CYCLES. The remain-
ing value series represent a single performance counter. STALLS_L1D_PENDING depicts the re-
sults of CYCLE_ACTIVITY:STALLS_L1D_PENDING. The PAGE_WALK_DURATION is measured
via DTLB_LOAD_MISSES:WALK_DURATION. RESOURCE_STALLS :LD_SB, :RS, :ROB, and
:OOO_RSRC are omitted as they show no significant event rates for the latency benchmark.
1 " _work_ loop_ independen t_mul : "
2 "mov (%%rbx ) , %%rbx ; " / / d e r e f e r e n c e p o i n t e r
3 " imul $1 , %%r8 ; imul $1 , %%r9 ; imul $1 , %%r10 ; [ . . . ] " / / i n d e p e n d e n t o p e r a t i o n s
4 "mov (%%rbx ) , %%rbx ; " / / d e r e f e r e n c e p o i n t e r
5 " imul $1 , %%r8 ; imul $1 , %%r9 ; imul $1 , %%r10 ; [ . . . ] " / / i n d e p e n d e n t o p e r a t i o n s
6 [ . . . ] / / 22 more such b l o c k s ( d e r e f e r e n c i a t i o n + o p e r a t i o n s )
7 " sub $1,%%r c x ; "
8 " j n z _work_ loop_ independen t_mul ; "
Algorithm 5.1: Latency benchmark with independent operations between the loads: In this version, in-
teger multiplications that do not have a data dependency on the accessed memory addresses are added
after each dereferenciation of the pointer. Their execution can overlap with the memory accesses.
10RAT: register alias table; IDQ: instruction decode queue
5.3 Identification of Meaningful Hardware Performance Counters 123
1 " _work_ loop_dependen t_mul : "
2 "mov (%%rbx ) , %%rbx ; " / / d e r e f e r e n c e p o i n t e r
3 " imul $1 , %%rbx ; imul $1 , %%rbx ; imul $1 , %%rbx ; [ . . . ] " / / o p e r a t i o n s on p o i n t e r
4 "mov (%%rbx ) , %%rbx ; " / / d e r e f e r e n c e p o i n t e r
5 " imul $1 , %%rbx ; imul $1 , %%rbx ; imul $1 , %%rbx ; [ . . . ] " / / o p e r a t i o n s on p o i n t e r
6 [ . . . ] / / 22 more such b l o c k s ( d e r e f e r e n c i a t i o n + o p e r a t i o n s )
7 " sub $1,%%r c x ; "
8 " j n z _work_ loop_dependen t_mul ; "
Algorithm 5.2: Latency benchmark with dependent operations between the loads: In this version, all
the operations are part of a single dependency chain. Thus, the multiplications are delayed until the
preceding memory accesses is completed and have to be processed one after another.
Figure 5.18a and 5.18b show the correlation of the selected counters with the unmodified latency bench-
mark. Multiple counters correlate with the execution time of the latency measurements—including the
time spent waiting for address translations. The PAGE_WALK_DURATION event can be used to differ-
entiate stalls caused by TLB misses from the latency of the actual memory access. However, the latency
kernel does not include any operations other than the memory accesses. Consequently, even the counter
CPU_CLK_UNHALTED—which counts all cycles the CPU is active—correlates well with the execu-
tion time of the latency measurements. Therefore, it is unclear if the reported number of total stall cycles
comprises other stall reasons as well. If arithmetic operations are added to the workload (see Figure 5.18c
and 5.18d), it becomes obvious that most events are not suitable to detect memory related stalls.
(a) unmodified latency benchmark, hugepages (b) unmodified latency benchmark, 4K pages
(c) 24 independent multiplications per load (d) 24 data dependent multiplications per load
Figure 5.18: Xeon E5-2670—correlation between performance counters and memory latency: Most
events also include stalls that are not related to memory accesses. Only the STALLS_L1D_PENDING
event provides a reasonable estimate for the time spent waiting for data to arrive.
124 5 Performance Impact of the Memory Hierarchy
If independent operations are added (see Figure 5.18c), the stall cycles that are caused by the memory
accesses can be used to perform the computations. A useful indicator for the performance loss due
to memory accesses should only include the remaining stall cycles—those that cannot be used for any
other operations. This is the case for CYCLES_NO_DISPATCH and STALLS_L1D_PENDING. They
both report 13.3 instead of 36.3 stalls per memory access for data sets that fit into the L3 cache while
the execution time is 39 cycles per load instruction with and without the multiplications. The 23 cycle
reduction is reasonably close to the number of operations added (24). One out of 24 multiplication is
presumably dispatched together with a load, which explains the difference. In contrast, the numbers
reported for the CYCLES_NO_ISSUE events remain on a higher level and are thus unsuitable. They
only reduce from 39.1 to 32.9. This can be explained by the different issue and dispatch rates. Up to
four micro-ops can be renamed and issued to the reservation station each cycle while only one can be
dispatched to the multiplier. Thus, only six cycles are required to issue the 24 multiplications.
If the multiplications are part of the dependency chain (see Figure 5.18d)), the situation changes again.
In this case, the arithmetic operations depend on the result of their respective predecessor. Thus, the
instruction latency (3 cycles [Int14a, Table C-20a]) also contributes to the execution time. These delays
are clearly not caused by memory accesses and should therefore not be counted as memory related
stalls. The results of an ideal memory stall counter would be identical to the latency measurements from
Figure 5.18a. However, such a counter does not exist. The events from the CYCLES_NO_DISPATCH
group also report an increased number of stall cycles in this scenario, i.e., they include the delays caused
by data dependencies. Therefore, they have to be excluded as indicators for memory related stalls. Only
the STALLS_L1D_PENDING event looks promising. For accesses to the local L2 and L3 cache, the
reported number of stalls is very close to the measured latency. The values reported for local and remote
memory accesses are reasonably close as well. However, accesses to the L1 cache are not included and
in case of remote cache accesses the reported values are too high. The latter is presumably caused by
the completion of the coherence protocol transaction in the home node after the data has already been
delivered. Apparently, the L1 miss is still considered to be pending—and the execution is continued
speculatively—until the home node acknowledges that there have been no conflicting accesses.
5.3.2.1 Decomposition of Stall Cycles
The event CYCLE_ACTIVITY:STALLS_L1D_PENDING has been identified as useful counter for
the total number of stall cycles caused by loads11. An assignment of the observed stall cycles
to the various components that can cause delays is not easily possible on out-of-order architec-
tures due to the overlapping of multiple concurrent requests with different stall reasons [Eye+06;
AEE12; Yas14]. However, hardware performance counters can give reasonable estimates, e.g., the
MEM_LOAD_UOPS_RETIRED event can be used to measure the latency caused by accesses to each
level of the memory hierarchy [Lev09]; [Int14a, Appendix B.3.4.1]. The fraction of the runtime that
is spent waiting for data from the individual levels of the memory hierarchy can be estimated based
on the CYCLE_ACTIVITY events for pending cache misses and the hit rate in the L3 cache as de-
scribed in [Int14a, Appendix B.3.2.3]. Unfortunately, the suggested MEM_LOAD_UOPS_RETIRED
and MEM_LOAD_UOPS_MISC_RETIRED events do not measure the L3 hit ratio correctly. As de-
picted in Figure 5.19 the number of L3 hits reported by MEM_LOAD_UOPS_RETIRED:L3_HIT
is way too low. Furthermore, the MEM_LOAD_UOPS_LLC_MISS_RETIRED event—which pro-
vides sub-events for local and remote DRAM accesses—always reports zero (not depicted, identi-
cal to MEM_LOAD_UOPS_RETIRED:L3_MISS). Alternatively, the SourceL3, Sourcemem–local, and
Sourcemem–remote described in Section 5.3.1.1 can be used to determine the percentages of L3 hits as
well as local and remote memory accesses. However, these events include all the data requested by the
core, not only the accesses that actually stall the execution.
There are two flavors of memory-boundedness—latency bound and bandwidth bound. In the former
case the data waited upon is required to continue execution, i.e, the following instructions have a data
11STALLS_LDM_PENDING—which Intel suggests for this [Int14a, Appendix B.3.2.3]—is not available on Sandy Bridge.
5.3 Identification of Meaningful Hardware Performance Counters 125
Figure 5.19: Xeon E5-2670—memory latency and
MEM_LOAD_UOPS events (MLU): The num-
ber of reported L2 hits is too high. This probably
results from an impairment of the measurement
by PAPI. The effect is even larger for L1 hits
(up to 3.5 events per access), which are there-
fore omitted. For L3 and memory accesses the
disturbance is negligible. The number of L3 hits
is severely underestimated. Thus, stalls caused
by L3 and memory accesses cannot be differen-
tiated accurately.
dependency from the requested data. In the latter case independent instructions are available, but the data
paths are used to their capacity, which restricts the processing speed of memory accesses. It is important
to distinguish both cases as there are different optimization strategies for them, e.g., adding prefetch
instructions in latency bound code or introducing cache blocking in bandwidth bound programs.
The out-of-order execution cores are decoupled from the memory hierarchy via load and store
buffers [HP06, Section 2.4]. Furthermore, several request queues exist that handle outstanding requests at
various levels [DAS12, Figure 5.14]. The achievable bandwidth is influenced by the number of available
entries as depicted in Figure 5.20. Typically, the cores can issue requests faster than they can be serviced
by the lower levels of the memory hierarchy. Thus, bandwidth bound applications—which issue many
independent requests—tend to fully utilize the available request buffers. Therefore, a high utilization
ratio of the request queues is an indicator for memory-boundedness. Figure 5.21 shows the event rates
reported by several stall counters for bandwidth measurements via x86-membench’s throughput kernel.
All stalls can be attributed to the sequential memory accesses as the measurement routines do not include
any other instructions. Therefore, the event CYCLE_ACTIVITY:CYCLES_NO_DISPATCH—which in-
cludes all cycles where no operations are sent from the scheduler to the execution units—is used as
reference. It can be observed that the CYCLE_ACTIVITY:STALLS_L1D_PENDING also includes loads
that are stalled due to bandwidth limitations. However, it cannot be used to distinguish latency bound
and bandwidth bound scenarios. Furthermore, it does not comprise stall cycles caused by stores if there
are no loads pending at the same time.
queue full cycles that do 
not limit the bandwidth
queue full cycles that limit 
the achievable bandwidth
cycles with queue entries 
being available
time
q
u
eu
e 
en
tr
ie
s
perceived latency (load to use)
actual latency Ttrans Ttrans Ttrans Ttrans Ttrans Ttrans Ttrans Ttrans Ttrans
recurring utilization pattern
bus idle bus idle
memory request
cache line received
max
Figure 5.20: Impact of the memory latency on the achievable bandwidth: In case of consecutive memory
accesses, the memory latency delays the first access. After that the transfer time (Ttrans) determines
the arrival rate of the requested cache lines. However, the time until an allocated entry in the request
queue becomes available again also depends on the memory latency. A sufficient number of entries
is required in order to fully utilize the theoretical peak bandwidth (bus frequency × bus width). If the
supported number of outstanding requests is too low, a recurring pattern of several cache line transfers
followed by a period where the data bus is idle may occur.
126 5 Performance Impact of the Memory Hierarchy
(a) one thread, 256 bit loads (b) one thread, 256 bit stores
Figure 5.21: Xeon E5-2670—indicators for bandwidth-boundedness: The benchmarks do not include
any arithmetic instructions that could overlap with the stall cycles caused by the memory accesses.
Therefore, the CYCLE_ACTIVITY:CYCLES_NO_DISPATCH event provides a good estimate for
the total number of cycles the execution is stalled due to memory accesses.
The remaining events depicted in Figure 5.21 show no significant event rates in case of latency
measurements. Thus they can be used to detect bandwidth-boundedness. Unfortunately, none of
the events counts the number of stall cycles accurately. L1D_PEND_MISS:FB_FULL and OFF-
CORE_REQUESTS_BUFFER:SQ_FULL do not cover bandwidth bound accesses in the L2 cache,
but show high event rates in case of load and store misses in the L2 cache. Unfortunately, even
their sum is lower than the total number of stall cycles caused by the resulting L3 and memory
accesses. However, it is the best available estimate for bandwidth bound loads. The event RE-
SOURCE_STALLS:LD_SB does not capture load accesses but correlates well with the stall cycles caused
by stores. Thus, RESOURCE_STALLS:LD is not useful to detect memory bound loads while RE-
SOURCE_STALLS:SB provides good estimates in case of stores. Unfortunately, it also overlaps with
CYCLE_ACTIVITY:STALLS_L1D_PENDING in case of mixed memory accesses (not depicted). I.e., the
event CYCLE_ACTIVITY:STALLS_L1D_PENDING includes cycles that are stalled for other reasons if
there are loads outstanding at the time.
Based on these observations presented above, the stall cycles can be decomposed as shown in Fig-
ure 5.22. The memory bound fraction is determined as the maximum of the load related (CY-
active cycles
productive cycles
stall cycles
memory bound
other stall reasonbandwidth bound latency bound
CPU_CLK_UNHALTED
CPU_CLK_UNHALTED 
– CYCLE_ACTIVITY:CYCLES_NO_DISPATCH
CYCLE_ACTIVITY
:CYCLES_NO_DISPATCH
max(RESOURCE_STALLS:SB,
CYCLE_ACTIVITY:STALLS_L1D_PENDING)
max(RESOURCE_STALLS:SB,
L1D_PEND_MISS:FB_FULL +
OFFCORE_REQUESTS_BUFFER
:SQ_FULL)
stall cycles
– memory bound
memory bound
– bandwidth bound
Figure 5.22: Decomposition of stall cycles: The active cycles can be divided into productive cycles
and stall cycles based on the number of CYCLE_ACTIVITY:CYCLES_NO_DISPATCH events. The
stalls can be categorized as memory bound and caused by other stall reasons. However, the shown
estimate for the memory bound fraction is conservative as it assumes maximal overlap of the stall
cycles that are caused by blocked stores and the stall cycles with outstanding loads.
5.3 Identification of Meaningful Hardware Performance Counters 127
CLE_ACTIVITY:STALLS_L1D_PENDING) and the store related (RESOURCE_STALLS:SB) stall cy-
cles due to the potential overlap in mixed workloads. However, this can lead to an underesti-
mation of the memory-boundedness if a measurement interval includes discrete load bound and
store bound phases. The bandwidth bound metric also uses the maximum as the events that cap-
ture loads also include RFOs caused by stores as depicted in Figure 5.21b. The miss-prediction
of branches is not considered here, i.e., “productive cycles” does not mean that the processed in-
structions are on the correct path. Ineffective speculation can be detected as described in [Int14a,
Appendix B.3.2]. Front-end stalls—including instruction cache misses—are also not covered by
the approach presented here. The address translation overhead is not explicitly listed but can
be determined using the events DTLB_LOAD_MISSES:WALK_DURATION (see Figure 5.18) and
DTLB_STORE_MISSES:WALK_DURATION.
5.3.2.2 Applicability for Succeeding Processor Generation
Figure 5.23 depicts performance counters that measure the number of stalls caused by loads on the
Haswell based Xeon E5-2680 v3 processors. The CYCLE_ACTIVITY:L1D_PENDING as well as the
added CYCLE_ACTIVITY:LDM_PENDING event provide a good estimate in case of the unmodified
latency benchmark. If independent operations are added, the number of stall cycles reported by both
events reduces as expected (not depicted). However, if multiplications with data dependencies are
added, the number of stall cycles reported by CYCLE_ACTIVITY:LDM_PENDING increases, i.e., it
partially includes the processing time like the CYCLE_ACTIVITY:CYCLES_NO_DISPATCH event on
Sandy Bridge. Therefore, CYCLE_ACTIVITY:L1D_PENDING is more suitable to measure the ac-
tual delay caused by the memory hierarchy. It works even better than on Sandy Bridge, as the
overestimation of remote cache accesses (see Figure 5.18d) has disappeared. The reliability of the
MEM_UOPS_RETIRED event has also improved. The :L3_HIT and :L3_MISS sub-events accurately
count loads that encounter the L3 and memory latency, respectively. However, the LLC_MISS_RETIRED
event still does not allow to differentiate local and remote memory accesses. The methodology for
the decomposition of the stall cycles also has to be adopted for Haswell based processors. The
event CYCLE_ACTIVITY:CYCLES_NO_DISPATCH has disappeared. It can be replaced with CY-
CLE_ACTIVITY:CYCLES_NO_EXECUTE.
(a) unmodified latency benchmark (b) 24 data dependent multiplications per load
Figure 5.23: Xeon E5-2680 v3—correlation between performance counters and memory latency: The
newly introduced CYCLE_ACTIVITY:LDM_PENDING event perfectly matches the latency in case
of the unmodified latency benchmark (left). However, the reported number of stall cycles caused by
loads increases significantly if multiplications are added that work on the pointer (right). In contrast,
the event ratios reported by CYCLE_ACTIVITY:L1D_PENDING remain on the expected levels.
128 5 Performance Impact of the Memory Hierarchy
5.4 Identification of Limiting Resources in Parallel Applications
As shown in Section 5.2, there are various possible causes for suboptimal performance. In order to per-
form targeted optimizations one needs to know which resources are the limiting factors. This section
demonstrates how the meaningful performance counter events identified in Section 5.3 can be used to-
gether with existing performance analysis tools in order to examine applications regarding their memory-
boundedness. This survey uses Score-P and Vampir, which is essentially motivated by the fact that I am
familiar with their usage. Furthermore, Vampir’s custom metrics [BW13] are ideally suited to visualize
the utilization metrics that are derived from multiple event ratios. However, other performance analy-
sis tools that support the collection of hardware performance counters alongside the performance data
(see Section 2.6.3.4) can also be used for this purpose.
A comprehensive characterization of a representative selection of applications is beyond the scope of this
work. The purpose of this study is only to show that the utilization metrics defined in Section 5.3 enable
the differentiation and quantification of the performance impact that is caused by the various components
of the memory hierarchy. Thus, only a small selection of benchmarks—which produce reasonably sized
traces without elaborate filtering—is considered. The benchmarks are part of the SPEC OMP2012 suite
and demonstrate good scaling (350.md, 372.smithwa), moderate scaling (351.bwaves, 370.mgrid331),
and poor scaling (363.swim) on multi-core processors (see [Mül+12, Figure 3(a)]). The benchmarks are
performed on all 16 cores in the dual-socket Xeon E5-2670 system (see Section 4.1.3) using one thread
per core. In Section 5.4.1 the fraction of the runtime that is spent waiting for the memory hierarchy
is determined. Section 5.4.2 and Section 5.4.3 illustrate how Vampir’s custom metrics can be used to
analyze the bandwidth usage per core and per processor, respectively.
5.4.1 Determining the Degree of Memory-boundedness
In order to determine the memory bound fraction of the runtime, the metrics defined in Section 5.3.2.1
are implemented as custom metrics in Vampir. Figure 5.24 shows the definition of the memory bound
metric as depicted in Figure 5.22. This is a conservative measure that is chosen because of the overlap
of RESOURCE_STALLS:SB and CYCLE_ACTIVITY:STALLS_L1D_PENDING in case of mixed load
Figure 5.24: Custom metric for memory-boundedness in Vampir: Memory bound cycles are determined
as maximum of the cycles that are blocked by stores and the stall cycles with outstanding loads.
The event RESOURCE_STALLS:SB can report event rates higher than the total number of stall cy-
cles (see Figure 5.21b). Therefore, CYCLE_ACTIVITY:CYCLES_NO_DISPATCH is used as upper
bound. The derived number of memory bound cycles per second is divided by clockrate [Hz]/100
in order to convert the result to a percentage of the active cycles.
5.4 Identification of Limiting Resources in Parallel Applications 129
and store instructions. However, using the sum instead would result in a high likelihood for over-
estimating the degree of memory-boundedness. Since the operating frequency is set to 2600 MHz,
CPU_CLK_UNHALTED constantly reports 2.6 billion events per second and can therefore be replaced
by a constant in the metric definitions12. The total stall cycles and other stall reason metrics are defined
in a similar manner. Figure 5.25 visualizes a trace of the 363.swim benchmark, which is known to be
memory bound [Mül+04; Mül+12]. Most of the time is spent in four parallel regions that show a repet-
itive pattern for almost the whole runtime. With 88% in one and around 60% in the other three regions,
the degree of memory-boundedness is very high. However, there is also a significant portion of other
stall reasons in three phases.
Figure 5.26 depicts the memory-boundedness and the bandwidth-boundedness for the whole runtime
of the five selected benchmarks. 350.md does not show any sign of being limited by the performance
of the memory hierarchy. However, it does show high rates of around 44% for stalls caused by other
reasons, which may include delays caused by accesses to the L1 cache. The remaining benchmarks
spent a significant fraction of their respective runtime waiting for the memory hierarchy. According to
the selected performance counter events, the stalls are mainly bandwidth bound. The bandwidth bound
fraction of 351.bwaves is estimated higher than the total degree of memory-boundedness. Presumably,
the included OFFCORE_REQUESTS_BUFFER:SQ_FULL event registers all cycles in which the queue
is full—not only the ones where no instructions are dispatched. Unfortunately, the upper bound defined
by the memory bound metric cannot be enforced as the required events cannot be counted concurrently.
For 363.swim the memory bound and bandwidth bound metrics are almost identical, i.e., all memory
Figure 5.25: Memory-boundedness—363.swim: This benchmark has a constantly high rate of stall
cycles, most of which are identified as being memory related. The event ratios of up to 25% for
the other stall reasons may seem high. However, this includes delays caused by accesses to the L1
cache—which are not covered by the selected performance counter events—as well as stalls due to
data dependencies (see Figure 5.18d) and front-end stalls [Int14a, Appendix B.3.2].
12 This is required for some metrics as only four events can be counted concurrently via PAPI on Sandy Bridge. The error
introduced by this simplification is negligible. However, it restricts the analysis to fixed frequency scenarios.
130 5 Performance Impact of the Memory Hierarchy
(a) 350.md, total stall cycles (b) 350.md, memory bound (c) 350.md, bandwidth bound
(d) 351.bwaves, total stall cycles (e) 351.bwaves, memory bound (f) 351.bwaves, bandwidth bound
(g) 363.swim, total stall cycles (h) 363.swim, memory bound (i) 363.swim, bandwidth bound
(j) 370.mgrid331, total stall cycles (k) 370.mgrid331, memory bound (l) 370.mgrid331, bandwidth bound
(m) 372.smithwa, total stall cycles (n) 372.smithwa, memory bound (o) 372.smithwa, bandwidth bound
Figure 5.26: Bandwidth-boundedness of selected SPEC OMP2012 benchmarks: The diagrams show
the performance data for the respective total runtime of the benchmarks. The total stall cycles and
memory bound metrics are collected in a single run of each benchmark. The counter timelines for
the bandwidth bound metric are taken from another trace since the required events cannot be counted
together with those required for the memory bound metric13. The reported values can be higher than
the memory bound cycles, as the included OFFCORE_REQUESTS_BUFFER:SQ_FULL event may
report higher numbers than there are stall cycles with outstanding loads.
related stall cycles can be attributed to bandwidth limitations. For 370.mgrid331 and 372.swithma the
bandwidth bound metric is lower than the memory bound fraction. The difference can be attributed to
the memory access latency.
5.4.2 Resource Usage per Core
The indicators for the used bandwidth defined in Section 5.3.1 have also been implemented as custom
metrics in Vampir with the exception of the SourceL3 metric. The SourceL3 metric cannot be observed
as it is derived from more events than can be counted concurrently. The utilization of the L1 cache can
only be displayed in percent of the maximal number of load and store accesses as depicted in Figure 5.27a
and 5.27b, respectively. The L2 cache, L3 cache, and main memory read bandwidths can be displayed in
GB/s (not depicted) as well as in percent of the maximal number of events per second for their respective
indicator (see Table 5.4) as depicted in Figure 5.27c – 5.27f. The per core write bandwidth can only be
shown for the L2 and L3 cache.
The 370.mgrid331 benchmark depicted in Figure 5.27 has a high degree of memory-boundedness, i.e.,
many stall cycles are caused by memory accesses (see Figure 5.26k). A significant portion of the stall
cycles is caused by a lack of entires in one of the request queues and is therefore categorized as bandwidth
bound. This does not cover loads from the L2 cache (see Figure 5.21a). Based on the ReadL2 and
SourceL2 metrics, around 75% of the L1 misses find the required cache line in the L2 cache. This
leads to a utilization of the L2 bandwidth of around 25%, which is more than a latency bound workload
(one access every 12 cycles) would achieve. Thus, the degree of bandwidth-boundedness is presumably
underestimated in Figure 5.26l. Consequently, a high utilization of the available bandwidth in at least
one level of the memory hierarchy is to be expected. However, this does not show if the per core limits
determined in Section 5.3.1 are used as reference.
13Maximal four events can be counted concurrently. Furthermore, the CYCLE_ACTIVITY:STALLS_L1D_PENDING and
L1D_PEND_MISS:FB_FULL events cannot be counted together.
5.4 Identification of Limiting Resources in Parallel Applications 131
(a) L1D load accesses (b) L1D store accesses (c) L2 cache read bandwidth
(d) L3 cache read bandwidth (e) local DRAM read bandwidth (f) remote DRAM read bandwidth
Figure 5.27: Bandwidth utilization—370.mgrid331 (3.9 s zoom view): In spite of the high degree of
memory-boundedness (see Figure 5.26k) the utilization of the available bandwidths is rather low.
Custom metrics that illustrate data transfers between cores (see Section 5.3.1.3) are available as well.
However, the number of L3 hits that snoop another core, the number of cache lines forwarded from
the second socket, and the number of remote invalidations are insignificant in the selected benchmarks.
Consequently, the ratio of snoop misses is close to 100%14. That is, most of the snoop requests that are
sent to the other socket by the coherence protocol are effectless. Figure 5.28 shows the local and remote
memory bandwidth during the execution of the 351.bwaves benchmark. These metrics can be used to
determine the fraction of local memory accesses. Out of the five selected benchmarks only 350.md shows
critical rates of remote memory accesses. However, due to the low degree of memory-boundedness this
does not have a significant impact on the performance. The remaining four benchmarks request more
than 90% of the cache lines from local memory.
Figure 5.28: NUMA awareness—351.bwaves: The Sourcemem–local and Sourcemem–remote metrics
can be used to derive the fraction of local memory accesses as a measure of the application’s NUMA
awareness. In the 351.bwaves benchmark around 90% of the accessed data is read from local memory.
14OFFCORE_RESPONSE_0:ANY_DATA:ANY_RFO:LLC_MISS_LOCAL:LLC_MISS_REMOTE:SNP_MISS / OFFCO-
RE_RESPONSE_0:ANY_DATA:ANY_RFO:LLC_MISS_LOCAL:LLC_MISS_REMOTE:SNP_MISS:SNP_FWD:SNP_.
NOT_NEEDED:HITM (SNP_ANY cannot be used as reference due to high ratio of SNP_NONE events)
132 5 Performance Impact of the Memory Hierarchy
5.4.3 Utilization of Shared Resources
The integrated PAPI support in Score-P only comprises the per core PMUs. Adding uncore coun-
ters to SCOREP_METRIC_PAPI results in PAPI initialization errors. Therefore, the Uncore Perfor-
mance Events Counter Plugin15—an extended version of the uncore plugin presented in [Wer14, Sec-
tion 4.2.1]—is used to capture uncore events. The plugin records the uncore performance counters asyn-
chronously in 100 ms intervals, i.e., the performance counter readings are not aligned with the entries
and exits of parallel regions. Therefore, this analysis is only useful for regions that have a sufficiently
long runtime.
The custom metrics for the utilization of the shared resources are defined according to Table 5.5. Fig-
ure 5.29 depicts the DRAM bandwidth utilization of 351.bwaves for one of the processors. The read
bandwidths is already close to the maximum. However, DRAM does not support concurrent reads and
writes. Therefore, the combined bandwidth needs to be considered. Using this metric, 351.bwaves
uses up to 94.5% of the measured peak bandwidth. This explains why the per core bandwidth shown
in Figure 5.28 is limited to around 40% of the single-threaded read bandwidth as the maximum de-
fined in Table 5.4 does not consider the saturation of shared resources. 363.swim, 370.mgrid331, and
372.smithwa also show high degrees of memory bandwidth utilization of around 85%. Likewise, their
per-core bandwidth usage is far from the maximum that a single thread can achieve.
Figure 5.29: Per package DRAM utilization—351.bwaves: Depicted are counter timelines for the mea-
sured read, write, and combined bandwidth of the first processor. The results for the second processor
are omitted as they are very similar. The display at the bottom shows the utilization of the memory
controller using the performance radar [BW13]. In the red regions the bandwidth usage is close to
the measured peak bandwidth.
15https://github.com/score-p/scorep_plugin_uncore
133
6 Summary
This thesis introduces x86-membench—an open source micro-benchmarking suite that facilitates the
performance analysis of memory accesses in cache coherent distributed shared memory systems. These
benchmarks are used to perform an in-depth analysis of contemporary multi-processor systems that iden-
tifies potential bottlenecks in the memory hierarchy. Furthermore, a methodology for the identification of
meaningful hardware performance counters is presented that uses the x86-membench micro-benchmarks
to derive metrics for the utilization of individual components in the memory hierarchy as well as memory
related waiting times from performance counter readings. These metrics can then be used to visualize
memory related performance problems in applications.
X86-membench is a versatile benchmark suite for the analysis of the memory hierarchy of complex
shared memory systems. It extends the state-of-the-art in memory performance measurement in various
directions. Table 6.1 summarizes the features of x86-membench and shows what can also be measured
using other customary benchmark suites. While the local memory hierarchy and the impact of remote
memory accesses in NUMA systems are sufficiently covered by existing benchmarks, the performance
of remote cache accesses is not. The data placement mechanism described in Section 3.2 closes this
gap. The cache flush routines enable a precise differentiation of the levels in the memory hierarchy.
Therefore, characteristics that originate from the location of the data can be distinguished from other
influences, e.g., the overhead of the page table based address translation. Furthermore, the coherence
state control mechanism described in Section 3.3 can be used to measure the costs of coherence protocol
transactions. The assembler implementation of the measurement routines (see Section 3.5) leads to very
accurate results. For instance, the measured local cache latencies typically are in full accordance with
the vendor documentation. Moreover, the performance impact of SIMD instructions can be measured
without having to rely on the compiler to properly vectorize the code.
The benchmarks expose potential bottlenecks in the memory hierarchy of multi-core processors and the
interconnection network between the processors [Mol+09; HMN09; Tho11; Mol+11; MHS14; Mol+15].
The obtained results regarding the impact of the coherence states on the characteristics of memory ac-
cesses facilitate the analytical performance modeling of cache coherent shared memory systems [RH13;
LHS13; Put14; PGB14; RH15; RH16]. The benchmarks can also be used to analyze the energy con-
sumption of data transfers and arithmetic operations as well as to evaluate the potential for energy effi-
ciency optimizations as shown in [Mol+10] and [SHM12], respectively. Furthermore, the understanding
of the throughput and power characteristics of data transfers and arithmetic operations has been taken
Table 6.1: Comparison of x86-membench with other established benchmarks: X86-membench provides
a wider functional range for analyzing the memory hierarchy, especially regarding the influence of
the coherence protocol. To be fair, it has to be noted that the other benchmarks also have features that
are not covered by x86-membench for the reasons outlined in Section 3.1.
benchmark
latency / bandwidth explicit instr. throughput coherence
suite
local cache remote SIMD with operands in protocol
& memory memory cache support register/memory influence
x86-membench 3 /3 3 /3 3 /3 3 3 /3 3
BlackjackBench 3 /3 3 /3 7 /3 7 3 /7 7
likwid-bench 7 /3 7 /3 7 /7 3 3 /3 7
X-Ray, P-Ray 3 /3 3 /3 7 /7 7 3 /7 7
lmbench, STREAM 3 /3 3 /3 7 /7 7 7 /7 7
134 6 Summary
into account during the development of the processor stress test FIRESTARTER [Hac+13]. Due to the
extensive use of inline assembly, the implementation is tailored to the x86 architecture. However, the
functional principle can be ported to other architectures [Old13].
The analysis of contemporary shared memory systems—which are the building blocks of large dis-
tributed memory systems commonly used in HPC systems—reveals several potential bottlenecks in the
memory hierarchy. It is shown that the memory accesses latency can exceed the size of the out-of-order
window, which stalls the execution. Furthermore, the bandwidths that are supported by the lower levels
in the memory hierarchy are typically not sufficient to fully utilize the available computational perfor-
mance. The bandwidth of shared caches and main memory does not necessarily scale linearly with the
number of concurrently operating cores, which can further limit the performance of parallel applications.
Remote accesses are additionally limited by the point-to-point interconnections between the processors,
which are typically not wide enough to fully utilize the remote memory bandwidth. In many cases the
achieved bandwidths cannot be explained by the width of the data paths. This can partially be attributed
to the interaction of the different cache levels [Hof+16]. However, the limited numbers of outstanding re-
quests that are supported at various points in the memory hierarchy—which are restricted by the number
of entries in associated request queues—also have an influence.
The cache coherence protocols have a strong influence on the characteristics of memory accesses. Wait-
ing for snoop responses increases the memory access latency. Furthermore, the coherence traffic con-
sumes bandwidth on the point-to-point interconnects between the processors. Contemporary processors
support various snoop filtering mechanisms to mitigate these effects. However, the filtering mechanisms
have considerable costs. AMD’s HT Assist consumes a substantial amount of precious on-chip memory
to implement fast directory look-ups. In contrast, Intel’s in-memory directory does not use up on-chip
memory but delays remote cache accesses if the snoop request cannot be filtered.
Knowing the maximal achievable performance of individual components in the memory hierarchy is an
essential prerequisite for the detection of memory related performance losses. However, in order to de-
termine the impact of the memory hierarchy on the achieved application performance, the utilization of
the various components during the runtime of the program as well as the waiting times that are caused by
memory accesses have to be measured as well. Therefore, the presented methodology that derives mean-
ingful metrics for the resource utilization and memory related stall cycles from hardware performance
counters is another major contribution of this thesis. It is shown that the performance degradation due to
resource limitations is reflected by certain hardware events in many cases. Unfortunately, the results ob-
tained on one architecture cannot easily be transferred to other architectures as the set of available events
as well as their definition and functionality can be different. For instance, the OFFCORES_RESPONSE
counters that can be used to measure the per core DRAM bandwidth on the examined Sandy Bridge sys-
tem are unreliable on the system with Haswell based processors. Therefore, it cannot be recommended
to rely on performance counter readings without validating that they are actually working as expected.
X86-membench is ideally suited to perform such validations, which is an important improvement of the
state-of-the-art in performance counter based performance analysis.
The novel approach for the selection of suitable counters and the determination of their respective pos-
sible range of values facilitates the detection of memory related performance issues. In contrast to the
raw counter values, the derived metrics are easy to understand. However, many events are necessary
to observe the utilization of all the components. Therefore, recording all metrics requires many runs
since only a limited number of events can be counted concurrently. Nevertheless, many performance
problems can be found using the presented visualization of the performance counter data. Furthermore,
the employed methodology verifies the validity of the used performance counters, which is missing in
other sources of information about the performance counter based breakdown of stall cycles [Lev09;
Yas14]; [Int14a, Appendix B.3]. When revisiting the initially mentioned challenge of understanding the
causes of limited application scaling within a node, this thesis provides both—the tools for establishing
the technically possible upper limits for the performance of various components in the memory hierarchy
as well as the means for measuring their impact on the achieved application performance.
135
Bibliography
[AB86] James Archibald and Jean-Loup Baer. “Cache Coherence Protocols: Evaluation Using
a Multiprocessor Simulation Model”. In: ACM Trans. Comput. Syst. 4.4 (Sept. 1986),
pp. 273–298. ISSN: 0734-2071. DOI: 10.1145/6513.6514.
[Adh+10] L. Adhianto et al. “HPCTOOLKIT: tools for performance analysis of optimized paral-
lel programs”. In: Concurrency and Computation: Practice and Experience 22.6 (2010),
pp. 685–701. ISSN: 1532-0634. DOI: 10.1002/cpe.1553.
[AEE12] Osman Allam, Stijn Eyerman, and Lieven Eeckhout. “An Efficient CPI Stack Counter Ar-
chitecture for Superscalar Processors”. In: Proceedings of the Great Lakes Symposium on
VLSI. GLSVLSI ’12. ACM, May 2012, pp. 55–58. ISBN: 978-1-4503-1244-8. DOI: 10 .
1145/2206781.2206796.
[Ahn+13] J. H. Ahn et al. “McSimA+: A manycore simulator with application-level+ simulation and
detailed microarchitecture modeling”. In: Performance Analysis of Systems and Software
(ISPASS), 2013 IEEE International Symposium on. Apr. 2013, pp. 74–85. DOI: 10.1109/
ISPASS.2013.6557148.
[Aji+12] A.M. Aji et al. “MPI-ACC: An Integrated and Extensible Approach to Data Movement
in Accelerator-based Systems”. In: 14th International Conference on High Performance
Computing and Communication & 9th International Conference on Embedded Software
and Systems (HPCC-ICESS). IEEE, June 2012, pp. 647–654. DOI: 10.1109/HPCC.2012.
92.
[Aln+90] K. Alnaes et al. “Scalable Coherent Interface”. In: CompEuro ’90. Proceedings of the
1990 IEEE International Conference on Computer Systems and Software Engineering. May
1990, pp. 446–453. DOI: 10.1109/CMPEUR.1990.113656.
[Amd10] AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet. Re-
vision: 3.19. Publication # 43374. AMD. June 2010. URL: http : / / support . amd .com/
TechDocs/43374.pdf (visited on 04/02/2016).
[Amd11] Software Optimization Guide For AMD Family 10h and 12h Processors. Revision: 3.13.
Publication # 40546. AMD. Feb. 2011. URL: http://support.amd.com/TechDocs/40546.
pdf (visited on 01/15/2015).
[Amd12a] AMD OpteronTM 6000 Series Platform Product Comparison. AMD, 2012. URL: http : / /
www. amd . com / Documents / AMD _ Opteron _ 6000 _ Comparison . pdf (visited on
02/13/2015).
[Amd12b] AMD Server Solutions Playbook - A Comprehensive Guide to the AMD OpteronTM 6000,
4000 and 3000 Series Server Platforms. Tech. rep. Dec. 2012. URL: http://www.amd.com/
Documents/AMD_Opteron_ServerPlaybook.pdf (visited on 02/13/2015).
[Amd12c] Family 15h Models 00h-0Fh AMD OpteronTM Processor Product Data Sheet. Revision:
3.01. Publication # 49687. AMD. Oct. 2012. URL: http://support.amd.com/TechDocs/
49687_15h_Mod_00h-0Fh_Opteron_PDS.pdf (visited on 02/13/2015).
[Amd13a] AMD Core Math Library (ACML) Version 5.3.1. AMD. 2013. URL: http : / / amd - dev.
wpengine . netdna - cdn . com / wordpress / media / 2013 / 05 / acml . pdf (visited on
07/28/2014).
136 Bibliography
[Amd13b] BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 00h-0Fh Pro-
cessors. Revision: 3.14. Publication # 42301. AMD. Jan. 2013. URL: http://support.amd.
com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf (visited on 07/01/2016).
[Amd14a] AMD FireProTM S9000 High Density Server Graphics. data sheet. AMD, 2014. URL: http://
www.amd.com/Documents/FirePro_S9000_Data_Sheet.pdf (visited on 08/21/2014).
[Amd14b] Compute Cores. white paper. AMD, 2014. URL: www.amd.com/Documents/Compute_
Cores_Whitepaper.pdf (visited on 01/15/2015).
[Amd14c] Software Optimization Guide for AMD Family 15h Processors. Revision: 3.08. Publication
# 47414. AMD. Jan. 2014. URL: http://support.amd.com/TechDocs/47414_15h_sw_
opt_guide.pdf (visited on 05/03/2014).
[Amd15a] AMD64 Architecture Programmer’s Manual Volume 2: System Programming. Revision
3.25. Publication # 24593. AMD. June 2015. URL: http://support.amd.com/TechDocs/
24593.pdf (visited on 02/10/2015).
[Amd15b] AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System In-
structions. Revision 3.22. Publication # 24594. AMD. June 2015. URL: http: / /support .
amd.com/TechDocs/24594.pdf (visited on 02/10/2015).
[Amd67] Gene M. Amdahl. “Validity of the Single Processor Approach to Achieving Large Scale
Computing Capabilities”. In: Proceedings of the April 18-20, 1967, Spring Joint Computer
Conference. AFIPS ’67 (Spring). ACM, 1967, pp. 483–485. DOI: 10 . 1145 / 1465482 .
1465560.
[ANE98] I. Anjoh, A. Nishimura, and S. Eguchi. “Advanced IC packaging for the future applica-
tions”. In: Electron Devices, IEEE Transactions on 45.3 (Mar. 1998), pp. 743–752. ISSN:
0018-9383. DOI: 10.1109/16.661237.
[Arc11] Andrea Arcangeli. Transparent Hugepage Support. The Linux Foundation - Collabora-
tion Summit. Apr. 2011. URL: https: / /events. linuxfoundation.org/slides/2011/ lfcs /
lfcs2011_hpc_arcangeli.pdf (visited on 07/29/2015).
[Arm13a] big.LITTLE Technology Moves Towards Fully Heterogeneous Global Task Scheduling.
white paper. ARM, 2013. URL: https : / / www . arm . com / files / pdf / big _ LITTLE _
technology_moves_towards_fully_heterogeneous_Global_Task_Scheduling.pdf
(visited on 07/02/2016).
[Arm13b] NEONTM Programmer’s Guide. Version 1.0. ARM DEN0018A (ID071613). ARM. June
2013. URL: https://silver.arm.com/download/download.tm?pv=1439811 (visited on
05/01/2014).
[Arm14] ARM R© Cortex R©-A57 MPCore Processor Revision: r1p3 Technical Reference Manual. Re-
vision: r1p3. ARM DDI0488G. ARM. 2014. URL: http : / / infocenter. arm . com / help /
topic/com.arm.doc.ddi0488g/DDI0488G_cortex_a57_mpcore_trm.pdf (visited on
01/15/2016).
[Arm15] ARM R© Architecture Reference Manual - ARMv8, for ARMv8-A architecture profile. Beta.
ARM: DDI 0487A.f (ID032515). ARM. Mar. 2015. URL: https : / / silver . arm . com /
download/download.tm?pv=2113558 (visited on 07/30/2015).
[Asl+01] Vishal Aslot et al. “SPEComp: A New Benchmark Suite for Measuring Parallel Computer
Performance”. In: OpenMP Shared Memory Parallel Programming: International Work-
shop on OpenMP Applications and Tools, WOMPAT 2001. Springer Berlin Heidelberg,
2001. ISBN: 978-3-540-44587-6. DOI: 10.1007/3-540-44587-0_1.
[Bae10] Jean-Loup Baer. Microprocessor Architecture: From Simple Pipelines to Chip Multiproces-
sors. Cambridge University Press, 2010. ISBN: 9780521769921.
Bibliography 137
[Bai+91] D. H. Bailey et al. “The NAS parallel benchmarkssummary and preliminary results”. In:
Proceedings of the 1991 ACM/IEEE conference on Supercomputing. Supercomputing ’91.
ACM, 1991, pp. 158–165. ISBN: 0-89791-459-7. DOI: 10.1145/125826.125925.
[Bar+08] Bradley J. Barnes et al. “A Regression-based Approach to Scalability Prediction”. In: Pro-
ceedings of the 22Nd Annual International Conference on Supercomputing. ICS ’08. ACM,
2008, pp. 368–377. ISBN: 978-1-60558-158-3. DOI: 10.1145/1375527.1375580.
[BC11] Shekhar Borkar and Andrew A. Chien. “The future of microprocessors”. In: Commun. ACM
54.5 (May 2011), pp. 67–77. ISSN: 0001-0782. DOI: 10.1145/1941487.1941507.
[BD97] D. Bhandarkar and J. Ding. “Performance characterization of the Pentium Pro processor”.
In: High-Performance Computer Architecture, 1997., Third International Symposium on.
Feb. 1997, pp. 288–297. DOI: 10.1109/HPCA.1997.569689.
[BDM09] G. Blake, R.G. Dreslinski, and T. Mudge. “A survey of multicore processors”. In: Signal
Processing Magazine, IEEE 26.6 (Nov. 2009), pp. 26–37. ISSN: 1053-5888. DOI: 10.1109/
MSP.2009.934110.
[Ben+13] Zakaria Bendifallah et al. “PAMDA: Performance Assessment Using MAQAO Toolset and
Differential Analysis”. In: Proceedings of the 7th International Workshop on Parallel Tools
for High Performance Computing. Sept. 2013, pp. 107–127. DOI: 10.1007/978-3-319-
08144-1_9.
[BGB98] Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. “Memory System
Characterization of Commercial Workloads”. In: SIGARCH Comput. Archit. News 26.3
(Apr. 1998), pp. 3–14. ISSN: 0163-5964. DOI: 10.1145/279361.279363.
[BGH12] G. Bauer, S. Gottlieb, and T. Hoefler. “Performance Modeling and Comparative Analysis
of the MILC Lattice QCD Application su3_rmd”. In: Cluster, Cloud and Grid Computing
(CCGrid), 2012 12th IEEE/ACM International Symposium on. May 2012, pp. 652–659.
DOI: 10.1109/CCGrid.2012.123.
[BH00] Bryan Buck and Jeffrey K. Hollingsworth. “An API for Runtime Code Patching”. In: Inter-
national Journal of High Performance Computing Applications 14.4 (Nov. 2000), pp. 317–
329. ISSN: 1094-3420. DOI: 10.1177/109434200001400404.
[Bru07] Holger Brunst. “Integrative Concepts for Scalable Distributed Performance Analysis and
Visualization of Parallel Programs”. PhD thesis. Technische Universität Dresden, 2007.
URL: http : / /www.shaker.de/de/content /catalogue/ index.asp?lang=de&ID=8&
ISBN=978-3-8322-6990-6 (visited on 03/11/2016).
[BT09] Vlastimil Babka and Petr Tu˚ma. “Investigating Cache Parameters of x86 Family Proces-
sors”. In: Computer Performance Evaluation and Benchmarking. Vol. 5419. Lecture Notes
in Computer Science. Springer Berlin Heidelberg, 2009, pp. 77–96. ISBN: 978-3-540-
93798-2. DOI: 10.1007/978-3-540-93799-9_5.
[Bul13] bullx DLC blade system - B700 series. data sheet. Bull SAS, 2013. URL: http://www.bull.
com/extreme-computing/download/S-bullxB700-en5.pdf (visited on 08/20/2014).
[Bul14] bullx R421 E4 accelerated server. data sheet. Bull SAS, 2014. URL: http://www.bull.com/
extreme-computing/download/S-bullxR421-en3.pdf (visited on 02/13/2015).
[Bur+14] E.A. Burton et al. “FIVR – Fully integrated voltage regulators on 4th generation Intel Core
SoCs”. In: Applied Power Electronics Conference and Exposition (APEC), 2014 Twenty-
Ninth Annual IEEE. Mar. 2014, pp. 432–439. DOI: 10.1109/APEC.2014.6803344.
[But+11] M. Butler et al. “Bulldozer: An Approach to Multithreaded Compute Performance”. In:
Micro, IEEE 31.2 (2011), pp. 6–15. ISSN: 0272-1732. DOI: 10.1109/MM.2011.23.
138 Bibliography
[But+91] Michael Butler et al. “Single Instruction Stream Parallelism is Greater Than Two”. In:
SIGARCH Comput. Archit. News 19.3 (Apr. 1991), pp. 276–286. ISSN: 0163-5964. DOI:
10.1145/115953.115980.
[BW13] Holger Brunst and Matthias Weber. “Custom Hot Spot Analysis of HPC Software with the
Vampir Performance Tool Suite”. In: Proceedings of the 6th International Parallel Tools
Workshop. Springer Berlin Heidelberg, 2013, pp. 95–114. DOI: 10.1007 /978- 3- 642-
37349-7_7.
[BW89] Jean-Loup Baer and Wen-Hann Wang. “Multilevel cache hierarchies: Organizations, pro-
tocols, and performance”. In: Journal of Parallel and Distributed Computing 6.3 (1989),
pp. 451–476. ISSN: 0743-7315. DOI: 10.1016/0743-7315(89)90001-4.
[Cen10] L. Cen. “Two-hop cache coherency protocol”. US Patent 7,822,929. Oct. 2010. URL: https:
//www.google.com.ar/patents/US7822929 (visited on 01/21/2016).
[Cep13] Shannon Cepeda. Using Intel R© VTuneTM Amplifier XE to Tune Software on the Intel R©
Xeon R© Processor E5 Family. 2013. URL: https://software.intel.com/sites/default/files/
article/380498/using-intel-vtune-amplifier-xe-on-xeon-e5-family-1.0.pdf (visited on
02/23/2016).
[CH07] Pat Conway and Bill Hughes. “The AMD Opteron Northbridge Architecture”. In: IEEE
Micro 27.2 (Mar. 2007), pp. 10–21. ISSN: 0272-1732. DOI: 10.1109/MM.2007.43.
[Che+13] An Ding Chen et al. IBM Power 795 (9119-FHB) Technical Overview and Introduction.
IBM Redbooks. IBM Form Number REDP-4640-01. International Technical Support Or-
ganization, Feb. 2013. ISBN: 9780738451152. URL: http : / /www. redbooks. ibm.com/
redpapers/pdfs/redp4640.pdf (visited on 07/01/2016).
[Che04] Wai Kai Chen. The Electrical Engineering Handbook. Academic Press, 2004. ISBN: 978-0-
12-170960-0.
[CHE11] Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. “Sniper: Exploring the Level of
Abstraction for Scalable and Accurate Parallel Multi-Core Simulations”. In: International
Conference for High Performance Computing, Networking, Storage and Analysis (SC). Nov.
2011. ISBN: 978-1-4503-0771-0. DOI: 10.1145/2063384.2063454.
[CL12] Patrick Conway and Kevin M. Lepak. “Snoop filtering mechanism”. US Patent 8,185,695
B2. May 2012. URL: https : / / www . google . com / patents / US8185695 (visited on
05/16/2014).
[Con+10] Pat Conway et al. “Cache Hierarchy and Memory Subsystem of the AMD Opteron Proces-
sor”. In: IEEE Micro 30.2 (Mar. 2010), pp. 16–29. ISSN: 0272-1732. DOI: 10.1109/MM.
2010.31.
[Con10] Patrik N. Conway. “Method and apparatus for detecting and tracking private pages in a
shared memory multiprocessor”. US Patent 7,669,011. Feb. 2010. URL: https : / / www.
google.ch/patents/US7669011 (visited on 07/02/2016).
[Cra10] Cray XE6 brochure. data sheet. Cray, 2010. URL: http://www.cray.com/Assets/PDF/
products/xe/CrayXE6Brochure.pdf (visited on 08/20/2014).
[Cur+07] Matthew Curtis-Maury et al. “Identifying energy-efficient concurrency levels using machine
learning”. In: Cluster Computing, 2007 IEEE International Conference on. IEEE. Sept.
2007, pp. 488–495. DOI: 10.1109/CLUSTR.2007.4629274.
[Cur+08] Matthew Curtis-Maury et al. “Prediction models for multi-dimensional power-performance
optimization on many cores”. In: Proceedings of the 17th international conference on Par-
allel architectures and compilation techniques. PACT ’08. ACM, 2008, pp. 250–259. ISBN:
978-1-60558-282-5. DOI: 10.1145/1454115.1454151.
Bibliography 139
[Dan+13] Anthony Danalis et al. “BlackjackBench: Portable Hardware Characterization with Au-
tomated Results’ Analysis”. In: The Computer Journal (2013). DOI: 10 . 1093 / comjnl /
bxt057.
[DAS12] M. Dubois, M. Annavaram, and P. Stenström. Parallel Computer Organization and De-
sign. Parallel Computer Organization and Design. Cambridge University Press, 2012. ISBN:
9780521886758.
[Del12a] Dell PowerEdge R510. spec sheet. Dell, 2012. URL: http://www.dell.com/downloads/
global/products/pedge/R510_Spec_Sheet.pdf (visited on 01/15/2015).
[Del12b] Dell PowerEdge R720. spec sheet. Dell, 2012. URL: http://www.dell.com/downloads/
global/products/pedge/dell-poweredge-r720-spec-sheet.pdf (visited on 01/15/2015).
[Del12c] Dell Precision T7600. spec sheet. Dell, 2012. URL: http: / /www.dell .com/downloads/
global / products / precn / en / Dell - Precision - T7600 - Spec - Sheet . pdf (visited on
01/15/2015).
[Dell12] John Beckett. BIOS Performance and Power Tuning Guidelines for Dell PowerEdge 12th
Generation Servers. Version 1.0. Dell Inc. Dec. 2012. URL: http: / /en.community.dell .
com/cfs-file/__key/telligent-evolution-components-attachments/13-4491-00-00-
20-24-87-40/12g_5F00_bios_5F00_tuning_5F00_for_5F00_performance_5F00_
power.pdf (visited on 11/23/2015).
[Din+10] James Dinan et al. “Hybrid Parallel Programming with MPI and Unified Parallel C”. In:
Proceedings of the 7th ACM International Conference on Computing Frontiers. CF ’10.
ACM, 2010, pp. 177–186. ISBN: 978-1-4503-0044-5. DOI: 10.1145/1787275.1787323.
[DLP03] Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. “The LINPACK Benchmark: past,
present and future”. In: Concurrency and Computation: Practice and Experience 15.9
(2003), pp. 803–820. ISSN: 1532-0634. DOI: 10.1002/cpe.728.
[DMN12] J. Diaz, C. Munoz-Caro, and A. Nino. “A Survey of Parallel Programming Models and
Tools in the Multi and Many-Core Era”. In: Parallel and Distributed Systems, IEEE Trans-
actions on 23.8 (Aug. 2012), pp. 1369–1386. ISSN: 1045-9219. DOI: 10 .1109 /TPDS.
2011.308.
[DOE10] U.S. department of energy. Challenges in exascale computing. SOS ’14. Mar. 2010. URL:
http://www.csm.ornl.gov/workshops/SOS14/documents/dosanjh_pres.pdf (visited
on 01/20/2016).
[Don+11] Jack Dongarra et al. “The International Exascale Software Project roadmap”. In: Interna-
tional Journal of High Performance Computing Applications 25.1 (2011), pp. 3–60. DOI:
10.1177/1094342010391989.
[Dor+07] J. Dorsey et al. “An Integrated Quad-Core Opteron Processor”. In: Solid-State Circuits Con-
ference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. Feb. 2007,
pp. 102–103. DOI: 10.1109/ISSCC.2007.373608.
[Dro07] Paul J. Drongowski. Instruction-Based Sampling: A New Performance Analysis Technique
for AMD Family 10h Processors. Tech. rep. AMD, Nov. 2007. URL: http://developer.amd.
com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf (visited on 07/01/2016).
[Duc+08] Alexandre X. Duchateau et al. “P-Ray: A suite of micro-benchmarks for multi-core archi-
tectures”. In: Proc. 21st Intl. Workshop on Languages and Compilers for Parallel Com-
puting (LCPC’08). Vol. 5335. 2008, pp. 187–201. URL: http : / / www1 . ju . edu . jo /
ecourse / cpestcourse / readings / 0808 - p - rayasuiteofmicro - benchmarksformulti -
corearchitectures%20.pdf (visited on 07/16/2016).
140 Bibliography
[ECP96] Marius Evers, Po-Yung Chang, and Yale N. Patt. “Using Hybrid Branch Predictors to Im-
prove Branch Prediction Accuracy in the Presence of Context Switches”. In: SIGARCH
Comput. Archit. News 24.2 (May 1996), pp. 3–11. ISSN: 0163-5964. DOI: 10 . 1145 /
232974.232975.
[Eis86] Yoram Eisenstadter. Methods for Performance Evaluation of Parallel Computer Systems.
Tech. rep. Columbia University Academic Commons, 1986. URL: http://hdl.handle.net/
10022/AC:P:11723 (visited on 02/04/2016).
[EM05] Per Ekman and Philip Mucci. Design Considerations for Shared Memory MPI Implementa-
tions on Linux NUMA Systems: An MPICH/MPICH2 Case Study. Advanced Micro Devices,
July 2005. URL: http://icl.cs.utk.edu/~mucci/latest/pubs/AMD-MPI-05.pdf (visited on
01/28/2016).
[Era08] Stéphane Eranian. “What Can Performance Counters Do for Memory Subsystem Anal-
ysis?” In: Proceedings of the 2008 ACM SIGPLAN Workshop on Memory Systems Perfor-
mance and Correctness. MSPC’08. ACM, 2008, pp. 26–30. ISBN: 978-1-60558-049-4. DOI:
10.1145/1353522.1353531.
[Eye+06] Stijn Eyerman et al. “A Performance Counter Architecture for Computing Accurate CPI
Components”. In: Proceedings of the 12th International Conference on Architectural Sup-
port for Programming Languages and Operating Systems. ASPLOS XII. ACM, 2006,
pp. 175–184. ISBN: 1-59593-451-0. DOI: 10.1145/1168857.1168880.
[Eye+09] Stijn Eyerman et al. “A Mechanistic Performance Model for Superscalar Out-of-order Pro-
cessors”. In: ACM Trans. Comput. Syst. 27.2 (May 2009), 3:1–3:37. ISSN: 0734-2071. DOI:
10.1145/1534909.1534910.
[Fan+14] Jianbin Fang et al. “Test-driving Intel Xeon Phi”. In: Proceedings of the 5th ACM/SPEC
International Conference on Performance Engineering. ICPE ’14. ACM, 2014, pp. 137–
148. ISBN: 978-1-4503-2733-6. DOI: 10.1145/2568088.2576799.
[FC07] Wu-chun Feng and K.W. Cameron. “The Green500 List: Encouraging Sustainable Super-
computing”. In: Computer 40.12 (Dec. 2007), pp. 50–55. ISSN: 0018-9162. DOI: 10.1109/
MC.2007.445.
[Fei95] Karl Feind. “Shared memory access (SHMEM) routines”. In: Proceedings of the Cray
User’s Group (CUG). Spring 1995, pp. 303–308. URL: https://cug.org/5-publications/
proceedings _ attendee _ lists / 1997CD / S95PROC / 303 _ 308 . PDF (visited on
08/12/2015).
[FG06] Karl Fürlinger and Michael Gerndt. “Finding Inefficiencies in OpenMP Applications Au-
tomatically with Periscope”. In: Computational Science - ICCS 2006. Vol. 3992. Lecture
Notes in Computer Science. Springer Berlin Heidelberg, 2006, pp. 494–501. ISBN: 978-3-
540-34381-3. DOI: 10.1007/11758525_67.
[FGD07] Karl Fürlinger, Michael Gerndt, and Jack Dongarra. “Scalability Analysis of the SPEC
OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors”. In: Computa-
tional Science - ICCS 2007. Vol. 4488. Lecture Notes in Computer Science. Springer Berlin
/ Heidelberg, 2007, pp. 815–822. DOI: 10.1007/978-3-540-72586-2_115.
[Fra+05] F. Franchetti et al. “Efficient Utilization of SIMD Extensions”. In: Proceedings of the IEEE
93.2 (Feb. 2005), pp. 409–425. ISSN: 0018-9219. DOI: 10.1109/JPROC.2004.840491.
[Fri+14] J. Friedrich et al. “The POWER8TM processor: Designed for big data, analytics, and cloud
environments”. In: IC Design Technology (ICICDT), 2014 IEEE International Conference
on. May 2014. DOI: 10.1109/ICICDT.2014.6838618.
[Fuj14] FUJITSU Server PRIMERGY RX4770 M1 Quad socket 4U rack server. data sheet. Fujitsu,
2014. URL: http://globalsp.ts.fujitsu.com/dmsp/Publications/public/ds-py-rx4770-
m1.pdf (visited on 08/19/2014).
Bibliography 141
[Fuj16] FUJITSU Workstation CELSIUS R940. data sheet. Fujitsu, 2016. URL: http://sp.ts.fujitsu.
com/dmsp/Publications/public/ds-CELSIUS-R940.pdf (visited on 01/12/2016).
[GAS13] GASPI: Global Address Space Programming Interface - Specification of a PGAS API for
communication. Version 1.01. Nov. 2013. URL: http://www.gaspi.de/fileadmin/GASPI/
pdf/GASPI-1.0.1.pdf (visited on 01/28/2016).
[Ge+07] Rong Ge et al. “CPU MISER: A Performance-Directed, Run-Time System for Power-Aware
Clusters.” In: Parallel Processing, 2007. ICPP 2007. International Conference on. IEEE.
2007. DOI: 10.1109/ICPP.2007.29.
[Gee+13] V. Geetha et al. “Improving value of forward state by increasing local caching agent for-
warding”. WO Patent App. PCT/US2012/020,408. July 2013. URL: http://www.google.
com/patents/WO2013103347A1?cl=en (visited on 03/14/2015).
[Gei+10] Markus Geimer et al. “The Scalasca performance toolset architecture”. In: Concurrency and
Computation: Practice and Experience 22.6 (Apr. 2010), pp. 702–719. ISSN: 1532-0626.
DOI: 10.1002/cpe.1556.
[GF09] Brice Goglin and Nathalie Furmento. “Memory migration on next-touch”. In: Linux Sympo-
sium. 2009. URL: https://www.kernel.org/doc/ols/2009/ols2009-pages-101-110.pdf
(visited on 08/07/2015).
[GK07] Michael Gerndt and Edmond Kereku. “Automatic Memory Access Analysis with Peri-
scope”. In: Computational Science - ICCS 2007. Vol. 4488. Lecture Notes in Computer
Science. Springer Berlin Heidelberg, 2007, pp. 847–854. ISBN: 978-3-540-72585-5. DOI:
10.1007/978-3-540-72586-2_119.
[Goc+03] Simcha Gochman et al. “The Intel Pentium M Processor: Microarchitecture and Perfor-
mance”. In: Intel Technology Journal 7.2 (2003), pp. 21–36. ISSN: 1535-864X.
[Gon+10] J. Gonzalez-Dominguez et al. “Servet: A benchmark suite for autotuning on multicore clus-
ters”. In: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on.
Apr. 2010. DOI: 10.1109/IPDPS.2010.5470358.
[Gor04] Mel Gorman. Understanding The Linux Virtual Memory Manager. Feb. 2004. URL: http:
//www.csn.ul.ie/~mel/docs/vm/guide/pdf/understand.pdf (visited on 01/26/2016).
[Gra+03] Ananth Grama et al. Introduction to Parallel Computing. 2nd edition. Addison-Wesley
Longman Publishing Co., Inc., 2003. ISBN: 9780201648652.
[GSP11] Neil Gunther, Shanti Subramanyam, and Stefan Parvu. A Methodology for Optimizing Mul-
tithreaded System Scalability on Multi-Cores. 2011. URL: http : / / arxiv. org /pdf / 1105 .
4301v1.pdf (visited on 02/14/2016).
[Gus88] John L. Gustafson. “Reevaluating Amdahl’s Law”. In: Commun. ACM 31.5 (May 1988),
pp. 532–533. ISSN: 0001-0782. DOI: 10.1145/42411.42415.
[Gwe95] Linley Gwennap. “Intel’s P6 uses decoupled superscalar design”. In: Microprocessor Report
(1995). URL: http://www.cs.cmu.edu/afs/cs/academic/class/15213-f01/docs/mpr-
p6.pdf (visited on 10/15/2015).
[GWM92] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. “Reducing Memory and Traffic Re-
quirements for Scalable Directory-Based Cache Coherence Schemes*”. In: Scalable Shared
Memory Multiprocessors. Springer US, 1992, pp. 167–192. ISBN: 978-1-4615-3604-8. DOI:
10.1007/978-1-4615-3604-8_9.
[Hac+13] Daniel Hackenberg et al. “Introducing FIRESTARTER: A processor stress test utility”. In:
Green Computing Conference (IGCC), 2013 International. 2013. DOI: 10.1109/ IGCC.
2013.6604507.
142 Bibliography
[Hac+15] Daniel Hackenberg et al. “An Energy Efficiency Feature Survey of the Intel Haswell Proces-
sor”. In: Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE
International. Eleventh IEEE Workshop on High-Performance, Power-Aware Computing
(HPPAC). May 2015, pp. 896–904. DOI: 10.1109/IPDPSW.2015.70.
[Hag+14] Georg Hager et al. “Exploring performance and power properties of modern multi-core
chips via simple machine models”. In: Concurrency and Computation: Practice and Expe-
rience (2014). ISSN: 1532-0634. DOI: 10.1002/cpe.3180.
[Ham+14] P. Hammarlund et al. “Haswell: The Fourth-Generation Intel Core Processor”. In: Micro,
IEEE 34.2 (Mar. 2014), pp. 6–20. ISSN: 0272-1732. DOI: 10.1109/MM.2014.10.
[Hay02] John P. Hayes. Computer Architecture and Organization. 3rd edition. McGraw-Hill, 2002.
ISBN: 0072861983.
[HC10] Nor Asilah Wati Abdul Hamid and Paul Coddington. “Comparison of MPI Benchmark Pro-
grams on Shared Memory and Distributed Memory Machines (Point-to-Point Communica-
tion)”. In: International Journal of High Performance Computing Applications 24.4 (2010),
pp. 469–483. DOI: 10.1177/1094342010371106.
[Hei+99] M. Heinrich et al. “A quantitative analysis of the performance and scalability of distributed
shared memory cache coherence protocols”. In: Computers, IEEE Transactions on 48.2
(Feb. 1999), pp. 205–217. ISSN: 0018-9340. DOI: 10.1109/12.752662.
[Hew+13] Advanced Configuration and Power Interface Specification. Revision 5.0a. Hewlett-Packard
Corporation et al. Nov. 2013. URL: http://acpi.info/DOWNLOADS/ACPI_5_Errata%
20A.pdf (visited on 07/01/2016).
[Hew14] HP ProLiant ML350e Gen8 v2 Server data sheet. data sheet. Hewlett-Packard, 2014. URL:
http://www8.hp.com/h20195/v2/GetPDF.aspx/4AA5-0651ENW.pdf?ver=0 (visited
on 08/19/2014).
[HF05] C.-H. Hsu and Wu-chun Feng. “A Power-Aware Run-Time System for High-Performance
Computing”. In: Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Confer-
ence. Nov. 2005. DOI: 10.1109/SC.2005.3.
[HG05] Herbert H. J. Hum and James R. Goodman. “Forward state for use in cache coherency in
a multiprocessor system”. US Patent 6922756. July 2005. URL: http://www.google.com/
patents/US6922756 (visited on 07/02/2016).
[HG97] Lei Hu and Ian Gorton. Performance Evaluation for Parallel Systems: A Survey. Tech. rep.
University of New South Wales, 1997. URL: ftp://ftp.cse.unsw.edu.au/pub/doc/papers/
UNSW/9707.pdf (visited on 02/04/2016).
[Hil+10] David L. Hill et al. “The Uncore: A Modular Approach to Feeding the High-performance
Cores”. In: vol. 14. 3. Intel Press, 2010, pp. 30–49. ISBN: 978-1-934053-33-1.
[HJ10] Kai Hwang and Naresh Jotwani. Advanced Computer Architecture. 2nd edition. McGraw-
Hill, 2010. ISBN: 9780070702103.
[HLK97] Cristina Hristea, Daniel Lenoski, and John Keen. “Measuring Memory Hierarchy Perfor-
mance of Cache-coherent Multiprocessors Using Micro Benchmarks”. In: Proceedings of
the 1997 ACM/IEEE Conference on Supercomputing. SC ’97. ACM, 1997. ISBN: 0-89791-
985-8. DOI: 10.1145/509593.509638.
[HM08] M.D. Hill and M.R. Marty. “Amdahl’s Law in the Multicore Era”. In: Computer 41.7 (July
2008), pp. 33–38. ISSN: 0018-9162. DOI: 10.1109/MC.2008.209.
[HMN09] D. Hackenberg, D. Molka, and W. E. Nagel. “Comparing cache architectures and coherency
protocols on x86-64 multicore SMP systems”. In: MICRO 42: Proceedings of the 42nd
Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2009, pp. 413–
422. ISBN: 978-1-60558-798-1.
Bibliography 143
[Hoe+13] Torsten Hoefler et al. “MPI + MPI: a new hybrid approach to parallel programming with
MPI plus shared memory”. In: Computing 95.12 (2013), pp. 1121–1136. ISSN: 1436-5057.
DOI: 10.1007/s00607-013-0324-2.
[Hof+16] Johannes Hofmann et al. “Analysis of Intel’s Haswell Microarchitecture Using the ECM
Model and Microbenchmarks”. In: Proceedings of the 29th International Conference on
Architecture of Computing Systems – ARCS 2016. Springer International Publishing, 2016,
pp. 210–222. ISBN: 978-3-319-30695-7. DOI: 10.1007/978-3-319-30695-7_16.
[HP02] A. Hartstein and Thomas R. Puzak. “The Optimum Pipeline Depth for a Microprocessor”.
In: SIGARCH Comput. Archit. News 30.2 (May 2002), pp. 7–13. ISSN: 0163-5964. DOI:
10.1145/545214.545217.
[HP06] John L. Hennessy and David A. Patterson. Computer Architecture - A Quantitative Ap-
proach. 4th edition. Morgan Kaufmann Publishers, 2006. ISBN: 9780123704900.
[HS11] Torsten Hoefler and Marc Snir. “Performance Engineering: A Must for Petascale and Be-
yond”. In: Proceedings of the Third International Workshop on Large-scale System and Ap-
plication Performance. LSAP ’11. ACM, 2011. ISBN: 978-1-4503-0703-1. DOI: 10.1145/
1996029.1996031.
[HTC10] HyperTransportTM I/O Link Specification. Revision 3.10c. HyperTransport Technology
Consortium. May 2010. URL: http : / / www . hypertransport . org / docs / twgdocs /
HTC20051222-0046-0035.pdf (visited on 05/27/2016).
[Hua+12] Min Huang et al. “An energy efficient 32nm 20 MB L3 cache for Intel R© Xeon R© processor
E5 family”. In: Custom Integrated Circuits Conference (CICC), 2012 IEEE. Sept. 2012.
DOI: 10.1109/CICC.2012.6330624.
[Ibm05] PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Pro-
gramming Environments Manual. Version 2.06. IBM. Aug. 2005. URL: http : / / math -
atlas .sourceforge.net /devel /assembly /vector_simd_pem.ppc .2005AUG23.pdf
(visited on 10/15/2015).
[Ils+15] Thomas Ilsche et al. “Combining Instrumentation and Sampling for Trace-Based Applica-
tion Performance Analysis”. In: Proceedings of the 8th International Workshop on Parallel
Tools for High Performance Computing. Springer International Publishing, 2015, pp. 123–
136. ISBN: 978-3-319-16012-2. DOI: 10.1007/978-3-319-16012-2_6.
[Int02] Intel R© Pentium R© 4 and Intel R© XeonTM Processor Optimization Reference Manual. order
number: 248966-007. Intel. 2002. URL: https://courses.cs.washington.edu/courses/
cse582/02au/x86/24896607.pdf (visited on 01/19/2016).
[Int04] Enhanced Intel R© SpeedStep R© Technology for the Intel R© Pentium R© M Processor. white
paper. Order Number: 301170-001. Mar. 2004. URL: download . intel . com / design /
network/papers/30117401.pdf (visited on 07/01/2016).
[Int06a] 64-bit Intel R© Xeon R© Processor with 800 MHz System Bus (1 MB and 2 MB L2 Cache
Versions) Specification Update. Revision 022. Reference Number: 302402-022. Intel. June
2006. URL: http : / / www. intel . de / content / dam / www / public / us / en / documents /
specification-updates/xeon-with-800-mhz-system-bus-specification-update.pdf
(visited on 01/19/2016).
[Int06b] Dual-Core Intel R© Xeon R© Processor 5000 Series Specification Update. Revision 003. Ref-
erence Number: 313065-003. Intel. Sept. 2006. URL: http://www.intel.com/Assets/en_
US/PDF/specupdate/313065.pdf (visited on 01/19/2016).
[Int09a] An Introduction to the Intel R© QuickPath Interconnect. Intel. Jan. 2009. URL: http://www.
intel.com/technology/quickpath/introduction.pdf (visited on 10/23/2015).
144 Bibliography
[Int09b] First the Tick, Now the Tock: Next Generation Intel R© Microarchitecture (Nehalem). white
paper. Intel, 2009. URL: http://www.intel.com/content/dam/doc/white-paper/intel-
microarchitecture-white-paper.pdf (visited on 05/30/2014).
[Int11] Intel R© 7500/7510/7512 Scalable Memory Buffer. Revision 002. Document Number:
322824-002. Intel. Apr. 2011. URL: http://www.intel.com/content/dam/doc/datasheet/
7500-7510-7512-scalable-memory-buffer-datasheet.pdf (visited on 07/02/2016).
[Int12a] Intel R© Xeon R© Processor E5-2600 Product Family Uncore Performance Monitoring Guide.
Intel. Mar. 2012. URL: http : / / www. intel . com / content / dam / www / public / us / en /
documents/design-guides/xeon-e5-2600-uncore-guide.pdf (visited on 02/06/2015).
[Int12b] Intel R© Xeon R© Processor E5-2600 Series-based Platforms for Intelligent Systems. plat-
form brief. Intel, 2012. URL: http: / /www.intel .de/content/dam/www/public/us/en/
documents/platform- briefs /xeon- e5- platforms- for - intelligent - systems- brief.pdf
(visited on 05/03/2014).
[Int13a] Intel R© VTuneTM Amplifier XE 2013. 2013. URL: https://software.intel.com/sites/default/
files/managed/0b/93/Intel-VTune-Amplifier-XE-overview-and-new- features.pdf
(visited on 02/23/2016).
[Int13b] Intel R© Xeon R© Processor 5400 Series Specification Update. Revision 021. Reference Num-
ber: 318585-021. Intel. Aug. 2013. URL: http : / /www. intel . com/content /dam/www/
public/us/en/documents/specification-updates/xeon-5400-spec-update.pdf (visited
on 01/19/2016).
[Int14a] Intel R© 64 and IA-32 Architectures Optimization Reference Manual. Order Number:
248966-029. Intel. Mar. 2014. URL: http://www.intel.com/content/dam/www/public/us/
en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (visited on
05/01/2014).
[Int14b] Intel R© 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes 1,
2A, 2B, 2C, 3A, 3B and 3C. Order Number: 325462-050US. Intel. Feb. 2014. URL: http:
//www.intel.com/content/dam/www/public/us/en/documents/manuals/64- ia-32-
architectures-software-developer-manual-325462.pdf (visited on 05/01/2014).
[Int14c] Intel R©Math Kernel Library Reference Manual. Document Number: 630813-060US. Intel.
2014. URL: https://software.intel.com/en-us/mkl_11.1_ref_pdf (visited on 07/28/2014).
[Int14d] Intel R© Xeon R© Processor E5 v3 Family Uncore Performance Monitoring Reference Man-
ual. Intel. Sept. 2014. URL: http://www.intel.com/content/dam/www/public/us/en/zip/
xeon-e5-v3-uncore-performance-monitoring.zip (visited on 02/12/2015).
[Int15a] Intel R© Xeon PhiTM Coprocessor x100 Product Family Datasheet. Intel. Apr. 2015. URL:
http://www.intel.de/content/dam/www/public/us/en/documents/datasheets/xeon-
phi-coprocessor-datasheet.pdf (visited on 07/02/2016).
[Int15b] Intel R© Xeon R© Processor 5600 Series Specification Update. Revision 017. Reference Num-
ber: 323372-017US. Intel. Feb. 2015. URL: http://www.intel.com/content/dam/www/
public/us/en/documents/specification-updates/xeon-5600-specification-update.pdf
(visited on 01/19/2016).
[Int15c] Intel R© Xeon R© Processor 7500 Series Specification Update. Revision 022. Reference Num-
ber: 323344-022. Intel. Mar. 2015. URL: http : / /www. intel . com/content /dam/www/
public / us / en / documents / specification - updates / xeon - processor - 7500 - series -
specification-update.pdf (visited on 07/02/2016).
[Int15d] Intel R© Xeon R© Processor E5 Product Family - Specification Update. Revision 018. Ref-
erence Number: 326510-018. Intel. Jan. 2015. URL: http : / / www. intel . com / content /
dam/www/public/us/en/documents/specification-updates/xeon-e5- family-spec-
update.pdf (visited on 01/19/2016).
Bibliography 145
[Int15e] Intel R© Xeon R© Processor E5 v3 Product Families - Specification Update. Revision 009.
Reference Number: 330785-009US. Intel. Aug. 2015. URL: http://www.intel.com/content/
dam / www / public / us / en / documents / specification - updates / xeon - e5 - v3 - spec -
update.pdf (visited on 10/23/2015).
[Int15f] Intel R© Xeon R© Processor E5-1600, E5-2600, and E5-4600 v3 Product Families, Volume
1 of 2, Electrical Datasheet. Order No.: 330783-002. Intel Corporation. June 2015. URL:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-
e5-v3-datasheet-vol-1.pdf (visited on 07/02/2016).
[Int15g] Intel R© Xeon R© Processor E5-1600/2400/2600/4600 v3 Product Families Datasheet - Vol-
ume 2 of 2, Registers. Reference Number: 330784-003. Intel Corporation. June 2015. URL:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-
e5-v3-datasheet-vol-2.pdf (visited on 07/02/2016).
[Int16] Intel R© Performance Counter Monitor - A better way to measure CPU utilization. Apr. 2016.
URL: http://www.intel.com/software/pcm (visited on 07/02/2016).
[Int95] A Tour of the P6 Microarchitecture. Tech. rep. Intel Corporation, 1995. URL: http://people.
cs.clemson.edu/~mark/330/colwell/p6tour.pdf (visited on 10/15/2015).
[Iso94] Information Technology — Open Systems Interconnection — Basic Reference Model: The
Basic Model. ISO/IEC 7498-1:1994. ISO, Nov. 1994. URL: http : / / www. iso. org / iso /
iso_catalogue/catalogue_tc/catalogue_detail .htm?csnumber=20269 (visited on
07/02/2016).
[Jai91] Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for Experi-
mental Design, Measurement, Simulation, and Modeling. Wiley Interscience, 1991. ISBN:
9780471503361.
[Juc+04] G. Juckeland et al. “BenchIT – Performance measurement and comparison for scientific
applications”. In: Parallel Computing — Software Technology, Algorithms, Architectures
and Applications. Vol. 13. Advances in Parallel Computing. North Holland Publishing Co.,
2004, pp. 501–508. DOI: 10.1016/S0927-5452(04)80064-9.
[Juc12] Guido Juckeland. “Trace-based Performance Analysis for Hardware Accelerators”. PhD
thesis. Technische Universität Dresden, 2012. URL: http://nbn-resolving.de/urn:nbn:de:
bsz:14-qucosa-105859 (visited on 03/11/2016).
[KAO05] Poonacha Kongetira, K. Aingaran, and K. Olukotun. “Niagara: a 32-way multithreaded
Sparc processor”. In: Micro, IEEE 25.2 (Mar. 2005), pp. 21–29. ISSN: 0272-1732. DOI:
10.1109/MM.2005.35.
[Kar14] Rama Karedla. Intel Xeon E5-2600 v3 (Haswell) Architecture & Features. 2014. URL: http:
//repnop.org/pd/slides/PD_Haswell_Architecture.pdf (visited on 02/12/2015).
[Kel+03] Chetana N. Keltcher et al. “The AMD Opteron processor for multiprocessor servers”. In:
Micro, IEEE 23.2 (Mar. 2003), pp. 66–76. ISSN: 0272-1732. DOI: 10.1109/MM.2003.
1196116.
[KG94] V.P. Kumar and A. Gupta. “Analyzing Scalability of Parallel Algorithms and Architectures”.
In: Journal of Parallel and Distributed Computing 22.3 (1994), pp. 379–391. ISSN: 0743-
7315. DOI: 10.1006/jpdc.1994.1099.
[KH09] R. Kumar and G. Hinton. “A family of 45nm IA processors”. In: Solid-State Circuits Con-
ference - Digest of Technical Papers, 2009. ISSCC 2009. IEEE International. Feb. 2009,
pp. 58–59. DOI: 10.1109/ISSCC.2009.4977306.
[Kim+15] D. H. Kim et al. “Design and Analysis of 3D-MAPS (3D Massively Parallel Processor with
Stacked Memory)”. In: IEEE Transactions on Computers 64.1 (Jan. 2015), pp. 112–125.
ISSN: 0018-9340. DOI: 10.1109/TC.2013.192.
146 Bibliography
[Kle05] Andreas Kleen. A NUMA API for LINUX*. Technical Linux Whitepaper, Novel Inc. Apr.
2005. URL: http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-
fv1.pdf (visited on 08/07/2015).
[KM01] J.B. Keller and D.R. Meyer. “Messaging scheme to maintain cache coherency and con-
serve system memory bandwidth during a memory read operation in a multiprocessing com-
puter system”. US Patent 6,275,905. Aug. 2001. URL: https://www.google.ch/patents/
US6275905 (visited on 07/02/2016).
[KMC72] D. J. Kuck, Y. Muraoka, and Shyh-Ching Chen. “On the Number of Operations Simul-
taneously Executable in Fortran-Like Programs and Their Resulting Speedup”. In: IEEE
Transactions on Computers C-21.12 (Dec. 1972), pp. 1293–1310. ISSN: 0018-9340. DOI:
10.1109/T-C.1972.223501.
[Knü+08] Andreas Knüpfer et al. “The Vampir Performance Analysis Tool-Set”. In: Proceedings
of the 2nd International Workshop on Parallel Tools for High Performance Computing.
Springer Berlin Heidelberg, 2008, pp. 139–155. ISBN: 978-3-540-68564-7. DOI: 10.1007/
978-3-540-68564-7_9.
[Knü+12] Andreas Knüpfer et al. “Score-P: A Joint Performance Measurement Run-Time Infrastruc-
ture for Periscope,Scalasca, TAU, and Vampir”. In: Proceedings of the 5th International
Workshop on Parallel Tools for High Performance Computing. Springer Berlin Heidelberg,
2012, pp. 79–91. ISBN: 978-3-642-31476-6. DOI: 10.1007/978-3-642-31476-6_7.
[Kol+13] Souad Koliaï et al. “Quantifying Performance Bottleneck Cost Through Differential Anal-
ysis”. In: Proceedings of the 27th International Conference on Supercomputing. ICS’13.
ACM, 2013, pp. 263–272. ISBN: 978-1-4503-2130-3. DOI: 10.1145/2464996.2465440.
[Kot+12] S. Kottapalli et al. “Extending a cache coherency snoop broadcast protocol with directory
information”. US20120047333 A1. US Patent Application 12/860,340. Feb. 2012. URL:
https://www.google.com.tr/patents/US20120047333 (visited on 03/14/2015).
[Kra12] William T.C. Kramer. “Top500 versus sustained performance: the top problems with the
top500 list - and what to do about them”. In: Proceedings of the 21st international confer-
ence on Parallel architectures and compilation techniques. PACT ’12. ACM, 2012, pp. 223–
230. ISBN: 978-1-4503-1182-3. DOI: 10.1145/2370816.2370850.
[Kri+12a] Manojkumar Krishnan et al. The Global Arrays User Manual. Pacific Northwest National
Laboratory. Feb. 2012. URL: http://hpc.pnl.gov/globalarrays/papers/GA-UserManual-
Main.pdf (visited on 08/14/2015).
[Kri+12b] P. Kristof et al. “Performance Study of SIMD Programming Models on Intel Multicore
Processors”. In: Parallel and Distributed Processing Symposium Workshops PhD Forum
(IPDPSW), 2012 IEEE 26th International. May 2012, pp. 2423–2432. DOI: 10 . 1109 /
IPDPSW.2012.299.
[Kur+11] N.A. Kurd et al. “A Family of 32 nm IA Processors”. In: Solid-State Circuits, IEEE Journal
of 46.1 (Jan. 2011), pp. 119–130. ISSN: 0018-9200. DOI: 10.1109/JSSC.2010.2079430.
[Lam06] Christoph Lameter. “Local and remote memory: Memory in a Linux/NUMA system”. In:
Linux Symposium. 2006. URL: ftp://82.96.64.10/pub/linux/kernel/people/christoph/
gelato/gelato2006-paper.pdf (visited on 08/07/2015).
[Lam13] Christoph Lameter. “NUMA (Non-Uniform Memory Access): An Overview”. In: Queue
11.7 (July 2013), 40:40–40:51. ISSN: 1542-7730. DOI: 10.1145/2508834.2513149.
[LB10] Igor Loi and Luca Benini. “An Efficient Distributed Memory Interface for Many-core Plat-
form with 3D Stacked DRAM”. In: Proceedings of the Conference on Design, Automa-
tion and Test in Europe. DATE ’10. European Design and Automation Association, 2010,
pp. 99–104. ISBN: 978-3-9810801-6-2. DOI: 10.1145/2508834.2513149.
Bibliography 147
[Le+07] H.Q. Le et al. “IBM POWER6 microarchitecture”. In: IBM Journal of Research and Devel-
opment 51.6 (Nov. 2007), pp. 639–662. ISSN: 0018-8646. DOI: 10.1147/rd.516.0639.
[Lem11] Oded Lempel. 2nd Generation Intel R© CoreTM Processor Family: Intel R© CoreTM i7, i5
and i3. HotChips 23. 2011. URL: http://www.hotchips.org/wp-content/uploads/hc_
archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.911-Sandy-Bridge-Lempel-
Intel-Rev%5C%207.pdf (visited on 11/13/2015).
[Len+90] Daniel Lenoski et al. “The Directory-based Cache Coherence Protocol for the DASH Mul-
tiprocessor”. In: SIGARCH Comput. Archit. News 18.2SI (May 1990), pp. 148–159. ISSN:
0163-5964. DOI: 10.1145/325096.325132.
[Lep+12] Kevin M. Lepak et al. “Method and Apparatus for Accelerated Shared Data Migration”.
US20120144122 A1. US Patent Application 12/962,156. June 2012. URL: https://www.
google.com/patents/US20120144122 (visited on 05/16/2014).
[Lev09] David Levinthal. Performance Analysis Guide for Intel R© CoreTM i7 Processor and Intel R©
XeonTM 5500 processors. Tech. rep. Intel, 2009. URL: https : / / software . intel . com /
sites /products /collateral /hpc /vtune/performance_analysis_guide.pdf (visited on
04/14/2016).
[LHL05] Weiping Liao, Lei He, and Kevin M. Lepak. “Temperature and supply Voltage aware per-
formance and power modeling at microarchitecture level”. In: Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactionson 24.7 (July 2005), pp. 1042–1053.
ISSN: 0278-0070. DOI: 10.1109/TCAD.2005.850860.
[LHS13] Shigang Li, Torsten Hoefler, and Marc Snir. “NUMA-aware Shared-memory Collective
Communication for MPI”. In: Proceedings of the 22nd International Symposium on High-
performance Parallel and Distributed Computing. HPDC’13. ACM, 2013, pp. 85–96. ISBN:
978-1-4503-1910-2. DOI: 10.1145/2462902.2462903.
[Lia+14] Xiangke Liao et al. “MilkyWay-2 supercomputer: system and application”. In: Frontiers of
Computer Science 8.3 (2014), pp. 345–356. ISSN: 2095-2236. DOI: 10.1007/s11704-014-
3501-3.
[Lil00] David J. Lilja. Measuring Computer Performance. Cambridge Books Online. Cambridge
University Press, 2000. ISBN: 9780511612398. DOI: 10.1017/CBO9780511612398.
[Lim12] ITL Education Solutions Limited. Introduction to Information Technology. second edition.
Pearson Education India, 2012. ISBN: 9788131760291.
[Liv+11] Charles Lively et al. “Energy and performance characteristics of different parallel imple-
mentations of scientific applications on multicore systems”. In: International Journal of
High Performance Computing Applications 25.3 (Aug. 2011), pp. 342–350. DOI: 10.1177/
1094342011414749.
[Liv+12] Charles Lively et al. “Power-aware predictive models of hybrid (MPI/OpenMP) scientific
applications on multicore systems”. In: Computer Science - Research and Development 27.4
(2012), pp. 245–253. ISSN: 1865-2034. DOI: 10.1007/s00450-011-0190-0.
[LM13] Xu Liu and John Mellor-Crummey. “A Data-centric Profiler for Parallel Programs”. In:
Proceedings of the International Conference on High Performance Computing, Networking,
Storage and Analysis. SC ’13. ACM, 2013, 28:1–28:12. ISBN: 978-1-4503-2378-9. DOI:
10.1145/2503210.2503297.
[Lo+97] Jack L. Lo et al. “Converting Thread-level Parallelism to Instruction-level Parallelism via
Simultaneous Multithreading”. In: ACM Trans. Comput. Syst. 15.3 (Aug. 1997), pp. 322–
354. ISSN: 0734-2071. DOI: 10.1145/263326.263382.
148 Bibliography
[Lom11] Chris Lomont. Introduction to Intel R© Advanced Vector Extensions. May 2011. URL: https:
/ / software . intel . com / en - us / articles / introduction - to - intel - advanced - vector -
extensions (visited on 07/02/2016).
[Lud12] Mario Ludwig. “Performance-Analyse von AMD Bulldozer-Prozessoren”. bachelor thesis.
Technische Universität Dresden, 2012.
[Mas98] Michael Mascagni. “Parallel linear congruential generators with prime moduli”. In: Par-
allel Computing 24.5–6 (1998), pp. 923–936. ISSN: 0167-8191. DOI: 10.1016/S0167-
8191(98)00010-6.
[MB05] C. McNairy and R. Bhatia. “Montecito: a dual-core, dual-thread Itanium processor”. In:
Micro, IEEE 25.2 (Mar. 2005), pp. 10–20. ISSN: 0272-1732. DOI: 10.1109/MM.2005.34.
[McC95] John D. McCalpin. “Memory Bandwidth and Machine Balance in Current High Perfor-
mance Computers”. In: IEEE Computer Society Technical Committee on Computer Archi-
tecture (TCCA) Newsletter (Dec. 1995), pp. 19–25. URL: http: / /www.cs.virginia.edu/
~mccalpin/papers/bandwidth/bandwidth.html (visited on 02/08/2016).
[McI+12] H. McIntyre et al. “Design of the Two-Core x86-64 AMD Bulldozer Module in 32 nm SOI
CMOS”. In: Solid-State Circuits, IEEE Journal of 47.1 (Jan. 2012), pp. 164–176. ISSN:
0018-9200. DOI: 10.1109/JSSC.2011.2167823.
[McK10] Paul E. McKenney. Memory Barriers: a Hardware View for Software Hackers. July 2010.
URL: http://www2.rdrop.com/~paulmck/scalability/paper/whymb.2010.07.23a.pdf
(visited on 08/09/2015).
[Meg14] SlashEight - Innovative Housing System with High Packaging Denisty. data sheet. Megware,
2014. URL: http://www.megware.com/config/media/files/326_SlashEight_EN.pdf
(visited on 08/20/2014).
[MG11a] Zoltan Majo and Thomas R. Gross. “Memory Management in NUMA Multicore Systems:
Trapped Between Cache Contention and Interconnect Overhead”. In: SIGPLAN Not. 46.11
(June 2011), pp. 11–20. ISSN: 0362-1340. DOI: 10.1145/2076022.1993481.
[MG11b] Zoltan Majo and Thomas R. Gross. “Memory System Performance in a NUMA Multicore
Multiprocessor”. In: Proceedings of the 4th Annual International Conference on Systems
and Storage. SYSTOR’11. ACM, 2011, 12:1–12:10. ISBN: 978-1-4503-0773-4. DOI: 10.
1145/1987816.1987832.
[MHS14] Daniel Molka, Daniel Hackenberg, and Robert Schöne. “Main Memory and Cache Perfor-
mance of Intel Sandy Bridge and AMD Bulldozer”. In: Proceedings of the Workshop on
Memory Systems Performance and Correctness. MSPC ’14. ACM, 2014, 4:1–4:10. ISBN:
978-1-4503-2917-0. DOI: 10.1145/2618128.2618129.
[MM04] Gabriel Marin and John Mellor-Crummey. “Cross-architecture Performance Predictions
for Scientific Applications Using Parameterized Models”. In: Proceedings of the Joint In-
ternational Conference on Measurement and Modeling of Computer Systems. SIGMET-
RICS ’04/Performance ’04. ACM, 2004, pp. 2–13. ISBN: 1-58113-873-3. DOI: 10.1145/
1005686.1005691.
[MML11] Ruben S Montero, Rafael Moreno-Vozmediano, and Ignacio M Llorente. “An elasticity
model for high throughput computing clusters”. In: Journal of Parallel and Distributed
Computing 71.6 (2011), pp. 750–757. DOI: 10.1016/j.jpdc.2010.05.005.
[Mog+14] A.C. Moga et al. “Allocation and write policy for a glueless area-efficient directory cache for
hotly contested cache lines”. US Patent 8,631,210. Jan. 2014. URL: https://www.google.
com.tr/patents/US8631210 (visited on 03/14/2015).
Bibliography 149
[Mol+09] D. Molka et al. “Memory Performance and Cache Coherency Effects on an Intel Nehalem
Multiprocessor System”. In: PACT ’09: Proceedings of the 2009 18th International Confer-
ence on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 2009,
pp. 261–270. ISBN: 978-0-7695-3771-9. DOI: 10.1109/PACT.2009.22.
[Mol+10] Daniel Molka et al. “Characterizing the Energy Consumption of Data Transfers and Arith-
metic Operations on x86-64 Processors”. In: Proceedings of the 1st International Green
Computing Conference. IEEE, 2010, pp. 123–133. DOI: 10.1109/GREENCOMP.2010.
5598316.
[Mol+11] Daniel Molka et al. “Memory Performance and SPEC OpenMP Scalability on Quad-Socket
x86_64 Systems”. In: Algorithms and Architectures for Parallel Processing. Vol. 7016. Lec-
ture Notes in Computer Science. Springer Berlin / Heidelberg, 2011, pp. 170–181. ISBN:
978-3-642-24650-0. DOI: 10.1007/978-3-642-24650-0_15.
[Mol+12] Daniel Molka et al. “Flexible workload generation for HPC cluster efficiency benchmark-
ing”. In: Computer Science - Research and Development 27.4 (2012), pp. 235–243. ISSN:
1865-2034. DOI: 10.1007/s00450-011-0194-9.
[Mol+15] Daniel Molka et al. “Cache Coherence Protocol and Memory Performance of the Intel
Haswell-EP Architecture”. In: Proceedings of the 44th International Conference on Par-
allel Processing (ICPP’15). IEEE, Sept. 2015. DOI: 10.1109/ICPP.2015.83.
[Mol08] Daniel Molka. “Leistungsbewertung von x86 Multicore-Prozessoren”. ZIH report: ZIH-R-
0802. Diplomarbeit. Technische Universität Dresden, July 2008.
[Moo+01] Shirley Moore et al. “Review of Performance Analysis Tools for MPI Parallel Programs”.
In: Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th Eu-
ropean PVM/MPI Users’ Group Meeting. Springer Berlin Heidelberg, 2001, pp. 241–248.
ISBN: 978-3-540-45417-5. DOI: 10.1007/3-540-45417-9_34.
[Moo02] Shirley V. Moore. “A Comparison of Counting and Sampling Modes of Using Performance
Monitoring Hardware”. In: Computational Science — ICCS 2002: International Conference
Proceedings, Part II. Springer Berlin Heidelberg, 2002, pp. 904–912. ISBN: 978-3-540-
46080-0. DOI: 10.1007/3-540-46080-2_95.
[Moo65] G. E. Moore. “Cramming More Components onto Integrated Circuits”. In: Electronics 38.8
(Apr. 1965), pp. 114–117. ISSN: 0018-9219. DOI: 10.1109/jproc.1998.658762.
[Moo75] G.E. Moore. “Progress in digital integrated electronics”. In: Electron Devices Meeting, 1975
International. Vol. 21. 1975, pp. 11–13. DOI: 10.1109/N-SSC.2006.4804410(reprint).
[MPI09] MPI: A Message-Passing Interface Standard. Version 2.2. Message Passing Interface Fo-
rum. Apr. 2009. URL: http:/ /www.mpi- forum.org/docs/mpi- 2.2/mpi22- report.pdf
(visited on 08/13/2015).
[MPI12] MPI: A Message-Passing Interface Standard. Version 3.0. Message Passing Interface Fo-
rum. Sept. 2012. URL: http://www.mpi- forum.org/docs/mpi-3.0/mpi30- report.pdf
(visited on 08/13/2015).
[MPI94] MPI: A Message-Passing Interface Standard. Message Passing Interface Forum. May 1994.
URL: http://www.mpi-forum.org/docs/mpi-1.0/mpi-10.ps (visited on 08/13/2015).
[MPV12] Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. Introduction to Lin-
ear Regression Analysis. 5th ed. Wiley Series in Probability and Statistics. Wiley Global
Education, 2012. ISBN: 978-0-470-54281-1.
[MS96] Larry McVoy and Carl Staelin. “lmbench: portable tools for performance analysis”. In: Pro-
ceedings of the 1996 annual conference on USENIX Annual Technical Conference. ATEC
’96. USENIX Association, 1996. URL: https:/ /www.usenix.org/ legacy/publications/
library/proceedings/sd96/full_papers/mcvoy.pdf (visited on 02/06/2016).
150 Bibliography
[MSM04] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns for Parallel
Programming. The Software Patterns Series. Addison-Wesley Professional, 2004. ISBN:
9780321228116.
[Mül+04] Matthias S. Müller et al. “SPEC HPG benchmarks for high performance systems”. In: Int.
J. High Perform. Comput. Netw. 1.4 (Dec. 2004), pp. 162–170. ISSN: 1740-0562. DOI: 10.
1504/IJHPCN.2004.008345.
[Mül+07] Matthias S. Müller et al. “Developing Scalable Applications with Vampir, VampirServer and
VampirTrace”. In: Parallel Computing: Architectures, Algorithms and Applications, volume
15 of Advances in Parallel Computing. IOS Press, 2007, pp. 637–644. ISBN: 978-1-58603-
796-3.
[Mül+10] Matthias S. Müller et al. “SPEC MPI2007 - an application benchmark suite for parallel sys-
tems using MPI”. In: Concurrency and Computation: Practice and Experience 22.2 (2010),
pp. 191–205. DOI: 10.1002/cpe.1535.
[Mül+12] Matthias S. Müller et al. “SPEC OMP2012 - An Application Benchmark Suite for Paral-
lel Systems Using OpenMP”. In: OpenMP in a Heterogeneous World: 8th International
Workshop on OpenMP, IWOMP 2012, Proceedings. Vol. 7312. Lecture Notes in Computer
Science. Springer Berlin Heidelberg, 2012, pp. 223–236. ISBN: 978-3-642-30960-1. DOI:
10.1007/978-3-642-30961-8_17.
[Nik+01] Dimitrios S. Nikolopoulos et al. “The Trade-off Between Implicit and Explicit Data Dis-
tribution in Shared-memory Programming Paradigms”. In: Proceedings of the 15th Inter-
national Conference on Supercomputing. ICS ’01. ACM, 2001, pp. 23–37. ISBN: 1-58113-
410-X. DOI: 10.1145/377792.377801.
[NR98] Robert W. Numrich and John Reid. “Co-array Fortran for Parallel Programming”. In: SIG-
PLAN Fortran Forum 17.2 (Aug. 1998). ISSN: 1061-7264. DOI: 10.1145/289918.289920.
[Nvi12] Tesla K20X GPU accelerator. Nvidia. Nov. 2012. URL: http://www.nvidia.de/content/
PDF/kepler/Tesla-K20X-BD-06397-001-v05.pdf (visited on 08/21/2014).
[Nvi14] NVIDIA Tegra K1 - A New Era in Mobile Computing. white paper. Nvidia, 2014. URL:
http://www.nvidia.com/content/PDF/tegra_white_papers/tegra-K1-whitepaper.pdf
(visited on 01/12/2016).
[NW88] David M. Nicol and Frank H. Willard. “Problem Size, Parallel Architecture, and Optimal
Speedup”. In: J. Parallel Distrib. Comput. 5.4 (Aug. 1988), pp. 404–420. ISSN: 0743-7315.
DOI: 10.1016/0743-7315(88)90005-6.
[NZ08] Dorit Nuzman and Ayal Zaks. “Outer-loop Vectorization: Revisited for Short SIMD Archi-
tectures”. In: Proceedings of the 17th International Conference on Parallel Architectures
and Compilation Techniques. PACT ’08. ACM, 2008, pp. 2–11. ISBN: 978-1-60558-282-5.
DOI: 10.1145/1454115.1454119.
[Old13] Roland M. Oldenburg. “Analyse der Speicherhierarchie bei ARM Cortex A9 Prozessoren”.
Großer Beleg. Technische Universität Dresden, 2013.
[Old14] Roland M. Oldenburg. “Skalierung paralleler Anwendungen auf der SGI UV-2000: Eine
Analyse der NUMA-Eigenschaften”. ZIH report: ZIH-R-1402. Diplomarbeit. Technische
Universität Dresden, Mar. 2014.
[Oli+05] Leonid Oliker et al. “A Performance Evaluation of the Cray X1 for Scientific Applications”.
In: High Performance Computing for Computational Science - VECPAR 2004: 6th Intl.
Conference, Revised Selected and Invited Papers. Springer Berlin Heidelberg, 2005, pp. 51–
65. ISBN: 978-3-540-31854-5. DOI: 10.1007/11403937_5.
Bibliography 151
[OMP13] OpenMP Application Program Interface. Version 4.0. OpenMP Architecture Review Board.
July 2013. URL: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf (visited on
08/12/2015).
[Ope15] OpenSHMEM Application Programming Interface. Version 1.2. Open Source Software So-
lutions, Inc. Mar. 2015. URL: http: / /bongo.cs.uh.edu/site /sites/default /site_files/
openshmem-specification-1.2.pdf (visited on 08/12/2015).
[Ora09] SUN FIRE X4140 SERVER. data sheet. Oracle, 2009. URL: http://www.oracle.com/us/
products/servers-storage/servers/x86/034352.pdf (visited on 01/15/2015).
[Ora13] SPARC M6-32 Server Architecture. white paper. Oracle, Sept. 2013. URL: http : / /www.
oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/o13-
066-sparc-m6-32-architecture-2016053.pdf (visited on 08/19/2014).
[Pat+11] Avadh Patel et al. “MARSS: A Full System Simulator for Multicore x86 CPUs”. In: Pro-
ceedings of the 48th Design Automation Conference. DAC ’11. ACM, 2011, pp. 1050–1055.
ISBN: 978-1-4503-0636-2. DOI: 10.1145/2024724.2024954.
[Pen+13] S.J. Pennycook et al. “Exploring SIMD for Molecular Dynamics, Using Intel R© Xeon R©
Processors and Intel R© Xeon PhiTM Coprocessors”. In: Parallel Distributed Processing
(IPDPS), 2013 IEEE 27th International Symposium on. May 2013, pp. 1085–1097. DOI:
10.1109/IPDPS.2013.44.
[Pfi98] Gregory F. Pfister. In Search of Clusters — the ongoing battle in lowly parallel computing.
second edition. Prentice Hall, 1998. ISBN: 0-13-899709-8.
[PGB14] B. Putigny, B. Goglin, and D. Barthou. “A benchmark-based performance model for
memory-bound HPC applications”. In: High Performance Computing Simulation (HPCS),
2014 International Conference on. July 2014, pp. 943–950. DOI: 10.1109/HPCSim.2014.
6903790.
[Pil+11] Laércio L Pilla et al. Improving parallel system performance with a NUMA-aware load
balancer. Tech. rep. University of Illinois at Urbana-Champaign, 2011. URL: http:/ /hdl.
handle.net/2142/25911 (visited on 04/19/2016).
[PP84] Mark S. Papamarcos and Janak H. Patel. “A Low-overhead Coherence Solution for Multi-
processors with Private Cache Memories”. In: SIGARCH Comput. Archit. News 12.3 (Jan.
1984), pp. 348–354. ISSN: 0163-5964. DOI: 10.1145/773453.808204.
[PS00] Boris V. Protopopov and Anthony Skjellum. “Shared-memory communication approaches
for an MPI message-passing library”. In: Concurrency: Practice and Experience 12.9
(2000), pp. 799–820. ISSN: 1096-9128. DOI: 10.1002/1096-9128(20000810)12:9<799::
AID-CPE476>3.0.CO;2-1.
[Put14] Bertrand Putigny. “Benchmark-driven Approaches to Performance Modeling of Multi-Core
Architectures”. Theses. Université Sciences et Technologies - Bordeaux I, Mar. 2014. URL:
https://tel.archives-ouvertes.fr/tel-00984791 (visited on 07/01/2016).
[Qual15] Qualcomm R© SnapdragonTM 810 processor. product brief. Qualcomm, 2015. URL: https://
www.qualcomm.com/media/documents/files/snapdragon-810-processor-product-
brief.pdf (visited on 01/12/2016).
[RH13] Sabela Ramos and Torsten Hoefler. Modeling communication in cache-coherent SMP sys-
tems: a case-study with Xeon Phi. technical report. Scalable Parallel Computing Laboratory,
ETH Zurich, Feb. 2013. URL: http://htor.inf.ethz.ch/publications/img/ramos-hoefler-
cc-modeling.pdf (visited on 03/15/2016).
152 Bibliography
[RH15] Sabela Ramos and Torsten Hoefler. “Cache Line Aware Optimizations for ccNUMA Sys-
tems”. In: Proceedings of the 24th International Symposium on High-Performance Parallel
and Distributed Computing. HPDC ’15. ACM, 2015, pp. 85–88. ISBN: 978-1-4503-3550-8.
DOI: 10.1145/2749246.2749256.
[RH16] S. Ramos and T. Hoefler. “Cache Line Aware Algorithm Design for Cache-Coherent Archi-
tectures”. In: Parallel and Distributed Systems, IEEE Transactions on PP.99 (2016). ISSN:
1045-9219. DOI: 10.1109/TPDS.2016.2516540.
[RHJ09] R. Rabenseifner, G. Hager, and G. Jost. “Hybrid MPI/OpenMP Parallel Programming on
Clusters of Multi-Core SMP Nodes”. In: Parallel, Distributed and Network-based Pro-
cessing, 2009 17th Euromicro International Conference on. Feb. 2009, pp. 427–436. DOI:
10.1109/PDP.2009.43.
[Rie14] Rik van Riel. Automatic NUMA Balancing. Red Hat Summit. Apr. 2014. URL: http : / /
events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_
0340_automatic_numa_balancing_0.pdf (visited on 08/07/2015).
[Rus+09] S. Rusu et al. “A 45nm 8-core enterprise Xeon R© processor”. In: Solid-State Circuits Con-
ference, 2009. A-SSCC 2009. IEEE Asian. Nov. 2009, pp. 9–12. DOI: 10.1109/ASSCC.
2009.5357230.
[Sai+03] Hideki Saito et al. “Large System Performance of SPEC OMP Benchmark Suites”. In: In-
ternational Journal of Parallel Programming 31.3 (2003), pp. 197–209. ISSN: 0885-7458.
DOI: 10.1023/A:1023086618401.
[Sav+15] Pavel Saviankou et al. “Cube v4: From Performance Report Explorer to Performance Anal-
ysis Tool”. In: Procedia Computer Science 51 (2015). International Conference On Compu-
tational Science, {ICCS} 2015, pp. 1343–1352. ISSN: 1877-0509. DOI: 10.1016/j.procs.
2015.05.320.
[Saw+11] S. Sawant et al. “A 32nm Westmere-EX Xeon R© enterprise processor”. In: Solid-State Cir-
cuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE International. Feb. 2011,
pp. 74–75. DOI: 10.1109/ISSCC.2011.5746225.
[SBO11] Jeff A. Stuart, Pavan Balaji, and John D. Owens. “Extending MPI to Accelerators”. In:
Proceedings of the 1st Workshop on Architectures and Systems for Big Data. ASBD ’11.
ACM, 2011, pp. 19–23. ISBN: 978-1-4503-1439-8. DOI: 10.1145/2377978.2377981.
[SC10] Xian-He Sun and Yong Chen. “Reevaluating Amdahl’s law in the multicore era”. In: Journal
of Parallel and Distributed Computing 70.2 (2010), pp. 183–188. ISSN: 0743-7315. DOI:
10.1016/j.jpdc.2009.05.002.
[Sca12] The Versatile SMP (vSMP) Architecture and vSMP Foundation Aggregation Platform. white
paper. available from http://www.scalemp.com/media-hub/resources/white-papers/
(registration required). ScaleMP, Aug. 2012. (Visited on 10/15/2015).
[Sch+11] Robert Schöne et al. “The VampirTrace Plugin Counter Interface: Introduction and Ex-
amples”. In: Euro-Par 2010 Parallel Processing Workshops. Vol. 6586. Lecture Notes in
Computer Science. Springer-Verlag, 2011, pp. 501–511. ISBN: 978-3-642-21878-1. DOI:
10.1007/978-3-642-21878-1_62.
[Sch07] Robert Schöne. “Leistungsbewertung von Multicore-Prozessoren”. ZIH report: ZIH-R-
0701. Diplomarbeit. Technische Universität Dresden, Mar. 2007.
[Sco15] Score-P User Manual. Version 1.4.2 (revision 8839). Virtual Institute – High Productivity
Supercomputing. June 2015. URL: https://silc.zih.tu-dresden.de/scorep-current/pdf/
scorep.pdf (visited on 06/10/2016).
[Sew+14] Julian Seward et al. Valgrind Documentation. Release 3.10.0. Sept. 2014. URL: http : / /
valgrind.org/docs/manual/valgrind_manual.pdf (visited on 03/02/2016).
Bibliography 153
[SGG12] A. Silberschatz, P.B. Galvin, and G. Gagne. Operating System Concepts. 9th edition. Wiley
Global Education, 2012. ISBN: 9781118559635.
[Sgi12a] Performance and Productivity Breakthroughs with Very Large Coherent Shared Memory:
The SGI R© UV Architecture. white paper. SGI, Jan. 2012. URL: http://www.sgi.com/pdfs/
4250.pdf (visited on 05/03/2014).
[Sgi12b] Technical Advances in the SGI R© UVTM Architecture. white paper. SGI, June 2012. URL:
http://www.sgi.com/pdfs/4192.pdf (visited on 05/03/2014).
[She+06] S. Shende et al. “Performance and Memory Evaluation Using TAU”. In: Proceedings of
the Cray User’s Group Conference. 2006. URL: https : / / cug . org / 5 - publications /
proceedings_attendee_lists/2006CD/S06_Proceedings/pages/Authors/Shende/
Shende_Paper.pdf (visited on 07/02/2016).
[SHM12] Robert Schöne, Daniel Hackenberg, and Daniel Molka. “Memory performance at reduced
CPU clock speeds: an analysis of current x86_64 processors”. In: Proceedings of the 2012
USENIX conference on Power-Aware Computing and Systems. HotPower’12. USENIX As-
sociation, 2012. URL: https://www.usenix.org/system/files/conference/hotpower12/
hotpower12-final5.pdf (visited on 07/02/2016).
[Sim97] D. Sima. “Superscalar instruction issue”. In: Micro, IEEE 17.5 (Sept. 1997), pp. 28–39.
ISSN: 0272-1732. DOI: 10.1109/40.621211.
[Sin08] Ronak Singhal. Inside Intel R© Next Generation Nehalem Microarchitecture. IDF 2008,
NGMS001. Apr. 2008. URL: http : / /www.cs.uml .edu/~bill / cs515/ Intel_Nehalem_
Processor.pdf (visited on 08/03/2015).
[SLC08] K.V. Sistla, Y.C. Liu, and Z. Cai. “Enforcing global ordering through a caching bridge in
a multicore multiprocessor system”. US Patent 7,360,008. Apr. 2008. URL: http://www.
google.tt/patents/US7360008 (visited on 08/01/2014).
[SM06] Sameer S. Shende and Allen D. Malony. “The TAU Parallel Performance System”. In: In-
ternational Journal of High Performance Computing Applications 20 (May 2006), pp. 287–
331. ISSN: 1094-3420. DOI: 10.1177/1094342006064482.
[Smi98] James E. Smith. “A Study of Branch Prediction Strategies”. In: 25 Years of the International
Symposia on Computer Architecture (Selected Papers). ISCA ’98. ACM, 1998, pp. 202–
215. ISBN: 1-58113-058-9. DOI: 10.1145/285930.285980.
[SN93] X.H. Sun and L.M. Ni. “Scalable Problems and Memory-Bounded Speedup”. In: Journal
of Parallel and Distributed Computing 19.1 (1993), pp. 27–37. ISSN: 0743-7315. DOI: 10.
1006/jpdc.1993.1087.
[Sol+03] B. Solomon et al. “Micro-operation cache: a power aware frontend for variable instruction
length ISA”. In: Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 11.5
(Oct. 2003), pp. 801–811. ISSN: 1063-8210. DOI: 10.1109/TVLSI.2003.814327.
[Spr02] B. Sprunt. “Pentium 4 performance-monitoring features”. In: Micro, IEEE 22.4 (July 2002),
pp. 72–82. ISSN: 0272-1732. DOI: 10.1109/MM.2002.1028478.
[SS95] R.H. Saavedra and A.J. Smith. “Measuring cache and TLB performance and their effect on
benchmark runtimes”. In: Computers, IEEE Transactions on 44.10 (Oct. 1995), pp. 1223–
1235. ISSN: 0018-9340. DOI: 10.1109/12.467697.
[Sta05] Carl Staelin. “lmbench: an extensible micro-benchmark suite”. In: Software: Practice and
Experience 35.11 (2005), pp. 1079–1105. ISSN: 1097-024X. DOI: 10.1002/spe.665.
[Ste+15] Holger Stengel et al. “Quantifying Performance Bottlenecks of Stencil Computations Using
the Execution-Cache-Memory Model”. In: Proceedings of the 29th ACM on International
Conference on Supercomputing. ICS ’15. ACM, 2015, pp. 207–216. ISBN: 978-1-4503-
3559-1. DOI: 10.1145/2751205.2751240.
154 Bibliography
[Str+15] E. Strohmaier et al. “The TOP500 List and Progress in High-Performance Computing”. In:
Computer 48.11 (Nov. 2015), pp. 42–49. ISSN: 0018-9162. DOI: 10.1109/MC.2015.338.
[Sup06] The SC818 Chassis Series User Guide. Revision 1.0. Super Micro Computer, Inc. Mar.
2006. URL: http:/ /www.supermicro.nl/manuals/chassis/1U/SC818.pdf (visited on
07/22/2015).
[Sup14] H8QGi+-F H8QG6+-F User’s Manual. Revision 1.2b. Super Micro Computer, Inc. Mar.
2014. URL: http://www.supermicro.nl/manuals/motherboard/SR56x0/MNL-H8QG(6)
(i)%5C_-F.pdf (visited on 07/22/2015).
[SVL12] P.A. Salvadeo, A.C. Veca, and R.C. Lopez. “Historic behavior of the electronic technology:
The Wave of Makimoto and Moore’s Law in the Transistor’s Age”. In: Programmable Logic
(SPL), 2012 VIII Southern Conference on. Mar. 2012. DOI: 10.1109/SPL.2012.6211774.
[SWC01] A. Snavely, N. Wolter, and L. Carrington. “Modeling application performance by convolv-
ing machine signatures with application profiles”. In: Workload Characterization, 2001.
WWC-4. 2001 IEEE International Workshop on. Dec. 2001, pp. 149–156. DOI: 10.1109/
WWC.2001.990754.
[Tak13] Tetsuo Takata. Perf for User Space Program Analysis. May 2013. URL: http : / / events .
linuxfoundation.org/sites/events/files/lcjp13%5C_takata.pdf (visited on 02/19/2016).
[Tan+13] Lingjia Tang et al. “Optimizing Google’s warehouse scale computers: The NUMA experi-
ence”. In: High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th Inter-
national Symposium on. Feb. 2013, pp. 188–197. DOI: 10.1109/HPCA.2013.6522318.
[Ter+09] Dan Terpstra et al. “Collecting Performance Data with PAPI-C”. In: Proceedings of the
3rd International Workshop on Parallel Tools for High Performance Computing. Springer,
2009, pp. 157–173. ISBN: 978-3-642-11261-4. DOI: 10.1007/978-3-642-11261-4_11.
[TH10] Jan Treibig and Georg Hager. “Introducing a Performance Model for Bandwidth-Limited
Loop Kernels”. In: Parallel Processing and Applied Mathematics. Vol. 6067. Lecture Notes
in Computer Science. Springer Berlin Heidelberg, 2010, pp. 615–624. ISBN: 978-3-642-
14389-2. DOI: 10.1007/978-3-642-14390-8_64.
[Tha+10] R. Thakur et al. “MPI at Exascale”. In: Procceedings of SciDAC 2010. June 2010. URL:
http://htor.inf.ethz.ch/publications/img/mpi_exascale.pdf (visited on 01/28/2016).
[Tho11] Michael E. Thomadakis. The Architecture of the Nehalem Processor and Nehalem-EP SMP
Platforms. Tech. rep. Texas A&M University, 2011. URL: http://sc.tamu.edu/systems/
eos/nehalem.pdf (visited on 10/15/2015).
[THW10] J. Treibig, G. Hager, and G. Wellein. “LIKWID: A Lightweight Performance-Oriented Tool
Suite for x86 Multicore Environments”. In: Parallel Processing Workshops (ICPPW), 2010
39th International Conference on. 2010, pp. 207–216. DOI: 10.1109/ICPPW.2010.38.
[THW12] Jan Treibig, Georg Hager, and Gerhard Wellein. “likwid-bench: An Extensible Microbench-
marking Platform for x86 Multicore Compute Nodes”. In: Proceedings of the 5th Interna-
tional Workshop on Parallel Tools for High Performance Computing. Springer Berlin Hei-
delberg, 2012, pp. 27–36. ISBN: 978-3-642-31476-6. DOI: 10.1007/978-3-642-31476-
6_3.
[THW13] Jan Treibig, Georg Hager, and Gerhard Wellein. “Performance Patterns and Hardware Met-
rics on Modern Multicore Processors: Best Practices for Performance Engineering”. In:
Euro-Par 2012: Parallel Processing Workshops. Vol. 7640. Lecture Notes in Computer
Science. Springer Berlin Heidelberg, 2013, pp. 451–460. ISBN: 978-3-642-36948-3. DOI:
10.1007/978-3-642-36949-0_50.
Bibliography 155
[Tiw+14] Ananta Tiwari et al. “Modeling the Impact of Reduced Memory Bandwidth on HPC Appli-
cations”. In: Euro-Par 2014 Parallel Processing. Vol. 8632. Lecture Notes in Computer Sci-
ence. Springer International Publishing, 2014, pp. 63–74. ISBN: 978-3-319-09872-2. DOI:
10.1007/978-3-319-09873-9_6.
[TJB03] D. Talla, L.K. John, and D. Burger. “Bottlenecks in multimedia processing with SIMD
style extensions and architectural enhancements”. In: Computers, IEEE Transactions on
52.8 (Aug. 2003), pp. 1015–1031. ISSN: 0018-9340. DOI: 10.1109/TC.2003.1223637.
[Too+11] S.S. Too et al. “Thin-core MCM assembly development for high-performance server mi-
croprocessor”. In: Electronic Components and Technology Conference (ECTC), 2011 IEEE
61st. May 2011, pp. 517–522. DOI: 10.1109/ECTC.2011.5898560.
[Top15] The TOP500 list – November 2015. Nov. 2015. URL: http://www.top500.org/lists/2015/
11/ (visited on 01/12/2016).
[TW12] G. Thorson and M. Woodacre. “SGI UV2: A fused computation and data analysis machine”.
In: High Performance Computing, Networking, Storage and Analysis (SC), 2012 Interna-
tional Conference for. Nov. 2012. DOI: 10.1109/SC.2012.102.
[UPC13] UPC Language Specifications. Version 1.3. UPC Consortium. Nov. 2013. URL: https://upc-
lang.org/assets/Uploads/spec/upc-lang-spec-1.3.pdf (visited on 08/14/2015).
[Vam13] VampirTrace 5.14.4 User Manual. TU Dresden, Center for Information Services and
High Performance Computing (ZIH). 2013. URL: http : / / tu - dresden . de / die _ tu _
dresden/zentrale_einrichtungen/zih/forschung/projekte/vampirtrace/dateien/VT-
UserManual-5.14.4.pdf (visited on 03/08/2016).
[Vam15] Vampir 9 User Manual. Manual Version: Vampir 9.0 / November 2015. GWT-TUD GmbH.
Nov. 2015. URL: https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/
forschung/projekte/vampir/dateien/Vampir-User-Manual.pdf (visited on 03/08/2016).
[Vet15] Jeffrey S. Vetter. Contemporary High Performance Computing: From Petascale toward Ex-
ascale, Volume Two. Chapman & Hall/CRC Computational Science. CRC Press, Mar. 2015.
ISBN: 9781498700634.
[Wae+15] Mattias De Wael et al. “Partitioned Global Address Space Languages”. In: ACM Comput.
Surv. 47.4 (May 2015), 62:1–62:27. ISSN: 0360-0300. DOI: 10.1145/2716320.
[Wal07] Brian Waldecker. AMD Quad Core Processor Overview. July 2007. URL: http://download.
boston.co.uk/downloads/9/b/9/9b9860e2-4360-4f45-8143-e04be91d44c2/AMD_
QC.pdf (visited on 02/15/2015).
[Wea15] Vincent M. Weaver. “Self-monitoring overhead of the Linux perf_event performance
counter interface”. In: Performance Analysis of Systems and Software (ISPASS), 2015 IEEE
International Symposium on. Mar. 2015, pp. 102–111. DOI: 10 . 1109 / ISPASS. 2015 .
7095789.
[Wec06] Ofri Wechsler. Inside Intel R© CoreTM Microarchitecture - Setting New Standards for
Energy-Efficient Performance. white paper. Intel, 2006. URL: http : / / www. intel . com /
pressroom/kits/core2duo/pdf/ICM_whitepaper.pdf (visited on 05/30/2014).
[Wei+11] D. Weiss et al. “An 8MB level-3 cache in 32nm SOI with column-select aliasing”. In: Solid-
State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE International.
Feb. 2011, pp. 258–260. DOI: 10.1109/ISSCC.2011.5746309.
[Wen+11] D.F. Wendel et al. “POWER7TM, a Highly Parallel, Scalable Multi-Core High End Server
Processor”. In: Solid-State Circuits, IEEE Journal of 46.1 (Jan. 2011), pp. 145–161. ISSN:
0018-9200. DOI: 10.1109/JSSC.2010.2080611.
[Wer14] Michael Werner. “Analyse der Energie-Effizienz paralleler Anwendungen auf der Basis von
Hardware-Performance-Countern”. Großer Beleg. Technische Universität Dresden, 2014.
156 Bibliography
[WWP09] Samuel Williams, Andrew Waterman, and David Patterson. “Roofline: An Insightful Visual
Performance Model for Multicore Architectures”. In: Commun. ACM 52.4 (Apr. 2009),
pp. 65–76. ISSN: 0001-0782. DOI: 10.1145/1498765.1498785.
[Yas14] A. Yasin. “A Top-Down method for performance analysis and counters architecture”. In:
Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Sympo-
sium on. Mar. 2014, pp. 35–44. DOI: 10.1109/ISPASS.2014.6844459.
[YMG14] L. Yavits, A. Morad, and R. Ginosar. “The effect of communication and synchronization on
Amdahl’s law in multicore systems”. In: Parallel Computing 40.1 (2014). ISSN: 0167-8191.
DOI: 10.1016/j.parco.2013.11.001.
[You07] M.T. Yourst. “PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simula-
tor”. In: Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE International
Symposium on. 2007, pp. 23–34. DOI: 10.1109/ISPASS.2007.363733.
[YP92] Tse-Yu Yeh and Yale N. Patt. “Alternative Implementations of Two-level Adaptive Branch
Prediction”. In: SIGARCH Comput. Archit. News 20.2 (Apr. 1992), pp. 124–134. ISSN:
0163-5964. DOI: 10.1145/146628.139709.
[YPS05a] K. Yotov, K. Pingali, and P. Stodghill. “X-Ray: a tool for automatic measurement of hard-
ware parameters”. In: Quantitative Evaluation of Systems, 2005. Second International Con-
ference on the. Sept. 2005, pp. 168–177. DOI: 10.1109/QEST.2005.44.
[YPS05b] Kamen Yotov, Keshav Pingali, and Paul Stodghill. “Automatic Measurement of Memory
Hierarchy Parameters”. In: SIGMETRICS Perform. Eval. Rev. 33.1 (June 2005), pp. 181–
192. ISSN: 0163-5999. DOI: 10.1145/1071690.1064233.
[Yuf+12] M. Yuffe et al. “A Fully Integrated Multi-CPU, Processor Graphics, and Memory Controller
32-nm Processor”. In: Solid-State Circuits, IEEE Journal of 47.1 (Jan. 2012), pp. 194–205.
ISSN: 0018-9200. DOI: 10.1109/JSSC.2011.2167814.
[Zen+09] H. Zeng et al. “MPTLsim: A simulator for X86 multicore processors”. In: Design Automa-
tion Conference, 2009. DAC ’09. 46th ACM/IEEE. July 2009, pp. 226–231. ISBN: 978-1-
6055-8497-3. DOI: 10.1145/1629911.1629974.
[Zia+10] D. Ziakas et al. “Intel R© QuickPath Interconnect Architectural Features Supporting Scal-
able System Architectures”. In: High Performance Interconnects (HOTI), 2010 IEEE 18th
Annual Symposium on. Aug. 2010. DOI: 10.1109/HOTI.2010.24.
157
List of Abbreviations
µop micro-op
AGU Address Generation Unit
ALU Arithmetic Logic Unit
AS Address Space
ATM Accelerated Transition to Modified
AVX Advanced Vector Extensions, a SIMD extension for x86 processors
AVX2 Advanced Vector Extensions 2, a SIMD extension for x86 processors
CA Caching Agent
CFD Computational Fluid Dynamics
CMP Chip Multi-processor
COD Cluster-on-Die
CPU Central Processing Unit
CU Compute Unit
DCT Dynamic Concurrency Throttling
DRAM Dynamic Random Access Memory
DVFS Dynamic Voltage and Frequency Scaling
ECC Error-Correcting Code
EX Execution (pipeline phase)
FLOPS Floating point Operations Per Second (also flop/s)
FMA Fused Multiply-Add
FMA4 Fused Multiply-Add with 4 address format
FMAC Floating point Multiply-ACcumulate
FPU Floating Point Unit
FSB Front-Side-Bus
GQ Global Queue
GT/s Giga Transfers per Second
HA Home Agent
HPC High Performance Computing
HT HyperTransport
HTC High Throughput Computing
IBS Instruction Based Sampling
ICU Instruction Control Unit
ID Instruction Decode (pipeline phase)
IDQ Instruction Decode Queue
IF Instruction Fetch (pipeline phase)
IMC Integrated memory controller
IOPS Integer Operation Per Second
IPC Instructions Per Cycle
IPS Instructions Per Second
ISA Instruction Set Architecture
L1 Level one cache
158 Nomenclature
L1D Level one data cache
L2 Level two cache
L3 Level three cache
LCG Linear Congruential Generator
LLC Last Level Cache
LSU Load-Store-Unit
M, O, E, S, I, F Modified, Owned, Exclusive, Shared, Invalid, Forward
MCM Multi-Chip Module
ME Memory (pipeline phase)
MOB Memory Order Buffer
MPI Message Passing Interface
MSR Machine Specific Register
MuW ModifiedUnWritten
NUMA Non Uniform Memory Access
OS Operating System
PAPI Performance Application Programming Interface
PCB Printed Circuit Board
PCH Platform Controller Hub
PCIe PCI Express - Peripheral Component Interconnect Express
PEBS Precise Event Based Sampling
PGAS Partitioned Global Address Space
PMU Performance Monitoring Unit
PTE Page Table Entry
QPI QuickPath Interconnect
RAM Random Access Memory
RAT Register Alias Table
RAW Read-After-Write
RFO Read For Ownership
RMA Remote Memory Access
ROB Reorder Buffer
RS Reservation Station
SA System Agent
SIMD Single Instruction Multiple Data
SMI Scalable Memory Interface
SMT Simultaneous Multi-Threading
SRI System Request Interface
SSE Streaming SIMD Extensions, a SIMD extension for x86 processors
THP Transparent Huge Pages
TLB Translation Lookaside Buffer
TSC Time-Stamp Counter
UFS Uncore Frequency Scaling
UMA Uniform Memory Access
WAR Write-After-Read
WAW Write-After-Write
WB Write Back (pipeline phase)
WCC Write Coalescing Cache
159
List of Figures
1.1 SPEC OMPM2001 scaling on a quad-socket Intel Xeon X7560 system . . . . . . . . . . 4
2.1 Basic 5-stage pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Speculative out-of-order execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Functionality of SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Cache structure and functional principle . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Composition of multi-core processors . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Double precision floating point performance and memory bandwidth development . . . . 15
2.7 Memory distribution in systems with four processors . . . . . . . . . . . . . . . . . . . 16
2.8 Software view of large scale systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Architecture of distributed memory systems . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Projected hardware development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.11 State change diagram of the MESI coherence protocol . . . . . . . . . . . . . . . . . . . 21
2.12 State change diagram of the MESIF coherence protocol . . . . . . . . . . . . . . . . . . 22
2.13 State change diagram of the MOESI coherence protocol . . . . . . . . . . . . . . . . . . 23
2.14 State change diagram of the extended MOESI protocol . . . . . . . . . . . . . . . . . . 24
2.15 Basic principle of virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.16 Address translation in 64 bit x86 processors using 4 KiB pages . . . . . . . . . . . . . . 29
2.17 Communication in parallel programming models . . . . . . . . . . . . . . . . . . . . . 31
2.18 Comparison of sampling and instrumentation . . . . . . . . . . . . . . . . . . . . . . . 37
2.19 Exemplary runtime statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.20 Utilization of shared resources in a multi-core processor . . . . . . . . . . . . . . . . . 38
3.1 Data placement mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Coherence state control mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Hardware performance counter example . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1 AMD family 10h micro-architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Composition of the dual-socket AMD Opteron 2435 system . . . . . . . . . . . . . . . 59
4.3 Opteron 2435—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Opteron 2435—memory latency with enabled HT Assist . . . . . . . . . . . . . . . . . 61
4.5 Opteron 2435—TLB miss penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Opteron 2435—single-threaded read and write bandwidths . . . . . . . . . . . . . . . . 62
4.7 Intel Westmere micro-architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.8 Composition of the dual-socket Xeon X5670 test system . . . . . . . . . . . . . . . . . 66
4.9 Xeon X5670—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.10 Xeon X5670—TLB miss penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.11 Xeon X5670—single-threaded read and write bandwidths . . . . . . . . . . . . . . . . . 68
4.12 Xeon X5670—SIMD effect on bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.13 Xeon X5670—aggregated bandwidth using 128 bit loads and stores . . . . . . . . . . . 71
4.14 Intel Sandy Bridge micro-architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.15 Composition of the dual-socket Xeon E5-2670 test system . . . . . . . . . . . . . . . . 73
4.16 Xeon E5-2670—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.17 Xeon E5-2670—DRAM latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.18 Xeon E5-2670—impact of TLB misses on memory latency . . . . . . . . . . . . . . . . 75
4.19 Xeon E5-2670—single-threaded read and write bandwidths . . . . . . . . . . . . . . . . 76
160 List of Figures
4.20 Xeon E5-2670—the ISA’s impact on memory bandwidth . . . . . . . . . . . . . . . . . 77
4.21 Xeon E5-2670—bandwidth using multiple cores . . . . . . . . . . . . . . . . . . . . . . 78
4.22 Intel Haswell micro-architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.23 Structure of the 12-core Xeon E5 v3 die . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.24 Composition of the dual-socket Xeon E5 v3 system . . . . . . . . . . . . . . . . . . . . 81
4.25 Xeon E5-2680 v3—read latency of Modified and Exclusive data . . . . . . . . . . . . . 82
4.26 Xeon E5-2680 v3—impact of TLB misses on memory latency . . . . . . . . . . . . . . 83
4.27 Xeon E5-2680 v3—read latency of shared data . . . . . . . . . . . . . . . . . . . . . . 84
4.28 Xeon E5-2680 v3—single-threaded read and write bandwidth . . . . . . . . . . . . . . 85
4.29 Xeon E5-2680 v3—the ISA’s impact on memory bandwidth . . . . . . . . . . . . . . . 86
4.30 AMD family 15h (models 00h – 1Fh) micro-architecture . . . . . . . . . . . . . . . . . 89
4.31 Composition of the quad-socket Opteron 6274 test system . . . . . . . . . . . . . . . . 90
4.32 Opteron 6274—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.33 Opteron 6274—HyperTransport transfers that comprise three nodes . . . . . . . . . . . 93
4.34 Opteron 6274—impact of TLB misses on memory latency . . . . . . . . . . . . . . . . 96
4.35 Opteron 6274—memory read bandwidth of scalar and vector instructions . . . . . . . . 96
4.36 Opteron 6274—write bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.37 Opteron 6274—bandwidth scaling within compute unit . . . . . . . . . . . . . . . . . . 99
4.38 Influence of the used ISA on the achievable read bandwidth . . . . . . . . . . . . . . . . 101
4.39 Scaling of the bandwidth with the number of cores . . . . . . . . . . . . . . . . . . . . 102
5.1 Topology of quad-socket test systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 SPEC OMPM2001—scaling with the number of used cores . . . . . . . . . . . . . . . . 105
5.3 SPEC OMPM2001—scaling with the number of used processors . . . . . . . . . . . . . 106
5.4 Comparison of access latency and out-of-order windows . . . . . . . . . . . . . . . . . 107
5.5 Xeon E5-2670—Throughput of arithmetic instructions depending on data location . . . . 108
5.6 Effect of resource scaling in multi-core processors on fixed-size speedup . . . . . . . . . 109
5.7 Average resource scaling example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.8 Workload scaling and resulting runtime distribution . . . . . . . . . . . . . . . . . . . . 111
5.9 Multi-core speedup under resource constraints . . . . . . . . . . . . . . . . . . . . . . . 112
5.10 Xeon E5-2670—read bandwidth and perf::L1-DCACHE-LOADS / -LOAD-MISSES . . 114
5.11 Xeon E5-2670—read bandwidth and MEM_LOAD_UOPS_RETIRED events . . . . . . 114
5.12 Xeon E5-2670—counters that identify the source of the accessed data . . . . . . . . . . 115
5.13 Xeon E5-2670—write bandwidth and write back events . . . . . . . . . . . . . . . . . . 116
5.14 Xeon E5-2670—last level cache counters . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.15 Xeon E5-2670—home agent counters . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.16 Xeon E5-2670—indicators for core-to-core transfers . . . . . . . . . . . . . . . . . . . 120
5.17 Xeon E5-2680 v3—QPI counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.18 Xeon E5-2670—correlation between performance counters and memory latency . . . . . 123
5.19 Xeon E5-2670—memory latency and MEM_LOAD_UOPS events . . . . . . . . . . . . 125
5.20 Impact of the memory latency on the achievable bandwidth . . . . . . . . . . . . . . . . 125
5.21 Xeon E5-2670—indicators for bandwidth-boundedness . . . . . . . . . . . . . . . . . . 126
5.22 Decomposition of stall cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.23 Xeon E5-2680 v3—correlation between performance counters and memory latency . . . 127
5.24 Custom metric for memory-boundedness in Vampir . . . . . . . . . . . . . . . . . . . . 128
5.25 Memory-boundedness—363.swim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.26 Bandwidth-boundedness of selected SPEC OMP2012 benchmarks . . . . . . . . . . . . 130
5.27 Bandwidth utilization—370.mgrid331 . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.28 NUMA awareness—351.bwaves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.29 Per package DRAM utilization—351.bwaves . . . . . . . . . . . . . . . . . . . . . . . 132
161
List of Tables
1.1 Comparison of Top500 results and achieved application performance . . . . . . . . . . . 3
2.1 Development of execution resources in Intel micro-architectures . . . . . . . . . . . . . 11
2.2 Latency and bandwidth for different levels in the memory hierarchy . . . . . . . . . . . 13
2.3 States of the MESI protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 States of the MESIF protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 States of the MOESI protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Number of data TLB entries per core in customary x86 processors . . . . . . . . . . . . 29
3.1 Measurement routines of the throughput kernel . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Dual-socket Test Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Opteron 2435—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Opteron 2435—core-to-core read and write bandwidths . . . . . . . . . . . . . . . . . . 63
4.4 Opteron 2435—L3 and main memory bandwidth scaling . . . . . . . . . . . . . . . . . 64
4.5 Xeon X5670—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Xeon X5670—core-to-core read and write bandwidths . . . . . . . . . . . . . . . . . . 69
4.7 Xeon X5670—L3 and main memory bandwidth scaling . . . . . . . . . . . . . . . . . . 71
4.8 Xeon E5-2670—memory read latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.9 Xeon E5-2670—read and write bandwidths of core-to-core transfers . . . . . . . . . . . 75
4.10 Xeon E5-2670—L3 and main memory bandwidth . . . . . . . . . . . . . . . . . . . . . 78
4.11 Test systems with complex NUMA topologies . . . . . . . . . . . . . . . . . . . . . . . 79
4.12 Xeon E5-2680 v3—L3 cache and main memory latency . . . . . . . . . . . . . . . . . . 83
4.13 Xeon E5-2680 v3—L3 latency if copies exist in multiple NUMA nodes . . . . . . . . . 84
4.14 Xeon E5-2680 v3—memory latency if data has been shared by multiple cores . . . . . . 84
4.15 Xeon E5-2680 v3—single-threaded read bandwidth in GB/s . . . . . . . . . . . . . . . 86
4.16 Xeon E5-2680 v3—L3 bandwidth scaling [GB/s] . . . . . . . . . . . . . . . . . . . . . 87
4.17 Xeon E5-2680 v3—memory read bandwidth in GB/s . . . . . . . . . . . . . . . . . . . 87
4.18 Xeon E5-2680 v3—memory write bandwidth in GB/s . . . . . . . . . . . . . . . . . . . 88
4.19 Xeon E5-2680 v3—memory read (write) bandwidth in COD mode . . . . . . . . . . . . 88
4.20 Opteron 6274—Number of HyperTransport hops for accessing remotely cached data . . 93
4.21 Opteron 6274—HyperTransport message delivery times . . . . . . . . . . . . . . . . . . 93
4.22 Opteron 6274—latency of cache-to-cache transfers that involve multiple NUMA nodes . 94
4.23 Opteron 6274—latency of local and remote cache and memory accesses . . . . . . . . . 95
4.24 Opteron 6274—read bandwidth depending on width of load instructions . . . . . . . . . 97
4.25 Opteron 6274—bandwidth scaling using SIMD instructions . . . . . . . . . . . . . . . . 100
4.26 Opteron 6274—HyperTransport bandwidths . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 Hardware configuration of quad-socket systems . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Aggregate read and write bandwidths per NUMA node in GB/s . . . . . . . . . . . . . . 104
5.3 SPEC OMPM2001—parallel efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Indicators for bandwidth usage per core . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Indicators for bandwidth usage per processor . . . . . . . . . . . . . . . . . . . . . . . 119
5.6 Xeon E5-2670—Counters for stall cycles . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.1 Comparison of x86-membench with other established benchmarks . . . . . . . . . . . . 133
162 List of Tables
163
Acknowledgments
At this point I would like to thank everyone who supported me throughout the years. This thesis would
not have come to fruition without the valuable feedback and encouragement I received.
First of all, I would like to thank Prof. Wolfgang E. Nagel for the continued support and his inexhaustible
patience in supervising this thesis. Besides the interesting and challenging work on several research
projects, I have been given opportunity to pursue my own research interests, which finally lead to this
thesis. Furthermore, I want to thank Prof. Thomas Ludwig for his guidance as well as Guido Juckeland
for reviewing countless iterations of my work.
I would also like to thank the colleagues at the center for information services and high performance
computing who made sure that the stressful and exhausting years of writing this thesis have mostly
been a pleasant time. Particular thanks are due to the members of the energy efficiency research group
who always inspired me and critically questioned my ideas. Moreover, I want to thank the German
federal ministry of education and research for funding the research projects “eeClust”, “Cool Silicon”,
and “Score-E”, which facilitated my research.
My work would not have been possible without the numerous software developers that created the tools
that I am using. My special thanks go to Robert Schöne and Michael Werner who develop Score-P’s
plugin counter interface and the uncore counter plugin as well as Maik Schmidt who assisted in the
development of x86-membench in his time as apprentice.
Last but not least, I want to thank my family for keeping me motivated the whole time.
164 Acknowledgments
