Accelerating Checkpoint/Restart Application Performance in Large-Scale Systems with Network Attached Memory by Schmidt, Juri
DISSERTATION
submitted to the
Combined Faculty of Natural Sciences and Mathematics
of the
Ruprecht–Karls University
Heidelberg
for the degree of
Doctor of Natural Sciences
put forward by
M.Sc. Juri Schmidt
born in
Alexejewka, Kazakhstan
Mannheim, 2017

Accelerating Checkpoint/Restart
Application Performance
in Large-Scale Systems with
Network Attached Memory
Advisor: Professor Dr. Ulrich Brüning
Date of the oral examination: ...........................

To my beloved family
my parents
Vera and Georg Schmidt
and my sister
Ludmila Schmidt

Abstract
Technology scaling and a continual increase in operating frequency have been the main
driver of processor performance for several decades. A recent slowdown in this evolution
is compensated by multi-core architectures, which challenge application developers
and also increase the disparity between the processor and memory performance. The
increasing core count and growing scale of computing systems furthermore turn attention
to communication as a signiﬁcant contributor on application run-times.
Larger systems also comprise many more components which are subject to failures.
In order to mitigate the eﬀects of these failures, fault tolerance techniques such as
Checkpoint/Restart are used. These techniques often rely on message-based commu-
nication and data transport stresses the local memory interface. In order to reduce
communication overhead it is desirable to either decrease the number of messages, or
otherwise to accelerate the execution of commonly used global operations. Finally,
power consumption of large-scale systems has become a major concern and the eﬃciency
of such systems must considerably improve to allow future Exascale systems to operate
within a reasonable power budget.
This work addresses the topics memory interface, communication, fault tolerance, and
energy eﬃciency in large-scale systems. It presents Network Attached Memory (NAM),
an FPGA-based hardware prototype that can be directly connected to a common
high-performance interconnection network in large-scale systems. It provides access
to the emerging memory technology Hybrid Memory Cube (HMC) as shared memory
resource, tightly integrated with processing elements.
The ﬁrst part introduces the HMC memory architecture and serial interface, and
thoroughly evaluates it in an FPGA using a custom-developed host controller, which
has become an open-source initiative.
The next part describes the hardware architecture of the NAM design and prototype,
and theoretically evaluates the expected performance and bottlenecks. The NAM design
Abstract
was fully prototyped in an FPGA and the contribution also comprises a corresponding
software stack.
As a ﬁrst use case NAM serves as Checkpoint/Restart target, aiming to reduce inter-
node communication and to accelerate the creation of checkpoint parity information.
Reducing checkpointing overhead improves application run-times and energy eﬃciency
likewise.
The ﬁnal part of this work evaluates the NAM performance in a 16 node test system.
It shows a good read/write scaling behavior for an increasing number of nodes. For
Checkpoint/Restart with a real application, a 2.1X improvement over a standard
approach is a remarkable result. It proves the successful concept of a dedicated
hardware component to reduce communication and fault tolerance overhead for current
and future large-scale systems.
Zusammenfassung
Der kontinuierliche Anstieg der Mikroprozessorleistung wurde über Jahrzehnte hinweg
getrieben von immer feiner werdenden Halbleiterstrukturen sowie steigenden Taktraten.
Die kürzlich beobachtete Verlangsamung dieser Entwicklung wird durch Multi-core
Architekturen kompensiert. Diese erfordern parallelisierte Anwendungen und stellen
Anwendungsentwickler und die Prozessor-Hauptspeicher Schnittstelle gleichermaßen
vor große Herausforderungen. Der weiterhin fortwährende Trend zu immer größeren
verteilten Systemen und die damit einhergehende Zunahme an Einzelkomponenten stellt
insbesondere Anforderungen an das Verbindungsnetzwerk, sodass viele Anwendungen
bereits heute viel Zeit mit reiner Kommunikation verbringen.
Größere Systeme erhöhen zugleich die Wahrscheinlichkeit für Defekte. Um deren
negative Auswirkungen zu reduzieren und Defekte zu tolerieren, werden üblicherweise
Checkpoint/Restart Mechanismen eingesetzt. Da diese zumeist auf Kommunika-
tion zwischen einzelnen Knoten basieren und zusätzlich die Prozessor-Hauptspeicher
Schnittstelle belasten, ist es sinnvoll entweder den Umfang der benötigten Kommunika-
tion zu reduzieren oder deren Einﬂuss zu minimieren. Zu guter Letzt gewinnt auch
die Leistungsaufnahme verteilter Systeme immer mehr an Bedeutung. Im Hinblick auf
die Exascale-Ära ist es daher zwingend notwendig die Energieeﬃzienz bedeutend zu
verbessern um den Leistungsverbrauch dieser Systeme in einem vertretbaren Rahmen
zu halten.
Diese Arbeit geht auf die oben genannten Problematiken Speicherschnittstelle, Kommu-
nikation, Fehlertoleranz und Energieeﬃzienz ein und stellt Network Attached Memory
(NAM) vor. NAM ist ein Hardware Prototyp, der direkt an ein gängiges Hochleistungs-
Verbindungsnetzwerk in verteilten Systemen angebunden werden kann. Es bietet
Zugriﬀ auf gemeinsamen Speicher, der durch die aufstrebende Hybrid Memory Cube
(HMC) Technologie realisiert ist.
Zusammenfassung
Der erste Beitrag umfasst die Vorstellung, Technologieanalyse und HMC Evaluation in
einem FPGA mithilfe einer eigens entwickelten Zugriﬀseinheit, die als Open-Source
Initiative frei zugänglich ist.
Der nächste Beitrag erläutert den Entwicklungsprozess und die Hardwarearchitektur
des NAM Designs und Prototypen und ermittelt die Leistung theoretisch. Das NAM
Design wurde hierfür vollständig in einem FPGA implementiert und durch die für den
Zugriﬀ notwendigen Softwarekomponenten ergänzt.
In einem ersten Anwendungsfall dient der NAM als Beschleuniger für Check-
point/Restart Prozesse mit dem Ziel, Kommunikation zwischen Knoten zu verringern
und die benötigte Paritätsinformation schneller zu berechnen. Dies wird sich Vorteilhaft
auf Anwendungslaufzeiten und Energieeﬃzienz auswirken.
Der letzte Beitrag beinhaltet verschiedene Leistungsmessungen in einem realen 16
Knoten System. Diese zeigen optimale Skalierbarkeit für Lese- und Schreibzugriﬀ.
Für Checkpoint/Restart wird eine bemerkenswerte, 2.1-fache Beschleunigung erreicht.
Dieses Resultat belegt das erfolgreiche NAM Konzept zur Reduktion von Kommunika-
tion und des Berechnungsaufwands für Fehlertoleranz in aktuellen und zukünftigen
Systemen.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 State of the Art 7
2.1 Memory: Technologies and Interfaces . . . . . . . . . . . . . . . . . . . 7
2.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Emerging Memory Technologies . . . . . . . . . . . . . . . . . . 11
2.1.5 Serial Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6 Processing in Memory . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.7 Summary Memory . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Communication in HPC systems . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Message Passing and Communication Characteristics . . . . . . 17
2.2.3 Summary Communication . . . . . . . . . . . . . . . . . . . . . 19
2.3 Fault Tolerance in HPC Systems . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Failure Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Fault Tolerance using Checkpoint/Restart . . . . . . . . . . . . 22
2.3.3 SCR: Scalable Checkpoint / Restart . . . . . . . . . . . . . . . 24
I
Contents
2.3.4 Summary Fault Tolerance . . . . . . . . . . . . . . . . . . . . . 28
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Hybrid Memory Cube 31
3.1 Introduction and Architecture Analysis . . . . . . . . . . . . . . . . . . 31
3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 DRAM Organization and Performance . . . . . . . . . . . . . . 33
3.1.3 Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.4 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.5 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.6 The Flow Control Barrier . . . . . . . . . . . . . . . . . . . . . 37
3.1.7 Summary HMC Architecture . . . . . . . . . . . . . . . . . . . . 41
3.1.8 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.9 Lessons learned for an HMC host controller design . . . . . . . . 43
3.2 openHMC Host Controller . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Conﬁgurations and Features . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Operating Frequencies . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Flow Control and Performance . . . . . . . . . . . . . . . . . . 46
3.2.4 Comparison with other IPs . . . . . . . . . . . . . . . . . . . . . 47
3.2.5 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 HMC Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3 Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.5 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.6 Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.7 Power Consumption and Energy Eﬃciency . . . . . . . . . . . . 60
3.3.8 Summary Performance Evaluation . . . . . . . . . . . . . . . . . 62
3.4 HMC Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
II
Contents
4 Network Attached Memory 65
4.1 DEEP-ER Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Background: EXTOLL . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Functional Units and Link Performance . . . . . . . . . . . . . . 68
4.2.2 From Software to Network Transactions . . . . . . . . . . . . . 70
4.2.3 Notiﬁcation Mechanism . . . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Network Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.5 Link Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.6 EMP: Network Discovery and Setup . . . . . . . . . . . . . . . 73
4.3 NAM Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Prototype ’Aspin-v2’ . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 FPGA Design Partitions . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Summary Estimated Read/Write Performance . . . . . . . . . . . . . . 92
4.5 Checkpoint/Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.5.1 Buddy Checkpointing in DEEP-ER . . . . . . . . . . . . . . . . 94
4.5.2 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . 96
4.5.4 Vision: NAM-XOR Checkpointing in DEEP-ER . . . . . . . . . 97
4.5.5 Conﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.6 Generating a Checkpoint . . . . . . . . . . . . . . . . . . . . . . 99
4.5.7 Restarting from a Checkpoint . . . . . . . . . . . . . . . . . . . 99
4.5.8 CR Functional Unit . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.9 Estimated Performance . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.7 NAM Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.7.1 EMP Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.7.2 The libNAM Library . . . . . . . . . . . . . . . . . . . . . . . . 106
4.7.3 NAM Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.8 NAM Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
III
Contents
5 NAM Performance Evaluation 111
5.1 Read/Write Microbenchmark Results . . . . . . . . . . . . . . . . . . . 111
5.1.1 Single Link Performance . . . . . . . . . . . . . . . . . . . . . . 112
5.1.2 Two Link PUT/GET Bandwidth . . . . . . . . . . . . . . . . . 115
5.1.3 Analysis and Improvements . . . . . . . . . . . . . . . . . . . . 116
5.2 Checkpoint/Restart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.1 Microbenchmark Results . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2 Application Performance . . . . . . . . . . . . . . . . . . . . . . 121
5.3 Performance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6 Conclusion and Outlook 127
6.1 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A Acronyms 133
B List of ﬁgures 137
C List of tables 141
R References 143
IV
C
h
a
p
t
e
r
1
Introduction
For many years, the increase in processor and system performance was driven by
technology scaling which allowed to pack more transistors per area at a ﬁxed power
budget. Increasing operating frequencies supported the improvements of single thread
performance accordingly. Since the early 2000’s, however, this continual increase has
slowed down due to excessive power dissipation, which is also caused by the leakage
current of today’s tiny transistor feature sizes. Multi-core architectures were developed
to keep up with the traditional growth rate, and system performance was scaled by
adding more and more components and nodes. Although these new architectures
pose challenges to application developers as it requires carefully parallelized codes, the
overall system performance kept increasing at a moderate rate. This is documented in
Figure 1.1 which shows the evolution of the number 1 systems of the TOP500 list of
supercomputers.
At a ﬁrst glance the current trend gives no indication that there is something wrong at
all, especially not with the memory system. This is because the TOP500 LINPACK
benchmark in large parts is insensitive to memory performance and re-uses data that
remains in registers and caches instead [1]. In reality, however, memory access times
and bandwidth lag behind the historical evolution of CPU performance. This disparity
is well known as the memory wall [2] and the gap is widening with the recent and
ongoing increasing number of CPU cores per socket that operate on the same memory
interface.
1
I n t r o d u c ti o n
1 0 0 G Fl o p/ s
1 T Fl o p/ s
1 0 T Fl o p/ s
1 0 0 T Fl o p/ s
1 P Fl o p/ s
1 0 P Fl o p/ s
1 0 0 P Fl o p/ s
1 9 9
2
1 9 9
4
1 9 9
6
1 9 9
8
2 0 0
0
2 0 0
2
2 0 0
4
2 0 0
6
2 0 0
8
2 0 1
0
2 0 1
2
2 0 1
4
2 0 1
6
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
●
●
●
● ●
● ● ● ● ● ●
● ● ●
1 M W att
5 M W att
1 0 M W att
1 5 M W att
2 0 M W att
Sys
te
m P
ow
er 
Co
ns
um
pti
on
Pe
ak 
Per
for
ma
nc
e
●
T O P 5 0 0 # 1 R m a x
T O P 5 0 0 # 1 R m a x tr e n d
T O P 5 0 0 # 1 p o w er
Fi g. 1. 1 T O P 5 0 0 n u m b er 1 s yst e m p erf or m a n c e a n d p o w er d e v el o p m e nt 1
D es pit e t h e l a g i n m e m or y e v ol uti o n, l ar g e-s c al e s yst e ms c o nti n u all y a c hi e v e d e x-
tr a or di n ar y p erf or m a n c e g ai ns. F u n d a m e nt al f or t his d e v el o p m e nt h as b e e n a st e a d y
i n cr e as e i n c o m p o n e nt a n d n o d e c o u nt, wit h t h e c urr e nt n u m b er 1 T O P 5 0 0 s yst e m
c o m prisi n g m or e t h a n 1 0 milli o n c or es 1 . C o m m u ni c ati o n b et w e e n t w o or m or e c or es is
t y pi c all y r e ali z e d vi a m ess a g e p assi n g a n d u nl ess a n a p pli c ati o n is p erf e ctl y p ar all el,
i nt er- pr o c ess c o m m u ni c ati o n will t a k e pl a c e. H e n c e, w h e n e v er a m ess a g e is s e nt or
r e c ei v e d a cr oss n o d e b o u n d ari es it als o i n v ol v es t h e i nt er c o n n e cti o n n et w or k. W or k
i n [ 3] s h o ws t h at c urr e nt l ar g e-s c al e a p pli c ati o ns s p e n d a n a v er a g e of 3 6 % of t h eir
r u nti m e wit h p oi nt-t o- p oi nt c o m m u ni c ati o n a n d w aiti n g f or c oll e cti v e o p er ati o ns t o
c o m pl et e. It will b e s h o w n t h at t h e i nt er c o n n e cti o n n et w or ks f or t h eir p art s u bst a nti all y
i m pr o v e d o v er t h e l ast d e c a d e a n d ot h er a p pr o a c h es t o miti g at e t h e o v er h e a d t hr o u g h
i nt er- pr o c ess c o m m u ni c ati o n m ust b e d e v el o p e d.
O n e a d diti o n al a n d u n d er esti m at e d dr a w b a c k of s yst e m s c ali n g is its i m p a ct o n
t h e s yst e m’s err or r at es, or M e a n Ti m e B et w e e n F ail ur e ( M T B F ). H ar d w ar e f a ults
ar e a m o n g t h e m ost c o m m o n c a us es f or s yst e m cr as h es a n d e v er y si n gl e a d diti o n al
c o m p o n e nt p ot e nti all y d e cr e as es t h e a v er a g e ti m e b et w e e n t w o f ail ur es. F or a n E x as c al e
m a c hi n e it is pr e di ct e d t h at it will c o m pris e a b o ut 2 6 0 t h o us a n d n o d es ( 1 3 4 milli o n
c or es) [ 4], t h at is a b o ut 6 ti m es m or e n o d es ( 1 2 ti m es m or e c or es) t h a n i n t h e c urr e nt
1 D at a c oll e ct e d vi a t h e T O P 5 0 0 st ati sti c s s u bli st g e n er at o r ( w w w.t o p 5 0 0. o r g / st ati sti c s / s u bli st).
2
1.1 Motivation
number 1 TOP500 system. Unless the per-component MTBF will signiﬁcantly drop,
systems will continue to fail even more frequently in the future. In order to properly
recover from failures, resilience and fault tolerance mechanisms play an important
part in modern large-scale systems. These features are typically implemented by
periodically storing the application’s or system state to disk. The checkpoint may
then be restored upon a system failure in order to reduce the amount of work lost.
The obvious disadvantage is that checkpoints are created whether or not there is an
actual failure. This process requires application time as well as memory and network
bandwidth, and can cause applications to execute more than 10 times slower [5].
Finally, power has become a main concern of today’s and future large-scale systems.
Main memory in particular is one of the largest consumers with up to 40 % for a
current system [6] and a projected 65 % at Exascale [7]. Inter-process communication
and fault tolerance mechanisms additionally reduce the actual work that can be done
within a given time period, which negatively impacts the system’s energy eﬃciency.
To allow future Exascale systems to operate within an economically and practically
reasonable power budget it was suggested to limit the power consumption of such a
system to 20 MWatt [7, 8]. As can be seen in Figure 1.1, the current number 1 system
already consumes more than 15 MWatt at less than 100 PFLOP/s peak performance.
The mandatory need for a change of the system architecture becomes clear when this
system was scaled to Exascale. At 1 Exaﬂop per second it would consume about 150
MWatt which exceeds the 20 MWatt goal by a factor of 7.5.
In summary, the challenges on the road to Exascale machines are best described by
the following quote:
The architectural challenges for reaching Exascale are
dominated by power, memory, interconnection networks,
and resilience.
— Richard C. Murphy et. al. (2010) [9]
1.1 Motivation
The motivation for this work is based on the following three key observations:
The Processor-Memory Gap
The memory interface is one of the last parallel buses and probably the most
3
Introduction
critical bottleneck in modern computing systems. The disparity between processor
and memory performance is ever increasing and the situation got worse with the
introduction of multi-core architectures. Since no technological breakthroughs
are expected in the near future, it is time to revisit the memory interface and
evaluate alternatives.
Inter-Node Communication
Large-scale systems typically communicate via message passing and data must be
transported between two or more nodes whenever the communicating processes
are spread across distinct nodes. While some applications mainly rely on point-
to-point communication, others spend a lot of time in processing and waiting
for the completion of collective operations. Most often the memory interface is
involved in collective operations as it holds the data elements that are placed
and retrieved by processors and the interconnection network. It is desirable to
either reduce the number of messages that are sent or otherwise to increase the
performance of point-to-point and collective communication.
Fault Tolerance
With an increasing number of components in large-scale systems, and without a
signiﬁcant improvement in component reliability, the MTBF will continue to drop
and the frequency of catastrophic failures will increase. To mitigate the eﬀects of
such failures, to reduce the amount of work lost, and to allow rapid system recovery,
today’s systems deploy fault tolerance techniques using Checkpoint/Restart.
Unfortunately, checkpointing introduces additional overhead and can take up a
large amount of the application runtimes. Parity checkpoints were introduced to
lower the overhead at the expense of computation and communication, which
heavily utilizes processors and the memory and storage system. It is therefore
necessary to investigate in innovative approaches to reduce the overhead in order
to speed-up the parity creation process.
1.2 Vision
As the types of hardware and interfaces in computing systems are standardized and
systems are built from commercial oﬀ-the-shelf components, the only way to achieve
improvements in the areas mentioned above is a dedicated hardware component. The
ideal candidate avoids slow memories, oﬄoads processors from computing collective
4
1.3 Contributions
Node 1 Node 2
Node 3Node 4
3
2
1
1 2 3
1
2
3
123
(a) Example traditional scheme: A collective
operation is implemented as a ring
Node 1 Node 2
Node 3Node 4
1
1
1
1
NAM Result
Result
Result
Result
(b) Envisioned scheme: NAM as central in-
stance to execute collective operations
Fig. 1.2 NAM Vision: Reduce communication and oﬄoad processor computation
operations, reduces communication and synchronization eﬀort, and provides suﬃcient
bandwidth to serve as target for as many processes and nodes as possible. Figure 1.2
envisions how such a component, integrated with an existing interconnection fabric,
is meant to improve collective operations by reducing inter-node communication and
associated memory accesses. This general approach can be transferred to Check-
point/Restart which in large parts relies on these patterns.
1.3 Contributions
This work presents the implementation of the Network Attached Memory, a dedicated
component to serve as a global shared storage and to carry out collective operations in
large-scale systems. It therefore employs network interfaces that provide the ability to
connect it to available links within the EXTOLL high-performance interconnection
network.
Based on the motivation to replace the current parallel memory interfaces with a
ﬂexible and serial one, the NAM prototype implements the emerging Hybrid Memory
Cube memory interface. The HMC performance and power eﬃciency is analyzed
and evaluated in an FPGA (Field Programmable Gate Array) using a custom host
controller. This contribution comprises conference publications [10, 11] and the well
adopted open-source initiative openHMC [12].
5
Introduction
The developed FPGA design implements links to the EXTOLL network and the HMC.
It provides modules for read and write operations from remote hosts from and to the
HMC memory on the NAM. A checkpointing module improves Checkpoint/Restart
mechanisms that are typically deployed in today’s systems. It aims to reduce the com-
munication and synchronization overhead between participating nodes and shall oﬄoad
processors from calculating the corresponding parity information. This contribution
has led to news articles [13, 14] and a conference poster [15].
Finally, the performance of the NAM for reading and writing, and for the
Checkpoint/Restart (CR) use case is evaluated in a 16 node test system. The re-
sults show that CR with the NAM outperforms a current approach by a factor of
2.1.
1.4 Outline
The remainder of this work is organized as follows: The next chapter covers the
three relevant topics memory, inter-node communication, and fault tolerance. The
discussion supports the need to revisit the memory interface and indicates that a
dedicated hardware may be able to mitigate the excessive overhead in communication
and fault tolerance. Chapter 3 presents the Hybrid Memory Cube (HMC) interface
and technology in detail. Using a self-developed host controller, the HMC performance
is characterized with real system measurements. Chapter 4 describes the development
of the Network Attached Memory (NAM) hardware prototype. It provides network
interfaces and integrates an HMC. The implemented FPGA design units are presented
and the theoretical NAM performance is evaluated. As a ﬁrst use case, the NAM
improves the creation of parity checkpoints in the DEEP-ER (Dynamical Exascale
Entry Platform - Extended Reach) project. The chapter is concluded by a description
of the Checkpoint/Restart process and the developed software components. Chapter
5 evaluates the NAM in a 16 node real system setup with microbenchmarks and a
DEEP-ER application mockup. The last chapter summarizes and reﬂects the obtained
results and suggests improvements for a future NAM implementation.
6
C
h
a
p
t
e
r
2
State of the Art
Today’s large-scale systems suﬀer from various limitations often caused by only very
few components. As this trend is expected to intensify in the future, and in order
to develop potential solutions, it is necessary to understand the reasons behind these
limitations.
The ﬁrst section of this chapter describes the historical evolution and current trends in
the main memory development, and motivates the adoption of serial interfaces as one
solution for most of the issues presented. Next, based on the prevalent software and
hardware components, the communication in High Performance Computing (HPC)
systems is analyzed. The third section presents currently deployed fault tolerance
techniques which will gain even more importance with increasing system sizes. A ﬁnal
summary that puts these three topics into context concludes this chapter.
2.1 Memory: Technologies and Interfaces
For many years the increase in CPU (Central Processing Unit) performance was
driven by Moore’s law which was initially formulated in 1965 [16]. It predicts that
the transistor count in microprocessors will double every 18 to 24 months, and this
prediction remained true for about four decades. Although recently a slowdown can be
7
State of the Art
observed, device manufacturers found ways to keep increasing the transistor count at a
moderate rate.
One processor characteristic that has stopped scaling, however, is the internal operating
frequency. This is due to the reason that transistor power consumption is proportional
to frequency and the power density increases as more transistors are packed per area.
Also, with smaller transistors, leakage current becomes signiﬁcant which causes the
processor to dissipate power and heat at an increasing rate. This has led to the end
of the well-known Dennard scaling [17] which states that power consumption remains
proportional to the chip area.
To keep up with the traditional performance growth rate of CPUs, multi-core architec-
tures were developed and current devices integrate as many as 72 cores on a single die
[18].
One component that historically lags behind processor performance is the Dynamic
Random-Access Memory (DRAM)-based main memory. Although it was formulated
more than 20 years ago, the current situation is very well summarized by the following
quote [19]:
Across the industry, today’s chips are largely able to execute code faster
than we can feed them with instructions and data. There are no longer
performance bottlenecks in the ﬂoating point multiplier or in having only a
single integer unit. The real design action is in memory
subsystems—caches, buses, bandwidth, and latency.
— Richard Sites: It’s the Memory, Stupid! (1996)
Similar to CPUs, DRAM obeyed Moore’s law for a long time and only recently a
slowdown in capacity growth is observed. Much more critical than the capacity,
however, is the access time for a memory reference to the oﬀ-chip main memory. While
the relative single core CPU performance increased by a factor of 10.000 in 30 years,
the vast increase in peak memory accesses outperforms the capabilities of the memory
interface. More speciﬁc, the DRAM access latency relative to the number of CPU
cycles it takes to serve a memory reference only improved by a factor of eight. Within
the same time period access times decreased from 250 ns in 1980 to 31 ns in 2012 [20].
Figure 2.1 illustrates this disparity in performance which is well-known as the memory
wall [2]. Although the historical development and current trend for the performance of
the number 1 system in TOP500 list of supercomputers gives only small indication
that memory performance may be critical at all, it is and will remain a serious matter.
8
2.1 Memory: Technologies and Interfaces
1980 1985 1990 1995 2000 2005 2010
1
10
0
10
00
0
Year
No
rm
aliz
ed
 Pe
rfo
rm
an
ce
Single Core Performance
Memory Performance
10
10
0
10
00
Memory Gap
Fig. 2.1 Historical trend of the processor-memory gap [20]
The reason why in particular the TOP500 list is not suitable to discover a memory
bottleneck is because of the benchmark it uses to characterize systems. The LINPACK
benchmark is very insensitive to memory performance [1] which certainly does not
accurately reﬂect the majority of HPC applications. It was also shown that future
application codes will be much more memory sensitive [21].
2.1.1 Latency
The recognition of the memory wall led to the introduction of caches and hierarchies
of caches in many variants and with various levels to hide the latency from a processor
view. Caches rely on the concept of temporal (a memory reference is likely to be
used more than once) and spatial locality (multiple accessed memory references are
within relatively close storage locations). Hence, caching tries to avoid accessing the
relatively slow physical memory interface by holding data in processor-local structures.
Eventually, data still needs to be transported over the memory interface with potentially
many independent processor cores competing for access. This increases the probability
of cache misses and leads to additional, latency and bandwidth-wise expensive main
memory accesses. And even if there was only one process to access the memory it
9
State of the Art
might be limited by the interface latency if an application is not able to exploit locality
for its memory references.
Although smaller and faster transistors typically help latency, the increasing number
of transistors per area and additional memory chips to maximize capacity also result
in longer channel distances. These longer traces and higher fan-outs naturally increase
the signal latency and limit the switching speed on these lines. Also, the focus of
the DRAM semiconductor process has shifted away from maximizing performance to
increasing the capacity and reducing the memory cell’s leakage current which is critical
with today’s small feature sizes. As the author in [22] states, the terms bandwidth and
capacity are much easier to sell than latency marketing wise, and yet another reason
why latency has been missing signiﬁcant improvements.
2.1.2 Bandwidth
The situation with memory bandwidth is less critical than with latency although
Double Data Rate (DDR) as the most commonly used main memory interface also
lags behind the requirements of modern processors. For example, an Intel Core i7
CPU with four cores can generate memory references that require a peak bandwidth of
409.6 GB/s [20]. The actual requirement can be even higher as peripheral devices may
also request access to main memory via Direct Memory Access (DMA). In contrast, a
current DDR4 module provides 25 - 30 GB/s [20].
The main reason for this disparity is that the main memory interface has not seen
meaningful changes in more than 30 years. Although each new generation of DRAM
modules came with a slightly modiﬁed layout (which also required new processor
generations), it is one of the last parallel interfaces in modern computing systems.
Performance gains were achieved by widening the interface, increasing the pin speeds,
and the introduction of DDR signaling and prefetch mechanisms. Adding more
memory modules for multi-channel operation remains another viable option but its
scaling behavior is limited by the excessive use of processor I/O pins, Printed Circuit
Board (PCB) routing issues and a lack of physical space to place the additional
components on a board.
Other approaches examined the feasibility of memory latency reduction (which corre-
lates with a bandwidth increase) by either improving the memory access scheduling
for multiple cores [23] or asynchronously reorganizing the DRAM banks within a
memory chip [24]. And ﬁnally, to address the rapidly growing mobile and graphics card
10
2.1 Memory: Technologies and Interfaces
market, derivatives of the DDR interface were developed. These are tailored to the
varying requirements of the power oriented mobile market (Low Power Double Data
Rate (LPDDR) [25]) and bandwidth-hungry graphics cards (Graphics Double Data
Rate (GDDR) [26]). All of these variants, however, are still limited by the memory
interface bottleneck which needs to be revisited.
2.1.3 Power
Aside from the memory performance, power consumption of large-scale systems in-
creasingly moves into focus and the memory system plays an important role in this
observation. An analysis of a high-end IBM server in [6] showed that memory consumes
as much as 40 % of the overall system power and this trend is also observed with
current graphics cards [27].
To keep the power requirements of future Exascale systems within a reasonable budget it
was decided that such systems should not exceed a total of 20 MWatt [7, 8]. Projections
in [7] show that this goal is ambitious and challenging. The authors scale a current
large-scale system to Exascale size and predict the power consumption considering
technology improvements that enhance eﬃciency. The outcome of this experiment is
that such a system would consume 70 Megawatt. More interestingly, the memory is
the largest consumer with over 65 % of the total consumption.
Clearly, memory and in particular its interface will remain one of the most important
targets for optimization for current and future systems. To bridge the gap until a
new technology with the potential to replace DRAM as main memory hits the market,
memory manufacturers recently started proposing alternative interfaces. Additionally,
advances in the semiconductor manufacturing process have made layer stacking and
heterogeneous stacks a viable option.
2.1.4 Emerging Memory Technologies
The increasing demand for memory performance and capacity, and the I/O and area
scalability issues of DRAM DIMMs (Dual In-line Memory Modules) has led to vertically
stacked architectures, leveraging recent developments in fabrication process. Multiple
layers of DRAM can now be stacked on top of each other, linked via tiny connections
called Through Silicon Vias (TSVs) [28]. The ability to pack more memory arrays
11
State of the Art
per area increases the capacity and results in shorter traces, hence reducing fan-out,
latency, and power consumption for signals on the memory interface channels.
Examples for stacked memories are HBM (High Bandwidth Memory) developed by
AMD and SK Hynix [29] (high-performance) and the WideIO standard [30] (low-power
mobile segment). Both memory types are DRAM based and still rely on a parallel
interface. Processor and memory components are typically placed and interconnected
on a common silicon interposer which is packaged in 2.5D technology. This brings the
two components closer to each other, thus further decreasing trace lengths and routing
eﬀort. HBM for instance is already deployed in AMD graphics cards [31] and Altera
FPGAs [32].
Although these new technologies only recently entered the market and the cost is
relatively high, it is expected that they will continue to gain considerable market
shares as both signiﬁcantly improve the memory performance and power characteristics
within their market segments. It must be noted that stacking is also utilized for
non-volatile storage class devices such as V-NAND from Samsung [33] as well as
3D-NAND [34] and the recently announced 3D XPoint [35] from Micron and Intel.
The second class of revolutionary packaging options is 3D integration. It beneﬁts
from the additional advantage that TSVs enable diﬀerent processes such as DRAM
and CMOS (Complementary Metal Oxide Semiconductor) to coexist within a single
stack. 3D integration allows to shift the complexity of a memory controller into a logic
layer at the bottom of a memory stack. Popular examples are Intel’s Multi Channel
DRAM (MCDRAM) and Microns HMC. Intel’s latest KNL (Intel Knights Landing)
CPUs connects multiple MCDRAM1 devices via a proprietary interface in a 2.5D
package [18]. An MCDRAM is a stack of multiple DRAM layers on top of a logic base
that integrates the actual memory controller functions. Similarly, Micron’s HMC [37]
stacks up to 4 layers of DRAM on top of a logic base that fully integrates up to 16
independent memory controller. The innovative part with HMC is that the traditional
parallel interface to the processor is replaced by high-speed serial links. The beneﬁts
of such an interface are described next.
1 According to [36], MCDRAM is based on HMC with a modiﬁed logic base and interface.
12
2.1 Memory: Technologies and Interfaces
2.1.5 Serial Interfaces
Utilizing serial high-speed links to connect a memory breaks with the traditional
parallel interface, and there are several good reasons to examine this approach: A
serial interface
1. shifts the memory controller complexity into the memory stack. It decouples the
development of the memory interface from of the actual DRAM array and other
memory technologies.
2. likely operates on packets instead of transactions. This enables the existence
of potentially many outstanding requests which suits the demands of current
multi-core/multi-threaded CPUs with many independent request streams.
3. enables the use of application-speciﬁc packets and commands to integrate pro-
cessing capabilities close to the memory (see Section 2.1.6).
4. reduces I/O pin requirements and routing complexity. The interface itself consists
of several high-speed diﬀerential lanes and a few sideband signals.
The author in [38] formulated a motivation to adapt serial interfaces and in particular
highlighted the beneﬁts of Micron’s HMC as a candidate. This motivation is extended
by a detailed description and evaluation of the HMC in Chapter 3.
To the best knowledge of the author, at the time of writing the only other device
that stacks memory on top of a logic base with a serial interface is the SRAM (Static
Random-Access Memory)-based Bandwidth Engine (BE) by Mosys [39]. Although BE
provides the lowest memory access latency on the market (≈ 16 ns for a full memory
reference) its maximum capacity is currently 1 Gbit which makes it unusable for most
applications and impractical as main memory replacement. The HBM speciﬁcation
similarly deﬁnes an optional interface die (e.g. with a serial interface). At the time of
writing, however, there are no devices with such an interface available.
2.1.6 Processing in Memory
This section so far has only considered changing or improving the existing parallel
memory interface. Another approach that has recently become a well discussed topic
is to avoid data movement where applicable by shifting the actual processing into the
13
State of the Art
memory array, or at least as close to it as possible. Among other acronyms that have
emerged, the most popular is Processing In Memory (PIM).
PIM describes the tight integration of CMOS logic and memory cells within a single
chip. This idea is not novel and several architectures that place combinational circuits
right next to the memory were proposed already in the 1990’s (e.g. [40, 41]). Their
functionality, however, was limited to very basic operations and only the recent
advancements in fabrication process have made PIM an interesting topic for researchers.
HMC can be categorized as PIM device as it supports atomic functions that can
autonomously add values to memory locations. The Active Memory Cube (AMC)
[42] takes this capability to a next level. Based on the HMC memory architecture it
integrates a full Instruction Set Architecture with caches and pipelines. Although AMC
is still a research project and the performance projections are based on simulations,
it gives a glimpse into the capabilities of PIM and how it can be used to reduce the
memory interface traﬃc.
2.1.7 Summary Memory
This section highlighted the reasons for the existing and ever increasing gap between
the processor and memory performance. It became clear that the main memory
interface must change in order to keep up with the increasing number of cores and
components that access it. Although recent developments led to performance and
capacity improvements they have not signiﬁcantly changed the way how memory is
accessed. A technological breakthrough that is able to replace the DRAM cell is
currently not foreseeable, but at the same time seems inevitable to overcome the
proposed power budget of next-generation Exascale systems. To speed-up the adoption
of future memory technologies, the serial memory interface was introduced. It comes
with plenty beneﬁts that have the potential to change the memory landscape. This
includes the possibility to rapidly develop and integrate complex processing units
within the memory stack without the need to change the interface itself.
2.2 Communication in HPC systems
Today’s HPC systems often comprise multiple thousand nodes and projections show
that this number will scale up to 260.000 nodes for Exascale machines [4]. Without
a signiﬁcant change in how these systems are designed, future systems will more or
14
2.2 Communication in HPC systems
1
10
10
0
10
00
10
00
0
Pic
ojo
ule
/O
pe
rat
ion
2010
2018
10
10
0
10
00
DP 
Flop
Reg
iste
r
1mm
 on-
chip
5mm
 on-
chip
Off-
chip
/DR
AM
Loc
al in
terc
onn
ect
Cro
ss s
yste
m
Fig. 2.2 Energy cost for data movement across diﬀerent layers [7]
less rely on the current prevalent communication scheme: An application uses MPI
(Message Passing Interface) to exchange messages between two or more processes. These
messages are physically transported via an underlying hardware, the interconnection
network (interconnect). This section introduces several types of commonly deployed
interconnects and summarizes the communication schemes and typical patterns of
current large-scale systems.
2.2.1 Interconnection Networks
To leverage the vast amount of processing capabilities of thousands of nodes, the jobs
that run on these systems need to be partitioned and parallelized as good as possible.
Unless a job (or the problem) is perfectly parallelized, and this often does not only
depend on the programmer but the on problem itself, inter-process communication will
take place. Sometimes, this communication occurs between two processes running on the
same node or even the same processor. Most often, however, inter-node communication
is inevitable which is expensive in terms of energy and latency. Figure 2.2 illustrates
the energy cost to move data through the diﬀerent possible types of the interconnect
hierarchy.
15
State of the Art
It can be seen that the required energy to transport an information becomes signiﬁcant
when moving oﬀ-chip (e.g. to get data from local DRAM), and increases further when
using the local (processor-)interconnect or crossing node boundaries. Even worse,
whenever data is transported to a remote node this node is likely to be waiting for it.
This additional latency results in stall states where no useful computation is performed.
Hence, the interconnection network plays a vital role for the overall system performance
and energy eﬃciency. Assuming a perfectly parallelized application the interconnect
is often the most important component and a popular optimization target to keep
communication overhead at a minimum.
Out of several interconnect technologies, Ethernet and Inﬁniband [43, 44] have emerged
as the most prevalent solutions in HPC. According to the TOP500 list of supercom-
puters2 (June 2017 edition), 208 out of 500 systems run Ethernet3 (41 %) and 177
use Inﬁniband4 (35 %). The remaining systems use proprietary interconnects such as
Intel Omnipath [46] which gradually gains traction in the list since its introduction
in 2015. It must be noted that currently no machine within the 10 most powerful
supercomputers of the TOP500 list uses Ethernet or Inﬁniband. The leading spots are
held by non-standard, vendor speciﬁc interconnects tailored to these machines and the
LINPACK benchmark.
Most of the interconnection networks, including Ethernet and Inﬁniband, have in
common that they come as a PCI Express (PCIe) plug-in card for the corresponding
slots in today’s commodity hardware. These Network Interface Controllers (NICs)
provide host connectivity via PCIe and one link to the network fabric. One exception
to this is Intel Omnipath which integrates the NIC with the CPU and therefore removes
the often criticized PCIe connection as bottleneck.
A message that targets a remote node is ﬁrst processed by the NIC and then sent
to the network. Switches and hierarchies of such are used to link all nodes together.
Unfortunately switches come at a price and limit the scalability as the network becomes
non-uniform. Also, physical space must be preserved to place these switches in a rack.
One approach to avoid these drawbacks is the emerging interconnect EXTOLL [47,
48, 49]. Just as with the other interconnects, the EXTOLL NIC plugs into a standard
PCIe slot, but it already integrates the switching functionality. Each NIC provides six
network links and therefore allows to directly create topologies such as a 3D torus,
2 The TOP500 list of supercomputers, established in 1993, is a half-yearly updated list that ranks
the 500 most powerful supercomputers using the LINPACK benchmark [45].
3 Ethernet, 1G Ethernet, 10G Ethernet (majority), or 100G Ethernet.
4 Inﬁniband QDR, FDR (majority), or EDR.
16
2.2 Communication in HPC systems
Table 2.1 Interconnect performance comparison
Interconnect Latency [us] Bandwidth [GB/s]
Ethernet 1G [50] 47 0.112
Ethernet 10G [50] 12 0.875
Inﬁniband QDR [50] 1.6 3.23
Inﬁniband EDR [51] 0.6 12.5
EXTOLL [52] 0.6-0.8 12.5
maintaining scalability at all times. These are two of the main reasons why EXTOLL
has been selected to evaluate the NAM. Section 4.2 will present the technology in
detail.
Performance-wise, all of the interconnects mentioned above have signiﬁcantly improved
over the past decade. As Table 2.1 points out, Inﬁniband EDR and EXTOLL are
superior to 1G and 10G Ethernet in bandwidth and latency. The reason why in
particular 10G Ethernet is still deployed that often (195 systems in the TOP500) is
because of its relatively low cost. State of the art HPC systems require thousands of
NICs and hundreds of switches to fully interconnect all nodes.
Whichever interconnect is used it still remains a tool for applications to facilitate
inter-process communication and the actual utilization of the interconnect depends
on the communication characteristics of the application itself. It is in particular
important to understand these characteristics in order to optimize the overall system
performance as simply improving the interconnect bandwidth and latency might not
pay oﬀ signiﬁcantly in all cases.
2.2.2 Message Passing and Communication Characteristics
Message passing is the prevalent inter-process communication scheme in today’s large-
scale systems and has become the de-facto standard. It abstracts the underlying data
movements to a simple concept of messages that are sent and received between two
processes.
The most widely used message passing standard is MPI5 (ﬁrst introduced in [54])
which itself is not a library, but rather a speciﬁcation that deﬁnes how message passing
5 The full speciﬁcation of the current MPI standard version 3.0 can be found in [53].
17
State of the Art
A
Sender Receiver
P0 P1
(a) Point-to-point send/receive
A
A A
P0 P2
P1 P3
(b) Broadcast
C
B D
P0 P2
P1 P3
+
(c) Reduce
P0 P2
P1 P3
+ +
++
(d) Allreduce
Fig. 2.3 Example MPI operations (Legend: Orange - sender. Blue - receiver. ’+’: Logical
operation)
libraries should operate. Out of this standard, several implementations including
MVAPICH/MVAPICH2 [55] and the very popular open-source variant Open MPI [56]
have evolved.
Figure 2.3 depicts four MPI operation examples. These include, but are not limited to
asynchronous point-to-point messaging and collective operations such as Broadcast
(distribute data from one process to other processes), Reduce (move data from other
processes to one process and perform a logical operation; one process receives the
result), and Allreduce (apply logical operation on data from all processes; all processes
receive the result). In particular, the actual implementation of Allreduce and other
similar collective operations depend on the MPI library that is used. Data exchange for
such functions can be realized with all-to-all communication (Figure 2.3d) or various
other logical topologies such as a binary tree or a ring.
Apart from the actual scheme that is used for collective operations, an application
typically comes with predictable and well-known access patterns. It is essential to
18
2.2 Communication in HPC systems
characterize applications by means of their communication behavior to determine
useful system optimizations for a speciﬁc use case. Naive approaches to simply improve
the performance of components such as the processors or the interconnect will not
necessarily lead to substantial speed-ups if the actual bottleneck is somewhere else.
For example, some applications heavily utilize point-to-point communication while
others spend most of their time performing collective operations. This depends on the
application itself and how it is implemented.
Recent work in [3] analyzed the MPI characteristics of application traces collected
by the U.S. Department of Energy (DOE) [57]. The analyzed dataset comprises 18
diﬀerent applications with 10 up to 13.000 ranks. The key ﬁnding of this work is that
these applications spend 36 % on average of their time in MPI routines with a peak of
up to 60 %. Interestingly, while the vast bulk of data is transported via point-to-point
communication the average application spends most of its MPI time with collective
operations. Although it was shown that only small amounts of data are processed
with collective operations, synchronization overhead for a large number of processes
becomes signiﬁcant. This insight is in particular important as it highlights that for
many applications the focus must be shifted to improving collective operations instead
of just focusing on increasing bandwidths.
This issue has already been identiﬁed and the latest MPI-3.0 standard foresees non-
blocking collective operations which allow to continue program execution while collective
operations take place. However, it also brings up the question of architectural changes
and encourages the use of dedicated resources to carry out these operations and to
reduce synchronization overhead.
2.2.3 Summary Communication
Current large-scale systems and applications rely on a message-based communication
schemes which send and receive messages via a physical interconnection network and
it is not expected that this general approach will change in the near future.
Although the various types of interconnects as important part for inter-node commu-
nication showed substantial progress over the past decade, simply improving these
components may not be suﬃcient. This is in particular the case for applications that
spend a large amount of their execution time waiting for completion of collective opera-
tions. The resulting processor stalling negatively impacts the application’s performance
and energy eﬃciency.
19
State of the Art
Unarguably the interconnect will remain an important system component and needs
to be further optimized. For many applications, however, there is an obvious need
to rethink system design and a dedicated resource to mitigate existing application
bottlenecks appears reasonable and tempting.
2.3 Fault Tolerance in HPC Systems
Resilience has become a major concern for HPC systems. As systems continue to
grow in size, more and more components are added. Unfortunately, each additional
component is also subject to faults (e.g. a stuck bit) which are likely to result in errors
such as an incorrect value and false program execution. Errors on the other hand may
lead to incorrect system states or an application crash known as failure6. Previous
work in [59] showed that the number of failures per system is almost proportional
to its number of processors, which correlates with the amount of memory and other
components. Without an increase in component reliability the MTBF of future systems
will further decrease as it is expected that Exascale systems will comprise more than
260.000 nodes [4], 6 times more than the current number 1 ranked HPC system Sunway
TaihuLight7. The equation is easy: with 6 times more components at a given component
reliability, a system will fail 6 times more frequently.
In fact, the single component reliability even decreases with technology scaling and
design for power eﬃciency. Smaller transistors typically carry smaller charges and
also suﬀer from manufacturing variances, making them more error prone. The DRAM
soft error rate for example has been analyzed in two studies conducted in 2004 [60]
and 2009 [61] using respective state of the art memories. A comparison of the results
unveils a 25X increase in DRAM soft failure probability in only 5 years. Although
ECC (Error Correction Code) technology is able to correct a bulk of such errors their
occurrence will further increase.
Along with the obvious challenges caused by technology scaling, semiconductor devices
also become less reliable over their lifetime. This is known as aging and it gets worse
with smaller features sizes. Interestingly, the authors in [62] found a correlation between
the number of component failures and the day of the week and the hour of the day.
Hence, components are likely to fail more often under heavy workload.
6 For more information on the taxonomy see [58].
7 The Top500 list twice-yearly ranks the performance of the 500 fastest supercomputers in the
world. See www.top500.org for more information.
20
2.3 Fault Tolerance in HPC Systems
System 3
System 2
System 1
0 20 40 60 80 100
Percentage of total failures [%]
Storage
Memory
Power supply
Motherboard
CPU
Other
Fig. 2.4 Hardware failure breakdown by component for three diﬀerent and unrelated
systems [65]
Cause Hardware Software Network Human Facilities Unknown
Percentage 60.4 % 22.6 % 1.8 % 0.6 % 1.5 % 13.1 %
Table 2.2 Causes of failures by type collected by LANL from 1996 to 2007 [66]
For today’s large-scale systems, the MTBF ranges from a few hours to several days,
mainly depending on the system size [63]. Researchers predicted that an Exascale
system might fail in the order of every 30 minutes [64].
Fault avoidance techniques such as ECC and redundancy come at the expense of
more hardware and increased power consumption. Finally, the recent IC (Integrated
Circuit) development is mainly driven by cost-eﬀective segments (e.g. mobile) that do
not demand high reliability and can easily tolerate certain errors. The vast majority of
available hardware focuses on these markets and high-performance systems built out
of commodity hardware especially suﬀer from a lower MTBF.
2.3.1 Failure Causes
The major cause for system failures is defective hardware, and as Figure 2.4 shows it
can be any component in a system. There are, however, several other possible causes
for system failures which are summarized in Table 2.2. Software, for example, is ranked
21
State of the Art
on the second place and is responsible for about 23 % of all failures on the example
system. With more complex hardware architectures, hierarchies, and topologies, also
software increasingly becomes more complex. [59] observed that there is a relationship
between the software that runs on a system and its MTBF. Although software layers
are able to detect errors caused by lower layers this process can be very complex.
Furthermore, this information may not necessarily be trusted since the state of the
software may be corrupted. Sometimes it is even not easy to track down the cause of
an error, especially whether or not it was caused by software. For example, when the
job ﬁnishes but only the ﬁnal result is incorrect.
Any of the failures mentioned above will likely cause an entire job to fail and fault
tolerance techniques were developed to mitigate the eﬀects of system failures. The most
commonly used approach is to periodically backup the system state in order to reduce
the penalty for restarting jobs after a failure. This is known as Checkpoint/Restart.
2.3.2 Fault Tolerance using Checkpoint/Restart
Checkpointing was introduced to avoid restarting jobs from scratch. With checkpointing,
programmers deﬁne states (checkpoints) of their application the job can rollback to upon
recovery from a failure. Although applications can now restart from a more advanced
state, application based checkpointing has a signiﬁcant characteristic: all processes will
roll back to the last well-known state even if only one of many processes has failed.
An additional drawback is that the checkpoints have to be stored somewhere. They
require extra storage and use I/O (Input/Output) and sometimes network bandwidth
to transfer the data. Traditionally, checkpoints were written to the Parallel File
System (PFS) which provides only very limited bandwidth since it is most often a
shared resource among multiple systems. In an extreme scenario where the time it
takes to write a checkpoint is close to or exceeds the MTBF, a job would spend most
of the runtime just to checkpoint its data without making progress in the actual task.
Finding the optimum time interval between two consecutive checkpoints is rather
complex and subject to intense investigation [67]. It requires a deep knowledge of the
system architecture and the application.
Recently exceptional eﬀort has been put into the prediction and prevention of system
failures. The results in [68] show that under certain circumstances the failure prediction
22
2.3 Fault Tolerance in HPC Systems
recall8 goes up to 50%. Proactive checkpointing [68] can then be used to back up
the system state right before a failure occurs, reducing the amount of work lost. The
authors also suggest spare nodes to replace other nodes that will fail soon, migrating
repair time.
All these approaches come at a certain overhead and they are currently complimen-
tary to periodic checkpointing which remains the prevalent fault tolerance technique.
Checkpointing inevitably leads to longer application runtimes and it is desirable to
reduce this overhead.
2.3.2.1 Mitigating Checkpointing Overhead
Several options to mitigate checkpointing overhead and to reduce its negative impact
on application runtimes are available:
Reduced checkpoint size
It is the responsibility of the programmer to identify the parts that need to be
stored in order to reduce data size but still allow for correct failure recovery.
Incremental checkpointing can be used to reduce the size of consecutive check-
points by only storing data that has changed since the last checkpoint. However,
current approaches such as in [69] require signiﬁcant modiﬁcation to operating
system kernels and may not be easily deployed.
Reduced checkpointing frequency
It is reasonable to decrease the checkpointing frequency to lower its overhead.
Since applications will lose more progress upon a failure in this case, checkpointing
frequency must be seen as a trade-oﬀ between MTBF and the time it takes to store
(and restore from) a checkpoint. Interestingly, the more frequent checkpoints are
created and written to the storage system, the more frequent speciﬁc components
such as Solid State Drives (SSDs) with limited durability will fail.
Multilevel checkpointing
Multilevel checkpointing approaches make use of intermediate levels of storage
that provide higher bandwidth than the slow PFS such as DRAM and local SSDs.
The checkpoint is written to this faster storage and then asynchronously ﬂushed
to a higher storage layer via a dedicated thread [70] or an agent [71]. Meanwhile
8 The prediction recall is the ratio of correctly predicted errors to the number of actual detected
failures.
23
State of the Art
the corresponding process can continue with its task. Typically, the last level of
checkpoint storage is still the PFS and not every checkpoint stored on a faster
storage will be transferred to the PFS, but instead 1 out of 10 checkpoints for
example.
Multilevel checkpointing and reducing the checkpoint size both allow to increase the
checkpointing frequency, which may also reduce the rollback penalty for restart. A
similar form of multilevel checkpoints is accomplished with burst buﬀers [72, 73], which
are intermediate destinations in front of the PFS but can be mounted as regular ﬁle
systems. Burst buﬀers exploit the bursty characteristic of checkpoint I/O where high
bandwidth is only occasionally requested, which gives enough time to forward it to
the PFS as ﬁnal destination in between two checkpoints. Other approaches such as
In-memory checkpointing [74] rely on a memory only checkpointing scheme, avoiding
the relatively slow PFS. Although checkpointing to memory undoubtedly delivers the
best performance it also requires multiple copies of a single checkpoint and multiple
times more memory than required by the application. Moreover, when the memory is
non-volatile, a node failure such as a simple power outage will erase the checkpoint.
Multilevel checkpoints provide a good trade-oﬀ between traditional PFS-based check-
pointing and the in-memory approach. It allows for frequent, ﬁne granular checkpointing
and keeps the requirement for additional memory low at a reasonable performance
degradation. One example implementation which has evolved as a de-facto standard is
provided by the Scalable Checkpoint / Restart (SCR) library [71, 75]. As an alternative
to SCR, the Fault Tolerant Interface (FTI) library [70] provides very similar features
and is also widely used. As SCR was used in the DEEP-ER project it serves as reference
and will be described more in detail.
One criteria that is often unnoticed is the eﬀect of checkpointing on power consumption.
Research in [76] showed that there is only little diﬀerence between checkpointing
protocols and redundancy schemes. Moreover, the power consumption of computing
and checkpointing was measured to be close. Depending on the checkpointing interval
and duration, creating checkpoints can signiﬁcantly inﬂuence application runtimes and
will increase the power footprint.
2.3.3 SCR: Scalable Checkpoint / Restart
The SCR library provides a multilevel checkpointing solution for MPI applications. It
is based on two key observations: First, only the most recent checkpoint is required
24
2.3 Fault Tolerance in HPC Systems
to successfully restart. Second, a system failure only disables a small portion of the
system.
With these two observations SCR was designed to only store the most recent checkpoint
to node-local storage, discarding any previous checkpoints. It also implements a
redundancy scheme to support some node failures at reasonable network traﬃc and
computation overhead. Storing checkpoints to node-local storage ensures system
scalability since the checkpointing bandwidth scales with the number of nodes. However,
even with SCR checkpoints must be occasionally written to the PFS. It is still required
to recover from larger system or node-local storage failures. It must also be noted that
the node-local storage may have limited endurance and frequent checkpointing to e.g.
an SSD will limit its average lifetime to approximately 3 years. Additional techniques
based on a hybrid DRAM/SSD approach were developed to increase the SSD lifetime
[5].
Even though SCR manages checkpointing and restart by itself it is still up to the
programmer to identify the parts of the code that need to be saved, and to make use
of the respective function calls provided by SCR.
2.3.3.1 Redundancy Schemes
SCR provides three diﬀerent checkpointing schemes:
Local Checkpoints are only written to the node-local storage. It is the fastest of the
three checkpointing schemes but cannot withstand node failures.
Storage required for a checkpoint of B Bytes: B
Partner Checkpoints are written to the node-local storage and additionally to the
local storage of a remote partner node (Figure 2.5). This scheme is slower than
’Local’ but can withstand node failures, and even multiple node failures as long
as a node and its partner do not fail simultaneously.
Storage required for a checkpoint of B Bytes: 2 ·B
XOR With XOR, all available nodes are split into sets with N nodes each. Using a
bit-wise XOR reduce operation, a parity information over all checkpoints in a set
is calculated. Each node receives only a fraction of the parity which can then
25
State of the Art
Node 1
Local 
Storage
Node 2 Node 3
Checkpoint Checkpoint Checkpoint
Local 
Storage
Local 
Storage
Fig. 2.5 SCR-Partner checkpointing scheme
be used to recover from any single node failure within a set. XOR invokes more
computation but requires less storage than ’Partner’. It can withstand multiple
node failures as long as no more than one node within a set fails simultaneously.
Storage required for a checkpoint of B Bytes: B+ B
N −1
Local checkpointing is not a viable option for most systems as a single storage outage
causes the scheme to fail. The Partner approach ensures the highest fault tolerance
and is trivial and easy to implement but it requires the most storage space as every
checkpoint is stored twice. SCR with XOR is a good trade-oﬀ in performance and
storage requirements between these two approaches. It will be examined in detail next.
2.3.3.2 XOR Redundancy
Figure 2.6 shows how SCR generates a XOR parity. As mentioned before, SCR with
XOR splits the available number of nodes into sets. In a set of N nodes, the checkpoint
ﬁle of each node is logically split into N-1 segments (Figure 2.6a). In the next stage,
zero-padded segments are inserted so that every checkpoint now consists of N segments.
All segments with the same index are then reduced via a bit-wise XOR operation
(Figure 2.6b). This process may be implemented as a typical MPI collective operation
which has been described in the previous section. Finally, one XOR parity information
segment is distributed (scattered) to every node (Figure 2.6c). SCR provides several
parameters to control the size of sets and the assignments of nodes to these. The
conﬁguration in the example provides one XOR set and can only tolerate a single node
26
2.3 Fault Tolerance in HPC Systems
0:0
0:1
0:2
1:0
1:1
1:2
2:0
2:1
2:2
3:0
3:1
3:2
Split
Ch
ec
kp
oi
nt
 0
Number of nodes in a set N
Node 2 Node 3
Ch
ec
kp
oi
nt
 1
Ch
ec
kp
oi
nt
 2
Ch
ec
kp
oi
nt
 3
Node 0 Node 1N
od
e-
lo
ca
l m
em
or
y
N-1 segments per node
(a) Logically split checkpoints of N nodes to N-1 segments
Split
PAD
Reduce
XOR:0
XOR:1
XOR:2
XOR:3
0:0
0:1
0:2
1:0
1:1
1:2
2:0
2:1
2:2
3:0
3:1
3:2
PAD
PAD
PAD
XO
R p
ar
ity
(b) Add alternating zero segments and reduce with bit-wise XOR
XOR:0
XOR:1
XOR:2
XOR:3
Ch
ec
kp
oi
nt
 0
Node 2 Node 3
Ch
ec
kp
oi
nt
 1
Ch
ec
kp
oi
nt
 2
Ch
ec
kp
oi
nt
 3
Node 0 Node 1
Scatter
XOR:0 XOR:1 XOR:2 XOR:3
(c) Scatter XOR segments to nodes, one segment per node
Fig. 2.6 SCR XOR checkpointing example
failure. It is the responsibility of the user to create a reasonable number of sets to
allow withstanding multi-node failures.
SCR is also able to handle multiple processes per node. In this case it will automatically
select and create XOR sets so that every set has no more than one process of a particular
node. Also, when a process writes more than one ﬁle during execution, SCR will combine
these to a single checkpoint ﬁle. Finally, checkpointing ﬁles with arbitrary sizes are
managed by determining the size of the largest checkpoint in a set and padding the
remaining checkpoints with zeros up to this size.
27
State of the Art
2.3.4 Summary Fault Tolerance
Every component in a computing system is subject to failures and the Mean Time
Between Failure decreases with an increasing number of components in large-scale
systems. As systems fail unexpectedly and will continue to do so, work will be lost
unless the accuracy of failure prediction models and migration strategies reaches 100%.
Until then, fault tolerance using periodic checkpointing is inevitable and remains the
prevalent fault tolerance technique.
To mitigate the checkpointing overhead, multilevel checkpointing libraries such as SCR
were developed. SCR provides multiple levels of tolerance and diﬀerent redundancy
schemes to account for diﬀerent checkpointing strategies, and SCR with XOR has been
identiﬁed as a reasonable trade-oﬀ between performance and storage requirements.
SCR with XOR, however, involves inter-node communication and computation of the
XOR parity result likewise. This will keep processors busy and increase the memory
references to move intermediate results to and from the memory. It is therefore desirable
to have an additional device that is able to oﬄoad computation, and at the same time
reduce communication among nodes. This communication overhead is identical to
MPI collective operations which has been identiﬁed as a major potential performance
bottleneck in the previous section.
2.4 Summary
This chapter highlighted the importance of the memory interface and inter-node
communication in today’s and future large-scale systems. It became clear that memory
has been and will remain one of the most critical bottlenecks with regards to performance
and power. For many applications, communication overhead is already a large part
of the overall application runtimes and the situation will become worse with growing
system sizes.
As future systems will comprise many more components this will also lead to more
frequent soft- and hard errors, increasing the importance of fault tolerance using
periodic checkpointing to reduce the penalty of such failures. Unfortunately, writing
checkpoints takes application time where no actual computation is performed. Since
the performance for writing checkpoints also depends on the memory interface, the
28
2.4 Summary
interconnection network, and communication performance, it is desirable to improve
these key elements.
In conclusion this chapter provided a strong motivation to develop a device that is able
to mitigate the negative eﬀects that were described above. Such a device must be able
to oﬄoad computation from a host processor and simultaneously reduce inter-node
communication.
29

C
h
a
p
t
e
r
3
Hybrid Memory Cube
As an alternative to the DDR interface, to overcome its scalability issues (such as I/O
pin, area, and load limitations), and to increase channel bandwidth, Micron recently
proposed the Hybrid Memory Cube. The ﬁrst section of this chapter introduces the
HMC and analyzes the impact of its novel architecture on performance. Section two
presents the implementation of the open-source HMC host controller openHMC. Section
three evaluates HMC performance and power eﬃciency in a real system using the
openHMC controller. A ﬁnal summary concludes this chapter.
The ﬁndings of section one and three have been published in [11]. The implementation
of the openHMC host controller is detailed in [10].
3.1 Introduction and Architecture Analysis
HMC is leveraging recent 3D fabrication processes to stack multiple layers of DRAM on
top of a logic die. Its interface operates on a packet-based protocol utilizing high-speed
SerDes (Serializer / Deserializer). As opposed to DDR, the HMC interface is not a
JEDEC standard. Instead, Samsung Electronics and Micron Technology formed the
Hybrid Memory Cube Consortium (HMCC) in October 2011 [37] and released the
ﬁrst HMC speciﬁcation 1.0 in January 2013 [77]. It was later revised with the HMC
speciﬁcation 1.1 (HMC Gen 2 devices) which is the reference for this work. HMC
31
Hybrid Memory Cube
Vault
Partition
Partition
Partition
Logic
Logic Layer , Crossbar  Switch
SerDes Layer
DRAM
TSVs
DRAM
TSVs
DRAM
TSVs
TSVs
Vault 
Controller
Vault 
Controller
Vault 
Controller
Vault 
Controller
B
a
n
k
B
a
n
k
...
B
a
n
k
B
a
n
k
B
a
n
k
...
B
a
n
k
TSV
B
a
n
k
B
a
n
k
B
a
n
k
...
B
a
n
k
B
a
n
k
B
a
n
k
...
P
ar
ti
ti
o
n
DRAM
B
a
n
k
B
a
n
k
B
a
n
k
B
a
n
k
TSV
Partition
Fig. 3.1 HMC architecture overview
hardware engineering samples were available since 2013 and volume production started
in June 2017 with 2 GB devices.
3.1.1 Architecture
Figure 3.1 shows the basic HMC architecture. Multiple layers of DRAM are stacked
on top of a CMOS based logic layer using TSVs [28]. The stack is organized in 16
independent vaults where each vault connects the upper DRAM layers with a dedicated
memory controller (the vault controller) using 32 TSVs [78]. Every DRAM layer
comprises 16 partitions with 2 DRAM banks each. In [78], the HMC Gen1 DRAM
stack was introduced as a composition of 68 mm2 1 Gbit dies manufactured in 50 nm.
Initially four layers were stacked for a total capacity of 512 MB. Current HMC Gen2
devices [79] stack four 4 Gbit DRAM dies on top of the logic base which increased the
capacity to 2 GB (4 layers ·16 partitions ·2 banks= 128 banks). The capacity growth
from Gen1 to Gen2 is based on denser memory arrays with a bank capacity increase
from 4 MB (Gen1) to 16 MB (Gen2).
The HMC logic layer exposes four external links which can connect processors or other
HMCs. Hence, multiple HMCs can be ’chained’ together with varying routing options
to increase the capacity (see Section 3.1.4). A single link comprises 16 bidirectional
high-speed serial lanes. Every link is local to four vaults and a crossbar ensures that
all links can access all vaults and other links (Figure 3.2). The 4-Link HMC comes in
32
3.1 Introduction and Architecture Analysis
Link 0
Vault 0-3
Link 1
Vault 4-7
Link 3
Vault 12-15
Link 2
Vault 8-11
Link to Host
16x @ 10, 12.5, 15 Gbps
Crossbar
Switch
Fig. 3.2 HMC logic layer top view:
schematic representation
Fig. 3.3 Close-up view of an HMC stack.
Image courtesy: Micron
a 31×31 mm package (896 balls)1. Figure 3.3 shows an HMC close-up view with four
DRAM layers.
3.1.2 DRAM Organization and Performance
HMC implements a DRAM closed-page policy, i.e. the row buﬀers become inactive
after each access. This is opposed to an open-page policy where a row stays active in
the sense ampliﬁers until it times out or another row is accessed. While an open-page
policy is in particular beneﬁcial for applications with high locality (i.e. a high page
hit
miss ratio) it also increases power consumption since the sense ampliﬁers stay active
after a memory access. Additionally, an open-page policy introduces delay on a page
miss as pre-charge of the word-line does not occur immediately after the row has been
accessed. As a result, a closed-page policy theoretically performs better for random
access patterns.
The DRAM row or page size in HMC has been reduced to 256 Byte from 512 Byte - 2
KB for DDR4 [80] and to up to several kilobytes in DDR3 [81]. A smaller page size
reduces the probability for a DRAM over-fetch where only a fraction of the information
contained in an opened page is actually used, and therefore also reduces dynamic power
consumption. It also makes an open-page policy impracticable and is another reason
why the HMC developers preferred a closed-page policy.
1 Initially, a 2-Link device with equal characteristics was available (19.5x16 mm package). Devel-
opment and production of this device was canceled in late 2016 for unknown reason.
33
Hybrid Memory Cube
Performance numbers can be obtained from many sources [78, 79, 82]. Most of them
highlight the potential link bandwidth of 240 GB/s (4 Links ·60 GB/s). The eﬀective
bandwidth, however, is limited by the vault controllers. With 32 TSVs each and a
clock frequency of 1.25 GHz [83] a single vault can deliver 10 GB/s. With 16 vaults
the maximum eﬀective bandwidth is 160 GB/s. It is reasonable to provide more
link bandwidth than the DRAM stack can deliver due to transaction layer (protocol)
overhead on the link. The protocol will be discussed in a later section. Experiments in
[83] showed that the maximum usable link bandwidth eventually ﬂattens at 240 GB/s
for a given 160 GB/s TSV or DRAM bandwidth.
Finally, HMC can be conﬁgured to internally remap memory addresses which can be a
useful tool if the most commonly used access patterns are well-known. Per default,
sequential requests will be spread over vaults, then banks, and ﬁnally DRAM to involve
as many vaults as possible. This scheme has a simple, HMC speciﬁc background: the
more vaults are accessed, the higher the parallelism and the potential bandwidth can
get. Other access schemes may result in an imbalance of accessed vaults and address
remapping can be used to correct this situation. The impact of various address-mapping
modes on bandwidth at ﬁxed access patterns will be evaluated in Section 3.3.3.
3.1.3 Link
A single HMC has four independent links, each comprised of 16 diﬀerential pairs (lanes)
per direction, i.e. data to and from the HMC can ﬂow at the same time. Individual
links can be conﬁgured to run at 8 lanes (half-width) instead of 16 (full-width) if
required. Available link speeds are 10 Gbps, 12.5 Gbps, and 15 Gbps. That is a
maximum bandwidth of 16 lanes ·15 Gbps= 240 Gbps= 30 GB/s per direction or 60
GB/s bidirectional per link and 240 GB/s total. The maximum eﬀective bandwidth is
limited to 160 GB/s due to the vault bandwidths. The polarity of individual lanes can
be inverted and the lane order can be reversed to simplify signal routing on a PCB.
Each link is complemented by two power state signals (RXPS and TXPS). Finally, each
HMC devices provides an active-low reset (PRST_N) and a unidirectional, HMC-driven
fatal error indicator (FERR_N). Both sides of a link, Host and HMC, share a common
reference clock which eliminates the need to transmit a dedicated clock along with the
data-lanes.
34
3.1 Introduction and Architecture Analysis
Host
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
//
//
//
//
HMC 0 HMC 1 HMC n
Fig. 3.4 HMC chain example: one host is connected to an HMC chain. Topology suggested
in [79]
Host 1
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
LI
N
K
//
//
//
Host 2
LI
N
K
LI
N
K
HMC 0 HMC n
Fig. 3.5 HMC chain example: two hosts are connected to an HMC chain. All hosts can
access any memory region. Topology suggested in [79]
3.1.4 Chaining
One notable feature is the ability to directly connect (to chain) multiple HMC devices to
each other to increase the capacity (Figure 3.4). Also, multiple hosts can be connected
to a network of HMCs for a shared memory environment (Figure 3.5). Chaining
allows to create novel processor-memory architectures and communication schemes. It
does not only increase the overall capacity but also enables processors to communicate
through memory. Note that the HMC speciﬁcation currently limits the total number
of HMCs in a single network to 7 devices.
Enhanced approaches foresee dedicated interconnects and switches that connect only
the memory modules for a large global or partitioned address space. Such an approach
is presented in [84]. The author suggests an interconnection network for NAND-based
ﬂash chips. Such a memory subsystem provides a decent increase in capacity in
combination with a good overall power footprint and reasonable performance. Similar
memory subsystems could be created with HMCs. An intelligent interconnect with
routing features would furthermore allow for more than 7 HMCs to coexist in a single
memory subsystem. And lastly, if the interconnect provided additional interfaces to
e.g. non-volatile memory, heterogeneous memory subsystems become feasible (see
Figure 3.6).
35
Hybrid Memory Cube
Processor Interconnect
Memory Interconnect
Host Host Host Host
HMCHMCHMCHMC
Host
NAND NAND
Host
Fig. 3.6 HMC + NAND heterogeneous memory subsystem example. Topology with NAND
only suggested in [84]
3.1.5 Protocol
The HMC communicates over a packed-based protocol. It deﬁnes a request-response
communication with a granularity, or Flow Unit (FLIT) size, of 16 Bytes. The protocol
supports reading and writing data packet sizes ranging from 16 to 128 Bytes along
with command support for atomic operations and HMC conﬁguration. Packets are
framed by a header and a tail (8 Byte each) which results in a 16 Byte overhead per
packet. Features such as CRC, a packet length check, and consecutive packet sequence
numbers ensure link integrity. Complemented by a retry mechanism the HMC link can
withstand bit errors that typically occur on serial high-speed links.
Responses are matched to non-posted requests using a 9 bit TAG ﬁeld for up to 512
outstanding requests. Since the HMC logic die is able to reorder packets for faster
execution (e.g. if a speciﬁc vault is accessed more frequently), responses may return
out of order. However, HMC internally queues requests to the same vault/bank so that
accesses to a speciﬁc location accessed from one link are always processed in order.
Care must be taken when a memory location is accessed by more than one link since
there is no guaranteed order for request execution across links. A small set of atomic
operations is provided for computation oﬄoading. These commands either add a single
16 Byte or two 8 Byte operands to a memory location via a read-modify-write operation.
The potential beneﬁts of oﬄoading computation to the HMC will be evaluated in
Section 3.3.6.
Flow control in both directions is achieved using tokens (credits), where one token
represents buﬀer space for one FLIT. The use of tokens prevents the input buﬀer of the
respective receiver from overﬂowing. Consequently, tokens are returned after packets
36
3.1 Introduction and Architecture Analysis
Transmit P KT1 at RAM address FRP1
FRP1 travels back as RRP 1
Process PKT 1
Inc  Write 
Pointer
Inc  Read 
Pointer
PKT1
Retry Buffer 
(RAM)
Host HMC
RRP1 RRP1
FRP1 FRP1
Fig. 3.7 HMC protocol FRP and RRP exchange loop
are processed by the receiver and the corresponding buﬀer space is freed up. Every
packet that is transmitted also carries a pointer, the Forward Retry Pointer (FRP).
The FRP represents the position of the packet in the retransmit/retry buﬀer of the
sender. Flow packets are not subject to ﬂow control and do not carry an FRP. As
soon as the packet has been processed at the receiver, this pointer will be returned as
Return Retry Pointer (RRP). The RRP signals the former requester that the packet
was received and the space in the retransmission buﬀer can be reused. This process is
depicted in Figure 3.7. Such ﬂow control features can negatively inﬂuence performance
and pose signiﬁcant challenges for a host controller design. This ﬂow control barrier
will be described next.
3.1.6 The Flow Control Barrier
In order to maintain the best performance, the HMC speciﬁcation deﬁnes two important
metrics associated with ﬂow control: the retry pointer loop time and the token return
loop time. Both metrics originate from the fact that ﬂow control is mandatory on a
serial link that runs a protocol, and critical when it comes to saturate the theoretical
link bandwidth. They will be described in the following.
Designers of a host controller should always keep these two metrics in mind. Especially
when targeting FPGAs with relatively low operating frequencies, processing pointers
and tokens can take up a large amount of the allowable return loop times.
3.1.6.1 Retry Pointer Loop Time
As mentioned earlier every packet that is sent on a link and subject to ﬂow control
will also be placed in the retry buﬀer of the respective requester. In addition, an FRP
is also sent along with the packet, uniquely identifying the packet and its location in
this buﬀer. The FRP is then extracted by the remote link partner and returned on
37
Hybrid Memory Cube
Transmit packet with FRP
FRP travels back as RRP
Retry Pointer Loop Time End
Start Retry Pointer Loop Time
HMCHost
Process
Host delay Transmission delay HMC delay
RRP
FRP
ProcessSe
rD
es
Fig. 3.8 Retry pointer loop time contributors
Table 3.1 Retry pointer loop time summary. The HMC internal clocking frequency is
independent of the link width and speed
Lanes Speed[Gbps]
HMC Retry Buﬀer
Size [FLIT]a
Retry Buﬀer
Full Period [ns]b
HMC Delay
[ns]c
Max Host
Delay [ns]
8
10 192 307.20 26.5 280.70
12.5 256 327.68 25.9 301.78
15 256 273.07 22.3 250.77
16
10 192 153.60 26.5 127.10
12.5 256 163.84 25.9 137.94
15 256 136.53 22.3 114.23
a See Equation (3.1) and Equation (3.2)
b 16 lane values extracted from the HMC speciﬁcation [79]
c Extracted from the HMC speciﬁcation [79]
the response link as RRP, embedded in any packet that also carries valid ﬂow control
ﬁelds. This is the case for any packet that is not NULL, IRTRY (used to request a
link retry or to clear error status), or erroneous. After the RRP has been extracted
the read pointer of the requesters retry buﬀer can be moved. This invalidates the
corresponding packet for potential retry and frees up its location for other packets.
While this process is ongoing the requester is able issue many more FLITs and packets
which ﬁlls up the retry buﬀer. Performance is throttled when the requester continues
to send packets and ﬁlls the retry buﬀer faster than the required space is freed up.
As a result, no more requests but NULL FLITs will be sent over the link, decreasing
the eﬀective bandwidth. To avoid this situation the HMC speciﬁcation deﬁnes the
maximum allowable time for the circulation time of the pointers, the retry pointer
loop time. Figure 3.8 identiﬁes its contributors: host delay, transmission delay which
is negligible, and the delay through the HMC. In the ﬁgure, HMC acts as requester
but the scheme applies for the host likewise. Now that the contributors are known,
Table 3.1 summarizes the maximum values for the host delay portion of the retry
pointer loop time. It can be seen that retry buﬀer full period and the allowable host
38
3.1 Introduction and Architecture Analysis
delay depend on the HMC link width and speed. Although not mentioned in the HMC
speciﬁcation this leads to the following two observations:
1. The HMC retry buﬀer size for a link running at 10 Gbps is smaller
than for 12.5 Gbps and 15 Gbps.
Equation (3.1) calculates the retry buﬀer size for a half-width (8 lane) link at
10 Gbps and Equation (3.2) at 12.5 Gbps, respectively. It can be seen that the
retry buﬀer size increases from 192 to 256 FLITs for the faster conﬁguration.
BW: 8 lanes / 10 Gbps 8 lanes · 10 Gbps = 80 Gbps
Time to process a bit tbit =
1 bit
80 Gbps = 1.25 ·10
−11 s = 12.5 ps
Time to process a FLIT tFLIT = 128 · tbit = 128 · 12.5 ps = 1.6 ns
Retry buﬀer size [FLITs] Full period at 10 Gbps
tFLIT
= 307.2 ns1.6 ns = 192
(3.1)
BW: 8 lanes / 12.5 Gbps 8 lanes · 12.5 Gbps = 100 Gbps
Time to process a bit tbit =
1 bit
100 Gbps = 1 ·10
−11 s = 10 ps
Time to process a FLIT tFLIT = 128 · tbit = 128 · 10 ps = 1.28 ns
Retry buﬀer size [FLITs] Full period at 12.5 Gbps
tFLIT
= 327.68 ns1.28 ns = 256
(3.2)
The reason for a smaller retry buﬀer at 10 Gbps is a decrease in the internal
HMC datapath-width to reduce power consumption by shutting down parts of
the logic, including the retry buﬀer2. In contrast, running at 12.5 or 15 Gbps
increases performance and in particular the 12.5 Gbps option provides the highest
allowable retry pointer host delay portion. It can ease the implementation of a
corresponding host controller.
2. The retry buﬀer full period is twice as high when operating in half-
width (8 lane) mode.
Table 3.1 highlights the retry buﬀer full period for all available HMC link
conﬁgurations. However, only the values for a 16 lane conﬁguration are mentioned
in the HMC speciﬁcation [79]. It is a reasonable expectation that the retry buﬀer
2 Further details are available under NDA with Micron.
39
Hybrid Memory Cube
full period doubles if only half of the bandwidth is provided, assuming the size
of the retry buﬀer is maintained. To prove this, hardware measurements were
conducted with 8 and 16 lanes. The host issued read requests without returning
RRPs so that the HMC will not free up any used retry buﬀer space. The results
showed that the retry buﬀer size is independent of the link width. This leads to
the conclusion that the time it takes to entirely ﬁll the retry buﬀer is doubled
when a link is operated in half-width. In fact, HMC stopped responding although
theoretically there were a few tokens left. It is common practice to set such
thresholds lower than the actual limit suggests. It can ease the implementation
and save logic as the need for a ﬁne-grain, FLIT or token based granularity is
eliminated.
3.1.6.2 Token Return Loop Time
The second important factor to avoid performance throttling is the time it takes to
consume, process, and return tokens for transmitted and received packets. Similar
to the retry pointer loop time, returning tokens too slow will throttle link packet
transmission. This is in particular the case when the input buﬀer in the host controller
or the HMC runs full. Processing tokens is diﬀerent to pointers which may be returned
immediately after passing integrity checks. Packet tokens can only be returned after
a packet passed the receiver’s input buﬀer. In addition, current HMC devices only
provide a maximum of 219 tokens for the host to transmit packets which can cause
the host controller to run out of tokens even faster. Therefore, the token return time
constraint is potentially even harder to meet than the retry pointer loop time.
3.1.6.3 Mitigation Techniques
Several techniques can be applied to handle retry pointer and token loop time violations.
If tokens are the limitation, the HMC-supported open-response loop mode can be
entered. In this mode, HMC will not check for free space in the host’s input buﬀer
but instead immediately return responses. Consequently, the hosts input buﬀer can
run full if a user application is not able to receive data at the same speed. Host-sided
optimizations for both types of loops include removing link integrity checks to lower the
loop delay and a low latency transceiver design. Removing integrity checks, however,
is highly discouraged as bit errors may lead to undeﬁned link states. Increasing the
host controllers internal operating frequency and moving from 10 Gbps to 12.5 Gbps
40
3.1 Introduction and Architecture Analysis
may also be considered and both of these options require the least design modiﬁcation
eﬀort.
3.1.7 Summary HMC Architecture
In conclusion, the following advantages and disadvantages of the HMC interface can
be obtained from this section. The main performance and power characteristics will be
thoroughly evaluated in Section 3.3.
3.1.7.1 Advantages
• Bandwidth Due to the high parallelism inside the HMC the total theoretical
bandwidth sums up to 160 GB/s per device. The impact of protocol overhead is
alleviated by providing 240 GB/s link bandwidth.
• Average latency High parallelism and the ability to issue many in-ﬂight trans-
actions decrease the average latency. This is opposed to DDR systems with pins
sharing transmit and receive direction and where the number of simultaneous
requests is limited by the number of banks connected to this channel.
• Heterogeneous die stacking Heterogeneous die stacking using TSVs enables
to combine multiple dies that were manufactured in a diﬀerent technology, such
as CMOS and DRAM. The yield increases since single layers can be tested prior
assembly.
• Footprint and I/O requirements The HMC package signiﬁcantly reduces
footprint requirements by 90% over DDR DIMMs [85]. The serialized links lower
the number of I/O pins required to connect a processor from several hundreds to
only 64 pins per link (16 diﬀerential lanes, two directions) and a few additional
sideband signals. Therefore, HMC can help to overcome the memory scaling
issues for processors and eases PCB development.
• Capacity Stacking multiple DRAM layers increases the memory density while
the footprint remains unchanged. It is one solution to delay the impact of the
upcoming end of the miniaturization process.
• Energy eﬃciency Shorter memory subsystem traces and reduced wire-loads
contribute to the overall power eﬃciency. HMC also provides a power-down
41
Hybrid Memory Cube
mode to shut down one or more serial links and parts of the logic base, if desired.
Micron claims that HMC uses only 10% energy per bit of current memory systems
[85].
• Atomic operations HMC is able to carry out simple integer ADD operations
which can be utilized to oﬄoad such computations from the host processor. These
operations, implemented in a logic layer right next to the memory cells, can be
categorized as PIM.
• Interface abstraction The abstraction of the actual memory interface is another
key beneﬁt of the HMC. Although the overhead through (de-)serialization and
the transaction layer increase the latency, an abstracted interface signiﬁcantly
eases the implementation of a corresponding host controller. It is furthermore a
key element to accelerate future moves to other memory technologies by speeding
up their adoption.
3.1.7.2 Disadvantages
• Single access latency Serial links and protocol processing introduce additional
delay and therefore increase the single access latency. In order to beneﬁt from
the HMC characteristics the link should be kept busy with as many in-ﬂight
transactions as possible. This might require modiﬁcations to existing applications.
• Capacity The capacity of current HMC devices is 2 GB and therefore much
less than most systems and applications require. It is also approximately 5 years
behind the evolution of DDR capacities [86]. Although chaining seems to be a
viable way to increase the HMC capacity it comes with the major drawback:
the total available bandwidth available to the host will still be limited by the
link bandwidth of the ’local’ HMC. Also, to the best knowledge of the author,
chaining has not been evaluated in detail yet and the real performance remains
unclear.
3.1.8 Outlook
In 2014, the HMCC has announced the next-generation HMC Gen3 devices (HMC
speciﬁcation 2.0 [87]). Along with a new very-short reach interface with up to 30
Gbps per link, Gen3 also supports quarter-width (4 lane) link operation. Initial cube
42
3.1 Introduction and Architecture Analysis
capacities were reported with 4 GB and 8 GB. The protocol is enhanced to support
additional atomic and arithmetic operations. In addition, the maximum packet size is
increased to 256 Byte to match the HMC DRAM page size and further increase the
overall link eﬃciency.
As of July 2017, Gen2 devices have reached volume production status with 2 GB
densities while HMC Gen3 has been taken oﬀ the roadmap. Micron states that at least
for now there is no demand for HMC links that can provide twice the bandwidth of
Gen2 devices.
3.1.9 Lessons learned for an HMC host controller design
The lessons learned in this chapter are very important for the design of an HMC
host controller and corresponding applications. It is particularly helpful to note the
following key characteristics in order to achieve best performance and usability.
• Serial links in combination with memory abstraction using a communication
protocol require ﬂow control and error handling mechanisms on both sides of a
link. The drawback here is that even though the maximum bandwidth could
be potentially delivered by raw link parameters, processing and the exchange
of pointers and tokens are subject to hard restrictions. A well designed host
controller must especially provide a short ﬂow control loop to perform well.
While this does not seem to aﬀect Application-Speciﬁc Integrated Circuit (ASIC)
implementations the relatively low frequencies in FPGAs can become a show
stopper. Another major contributor to the loop times is the delay through the
SerDes (see Section 3.3.5). Especially user-friendly SerDes instances created
by the FPGA design tools often use deep buﬀer structures which results in
unnecessary high delays.
• As opposed to a transactional interface (such as DDR) where no overhead is
transmitted on the link, a protocol-based communication requires packet framing
to exchange ﬂow control items and to distinguish packets. These additional items
appear as pure overhead on the link and therefore lower the eﬀective bandwidth.
In case of the HMC every packet results in additional 16 Byte not carrying
any data. Section 3.3 will highlight that the eﬀective peak link bandwidth is
approximately 83 % of the theoretical bandwidth for 128 Byte requests. Smaller
requests will decrease the usable bandwidth even further as they increase the
43
Hybrid Memory Cube
protocol overhead and can cause bank access conﬂicts. Application developers
and memory management units should be aware of that fact in order to optimize
link utilization.
3.2 openHMC Host Controller
As for any other memory interface, HMC requires a host controller it can be connected
to and the previous section has identiﬁed several requirements for such a controller.
Besides compliance with the transaction layer of the speciﬁcation it became clear that
a low latency design is crucial for performance reasons. At the time of writing only
a few host controller solutions were available (e.g. [88, 89]) and none of them was
aﬀordable. One low-cost solution was provided by Altera Corporation (now part of
Intel) in 2015, called HMC Controller MegaCore Intellectual Property (IP) [90]. This
core can be generated within the Quartus II or Quartus Prime software for use in the
Arria 10 FPGA series. A second alternative was made available by Xilinx by the end
of 2016. Their IP can be generated in the Vivado Design Suite targeting latest Xilinx
Ultrascale and Ultrascale+ devices. Since the target FPGA used in this thesis is a
Xilinx Virtex 7 and because the development itself started before either of these cores
was available, a custom host controller named openHMC was developed. This section
highlights the most important technical details. A full reference is available in [10] and
[91].
openHMC is a conﬁgurable, vendor-agnostic, and open-source HMC controller IP.
The ﬁrst revision was released in September 2014. Meanwhile the ﬁfth revision is
publicly available3 as a Verilog package including a custom simulation model along
with a detailed documentation [91]. It has also been presented and evaluated in [10].
openHMC is licensed under the Lesser General Public License (LGPL) version 3. The
LGPL states that the core may be used in proprietary projects without limitations but
any changes to the core itself must be made publicly available.
3.2.1 Conﬁgurations and Features
openHMC fully complies with the HMC speciﬁcation 1.1 [79] and provides additional
valuable features such as:
3 http://www.uni-heidelberg.de/openhmc
44
3.2 openHMC Host Controller
Table 3.2 Resource utilization for an 8x half-width link at 10 Gbps in a Xilinx Virtex
7 690T FPGA. Percentages reﬂect the overall usage in the respective device.
Xilinx and Altera core statistics provided as reference
Core IF width LUTs Registers BRAM DSPs
openHMC
standard
256 bit 11710 (2.7%) 12486 (1.4%) 8 (0.4%) 0
512 bit 25307 (5.8%) 23973 (2.7%) 10 (1.0%) 0
768 bit 48806 (11.2%) 36129 (4.1%) 23 (1.5%) 0
1024 bit 81412 (18.8%) 48885 (5.6%) 31 (2.1%) 0
openHMC
w/ XILINX
deﬁne
256 bit 7133 (1.6%) 7580 (0.8%) 8 (0.4%) 10 (0.3%)
512 bit 16426 (3.7%) 14346 (1.6%) 10 (1.0%) 10 (0.3%)
768 bit 35652 (8.2%) 21787 (2.5%) 23 (1.5%) 10 (0.3%)
1024 bit 63531 (14.7%) 29773 (3.4%) 31 (2.1%) 10 (0.3%)
Xilinx
IP [92]a 512 bit 18077 19367 36 0
Altera
IP [90]b 512 bit
24400
(ALMs) 48200
51
(M20K) –
Legend: BRAM = Block RAM (36 Kb memory unit)
DSP = Digital Signal Processor, IF = Interface, LUT = LookUp Table
a Device: Ultrascale XCVU190. Results reﬂect a full-width conﬁguration
b Device: Arria 10. ALM = Adaptive Logic Module, M20K = 20 Kb memory unit
• A conﬁgurable, synchronous or asynchronous AXI4 Stream user interface with
256, 512, 768, or 1024 bit datapath
• Half-width (8x) and full-width (16x) HMC link support for all available datapath-
widths and link speeds (10, 12.5, 15 Gbps)
• No vendor speciﬁc components to target all types of FPGAs and ASICs
• Additional switch to turn selected building blocks into Xilinx speciﬁc components
to optimally use device resources. Refer to the openHMC documentation [91] for
more details
Depending on the selected datapath-width and whether Xilinx speciﬁc components
shall be used the openHMC resource utilization varies. The amount of device resources
required after place and route is shown in Table 3.2. The results were obtained
45
Hybrid Memory Cube
using the default openHMC parameter set and default synthesis and implementation
strategies in Vivado 2016-2. These numbers will more or less slightly vary for other
settings or tool versions. It can be seen that doubling the datapath-width also doubles
the number of registers but almost quadruples logic complexity and the amount of
LUTs required. Using the XILINX parameter results in easy resource savings and more
eﬃcient FPGA fabric utilization. There is no change in Block RAM usage because
Vivado automatically maps suitable register arrays such as FIFOs (First In - First
Outs) and RAM (Random-Access Memory) to Block RAM. Overall, the openHMC
controller is a very compact and easy to implement solution. Experience shows that a
512 bit datapath is most often the best trade-oﬀ between speed, design complexity,
and usability.
3.2.2 Operating Frequencies
The openHMC core provides 24 individual conﬁgurations (Table 3.3). The resulting
core clock frequency is calculated with Equation (3.3) where NUM_LANES is either 8
or 16, LINK_SPEED is 10, 12.5, or 15 Gbps and DATAPATH_WIDTH the width of
user interface in bit.
core clock [MHz]= NUM_LANES ·LINK_SPEED
DATAPATH_WIDTH ·106 (3.3)
All frequencies marked in gray were successfully implemented and tested in hardware.
The conﬁguration that is used throughout this thesis is highlighted in boldface.
3.2.3 Flow Control and Performance
The requirements for retry pointer and token return loop times were mentioned in
Section 3.1.6. The openHMC speciﬁcation highlights that the controller meets both
requirements for most conﬁgurations depending on the operating frequency. The
results, however, assume that the SerDes are optimized for low latency. In case a host
design experiences performance limitations through loop time violations the openHMC
controller provides several options to further decrease the delay. These options include
HMC open-response loop mode and deactivation/removal of link integrity features.
46
3.2 openHMC Host Controller
Table 3.3 openHMC core clock frequencies [MHz] for various conﬁguration. The conﬁg-
uration in bold is the reference for this thesis. Conﬁgurations marked in gray
were successfully implemented and tested in a Xilinx Virtex 7 690T FPGA
HMC Link Parameters Datapath-width
Width Speed in Gbps 256 bit 512 bit 768 bit 1024 bit
10 312.5 156.25 104.17 78.125
half-width (8x) 12.5 390.625 195.3125 130.208 97.65625
15 468.75 234.375 156.25 117.1875
10 625 312.5 208.33 156.25
full-width (16x) 12.5 781.25 390.625 260.147 195.3125
15 937.5 468.75 312.5 234.375
3.2.4 Comparison with other IPs
Table 3.2 compares the openHMC HMC IP to the Xilinx and Altera ones. It must be
noted that the Altera core has ﬁxed user interface widths; 256 bit for a half-width (8x)
HMC link and 512 bit for a full-width (16x) link. The comparison highlights that the
openHMC controller requires about 50% less registers and only slightly more LookUp
Tables (LUTs) without the XILINX parameter set. With this parameter set, openHMC
requires only one-third registers and about two-thirds LUTs. In both cases it also
consumes about 60% less memory cells. Although the cores were mapped to a diﬀerent
FPGA technology (Altera vs. Xilinx) the comparison of the two is reasonable. The
largest diﬀerence is the naming since both use 6-input LookUp Tables. In addition to
the diﬀerence in resource utilization the biggest advantage of the openHMC controller
is its ﬂexibility. While the Altera core is limited to only two possible user interface
width/HMC link conﬁgurations, openHMC supports 24. One remarkable feature of
the Altera core is the ability to reorder incoming HMC responses but using this feature
will further increase the resources required by 10 to 20 percent.
The resource utilization of the Xilinx IP is comparable to openHMC. It also provides a
broader range of user interface widths, from 256 bit up to 2048 bit. Response reordering
and a multi-channel user interface are additional valuable features. Both of vendor IPs,
however, are based on evaluation-only licenses. The cost of purchasing an enhanced
license to integrate these cores into products is not known the author.
47
Hybrid Memory Cube
Table 3.4 openHMC ASIC implementation results
Process Node Gates Area [mm2] % SRAM of area Fmax
65 nm general purpose 41900 0.921 75 (0.69 mm2) 415 MHz
28 nm high-performance 41600 0.223 62 (0.14 mm2) 1 GHz
3.2.5 ASIC Implementation
The openHMC controller was implemented with two diﬀerent process nodes without
any additional optimizations. The conﬁguration was set to a 256 bit datapath and
all other parameters were left at their standard values. Table 3.4 summarizes the
post-synthesis results with the Cadence Genus Synthesis Solution at the slowest process
corner (minimum voltage, -40°C). As can be seen, the estimated resource utilization
between the two processes remains comparable while the required area is signiﬁcantly
smaller in 28 nm. The maximum operating frequency Fmax is expected to reach
approximately 415 MHz in a conservative, 65 nm general purpose process. The relatively
slow SRAMs prohibit higher frequencies. For the more advanced 28 nm node, however,
it scales up to 1 GHz. According to Table 3.3 it becomes clear that in 28 nm the
openHMC controller can be implemented with the fastest available link speed (15 Gbps)
at 16 lanes. It furthermore eliminates any concerns regarding ﬂow control performance
aspects as processing pointers and tokens takes place much faster.
3.3 HMC Performance Evaluation
Since its introduction in 2011 [37], HMC has been research topic and investigation
target in various publications.
In [83] the author theoretically evaluates the HMC using preliminary data that was
available at that time. The evaluation is based on simulations and contains many
assumptions for a variety of parameters that aﬀect performance and power. Since
then, several other simulation models were proposed, e.g. more general ones toward
3D stacked architectures [93] and cycle-accurate simulators [94]. Others presented
techniques to improve the HMC architecture by either optimizing the DRAM re-
fresh mechanism ([95, 96]) or reducing thermal dissipation through data compression
algorithms as in [97].
48
3.3 HMC Performance Evaluation
[98] initially explores the HMC capabilities with application-near memory access traces.
The authors in [99] provide a more general study and highlight the importance of
request sizes and access patterns on performance. Another approach in [100] attempts
to give a more detailed characterization. In this work, however, the experimental setup
turns out to be a performance limitation as it only supports half-width (8x) HMC
links and uses the meanwhile discontinued 2-Link HMC device.
The following section extends the ﬁndings in [98],[99], and [100] by providing an
ultimate general overview about the impact of access patterns on metrics such as
bandwidth, latency, and power consumption. Understanding these characteristics is a
must for system engineers and application developers who want to optimally use this
new technology.
In order to provide solid results, the HMC is thoroughly evaluated in a real system
environment. The test setup and host controller can support the full HMC performance
in various link conﬁgurations. A comprehensible overview for various metrics will
determine whether or not HMC can satisfy the expectations.
3.3.1 Metrics
Only a few base metrics are required to qualify a memory device. In general, it is
important to clearly understand these to compare individual memory technologies and
to select the best candidate for a given scenario. The metrics evaluated in this section
are:
Bandwidth The bandwidth is one of the most important memory interface metrics.
One must distinct between two related bandwidth measures: the total and the
eﬀective bandwidth.
• Total: The total (raw) bandwidth is the maximum number of information a
link can transport in a given time period.
• Eﬀective: The eﬀective bandwidth is the maximum number of payload a
link is capable to transport. The eﬀective bandwidth is per deﬁnition equal
to or less than the total. Serialized links that run a protocol require certain
overhead to be transmitted along with the actual payload, e.g. 8b/10b
encoding and packet framing. Hence their eﬀective bandwidth is less than
the total.
49
Hybrid Memory Cube
Latency Generally, latency is deﬁned as the time it takes to transport an information
from one point to another. In case of the HMC, the read latency deﬁnes the
time it takes for a request to become available on the host controller transmit
interface until a corresponding response is seen at the host receiving application.
Power Consumption The power consumption (or simply power) describes how much
energy a device uses (or generates) at any point in time. The two most common
measures are Joule per second (Js ) and Watt (W) with W =
J
s . Since power
increasingly moves into focus it is more important than ever to obey power
budgets. Power consumption limits can be identiﬁed on a system level and for
individual components such as 25 Watts for a PCIe connector (if no additional,
external connector is used).
Power Eﬃciency Power eﬃciency describes how much energy (not to be confused
with power) is required to transmit a given number of information measured
in Joule per bit ( Jbit). When referring to power eﬃciency only the eﬀective
bandwidth is considered.
3.3.2 Test Setup
In order to obtain reliable numbers a test setup as shown in Figure 3.9 was created.
It comprises a Xilinx Virtex 7 690T FPGA that connects a 2 GB, 4-Link HMC4
with a full-width (16x) link at 10 Gbps and 12.5 Gbps. Implementing a 15 Gbps
link was not possible as such high lane speeds are not supported by the Xilinx 7
series. The openHMC controller is used as HMC host controller. A low-impedance,
high-precision resistor per individual power rail is used to measure the voltage drop via
a Linear Technologies DC1613A PMBus module. As the electric current is known this
leads to the power consumption. The HMC address-mapping mode is conﬁgured to
low-interleave and the maximum block size is set to 128 Byte. Address (re-)mapping
in the HMC core logic can be a useful tool to optimize performance for given access
patterns and will be discussed in Section 3.3.3.
Before the actual results are presented it is crucial to understand the impact of access
patterns on bandwidth. This will also help to avoid pitfalls in a host controller and
application design.
4 Logic revision 2, ﬁrmware 0.95A, part number MT43A4G40200NFA-S15 ES:A
50
3.3 HMC Performance Evaluation
FPGA
Traffic 
Generator
openHMC
Controller
Power 
SupplyHMC
Res
Res
V
V
Fig. 3.9 Experimental test setup
3.3.3 Access Patterns
Traditional DDR interfaces operate on a single transactional data-bus to transmit and
receive data to and from the DRAM. Also, a single channel can only serve one command
at a time so that execution of subsequent commands requires the previous command
to complete ﬁrst. In contrary, HMC comes with a bidirectional communication scheme
where requests and responses are transmitted on separate channels. This circumstance
requires well-balanced access patterns, i.e. an optimum read/write ratio in order to
eﬃciently utilize both link directions and to maximize bandwidth. The HMC access
granularity is 16 Byte and supported packet sizes range from 16 Byte up to 128 Byte.
Every transmitted packet also requires an additional protocol overhead of 16 Byte.
Therefore, a maximum-sized write request of 128 Byte comprises 8 FLITs payload and
1 FLIT overhead (=9 FLITs total) to be transmitted on the request channel. A read
request appears as 16 Byte overhead on this channel and returns one FLIT overhead
and up to 8 FLITs payload in response direction. Due to this fact the optimum HMC
read-to-write ratio is not 1read1write as a single maximum-sized read+write results in 10
FLITs on the request channel while only 9 FLITs will be returned.
Figure 3.10a shows the impact of read-to-write ratio on the total request, response,
and combined bandwidth for maximum-sized, 128 Byte read and write requests. A
read ratio of 53 % maximizes the total bandwidth including packet overhead. Similarly,
Figure 3.10b presents the impact on the eﬀective bandwidth. The ﬁgures represent the
bandwidth for a full-width (16x) link at 10 Gbps. The bandwidth increases linearly
with the lane speed and results for 12.5 and 15 Gbps can be obtained by multiplying
the bandwidth by 1.25 and 1.5 respectively. It can be seen that the maximum eﬀective
bandwidth (i.e. excluding protocol overhead) in a 10 Gbps conﬁguration is 33.5 GB/s
(≈83.5 % eﬃciency). Furthermore the actual optimum ratio depends on the request
sizes as shown in Figure 3.11. It can be seen that the optimum ratio shifts toward
more read requests as request sizes become smaller since the percentage of overhead
51
Hybrid Memory Cube
0 20 40 60 80 100
0
10
20
30
40
Read Ratio [%]
To
ta
l B
an
dw
id
th
 [G
B
/s
]
53
 %
40 GB/sTotal BW
Request BW
Response BW
(a) Total bandwidth
0 20 40 60 80 100
0
10
20
30
40
Read Ratio [%]
E
ffe
ct
iv
e 
B
an
dw
id
th
 [G
B
/s
]
53
 %
33.54 GB/s
Effective BW
Eff. Request BW
Eff. Response BW
(b) Eﬀective bandwidth
Fig. 3.10 Impact of read/write ratio on bandwidth with 128 Byte requests
0 20 40 60 80 100
0
10
20
30
40
Read Ratio [%]
E
ffe
ct
iv
e 
B
an
dw
id
th
 [G
B
/s
]
128 Byte
64 Byte
32 Byte
16 Byte
Fig. 3.11 Impact of diﬀerent request sizes
on the optimum read/write ratio
0 20 40 60 80 100
0
10
20
30
40
Read Ratio [%]
E
ffe
ct
iv
e 
B
an
dw
id
th
 [G
B
/s
]
Eff. Measured
Eff. Theoretical
Fig. 3.12 128 Byte request ratio sweep re-
sults: theoretical versus measured
per request increases. Table 3.5 summarizes the results for each of the possible request
sizes.
It is not only important to maintain the optimum ratio but also the ordering of requests
is relevant. In a worst case scenario instead interleaving reads and writes in a stream
of 100 × 128 Byte requests with the optimum ratio of 53 %, the user issues 53 reads
followed by 47 writes. This is referred to as bad request practice and its impact on the
overall bandwidth will be discussed in the following.
52
3.3 HMC Performance Evaluation
Table 3.5 Optimum ratio and maximum eﬀective bandwidth per request size
Request Size [Byte] 16 32 48 64 80 96 112 128
Optimum Ratio [% Read] 66 60 57 55 55 54 53 53
Maximum
eﬀective
BW [GB/s]
10 Gbps 14.93 22.2 26.2 28.6 30.3 31.75 32.6 33.54
12.5 Gbps 18.66 27.8 32.8 35.7 37.9 39.7 40.8 41.9
15 Gbpsa 22.4 33.3 39.3 42.85 45.5 47.6 48.95 50.3
a Listed only for reference. Not veriﬁed in hardware
3.3.4 Bandwidth
Several access pattern schemes were identiﬁed and tested. While some of them perform
best with the default address-mapping mode (low-interleave, see Section 3.1.2), others
will beneﬁt from a diﬀerent address-mapping or maximum block size setting. Bandwidth
and eﬃciency numbers in this section were rounded down to account for measurement
errors.
The ﬁrst measurement is shown in Figure 3.12. It compares a sweep of the read ratio
for 128 Byte requests between expected and measured eﬀective bandwidth. The results
very closely match the theoretical evaluation. Only a slight deviation appears for
higher read ratios, most likely due to measurement errors and/or the negative impact
of violating the retry pointer or token return loop times. The more reads are issued;
the more responses are generated. Consequently, the retry buﬀer ﬁlls up faster and
the loop time constraints are tightened. Several ways to alleviate loop time violations
were proposed earlier. The results for a 12.5 Gbps link are very similar and not shown.
They can be calculated by multiplying the 10 Gbps results by 1.25.
Figure 3.13 shows a plot of various access patterns for 128 Byte requests and their
impact on the measured eﬀective bandwidth at 10 Gbps. It can be seen that linear
reading and writing deliver the theoretical maximum of 17.7 GB/s with an eﬃciency of
88.5 % per link direction (see Equation (3.4)). An additional experiment with strided
accesses unveils a drop in bandwidth for stride=16, where stride=1 represents linear
reading/writing throughout all vaults.
Read or write link eﬃciency: Eﬀective bandwidthTotal bandwidth =
17.7 GB/s
20 GB/s = 88.5 % (3.4)
53
Hybrid Memory Cube
With a stride of 16 and low-interleave address-mapping only 1 vault is continuously
accessed. The peak bandwidth for writing a single vault is 9.8 GB/s and 9.35 GB/s for
reading, respectively. These results closely reﬂect the maximum vault bandwidth of 10
GB/s, lowered by packet processing overhead. In general, increasing the stride will only
aﬀect the bandwidth when the number of accessed vaults and therefore the provided
vault bandwidth is lower than the eﬀective link bandwidth. For a given strided access
pattern changing the address-mapping mode can eliminate this limitation. In the
previous case, shifting vault and bank address segments to higher address bits will
improve stride=16 accesses. All other strided accesses, however, will negatively impact
performance due to vault congestion.
The optimum read ratio of 53 % gives the maximum eﬀective bandwidth of 33.5 GB/s
for linear accesses (83.75 % eﬃciency, see Equation (3.5)) and 8.9 GB/s for a single
vault. A reasonable expectation would be that the eﬃciency stayed constant compared
to only reading or writing at a time, which was measured with an eﬃciency of 88.5 %.
Mixing reads and writes, however, increases the protocol overhead in request direction
which now carries 1 out of 9 FLITs overhead for writes and 1 FLIT pure overhead for
every read that is sent.
Combined R/W link eﬃciency: Eﬀective bandwidthTotal bandwidth =
33.5 GB/s
40 GB/s = 83.5 % (3.5)
Random accesses do not show an impact when addressing all vaults while the single
vault bandwidth drops to 7.58 GB/s due to the increased probability of bank conﬂicts.
Increasing the lane speed to 12.5 Gbps does not improve the single vault performance
as shown in Figure 3.14. For all other access patterns, however, the results represent
what has been theoretically evaluated earlier.
The term bad request practice was introduced to describe bad ordering of requests in a
stream for a given read ratio. The negative impact of this bad request practice turns
out to be negligible in a stream of 100 requests. Restrictions in the FPGA design
prohibited the use of longer sequences which will decrease the achievable bandwidth.
Hence, although the HMC is capable to internally reorder independent requests, bad
request ordering over a longer period of time should be avoided. It will lead to ineﬃcient
utilization of either of the two link directions. If required, host-sided reordering should
be performed in order to maintain highest link bandwidth.
54
3.3 HMC Performance Evaluation
E
ffe
ct
iv
e 
B
an
dw
id
th
 [G
B
/s
]
0
10
20
30
40
17.7 17.7
9.8
17.7 17.7
9.3
33.5
8.9
33.5
8.94
22.2
33.5
7.5
R 
Lin
ea
r
R 
St
rid
e=
2
R 
St
rid
e=
4
R 
St
rid
e=
8
R 
St
rid
e=
16
W
 Li
ne
ar
W
 S
trid
e=
2
W
 S
trid
e=
4
W
 S
trid
e=
8
W
 S
trid
e=
16 Op
t
Op
t 1
 V
au
lt
Op
t B
ad
Op
t B
ad
 1 
Va
ult
80
R/
20
W
Rd
m
Rd
m 
1 V
au
lt
R = Read
W = Write
Opt = Optimum Ratio (53% reads)
Bad = Bad Request Practice
Rdm = Random Addresses at Opt
Fig. 3.13 Eﬀective bandwidth for diﬀerent access patterns with 128 Byte requests at 10
Gbps
E
ffe
ct
iv
e 
B
an
dw
id
th
 [G
B
/s
]
0
10
20
30
40
22.2 22.2
18.7
9.8
22.2 22.2
19.2
9.3
41.9
8.95
41.9
8.9
28.1
41.9
7.5
R 
Lin
ea
r
R 
St
rid
e=
2
R 
St
rid
e=
4
R 
St
rid
e=
8
R 
St
rid
e=
16
W
 Li
ne
ar
W
 S
trid
e=
2
W
 S
trid
e=
4
W
 S
trid
e=
8
W
 S
trid
e=
16 Op
t
Op
t 1
 V
au
lt
Op
t B
ad
Op
t B
ad
 1 
Va
ult
80
R/
20
W
Rd
m
Rd
m 
1 V
au
lt
R = Read
W = Write
Opt = Optimum Ratio (53% reads)
Bad = Bad Request Practice
Rdm = Random Addresses at Opt
Fig. 3.14 Eﬀective bandwidth for diﬀerent access patterns with 128 Byte requests at 12.5
Gbps
55
Hybrid Memory Cube
Request
Request Response
User App openHMC
Request
Response
Vaults/DRAM/...
Host HMC
Transceiver
Host delay  (non-HMC) HMC delay
Fig. 3.15 Host to HMC read latency contributors
Table 3.6 Host-sided read latency contributors
Type / Delay Cycles at 10 Gbps at 12.5 Gbps
User application 2 6.4 ns 5.12 ns
openHMC 29 92.8 ns 74.24 ns
SerDes 19 60.8 ns 48.64 ns
Total non-HMC 50 160 ns 128 ns
10 Gbps: 312.5 MHz FPGA clock (tcycle = 3.2 ns)
12.5 Gbps: 390.625 MHz FPGA clock (tcycle = 2.56 ns)
3.3.5 Latency
The latency of individual requests (i.e. the latency of a randomly selected request in a
given request stream) for HMC is higher than for a transactional memory interface
such as DDR. Several contributors to this latency can be identiﬁed as depicted in
Figure 3.15. The delays for the user application and the openHMC controller are
well known. The SerDes internal loopback mode of the FPGA was run to quantify
the delay introduced through serialization and deserialization. It is assumed that
the transmission line is not contributing noticeably. The actual HMC read delay can
be estimated by subtracting all known delays from the overall latency. Table 3.6
summarizes the results for the individual contributors in the test design at 10 Gbps
and 12.5 Gbps. It can be seen that the overall latency can be signiﬁcantly reduced
by increasing the FPGA logic frequency. To provide an application-near scenario the
overall read request latency was measured, starting from the point where a packet
is created in the user application until the corresponding response is received there.
Figure 3.16 and Figure 3.17 plot the latency over a read ratio sweep for 128 Byte
requests for linear addressing. Each ratio was applied for 20 seconds and the best,
worst, and average latencies were measured for randomly selected individual requests
56
3.3 HMC Performance Evaluation
0 20 40 60 80 100
0
10
00
20
00
30
00
40
00
50
00
60
00
Read Ratio [%]
R
ea
d 
la
te
nc
y 
[n
s]
Average
Best
Worst
Fig. 3.16 Host to HMC read latency at 10 Gbps (tcycle = 3.2 ns)
0 20 40 60 80 100
0
10
00
20
00
30
00
40
00
50
00
60
00
Read Ratio [%]
R
ea
d 
la
te
nc
y 
[n
s]
Average
Best
Worst
10 Gbps Average
Fig. 3.17 Host to HMC read latency at 12.5 Gbps (tcycle = 2.56 ns)
57
Hybrid Memory Cube
in the access stream. It can be seen that the initial read latency starts out with an
average of 224 ns (70 FPGA cycles) for 10 Gbps and 192 ns (75 FPGA cycles) for 12.5
Gbps, respectively. The latency then remains stable until the optimum ratio threshold
is reached. At this point, more reads are requested than the HMC and in particular
the response link can supply. The read latency continues to increase when more reads
are sent and goes up to several microseconds. This is because the HMC input buﬀer
runs full with unanswered requests and the response link is in bandwidth saturation.
This prevents the host to continue. The disparities between the best and worst case
latencies originate from corner cases where a corresponding read request enters the
openHMC controller right at the time that traﬃc is throttled. The request therefore
remains in buﬀers waiting to be transmitted, while this waiting time accounts for the
overall latency.
In summary it becomes clear that the single access latency gets much worse when
either of the link directions saturates. A low latency host design including SerDes, host
controller, and user application in combination with well-balanced access patterns are
the key elements to lower the HMC access latency. Increasing host (FPGA) operating
frequencies or HMC lane speeds are additional options. For the given test environment,
however, increasing the link speed to 15 Gbps was not an option because it could not
be implemented in the target FPGA.
3.3.6 Atomic Operations
The HMC protocol deﬁnes packet types for atomic operations that will be executed
by the HMC logic layer, eliminating the need for expensive read-modify-write cycles
on the host. The two available commands add either an 8 Byte value to a 16 Byte
memory operand (16-Byte immediate add) or two 4 Byte values to two 8 Byte memory
operands (dual 8-Byte immediate add). Each add operation is referred to as an
update. Figure 3.18 summarizes the maximum updates per second for accessing a
single address, a single vault, and up to all available vaults. This is represented by
the corresponding access stride, where stride=16 accesses only 1 vault with HMC
standard address-mapping. It can be seen that the maximum number of updates per
second increases proportionally with the number of accessed vaults and inverse with
the stride size for both types of atomics. Since the actual packet throughput remains
the same, dual 8-Byte add immediate commands can update as twice as many values
compared to 16-Byte adds. Figure 3.19 points out that increasing the lane speed
58
3.3 HMC Performance Evaluation
M
eg
au
pd
at
es
 p
er
 s
ec
on
d
0
20
0
40
0
60
0
80
0
10
00
12
00
Si
ng
le 
Ad
dr
es
s
1 V
au
lt
2 V
au
lts
4 V
au
lts
8 V
au
lts
16
 V
au
lts
15 31 20 37
74
163
72
143
325
145
291
625
292
584
625
584
1168
625
Address Range
16 Byte immediate add
8 Byte immediate add
16 Byte posted write
Fig. 3.18 Megaupdates/second versus address range at 10 Gbps
M
eg
au
pd
at
es
 p
er
 s
ec
on
d
0
20
0
40
0
60
0
80
0
10
00
12
00
Si
ng
le 
Ad
dr
es
s
1 V
au
lt
2 V
au
lts
4 V
au
lts
8 V
au
lts
16
 V
au
lts
19 39 21 37
75
164
75
149
326
149
298
651
298
596
781
581
1162
781
Address Range
16 Byte immediate add
8 Byte immediate add
16 Byte posted write
Fig. 3.19 Megaupdates/second versus address range at 12.5 Gbps
59
Hybrid Memory Cube
does not improve the maximum number of atomic operations since the HMC internal
operating frequency is maintained.
The key observation in these plots is that increasing the number of accessed vaults has
a positive impact on the total number of updates per second. This is in contrast to
regular read/write requests that already saturate with less vaults. The positive eﬀect
of accessing more vaults concurrently, however, would also apply for regular reading
and writing when more than one link was used.
3.3.7 Power Consumption and Energy Eﬃciency
The HMC power consumption was measured for the workloads presented in Figure 3.13
and Figure 3.14 using the test setup shown in Figure 3.9. Power was measured via the
voltage drop over high-precision resistors. The following results represent experimental
measurements at best eﬀorts and are furthermore subject to parasitic eﬀects (e.g.
eﬃciency of the power source and other components) and deviation (e.g. temperature,
measurement error).
Figure 3.20 and Figure 3.21 plot the HMC power consumption at 10 Gbps and 12.5
Gbps. The values for HMC power-on/reset and idle states are included as a reference.
It can be seen that static and idle power make up a major fraction of the overall
consumption. While a link in idle already consumes about 5 watts, actual traﬃc does
not excessively contribute to the overall power footprint. One expected observation is
that dynamic power consumption increases as more bandwidth is requested. The main
contributors here are the sources for the DRAM and the logic core. Figure 3.22 and
Figure 3.23 show the measured power eﬃciency in [pJ/bit] for the individual workloads
at 10 Gbps and 12.5 Gbps. The eﬃciency is calculated as the power consumption
in [Watt] divided by the eﬀective bandwidth. The ﬁgures point out an idle power
consumption (i.e. after the link has trained) of 5.1 Watt and the best energy eﬃciency
with 23.2 pJ/bit at the optimum read/write ratio for a 10 Gbps link. Similarly, the idle
power consumed for 12.5 Gbps is 5.6 Watt and the best eﬃciency was measured with
21.7 pJ/bit. All eﬃciencies are relative to the eﬀective delivered bandwidth. There
are no values provided for reset, idle, and sleep as there is no data transmitted at
that time. Accessing random addresses does not aﬀect power eﬃciency except when
bank conﬂicts occur which lower the eﬀective bandwidth. Furthermore, there is no
diﬀerence between properly ordered request streams and the bad request practice
60
3.3 HMC Performance Evaluation
Po
w
er
 C
on
su
m
pt
io
n 
[W
at
t]
0
1
2
3
4
5
6
7
8
9 Static
Idle/Sleep Offset
Dynamic
In 
Re
se
t
Idl
e
Sl
ee
p M
od
e
R 
Lin
ea
r
R 
St
rid
e=
2
R 
St
rid
e=
4
R 
St
rid
e=
8
R 
St
rid
e=
16
W
 Li
ne
ar
W
 S
trid
e=
2
W
 S
trid
e=
4
W
 S
trid
e=
8
W
 S
trid
e=
16
Op
t li
ne
ar
Op
t li
n 1
 V
au
lt
Op
t B
ad
Op
t li
n B
ad
 1V
80
R/
20
W
Rd
m 
lin
ea
r
Rd
m 
1 V
au
lt
Fig. 3.20 HMC power consumption for various workloads at 10 Gbps (lower=better)
Po
w
er
 C
on
su
m
pt
io
n 
[W
at
t]
0
1
2
3
4
5
6
7
8
9 Static
Idle/Sleep Offset
Dynamic
In 
Re
se
t
Idl
e
Sl
ee
p M
od
e
R 
Lin
ea
r
R 
St
rid
e=
2
R 
St
rid
e=
4
R 
St
rid
e=
8
R 
St
rid
e=
16
W
 Li
ne
ar
W
 S
trid
e=
2
W
 S
trid
e=
4
W
 S
trid
e=
8
W
 S
trid
e=
16
Op
t li
ne
ar
Op
t li
n 1
 V
au
lt
Op
t B
ad
Op
t li
n B
ad
 1V
80
R/
20
W
Rd
m 
lin
ea
r
Rd
m 
1 V
au
lt
Fig. 3.21 HMC power consumption for various workloads at 12.5 Gbps (lower=better)
61
H y b ri d M e m o r y C u b e
En
erg
y E
ffic
ie
ncy
 [p
J/b
it]
0 
20 
40 
60 
80 
10
0
I n 
R e s
et I dl e
Sl e e
p M
o d e
R Li
n e a
r
R St
ri d e
= 2
R St
ri d e
= 4
R St
ri d e
= 8
R St
ri d e
= 1 6
W Li
n e a
r
W St
ri d e
= 2
W St
ri d e
= 4
W St
ri d e
= 8
W St
ri d e
= 1 6 O pt
O pt 
1 V
a ult
O pt 
B a d
O pt 
B a d 
1 V
a ult
8 0 R
/ 2 0
W
R d m
R d m
 1 
V a ul
t
∞  ∞  ∞
4 2 4 2 4 2 4 1
7 2
4 3 4 3 4 3 4 2
7 6
2 3
8 1
2 3
8 1
3 3
2 3
9 6
Fi g. 3. 2 2 H M C e n er g y e ﬃ ci e n c y f or v ari o us w or kl o a ds at 1 0 G b ps (l o w er = b ett er)
(f or 1 0 0 r e q u ests) i ntr o d u c e d e arli er. As m e nti o n e d b ef or e it w o ul d r e q uir e v er y l o n g
str e a ms of dis or d er e d a c c ess es t o s e e a n e ﬀ e ct h er e.
I n g e n er al, b ot h pl ots p oi nt o ut t h at t h e p o w er e ﬃ ci e n c y i m pr o v es (i. e. p J / bit dr o ps)
w h e n t h e li n k is k e pt b us y. I n c o ntr ast, st ati c p o w er c o ns u m pti o n d o mi n at es f or
i n e ﬃ ci e nt li n k utili z ati o n a n d t h e e ﬃ ci e n c y is r e d u c e d. It is e x p e ct e d t h at i n cr e asi n g
t h e n u m b er of a cti v e li n ks a n d t h eir l a n e s p e e ds will h a v e a p ositi v e e ﬀ e ct o n e n er g y
e ﬃ ci e n c y as st ati c d e vi c e p o w er is a m aj or c o ntri b ut or t o t h e o v er all c o ns u m pti o n.
T h e H M C sl e e p m o d e c a n b e e nt er e d t o r e d u c e p o w er c o ns u m pti o n w h e n t h e li n k is
i n i dl e t o s a v e a b o ut 4 5 % at 1 0 G b ps a n d 4 9 % at 1 2. 5 G b ps. H o w e v er, it m ust b e
n ot e d t h at e nt eri n g a n d e xiti n g sl e e p m o d e t a k es ti m e a n d r e q uir es a n a d diti o n al li n k
i niti ali z ati o n s e q u e n c e.
3. 3. 8 S u m m ar y P erf or m a n c e E v al u a ti o n
T his s e cti o n pr o vi d e d H M C i n-s yst e m m e as ur e m e nts f or b a n d wi dt h, l at e n c y, c o m p ut a-
ti o n o ﬄ o a di n g usi n g at o mi c o p er ati o ns, a n d e n er g y e ﬃ ci e n c y f or a si n gl e, f ull- wi dt h
( 1 6 x) H M C li n k at 1 0 G b ps a n d 1 2. 5 G b ps. T h e k e y t a k e a w a ys c a n b e s u m m ari z e d as
f oll o ws:
6 2
3. 3 H M C P e rf o r m a n c e E v al u a ti o n
En
erg
y E
ffic
ie
ncy
 [p
J/b
it]
0 
20 
40 
60 
80 
10
0
I n 
R e s
et I dl e
Sl e e
p M
o d e
R Li
n e a
r
R St
ri d e
= 2
R St
ri d e
= 4
R St
ri d e
= 8
R St
ri d e
= 1 6
W Li
n e a
r
W St
ri d e
= 2
W St
ri d e
= 4
W St
ri d e
= 8
W St
ri d e
= 1 6 O pt
O pt 
1 V
a ult
O pt 
B a d
O pt 
B a d 
1 V
a ult
8 0 R
/ 2 0
W
R d m
R d m
 1 
V a ul
t
∞  ∞  ∞
3 6 3 5 3 5
4 0
7 2
3 6 3 6 3 5 3 9
7 9
2 2
8 5
2 2
8 5
3 0
2 2
1 0 2
Fi g. 3. 2 3 H M C e n er g y e ﬃ ci e n c y f or v ari o us w or kl o a ds at 1 2. 5 G b ps (l o w er = b ett er)
• A pr o p er u n d erst a n di n g of t h e H M C ar c hit e ct ur e is m a n d at or y i n or d er t o
o pti mi z e t h e o v er all p erf or m a n c e. As l o n g as r e q u ests a c c ess m or e v a ults t h a n
a ct u all y r e q uir e d t o s at ur at e t h e li n k b a n d wi dt h, t h e li miti n g f a ct or is t h e li n k
its elf. Li n k p erf or m a n c e, h o w e v er, is m ai nl y d e p e n d e nt o n t h e c orr es p o n di n g
h ost c o ntr oll er. A l o w l at e n c y o pti mi z e d c o ntr oll er is cr u ci al t o a v oi d a n y ﬂ o w
c o ntr ol dr a w b a c ks t hr o u g h t h e cir c ul ati o n ti m e of p a c k et p oi nt ers a n d t o k e ns.
• R e a d l at e n c y is h e a vil y a ﬀ e ct e d b y a c c ess p att er ns a n d i m b al a n c e d li n k utili z ati o n
s h o ul d b e a v oi d e d w h e n e v er p ossi bl e.
• T h e h ost p orti o n of t h e r e q u est /r es p o ns e l o o p str o n gl y i n ﬂ u e n c es t h e o v er all l a-
t e n ci es. T h e r es ults, h o w e v er, r e ﬂ e ct a n F P G A h ost i m pl e m e nt ati o n. A SI Cs wit h
hi g h er cl o c k s p e e ds a n d l o w l at e n c y S er D es i m pl e m e nt ati o ns w o ul d si g ni ﬁ c a ntl y
r e d u c e t h es e n u m b ers.
• O ﬄ o a di n g c o m p ut ati o n usi n g at o mi c o p er ati o ns c a n b e a us ef ul f e at ur e t o
eli mi n at e c o m p ut ati o n a n d c o m m u ni c ati o n o v er h e a d f or a d d o p er ati o ns. M or e
a d v a n c e d arit h m eti c al a n d l o gi c f u n cti o ns (s u c h as a n n o u n c e d wit h t h e H M C
s p e ci ﬁ c ati o n 2. 0) will b e r e q uir e d t o pr o vi d e a r e al b e n e ﬁt f or m ost o p er ati o ns.
C o m pl e x o p er ati o ns, h o w e v er, will still r e m ai n a pr o c ess or t as k a n d t h er ef or e
r e q uir e d at a m o v e m e nt.
6 3
Hybrid Memory Cube
• The results for energy eﬃciency prove that stacked memories and in particular
HMC are capable to reduce the power penalty for accessing memory. Additional
experiments with more links and higher lane speeds are required to ultimately
identify its full potential.
In conclusion it becomes clear that HMC indeed provides the performance it claims. It
furthermore contributes to meet the power and energy requirements for future systems.
3D integration of CMOS logic and processing elements will continue to gain importance
and memory makers will hopefully integrate advanced oﬄoading capabilities along
with the memory device in the future.
3.4 HMC Summary
This chapter introduced the Hybrid Memory Cube and highlighted its most valuable
characteristics. Its abstracted processor interface not only forces application developers
to rethink how memory is used, but also requires a corresponding host controller.
openHMC has been presented as a no-cost alternative to other, commercially available
host controllers. It was shown that openHMC outperforms at least one comparable host
controller in terms of resource eﬃciency and ﬂexibility, and at the same time maintains
the best link performance. Experiments showed that an ASIC implementation of the
same design can reach up to 1 GHz with a current process node.
A test setup comprising a 2 GB HMC and an FPGA was created to qualify the HMC
performance and power eﬃciency. It became clear that access patterns have major
inﬂuence on latency and bandwidth and also aﬀect eﬃciency. However, if the loads on
the link are well-balanced, HMC can provide a powerful, energy eﬃcient, and dense
memory alternative for many applications. Unfortunately, HMCs very limited capacity
of 2 GB currently limits its use for most applications. Although the next-generation
HMC devices were speciﬁed they have been taken oﬀ the roadmap and it remains to
be seen if the capacity of current devices will increase considerably.
HMC and the openHMC host controller are essential building blocks of the Network
Attached Memory which will be presented in the next chapter.
64
C
h
a
p
t
e
r
4
Network Attached Memory
This chapter introduces Network Attached Memory (NAM)1, a novel and standalone
component with EXTOLL network interfaces. It provides access to a 2 GB HMC as
shared memory resource combined with tightly coupled processing units implemented
in an FPGA. As processing takes place in the FPGA and not the HMC memory itself,
the NAM can be categorized as Near-Data Computing (NDC) device and not a true
PIM architecture. The idea for the NAM originated from the desire to introduce a
network device with fast memory and processing capabilities in order to reduce network
traﬃc and speed-up collective operations.
The NAM is ﬁrst used in the DEEP-ER (Dynamical Exascale Entry Platform - Extended
Reach) project where it can be connected to any EXTOLL NIC to provide system-
wide, high-performance DRAM access as an additional level in the memory hierarchy.
Scalability is preserved as the memory capacity and network bandwidth linearly increase
with the number of NAMs in the system. The ﬁrst particular use case is to improve
the performance of the DEEP-ER resiliency features. The NAM therefore implements
a Checkpoint/Restart (CR) mechanism to speed-up the creation and reconstruction of
parity checkpoints. The decision to use the HMC memory interface is in particular
beneﬁcial as it optimally suits the sequential access patterns of reading and writing
large checkpoint ﬁles.
1 The NAM concept has been prominently presented in various articles [13, 14] and as peer-reviewed
conference poster [15].
65
Network Attached Memory
The following sections provide background information on DEEP-ER and the EXTOLL
network technology. The NAM prototype is presented and the functional units imple-
mented in the FPGA are described in detail. A theoretical performance analysis will
support characterization of the measurements conducted in the next chapter. Finally,
FPGA implementation results and the required software components to actually use
the NAM are presented.
4.1 DEEP-ER Project
DEEP-ER is a European Commission funded project under the Seventh Framework
Programme (FP7/2007-2013). It addresses I/O performance and resiliency as two im-
portant challenges in building an Exascale-ready architecture. Both problems correlate
since I/O performance also aﬀects resiliency throughput. Within its predecessor DEEP,
an innovative cluster-booster architecture was developed. While the cluster part is
based on commodity Intel Xeon processors to execute complex, low to medium scalable
code, the booster is equipped with Intel Xeon Phi accelerators for compute-intensive
tasks. DEEP-ER extends this approach with upgraded components linked via the high-
performance interconnection network EXTOLL. In addition, to satisfy the increasing
demands to I/O performance, DEEP-ER attaches state of the art non-volatile memory
and NAMs. Figure 4.1 outlines the system architecture as a high level diagram.
To establish a running hardware platform as early as possible in the project the DEEP-
ER team created the Software Development Vehicle (SDV). It consists of 16 high-end
Intel Xeon processor nodes and 3 ﬁle servers on the cluster side, and 8 Intel Xeon
Phi accelerators as booster part. The early availability of the SDV helped software
developers to familiarize themselves with the new components, especially the NAM. It
was used to run all kind of application benchmarks including those for the NAM. The
ﬁnal DEEP-ER prototype foresees to upgrade the booster part for a total of 72 Xeon
Phi accelerators. Although the project oﬃcially ended in March 2017, at the time of
writing work to establish the ﬁnal prototype is ongoing.
4.2 Background: EXTOLL
EXTOLL [47, 48, 49] is a high-performance interconnection network developed by the
EXTOLL GmbH, a spin-oﬀ company of the Ruprecht-Karls University Heidelberg. Its
66
4.2 Background: EXTOLL
Cluster Booster
NAM
NAM
XEON
XEON
XEON
XEON KNL
KNL
KNL
KNL KNL
KNL KNL
KNL
KNL
KNL
N
VM
N
VM
N
VM
N
VM
N
VM
N
VM
N
VM
N
VM
N
VM
N
VM
Fig. 4.1 DEEP-ER System Overview: The cluster part is based on Intel Xeon processors.
The booster consists of Intel KNL nodes with one NIC and NVMe device each.
Two NAMs are attached to available EXTOLL links
switch-less (i.e. the switch is integrated with the NIC) architecture removes the need
for external switches and allows to create a variety of network topologies including mesh
and 3D torus. Hence the network scales linearly with the system size. The EXTOLL
ASIC named Tourmalet (Figure 4.2) comes with a PCIe Gen3 x16 host interface, 6
independent network links (+1 optional), the network switching architecture, and three
diﬀerent functional units used to exchange data between NICs.
Fig. 4.2 EXTOLL Tourmalet ASIC. Image courtesy: EXTOLL
67
Network Attached Memory
Host 
Interface
NetworkNetwork Interface
VELO
ATU
ATU
ATU
ATU
ATU
ATU
ATU
ATU
ATU
ATU
ATU
ATU
O
n
 C
h
ip
 N
et
w
o
rk
ATU
VELO
ATU
RMA
SMFU
Register File
Network 
Port
Network 
Port
Network 
Port
Network 
Port
N
et
w
o
rk
 S
w
it
ch
Link Port
Link Port
Link Port
Link Port
Link Port
Link Port
P
C
Ie
 G
e
n
3
 x
1
6
Fig. 4.3 EXTOLL Tourmalet ASIC Block Diagram
4.2.1 Functional Units and Link Performance
Figure 4.3 shows a block diagram of the Tourmalet ASIC. The PCIe Gen3 x16 interface
connects a host processor. The three functional units for data transport (RMA, VELO,
SMFU) are connected via the network crossbar switch to any of the six network links
and via an additional on-chip network to PCIe. Out of these three units, the RMA
(Remote Memory Access or Remote Memory Architecture) [101] has been identiﬁed
as best candidate to communicate with the NAM, which in turn needs to implement
a compliant unit. RMA is a throughput oriented unit designed for middle to large
message sizes. Data is transferred and received via PUT and GET transactions and
data transport is oﬄoaded via a DMA engine. EXTOLL furthermore provides a
low-overhead notiﬁcation mechanism to inform a process whether data has been sent,
requested data has arrived, or to inform a remote process that a PUT or GET operation
has completed. The set of functional units is complemented by a Register File (RF)
that can be accessed from local or remote and an Address Translation Unit (ATU).
In terms of link performance, each of the six EXTOLL network links operates on 12
lanes with a maximum of 8.4 Gbps per lane. Note that EXTOLL links operate in
full-duplex mode, i.e. data can be transmitted and received simultaneously, doubling
the lane count per link to 24. The following sections consider unidirectional operation
and assume that bidirectional traﬃc results in approximately twice the bandwidth.
The total raw bandwidth per link and direction is 12 ·8.4 = 100.8 Gbps= 12.6 GB/s.
68
4.2 Background: EXTOLL
1
2
8 
b
it
 d
at
a
p
at
h 6
4
 b
it
 
Q
u
ad
 2
6
4
 b
it
 
Q
u
ad
 1
6
4
 b
it
 
Q
u
ad
 0 1
2
8 
b
it
 d
at
a
p
at
h 6
4
 b
it
 
Q
u
ad
 2
6
4
 b
it
 
Q
u
ad
 1
6
4
 b
it
 
Q
u
ad
 0 1
2
8 
b
it
 d
at
a
p
at
h 6
4
 b
it
 
Q
u
ad
 2
6
4
 b
it
 
Q
u
ad
 1
6
4
 b
it
 
Q
u
ad
 0
Cycle 1 Cycle 2 Cycle 3
Fig. 4.4 The EXTOLL Link gearbox converts from the 128 bit datapath to the 192 bit
link interface
Due to an 8b/10b coding scheme only 80 % of the link bandwidth is useful. The
maximum unidirectional link bandwidth is therefore:
BWLINK = 100.8 Gbps · 8b10b = 80.64 Gbps= 10.08 GB/s (4.1)
The links are downward compatible to support smaller links (8 lanes / 4 lanes) and lower
link speeds (4.2 Gbps / 2.1 Gbps). When the NAM project started, EXTOLL Link
speeds were announced with 2.5/5/10 Gbps. For technical reasons the link speeds had
to be decreased, which inﬂuenced the reference clock selection on the NAM prototype.
It will be discussed in Section 4.3.
All EXTOLL functional units including the RMA operate on a 128 bit datapath at
630 MHz which matches BWLINK :
BWRMA = 128 bit ·630 MHz= 80.64 Gbps= 10.08 GB/s (4.2)
Note that the equation above is only valid for a link between two EXTOLL ASICs.
Although at this time the maximum RMA bandwidth with a NAM as link partner is
not calculated it is necessary to understand how data is passed from the EXTOLL
functional units to the link.
A 12x EXTOLL Link is subdivided into three quads with four lanes each, and every lane
takes 16 bit parallel data at a time. Hence, the width of the parallel data input to the
link is 12 ·16 bit= 192 bit. A gearbox is used to translate this interface to the 128 bit
datapath of the functional units in a 3-stage iterative process as depicted in Figure 4.4.
It can be seen that six 64 bit cells are processed within each iteration. It will be shown
69
Network Attached Memory
that this gearbox has a negative eﬀect on the bandwidth when communicating with
the NAM.
4.2.2 From Software to Network Transactions
Every PUT or GET transaction carried out by the RMA is initiated by a user program.
The software places a descriptor into one of the descriptor queues of the EXTOLL
device. These descriptors contain information such as the destination node and process,
the amount of payload to be written or read, and where this payload shall be read
from or written to. For PUT operations, the source address is the start location of
the payload in the local memory and is either a virtual or physical address. In case
of a virtual address the ATU is requested to translate it to a physical one. Without
involving the host processor, the EXTOLL NIC fetches the payload via DMA. The data
is then packed into network packets and transmitted by the local RMA requester unit.
The transaction is directed to the RMA completer unit of the destination node which
forwards the data to its local target memory location (Figure 4.5a). GET operations
on the other hand will fetch data from a remote memory location and transfer it to the
local memory of the requesting node via GET Response transactions. In this case the
transaction is requested by the local RMA requester with the remote RMA responder
as turnaround unit. As the response returns to the local node it is eventually processed
by the RMA completer (Figure 4.5b).
The maximum amount of data movement initiated by a single software descriptor is
8 MB. Hence, to accommodate larger data transfers, multiple transactions must be
triggered by placing additional software descriptors.
A third command, PUT IMMEDIATE, is provided for small data transfers (72 bit)
without involving the local DMA engine. The payload is already embedded in the
software descriptor in this case and it can be useful to e.g. access a remote RF.
It must be noted that the EXTOLL RMA supports additional commands. They are
neither relevant for this work nor supported by the NAM.
4.2.3 Notiﬁcation Mechanism
EXTOLL provides an optional notiﬁcation mechanism as depicted in Figure 4.5. It
can be used to inform processes of the progress or completion of transactions. For
PUT transactions, notiﬁcations may be generated at the local nodes of the respective
70
4.2 Background: EXTOLL
CPU1
Pr
oc
es
s
EXTOLL1 EXTOLL2
PUT Request
CPU2
Copy Data 
To Main 
Memory 
via DMA
(Notification)
(Notification)
Get Data 
From Main 
Memory 
via DMA
R
E
Q
U
E
S
T
E
R
C
O
M
P
L
E
T
E
R
PUT Data
(a) PUTs may generate up to two notiﬁcations
(local requester, remote completer)
CPU1 EXTOLL1 EXTOLL2
GET Request
CPU2
Copy Data 
From Main 
Memory 
via DMA
(Notification)
R
E
S
P
O
N
D
E
R
GET Data
Copy Data 
To Main 
Memory 
via DMA
Pr
oc
es
s
Response
(Notification)
(Notification)
R
E
Q
C
O
M
P
L
E
T
E
R
(b) GET operations may generate up to three
notiﬁcations (local requester, local com-
pleter, remote responder)
Fig. 4.5 EXTOLL PUT/GET operations and notiﬁcation mechanism
RMA units involved in the process, i.e. the local RMA requester or remote RMA
completer. GET operations additionally involve a remote responder unit which may
generate notiﬁcations likewise.
4.2.4 Network Protocol
The EXTOLL network protocol operates on cells as transmission units. Each cell is 64
bit (8 Byte) in size and a network packet consists of multiple cells. The maximum size
of one packet is limited by the network Maximum Transmission Unit (MTU), which
is ﬁxed to 512 Byte or 512/8 = 64 cells in this work. Every packet is furthermore
preceded by a network descriptor which accounts for the MTU: 16 Byte for PUT, PUT
IMMEDIATE, and GET Response commands and 24 Byte for GETs. GET operations,
however, do not carry any payload and PUT IMMEDIATE commands only transmit
very few data. Both transaction types are not subject to the MTU. PUT requests and
GET responses on the other hand may carry 512−16 = 496 Byte payload per packet at
most. This packet size limitation introduced by the network MTU also implies that a
single software descriptor with a maximum size of 8 MB may trigger multiple network
packets. The actual number is determined by splitting the software requested size into
496 Byte network packets. Within each subsequent packet the initial destination target
71
Network Attached Memory
address as provided by the software descriptor is incremented by the packet size of 496
Byte accordingly.
Beyond the RMA the EXTOLL network protocol frames each packet with two extra
cells. These cells (Start Of Packet (SOP) and End Of Packet (EOP)) contain additional
network information and ensure packet integrity implemented as a Cyclic Redundancy
Check (CRC) check, which will trigger a link retry mechanism if a corrupted packet is
received. The overhead through SOP and EOP adds another 16 Byte for a total packet
size of 512+16 = 528 Byte. According to the information above the RMA eﬃciency
can be calculated with:
EFFRMA =
Data Bytes
Total Bytes =
496
512+16 = 93.4 % (4.3)
which gives the maximum eﬀective RMA bandwidth of:
BW_EFFRMA =BWRMA ·EFFRMA = 80.64 Gbps ·93.4 %
BW_EFFRMA = 75.75 Gbps= 9.47 GB/s
(4.4)
4.2.5 Link Flow Control
Flow control between two EXTOLL links is handled via credits which reﬂect the local
retry buﬀer and remote input buﬀer space of the respective link partner likewise. A 496
Byte RMA packet consumes four credits in total, one per 128 Byte payload. After the
packet has passed the remote input buﬀer these credits will be returned in dedicated
ﬂow control cells, freeing up the corresponding space in the local retry buﬀer.
The buﬀers were designed to accommodate payload for up to 128 credits, which are
shared among ten Virtual Channels (VCs). VCs can be used to create individual and
unrelated streams of traﬃc to prioritize certain types of traﬃc, often used to handle
routing congestion. Some of these VCs are dedicated to speciﬁc traﬃc classes such
as broadcasts. Every single VC gets assigned eight credits for exclusive use and the
remaining 48 are shared among these on a ﬁrst-come-ﬁrst-serve basis. Out of the
ten channels, four can be used for regular read/write commands. If packets were
distributed evenly on these four channels and no other traﬃc was ﬂowing, a total of
4 · 8+48 = 80 credits would be available for reading and writing. For a single VC,
however, the maximum count is 56 (8 exclusive + 48 shared).
72
4.3 NAM Hardware
Throttling of the link performance occurs whenever no more or too few credits are
available to transmit the next packet. This is the case when credits are consumed
faster than they are returned by the remote link partner or simply shared credits were
consumed by other VCs. In either event the ﬂow control loop is too slow. A very
similar diﬃculty has been identiﬁed with the HMC token return loop time violation
in Section 3.1.6.2. A later section in this chapter will show how credits and the ﬂow
control loop aﬀect the NAM access performance, how it is currently handled in software,
and what needs to be done to improve the situation.
4.2.6 EMP: Network Discovery and Setup
EXTOLL devices can be connected in many diﬀerent ways to create commonly used
mesh and torus or individual non-standard topologies. In any case, all network
device routing tables initially must be set up. The EXTOLL Management Program
(EMP) supports two types of network setup modes: discovery and topology ﬁle based
conﬁguration. In discovery mode all EXTOLL links that show an active connection
are scanned and a topology is automatically created. Topology ﬁle based conﬁguration
on the other hand may be used to verify that all devices are properly connected and
the desired topology was successfully created. In either of the network setup modes,
EMP assigns unique identiﬁers (Node ID) to every EXTOLL NIC and calculates and
sets the routing table entries according to the desired routing scheme. Eventually all
nodes are marked as active which unlocks the network for software usage.
4.3 NAM Hardware
The following section presents the NAM hardware prototype and functional modules
implemented in the FPGA. In order to estimate performance as early as possible in
the design process and to avoid unexpected bottlenecks, design decisions and their
potential impact on the achievable performance are evaluated.
4.3.1 Requirements
The ﬁrst step in developing a new hardware device is to deﬁne its requirements. A
clear view of the physical interfaces and understanding their impact on the FPGA
design is mandatory and allows to develop a prototype early in the design phase. This
73
Network Attached Memory
lowers the risk of delays due to potential PCB manufacturing and bring-up issues. The
following physical and logical requirements for the NAM have been identiﬁed based on
the DEEP-ER use case, the available components such as HMC, the FPGA, and the
EXTOLL interconnect, and the physical size and form factor.
4.3.1.1 Components and Connectors
EXTOLL A Samtec HDI-6 connector as physical interface to connect up to two
EXTOLL NICs with 12 lanes per link.
PCIe The PCIe edge card connector is used to power the NAM board and allows to
easily integrate it with commodity systems. The connector could also be used to
establish host connectivity for management and/or data transport.
HMC-1 One or more HMC links, desirably in a conﬁguration that matches or outper-
forms the total EXTOLL Link bandwidth (for available HMC link conﬁgurations
see Section 3.1.3).
HMC-2 A high-speed connector that interfaces one additional HMC link provides the
ability to chain HMCs to increase the memory capacity.
RAS Advanced RAS (Reliability, Availability and Serviceability) features require a
physical programming and debug interface.
FPGA A suitable FPGA must provide enough resources and I/O capability to im-
plement processing elements and modules for the physical interfaces described
above. Especially the total high-speed transceiver count to accommodate all
types of serial interfaces (PCIe, EXTOLL, HMC) is essential.
4.3.1.2 FPGA Design Functional Units
EXTOLL Link FPGA implementation The original EXTOLL Link is imple-
mented with a 128 bit datapath in the ASIC Tourmalet at 630 MHz. The
source code for this link was provided by EXTOLL. In order to maintain link
throughput while keeping the clock frequency within a reasonable region for an
FPGA implementation, the link must be extended to support a wider datapath.
The implementation must be able to support the maximum EXTOLL Link width
and speed.
74
4.3 NAM Hardware
RMA compatible unit The NAM must be able to communicate with the native
EXTOLL RMA unit. A compatible unit implements the required subset of RMA
functions and accounts for a wider datapath.
HMC host controller The development of the HMC host controller openHMC was
already discussed in Section 3.2.
RMA to HMC (and reverse) protocol converter The most basic requirement,
reading and writing to the NAM, demands a module that converts EXTOLL
RMA network packets to HMC transactions and vice versa.
RAS One module to provide remote register ﬁle conﬁguration and monitoring over
EXTOLL. A second module grants RAS access over an external debug connector.
CR The CR unit required to carry out the DEEP-ER resiliency features. It will be
discussed in Section 4.5.
4.3.2 Prototype ’Aspin-v2’
Figure 4.6 depicts the NAM hardware prototype Aspin-v2 developed as a standard
height PCIe form factor PCB. The Xilinx Virtex 7 FPGA utilizes 16 lanes at 10 Gbps
to connect a 2 GB HMC. Additional 16 lanes are connected to the 16x PCIe edge card
connector. Although the maximum link width of the Virtex 7 PCIe hard-IP2 blocks
is 8x, the eight additional lanes can be useful if the connector is used proprietary. It
is also possible not to use the hard-IP block to set up a 16x PCIe link. In this case,
however, PCIe Gen3 (8 Gbps per lane) will not meet timing in the FPGA, limiting the
capability of the FPGA PCIe core to Gen2 (5 Gbps) or even Gen1 (2.5 Gbps)3. The
set of high-speed connections to the FPGA is complemented by two 12x links on the
HDI-6 connector used to connect EXTOLL NICs.
The total transceiver count of 56 (16 PCIe + 16 HMC + 24 EXTOLL) narrowed down
the number of usable FPGAs from the Virtex 7 device family. Eventually the V7 690T
as second smallest device with at least 56 transceivers as a trade-oﬀ between logic cells
and cost was chosen.
2 FPGAs typically provide several ﬁxed (hardened) logical blocks that implement speciﬁc functions
such as a PCIe endpoint/root-port complex. Hardened IP is superior to functions implemented
with standard registers and LUTs regarding achievable performance.
3 Besides the actual lane speeds, PCIe Gen3 uses an improved lane encoding which increases the
eﬀective bandwidth per lane to ≈98 % compared to 80 % in previous generations.
75
Network Attached Memory
Virtex 7 
690T FPGA
HDI 6
EXTOLL
PCIe Connector + Power Supply
LEDs
2GB HMC
HMC Addon
I2C
Fig. 4.6 NAM Prototype Board ’Aspin-v2’
A second HMC link is exposed to a dedicated connector which can be used to attach
additional HMCs to increase the capacity (chaining, see Section 3.1.3). RAS features
can be carried out through dedicated I2C (Inter-Integrated Circuit) and JTAG (Joint
Test Action Group) connectors. A set of general purpose LEDs is free to use. Power is
supplied via the PCIe connector and the required voltages are generated by on-board
power regulators. A ﬂash memory chip stores the FPGA conﬁguration so that it does
not need to be reprogrammed upon a power cycle.
FPGA and HMC are supplied by a single oscillator and clock distribution network. As
stated in Section 4.2, EXTOLL Link lane speeds were reduced for technical reasons.
This happened after the NAM prototype was already built. To maintain interoperability
with the clocking infrastructure of the Xilinx GTH transceivers, the former 125 MHz
shared reference clock was increased to 127.273 MHz. This option provides the least
signiﬁcant change in the clocking infrastructure and leads to a static multiplier FMULT :
FMULT =
127.273 MHz
125 MHz = 1.018184 (4.5)
The increased reference clock results in overclocking the HMC link as the HMC
internally uses a ﬁxed multiplier. The following paragraphs also describe the impact of
this design change on other modules.
76
4.3 NAM Hardware
HMCNetwork Partition NAM Partition
Extoll Layer
HMC Transaction Layer (HTL)
CR 
Logic
 Network Transaction Layer (NTL)
clk_extoll (200 MHz) clk_hmc (312.5 MHz)
Data 
FIFO
FIFO
RMA to HMC Converter / 
Packet Split
CR FIFO
Packet 
Serializer
Packet 
Aligner
A
R
B
op
en
HM
C
Extoll Layer
A
R
B
D
E
M
U
X
clk_cr
(230 MHz)
NTL Completer
RF
E
X
T
O
L
L
L
I
N
K
M
U
X
512 bit
EXTOLL
FPGA
Link
4x Tag Map
512 bit
EXTOLL
FPGA
Link
NTL Responder
Notifi-
cations 2x Tag Map
Register 
File 
Access
HTL RX
Reorder
Buffer
HTL TX
Recombine
Fig. 4.7 NAM FPGA design block diagram: The design is partitioned into HMC, NAM,
and EXTOLL functional layers
4.3.3 FPGA Design Partitions
The NAM FPGA design is depicted in Figure 4.7. It is divided into three main
partitions: Network, NAM, and HMC. The network partition integrates two 512 bit
EXTOLL FPGA links connected to the NAM logic via a Multiplexer (MUX). Note
that this MUX does not implement any routing and will not forward packets from one
link to another. Both links are therefore EXTOLL endpoints and only packets that
target the NAM as ﬁnal destination may be received. The NAM partition translates
EXTOLL to HMC packets (and vice versa) and provides RF access to remote processes
via EXTOLL RRA (Remote Register File Access). It also integrates the CR unit which
will be described in Section 4.5. Finally, the HMC partition integrates the openHMC
controller and an autonomous HMC conﬁguration module.
Many of the modules also provide a set of registers. These are embedded in a hierarchy
of Register Files and allow design control and monitoring at runtime, accessible via
RRA or the physical I2C connector.
Figure 4.7 also identiﬁes the three main clock domains. clk_hmc is a 318.1825 MHz
clock derived from the HMC link conﬁguration, based on a 312.5 MHz clock multiplied
by FMULT (see Section 4.3.3.1), so that the throughput of the 512 bit datapath matches
the HMC link bandwidth. The openHMC speciﬁcation states that a connected user
application must operate at the frequency of clk_hmc or faster. To avoid additional
clock domain crossings and as it is unlikely that the NAM logic will meet timing
77
Network Attached Memory
constraints for even faster clocks it is also sourced from clk_hmc. The second main
clock domain is clk_extoll which drives the logic of both EXTOLL links. Although
there is no restriction on the frequency of this clock it will be shown that it has a major
impact on performance. The third clock domain is clk_cr which drives all CR related
parts of the design. Another clock domain crossing at this point became necessary
as it turned out that the CR logic could not be implemented with a clock as fast as
clk_hmc.
This section presents the individual design partitions and theoretically evaluates
bandwidth characteristics based on design decisions. Several potential bottlenecks
will be identiﬁed which will help to interpret the in-system measurements provided in
Chapter 5. As a naming convention, packets traveling from the network to the NAM
are referred to as requests while a response denotes the opposite direction respectively.
4.3.3.1 HMC Partition / openHMC
The HMC partition integrates the openHMC host controller with a 512 bit user interface
and a full-width (16x), 10 Gbps HMC link. In fact, through overclocking, the actual
speed per lane is 10 Gbps ·FMULT = 10.18184 Gbps. Based on Equation (3.3) the
resulting operating frequency clk_hmc is calculated with:
clk_hmc= 16 lanes ·10.18184 Gbps512 bit ·106 = 318.1825 MHz (4.6)
Using the unidirectional HMC bandwidth of 17.7 GB/s (see Section 3.3.4) and the
multiplier FMULT the new HMC read or write bandwidth BWHMC is:
BWHMC = 17.7 GB/s ·1.018184 = 18.02 GB/s (4.7)
It is the theoretical peak bandwidth for 128 Byte HMC read or write packets with a
sequential access pattern.
4.3.3.2 Network Partition / EXTOLL FPGA Link
The EXTOLL FPGA link has been derived from the native EXTOLL ASIC link
implementation which is based on a 128 bit datapath. The nominal EXTOLL ASIC
operating frequency is 630 MHz for a throughput of 128 bit ·630 MHz= 80.64 Gbps.
To match the throughput at a reasonable frequency in the FPGA the datapath-width
78
4.3 NAM Hardware
had to be increased to 512 bit. Although a 256 bit datapath would have been a feasible
choice as well it would not integrate seamlessly with the remaining NAM logic. The
datapath-width here is dictated by the conﬁguration of the openHMC controller and
has been set to 512 bit. A uniﬁed datapath-width throughout all modules considerably
simpliﬁes logic design. The resulting minimum core clock frequency of the EXTOLL
FPGA link to support the RMA bandwidth of 80.64 Gbps is calculated with:
clk_extollmin =
throughput
datapath-width =
80.64 Gbps
512 bit = 157.5 MHz (4.8)
At a ﬁrst glance it seems suﬃcient to set clk_extoll to the minimum required frequency.
However, the performance measurements conducted in Chapter 5 will reveal a cor-
relation between clk_extoll and the overall NAM performance, with clk_extollmin
performing the worst. This behavior is associated with the EXTOLL network protocol
ﬂow control features to support a retransmission scheme when errors were detected on
the serial link. Such link integrity features are common practice in serial link protocols.
The EXTOLL Link parameters including the size of the retry buﬀer were tailored for
the ASIC and hence dimensioned to operate on a 128 bit datapath at a frequency of
630 MHz. The relatively low frequency of the 512 bit link implementation implies that
packets and credits in the NAM are processed much slower than in the ASIC. The
number of credits for reading from and writing to the NAM is ﬁxed to a maximum of 80
using all four available Virtual Channels and 58 on one channel. For this ﬁxed number of
credits and if bandwidth throttling comes in, the only mitigation strategy is to increase
the frequency of clk_extoll in the NAM. Increasing the frequency, however, signiﬁcantly
complicates placement, routing, and timing closure in the FPGA. Eventually the link
logic was successfully implemented at 200 MHz with clean timing which lowered the
negative impact of the issues mentioned above.
clk_extoll = 200 MHz (4.9)
The diﬃculty with insuﬃcient credits becomes even worse in response direction, with
traﬃc ﬂowing from the NAM to an EXTOLL ASIC. Due to an unintended limitation
in the ASIC, the maximum credit count per Virtual Channel the NAM can use to send
traﬃc is 31.
79
Network Attached Memory
Gearbox, Alignment, and Flow Control
Apart from the operating frequency, the 512 bit link has additional negative side-
eﬀects on the maximum achievable bandwidth with an EXTOLL ASIC link partner.
Section 4.2 introduced the gearbox that is used to pass data from a 128 bit functional
unit to the 192 bit link in the EXTOLL ASIC. Similarly, the 512 bit link implements a
gearbox that passes data from the 768 bit link layer (three quads with 256 bit parallel
data each, 4 lanes per quad with 64 bit per lane) on the receiving side. The result
is again a 3-stage iterative process. For the sake of design simplicity, it is required
that packets start at a 512 bit / 64 Byte boundary so that the very ﬁrst cell (SOP) is
seen starting at bit position 0 in a parallel 512 bit cycle. It is the responsibility of the
sending side to ensure that packets meet this requirement. Therefore, the EXTOLL
ASIC gearbox will issue ﬁller cells up to the next 64 Byte boundary whenever a packet
including protocol overhead is not a multiple of 64 Byte.
The use of ﬁller cells for packet alignment limits the eﬀective RMA bandwidth in the
EXTOLL ASIC. According to Section 4.2.4 the maximum RMA packet size is 528
Byte of which 496 Byte contain payload. The ASIC gearbox now appends additional
ﬁller cells up to the next 64 Byte boundary, which is 576 in this case. This leads to
the RMA packet eﬃciency EFFRMA_PKT of:
EFFRMA_PKT =
Data Bytes
Total Bytes =
496
576 = 86.1% (4.10)
Packets also consume credits, four in total for a full-sized RMA-to-NAM write packet
and one per read request. Likewise, a full-sized RMA GET Response will utilize
four credits on the EXTOLL Link in the NAM. These credits must be returned by
the remote link partner so that they eventually can be reused to transmit additional
packets. Dedicated credit cells are generated and sent to the former source node.
The threshold for the number of credits at which a credit cell is generated is conﬁgurable
and has been set to 10 credits on the NAM. This means the NAM will create a credit cell
for every 2.5 full-sized RMA write requests, and for every 10 RMA reads it has received.
These cells do not aﬀect the request bandwidth for traﬃc ﬂowing from EXTOLL to the
NAM as credit cells travel in opposite direction. Every received credit cell, however,
must be acknowledged by the local link4. This acknowledge is an eight Byte packet
which is again subject to packet alignment boundaries and occupies a full 64 Byte
4 Acknowledge cells may also carry credits that need to be returned and credit cells may implement
acknowledge counter likewise.
80
4.3 NAM Hardware
of the transmission bandwidth. For writing, acknowledge cells add 642.5 = 25.6 Byte
overhead per packet which results in an actual total bytecount of 576+25.6 = 601.6
Byte per request. Given these results the actual RMA write eﬃciency for requests to
the NAM can be derived with:
EFFRMA_REQ =
Data Bytes
Total Bytes =
496
601.6 = 82.4 % (4.11)
Using EFFRMA_REQ the maximum eﬀective link RMA request bandwidth
BW_EFFRMA_REQ is:
BW_EFFRMA_REQ =BWRMA ·EFFRMA_REQ = 80.64 Gbps ·82.4 %
BW_EFFRMA_REQ = 66.48 Gbps= 8.31 GB/s
(4.12)
Read responses that return to the local EXTOLL device must also be acknowledged.
To reduce the amount of overhead at this point the EXTOLL Link is able to pack
several acknowledgments into a single cell. To approximate the performance, it is
assumed that credits are also embedded with acknowledge cells traveling back to the
NAM.
Read requests to the NAM, on the other hand, will generate a credit cell for every 10
request packets. This adds an average of 6410 = 6.4 Byte overhead per packet in response
direction caused by credit cells, and the same amount of overhead in request direction
used for acknowledging these. Given this additional overhead the total bytecount is
576+6.4 = 582.4 for a packet traveling from the NAM to an ASIC. Hence the NAM to
ASIC response eﬃciency EFFRMA_RSP is:
EFFRMA_RSP =
Data Bytes
Total Bytes =
496
582.4 = 85.1 % (4.13)
Using EFFRMA_RSP the actual maximum eﬀective link RMA response bandwidth
BW_EFFRMA_RSP is:
BW_EFFRMA_RSP =BWRMA ·EFFRMA_RSP = 80.64 Gbps ·85.1 %
BW_EFFRMA_RSP = 68.67 Gbps= 8.58 GB/s
(4.14)
Note that communication between two EXTOLL ASICs must be aligned likewise, with
reduced packet boundaries at 128 bit / 16 Byte. Therefore, no ﬁller cells are applied
for packets that come as a multiple of 16 Bytes such as the largest RMA packet and
for all other packets the overhead of alignment is signiﬁcantly lowered.
81
Network Attached Memory
There are two ways to alleviate the impact of the gearbox. First, a smaller datapath
would reduce the packet boundaries and alignment overhead. The decision for a 512
bit link, however, was made for good reason. It seamlessly integrates with the rest of
the design. And second, a link layer design that operates on the datapath-width of the
functional units or an integer multiple (e.g. 256 bit link and 128 bit functional unit) of
it would signiﬁcantly reduce the interface complexity.
4.3.3.3 Network Partition / EXTOLL Link MUX
The EXTOLL Link MUX connects both EXTOLL links to the NAM layer and acts as
clock domain crossing from clk_extoll to the faster clk_hmc clock domain. The clock
domain transition is realized with asynchronous buﬀers, one per link and direction.
The actual switching between links is then performed with the speed of clk_hmc to
eliminate the bandwidth of a single EXTOLL Link as bottleneck at this point. The
theoretical link MUX bandwidth in both directions, request and response, is linked to
the datapath-width, the operating frequency clk_hmc, and the RMA packet eﬃciency
EFFRMA_PKT (not the actual link RMA eﬃciency as ﬂow control cells were removed
already). It is calculated with:
BW_EFFMUX = clk_hmc ·datapath-width ·EFFRMA_PKT
BW_EFFMUX = 318.1825 MHz ·512 bit ·86.1 %
BW_EFFMUX = 140.2 Gbps= 17.54 GB/s
(4.15)
In comparison the combined EXTOLL bandwidth that two links can deliver for requests
is:
BW_EFFRMA_REQ_TWO_LINKS = 2 ·BW_EFFRMA_REQ
BW_EFFRMA_REQ_TWO_LINKS = 2 ·8.31 GB/s= 16.62 GB/s
(4.16)
And for responses:
BW_EFFRMA_RSP_TWO_LINKS = 2 ·BW_EFFRMA_RSP
BW_EFFRMA_RSP_TWO_LINKS = 2 ·8.58 GB/s= 17.16 GB/s
(4.17)
It can be seen that link multiplexing is good enough to keep up with the performance of
both EXTOLL links in either direction. However, it will be shown that the aggregate
bandwidth of two EXTOLL links outperforms the capabilities of the subsequent NAM
logic units.
82
4.3 NAM Hardware
4.3.3.4 NTL - Network Transaction Layer
The Network Transaction Layer (NTL) connects the EXTOLL links via the link MUX
to the NAM logic and operates in the clk_hmc clock domain. It decodes and distributes
incoming RMA packets targeting HMC as read/write request, the RF for conﬁguration
or maintenance, or the NAM CR logic. It is the counterpart of the EXTOLL ASIC
RMA unit. For read requests, tag maps are used to retain information that is required
to generate corresponding responses. Packets are processed in cut-through mode, i.e.
data cycles are immediately forwarded to the next layer. This is opposed to store and
forward, where all cycles that belong to a packet are collected ﬁrst and then forwarded.
Cut-through was chosen to enable subsequent layers to receive data faster instead
of waiting for an entire packet to become available by the network partition which
operates with the relatively slow clk_extoll. Obviously the forwarding mode is only
relevant for requests that spread over more than 1 parallel cycle which aﬀects write
requests larger than 48 Byte (16 Byte network descriptor + 48 Byte payload in a 512
bit cycle). The 8 Byte SOP cell preceding the network descriptor is initially removed
by the NTL. A full-sized RMA packet that carries 496 Byte payload therefore stretches
over a total of eight cycles where the ﬁrst cycle contains 48 Byte payload (+16 Byte
protocol overhead) followed by seven cycles with 64 Byte payload each.
Every packet is also subject to a variety of checks. It increases the NAM resistance
to false usage by applications or the EMP. Incorrect accesses may include the use
of commands other than mentioned in the description of the EXTOLL FPGA link
or packets that do not target the NAM. Especially in the initial bring-up phase of
the FPGA design and software components it is essential to rather catch exceptions
than to risk unexpected behavior. Consequently, the NAM drops any packets out of
speciﬁcation and leaves some debug information in its Register File. The NAM access
granularity has been set to 16 Byte to match the granularity of the HMC protocol.
This decision reduces design complexity by eliminating various corner cases for packet
and address translation between the EXTOLL and HMC protocol.
The NTL strips the packet SOP and otherwise immediately forwards any incoming
cycles to the next stage. Its theoretical bandwidth is equal to the capability of the
MUX:
BW_EFFNTL =BW_EFFMUX = 17.54 GB/s (4.18)
83
Network Attached Memory
CPU
Pr
oc
es
s
EXTOLL FPGA
PUT Request
HMC
(Notification)
(Notification)
Get Data 
From Main 
Memory 
via DMA
R
E
Q
U
E
S
T
E
R
NAM
Write
Write
N
A
M
L
O
G
I
C
PUT Data
Write
(a) PUT operations may generate up to two
notiﬁcations. A local requester notiﬁcation
when data has been sent, and another when
the data has successfully passed packet
checks in the NAM NTL
CPU EXTOLL FPGA
GET Request
HMC
GET Data
Copy Data 
To Main 
Memory 
via DMA
NAM
Read
Response
ResponsePr
oc
es
s
(Notification)
(Notification)
N
A
M
L
O
G
I
CC
O
M
P
L
E
T
E
R
(Notification)
R
E
Q
(b) GET operations may generate up to three
notiﬁcations. A local requester notiﬁcation
when the GET request has been sent, a
second when the request has successfully
passed packet checks in the NAM NTL, and
a ﬁnal notiﬁcation when the requested data
has been placed in local memory
Fig. 4.8 NAM/EXTOLL notiﬁcation mechanism for PUT and GET operations
Notiﬁcations
The concept of notiﬁcations which can be used to inform processes of certain events
has been introduced in Section 4.2.3. The NAM supports this notiﬁcation mechanism,
with the following two modiﬁcations as depicted in Figure 4.8: For PUT operations,
the completer notiﬁcation bit set will not generate any notiﬁcation on the NAM as
there is no actual processor present. Instead, a notiﬁcation directed to the requesting
process will be sent as the packet was accepted at the NTL and has passed integrity
checks. Similarly, such a notiﬁcation can be generated when a GET request has been
processed in the NTL. Such notiﬁcations can be used to ease synchronization between
processes that share a common address space on the NAM.
4.3.3.5 HTL - HMC Transaction Layer
The HMC Transaction Layer (HTL) connects the NTL and CR logic to the HMC
partition and converts from the RMA protocol to HMC and vice versa. Several
properties of the various packet types complicate this protocol conversion. The
84
4.3 NAM Hardware
following section analyzes these diﬃculties and presents the implemented translation
units. The request direction is examined ﬁrst.
Requests
The HTL receives packets from either the NTL or the CR functional unit and converts
these to HMC packets. Protocol conversion at this point is non-trivial as HMC packets
must meet the following requirements:
The maximum packet size is 128 Byte The largest HMC packet that may be
transmitted is 128 Byte and an RMA packet can carry up to 496 Byte payload.
Hence, a single RMA transfer may trigger several HMC packets.
The memory access granularity is 16 Byte The HMC protocol deﬁnes requests
with a granularity of 16 Byte and packet sizes ranging from 16 to 128 Byte.
Although HMC preserves Byte access using BIT WRITE commands, these are
not supported by the HTL to keep the complexity and corner cases of packet
conversion at a minimum. This limitation also forces the use of 16 Byte aligned
addresses which must be handled in software.
Destination address plus bytecount must not cross a 128 Byte boundary
The HMC memory arrays are internally organized in 128 Byte blocks. An issue
arises when a request targets an address oﬀset other than zero and the number
of Bytes to be read or written would cross a logical 128 Byte boundary. Such an
access would cause a wraparound within the block and wrong data would be
returned or false memory locations overwritten. This is depicted in Figure 4.9.
Hence, block-boundary crossing must be avoided in any case, and in addition
to the fact that larger RMA packets must be split regardless it furthermore
complicates the protocol conversion.
After all, the requirements mentioned above not only complicate protocol conversion
but also negatively aﬀect the achievable bandwidth.
To greatly reduce the complexity of combinational logic it was decided to only utilize a
subset of the available HMC packet sizes for write requests. As stated earlier the NTL
passes data cycles of an RMA packet independently, and the HTL solely operates with
this cycle based approach. Hence, the largest amount of payload to be converted in
one conversion step is equal to the datapath-width (64 Byte).
85
Network Attached Memory
A B C D
F G H
Memory Address 0
Memory Address 128
Request: Write 0-1-2-3 to memory address 32
Write
A -> 3 B -> 0 C -> 1 D -> 2
E F G H
Memory Address 0
Memory Address 128
Result: Wraparound within the upper block
E
Wraparound
128 Byte Block
32 Byte Word
Fig. 4.9 HMC 128 Byte block-boundary crossing example. Left hand side: A request
writes the pattern 0-1-2-3 (128 Byte) to memory address 32 intended to overwrite
B-C-D-E. Right hand side: Start address plus bytecount cause a wraparound in
the upper block. A false memory location was written
The easiest way to perform conversion is to map an RMA request exclusively to 16
Byte HMC packets. This will generate 496 Byte16 Byte = 31 HMC packets out of a full-sized
RMA transaction, packed in 16 parallel cycles with two 16 Byte packets per cycle at
most (i.e. 2 packets with 16 Byte payload and 16 Byte overhead each). On the one
hand this approach eliminates the probability to cross an HMC block-boundary. On
the other hand, it is desirable to decrease the number of HMC packets transmitted
as many smaller requests targeting a similar memory location limit parallelism and
are likely to cause access conﬂicts in the HMC DRAM. Smaller requests furthermore
increase the overhead on the HMC link as every HMC packet includes 16 Byte overhead
regardless of its size. 32 Byte HMC packets can be utilized to achieve a reduction in
most cases. Still, 16 Byte requests will be issued when approaching a block-boundary
or simply no more data is available. It is reasonable to consider 64 Byte packets as
the payload of a cycle can be directly mapped to a single packet. However, 64 Byte
HMC packets will span over two cycles due to the HMC protocol overhead and can
be substituted by a combination of 32B/32B or 48B/16B packets. 48 Byte requests
have an additional beneﬁt. They can pack the ﬁrst cycle of an RMA packet (which
has only 48 Byte payload, + 16 Byte RMA header) into a single packet and cycle on
the HMC side, whereas a combination of 32B/16B would span over two cycles. In
conclusion the three available HMC packet sizes 16B, 32B, and 48B, have been chosen
as trade-oﬀ between design complexity and resulting number of HMC packets that will
be generated.
The HTL ﬁrst converts an RMA data cycle to HMC packets, one per output cycle,
before moving to the next RMA cycle. The obvious side eﬀect of this scheme is that it
86
4.3 NAM Hardware
increases the protocol overhead and some FLITs remain unused. In order to estimate
the implications on performance, Table 4.1 provides two examples of how packets are
split in response to the current address and the available HMC packet sizes. The
two example packets shown are layered packet A and packet B. The actual layout
of the conversion is determined within the ﬁrst RMA cycle: either the full payload
can be packed into a single HMC cycle (packet type A) or it must be split due to
block-boundary crossing (packet type B). Hence, packet B type conversion is required
whenever the start address of an RMA transaction leaves only 16 or 32 Byte distance
to the next 128 Byte boundary as otherwise the full 48 Byte may be processed at once.
Due to the 16 Byte access granularity an RMA packet may target one of eight possible
address locations with regard to the 128 Byte block-boundary, i.e. the distance is 16B,
32B, ... up to 128B. Therefore, two out of eight packets will cause a packet B type
conversion which requires 16 cycles to complete, while the remaining six packets can
be represented by the packet A type and a cycle count of 15.
Using the information above it is possible to calculate the eﬀective bandwidth of
the HTL layer for write requests. Out of eight packets, six will take a total of
6 · 15 = 90 cycles. The remaining two require 2 · 16 = 32 cycles to complete. This
results in an average of 90+32 cycles8 packets = 15.25 cycles per packet. 15.25 cycles can carry
15.25 ·64 Byte=976 Byte of which 496 Byte are actual payload. The resulting eﬃciency
EFFHTL_REQ is therefore:
EFFHTL_REQ =
Data Bytes
Total Bytes =
496
976 = 50.8 % (4.19)
The eﬀective write request bandwidth BW_EFFHTL_REQ is now calculated with:
BW_EFFHTL_REQ = clk_hmc ·datapath-width ·EFFHTL_REQ
BW_EFFHTL_REQ = 318.1825 MHz ·512 bit ·50.8 %
BW_EFFHTL_REQ = 82.85 Gbps= 10.35 GB/s
(4.20)
So far the conversion analysis between the two protocols has only considered write
requests. Read requests, however, are treated similarly as they have to obey the HMC
packet requirements described above, especially because a read request may also be
subject to block-boundary crossing. Luckily, the conversion eﬀort is greatly reduced due
to one signiﬁcant diﬀerence: HMC read requests are always 16 Byte in size regardless
of the requested payload size. Translation for a maximum-sized RMA request takes
87
Network Attached Memory
Table 4.1 HTL request packet splitting example. Two 496 Byte RMA packets are converted.
Depending on the packet start address the ﬁrst RMA cycle might be split to
avoid a 128 Byte block-boundary crossing. Packets that do not require initial
packet splitting (type A) will take 15 cycles. These packets have a target
address to block-boundary distance of 48 Byte or more. All other packets (type
B) take 16 cycles to complete
RMA packet type A. 496 Byte. Start address 0. No split in the ﬁrst cycle
RMA HMC Next Payload ActionCycle Cycle Address [Byte]
1 1 0 48 First cycle with only 48 Byte
2 2 48 48 Send 48 Byte. 16 Byte remain
2 3 96 16 Send remaining 16 Byte
3 4 112 16 Send 16 Byte to avoid boundary crossing
3 5 128 48 Send remaining 48 Byte
4 6 176 48 Send 48 Byte. 16 Byte remain
4 7 224 16 Send remaining 16 Byte
5 8 240 16 Send 16 Byte to avoid boundary crossing
5 9 256 48 Send remaining 48 Byte
6 10 304 48 Send 48 Byte. 16 Byte remain
6 11 352 16 Send remaining 16 Byte
7 12 368 16 Send 16 Byte to avoid boundary crossing
7 13 384 48 Send remaining 48 Byte
8 14 432 48 Send 48 Byte. 16 Byte remain
8 15 480 16 Send remaining 16 Byte
RMA packet type B. 496 Byte. Start address 496. First cycle must be split
1 1 496 16 Send 16 Byte to avoid boundary crossing
1 2 512 32 Send remaining 32 Byte
2 3 544 48 Send 48 Byte. 16 Byte remain
2 4 592 16 Send remaining 16 Byte
3 5 608 32 Send 32 Byte to avoid boundary crossing
3 6 640 32 Send remaining 32 Byte
4 7 672 48 Send 48 Byte. 16 Byte remain
4 8 720 16 Send remaining 16 Byte
5 9 736 32 Send 32 Byte to avoid boundary crossing
5 10 768 32 Send remaining 32 Byte
6 11 800 48 Send 48 Byte. 16 Byte remain
6 12 848 16 Send remaining 16 Byte
7 13 864 32 Send 32 Byte to avoid boundary crossing
7 14 896 32 Send remaining 32 Byte
8 15 928 48 Send 48 Byte. 16 Byte remain
8 16 976 16 Send remaining 16 Byte
88
4.3 NAM Hardware
128B read
112B read
96B read
64B read
16
B
128B read
128B read
128B read
128B read
128B read 112B read
128B read 128B read
128B read 128B read 16B
128B read 128B read 32B
128B read 128B read 128B read 96B read
Start Address relative to 128 Byte
160 32 48 64 80 96 112
128 Byte Boundary
HMC Packet 1 HMC Packet 2 HMC Packet 3 HMC Packet 4 HMC Packet 5
... ... ... ...
Fig. 4.10 496 Byte RMA read request to HMC packet mapping. Depending on the start
address distance relative to the next 128 Byte boundary four or ﬁve HMC read
requests will be generated
up three cycles at most and results in four to ﬁve HMC requests with up to 128 Byte.
This process is depicted in Figure 4.10. It shows that the determination of the actual
number of HMC requests is based on whether and at what point requests have to be
split to avoid reading through block boundaries. In a given request stream that requests
eight or more full-sized RMA packets, the translation process for each subsequent
RMA packet is deterministic as it iterates through the eight possible variations. All
of these have in common that three 128 Byte reads will be issued, complemented by
one or two reads to address the remaining 112 Byte of the 496 Byte RMA request.
Eventually, eight 496 Byte RMA packets will be translated to 24 128 Byte requests
and two additional requests for each of the remaining packet sizes (16 to 112 Byte).
Performance is not considered critical at this point. Although it can take up to three
cycles to request 496 Byte of data, neither the HMC nor the response path are able to
deliver the requested bandwidth. In fact, even if there was no protocol overhead at
all, three response cycles can carry only a maximum of 192 Byte payload (64 Byte per
parallel cycle). The response bandwidth will be examined later in this section.
As packets were converted they are sent to the openHMC controller and each read
request gets a sequence number assigned. These sequence numbers are stored in four
diﬀerent tag maps as four read requests may be packed into a single cycle. The purpose
of tagging is to allow the response path to properly reorder HMC responses and to
reassemble these back into larger RMA packets.
89
Network Attached Memory
Cycle 1 Cycle 2 Cycle 3
128 bit FLIT 512 bit cycle
0 PKT1 PKT1 PKT1 PKT1 PKT2 PKT2 PKT3 PKT3 PKT3 PKT3 PKT3
Fig. 4.11 Response packet sampling example: Three cycles were sampled at the openHMC
controller output. The second cycle contains a full packet PKT2 along with the
ﬁnal FLIT of PKT1 and the ﬁrst FLIT of PKT3
Cycle 1 Cycle 2 Cycle 3
128 bit FLIT 512 bit cycle
0 PKT1 PKT1 PKT1 PKT1 PKT2 PKT2 PKT3 PKT3 PKT3 PKT3 PKT3
0 PKT1 PKT1 PKT1 PKT1 0 0 0 0 PKT2 PKT2 0 0 0 0 PKT3 PKT3 PKT3 PKT3 PKT3
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Serialize
Fig. 4.12 Packet serialization example. The second cycle contains one full and parts of
two other packets. Packets are separated using two additional cycles
Responses
The HTL response path receives HMC response packets from the openHMC controller.
HMC internally operates in a FLIT granularity and the FPGA will buﬀer four FLITs
to create a 512 bit cycle. A parallel cycle may contain (parts of) several packets
at a time as shown in Figure 4.11. To make decoding and processing easier for the
following stages, the incoming packets are ﬁrst separated so that in any given cycle
only data from one packet is forwarded. Figure 4.12 depicts how extra cycles are
used to serialize the packet stream in Figure 4.11. Unfortunately, extra cycles will
also throttle the throughput. To properly estimate the achievable bandwidth at this
stage it is necessary to analyze how packets can be aligned within the parallel cycle.
Depending on this position and the packet length, the number of cycles required to
forward the packet varies. The four possible start positions and their implication on
the cycle count are depicted in Figure 4.13. As can be seen a 32 Byte HMC response
can have four diﬀerent layouts and requires one cycle or spreads over two cycles equally
in 50% of the time assuming a uniform distribution. Its average required cycle count
is therefore 1.5. The splitting scheme shown applies to all other packet sizes likewise
and the results are summarized in Table 4.2. It lists the various packet sizes and their
minimum, maximum, and average cycle spread count for a single response. In a given
request/response stream, eight consecutive, randomly picked requests can be selected
90
4.3 NAM Hardware
Cycle 1
Cycle 2
Cycle 1
Cycle 2
HDR
DATA
HDR
DATA
HDR
DATA
HDR
DATADATA DATA DATA
DATA
DATA
TAIL
DATA
TAIL
DATA
TAIL
DATA
TAIL
Layout 1 Layout 2 Layout 3 Layout 4
Fig. 4.13 Response packet layouts: A 32 Byte HMC response is embedded in a 48 Byte
packet (including 16 Byte protocol overhead). Depending on the FLIT start
position within the 512 bit word this packet is received in four diﬀerent layouts
and spreads over one or two cycles
Table 4.2 HMC response packet serialization overview. The data highlights the average
cycle count that is required to forward a packet for each size. Translated to
the total number of packets and their average cycle count, the eﬃciency of the
HTL response path can be calculated
Packet Cycle counts Probability For eight 496 Byte RMA Packets
Size Min Max Min Max Avg Count Cycles Raw B Payload B
16B 1 2 3 1 1.25 2 2.5 160 32
32B 1 2 2 2 1.5 2 3 192 64
48B 1 2 1 3 1.75 2 3.5 224 96
64B 2 2 4 2 2 4 256 128
80B 2 3 3 1 2.25 2 4.5 288 160
96B 2 3 2 2 2.5 2 5 320 192
112B 2 3 1 3 2.75 2 5.5 352 224
128B 3 3 4 3 24 72 4608 3072
SUM - - - - - 38 100 6400 3968
to calculate the total number of cycles that will be spent. This cycle count, multiplied
by 64 Byte, gives the total (raw) number of Bytes that will be forwarded. The actual
number of Bytes that carry payload, however, is less than that. Eight 496 Byte RMA
read requests will generate 38 HMC packets that will take 100 cycles after serialization,
and only 3968 Bytes out of 6400 carry payload.
Since all of the following HTL stages will not delay the response stream any further
this information can now be used to calculate the HTL response eﬃciency with:
EFFHTL_RSP =
Payload Bytes
Total Bytes =
3968
6400 = 62 % (4.21)
91
Network Attached Memory
Table 4.3 NAM design building blocks bandwidth summary: single EXTOLL Link operation.
For reading and writing, achievable bandwidth is limited by the EXTOLL Link
Functional Unit Request Bandwidth [GB/s] Response Bandwidth [GB/s]
EXTOLL 8.31 8.58
Link MUX 17.54 17.54
NTL 17.54 17.54
HTL 10.35 12.62
openHMC 18.02 18.02
which results in a maximum eﬀective bandwidth of:
BW_EFFHTL_RSP = clk_hmc ·datapath-width ·EFFHTL_RSP
BW_EFFHTL_RSP = 318.1825 MHz ·512 bit ·62 %
BW_EFFHTL_RSP = 12.62 GB/s
(4.22)
The next step in response processing is to reorder HMC packets as they can return
out of order. This is done using the sequence numbers that have been placed in the
TAG maps by the HTL request modules. Finally, individual packets are recombined to
create their corresponding RMA GET responses. These are forwarded to either the
NTL if the request was a remote read/write or otherwise to the CR unit.
4.4 Summary Estimated Read/Write Performance
The previous section explained the individual NAM design building blocks and analyzed
their estimated performance. Table 4.3 summarizes the results for reading and writing
with one EXTOLL Link to easily identify existing bottlenecks. For both, reading and
writing, the EXTOLL Link bandwidth is the limiting factor. This is independent of
request sizes and access patterns since even in the worst case usage the HMC would
be able to deliver more bandwidth5. Therefore, the write bandwidth is expected to
peak at about 8.31 GB/s and the read bandwidth at 8.58 GB/s, respectively. Similarly,
Table 4.4 highlights the expected bottlenecks when writing and reading to and from
both EXTOLL links. It assumes that request addresses are somewhat distributed and
do not target the same memory location as in this case the HMC bandwidth could have
5 For more information on HMC access patterns and bandwidths refer to Section 3.3.4.
92
4.5 Checkpoint/Restart
Table 4.4 NAM design building blocks bandwidth summary: dual EXTOLL Link operation.
For reading and writing, achievable bandwidth is limited by the HTL
Functional Unit Request Bandwidth [GB/s] Response Bandwidth [GB/s]
2 × EXTOLL 16.62 17.16
Link MUX 17.54 17.54
NTL 17.54 17.54
HTL 10.35 12.62
openHMC 18.02 18.02
a negative impact6. Compared to Table 4.3 it becomes clear that the bottlenecks now
have shifted into the NAM logic, more speciﬁc into the HTL request path for writes,
and the HTL response path for reads. Hence the expected maximum bandwidth for
writes is 10.35 GB/s and 12.62 GB/s for reads, respectively.
These results will be used as a reference for the real hardware measurements conducted
in Chapter 5.
4.5 Checkpoint/Restart
In DEEP-ER the NAM carries out XOR based checkpoint/restart as a potential
performance improvement to the existing SCR-Partner checkpointing scheme with
SIONlib. Compared to this partner approach where one checkpoint is stored at the task
local node and also transferred to another remote node, XOR checkpointing generates
a parity via a bit-wise XOR operation from the checkpoints of all participating ranks
in a group. The result is as large as the largest individual checkpoint and can then be
used to recover from any single rank failure within a group.
XOR checkpointing reduces the overhead in storage capacity required to perform
checkpoint/restart as every node holds only a fraction of the parity information,
compared to Partner checkpointing where checkpoints are simply duplicated and
distributed across nodes. It comes, however, at the expense of calculation overhead to
generate the parity. For more information on checkpointing and fault tolerance refer
to Section 2.3.
6 See Footnote 5.
93
Network Attached Memory
Local Node
Local
NVMe
1. Store locally
Buddy Node
Local
Buddy
Buddy
3. Receive from
buddy
2. Send to 
Buddy Checkpoint
Local
Fig. 4.14 SIONlib-Buddy checkpointing scheme with
two nodes
File
File
File
File
Node 3
Node 4
Node 1
Node 2
Fig. 4.15 SIONlib ring fashion
ﬁle exchange with
more than two nodes
The NAM carries out the parity computation and stores the result in the HMC. Each
NAM in the system is associated with a set of ranks (just like a set in SCR) and within
each set a single rank failure may be recovered. Therefore, a system can have as many
sets as there are NAMs in the system.
The following section documents the design process of the CR functional unit and
describes how the NAM creates the XOR parity, and how it can be used to restart
from a failure.
4.5.1 Buddy Checkpointing in DEEP-ER
The DEEP-ER resiliency scheme is based on SCR-Partner checkpointing which has
been extended to support the SIONlib [102] parallel I/O library. SIONlib allows to
merge I/O streams of multiple processes into one or multiple ﬁles, removing ﬁle system
congestion due to many smaller, unaligned data blocks. This process is applied to
checkpoint data on all processes on a node so that only a single ﬁle is written per
node. The SIONlib-Buddy checkpointing approach writes this ﬁle to the local NVMe
devices and also creates the same ﬁle on a remote buddy node. It then initiates a
receive routine to fetch the local checkpoint of a remote buddy which is also placed in
the local NVMe (Figure 4.14). Note that the buddy node where the local checkpoint
is written to is not necessarily the same node a remote checkpoint is received from.
SIONlib achieves an additional speed-up over standard SCR-Partner by overlapping
the write-out functions to local storage and the buddy node. If more than two nodes
94
4.5 Checkpoint/Restart
1001   1001   1111   1000
1111   0000   0000   1000
=
0110   1001   1111   0000
1001   1001   1111   1000
0110   1001   1111   0000
=
1111   0000   0000   1000
Set A
Set B
Parity Set B
Parity
Set A
XOR XOR
Fig. 4.16 XOR parity generation (left) and reconstruction of a missing checkpoint dataset
(right)
participate in the process, buddy nodes are assigned and ﬁles are exchanged in a ring
fashion (Figure 4.15).
4.5.2 Deﬁnitions
The following deﬁnitions may be helpful to understand the remainder of this section.
XOR Parity
A XOR (Exclusive OR) operation applied to several sets of data can be used to generate
a parity. The size of the parity is as large as the largest dataset. With the help of this
parity any single missing set of data can be reconstructed. Figure 4.16 depicts a simple
example of this process.
Segmentation
The NAM internally segments checkpoint data into smaller chunks, currently 496
Byte which is the EXTOLL network MTU and reﬂects a maximum-sized RMA packet.
Segment numbers are assigned since the XOR operation is applied on equal segment
numbers over all checkpoint data sets.
Rank
A rank may be a remote process or remote node with one or multiple processes
depending on the checkpointing granularity. For example, SIONlib merges checkpoints
of multiple processes on a node into a single ﬁle. In this context a rank equals one
node.
95
Network Attached Memory
4.5.3 Design Space Exploration
In order for the NAM to create a parity out of a group of checkpoints it has to receive
all participating datasets. There are three ways to do so:
1. The nodes unconditionally send their checkpoints to the NAM which acts as a
passive device. This approach requires the least hardware eﬀort.
2. The nodes send their checkpoints to the NAM which acts as a semi-passive device.
All nodes send their checkpoints to the NAM upon request. Synchronization
between the nodes and the NAM is required to control data ﬂow.
3. The NAM reads checkpoint data from the nodes. It is up to the NAM hardware
to decide when and how much data to fetch. The application will need signal
readiness and wait for a notiﬁcation of completion.
Option 1: Nodes send data In the easiest approach all nodes send their checkpoints
whenever ready and without any further inter-process synchronization. If one
or more nodes delay the transmission it is not guaranteed that the NAM can
hold all relevant segments to generate the next XOR segment because the FPGA
buﬀer capacity might be insuﬃcient. As a result, all currently available segments
would have to be XORed and the temporary result would be written to the HMC.
When all of the remaining and required segments have arrived, the temporary
result would need to be read again before the parity can be generated. There is
a high risk that data will be moved between FPGA and HMC multiple times.
Option 2: Nodes send data upon request by the NAM This approach elimi-
nates the risk of ﬂooding the NAM with data from individual nodes. The NAM
requests every segment or a set of segments of deﬁned size with notiﬁcation PUTs
directed to the remote process. A disadvantage is the fact that processes stay
busy with waiting for these notiﬁcations, up to several million times for a 2 GB
checkpoint using 496 Byte RMA transactions. In addition, each transaction is
eventually sourced by a software descriptor which must be translated and might
involve address translation.
Option 3: NAM retrieves checkpoint data autonomously With this approach
the NAM has exclusive control over any data movement. It can autonomously
request segments in a way that the FPGA internal buﬀer space is optimally
96
4.5 Checkpoint/Restart
used. Remote nodes will need to inform that checkpoint data may be retrieved.
Completion of the checkpointing process is signaled by a notiﬁcation to all
participating nodes which only have to check for this notiﬁcation before a next
checkpoint may be created. Another advantage is the fact that no software to
network descriptor translation is performed, potentially increasing the overall
performance.
The dataset granularity must obey the NAM internal access granularity of 16 Byte.
It is the responsibility of the software to pad datasets with zeros up to the next 16
Byte boundary. Also the NAM must provide a reasonable amount of buﬀer space to
avoid frequent read/modify/write to the HMC (option 1) or to allow suﬃcient in-ﬂight
transactions to exist (option 2 and 3). Buﬀer space must be partitioned to allow
holding segments of up to 44 nodes at a time to cover all 88 nodes with two NAMs in
the DEEP-ER prototype.
4.5.3.1 Summary of Design Decisions
Three approaches to transfer data were described above. Although option 3 requires
a higher hardware implementation eﬀort, software complexity and processor-time to
transfer data is greatly reduced while providing the most resource eﬃcient solution.
To optimally utilize available Block RAM memory of the Virtex 7 FPGA the NAM
will request a maximum of 128 in-ﬂight segments with 496 Byte each per node. The
required number of nodes that must be handled by the NAM is 44, and the actual
number is slightly increased to 48 to allow for a small imbalance in node-to-NAM set
assignments.
4.5.4 Vision: NAM-XOR Checkpointing in DEEP-ER
Based on the design decisions for NAM-XOR checkpointing, Figure 4.17 depicts the
envisioned checkpoint creation ﬂow with SIONlib. First, SIONlib writes a single ﬁle
from the checkpoints of all processes on a node to its local NVMe. The ﬁle is then
re-read into the node-local memory where it is ready to be fetched by the NAM.
97
Network Attached Memory
NAM
Node
1. Create file
Local
NVMe
Local
17 20161514 1918
4 1567 23
13
8
12
9
11
10
texttexttexttexttexttexttexttexttexttext
texttexttexttexttexttexttexttexttexttext
Local Memory
17 20161514 1918
4 1567 23
13
8
12
9
11
10
texttexttexttexttexttexttexttexttexttext
texttexttexttexttexttexttexttexttexttext
Local Memory
2. Read back
Checkpoint
3./4. GET 
requests
3./4. Data
Fig. 4.17 NAM/SIONlib checkpoint creation example with one node
4.5.5 Conﬁguration
Before the NAM CR feature can be used it must be conﬁgured by a root process.
RRA packets are used to read and write a set of CR registers in the NAM Register
File. The CR control unit expects the number of participating ranks (register C0) and
the unique EXTOLL NodeID + Virtual Process Identiﬁer (VPID) (C2) of each rank
along with the size (C2) and remote memory start address (C1) of the corresponding
checkpoint, one rank per access. As shown in Figure 4.18 these steps are repeated until
all ranks have been conﬁgured. The conﬁguration process and any misconﬁguration are
monitored in dedicated status registers. As soon as all information has been written
the NAM is operational for CR.
Rank A NAM
RRA CR-C1
RRA CR-C2
RRA CR-C0
Repeat for 
all Ranks
Fig. 4.18 NAM CR conﬁguration process
98
4.5 Checkpoint/Restart
Conﬁguration and all subsequent processes are carried out by the libNAM library (see
Section 4.7.2), which provides interfaces to a higher layer such as SIONlib.
4.5.6 Generating a Checkpoint
Figure 4.19 depicts the checkpointing process on a network transaction level. The
checkpoint sizes are 5 segments for Rank A, and 4 segments for Rank B respectively. In
this example, the buﬀer sizes in the FPGA have been set to accommodate 3 segments
per rank at most. After conﬁguration has been performed an application may be
executed. Whenever a rank is ready to have its checkpoint fetched by the NAM it
posts a ﬂag into one of the CR control registers via RRA. This will trigger a burst of
RMA read requests up to 128 in-ﬂight segments (only three in this example) to retrieve
data from the corresponding rank. As this process is ongoing, additional ﬂags from
other ranks may be written which will cause the RMA read request scheme to alternate
through all currently ﬂagged ranks. As GET responses return the NAM places these
segments into buﬀers, one per rank. The performance for GET responses from remote
nodes to the NAM is expected to be close to what RMA PUTs from an ASIC to the
NAM can achieve since both packet types look very similar.
When matching segments from all participating ranks have been received a XOR
operation on this set of segments is performed and the resulting parity is written to
the HMC. With every processed set, another segment from all nodes may be requested
as the buﬀer space is now freed up. Since checkpoint sizes can vary for each rank,
data fetch operations for some ranks may be ongoing while others are ﬁnished. The
NAM takes the largest available checkpoint as reference and internally pads all other
checkpoints with zeros so that the XOR result stays correct. The segment request
process is repeated until all checkpoints were fully transported and the last request to
each rank will have a notiﬁcation bit set to signal completion.
For any subsequent checkpoints, step A in the sequence is obsolete when there is no
change in the conﬁguration.
4.5.7 Restarting from a Checkpoint
When a rank has failed the root process is responsible to update the entries in the
NAM CR control unit accordingly. With completion of this update, the NAM will
start requesting segments from all remaining ranks. This process is very similar to the
99
Network Attached Memory
Rank A NAM
Freeze Checkpoint
RRA with RUN Flag
Get Segment A0
Rank B
Freeze Checkpoint
RRA with RUN Flag
Get Segment B1
Get Last Segment  A4 with NOTI
Get Last Segment B3 with NOTI
- Apply XOR on all segments
- Write result segments (parity) to HMC
Get Next Segment A3
Run Application
XO
R
- Got a segment #0
- Waiting for all segments #0
- Got all segments #0. Run XOR
- Request next segment(s)
Get Segment B0
Get Segment A1
Get Segment A2
Get Segment B2
Configuration
A0
A1
A2
A3
A4 B3
B2
B0
B1
XO
R
Fig. 4.19 NAM parity checkpoint creation example with two participating ranks A and
B. The rank A checkpoint is 5 segments in size and the rank B checkpoint
4, respectively. The resulting parity is as large as the largest checkpoint; 5
segments in this case. Segment GET requests are arbitrated among all currently
valid ranks
100
4.5 Checkpoint/Restart
checkpoint creation process with one main diﬀerence: the checkpoint of the failed rank
is replaced by the parity information that has been stored in the HMC. A low-level
diagram that highlights the individual sequences is depicted in Figure 4.20. The result
of the XOR operation on all remaining checkpoints and the parity is again written to
the HMC. This information reﬂects the missing checkpoint. The NAM informs the
failed (or newly conﬁgured) rank which then fetches the data via regular RMA reads.
A mandatory precondition to restart after a failure is that a parity checkpoint has been
written previously. In the unlikely case that a rank fails while a checkpoint creation
process is ongoing, the parity information may be invalid and no restart is possible.
One possible workaround to avoid this situation is to partition the HMC address space
into two equal-sized blocks. The NAM will then alternate between the blocks for each
subsequent checkpoint. A major drawback of this scheme is that the available capacity
is cut in half.
4.5.8 CR Functional Unit
The CR functional unit is depicted in Figure 4.21. The starting point for any CR
process is the control unit. After conﬁguration, it receives the start CR ﬂags which
triggers RMA read requests to be issued. Any packets arriving at the NTL completer
must be forwarded to the correct unit, depending on whether or not a packet/segment
belongs to a CR process. This data is shifted to the input stage which looks up the
corresponding buﬀer index of the remote process. The segment is then shifted into the
buﬀer array where it remains until the matching segments from all participating nodes
have arrived. Eventually, these segments are shifted into the XOR stage which creates
the parity. A ﬁnal stage generates the HMC destination address, frames the packet
into a suitable format, and forwards it to the HTL layer.
4.5.9 Estimated Performance
The achievable CR performance depends on many factors and without actual mea-
surements it is not possible to make a prediction at this point. Benchmarks will have
to show if the dimensioning of the NAM internal buﬀers is suﬃcient, how well this
approach scales with the number of participating ranks, and if the newly created
software components are able to make use of this novel hardware architecture. For
the task of collecting checkpoints it is expected that the bandwidth is higher than for
101
Network Attached Memory
Rank A FPGA
Get Segment A2
HMC
Get Segment H1
Get Last Segment A4 with Noti
Get Next Segment H3
Notification
Wait for XOR to finish
Get Next Segment A3
Failed 
Rank B
GET Response
Mark Rank as Valid
Repeat until done
Get Segment A0
Get Segment A1
Get Segment H0
Get Last Segment H4
Configuration Update
NAM
A0
A1
A2
A3
A4
H0Get Segment H2
H2
H3
H4
GET CP
H1
Fig. 4.20 NAM restart process. Rank B failed and its checkpoint is now replaced by the
parity which resides in the HMC. Similar to the checkpoint process the NAM
now collects the checkpoints from all remaining ranks and again applies a XOR
function. This operation results in the missing checkpoint of rank B
102
4.5 Checkpoint/Restart
...
...
N
TL
In
pu
t S
ta
ge
X-
O
R 
St
ag
e
Da
ta
 B
uf
fe
r
R 0 R 1 R NR 2
NT
L 
Re
sp
on
de
r/
Re
qu
es
te
r
XO
R
XO
R
XO
R
XO
R
GE
T
GE
T 
Re
sp
on
se
Sh
ift
 o
ut
RR
A
RR
A
De
la
y 
an
d 
co
m
pl
ex
ity
 co
nf
ig
ur
ab
le
Pa
ra
lle
l X
OR
 in
 g
ro
up
s o
f c
on
fig
ur
ab
le
 si
ze
48
 B
uf
fe
rs
Ac
tu
al
 u
se
 =
 c
ur
re
nt
 #
 o
f N
od
es
 (N
)
A R B (2
)
To
 A
RB
 (2
)
TA
G
0 ...
TA
G
47
TA
G
BU
F 
ID
No
de
 ID
 +
 V
PI
D
48 x
DA
TA
ID
 T
ra
ns
la
tio
n 
Ta
bl
e
12
8 
se
gm
en
t b
uf
fe
r
Se
gm
en
t
HT
L
CR
 C
on
tr
ol
 U
ni
t
TA
G 
M
at
ch
 E
ng
in
e
Bu
ffe
r 0
Bu
ffe
r 1
D E M U X
NT
L 
Co
m
pl
et
er
To
 A
RB
 (1
)
TA
G
 M
at
ch
 in
 C
on
tr
ol
 U
ni
t
De
la
y 
St
ag
e
+ 
Pa
dd
in
g
FI
FO
HM
C
Pa
ck
et
 
Ge
n
RM
A 
Pa
ck
et
 G
en
Re
sp
on
se
 
fro
m
 H
TL
 R
X
D E M U X A R B (1
)
Fr
om
 N
TL
 
Co
m
pl
et
er
FI
FO
Bu
ffe
r 2
Bu
ffe
r 4
7
Fi
g.
4.
21
CR
fu
nc
tio
na
lu
ni
tb
lo
ck
di
ag
ra
m
.
An
in
pu
ts
ta
ge
di
st
rib
ut
es
in
co
m
in
g
pa
ck
et
s
on
to
on
e
of
48
av
ail
ab
le
bu
ﬀe
rs.
As
m
at
ch
ing
se
gm
en
ts
fro
m
all
pa
rti
cip
at
ing
no
de
sh
av
e
ar
riv
ed
,d
at
a
is
sh
ift
ed
to
th
e
XO
R
sta
ge
wh
ich
ge
ne
ra
te
st
he
pa
rit
y.
Th
e
pr
oc
es
si
sc
on
tro
lle
d
by
th
e
CR
co
nt
ro
lu
ni
t,
ac
ce
ss
ib
le
an
d
co
nﬁ
gu
ra
bl
e
via
RR
A.
Tw
o
ad
di
tio
na
lm
od
ul
es
cr
ea
te
re
qu
es
tp
ac
ke
ts
di
re
ct
ed
to
th
e
HM
C
to
re
ad
or
wr
ite
th
e
pa
rit
y,
or
to
th
e
ne
tw
or
k
as
re
ad
re
qu
es
ts
to
ge
ta
dd
iti
on
al
ch
ec
kp
oin
ts
eg
m
en
ts
103
Network Attached Memory
EXTOLL Link 1 EXTOLL Link 2
CR Logic
NTL
HTLHMC 
Layer
Link MUX
Fig. 4.22 NAM design device view and ﬂoor plan
just reading or writing to the NAM. The main reason for this is that the bandwidth
limiting NAM HTL layer is mostly avoided except for writing or reading the XOR
parity from the HMC. However, this is only true when both NAM links can be accessed
and Section 4.7 will discuss how this requirement is inﬂuenced by the NAM software
stack.
4.6 Implementation Results
The full NAM design was implemented in the Virtex 7 690T FPGA. Figure 4.22 shows
that a ﬂoor plan was applied to partition the available space in the FPGA. In general,
a carefully applied ﬂoor plan can reduce routing congestion and place&route runtimes,
and will also lead to more reliable results. It can also be seen that the device is
reasonably utilized. The CR logic, for example, currently supports buﬀer space for up
104
4.7 NAM Software
Table 4.5 NAM design resource utilization in a Virtex 7 690T FPGA. Percentages are
listed in reference to the total number of available resources of same type
Resource Type LUTs Registers BRAM DSP
Utilization 273k (63.0%) 199k (23%) 553 (37.6%) 214 (5.9%)
Per Functional Unit
One EXTOLL Link 66.8k (15.4%) 57.2k (6.6%) 30.50 (2.1%) 47 (1.3%)
EXTOLL MUX 3.8k (0.9%) 2.3k (0.3%) 30 (2%) 0 (0%)
HTL/NTL 24.8k (5.7%) 16.2k (1.9%) 42 (2.9%) 12 (0.3%)
CR Logic 87.2k (20.1%) 43.8k (5.1%) 404 (27.5%) 102 (2.8%)
HMC Layer 21.6k (5%) 19k (2.2%) 15.5 (1.1%) 2 (0.1%)
to 48 ranks at a time. A further increase of the number of ranks would increase Block
RAM usage in the speciﬁed device region, and signiﬁcantly increase routing congestion
in this area. Routing congestion also comes in heavily when operating frequencies
are increased as the implementation tools start to replicate logic in order to reduce
trace lengths and fan-out. The modules that suﬀered most from routing congestion
are the EXTOLL links (fmax = 200 MHz) and the CR logic (fmax = 230 MHz). The
ﬁnal utilization report can be found in Table 4.5.
4.7 NAM Software
Even the best hardware is useless without software that can actually use it. This section
describes the software components that were developed or modiﬁed to make use of the
NAM. There are three main components in this scope: a NAM-aware network setup
and management tool to seamlessly integrate this new device with EXTOLL ASICs.
An additional user-level Application Programming Interface (API) that provides access
to the NAM and implements the CR features. And ﬁnally, a service as central instance
to handle and manage NAMs and its allocations system-wide.
105
Network Attached Memory
4.7.1 EMP Extension
The EMP is a software component that is integrated with the EXTOLL software
stack. It is used to initially assign NIC identiﬁers and to setup routing to and between
EXTOLL devices in a network. In fact, EMP must be run anytime a system is powered
down or even a single node was replaced. In its original form, EMP does not support
NAMs as it expects that every connected device also provides routing tables and is
able to route through two links via the EXTOLL Crossbar (XBAR). The NAM is
an endpoint for any traﬃc and does not provide a routing table as routing from one
link to the other is not supported. Hence, the network must be properly conﬁgured
to ensure that only packets that actually target a speciﬁc NAM will be sent to it.
An additional hardware device type was added to the EMP which can now route to
and from NAMs but will not attempt to route through it. Currently only ﬁxed and
deterministic routing is supported with exactly one path from one node to another.
4.7.2 The libNAM Library
The libNAM library operates on top of the existing EXTOLL RMA API. The function
calls provided by libNAM are very similar to libRMA so that existing user applications
can be modiﬁed without much eﬀort. Listing 4.1 shows a code example to write and
read to and from the NAM. In the initial bring-up phase of the NAM hardware-software
interaction many of the features that were required to protect the NAM from false
usage were implemented in hardware (e.g. a violation of the 16 Byte granularity or
unsupported commands). These protection features were gradually shifted into the
software, hence reducing hardware and associated implementation complexity.
Reading and writing is realized with send and receive buﬀers organized in a ring
structure. The EXTOLL/NAM notiﬁcation mechanism is utilized to handle the buﬀer
space, i.e. to free up locations when data has been transmitted (PUT) or received
(GET). The number and sizes of the elements a buﬀer can hold is conﬁgurable and at
the same time the limit for outstanding transactions.
Currently, data is sent and received on only one of four available EXTOLL Virtual
Channels. Measurements conducted in Chapter 5 will have to unveil if and how strong
this aﬀects performance. A possible implementation that uses all VCs would require
libNAM to use dedicated buﬀers, one per VC to properly handle GET responses that
might return out of order.
106
4.7 NAM Software
int main(int argc, char **argv)
{
nam_allocation_t *my_alloc;
char hello[] = "Hello NAM!";
char transferred[13];
//Allocate NAM for Read/Write
my_alloc = nam_malloc(sizeof(hello));
//PUT and GET data
nam_put_sync(hello, 0, sizeof(hello), my_alloc);
nam_get_sync(transferred, 0, sizeof(transferred), my_alloc);
printf("Transferred from NAM: <%s>\n", transferred);
//Release Allocation
nam_free(my_alloc);
return 0;
}
Listing 4.1 libNAM PUT/GET usage example
In subsequent libNAM implementations stages an MPI-based layer was added to
allow sharing a NAM allocation between processes. This layer furthermore allows to
coordinate checkpoint and restart processes for the NAM CR use case. As there may
exist multiple NAMs in a system, libNAM forms sets of participating nodes in a CR
process and assigns these sets to one of the NAMs. This assignment process is currently
implemented in a pseudo-random fashion that balances the number of nodes among
sets.
Unfortunately, assigning nodes to NAMs without additional information about routing
comes with obvious drawbacks. Figure 4.23 depicts various possible set assignments
for an example network with eight nodes and two NAMs. It can be seen that there
exist good mappings with potentially low routing congestion and short distances, but
also bad mappings that require more network hops and where only one NAM link will
be used. As routes are static the system behavior in response to NAM placement and
set conﬁguration is predictable. It is therefore essential to assign sets in consideration
of the network topology and routing scheme. This task can either be oﬄoaded to the
user, who must provide an appropriate mapping scheme, or to libNAM which could
use the information provided by EMP to optimally form sets.
It is also possible that the job scheduler selects a node combination that inevitably
leads to a similar condition. Figure 4.24 shows two possible node combinations for a
job running on two nodes. Assumed is a shortest-path routing algorithm with ﬁxed
107
Network Attached Memory
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(a) Optimal mapping. The
logically nearest nodes
are assigned. Distances
are small and all NAM
links are utilized
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(b) Good mapping. Larger
distances and higher risk
of routing congestion.
All NAM links are utilized
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(c) Bad mapping. Increased
number of hops leads to
routing congestion. Only
one link per NAM due to
static routes
Fig. 4.23 NAM-XOR set mapping examples
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(a) Optimal scheduling. Both NAMs will be
accessed through both links
N0 N1
N2 N3
N4 N5
N6 N7
NAM 0 NAM 1
(b) Suboptimal scheduling. Both NAMs can
only be accessed through one link
Fig. 4.24 Impact of node scheduling on NAM accessibility
routes and the best possible XOR set assignment. The ﬁgure points out that the
NAM checkpointing performance can be signiﬁcantly aﬀected by simply scheduling the
’wrong’ nodes. The impact of suboptimal mapping on performance will be evaluated in
Chapter 5.
For CR, libNAM is also responsible to pad data chunks with zeros up to the next 16
Byte boundary which would otherwise violate the NAM access granularity.
The NAM address space of 2GB per NAM can be allocated as a single or multiple
contiguous memory regions. Allocations are granted, managed, and released by a
dedicated NAM manager.
4.7.3 NAM Manager
Before a user application can access a NAM it must obtain an allocation. These
allocations are managed by the NAM manager. It is implemented as a system service
108
4.8 NAM Summary
NAM
Request Allocation
Allocation Credentials
Read Result
Write/Read
NAM 
ManagerJob
1.
2.
3. Release Allocation
Fig. 4.25 NAM manager interaction: A job requests space on a NAM via the NAM
manager. Allocations are either shared or exclusive and may be used to read
and write a NAM until released
which returns a handle upon a successful allocation request. This process is depicted
in Figure 4.25. Even if a checkpoint or restart process is running, any non-allocated
NAM address space can still be allocated and read or written when CR is not using the
full memory address space. However, only one CR process may be running at a time.
4.8 NAM Summary
This chapter introduced the NAM hardware prototype and described the implemen-
tation and individual functional units of the FPGA design in detail. A theoretical
analysis of the estimated performance identiﬁed the expected bottlenecks, and there are
several recommendations for improving these. In particular, the EXTOLL FPGA link
implementation for single link operation and the NAM protocol conversion units for
two link operation require optimization. The estimated performance will be validated
with in-system measurements using real hardware in the next chapter.
109

C
h
a
p
t
e
r
5
NAM Performance Evaluation
This chapter presents the performance of the NAM prototype using real hardware
setups. Various microbenchmarks were executed to characterize bandwidth and latency
for reading, writing, and Checkpoint/Restart. One example application was run on
the DEEP-ER SDV to cover the full set of functionality under real world conditions.
5.1 Read/Write Microbenchmark Results
A basic PUT/GET microbenchmark was executed to measure the read, write, and
simultaneous read/write performance. All of the ﬁgures below use a logarithmic base 4
scale on the x-axis. A comprehensible labeling is used for the message sizes from 16
Byte to 1 GB. To eliminate initial software overhead, each message size is requested
5000 times and the time between start and completion is measured. In reference to
the theoretical NAM performance evaluation in Section 4.4, the ﬁgures also depict
the theoretical bandwidth limit where applicable. The presented results highlight the
bandwidth and latency for a single NAM link, and the bandwidth for accessing both
NAM links at the same time, respectively.
111
NAM Performance Evaluation
0
20
00
40
00
60
00
80
00
Message Size [Byte]
Ba
nd
wid
th 
[M
B/s
]
RMA PUT
RMA GET
Theoretical PUT MAX
Theoretical GET MAX
16 256 4K 64K 1M 16M 256M64 1k 16k 256k 4M 64M 1G
Fig. 5.1 Single link PUT/GET bandwidth
5.1.1 Single Link Performance
In a ﬁrst measurements series one NAM was connected via one link to an EXTOLL
ASIC. The results for bandwidth are presented ﬁrst.
5.1.1.1 Bandwidth
Figure 5.1 shows the PUT and GET bandwidth for one NAM link. It can be seen that
the achievable bandwidth is linked to the message size. Larger messages initiated by a
software descriptor lead to less software overhead and network descriptor translation
eﬀort. The PUT bandwidth peaks at 8.25 GB/s, close to the theoretical limit of 8.31
GB/s. The performance of GET requests on the other hand is surprisingly low with 5.5
GB/s, about 3 GB/s less than the theoretical limit of 8.58 GB/s. To understand this
behavior, it is mandatory to recap some of the lessons learned in the previous chapter.
The operating frequency of the EXTOLL FPGA link in the NAM has been identiﬁed
as an important factor that aﬀects performance. To quantify its impact on PUT and
GET bandwidth, two additional measurements were conducted with three diﬀerent
frequencies. As Figure 5.2 shows the maximum bandwidth correlates with the frequency.
The impact on GET requests is slightly higher than on PUTs which is already close
112
5.1 Read/Write Microbenchmark Results
70
00
75
00
80
00
85
00
Message Size [Byte]
Ba
nd
wid
th 
[M
B/s
]
clk_extoll
200 MHz
180 MHz
160 MHz
1M 16M256K 4M
(a) PUT (Y-axis range: 7000-8500 MB/s)
40
00
45
00
50
00
55
00
Message Size [Byte]
Ba
nd
wid
th 
[M
B/s
]
clk_extoll
200 MHz
180 MHz
160 MHz
1M 16M256K 4M
(b) GET (Y-axis range: 4000-5500 MB/s)
Fig. 5.2 Single link PUT/GET bandwidth in dependency of the NAM EXTOLL Link
operating frequency clk_extoll
to the theoretical maximum. The ﬁgures point out that decreasing the frequency of
the EXTOLL link which was implemented at 200 MHz does not substantially aﬀect
bandwidth. This leads to the conclusion that increasing the frequency would not be
suﬃcient to take GET performance to a maximum here.
In a second measurement the NAM was conﬁgured to utilize all four EXTOLL Virtual
Channels (VCs) for GET responses. Figure 5.3 shows that using more VCs takes the
GET bandwidth close to its theoretical limit, still with a slight dependency on the
EXTOLL FPGA link operating frequency. The reason why PUT and GET are aﬀected
unequally by the credit limitation is that the NAM may only use 31 credits per VC to
send packets (GET responses), while an ASIC has up to 58 for requests (PUTs) to the
NAM (see Section 4.2.5 and Section 4.3.3.2). Although the usage of multiple VCs is
highly recommended, the current implementation of libNAM is not capable to handle
more than one VC. It can cause responses to return out of order and must be handled
separately.
5.1.1.2 Latency
The NAM PUT and GET latency measurements are depicted in Figure 5.4. The
lowest NAM access latency starts with 1.8 µs for PUT and 2.8 µs for GET requests
respectively. The values are increasing with larger packet sizes accordingly. The actual
numbers and a breakdown of the individual latency contributors can be found in
Table 5.1. The values for the ASIC to ASIC communication reference were taken
from [49]. It can be seen that accesses to the NAM have a similar, yet slightly lower
113
NAM Performance Evaluation
0
20
00
40
00
60
00
80
00
Message Size [Byte]
Ba
nd
wid
th 
[M
B/s
]
clk_extoll
200 MHz
180 MHz
160 MHz
200 MHz Single VC
16 256 4K 64K 1M 16M 256M64 1k 16k 256k 4M 64M 1G
Fig. 5.3 Single link GET bandwidth with four Virtual Channels
0
2
4
6
8
10
Message Size [Byte]
La
ten
cy 
[ms
]
RMA PUT
RMA GET
16 256 4k64 1k 16k
Fig. 5.4 Single link PUT/GET latency
114
5.1 Read/Write Microbenchmark Results
Table 5.1 Overall ASIC-NAM and ASIC-ASIC PUT and GET latencies. Breakdown by
sub-operation and functional unit delays
Sub-operation /
functional unit
Delay
[ns]
# for PUT
ASIC-
ASIC
# for PUT
ASIC-
NAM
# for GET
ASIC-
ASIC
# for GET
ASIC-
NAM
Software Overhead 300 1 1 1 1
PIO Write 150 1 1 1 1
ATU Translation 70 2 1 2 1
DMA Read 350 1 1 1 0
RMA Unit Delay 50 2 1 3 2
Network Trip
ASIC-ASIC 650 1 0 2 0
Network Trip
ASIC-NAM 700 0 1 0 2
DMA Write 200 1 0 2 1
NAM Logic Delay 80 0 1 0 2
HMC Read 200 0 0 0 1
HMC Write 80 0 1 0 0
Overall Latency [ns] 1890 1780 2790 2780
latency than packets between two EXTOLL ASICs. The network trip latency for NAM
accesses is only slightly higher although the FPGA operating frequency is much lower.
The reason for this is that the NAM does not have a network crossbar which saves
parts of the delay that exists in the ASIC. For PUT request, an additional DMA write
and ATU translation are avoided but latency through the slow FPGA clock domains
is added. Although GET requests require two network trips to complete (round-trip,
to the NAM and back), their latency is by far less than for two PUTs. This can be
explained by the fact that several delay contributors such as the software overhead
appear only once. Unfortunately, the advantage of avoiding PCIe for DMA reads on
the NAM is nulliﬁed by the high delays in the FPGA.
5.1.2 Two Link PUT/GET Bandwidth
To measure the performance of both NAM links simultaneously, two EXTOLL ASICs
were connected with one link each to a NAM. The MPI benchmark executes sequential
reading and writing and the ﬁnal result is calculated by aggregating the individual
115
NAM Performance Evaluation
0
20
00
40
00
60
00
80
00
10
00
0
Message Size [Byte]
Ba
nd
wid
th 
[M
B/s
]
RMA PUT
RMA GET
Theoretical PUT MAX
Theoretical GET MAX
16 256 4K 64K 1M 16M 256M64 1k 16k 256k 4M 64M 1G
Fig. 5.5 Two link PUT/GET bandwidth
bandwidths. The results are depicted in Figure 5.5. While the PUT bandwidth peaks
at the theoretical limit of 10.35 GB/s, the GET Response bandwidth again suﬀers
from the credit limitation of the EXTOLL FPGA links. Eventually the total GET
bandwidth on two links settles to 10.15 GB/s, approximately twice the single link
bandwidth.
5.1.3 Analysis and Improvements
The estimated values and actual measurement results for PUT and GET operations are
summarized in Table 5.2. Based on this comparison and the ﬁndings in the previous
sections, two key observations and improvement recommendations can be derived.
PUT performance is as expected While the EXTOLL Link itself is the bottleneck
in single link operation, the bandwidth limiting component shifts to the NAM
internal HTL functional unit when using two EXTOLL Links. Both values
measured, however, match the theoretical estimation. Currently the HTL converts
a large RMA packet to multiple, smaller sized HMC packets and splitting is subject
to various restrictions. A possible solution to increase the HTL performance
is to substitute the packet conversion logic by a cache-like unit, for example a
116
5.2 Checkpoint/Restart
Table 5.2 NAM bandwidth comparison: estimated versus actual measured
Operation Bandwidth Estimated [GB/s] Bandwidth Measured [GB/s]
PUT 8.31 8.27
PUT 2 Links 10.35 10.35
GET 8.58 5.54
GET 2 Links 12.62 10.15
set-associative cache with a line size of 496 Byte. The NAM will then be able to
select and write-out larger HMC packets more eﬃciently. Other characteristics
such as the cache eviction strategy are implementation speciﬁc and can be set to
either optimize for area, bandwidth, or power consumption. The latter option
would not only reduce the FPGA power footprint but also HMC dynamic power,
which is a signiﬁcant fraction of its overall consumption.
GET performance falls short of expectations For single link operation a band-
width drop of 3 GB/s over the theoretical limit can be observed. The cause
for this lack of performance has been identiﬁed with the FPGA EXTOLL Link
operating frequency, and even more critical, the number of credits available
for packet transmission. Hence, future link ASIC link implementations should
increase the credit count to enable best performance with FPGAs and other
devices that run on slower clocks. It is the easiest way to further scale the
bandwidth as increasing the operating frequencies will cause implementation
issues in the FPGA. Alternatively, the NAM could be implemented as an ASIC
to eliminate all of the issues mentioned above.
5.2 Checkpoint/Restart
The NAM CR functionality to speed-up the creation of parity checkpoints has been
evaluated in the DEEP-ER SDV. Here, two NAMs are connected with both links to
a 16-node torus type network. The logical NAM placement within the topology was
carefully chosen to balance out the distances from each node to the nearest NAM. A
set of microbenchmarks was implemented to independently evaluate the performance
of creating and restarting from checkpoints. These measurements are complemented
by an application benchmark with one of the DEEP-ER applications.
117
NAM Performance Evaluation
5.2.1 Microbenchmark Results
The following set of microbenchmarks analyzes the bandwidth of the NAM CR and its
scaling behavior in the DEEP-ER SDV from 1 to 16 nodes with 4 processes each. The
checkpoint sizes range from 4 KB up to 2 GB per node. The benchmarks directly call
libNAM CR functions without involving an additional layer such as SIONlib, and each
process is treated as an independent rank. Hence a maximum of 64 checkpoints are
created and evenly assigned to both NAMs (maximum 32 checkpoints per NAM).
5.2.1.1 Checkpointing
The ﬁrst benchmark measures the overall bandwidth for creating XOR parity check-
points. A root process conﬁgures the NAM CR unit and distributes the job to all
participating ranks. Each rank then creates a checkpoint and informs the NAM in
order to fetch the data and generate the parity. The bandwidth measurement is started
as soon as the MPI job starts and stopped when all ranks have received a notiﬁcation
that the parity has been generated. The actual checkpointing bandwidth is calculated
using the total amount of data that has been processed divided by the time the process
took, which includes MPI start-up times and synchronization. The results of this
benchmark are depicted in Figure 5.6. It can be seen that the bandwidth scales with
the number of available nodes.
For one participating node, only one NAM is utilized and only one link of this NAM is
accessed since there is a static route between the two endpoints. The resulting peak
bandwidth is 6.2 GB/ which is less than what has been measured for PUT requests
from a node to the NAM. This surprises as the NAM issues GET requests, and GET
responses traveling back to the NAM are very similar to PUTs with respect to how
they are handled by the EXTOLL network. The reason for this disparity is software
synchronization overhead and the generation of the XOR parity which is then also
written to the HMC. It is reasonable to include this overhead in the measurements
since it is part of the overall CR process.
With two nodes the eﬀective bandwidth is already more than doubled with 14 GB/s
as now both NAMs are involved and the software overhead remains at a comparable
level. Adding more nodes to the checkpointing process eventually leads to a bandwidth
saturation at 24.8 GB/s with 16 nodes. At a ﬁrst glance this result surprises as it states
that the bandwidth per NAM, assuming an equal distribution, is 24.8 GB/s2 NAMs = 12.4 GB/s.
This is higher than what has been measured for writing data to a NAM via both links.
118
5.2 Checkpoint/Restart
0
50
00
15
00
0
25
00
0
Checkpoint size per Node [Byte]
Ba
nd
wid
th 
[M
B/s
]
8K 32K 128K 512K 2M 8M 32M 128M 512M 2G
Number of Nodes (4 Processes each)
1 2 4 8 16
Fig. 5.6 XOR checkpointing bandwidth with 2 NAMs in the DEEP-ER SDV. 4 processes
per node with one checkpoint per process
However, the theoretical NAM bandwidth analysis in Section 4.4 pointed out that
the bottleneck for a two link operation sits in the HTL protocol conversion logic. In
case of Checkpoint/Restart this module is completely avoided except for the task of
writing out the XOR parity to the HMC. All other data is directed to the CR layer
which operates at a higher throughput (17.54 GB/s) than two EXTOLL links can
deliver (16.62 GB/s). Achieving even higher bandwidths for checkpointing remains
diﬃcult due to natural overhead of generating and storing the XOR parity, and process
synchronization among participating nodes.
5.2.1.2 Restart
Benchmarking a restart requires that a XOR parity has already been generated. Hence,
a checkpoint is ﬁrst created following the scheme presented in the previous section. The
bandwidth measurement is started as soon as the root process informs the NAM that
a rank failure has occurred and stopped after the failed rank has entirely retrieved its
missing checkpoint. Figure 5.7 shows that restart scales similarly to checkpointing for
an increasing number of participating nodes. The resulting bandwidths, however, are
119
NAM Performance Evaluation
0
50
00
15
00
0
25
00
0
Checkpoint size per Node [Byte]
Ba
nd
wid
th 
[M
B/s
]
8K 32K 128K 512K 2M 8M 32M 128M 512M 2G
Number of Nodes (4 Processes each)
1 2 4 8 16
Fig. 5.7 XOR restart bandwidth with 2 NAMs in the DEEP-ER SDV. 4 processes per node
with one checkpoint per process
continually lower than for checkpointing. The reason for this behavior is the additional
read process to fetch the missing checkpoint after reconstruction has ﬁnished.
5.2.1.3 Impact of XOR Set Mapping on CR Performance
One important property that aﬀects CR performance is the assignment of nodes to a
XOR set, or more speciﬁc the mapping of ranks to one of the two NAMs. The libNAM
library currently maps nodes to a set in pseudo-random fashion and the actual topology
and routing setup is not considered. As Section 4.7.2 highlighted there exist good and
bad mappings for the same node/routing/NAM setup. The measurements so far were
executed with manually assigned XOR sets. This is reasonable for a system such as the
DEEP-ER SDV. For larger systems and many diﬀerent applications, however, it is up
to libNAM to form these sets. Therefore, it is necessary to measure the performance
impact of the mapping scheme.
Figure 5.8 compares the checkpointing bandwidth for two diﬀerent mappings with
4 nodes. It shows that the potential performance loss for a bad mapping scheme is
signiﬁcant. Therefore, with the current libNAM implementation and without any
additional eﬀort it is not guaranteed that always the best mapping is provided. In
120
5.2 Checkpoint/Restart
0
50
00
15
00
0
25
00
0
Checkpoint size per Node [Byte]
Ba
nd
wid
th 
[M
B/s
]
8K 32K 128K 512K 2M 8M 32M 128M 512M 2G
Node to NAM mapping impact with 4 Nodes
Good mapping Bad mapping
Fig. 5.8 Impact of XOR set to NAM mapping on achievable bandwidth
addition, it can also be due to the job scheduler that a bad mapping is inevitable. In
this case the user is responsible to reserve nodes where the routing is guaranteed to
target all available NAM links.
5.2.2 Application Performance
One DEEP-ER application was selected to ultimately compare the NAM check-
point/restart approach with the existing SIONlib-Buddy checkpointing scheme.
iPic3D [103] is a space weather application developed by the Katholieke Universiteit
(KU) Leuven. It is meant to deepen the understanding and increase the forecasting
accuracy of the impact of sun solar emissions on the earth weather. The application
itself operates on two distinct items: computation-intensive particle operations, and
communication-dominated inter-particle ﬁeld calculations. This perfectly suits the
cluster-booster architecture of the DEEP-ER system and makes iPic3D a perfect
candidate to proof its concept and also to evaluate the NAM as CR target.
In order to run benchmarks, the application code was modiﬁed to only execute the
checkpointing portion, leaving out actual computation because it is irrelevant for the
measurements. iPic3D operates on particle and cell datatypes, where each cell is
121
NAM Performance Evaluation
approximately 64 KB in size and consists of 1024 particles. Each scenario, NAM-XOR1
and SIONlib-Buddy2, was run for various checkpoint sizes on 2 to 16 nodes with four
processes per node and a total of 2 XOR sets. Participating nodes were evenly assigned
to the two NAMs. Runs were executed 20 times, with 10 checkpoints per iteration,
and the total runtime was taken to even out measurement errors.
Two types of scalability were evaluated: weak scaling, which means that the problem
size linearly increases with the number of nodes, and strong scaling, where the problem
size stays constant but the number of processes and nodes varies.
5.2.2.1 Weak Scaling
The ﬁrst set of benchmarks measures the weak scaling behavior. Checkpoint sizes
range from a total of 64 MB (16 MB per process) per node up to 2 GB (512 MB
per process), which is the maximum NAM checkpoint size. The results are depicted
in Figure 5.9 and the values reﬂect the average time out of 20 runs. The best case
runtimes are slightly better and the worst case runtimes may be signiﬁcantly higher
due to ﬁle system and network congestion. Note that the Y-axis range changes.
The results clearly show an advantage with the NAM approach and the achievable
speed-ups range from 1.06X to approximately 2.1X, i.e. with two NAMs in the system,
checkpoints may be created 2.1 times faster than with SIONlib-Buddy. For a given
dataset size per node it can be seen that the runtimes on the NAM remain almost
constant. This is an indication that the NAM internal CR request mechanism is
well-balanced. It also shows that two links per NAM provide suﬃcient bandwidth
for at least 8 nodes. However, at a certain point only two NAMs will not be able to
support the link bandwidth of additional nodes and more NAMs should be added to
the system.
Noticeably, the speed-up continually increases with larger checkpoint sizes. The reason
for this is that with larger checkpoints, SIONlib-Buddy writes more and more data to
the local NVMe drives which is much slower than moving the data over EXTOLL to
the NAM. It is expected that the achievable speed-up will further increase for larger
datasets but such measurements were not feasible due to the NAM memory capacity
of 2 GB.
1 For more information on how the NAM creates checkpoints see Section 4.5.4 and Section 4.5.6
2 For more information on how SIONlib-Buddy creates checkpoints see Section 4.5.1
122
5.2 Checkpoint/Restart
0
2
4
6
8
10
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
64 MB per node
Average speed-up:  1.06 X
(a) 256 cells per process / 64 MB checkpoint
size per node
0
2
4
6
8
10
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
128 MB per node
Average speed-up:  1.2 X
(b) 512 cells per process / 128 MB checkpoint
size per node
0
5
10
15
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
256 MB per node
Average speed-up:  1.38 X
(c) 1024 cells per process / 256 MB checkpoint
size per node
0
5
10
15
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
512 MB per node
Average speed-up:  1.57 X
(d) 2048 cells per process / 512 MB checkpoint
size per node
0
5
10
15
20
25
30
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
1 GB per node
Average speed-up:  1.87 X
(e) 4096 cells per process / 1 GB checkpoint
size per node
0
5
10
15
20
25
30
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
2 GB per node
Average speed-up:  2.1 X
(f) 8192 cells per process / 2 GB checkpoint
size per node
Fig. 5.9 Xpic3d application performance comparison for weak scaling: NAM-XOR versus
SIONlib-Buddy. Note the variable Y-Axes
123
NAM Performance Evaluation
0
2
4
6
8
10
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
8192 cells - 512 MB total 
Average speed-up:  1.37 X
(a) 8192 cells / 512 MB total dataset size
0
2
4
6
8
10
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
16384 cells - 1 GB total 
Average speed-up:  1.44 X
(b) 16384 cells / 1 GB total dataset size
0
5
10
15
20
25
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
32768 cells - 2 GB total 
Average speed-up:  1.65 X
(c) 32768 cells / 2 GB total dataset size
0
5
10
15
20
25
Number of nodes (4 processes per node)
Ru
nti
me
 [s
]
NAM XOR SIONlib Buddy
2 4 8 16
65536 cells - 4 GB total 
Average speed-up:  1.9 X
(d) 65536 cells / 4 GB total dataset size
Fig. 5.10 Xpic3d application performance comparison for strong scaling: NAM-XOR versus
SIONlib-Buddy. Note the variable Y-Axes
5.2.2.2 Strong Scaling
The strong scaling behavior was measured on 2 to 16 nodes on the SDV. Figure 5.10
depicts the results for 4 diﬀerent dataset sizes, where the largest dataset is 4 GB in
total, or 2 GB per node. This limitation is again due to the limited NAM capacity.
As for weak performance, NAM-XOR performs better and achieves speed-ups from
1.37X to 1.9X. It can be seen that the runtimes for the NAM signiﬁcantly decrease
from 2 to 4 nodes. The reason for this is that with two nodes, both NAMs with only
one link each may be accessed. Already with 4 nodes all NAM links are utilized3. From
this point the performance remains mainly constant with only a very slight decrease
since the inter-node MPI communication and NAM conﬁguration overhead increases.
3 Node selection is often handled by the job scheduler. A proper selection is critical for the NAM
performance. See also Section 4.7.2.
124
5.3 Performance Summary
5.3 Performance Summary
This chapter evaluated the NAM prototype and software stack using micro- and
application benchmarks in real system setups. It was shown that read and write
operations to the NAM perform reasonably well, although several limitations as
discussed in the theoretical performance evaluation were discovered. It turns out that
the EXTOLL Link FPGA implementation has to be improved to further increase the
single link performance, in particular for GET requests. This aﬀects the operating
frequency constraints, and more important the credit-based ﬂow control mechanism.
Two-link read/write performance, on the other hand, suﬀers from protocol translation
requirements and overhead, and other approaches need to be developed to reduce this
penalty.
It was pointed out that the NAM access latency is very close, yet slightly better to what
can be achieved between two EXTOLL ASICs. Increasing the operating frequency in
all parts of the design would ultimately put the NAM at an advantage. For an ASIC
implementation it is expected that both, bandwidth and latency would signiﬁcantly
improve.
An additional set of microbenchmarks was executed to measure the Checkpoint/Restart
performance. It became clear that checkpoint and restart with two NAMs in the 16
node SDV test system result in tremendous bandwidth and a good scaling behavior.
As NAMs can be attached to any unused EXTOLL Link in the system, the overall
memory capacity and bandwidth perfectly scale with the system size and the number
of NAMs attached to it. For future use, however, the libNAM library and EXTOLL
EMP application should be made aware of the topology to allow the best possible
node-to-NAM XOR set mapping.
Finally, application benchmarks with one of the DEEP-ER applications showed that
checkpointing using the NAM is superior to SIONlib-Buddy. With a maximum speed-up
of 2.1X for weak scaling and 1.9X for strong scaling, NAM-XOR is able to signiﬁcantly
reduce the overhead of fault tolerance features of today’s and future large-scale systems.
125

C
h
a
p
t
e
r
6
Conclusion and Outlook
The goal of this work was to develop a hardware prototype that is able to mitigate the
negative eﬀects of three common problems in today’s and future large-scale systems.
These are the memory interface and performance, the rapidly increasing amount of inter-
process communication, and fault tolerance. Network Attached Memory was developed
and presented as an innovative solution. It is a dedicated device that speeds-up collective
operations and provides shared memory access at network bandwidth. As a ﬁrst use
case the NAM serves as target for commonly deployed Checkpoint/Restart mechanisms.
The resulting hardware prototype provides high-performance interconnection network
interfaces and implements the emerging memory technology Hybrid Memory Cube.
The NAM design was fully prototyped in an FPGA and the excellent performance
results show that the NAM is able to speed-up the creation process of checkpoints in a
16 node test system by a factor of 2.1X.
The ﬁrst contribution comprises an overview over memory technology and interface
evolution, and typical communication methods and patterns in distributed systems.
It became clear that the memory interface must be optimized in order to keep up
with the latency and bandwidth requirements of multi-core architectures. Inter-process
and in particular inter-node communication already today take up a large amount
of the overall application runtimes. Since the general message passing scheme is not
expected to change in the near future, it is either desirable to reduce the number
of messages that are sent or otherwise to speed-up messaging and commonly used
127
Conclusion and Outlook
collective operations. The overview is complemented by an introduction to fault
tolerance which has become a major concern in today’s and future large-scale systems.
Checkpointing introduces additional application overhead and most often stresses
the memory interface and the interconnection network due to communication and
synchronization. Hence, fault tolerance will also beneﬁt from optimizing either of
these two components. Finally, power and energy play an increasingly important role,
especially since the total power budget for an Exascale system should not exceed 20
MWatt and today’s fastest supercomputers are already close to this limit. The NAM
can help to increase the energy eﬃciency, allowing systems to consume less power or
to execute more actual work in a given time period.
The next contribution presented the Hybrid Memory Cube technology and interface as
one approach to overcome the bandwidth, scalability, and power eﬃciency issues of
the parallel memory interface. The HMC architecture is thoroughly analyzed and the
obtained insights contributed to the development of an HMC host controller, which has
become a popular and widely used open-source initiative. This development enabled
the evaluation of HMC performance and power characteristics in an FPGA test system.
The results clearly emphasize the motivation to adapt serial and abstracted interfaces
to enable memory access parallelism and the independent development of memory
technology and its interface.
The development of the NAM prototype hardware and FPGA design was described in
detail. The PCB was developed as a standard height PCIe form factor card that can
be plugged into common PCIe slots. It provides interfaces to directly connect up to 2
EXTOLL high-performance NICs and integrates a 2 GB HMC device. This development
process was driven by the vision initially formulated in the introduction and the use
case as checkpointing target in the DEEP-ER project. The architectural contributions
comprise an FPGA implementation of the native EXTOLL ASIC network link, protocol
conversion logic, and an application-speciﬁc Checkpoint/Restart functional unit. The
NAM design operates in several clock domains with 200 MHz for the EXTOLL links,
220 MHz for Checkpoint/Restart, and 312.5 MHz for the remaining logic at more than
60 % LUT utilization in the Virtex 7 FPGA. A theoretical performance evaluation
identiﬁed potential bottlenecks and served as reference for the in-system validation.
NAM read/write access, allocations, and Checkpoint/Restart functions are orchestrated
by libNAM, a user-level API derived from existing EXTOLL libraries.
The ﬁnal contribution of this work evaluated the presented hardware and software
components in a real 16 node system. Although the theoretical performance evaluation
128
6.1 Improvements
and microbenchmark results have identiﬁed existing bottlenecks for reading and writing,
checkpointing with the NAM shows an excellent 2.1X speed-up over the SIONlib-Partner
approach which is currently deployed in the DEEP-ER system.
In summary, the vision that has been formulated in the introduction of this work was
successfully transferred to a hardware prototype. The performance evaluation of the
developed hardware and software components undoubtedly prove that the presented
approach is able to reduce inter-node communication and to relieve the memory
interface in general, while improving state of the art fault tolerance mechanisms of
large-scale systems. These outstanding contributions were publicly recognized and
an enhanced NAM device will be developed in the follow-up DEEP-EST (Dynamical
Exascale Entry Platform - Extreme Scale Technologies) project.
6.1 Improvements
For a future implementation of the NAM, several possible improvements for the
following components were identiﬁed:
HMC Link
Due to the limited number of the available FPGA transceivers, the HMC evaluation
only covered accesses through 1 out of 4 possible links. A diﬀerent FPGA with more
transceivers and a suitable PCB would be required to connect more than one link.
Also, the Xilinx Virtex 7 GTH transceiver do not support 15 Gbps line-rates and a
newer FPGA generation (e.g. Xilinx Ultrascale) device would be required. For both,
multi-link and 15 Gbps operation, it is expected that the achievable bandwidth and
energy eﬃciency will further improve.
NAM Hardware and FPGA Design
The NAM design unveiled only very few weaknesses, and especially the read performance
lags behinds the expectations. This is mainly due to the fact the credits in the EXTOLL
network protocol are exchanged too slow when communicating with an FPGA that
runs at relatively low operating frequencies. The complex FPGA logic prohibits faster
clock domains and the number of credits the EXOLL ASIC can provide is ﬁxed. Hence,
129
Conclusion and Outlook
the only way to alleviate bandwidth throttling is to make use of all available traﬃc
classes as provided by the EXTOLL network protocol.
For writing and reading in 2-link operation, the FPGA design internal protocol conver-
sion units have been identiﬁed as performance limiting factor. Currently, EXTOLL
network packets are directly translated to (many smaller) HMC packets. The main
issue here are the diﬀerent access and address granularities, and packet sizes. A
cache-like unit would allow writing ’cache-lines’ with a maximum size of the network
MTU to an internal buﬀer array, decoupling the HMC-sided access units which may
take data and return GET responses in a more eﬃcient way. Although there are many
considerations left open, such as associativity and eviction strategies, this approach
would make protocol translation much easier.
For Checkpoint/Restart, a clear drawback of the NAM is the volatility of the memory
array. Hence a future NAM implementation should combine fast DRAM access (e.g.
HMC) and a second level, non-volatile storage such as NAND. It is also desirable
to partition the available address space into 2 equally sized regions so that even in
the process of checkpoint creation a copy of the stable XOR parity may always be
preserved.
NAM Software
In order to utilize the traﬃc classes mentioned above, a future libNAM implementation
should issue GET requests on alternating traﬃc classes. This results in a slightly more
complex software component as responses may return out of order. However, it also
unlocks the full network bandwidth. A second libNAM optimization is to use topology
and routing information provided by EMP. It could be used to always create the best
possible XOR set mapping for checkpointing.
6.2 Outlook
The NAM approach will soon be taken to its next level in the European funded
DEEP-EST project. Here, the existing NAM prototype will be extended by multiple
Terabyte of non-volatile memory. Additionally, a newer and larger FPGA to accom-
modate more complex processing capabilities will be introduced. A more general use
case toward MPI collective operations will motivate even more application developers
to make use of this novel concept. A performance boost is in particular expected for
130
6.2 Outlook
applications that make use of non-blocking collective operations which are supported
by the latest MPI 3 standard. These operations allow host processors to continue
program execution while the NAM collects the required information, carries out the
collective operation, and then re-distributes the results.
131

A
p
p
e
n
d
ix
A
Acronyms
AMC Active Memory Cube
API Application Programming Interface
ASIC Application-Speciﬁc Integrated Circuit
ATU Address Translation Unit
BE Bandwidth Engine
CMOS Complementary Metal Oxide Semiconductor
CPU Central Processing Unit
CR Checkpoint/Restart
CRC Cyclic Redundancy Check
DDR Double Data Rate
DEEP Dynamical Exascale Entry Platform
DEEP-ER Dynamical Exascale Entry Platform - Extended Reach
DEEP-EST Dynamical Exascale Entry Platform - Extreme Scale Technologies
DIMM Dual In-line Memory Module
DMA Direct Memory Access
DOE U.S. Department of Energy
DRAM Dynamic Random-Access Memory
ECC Error Correction Code
EMP EXTOLL Management Program
EOP End Of Packet
FIFO First In - First Out
133
Acronyms
FLIT Flow Unit
FPGA Field Programmable Gate Array
FRP Forward Retry Pointer
FTI Fault Tolerant Interface
GDDR Graphics Double Data Rate
HBM High Bandwidth Memory
HMC Hybrid Memory Cube
HMCC Hybrid Memory Cube Consortium
HPC High Performance Computing
HTL HMC Transaction Layer
I2C Inter-Integrated Circuit
IC Integrated Circuit
I/O Input/Output
IP Intellectual Property
JTAG Joint Test Action Group
KNL Intel Knights Landing
LED Light-Emitting Diode
LGPL Lesser General Public License
LPDDR Low Power Double Data Rate
LUT LookUp Table
MCDRAM Multi Channel DRAM
MPI Message Passing Interface
MTBF Mean Time Between Failure
MTU Maximum Transmission Unit
MUX Multiplexer
NDC Near-Data Computing
NIC Network Interface Controller
NAM Network Attached Memory
NAND NAND Flash Memory
NTL Network Transaction Layer
NVMe Non-volatile Memory Express
PCB Printed Circuit Board
PCIe PCI Express
PIM Processing In Memory
PFS Parallel File System
RAM Random-Access Memory
134
Acronyms
RAS Reliability, Availability and Serviceability
RF Register File
RMA Remote Memory Access or Remote Memory Architecture
RRA Remote Register File Access
RRP Return Retry Pointer
SCR Scalable Checkpoint / Restart
SDV Software Development Vehicle
SerDes Serializer / Deserializer
SOP Start Of Packet
SRAM Static Random-Access Memory
SSD Solid State Drive
TSV Through Silicon Via
VC Virtual Channel
VPID Virtual Process Identiﬁer
XBAR Crossbar
135

A
p
p
e
n
d
ix
B
List of ﬁgures
Chapter 1: Introduction 1
1.1 TOP500 number 1 system performance and power development . . . . 2
1.2 NAM Vision: Reduce communication and oﬄoad processor computation 5
Chapter 2: State of the Art 7
2.1 Historical trend of the processor-memory gap . . . . . . . . . . . . . . . 9
2.2 Energy cost for data movement across diﬀerent layers . . . . . . . . . . 15
2.3 Example MPI operations . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Hardware failure breakdown by component . . . . . . . . . . . . . . . . 21
2.5 SCR-Partner checkpointing scheme . . . . . . . . . . . . . . . . . . . . 26
2.6 SCR XOR checkpointing example . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3: Hybrid Memory Cube 31
3.1 HMC architecture overview . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 HMC logic layer top view: schematic representation . . . . . . . . . . . 33
3.3 Close-up view of an HMC stack . . . . . . . . . . . . . . . . . . . . . . 33
3.4 HMC chain example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 HMC chain example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 HMC + NAND heterogeneous memory subsystem example . . . . . . . 36
137
List of ﬁgures
3.7 HMC protocol FRP and RRP exchange loop . . . . . . . . . . . . . . . 37
3.8 Retry pointer loop time contributors . . . . . . . . . . . . . . . . . . . 38
3.9 Experimental test setup . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10 Impact of read/write ratio on bandwidth with 128 Byte requests . . . . 52
3.11 Impact of diﬀerent request sizes on the optimum read/write ratio . . . 52
3.12 128 Byte request ratio sweep results: theoretical versus measured . . . 52
3.13 Eﬀective bandwidth at 10 Gbps . . . . . . . . . . . . . . . . . . . . . . 55
3.14 Eﬀective bandwidth at 12.5 Gbps . . . . . . . . . . . . . . . . . . . . . 55
3.15 Host to HMC read latency contributors . . . . . . . . . . . . . . . . . . 56
3.16 Host to HMC read latency at 10 Gbps . . . . . . . . . . . . . . . . . . 57
3.17 Host to HMC read latency at 12.5 Gbps . . . . . . . . . . . . . . . . . 57
3.18 Megaupdates/second versus address range at 10 Gbps . . . . . . . . . . 59
3.19 Megaupdates/second versus address range at 12.5 Gbps . . . . . . . . . 59
3.20 HMC power consumption at 10 Gbps . . . . . . . . . . . . . . . . . . . 61
3.21 HMC power consumption at 12.5 Gbps . . . . . . . . . . . . . . . . . . 61
3.22 HMC energy eﬃciency at 10 Gbps . . . . . . . . . . . . . . . . . . . . . 62
3.23 HMC energy eﬃciency at 12.5 Gbps . . . . . . . . . . . . . . . . . . . . 63
Chapter 4: Network Attached Memory 65
4.1 DEEP-ER System Overview . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 EXTOLL Tourmalet ASIC . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 EXTOLL Tourmalet ASIC Block Diagram . . . . . . . . . . . . . . . . 68
4.4 EXTOLL Link gearbox . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 EXTOLL PUT/GET operations and notiﬁcation mechanism . . . . . . 71
4.6 NAM Prototype Board ’Aspin-v2’ . . . . . . . . . . . . . . . . . . . . . 76
4.7 NAM FPGA design block diagram . . . . . . . . . . . . . . . . . . . . 77
4.8 NAM/EXTOLL notiﬁcation mechanism for PUT and GET operations . 84
4.9 HMC 128 Byte block-boundary crossing example . . . . . . . . . . . . 86
4.10 496 Byte RMA read request to HMC packet mapping . . . . . . . . . . 89
4.11 Response packet sampling example . . . . . . . . . . . . . . . . . . . . 90
4.12 Packet serialization example . . . . . . . . . . . . . . . . . . . . . . . . 90
4.13 Response packet layouts . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.14 SIONlib-Buddy checkpointing scheme with two nodes . . . . . . . . . . 94
4.15 SIONlib ring fashion ﬁle exchange with more than two nodes . . . . . . 94
4.16 XOR parity generation and reconstruction . . . . . . . . . . . . . . . . 95
4.17 NAM/SIONlib checkpoint creation example with one node . . . . . . . 98
138
List of ﬁgures
4.18 NAM CR conﬁguration process . . . . . . . . . . . . . . . . . . . . . . 98
4.19 NAM parity checkpoint creation example . . . . . . . . . . . . . . . . . 100
4.20 NAM restart process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.21 CR functional unit block diagram . . . . . . . . . . . . . . . . . . . . . 103
4.22 NAM design device view and ﬂoor plan . . . . . . . . . . . . . . . . . . 104
4.23 NAM-XOR set mapping examples . . . . . . . . . . . . . . . . . . . . . 108
4.24 Impact of node scheduling on NAM accessibility . . . . . . . . . . . . . 108
4.25 NAM manager interaction . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 5: NAM Performance Evaluation 111
5.1 Single link PUT/GET bandwidth . . . . . . . . . . . . . . . . . . . . . 112
5.2 Single link PUT/GET bandwidth in dependency of clk_extoll . . . . . 113
5.3 Single link GET bandwidth with four Virtual Channels . . . . . . . . . 114
5.4 Single link PUT/GET latency . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Two link PUT/GET bandwidth . . . . . . . . . . . . . . . . . . . . . . 116
5.6 XOR checkpointing bandwidth with 2 NAMs . . . . . . . . . . . . . . . 119
5.7 XOR restart bandwidth with 2 NAMs . . . . . . . . . . . . . . . . . . 120
5.8 Impact of XOR set to NAM mapping on achievable bandwidth . . . . . 121
5.9 Xpic3d application performance comparison for weak scaling . . . . . . 123
5.10 Xpic3d application performance comparison for strong scaling . . . . . 124
139

A
p
p
e
n
d
ix
C
List of tables
Chapter 2: State of the Art 7
2.1 Interconnect performance comparison . . . . . . . . . . . . . . . . . . . 17
2.2 Causes of failures by type . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 3: Hybrid Memory Cube 31
3.1 Retry pointer loop time summary . . . . . . . . . . . . . . . . . . . . . 38
3.2 Resource utilization of diﬀerent HMC host controllers . . . . . . . . . . 45
3.3 openHMC core clock frequencies . . . . . . . . . . . . . . . . . . . . . . 47
3.4 openHMC ASIC implementation results . . . . . . . . . . . . . . . . . 48
3.5 Optimum ratio and maximum eﬀective bandwidth per request size . . . 53
3.6 Host-sided read latency contributors . . . . . . . . . . . . . . . . . . . 56
Chapter 4: Network Attached Memory 65
4.1 HTL request packet splitting example . . . . . . . . . . . . . . . . . . . 88
4.2 HMC response packet serialization overview . . . . . . . . . . . . . . . 91
4.3 NAM design building blocks bandwidth summary: 1 EXTOLL Link . . 92
4.4 NAM design building blocks bandwidth summary: 2 EXTOLL Links . 93
4.5 NAM design resource utilization . . . . . . . . . . . . . . . . . . . . . . 105
141
List of tables
Chapter 5: NAM Performance Evaluation 111
5.1 Overall ASIC-NAM and ASIC-ASIC PUT and GET latencies . . . . . 115
5.2 NAM bandwidth comparison: estimated versus actual measured . . . . 117
142
A
p
p
e
n
d
ix
R
References
[1] Erich Strohmaier et al. “The TOP500 List and Progress in High-Performance
Computing”. In: Computer 48.11 (Nov. 2015), pp. 42–49. issn: 0018-9162. doi:
10.1109/MC.2015.338.
[2] Wm. A. Wulf and Sally A. McKee. “Hitting the Memory Wall: Implications of
the Obvious”. In: SIGARCH Comput. Archit. News 23.1 (Mar. 1995), pp. 20–24.
issn: 0163-5964. doi: 10.1145/216585.216588. url: http://doi.acm.org/10.1145/
216585.216588.
[3] Benjamin Klenk and Holger Fröning. “An Overview of MPI Characteristics of
Exascale Proxy Applications”. In: High Performance Computing: 32nd Inter-
national Conference, ISC High Performance 2017, Frankfurt, Germany, June
18–22, 2017, Proceedings. Ed. by Julian M. Kunkel et al. Cham: Springer Interna-
tional Publishing, 2017, pp. 217–236. isbn: 978-3-319-58667-0. doi: 10.1007/978-
3-319-58667-0_12. url: https://doi.org/10.1007/978-3-319-58667-0_12.
[4] Daniel Dauwe et al. “A Performance and Energy Comparison of Fault Tolerance
Techniques for Exascale Computing Systems”. In: 2016 IEEE International
Conference on Computer and Information Technology (CIT). Dec. 2016, pp. 436–
443. doi: 10.1109/CIT.2016.44.
[5] Nilmini Abeyratne et al. “Checkpointing Exascale Memory Systems with Exist-
ing Memory Technologies”. In: Proceedings of the Second International Sympo-
sium on Memory Systems. MEMSYS ’16. Alexandria, VA, USA: ACM, 2016,
pp. 18–29. isbn: 978-1-4503-4305-3. doi: 10 . 1145 / 2989081 . 2989121. url:
http://doi.acm.org/10.1145/2989081.2989121.
[6] Malcolm Ware et al. “Architecting for power management: The IBM®;
POWER7™; approach”. In: HPCA - 16 2010 The Sixteenth International
Symposium on High-Performance Computer Architecture. Jan. 2010, pp. 1–11.
doi: 10.1109/HPCA.2010.5416627.
143
References
[7] John Shalf, Sudip Dosanjh, and John Morrison. “Exascale Computing Technol-
ogy Challenges”. In: High Performance Computing for Computational Science –
VECPAR 2010: 9th International conference, Berkeley, CA, USA, June 22-25,
2010, Revised Selected Papers. Ed. by José M. Laginha M. Palma et al. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2011, pp. 1–25. isbn: 978-3-642-19328-6.
doi: 10.1007/978-3-642-19328-6_1. url: https://doi.org/10.1007/978-3-642-
19328-6_1.
[8] Avinash Sodani. “Race to Exascale: Opportunities and Challenges”. In: Keynote
at the Annual IEEE/ACM 44th Annual International Symposium on Microar-
chitecture. 2011.
[9] Brian Barrett et al. “On the Path to Exascale”. In: Int. J. Distrib. Syst. Technol.
1.2 (Apr. 2010), pp. 1–22. issn: 1947-3532. doi: 10.4018/jdst.2010040101. url:
http://dx.doi.org/10.4018/jdst.2010040101.
[10] Juri Schmidt and Ulrich Brüning. “openHMC - a Conﬁgurable Open-Source
Hybrid Memory Cube Controller”. In: 2015 International Conference on Re-
ConFigurable Computing and FPGAs (ReConFig). Dec. 2015, pp. 1–6. doi:
10.1109/ReConFig.2015.7393331.
[11] Juri Schmidt, Holger Fröning, and Ulrich Brüning. “Exploring Time and Energy
for Complex Accesses to a Hybrid Memory Cube”. In: Proceedings of the Second
International Symposium on Memory Systems. MEMSYS ’16. Alexandria, VA,
USA: ACM, 2016, pp. 142–150. isbn: 978-1-4503-4305-3. doi: 10.1145/2989081.
2989099. url: http://doi.acm.org/10.1145/2989081.2989099.
[12] Computer Architecture Group - University of Heidelberg. openHMC - an Open-
Source Hybrid Memory Cube Controller. [Accessed 28-July-2017]. url: www.uni-
heidelberg.de/openhmc.
[13] Sabrina Eisenreich and Juri Schmidt. Interview: Experimenting with DEEP-ER
NAM Technology. [Accessed 27-July-2017]. url: https://insidehpc.com/2014/
10/interview-experimenting-deep-er-memory-technology/.
[14] Primeur Magazine. European exascale projects DEEP-ER and Mont-Blanc to
investigate new Exascale technologies. [Accessed 12-June-2017]. url: https :
//www.youtube.com/watch?v=tr_co6vu-4s.
[15] Juri Schmidt. Network Attached Memory. [Accessed 12-June-2017]. url: http:
//sc16.supercomputing.org/sc-archive/doctoral_showcase/doc_ﬁles/drs106s2-
ﬁle7.pdf.
[16] Gordon E. Moore. “Cramming more components onto integrated circuits,
Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ﬀ.” In:
IEEE Solid-State Circuits Society Newsletter 11.5 (Sept. 2006), pp. 33–35. issn:
1098-4232. doi: 10.1109/N-SSC.2006.4785860.
[17] Robert H. Dennard et al. “Design of ion-implanted MOSFET’s with very small
physical dimensions”. In: IEEE Journal of Solid-State Circuits 9.5 (Oct. 1974),
pp. 256–268. issn: 0018-9200. doi: 10.1109/JSSC.1974.1050511.
[18] Avinash Sodani. “Knights landing (KNL): 2nd Generation Intel®Xeon Phi
processor”. In: 2015 IEEE Hot Chips 27 Symposium (HCS). Aug. 2015, pp. 1–
24. doi: 10.1109/HOTCHIPS.2015.7477467.
144
References
[19] Richard Sites. “It’s the Memory, Stupid!” In:Microprocessor Report 10.10 (1996),
pp. 2–3.
[20] John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition:
A Quantitative Approach. 5th. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2011. isbn: 012383872X, 9780123838728.
[21] Richard Murphy. “On the Eﬀects of Memory Latency and Bandwidth on
Supercomputer Application Performance”. In: 2007 IEEE 10th International
Symposium on Workload Characterization. Sept. 2007, pp. 35–43. doi: 10.1109/
IISWC.2007.4362179.
[22] David A. Patterson. “Latency Lags Bandwith”. In: Commun. ACM 47.10
(Oct. 2004), pp. 71–75. issn: 0001-0782. doi: 10.1145/1022594.1022596. url:
http://doi.acm.org/10.1145/1022594.1022596.
[23] Saud Wasly and Rodolfo Pellizzoni. “Hiding memory latency using ﬁxed priority
scheduling”. In: 2014 IEEE 19th Real-Time and Embedded Technology and
Applications Symposium (RTAS). Apr. 2014, pp. 75–86. doi: 10.1109/RTAS.
2014.6925992.
[24] Young Hoon Son et al. “Reducing Memory Access Latency with Asymmetric
DRAM Bank Organizations”. In: Proceedings of the 40th Annual International
Symposium on Computer Architecture. ISCA ’13. Tel-Aviv, Israel: ACM, 2013,
pp. 380–391. isbn: 978-1-4503-2079-5. doi: 10.1145/2485922.2485955. url:
http://doi.acm.org/10.1145/2485922.2485955.
[25] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. JEDEC Standard
JESD 209-2B: Low Power Double Data Rate 2 (LPDDR2). 2010.
[26] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. JEDEC Standard
JESD 212: GDDR5 SGRAM. 2009.
[27] Indrani Paul et al. “Harmonia: Balancing Compute and Memory Power in
High-performance GPUs”. In: Proceedings of the 42Nd Annual International
Symposium on Computer Architecture. ISCA ’15. Portland, Oregon: ACM,
2015, pp. 54–65. isbn: 978-1-4503-3402-0. doi: 10.1145/2749469.2750404. url:
http://doi.acm.org/10.1145/2749469.2750404.
[28] Guruprasad Katti et al. “Electrical Modeling and Characterization of Through
Silicon via for Three-Dimensional ICs”. In: IEEE Transactions on Electron
Devices 57.1 (Jan. 2010), pp. 256–262. issn: 0018-9383. doi: 10.1109/TED.2009.
2034508.
[29] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. JEDEC Standard
JESD 235A: High Bandwidth Memory (HBM) DRAM. 2015.
[30] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. JEDEC Standard
JESD 229-2: Wide I/O 2 (WideIO2). 2014.
[31] Joe Macri. “AMD’s next generation GPU and high bandwidth memory archi-
tecture: FURY”. In: 2015 IEEE Hot Chips 27 Symposium (HCS). Aug. 2015,
pp. 1–26. doi: 10.1109/HOTCHIPS.2015.7477461.
[32] Manish Deo, Jeﬀrey Schulz, and Lance Brown. Intel Stratix 10 MX Devices
Solve the Memory Bandwidth Challenge. Tech. rep. Altera, now part of Intel,
2016.
145
References
[33] Samsung V-NAND technology: Yield more capacity, performance, endurance
and power eﬃciency. Tech. rep. Samsung Electronics, 2014.
[34] Micron Technology, Inc. 3D-NAND. [Accessed 28-July-2014]. url: https://www.
micron.com/about/emerging-technologies/3d-nand.
[35] Micron Technology, Inc and Intel Corporation. 3D XPoint Technology. [Accessed
20-July-2017]. url: https://www.micron.com/about/our-innovation/3d-xpoint-
technology.
[36] Micron Technology, Inc. Hybrid Memory Cube Webinar July 2017. [Accessed
20-July-2017]. url: https://www.micron.com/~/media/documents/products/
presentation/hmc_webinar_july_2017.pdf.
[37] Hybrid Memory Cube Consortium. Micron and Samsung Launch Consor-
tium to Break Down the Memory Wall. Oct 6, 2011. url: http : / / www .
hybridmemorycube.org/.
[38] Myles G. Watson. “Applications for Packetized Memory Interfaces”. PhD thesis.
University of Heidelberg, 2014.
[39] Michael J. Miller. “Bandwidth Engine® Serial Memory Chip Breaks 2 Billion
Accesses/sec”. In: 2011 IEEE Hot Chips 23 Symposium (HCS). Aug. 2011,
pp. 1–23. doi: 10.1109/HOTCHIPS.2011.7477493.
[40] Maya Gokhale, Bill Holmes, and Ken Iobst. “Processing In Memory: The Terasys
Massively Parallel PIM Array”. In: Computer 28.4 (Apr. 1995), pp. 23–31. issn:
0018-9162. doi: 10.1109/2.375174.
[41] Patterson, David and Anderson, Thomas and Cardwell, Neal and Fromm,
Richard and Keeton, Kimberly and Kozyrakis, Christoforos and Thomas, Randi
and Yelick, Katherine. “A case for intelligent RAM: IRAM”. In: IEEE Micro
17.2 (Mar. 1997), pp. 34–44. issn: 0272-1732. doi: 10.1109/40.592312.
[42] Ravi Nair et al. “Active Memory Cube: A processing-in-memory architecture
for exascale systems”. In: IBM Journal of Research and Development 59.2/3
(Mar. 2015), 17:1–17:14. issn: 0018-8646. doi: 10.1147/JRD.2015.2409732.
[43] Tsuyoshi Hamada and Naohito Nakasato. “InﬁniBand Trade Association, In-
ﬁniBand Architecture Speciﬁcation, Volume 1, Release 1.0”. In: International
Conference on Field Programmable Logic and Applications. Citeseer. 2005.
[44] Gregory F. Pﬁster. “An Introduction to the Inﬁniband Architecture”. In: High
Performance Mass Storage and Parallel I/O 42 (2001), pp. 617–632.
[45] Jack J Dongarra et al. LINPACK users’ guide. SIAM, 1979.
[46] Mark S Birrittella et al. “Intel®; Omni-path Architecture: Enabling Scalable,
High Performance Fabrics”. In: 2015 IEEE 23rd Annual Symposium on High-
Performance Interconnects. Aug. 2015, pp. 1–9. doi: 10.1109/HOTI.2015.22.
[47] Mondrian Nüssle et al. “An FPGA-Based Custom High Performance Interconnec-
tion Network”. In: 2009 International Conference on Reconﬁgurable Computing
and FPGAs. Dec. 2009, pp. 113–118. doi: 10.1109/ReConFig.2009.23.
[48] Holger Fröning et al. “On Achieving High Message Rates”. In: 2013 13th
IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.
May 2013, pp. 498–505. doi: 10.1109/CCGrid.2013.43.
146
References
[49] Mondrian B. Nüßle. “Acceleration of the hardware-software interface of a com-
munication device for parallel systems”. PhD thesis. Universität Mannheim,
2009.
[50] HPC Advisory Council. Interconnect Analysis: 10GigE and InﬁniBand in
High Performance Computing. [Accessed 26-June-2017]. url: http://www.
hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf.
[51] Mellanox Technologies. EDR Inﬁniband. [Accessed 26-June-2017]. url: https:
//www.openfabrics.org/images/eventpresos/workshops2015/UGWorkshop/
Friday/friday_01.pdf.
[52] EXTOLL GmbH. EXTOLL Technology Overview. [Accessed 26-June-2017]. url:
http://extoll.de/images/pdf/Extoll_Technology_Overview_2016.pdf.
[53] University of Tennessee. MPI: A Message-Passing Interface Standard. Version
3.0. [Accessed 26-June-2017]. url: http://mpi-forum.org/docs/mpi-3.0/mpi30-
report.pdf.
[54] William Gropp et al. “A high-performance, portable implementation of the MPI
message passing interface standard”. In: Parallel Computing 22.6 (1996), pp. 789–
828. issn: 0167-8191. doi: http://dx.doi.org/10.1016/0167-8191(96)00024-5.
url: http://www.sciencedirect.com/science/article/pii/0167819196000245.
[55] NOWLAB: Network Based Computing Lab, Ohio State University. MVAPICH:
MPI over InﬁniBand, 10GigE/iWARP and RoCE. [Accessed 26-July-2017]. url:
http://mvapich.cse.ohio-state.edu/.
[56] Edgar Gabriel et al. “Open MPI: Goals, Concept, and Design of a Next Genera-
tion MPI Implementation”. In: Recent Advances in Parallel Virtual Machine
and Message Passing Interface: 11th European PVM/MPI Users’ Group Meeting
Budapest, Hungary, September 19 - 22, 2004. Proceedings. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2004, pp. 97–104. isbn: 978-3-540-30218-6. doi:
10.1007/978-3-540-30218-6_19. url: https://doi.org/10.1007/978-3-540-30218-
6_19.
[57] U.S. DOE. Characterization of the DOE Mini-apps. [Accessed 28-July-2017].
url: https://portal.nersc.gov/project/CAL/doe-miniapps.htm.
[58] Algirdas Avizienis et al. “Basic Concepts and Taxonomy of Dependable and Se-
cure Computing”. In: IEEE Transactions on Dependable and Secure Computing
1.1 (Jan. 2004), pp. 11–33. issn: 1545-5971. doi: 10.1109/TDSC.2004.2.
[59] Ifeanyi P Egwutuoha et al. “A survey of fault tolerance mechanisms and check-
point/restart implementations for high performance computing systems”. In:
The Journal of Supercomputing 65.3 (Sept. 2013), pp. 1302–1326. issn: 1573-
0484. doi: 10.1007/s11227-013-0884-0. url: https://doi.org/10.1007/s11227-
013-0884-0.
[60] Tezzaron Semiconductor. Terrazon Semiconductor. Soft Errors in Electronic
Memory — A White Paper. [Accessed 28-July-2017]. url: http://www.tezzaron.
com/about/papers/soft_errors_1_1_secure.pdf.
147
References
[61] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. “DRAM Errors
in the Wild: A Large-scale Field Study”. In: Commun. ACM 54.2 (Feb. 2011),
pp. 100–107. issn: 0001-0782. doi: 10 . 1145/1897816 . 1897844. url: http :
//doi.acm.org/10.1145/1897816.1897844.
[62] Bianca Schroeder and Garth Gibson. “A Large-Scale Study of Failures in
High-Performance Computing Systems”. In: IEEE Transactions on Dependable
and Secure Computing 7.4 (Oct. 2010), pp. 337–350. issn: 1545-5971. doi:
10.1109/TDSC.2009.4.
[63] Fabrizio Petrini, Kei Davis, and José Carlos Sancho. “System-level fault-tolerance
in large-scale parallel machines with buﬀered coscheduling”. In: 18th Interna-
tional Parallel and Distributed Processing Symposium, 2004. Proceedings. Apr.
2004, pp. 209–. doi: 10.1109/IPDPS.2004.1303239.
[64] Marc Snir et al. “Addressing Failures in Exascale Computing”. In: Int. J. High
Perform. Comput. Appl. 28.2 (May 2014), pp. 129–173. issn: 1094-3420. doi:
10.1177/1094342014522573. url: http://dx.doi.org/10.1177/1094342014522573.
[65] Bianca Schroeder and Garth A. Gibson. “Understanding Disk Failure Rates:
What Does an MTTF of 1,000,000 Hours Mean to You?” In: Trans. Storage
3.3 (Oct. 2007). issn: 1553-3077. doi: 10.1145/1288783.1288785. url: http:
//doi.acm.org/10.1145/1288783.1288785.
[66] Xiangyu Dong et al. “Leveraging 3D PCRAM Technologies to Reduce Check-
point Overhead for Future Exascale Systems”. In: Proceedings of the Confer-
ence on High Performance Computing Networking, Storage and Analysis. SC
’09. Portland, Oregon: ACM, 2009, 57:1–57:12. isbn: 978-1-60558-744-8. doi:
10.1145/1654059.1654117. url: http://doi.acm.org/10.1145/1654059.1654117.
[67] Guillaume Aupy et al. “Optimal Checkpointing Period: Time vs. Energy”. In:
High Performance Computing Systems. Performance Modeling, Benchmarking
and Simulation: 4th International Workshop, PMBS 2013, Denver, CO, USA,
November 18, 2013. Revised Selected Papers. Cham: Springer International
Publishing, 2014, pp. 203–214. isbn: 978-3-319-10214-6. doi: 10.1007/978-3-
319-10214-6_10. url: https://doi.org/10.1007/978-3-319-10214-6_10.
[68] Mohamed Slim Bouguerra et al. “Improving the Computing Eﬃciency of HPC
Systems Using a Combination of Proactive and Preventive Checkpointing”. In:
2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
May 2013, pp. 501–512. doi: 10.1109/IPDPS.2013.74.
[69] Kurt B. Ferreira et al. “Accelerating Incremental Checkpointing for Extreme-
scale Computing”. In: Future Gener. Comput. Syst. 30 (Jan. 2014), pp. 66–77.
issn: 0167-739X. doi: 10.1016/j.future.2013.04.017. url: http://dx.doi.org/10.
1016/j.future.2013.04.017.
[70] Leonardo Bautista-Gomez et al. “FTI: High performance Fault Tolerance Inter-
face for hybrid systems”. In: 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis (SC). Nov. 2011, pp. 1–12. doi:
10.1145/2063384.2063427.
148
References
[71] Adam Moody et al. “Design, Modeling, and Evaluation of a Scalable Multi-level
Checkpointing System”. In: 2010 ACM/IEEE International Conference for High
Performance Computing, Networking, Storage and Analysis. Nov. 2010, pp. 1–11.
doi: 10.1109/SC.2010.18.
[72] Ning Liu et al. “On the role of burst buﬀers in leadership-class storage systems”.
In: 012 IEEE 28th Symposium on Mass Storage Systems and Technologies
(MSST). Apr. 2012, pp. 1–11. doi: 10.1109/MSST.2012.6232369.
[73] Melissa Romanus, Robert B Ross, and Manish Parashar. “Challenges and
Considerations for Utilizing Burst Buﬀers in High-Performance Computing”.
In: CoRR abs/1509.05492 (2015). url: http://arxiv.org/abs/1509.05492.
[74] Gengbin Zheng, Xiang Ni, and Laxmikant V Kalé. “A Scalable Double In-
memory Checkpoint and Restart Scheme towards Exascale”. In: IEEE/IFIP
International Conference on Dependable Systems and Networks Workshops (DSN
2012). June 2012, pp. 1–6. doi: 10.1109/DSNW.2012.6264677.
[75] Lawrence Livermore National Laboratory. SCR v1.1.8 User Manual. [Accessed
12-June-2017]. url: https://computation.llnl.gov/sites/default/ﬁles/public/
scr_users_manual.pdf.
[76] Mohammed el Mehdi Diouri et al. “Energy considerations in Checkpointing
and Fault Tolerance protocols”. In: IEEE/IFIP International Conference on
Dependable Systems and Networks Workshops (DSN 2012). June 2012, pp. 1–6.
doi: 10.1109/DSNW.2012.6264670.
[77] Hybrid Memory Cube Consortium. Hybrid Memory Cube Speciﬁcation 1.0.
http://www.hybridmemorycube.org/.
[78] Joe Jeddeloh and Brent Keeth. “Hybrid Memory Cube. New DRAMArchitecture
Increases Density and Performance”. In: 2012 Symposium on VLSI Technology
(VLSIT). June 2012, pp. 87–88. doi: 10.1109/VLSIT.2012.6242474.
[79] Hybrid Memory Cube Consortium. Hybrid Memory Cube Speciﬁcation 1.1.
http://www.hybridmemorycube.org/.
[80] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. JEDEC Standard
JESD 79-4: DDR4 SDRAM. 2012.
[81] JEDEC SOLID STATE TECHNOLOGY ASSOCIATION. JEDEC Standard
JESD 79-3B: DDR3 SDRAM. 2008.
[82] J. Thomas Pawlowski. “Hybrid Memory Cube (HMC)”. In: 2011 IEEE Hot
Chips 23 Symposium (HCS). Aug. 2011, pp. 1–24. doi: 10.1109/HOTCHIPS.
2011.7477494.
[83] Paul Rosenfeld. “Performance Exploration of the Hybrid Memory Cube”. PhD
thesis. University of Maryland, 2014.
[84] Bruce Jacob. “The 2 PetaFLOP, 3 Petabyte, 9 TB/s, 90 kW Cabinet: A System
Architecture for Exascale and Big Data”. In: IEEE Computer Architecture
Letters 15.2 (July 2016), pp. 125–128. issn: 1556-6056. doi: 10.1109/LCA.2015.
2451652.
[85] Micron Technology, Inc. Revolutionary Advancements in Memory Performance.
[Accessed 28-July-2017]. Aug. 2011. url: https://www.youtube.com/watch?v=
kaV2nZSkw8A.
149
References
[86] Dan McMorrow. Technical Challenges of Exascale Computing. Tech. rep. MITRE
Corporation, 2013.
[87] Hybrid Memory Cube Consortium. Hybrid Memory Cube Speciﬁcation 2.0.
http://www.hybridmemorycube.org/.
[88] Open Silicon, Inc. Hybrid Memory Cube (HMC) Controller IP. [Accessed 28-
July-2017]. url: http://www.open-silicon.com/open-silicon-ips/hmc/.
[89] Pico Computing, Inc (now Micron Technology, Inc). Hybrid Memory Cube
(HMC) Controller IP. [Accessed 28-July-2017]. url: http://picocomputing.
com/hmc-ip/.
[90] Altera Corporation. Hybrid Memory Cube Controller IP Core User Guide UG-
01152. [Accessed 28-July-2017]. url: http://design.altera.com/HMCWP.
[91] Computer Architecture Group - University of Heidelberg. openHMC documen-
tation Rev1.5. [Accessed 28-July-2017]. url: http://www.uni-heidelberg.de/
openhmc.
[92] Xilinx, Inc. XHMC v1.0 LogiCORE IP Product Guide. Nov. 2016.
[93] John D. Leidel and Yong Chen. “HMC-Sim: A Simulation Framework for
Hybrid Memory Cube Devices”. In: 2014 IEEE International Parallel Distributed
Processing Symposium Workshops. May 2014, pp. 1465–1474. doi: 10.1109/
IPDPSW.2014.164.
[94] Dong-Ik Jeon and Ki-Seok Chung. “CasHMC: A Cycle-Accurate Simulator for
Hybrid Memory Cube”. In: IEEE Computer Architecture Letters 16.1 (Jan.
2017), pp. 10–13. issn: 1556-6056. doi: 10.1109/LCA.2016.2600601.
[95] Yinhe Han et al. “Data-Aware DRAM Refresh to Squeeze the Margin of Re-
tention Time in Hybrid Memory Cube”. In: 2014 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD). Nov. 2014, pp. 295–300. doi:
10.1109/ICCAD.2014.7001366.
[96] Ishan G Thakkar and Sudeep Pasricha. “Massed Refresh: An Energy-Eﬃcient
Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures”.
In: 2016 29th International Conference on VLSI Design and 2016 15th Interna-
tional Conference on Embedded Systems (VLSID). Jan. 2016, pp. 104–109. doi:
10.1109/VLSID.2016.13.
[97] Mushﬁque Junayed Khurshid and Mikko Lipasti. “Data Compression for Ther-
mal Mitigation in the Hybrid Memory Cube”. In: 2013 IEEE 31st Interna-
tional Conference on Computer Design (ICCD). Oct. 2013, pp. 185–192. doi:
10.1109/ICCD.2013.6657041.
[98] Maya Gokhale, Scott Lloyd, and Chris Macaraeg. “Hybrid Memory Cube
Performance Characterization on Data-centric Workloads”. In: Proceedings
of the 5th Workshop on Irregular Applications: Architectures and Algorithms.
IA3 ’15. Austin, Texas: ACM, 2015, 7:1–7:8. isbn: 978-1-4503-4001-4. doi:
10.1145/2833179.2833184. url: http://doi.acm.org/10.1145/2833179.2833184.
150
References
[99] Khaled Z. Ibrahim et al. “Characterizing the Performance of Hybrid Memory
Cube Using ApexMAP Application Probes”. In: Proceedings of the Second
International Symposium on Memory Systems. MEMSYS ’16. Alexandria, VA,
USA: ACM, 2016, pp. 429–436. isbn: 978-1-4503-4305-3. doi: 10.1145/2989081.
2989090. url: http://doi.acm.org/10.1145/2989081.2989090.
[100] Ramyad Hadidi et al. “Demystifying the Characteristics of 3D-Stacked Memories:
A Case Study for Hybrid Memory Cube”. In: CoRR abs/1706.02725 (2017).
url: http://arxiv.org/abs/1706.02725.
[101] Mondrian Nüssle, Martin Scherer, and Ulrich Brüning. “A Resource Optimized
Remote-Memory-Access Architecture for Low-latency Communication”. In: 2009
International Conference on Parallel Processing. Sept. 2009, pp. 220–227. doi:
10.1109/ICPP.2009.62.
[102] Wolfgang Frings, Felix Wolf, and Ventsislav Petkov. “Scalable Massively Parallel
I/O to Task-local Files”. In: Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis. SC ’09. Portland, Oregon: ACM,
2009, 17:1–17:11. isbn: 978-1-60558-744-8. doi: 10.1145/1654059.1654077. url:
http://doi.acm.org/10.1145/1654059.1654077.
[103] Stefano Markidis, Giovanni Lapenta, and Rizwan-uddin. “Multi-scale Simula-
tions of Plasma with iPIC3D”. In: Math. Comput. Simul. 80.7 (Mar. 2010),
pp. 1509–1519. issn: 0378-4754. doi: 10.1016/j.matcom.2009.08.038. url:
http://dx.doi.org/10.1016/j.matcom.2009.08.038.
151

Acknowledgements
My biggest thanks go out to my parents, Vera and Georg Schmidt, and my sister,
Ludmila Schmidt. All my life and unconditionally - you were there for me and always
encouraged me to follow my path. I dedicate this dissertation to you.
I would like to express my sincere gratitude to Prof. Ulrich Brüning. He taught me
so many things for both, work and life. With his experience and knowledge, he has
always been a fantastic supporter, advisor, and my personal role model.
Many thanks to my colleagues at the Computer Architecture Group and the EXTOLL
GmbH. Your advice helped me in countless situations and I appreciate the teamwork
which has always been excellent. It was a pleasure to work with every single one of
you.
I am grateful to the committee members, Prof. Norbert Eicker, Prof. Michael Gertz,
and Artur Andrzejak for their support. And of course I would like to say thanks to the
Ruprecht-Karls University Heidelberg and especially to the Faculty for Mathematics
and Computer Science for the great opportunity to carry out the research that led to
this dissertation.
A special thanks goes out to my friends, especially Kevin Rodenhausen, Niklas Sachs,
Alexander Schepp, and Christoph Hick. You always supported me in every situation
and knew how to properly distract me from work - weekend after weekend ;-)
Thank you Anja Stolzheise for undergoing this journey together with me. I am grateful
to have you in my life and looking forward to everything that lies ahead of us. And by
the way, as I promised: I made it in time! ;-)
Last but not least, thanks to my roommate and friend Benjamin Klenk, all the friends
I made all around the world, and everyone else who believed in me.
