Exploring the Performance Benefit of Hybrid Memory System on HPC
  Environments by Peng, Ivy Bo et al.
Exploring the Performance Benefit of Hybrid Memory System
on HPC Environments
Ivy Bo Peng∗, Roberto Gioiosa‡, Gokcen Kestor‡, Pietro Cicotti†, Erwin Laure∗ and Stefano Markidis∗
∗Department of Computational Science and Technology, KTH Royal Institute of Technology, Sweden
†Advanced Technology Laboratory, San Diego Supercomputer Center, USA
‡Computational Science and Mathematics Division, Pacific Northwest National Laboratory, USA
Abstract—Hardware accelerators have become a de-facto
standard to achieve high performance on current supercom-
puters and there are indications that this trend will increase
in the future. Modern accelerators feature high-bandwidth
memory next to the computing cores. For example, the Intel
Knights Landing (KNL) processor is equipped with 16 GB
of high-bandwidth memory (HBM) that works together with
conventional DRAM memory. Theoretically, HBM can provide
∼ 4× higher bandwidth than conventional DRAM. However,
many factors impact the effective performance achieved by
applications, including the application memory access pattern,
the problem size, the threading level and the actual memory
configuration. In this paper, we analyze the Intel KNL system
and quantify the impact of the most important factors on the
application performance by using a set of applications that
are representative of scientific and data-analytics workloads.
Our results show that applications with regular memory access
benefit from MCDRAM, achieving up to 3× performance when
compared to the performance obtained using only DRAM. On
the contrary, applications with random memory access pattern
are latency-bound and may suffer from performance degrada-
tion when using only MCDRAM. For those applications, the
use of additional hardware threads may help hide latency and
achieve higher aggregated bandwidth when using HBM.
Keywords-Intel Knights Landing (KNL); HBM; MCDRAM;
application performance on Intel KNL; hybrid memory system
I. INTRODUCTION
Current and next-generation supercomputers will heavily
rely on hardware accelerators to achieve extreme paral-
lelism and high performance. Accelerators come in many
forms, including general-purpose GPUs, FPGAs, or other
energy-efficient architectures, such as ARM big.LITTLE [1]
and Intel Many Integrated Core (MIC) [2]. Accelerators
generally present a different architecture compared to the
host processors and require programmers to adapt their
applications to fully exploit the underlying hardware fea-
tures. Besides a large number of computing threads, modern
accelerators, such as AMD Fiji, NVIDIA Pascal GPUs,
and Intel Knights Landing (KNL), also feature a relatively
small, high-bandwidth on-chip memory implemented with
3D-stacked technology. This memory can be used to cache
application data but also as an extension of DRAM. The
effective use of these memory technologies helps reducing
data movement [3] and is of paramount importance to
achieve high performance on modern and future supercom-
puters.
The simultaneous presence of two different memory
technologies that work side-by-side, e.g., the traditional
DRAM and the new on-chip memory, makes the system
heterogeneous from the memory perspective. Deeper and
heterogeneous memory systems are likely to be the norm
in future mainstream HPC systems. Thus it is important
to understand how applications can make efficient use of
these new high-bandwidth memory technologies and what
the limitations and constraints are [4]. While several studies
have successfully ported specific applications to hardware
accelerators equipped with 3D-stacked memory [5], [6], a
deep and comprehensive understanding on how to leverage
the high-bandwidth memory in HPC applications is still an
open issue. In this paper we target this problem and aim
at providing recommendations to programmers on which
applications may benefit from these new memory technolo-
gies and what are the limiting factors to consider and their
impact on the application performance. Without loss of
generality, we analyze the performance benefits of the new
3D-stacked high-bandwidth memory (HBM) packaged with
the Intel KNL processor, the second-generation Intel Xeon
Phi processor. This processor provides the main computing
power of the Cori [7] supercomputer at NERSC, while
the upcoming Aurora supercomputer at Argonne National
Laboratory [8] will use the third-generation Intel Xeon Phi
processors.
The HBM 3D-stacked memory on the KNL processor
is implemented as a Multi-Channel DRAM (MCDRAM).
DRAM and MCDRAM differ significantly in the metrics of
capacity, bandwidth and latency. In addition, Intel provides
three memory modes to configure HBM: flat, cache, and
hybrid mode. In flat mode, the user sees HBM as a memory
pool exposed by the operating system (OS) as an additional
NUMA node. In cache mode, HBM is transparent to the OS
and is directly managed by the hardware as a large cache
for DRAM transfers. This is the expected default mode, as
it requires neither modifications to the application code nor
re-structuring of the algorithm or data structures. In hybrid
mode, a part of HBM is explicitly managed by the user (as
in the flat mode), while the rest is managed by the hardware
ar
X
iv
:1
70
4.
08
27
3v
1 
 [c
s.D
C]
  2
6 A
pr
 20
17
as a cache for DRAM transfers. Because of the capacity,
bandwidth, and latency differences between DRAM and
HBM, and also because of the possibility of configuring
the memory system in different modes, performance will
vary widely depending on the application characteristics, the
usage of memory resources, and the chosen configuration.
The goal of this work is to investigate the correlation be-
tween application characteristics and attainable performance
on Intel KNL processor.
We follow a two-step methodology. First, we characterize
the memory performance on KNL systems through precise
micro-benchmarks. From the extracted hardware character-
istics, such as memory bandwidth and latency, we iden-
tify factors correlated to the observed performance. Next,
we perform an exhaustive performance evaluation using
kernels (DGEMM and GUPS), two scientific DOE proxy
applications (MiniFE and XSBench), and one benchmark
mimicking data analytics workloads (Graph500). In this
step we vary different parameters to collect performance
measurements from a vast array of configurations including
different problem sizes, number of OpenMP threads, and
memory modes (i.e., flat or cache). Analyzing the results
we were able to assess performance variations due to dif-
ferent configurations. Thus we can evaluate the application
sensitivity to the memory configurations, including a quan-
tification of the performance improvement/degradation due
to memory configuration and usage.
Our results show that the memory access pattern impacts
the application performance the most. Applications with
regular data access, such as DGEMM and MiniFE, are
mostly limited by memory bandwidth, thus they consid-
erably benefit from HBM. These applications can achieve
3× performance improvement when using HBM compared
to solely using DRAM. However, when the problem size
is larger than the HBM capacity, the benefit of using
HBM as cache decreases with increasing problem size. The
performance of applications with random memory access,
such as GUPS, Graph500 and XSBench, is largely affected
by the memory latency and does not benefit from using
HBM, as the latency is higher than the DRAM. We also
found that the use of additional hardware threads per core
leads to a performance improvement when using HBM as the
simultaneous use of multiple hardware threads can help hide
HBM latency. For MiniFE, we observe a 3.8× performance
improvement with respect to the performance obtained with
only DRAM when we use four hardware threads per core.
In this paper we make the following contributions:
1) We identify three main factors (access pattern, prob-
lem size, and threading) that determine application
performance on systems with KNL processors.
2) We quantify the impact of using different memory
configurations (HBM-only, cache and DRAM-only),
problem size, number of OpenMP threads on applica-
tion performance.
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
L2 
1MB 
MCDRAM	
16	GB	
DDR 
96 GB 
(Up to 384 GB) 
Off-package On-package 
DDR4 x 6 
Figure 1: Layout of different memories on KNL. The grid
indicates the Intel mesh of tiles. Each tile consists of two
cores sharing a 1-MB L2 cache. MCDRAM is integrated
on-package while DRAM is off-package and connected by
six DDR4 channels.
3) We show that applications with regular memory access
patterns largely benefit the HBM. On the contrary, ap-
plications with random access patterns do not benefit
HBM.
4) We identify the HBM higher latency as the main
obstacle to achieve performance gain in applications
with random memory access.
5) We show that the use of multiple hardware threads in
the same core is critical to take advantage of HBM
high bandwidth.
6) We provide a guideline for setting correct expectation
for performance improvement on systems with 3D-
stacked high-bandwidth memories.
The rest of this paper is organized as follows: Section II
provides a brief description of the Intel KNL memory
system. Section III describes our system set-up, software
environment and selected benchmarks and mini-applications.
We show our experimental results in Section IV and com-
pare our approach and methodology to previous work in
Section V. Finally, we discuss the results and conclude the
paper in Section VI.
II. INTEL KNIGHTS LANDING PROCESSOR
In this section we briefly summarize the main character-
istics of the Intel KNL processor, with particular emphasis
on the memory sub-system and its configuration options.
A detailed description of the system can be found in the
literature [9].
Fig. 1 presents a KNL processor and its connection
to the main memory. Each computing core has a private
32KB L1 cache (not presented in the diagram) and supports
four hardware threads. Computing cores are grouped into
tiles connected through a mesh network on-chip. Each tile
consists of the two computing cores that share a 1-MB L2
cache. The L2 cache of these tiles are maintained coherent
by a distributed tag directory implemented with a MESIF
protocol, which enables cache-to-cache line forwarding. The
example in Fig. 1 shows that the 36 tiles provide a total of
36 MB L2 cache.
In this work, we are mainly interested in the KNL
memory sub-system. Computing cores in a KNL system
are connected to two different memory technologies: MC-
DRAM and DDR. The MCDRAM modules are located on
package while DDR is off-package and connected by six
DDR4 channels driven by two memory controllers. The 3D
architecture and the shorter distance to the computing cores
allow MCDRAM to deliver a peak bandwidth of ~400 GB/s
while DDR can deliver ~90 GB/s [9]. This 4× difference
in bandwidth can be crucial for many applications that are
memory bound. However, MCDRAM is also more expensive
and has much smaller capacity compared to DDR. On the
current version of KNL, MCDRAM is extensible up to 16
GB while DDR can reach 384 GB in size. An optimal usage
of KNL processor needs to consider the features of both
memory technologies, aiming for a balance of the two.
MCDRAM can be configured in three modes: cache, flat
or hybrid mode.
Cache: The default mode is cache mode, where MC-
DRAM is used as the last-level cache (i.e., L3 cache)
for DDR transfers. In this mode, the MCDRAM is
completely managed by the hardware so that it is trans-
parent to users and OS. This mode requires no code
modifications or manual allocation of data structures.
Thus, it is the most convenient mode in terms of manual
efforts and flexibility. The MCDRAM cache protocol
uses a direct mapping scheme, which is an effective
protocol but can result in higher conflict misses. In this
mode, the memory hierarchy consists of L1 and L2
SRAM caches, an L3 cache backed by MCDRAM and
a homogeneous main memory backed by DDR.
Flat: When MCDRAM is set in the flat mode, it
works side-by-side with DDR. Together, they form a
heterogeneous main memory. This mode requires either
code modifications or using numactl to explicitly
control the data placement. If an application can isolate
data structures that benefit from the higher bandwidth
of MCDRAM, it is feasible to have fine-grained data
placement using heap memory management libraries,
such as the memkind library [10]. On the other hand,
if the entire application fits into MCDRAM, coarse-
grained placement by numactl is more convenient. In
flat mode, the memory hierarchy includes L1 and L2
caches and a heterogeneous main memory composed
of small capacity MCDRAM and large capacity DDR.
Hybrid: The last memory mode is the hybrid mode. In
this mode, MCDRAM can be partitioned into two parts,
one in cache mode and the other in flat mode. This
Table I: List of Evaluated Applications
Application Type Access Pattern Max. Scale
DGEMM Scientific Sequential 24 GB
MiniFE Scientific Sequential 30 GB
GUPS Data analytics Random 32 GB
Graph500 Data analytics Random 35 GB
XSBench Scientific Random 90 GB
mode should be used when the data structures in an
application are accessed in different patterns. Changing
partition requires a system reboot and modification of
the BIOS to indicate the ratio of MCDRAM in cache
mode and the one in flat mode. Thus, it is arguably
cumbersome to change the MCDRAM partitioning for
each application.
III. EXPERIMENTAL ENVIRONMENT
This section describes the hardware and software setup
used in this work and provides a brief description of the
tested applications.
A. Hardware Setup
We carry out experiments on a cluster of 12 KNL-based
compute nodes on the Cray Archer supercomputer [11].
Each node is equipped with a 64-core KNL processor (model
7210) running at 1.3 GHz. Each core can support up to
four hardware threads. The testbed is configured with 96-
GB DDR and 16-GB MCDRAM. MCDRAM consists of
eight 2-GB modules. The six DDR4 channels are running
at 2.1 GHz. Two compute nodes are pre-configured with
MCDRAM in flat mode, while the remaining ten nodes
have MCDRAM configured in cache mode. All nodes use
the quadrant cluster mode for the mesh of tiles. These
compute nodes are connected via Cray’s proprietary Aries
interconnect.
B. Benchmarks and Applications
We select benchmarks and applications whose perfor-
mance is sensitive to the characteristics of the memory
system. The list of applications is presented in Table I. The
applications are summarized as follows:
• DGEMM [12] benchmark performs dense-matrix mul-
tiplication.The memory access pattern is sequential
and optimization in data locality is crucial. For this
evaluation, we compile the code linking to the Intel
Math Kernel Library (MKL), which provides highly-
optimized multithreaded kernels. We report Giga float-
ing point operation per second (GFLOPS) for perfor-
mance evaluation.
• MiniFE [13] is a DOE proxy application that is rep-
resentative of implicit Finite Element applications. The
most performance critical part of the application solves
the linear-system using a Conjugate-Gradient algorithm
similar to HPCG benchmark. We report the total Mflops
in the CG part of the application output for performance
evaluation.
• GUPS [14] is a synthetic benchmark that measures
the Giga-updates-per-second (GUPS) by reading and
updating uniformly distributed random addresses in
a table. The memory access pattern is random with
poor data locality. This synthetic problem is often
used for profiling the memory architecture. We report
GUPS measured by the benchmark for performance
evaluation.
• Graph500 [15] represents data-analytics workloads.
The memory access pattern is data-driven with poor
temporal and spatial locality. Thus this application
is featured with random access pattern. We use the
reference implementation version 2.1.4 from the official
website( www.graph500.org/) for experiments. We use
the OpenMP implementation using CSR compression
format. We report the harmonic mean of traversed edge
per second (TEPS) for performance evaluation.
• XSBench [16] mimics the most time consuming part
of OpenMC, a Monte Carlo neutron transport code.
XSBench isolates the macroscopic cross section lookup
kernel. As the random input conditions for the macro-
scopic cross section lookups is retained in the appli-
cation, the memory access is random and demands for
large memory capacity. We use a large fixed problem-
size and vary the number of grid points (option -g) to
scale up the memory footprint of the test. We report
the rate of performed lookups per second (lookups/s)
for performance evaluation.
All applications are compiled using the Intel compiler
version 17.0.0.098 (gcc version 4.8.0 compatibility) with
OpenMP, -xMIC-AVX512 flags and optimization option
-O3. Applications are linked to the Cray MPICH library
version 7.4.4.
C. Execution Setup
For all applications, we scale up the problem size to
the maximum containable size for a single node. We use
OpenMP threads to take advantage of all the 64 computing
cores. By default, 64 OpenMP threads are used for experi-
ments unless otherwise specified.
We consider three memory configurations in the scope of
this paper. First, if HBM is configured in flat mode and all
memory allocation is bound to HBM, we refer this set-up
as HBM in the results as it only uses the high-bandwidth
memory. Second, if HBM is configured in flat mode but all
memory allocation is bound to DRAM, we refer this setup as
DRAM in the results. Third, if HBM is configured in cache
mode, this configuration uses DRAM as main memory and
HBM as the last-level cache. We refer this setup as Cache
Mode.
We use numactl to control the data placement into the
two memories. When HBM is in flat mode, there are two
NUMA nodes available, corresponding to DRAM node and
HBM node respectively. When HBM is in cache mode, there
is only one NUMA node available. We report the NUMA
domain distances obtained by the command numactl
--hardware in these two setups in Table II.
Table II: The left panel presents the distance between DRAM
node (0) and HBM node (1) when HBM is flat mode. The
right panel shows that only one NUMA node is available
when HBM is in cache mode.
Distances: 0 (96 GB) 1 (16 GB)
0 10 31
1 31 10
Distances: 0 (96 GB)
0 10
The DRAM configuration uses HBM in flat mode and
use numactl --membind=0 to allocate all data in
DRAM. The HBM configuration uses HBM in flat mode
and use numactl --membind=1 to allocate all data in
HBM. The Cache Mode configuration uses HBM in cache
mode and use numactl --membind=0 for consistency
even though there is only one NUMA domain available.
IV. EVALUATION
In this section, we evaluate the performance benefit from
HBM in a hybrid memory system. HBM can be used as
main memory or as the cache for DRAM transfers. Thus,
for each application, we carried out two identical experiment
sets, one in each configuration. To establish the baseline, we
also repeat each set of experiments on DRAM. We first use
two micro-benchmarks to understand the hardware charac-
teristics of the system. We then evaluate the performance
impact from the memory sub-system to five applications that
are representative for traditional scientific applications and
data-analytics workloads.
A. Memory Bandwidth and Latency
We first use micro-benchmarks to experimentally mea-
sure the peak bandwidth and latency of the testbed. The
peak bandwidth is measured using the OpenMP version of
STREAM benchmark with varying data sizes [17]. We report
the results from the triad computational kernel in Fig. 2.
In this experiments we run one thread per core. Fig. 2
shows that DRAM achieves a maximum of 77 GB/s band-
width (red line) while HBM reaches a maximum of 330
GB/s (blue line). Thus HBM provides a 4× higher band-
width than DRAM. As we will show later, HBM can reach
as high as 420 GB/s using more hardware threads per
core. The bandwidth that is measured in the cache mode
configuration (black line) largely depends on the data size.
In this configuration, we measure a peak performance of 260
GB/s when the input set size that is approximately half of the
HBM capacity. Beyond that size, the measured bandwidth
rapidly drops to 125 GB/s at 11.4 GB. When the input set is
22.8 GB, the cache mode performance even becomes lower
10 20 30 40
100
200
300
Size (GB)
B
an
dw
id
th
(G
B
/s
) DRAM
HBM
Cache Mode
Figure 2: Peak bandwidth measured by the STREAM bench-
mark under the three memory configurations are reported.
Measurements for HBM configuration stops when the data
size does not fit into HBM.
than DRAM. The results indicate that HBM can provide the
highest bandwidth if the data size fits in HBM. Cache mode
is arguably the easiest way to use HBM and, although the
caching protocol might introduces some overhead compared
to direct data placement in HBM. The cache mode provides
the second highest bandwidth and it is comparable to HBM
up to 10GB. When the data size reaches the range that is
larger than HBM size but still comparable to HBM (see
16-24 GB stream size range in Fig. 2), the cache mode
provides higher bandwidth than DRAM. When the data size
is much larger than HBM size, using HBM as cache for
DRAM might even bring lower bandwidth than DRAM. This
is likely due to the direct mapping scheme of HBM cache
protocol, which results in higher capacity conflicts when data
size increases. On the other hand, STREAM memory access
pattern is very regular, thus the prefetcher can effective bring
in data to the cache in advance, hiding memory latencies
and reducing the differences between HBM in cache mode
and DRAM. The STREAM benchmark also shows that the
overhead of cache could even result in lower bandwidth than
DRAM when the data size is larger than 24 GB.
While HBM provides the highest bandwidth, it also
presents a higher latency than DRAM Our results are con-
sistent with previous work[18]: we report 154.0 ns latency
for HBM and 130.4 ns for DRAM. This indicates that
accessing to HBM could be ∼ 18% slower than accessing
DRAM. Besides the memory latency, we are also interested
to quantify the correlation between the data size and time
delay for accessing data at a random address. We use the
open-source benchmark, TinyMemBench [19] to measure
the read latency for random accesses to an allocated buffer
with varying sizes. This benchmark reports the average mea-
sured latency with less than 0.1% deviation. We report the
measurement for performing two simultaneous random read
(dual read latency) in Fig. 3. This measurement reflects the
capability of handling concurrent requests by the underlying
memory system. Thus it is more relevant for KNL out-of-
12
8K
25
6K
51
2K 1M 2M 4M 8M16
M
32
M
64
M
12
8M
25
6M
51
2M 1G
0
200
400
Block Size (B)
D
ua
l
R
an
do
m
R
ea
d
L
at
en
cy
(n
s) DRAM HBM
0
10
20
Pe
rf
or
m
an
ce
G
ap
(%
)Performance Gap (%)
Figure 3: Dual random read latency to a buffer allocated in
DRAM (blue) and HBM (red) respectively. As the buffer size
increases, the latency includes effects from cache misses,
TLB misses and page walk.
order cores. We use the block size larger than L1 cache and
also exclude L1 latency from the measurements.
Fig. 3 shows three tiers of the time delay in accessing
random addresses. The first range is for block sizes smaller
than 1 MB (local L2 cache). In this range, latencies are
approximately 10 ns. The second range is block sizes that
are smaller than 64 MB (two mesh L2 cache size). This
range is characterized by a latency that is approximately
200 ns. In this region, the performance gap reaches the peak
of 20% when the data size is just slightly larger than the tile
local L2 cache. Beyond that, the gap starts decreasing to
about 15% till the data size reaches four times of the mesh
L2 cache, i.e., 128 MB. Starting from 128 MB, the latency
values for DRAM and HBM increase with the increase of
the block size. We observed that DRAM can be accessed
15-20% faster than HBM in the three block size ranges.
The percentage difference between DRAM and HBM is
represented with a black line in Fig. 3.
B. Impact of Access Pattern on Application Performance
The top panels in Fig. 4 present the performance of appli-
cations with regular access patterns (DGEMM and MiniFE)
with varying problem size. Binding memory allocation to
HBM provides the best performance for both DGEMM
and MiniFE. The two applications achieve 2× and 3× the
performance improvement by allocating data structures in
HBM compared DRAM, respectively. The bottom panels in
Fig. 4 present the performance of applications with random
memory access patterns (GUPS, Graph500 and XSBench).
We note that HBM does not bring any performance improve-
ment, neither in flat mode (red bars) nor in cache mode (grey
bars). On the contrary, these applications achieve the highest
performance by only allocating data to DRAM (blue bars).
0.1 0.4 1.5 6 24
200
400
600
Array Size (GB)
G
FL
O
PS
DRAM HBM Cache Mode
1.4
1.6
1.8
2
2.2
Im
pr
ov
em
en
t
(x
)
(a) DGEMM
0.1 0.9 1.8 3.6 7.2 14.4 28.8
0.5
1
1.5
·104
Matrix Size (GB)
C
G
M
FL
O
PS
1
2
3
Im
pr
ov
em
en
t
(x
)
Speedup by HBM w.r.t. DRAM
Speedup by Cache w.r.t. DRAM
(b) MiniFE
1 2 4 8 16 32
1.06
1.08
1.1
·10−2
Table Size (GB)
G
U
PS
(c) GUPS
1.1 2.2 4.4 8.8 17.5 35
1
1.5
2
·108
Graph Size (GB)
T
E
PS
(d) Graph500
5.6 11.3 22.5 45 90
2.5
3
·106
Problem Size (GB)
L
oo
ku
ps
/s
(e) XSBench
Figure 4: Application performance in different memory modes varying the problem size. Evaluation is based on single-node
execution using 64 OpenMP threads. No measurements for HBM in flat mode (red bar) when the problem size exceeds its
capacity.
By comparing the top and bottom panels of Fig. 4
it is evident that the application memory access pattern
is an important factor to achieve high performance when
using HBM. Regular applications are generally sensitive to
memory bandwidth. These applications can directly benefit
from HBM and show promising speedup. On the other
hand, applications exhibiting random access pattern are more
sensitive to memory latency, thus they are penalized by the
higher latency of HBM. By Little’s Law [20], the memory
throughout equals to the ratio between the outstanding mem-
ory requests and the memory latency. If an application has
regular access pattern, both prefetcher and the out-of-order
core can perform well to increase the number of memory
requests. In that case, the bottleneck becomes the maximum
number of concurrent requests that can be supported by the
hardware. Since HBM provides 4× higher bandwidth than
DRAM, applications with regular access benefit significantly
from such high-bandwidth memories. In the case of ran-
dom access pattern, the memory addresses cannot be pre-
determined and the number of requests remains small. Thus
the throughput is bound by the latency. The benchmarking
on the testbed shows that DRAM has about 20% lower
latency than HBM (presented in Fig. 3 and also reported
by [9]). As a result, the latency-bound applications do not
benefit from HBM but performs better on DRAM.
In general, we observe the cache mode does not show
clear advantage when compared to HBM or DRAM, but
performance in this mode generally fall in between the
highest and the lowest. For applications that benefit from
HBM, the cache mode is an effective way to achieve high
performance when the data size is larger than the HBM
capacity.
C. Impact of Problem Size on Application Performance
The problem size directly determines whether it is feasible
or convenient to store the application data entirely in HBM.
First, even if the application access pattern favors HBM
in flat mode, the problem size might be too large to fit
in HBM. Second, HBM can be used to augment the total
memory capacity of the platform, which might be beneficial
for applications with very large input sets, especially on
system with a relatively small DRAM. On platforms with
similar ratio between DRAM and HBM, the only way to
run some large problems might be to use both HBM and
DRAM side-by-side, e.g., setting HBM in flat mode and
interleaving memory allocation between the two memories.
The problem size also impacts the application perfor-
mance from hybrid memory system. When the problem size
fits in HBM, the optimal configuration for applications with
regular access is to use HBM in flat mode and allocate data
only in HBM. This is clear from the top panel of Fig. 4,
where the configuration with HBM (red bars) achieve the
highest performance for both applications over all problem
sizes that fit in HBM. When the problem size is larger
than HBM but still comparable to its capacity, the cache
mode can significantly improve the performance compared
to the performance on DRAM. However, the performance
improvement from the cache mode decreases as the problem
size increases. In the case of the MiniFE, the improvement
from the cache mode drops to 1.05× when the problem size
is nearly twice HBM capacity. This performance degradation
in cache mode is consistent with the results of the STREAM
benchmark. This result can be useful for determining multi-
node configuration of a given problem. If the application
has good parallel efficiency across multi-nodes, with enough
compute nodes, the optimal setup is to decompose the
problem so that each compute node is assigned with a sub-
problem that has a size close to the HBM capacity.
The performance gap between DRAM and HBM increases
as the problem size increases in latency-bound applications.
The bottom panels of Fig. 4 show that when the problem
is small, i.e., 1 GB table for GUPS, 1.1 GB graph for
Graph500 and 5.6 GB grid points for XSBench, running
on DRAM (blue bar) or HBM (red bar) or cache mode
(grey bar) results in small performance difference. However,
when the problem size is larger than the HBM capacity, the
performance gap increases. For instance, running Graph500
on a large scale graph, allocating data only to DRAM
leads to a 1.3× performance increase compared with the
performance from using HBM in cache mode. There are two
main reasons for this result. First, as presented in Fig. 3, the
difference between the random read latency to DRAM and to
HBM increases when the data size increases. Second, these
applications have poor data locality so that data brought in
2 4 6 8 10
50
150
250
350
450
Size (GB)
B
an
dw
id
th
(G
B
/s
)
DRAM (ht = 1) HBM (ht = 1)
DRAM (ht = 2) HBM (ht = 2)
DRAM (ht = 3) HBM (ht = 3)
DRAM (ht = 4) HBM (ht = 4)
Figure 5: Impact on memory bandwidth by using hardware
threads on DRAM and HBM measured by the STREAM
benchmark.
cache is not used. In this case, the cache brings limited
benefit to hide memory latency but the overhead of cache
is still paid. As HBM in cache mode uses direct mapping
scheme, when the data size increases to be comparable to the
HBM capacity, the number of capacity conflicts increases,
further lowering the performance of cache mode.
D. Impact of Hardware Threads on Application Perfor-
mance
Each KNL core supports four hardware threads so that
up to 256 hardware threads are available on a single node.
Intel hyper-threading allows multiple hardware threads to
issue instructions without performing a complete context
switch. Hardware threads that are waiting for data or an
internal resource are temporarily parked while the processor
core fetches instructions from another hardware thread.
Thus hardware multi-threading is generally considered as an
effective way to hide latency, especially for latency-bound
applications.
On Intel KNL, using additional hardware threads per core
might help hide memory latency, which can results in an
overall performance improvement especially when accessing
HBM in flat mode. As the first step, we quantify the impact
of using hardware threads on the memory bandwidth. We
measure the aggregated bandwidth by using one to four
hardware threads per core when running the STREAM
benchmark presented in previous section. The measured
bandwidth is reported in Fig. 5. For HBM, using two
hardware threads per core (blue dashed lines) reaches 1.27×
the bandwidth measured with only one hardware thread per
core at all data sizes. Other thread counts have varying
performance on different sizes, but the graph shows that one
thread per core cannot achieve peak bandwidth. Applications
using DRAM (red lines) only slightly benefit from more
64 128 192
200
400
600
800
No. of Threads
G
FL
O
PS
DRAM HBM Cache Mode
1
1.2
1.4
1.6
Im
pr
ov
em
en
t
(x
)
(a) DGEMM
64 128 192 256
1
2
3
·104
No. of Threads
C
G
M
FL
O
PS
1
1.2
1.4
1.6
Im
pr
ov
em
en
t
(x
)
Speedup on DRAM
Speedup on HBM
Speedup in Cache Mode
(b) MiniFE
64 128 192 256
2
2.5
3
·108
No. of Threads
T
E
PS
1
1.2
1.4
Im
pr
ov
em
en
t
(x
)
(c) Graph500
64 128 192 256
3
4
5
6
·106
No. of Threads
L
oo
ku
ps
/s
1
1.5
2
Im
pr
ov
em
en
t
(x
)
(d) XSBench
Figure 6: Impact on the performance of applications with sequential access (DGEMM and MiniFE in the top two panels)
and on applications with random memory access (Graph500 and XSBench in the bottom two panels) using different number
of hardware threads. The black lines indicate the speedup by using X threads w.r.t 64 threads.
hardware threads so that all four red lines overlap with each
other in Fig. 5.
The top panel of Fig. 6 shows the performance of ap-
plications with regular memory access patterns, DGEMM
and MiniFE, varying the number of hardware threads.
As the plots show, these applications significantly benefit
from HBM when using multiple hardware threads because
they achieve higher aggregated bandwidth. We observe a
1.7× performance improvement when moving from 64 (one
hardware thread/core) to 192 (three hardware threads/core)
hardware threads for both DGEMM and MiniFE.1 As
explained above, the additional hardware threads do not
effectively hide DRAM latency, thus neither DGEMM nor
MiniFE show particular performance benefit from using
hyper-threading.
Hyper-threading can be used to hide memory latency in
irregular applications as well. We quantify the impact of
1Note that results relative to DGEMM with 256 hardware threads are
not available as the run can not complete successfully.
hyper-threading on Graph500 and XSBench performance.
The results are reported in the bottom two panels of Fig. 6.
Both applications benefit from additional hardware threads
in all three memory configurations. XSBench reaches the
highest performance (2.5×) by using 256 threads in HBM
and in cache mode, while the DRAM configuration achieves
1.5× performance compared to the execution with one
thread/core. For this application, multi-threading can effec-
tively hide latency so that the best configuration with hyper-
threading (HBM) is different from the best option with a
single thread/core (DRAM). Graph500 shows similar im-
provements but DRAM still remains the best configuration,
as it shows the highest performance when using 128 threads.
All three memory configurations achieve 1.5× performance
when using multiple hardware threads. These results indi-
cates memory-bound applications can significantly benefit
from multi-threading, both in HBM and DRAM configura-
tions.
Our results also show that the optimal number of hardware
threads varies from application to application. XSBench
shows a trend of higher performance from larger count of
threads on all three memory configurations. In Graph500,
all memory configurations achieves the best performance on
128 threads. Different memory configurations have different
optimal counts of threads in MiniFE. An extensive study to
find the optimal setup for each application is out of the scope
of this work. However, it is important to note that HBM
is more sensitive to multi-threading and it could become
the optimal options even for latency-bound applications like
XSBench.
V. RELATED WORK
Next-generation supercomputers will likely feature a hy-
brid memory sub-systems consisting of conventional DRAM
systems next to 3D-stacked memory and/or Non-Volatile
RAM. 3D-stacked DRAM is a high density, high bandwidth
memory obtained by stacking multiple DRAM semiconduc-
tor dies on top of each other and communicating with
Through-Silicon-Vias (TSV) [21]. Currently, besides the
Intel KNL analyzed in this work, other systems such as
Altera Stratix 10 and Micron Hybrid Memory Cube are also
equipped with HBM [22], [23], [24]. Chang et al. [25] show
that the most significant performance benefit of 3D-Stack
memory is the largely increased bandwidth with respect to
conventional DRAM. However, the same study also shows
that latency to HBM is not reduced as expected. One of the
first examples of HBM is the MCDRAM that is a 3D-stacked
DRAM that is used in the Intel Xeon Phi processor. Since
the delivery of 3D-stack memories on the KNL, several
performance evaluation with HPC applications have been
conducted to understand the performance of the HBM. Ros-
ales et al. [26] study the performance differences observed
when using flat, cache and hybrid HBM configurations
together with the effect of memory affinity and process
pinning in Mantevo suite and NAS parallel benchmark.
Barnes et al. [27] present the initial performance results of
20 NESAP scientific applications running on KNL nodes
of the Cori system and comparing KNL hardware features
with traditional Intel Xeon architectures. This study mainly
targets at how to effectively run NESAP applications in the
Cori system whereas we focus on giving general guidelines
on what kind of applications characteristics benefit from
running on a hybrid memory system. Moreover, both works
analyze traditional scientific applications while we also
evaluate the data-analytics applications. Other studies [5],
[6] focus on the performance of a specific application on a
KNL system and show how to optimize the application to
increase the performance benefit. More specifically, Ba´lint et
al. [6] shows of a performance analysis of linear solvers and
kernels used in Lattice Quantumchromodynamics (QCD) on
16 KNL nodes. This work also emphasizes the importance of
using hyper-threads to effectively exploit HBM bandwidth.
DeTar et al. [5] optimize a staggered conjugate gradient
solver in the MILC application for KNL and report that
use of HBM brings a large improvement for the MILC
application.
VI. DISCUSSION AND CONCLUSIONS
The 3D-stacked high-bandwidth memories are emerging
on mainstream processors and hardware accelerators. These
new memory technologies potentially bring significant im-
provement to bandwidth-bound applications but they are
currently more expensive and present an higher latency than
traditional DRAM technologies. Thus their usage is limited
to small capacities. To mitigate these limitations, current
compute nodes often consist of a small capacity HBM (high
bandwidth, high latency) side-by-side with large capacity
DRAM (lower bandwidth, lower latency). The optimal usage
of such hybrid memory systems is still largely unknown, as
the first hardware prototypes have just arrived to the market.
In this paper, we analyzed one of such hybrid memory
systems, the Intel KNL, which features a 16 GB HBM next
to a 96 GB DRAM. We studied a variety of traditional
HPC and emerging data-analytics applications and identified
memory access pattern, problem size, and threading as the
main applications characteristics to determine whether HBM
is beneficial or not. Our conclusions are backed up by
experiments on real state-of-the-art compute node equipped
with Intel KNL processors.
Our results show that applications with sequential ac-
cess pattern are usually bandwidth-bound and benefit the
most from HBM. Our evaluation indicates that using HBM
provides up to 3× performance improvement compared to
storing data structures in DRAM. Applications with random
access pattern, such as XSBench and Graph500, are instead
latency-bound. Thus they suffer from the higher latency on
HBM and achieve better performance when storing data
structures in DRAM. For those applications, using multiple
hardware threads per core can effectively hide HBM higher
latency. Our results show that when using multiple hardware
threads per core, HBM becomes a more effective solution.
However, other applications, such as Graph500, might not be
able to completely hide the memory latency, hence DRAM
still gives the best performance.
Although we studied a specific hybrid memory system,
we believe that our conclusions can be generalized to other
heterogeneous memory systems with similar characteristics.
Our study provides guidelines for selecting suitable memory
allocation based on application characteristic and problem to
solve. It also helps set the correct expectation of performance
improvement when porting and optimizing applications on
such systems. However, we used a coarse-grained approach,
in which all application data structures are either stored in
HBM or DRAM. In the future, we plan to investigate a finer-
grained approach in which we can apply our conclusions to
individual data structures and eventually employ Intel KNL
hybrid HBM mode whenever necessary.
ACKNOWLEDGMENTS
This work was funded by the European Commission
through the SAGE project (grant agreement no. 671500,
http://www.sagestorage.eu/). This work used the ARCHER
UK National Supercomputing Service (http://www.archer.ac.
uk. This work was supported by the DOE Office of Science,
Advanced Scientific Computing Research, under the ARGO
project (award number 66150) and the CENATE project
(award number 64386).
REFERENCES
[1] P. Greenhalgh, “Big. little processing with arm cortex-a15 &
cortex-a7,” ARM White paper, pp. 1–8, 2011.
[2] A. Duran and M. Klemm, “The Intel® many integrated core
architecture,” in High Performance Computing and Simula-
tion (HPCS), 2012 International Conference on. IEEE, 2012,
pp. 365–366.
[3] G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie,
“Quantifying the energy cost of data movement in scientific
applications,” in Workload Characterization (IISWC), 2013
IEEE International Symposium on. IEEE, 2013, pp. 56–65.
[4] I. B. Peng, S. Markidis, E. Laure, G. Kestor, and R. Gioiosa,
“Exploring application performance on emerging hybrid-
memory supercomputers,” in High Performance Computing
and Communications; IEEE 14th International Conference
on Smart City; IEEE 2nd International Conference on Data
Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE
18th International Conference on. IEEE, 2016, pp. 473–
480.
[5] R. Li, C. DeTar, D. Doerfler, S. Gottlieb, A. Jha, D. Kalamkar,
and D. Toussaint, “MILC staggered conjugate gradient per-
formance on intel KNL,” arXiv preprint arXiv:1611.00728,
2016.
[6] B. Joo´, D. D. Kalamkar, T. Kurth, K. Vaidyanathan, and
A. Walden, “Optimizing wilson-dirac operator and linear
solvers for intel® KNL,” in International Conference on High
Performance Computing. Springer, 2016, pp. 415–427.
[7] K. Antypas, N. Wright, N. P. Cardo, A. Andrews, and
M. Cordery, “Cori: a cray xc pre-exascale system for nersc,”
Cray User Group Proceedings. Cray, 2014.
[8] Aurora supercomputer. [Online]. Available: http://aurora.alcf.
anl.gov
[9] A. Sodani, “Knights Landing (KNL): 2nd generation Intel®
Xeon Phi processor,” in Hot Chips 27 Symposium (HCS),
2015 IEEE. IEEE, 2015, pp. 1–24.
[10] C. Cantalupo, V. Venkatesan, J. Hammond, K. Czurlyo, and
S. D. Hammond, “memkind: An extensible heap memory
manager for heterogeneous memory platforms and mixed
memory policies.” Sandia National Laboratories (SNL-NM),
Albuquerque, NM (United States), Tech. Rep., 2015.
[11] (2017) Knights Landing (KNL) Testing & Development
Platform. http://www.archer.ac.uk/documentation/knl-guide/.
[Online; accessed 15-January-2017].
[12] “The DGEMM Benchmark,” http://www.nersc.gov/research-
and-development/apex/apex-benchmarks/dgemm/, 2017,
[Online; accessed 15-January-2017].
[13] M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willen-
bring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter,
H. K. Thornquist, and R. W. Numrich, “Improving Perfor-
mance via Mini-applications,” Sandia National Laboratories,
Tech. Rep. SAND2009-5574, 2009.
[14] “HPC Challenge Benchmark,” http://icl.cs.utk.edu/hpcc/
index.html, 2017, [Online; accessed 15-January-2017].
[15] “The Graph500 Benchmark,” http://www.graph500.org/,
2017, [Online; accessed 15-January-2017].
[16] “Xsbench: The Monte Carlo macroscopic cross sec-
tion lookup benchmark.” https://github.com/ANL-CESAR/
XSBench, 2017, [Online; accessed 01-January-2017].
[17] J. D. McCalpin, “A survey of memory bandwidth and machine
balance in current high performance computers,” IEEE TCCA
Newsletter, pp. 19–25, 1995.
[18] ——. (2016) Memory Latency on the Intel Xeon Phi
x200 Knights Landing Processor. [Online]. Available: https:
//sites.utexas.edu/jdm4372/2016/12/06/memory-latency-on-
the-intel-xeon-phi-x200-knights-landing-processor/
[19] “TinyMemBench,” https://github.com/ssvb/tinymembench,
2017, [Online; accessed 15-January-2017].
[20] J. L. Gustafson, “Littles law,” in Encyclopedia of Parallel
Computing. Springer, 2011, pp. 1038–1041.
[21] G. H. Loh, “3D-stacked memory architectures for multi-core
processors,” in Computer Architecture, 2008. ISCA’08. 35th
International Symposium on. IEEE, 2008, pp. 453–464.
[22] “Micron Technology - Hybrid Memory Cube, Memory and
Storage,” https://www.micron.com/products/hybrid-memory-
cube.
[23] J. T. Pawlowski, “Hybrid memory cube (hmc),” in 2011 IEEE
Hot Chips 23 Symposium (HCS), Aug 2011, pp. 1–24.
[24] “Stratix 10 TX,” https://www.altera.com/products/fpga/
stratix-series/stratix-10/overview.html.
[25] D. W. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. S.
Kim, and M. Schulte, “Reevaluating the latency claims of 3d
stacked memories,” in Design Automation Conference (ASP-
DAC), 2013 18th Asia and South Pacific. IEEE, 2013, pp.
657–662.
[26] C. Rosales, J. Cazes, K. Milfeld, A. Go´mez-Iglesias,
L. Koesterke, L. Huang, and J. Vienne, “A comparative
study of application performance and scalability on the Intel
Knights Landing processor,” in International Conference on
High Performance Computing. Springer, 2016, pp. 307–318.
[27] T. Barnes, B. Cook, J. Deslippe, T. K. He, T. Koskela,
M. Lobet, T. Malas, L. Oliker, A. Ovsyannikov, A. Sarje et al.,
“Evaluating and optimizing the nersc workload on knights
landing,” in Supercomputing Conference, 7th International
Workshop on Performance Modeling, Benchmarking and Sim-
ulation of High Performance Computer Systems, 2016, pp.
43–53.
