PROFET: modeling system performance and energy without simulating the CPU by Radulovic, Milan et al.
PROFET: Modeling System Performance and Energy Without
Simulating the CPU
Milan Radulovic
Barcelona Supercomputing Center
(BSC) & Universitat Politècnica de
Catalunya (UPC)
Barcelona, Spain
milan.radulovic@bsc.es
Rommel Sánchez Verdejo
BSC & UPC
Barcelona, Spain
rommel.sanchez@bsc.es
Paul Carpenter
BSC
Barcelona, Spain
paul.carpenter@bsc.es
Petar Radojković
BSC
Barcelona, Spain
petar.radojkovic@bsc.es
Bruce Jacob
University of Maryland
College Park, Maryland, USA
blj@umd.edu
Eduard Ayguadé
BSC & UPC
Barcelona, Spain
eduard@ac.upc.edu
ABSTRACT
The approaching end of DRAM scaling and expansion of emerg-
ing memory technologies is motivating a lot of research in future
memory systems. Novel memory systems are typically explored by
hardware simulators that are slow and often have a simplified or
obsolete abstraction of the CPU. This study presents PROFET, an
analytical model that predicts how an application’s performance
and energy consumption changes when it is executed on different
memory systems. The model is based on instrumentation of an
application execution on actual hardware, so it already takes into
account CPU microarchitectural details such as the data prefetcher
and out-of-order engine. PROFET is evaluated on two real plat-
forms: Sandy Bridge-EP E5-2670 and Knights Landing Xeon Phi
platforms with various memory configurations. The evaluation re-
sults show that PROFET’s predictions are accurate, typically with
only 2% difference from the values measured on actual hardware.
We release the PROFET source code and all input data required for
memory system and application profiling. The released package
can be seamlessly installed and used on high-end Intel platforms.
CCS CONCEPTS
•Computingmethodologies→Model development and anal-
ysis; • Hardware→ Dynamic memory.
KEYWORDS
Memory bandwidth; Memory access latency; DRAM; MCDRAM;
Performance; Power; Energy; Modeling
1 INTRODUCTION
The memory system is a major contributor to the deployment
and operational costs of a large-scale high-performance comput-
ing (HPC) cluster [24, 35, 38], and in terms of system performance
it is one of the most critical aspects of the system’s design [20, 41].
For decades, most server and HPC cluster memory systems have
been based on DRAM DIMMs. However, it is becoming question-
able whether DRAM DIMMs will continue to scale and meet the
industry’s demand for high performance and high capacity memory.
Significant effort is therefore being invested into the research and
development of future memory systems.
Application performance on novel memory systems is typically
estimated using a hardware simulator. The simulation is, however,
time consuming, which limits the number of design options that
can be explored within a practical length of time. Also, although
memory simulators are typically well validated [23, 32], current
CPU simulators have various shortcomings, such as simplified out-
of-order execution, an obsolete data prefetcher and a lack of virtual-
to-physical memory translation, all of which can make a huge
difference between the simulated and actual memory system, in
terms of behavior and performance.
This study proposes PROFET (PROFiling-based EsTimation of
performance and energy), an analytical model that predicts how an
application’s performance, power and energy consumption would
change when it is executed on a new memory system. The method
is based on instrumentation of an application running on actual
hardware, so it already takes account of CPU microarchitectural
details such as the real (and not publicly disclosed) data prefetcher
and out-of-order engine. Therefore, it can be used to model various
platforms as long as they support the required application profiling.
PROFET was initially developed for the Sandy Bridge platform,
and later we evaluated it for the Knights Landing (KNL) server.
Adjustment of the PROFET model to the KNL system was trivial,
as it required changes to only a few hardware parameters, such as,
for example the reorder buffer size.
We evaluated PROFET on two actual platforms: Sandy Bridge-EP
E5-2670 with four DRAM configurations DDR3-800/1066/1333/1600,
and Knights LandingXeon PhiwithDDR4 and 3D-stackedMCDRAM.
The evaluation results show that PROFET’s predictions are very
accurate: the average difference from the performance, power and
energy measured on the actual hardware is only 2%, 1.1% and 1.7%,
respectively. We also compare PROFET’s performance predictions
with simulation results for the Sandy Bridge-EP E5-2670 system
with ZSim [33, 40] and DRAMSim2 [32], and PROFET shows sig-
nificantly better accuracy over the simulator. PROFET is also faster
than the hardware simulators by three orders of magnitude, so it
can be used to analyze production HPC applications, on arbitrarily
sized systems.
M
em
or
y 
ac
ce
ss
 la
te
nc
y
Constant
region
Lead-off
latency
25 – 35%
reduction
M
ax
im
um
 s
us
ta
in
ed
ba
nd
w
id
thLoaded
latency
M
ax
im
um
 th
eo
re
tic
al
ba
nd
w
id
th
Used memory bandwidth
Figure 1: Bandwidth–latency curve showing how the mem-
ory access latency depends on the used memory bandwidth.
It is critical to distinguish between the lead-off and loaded
memory access latency regions [20].
We release the PROFET source code as open source [31]. The
release includes all inputs and outputs and evaluation results for the
case study that is used in the rest of this paper. The package includes
the memory system profiles, CPU parameters, application profiles
and memory power parameters, as well as the power, performance
and energy outputs from PROFET and the measurements on the
baseline and target platforms. The released PROFET model is ready
to be used on high-end Intel platforms, and we would encourage
the community to use it, adapt it to other platforms, and share their
own evaluations.
2 PROFET OVERVIEW
This section summarizes the main idea behind PROFET’s analytical
models and it describes the inputs and outputs to PROFET.
2.1 Background: Memory bandwidth and
latency
The memory access latency and used bandwidth are often described
as independent concepts, but they are in fact inherently interre-
lated [20]. We start by clarifying what is meant by the lead-off and
loaded memory access latencies, and then we address the connec-
tion between memory access latency and used memory bandwidth.
Lead-off memory access latency corresponds to the single-ac-
cess read latency in an idle system. This latency includes the time
spent in the CPU load/store queues, cache memory, memory con-
troller, memory channel and main memory. Loaded memory ac-
cess latency corresponds to the read latency in a loaded system. In
addition to all timings included in the lead-off latency, the loaded
memory latency includes shared-resource contention among con-
current memory requests. As illustrated in Figure 1, the loaded
memory access latency increases (non-linearly) with the used band-
width, due to increasing contention among concurrent memory
requests. It is critical to distinguish between the lead-off and loaded
latencies because the difference between them can be on the order
of hundreds of nanoseconds.
2.2 The idea: Moving between memory curves
The main idea of this paper is that we can understand the effect
of changing the memory system by understanding how the appli-
cation moves from one bandwidth–latency curve to another. We
illustrate this idea using the DDR4 and MCDRAM memories on
Used memory bandwidth
Application using DDR4
(            ,            )
Application using MCDRAM
(                ,                )BW MCDRAMused Lat
 MCDRAM
mem
BW DDR4used
M
em
or
y 
ac
ce
ss
 la
te
nc
y
Lat DDR4mem
Figure 2: High-level view of the transition from DDR4 to
high-bandwidth MCDRAMmemory on the KNL platform.
Intel’s Knights Landing platform. This platform has two memory
systems, so there are two bandwidth–latency curves, shown to-
gether on the same plot in Figure 2.1 When used bandwidth is high,
as seen towards the right of the figure, MCDRAM is clearly better.
In contrast, when used bandwidth is low, as seen towards the left,
DDR4 has lower latency due to its lower lead-off latency.
When an application (or application phase) executes on the
DDR4 main memory, it will be positioned at some point on the
DDR4 curve; e.g. (BW DDR4used , lat
DDR4
mem ) illustrated in Figure 2. Anal-
ogously, when the same application is executed on the MCDRAM
memory, it will be positioned at some point on the MCDRAM
curve, e.g. (BWMCDRAMused , lat
MCDRAM
mem ). We see that the applica-
tion in Figure 2 benefits from running on the MCDRAM through
a lower memory latency (MCDRAM point is lower) and a higher
used bandwidth (MCDRAM point is to the right). This idea, of mov-
ing between bandwidth–latency curves, is central to the PROFET
performance, power and energy models presented in this paper.
2.3 PROFET inputs
Figure 3 gives a high-level overview of the whole process of per-
formance, power and energy estimation. The inputs to PROFET,
shown towards the left of the figure, are the bandwidth–latency
curves, measured for the baseline memory system and the target
memory system, parameters for the CPU (which is the same for
both memory systems), as well as the application profiles on the
baseline memory system. These inputs can all be easily obtained on
mainstream platforms and many emerging platforms. The outputs
from PROFET will be the predicted performance, power and energy
consumption on the target memory system.
Memory systemprofiling is done via bandwidth–latency curves,
for the baseline and target memory systems, along the lines out-
lined in Section 2.1. The precise method for obtaining these curves
is given in Section 3, which describes the memory profiling mi-
crobenchmarks and their outputs.
CPU parameters are needed, alongside the application profil-
ing (see below), to characterize the relationship between memory
system latency and execution time. This relationship is dependent
on the processor’s ability to hide memory latency by overlapping
memory accesses with independent instructions. As detailed in Sec-
tion 4.3, the PROFET performance model therefore requires some
basic parameters of the processor under study: re-order buffer (ROB)
1Figure 2 shows a simplified bandwidth–latency curve, as discussed in Section 2.1.
Detailed curves are given in Figure 4 in Section 3.
time
t0 t1 t2 t3
P1 , tpwr1
Performance
Power
t0
t1
t0
t1
Power
model
Baseline
memory
I1,V1,t1
P1,BW1
Target
memory
I2,V2,t2
P2,BW2
P0 , tpwr0 
Cyc0, Ins0,...
Cyc1, Ins1,
Memory system profiling: Bandwidth-latency curves
Section 3
Section 4.1
Sections 4.2,
4.3, 4.4 and 4.5
Appendix A.1
CPU
parameters
ROB, IPCmax,
LLClat
Application executing
on the baseline system
Application profiling: Hardware counters
Energy
model
Appendix A.2
E=P×Δt
Performance
model
Memory
power
parameters
I1, V1, tmem1
I2, V2, 
Outcome 1:
Performance
estimation
Outcome 2:
Power
estimation
Outcome 3:
Energy
estimation
Figure 3: Diagram of the whole process of performance, power and energy estimation. The cross-references indicate which
section describes which part of the estimation process.
capacity, miss information status holding register (MSHR) capacity
and minimum theoretical cycles-per-instruction (CPI).
Application profiling is done on the baseline memory system,
and consists of executing the application and profiling it using
hardware performance counters. Application performance profil-
ing obtains the number of CPU cycles, number of instructions,
number of last-level cache (LLC) misses and the read and write
memory bandwidths. Application power profiling measures the total
power consumption using integrated or external power measure-
ment infrastructure, and memory-related power parameters using
performance counters. Since the application’s behavior changes
over time, application profiling is done by sampling over regular
time intervals, which we refer to as segments. Further details on
application profiling are given in Section 4.1.
Memory power parameters characterize the baseline and tar-
get memory systems, in terms of the power consumption in various
operational modes, idle state and power-down states, as well as the
energy consumption for various operations such as read and write
transfers, row buffer hits and misses. These figures are typically
provided by the memory device manufacturers [28].
2.4 Performance, power and energy estimation
Figure 3 gives an overview of the whole process of performance,
power and energy estimation. Since application profiling involves
collecting a trace over the program’s execution, the PROFET perfor-
mance and power models are run for each segment (time interval)
in the trace. This gives the predicted execution time, power and
energy consumption of each segment. Summing over time gives
the final execution time and energy for the whole application. The
application’s average power demand is total energy divided by total
execution time.
The PROFET performance model reads the application per-
formance information from the profiling trace-file and determines
the application’s position on the bandwidth–latency curve for the
baseline memory system. As described in detail in Sections 4.2, 4.3
and 4.5, PROFET then estimates the application’s position on the
memory bandwidth–latency curve for the target memory system,
and uses it to predict the application’s performance on the target
memory system.
The PROFET power model estimates the power consumption
of the target memory system using the application performance
profiling and the memory power parameters. Finally, the PROFET
energy model is done based on the output of the performance and
power models. Due to a lack of space, the detailed description and
evaluation of the PROFET power and energy models are presented
in Appendices A.1, A.2 and B.1.
3 MEMORY SYSTEM PROFILING
The baseline and target memory systems are characterized using
bandwidth–latency curves. For mature technologies, these curves
are measured on a real platform. For emerging memory devices that
are not yet available in off-the-shelf servers, the bandwidth–latency
curve can be measured on a developer board with a prototype
of the new device [2], or alternatively it can be provided by the
manufacturer.
The bandwidth–latency curve is determined using a pointer-
chasing microbenchmark designed to measure latency [34] running
concurrently with a derivative of the STREAM benchmark [27] that
was modified to vary the load on the memory system. Currently,
profiling of a single memory system configuration, e.g., DDR3-1600,
is performed in approximately 15 minutes (see Appendix B.2.3).
Although Section 2.1 plots a single bandwidth–latency curve, in
reality a single memory system has a family of curves that depend
on the ratio between read and writes in the overall memory traffic.
As an example, Figure 4 shows the measured bandwidth–latency
curves for the Knights Landing and Sandy Bridge platforms, as the
proportion of reads is varied between 50% and 100%. The lightest
curves correspond to 50% reads and the darkest curves correspond
to 100% reads. Instead of the single bandwidth–latency curve per
memory system that was illustrated in Figure 1, we now see a family
of curves. When the used memory bandwidth is low or moderate,
the read fraction has negligible impact on the memory access la-
tency and the bandwidth–latency curves practically overlap. As the
stress to the memory system increases, however, the read fraction
starts to have a significant impact on latency. For example, in Fig-
ure 4b, at an aggregate bandwidth of 41.5 GB/s, the (read) latency
with 100% reads is 132 ns, but the read latency with 50% reads and
50% writes is 232 ns, an increase of 100 ns (76%). In general, for
all experiments we did, shown in Figure 4, curves with a higher
percentage of writes (lighter curves) are located higher (at higher
latency) on the chart. The main reason is that write requests incur
additional delays that are not required by memory reads [19].2
Increasing the proportion of write requests therefore reduces the
sustainable bandwidth and increases the loaded latency.
Recently, Clapp et al. [4] also did a preliminary analysis of
bandwidth–latency curves for different memory frequencies (DDR3-
1333 and DDR3-1600) and different read-to-write ratios (3:1 and 2:1).
Based on an analysis of four curves, the authors conclude that it is
sufficient to use a single, generic memory bandwidth–latency curve
for different frequencies and memory traffic compositions. Our
analysis is based on numerous measurements on a wide range of
DDR3, DDR4 and MCDRAM frequencies, with fine-grain changes
in the read-to-write ratio. Our findings show that different mem-
ory frequencies have fundamentally different bandwidth–latency
curves with different shapes and different lead-off and maximum
memory access latencies. We also show that the read-to-write ratio
may have a significant impact on memory access latency. Directly
contrary to the conclusion of Clapp et al. [4], our study shows
that the relationship between bandwidth and latency cannot be
approximated with a single curve. To the best of our knowledge,
ours is the first study of memory system read latency that makes
this conclusion.
4 PERFORMANCE MODEL
This section presents the PROFET analytical model that predicts
the application’s performance. We start, in Section 4.1, by outlining
the application characteristics that must be measured on the base-
line system. Then, in Section 4.2, we introduce the problem with a
simple case, that of an in-order processor. Next, in Section 4.3, we
analyse a complex out-of-order processor. Section 4.4 completes the
analysis of out-of-order processor performance as a function of la-
tency. Finally, in Section 4.5, we explain how PROFET combines this
latency–performance characterization with the bandwidth–latency
curves to obtain the estimate, with error bars, of the application
performance on the target memory system.
4.1 Application profiling
As outlined in Section 2.3, the application’s execution is divided into
segments at regular time intervals. For each segment, we measure,
using performance measuring counters, the number of cycles, num-
ber of instructions, read last-level cache (LLC) misses, used memory
2Write Recovery time or tWR is the minimum delay between the end of a write and the
next precharge command. The Write To Read delay time or tWTR is the minimum time
interval between a memory write and a consecutive read.
0 50 100 150 200 250 300 350 400 450
Used memory bandwidth [GB/s]
150
200
250
300
350
M
em
or
y
ac
ce
ss
la
te
nc
y
[n
s]
DDR4-2400
MCDRAM
50% RD 100% RD
0
(a) Knights Landing platform with a DDR4-2400 and MCDRAM.
0 10 20 30 40 50 60 70 80 90
Used memory bandwidth [GB/s]
0
50
100
150
200
250
M
em
or
y
ac
ce
ss
la
te
nc
y
[n
s]
DDR3-800
DDR3-1600
50% RD 100% RD
232ns
(41.5GB/s, 50% RD, 50% WR)
132ns
(41.5GB/s, 100% RD, 0% WR)Δlatency = 100ns
(b) Sandy Bridge platform with DDR3-800/1066/1333/1600. DDR3-
1066 and DDR3-1333 are excluded to improve the visibility.
Figure 4: Bandwidth–latency curves for the platforms under
study.Memory access latencyw.r.t. usedmemory bandwidth
cannot be approximated with a single curve — as the used
memory bandwidth increases, memory traffic read/write
composition makes a significant latency impact.
bandwidth, and the overall fraction of reads. These parameters and
the notation used in the paper are listed in Table 1.
Table 1: PROFET performance model input parameters
Input parameter Symbol
Number of cycles Cyctot
Number of Instructions Instot
Application read LLC misses MissLLC
Used memory bandwidth for total traffic BW (1)used
Fraction of reads in total traffic RatioR/W
The used memory bandwidth, BWused , and fraction of reads,
RatioR/W, include all memory accesses, whether issued by the ap-
plication or the prefetcher, since both types of accesses cause con-
tention and have a similar impact on memory system read latency.
In contrast,MissLLC only includes LLC read misses issued by the
application. This parameter is used to estimate how the memory
read latency impacts application performance, and only application
read misses have a direct performance impact.
In order to determine the duration of the sampling interval, we
analyzed the tradeoff among the measurement overhead, trace-file
size and PROFET accuracy. An interval of 1 s was selected because it
provided high accuracy (see Section 7) while introducing negligible
measurement overhead of below 1%. With this sampling interval,
the trace-file size of the benchmarks used in the study is in the
range of hundreds of megabytes, which is acceptable.
4.2 In-order processors
This section derives the relationship between latency and perfor-
mance for a simple in-order CPU. In the interest of helping the
reader to follow the formulas, we start by summarizing PROFET’s
inputs and outputs in Table 2.
Table 2: Notation used in formulas: In-order processors
Description Symbol
Inputs (in addition to Table 1)
Memory access latency from bandwidth–latency curve Latmem
Memory access penalty (Latmem minus LLC hit latency) Penmem
Intermediate outputs
Single LLC miss penalty (number of CPU stall cycles) StallsLLC
Application cycles-per-instruction CPItot
CPI component in the case of perfect LLC CPI0
CPI component due to LLC misses penalties CPILLC
Outputs
Application instructions-per-cycle (1/CPItot) I PCtot
Our analysis distinguishes between Memory access latency, Mem-
ory access penalty and LLC miss penalty.Memory access latency,
Latmem, is the number of CPU cycles necessary for a single load
instruction that reads data from the main memory. It is measured
as part of the memory system profiling and given in the memory
bandwidth–latency curve. Memory access penalty, Penmem, is
the difference between the latency of a main memory access and the
latency of an LLC hit. The values of Latmem and Penmem are inputs
to PROFET. The values for the baselinememory system are found by
looking up the application’s used bandwidth, measured on the base-
line memory system. The values for the target memory system are
generated as explained in Section 4.5. Finally, LLC miss penalty,
StallsLLC, is calculated by PROFET as the average number of cycles
for which the CPU pipeline is stalled because of each LLC miss.
We start by partitioning the application cycles-per-instruction,
CPItot, into two components [4, 8, 14, 22]: CPItot = CPI0 +CPILLC.
The first component, CPI0, is the application’s CPI for the hypo-
thetical case of a 100% LLC hit rate. This component is not affected
by the memory access latency. The second component, CPILLC, is
due to execution stalls due to the LLC misses. Once we know the
number of stall cycles to be attributed to each LLC miss, we can
calculate its value as [3, 14]:
CPILLC =
MissLLC × StallsLLC
Instot
(1)
We use the superscripts (1) and (2) to distinguish between the
baseline and target memory systems, respectively:
Baseline memory: CPI (1)tot = CPI
(1)
0 +CPI
(1)
LLC
Target memory: CPI (2)tot = CPI
(2)
0 +CPI
(2)
LLC (2)
Since the memory access latency does not affect CPI0, we have
CPI
(1)
0 = CPI
(2)
0 . Therefore, CPI
(2)
tot from Eq. 2 can be expressed as:
CPI
(2)
tot = CPI
(1)
tot +
(
CPI
(2)
LLC −CPI
(1)
LLC
)
(3)
Next, we assume that a change in the memory system, which for
this section is a change in the read access latency, does not change
the number of instructions, Instot, or the application’s memory
access pattern. This is a reasonable assumption for applications or
computational kernels that do not use busy-waiting or dynamic
scheduling. The assumption is valid for both in-order and out-of-
order processors. A change in the memory access latency may
affect the timeliness of the prefetcher, but it should not consistently
affect its coverage or accuracy; in any case, we found this effect
to be small.3 We therefore also assume that the LLC miss rate
is unaffected by the change in memory latency. In summary, we
conclude that we do not need superscripts (1) or (2) on Instot and
MissLLC. We can therefore substitute Eq. 1 into Eq. 3 to obtain:
CPI
(2)
tot = CPI
(1)
tot +
MissLLC
Instot
×
(
Stalls
(2)
LLC − Stalls
(1)
LLC
)
(4)
We have not yet assumed an in-order processor, so all the above
equations are also true for out-of-order processors. For an in-order
processor, we make the single observation that LLC misses directly
lead to pipeline stalls, i.e.: StallsLLC = Penmem. Substituting this
into Eq. 4 gives:
CPI
(2)
tot = CPI
(1)
tot +
MissLLC
Instot
×
(
Pen
(2)
mem − Pen(1)mem
)
for an in-order processor (5)
Eq. 5 is important because it shows that the difference in applica-
tion CPI for different memory systems, CPI (2)tot and CPI
(1)
tot , can be
calculated based on the corresponding memory access penalties,
Pen
(2)
mem and Pen
(1)
mem. Finally, by replacing IPC = 1/CPI , the appli-
cation performance on the target memory system is calculated as:
IPC
(2)
tot =
1
1
I PC (1)tot
+
MissLLC
Instot × (Pen
(2)
mem − Pen(1)mem)
for an in-order processor (6)
4.3 Out-of-order processors
The analysis for out-of-order (OOO) processors is more complex
because following an LLC miss the processor can continue execut-
ing independent instructions without immediately being stalled. In
consequence, the number of stalls per LLC miss is no longer equal
to the full memory access penalty, and it is typically strictly lower
than it: StallsLLC < Penmem. In order to handle this inequality, it
is necessary to introduce the additional symbols given in Table 3.
4.3.1 Isolated LLC miss. We first consider an isolated LLC miss.
As illustrated in Figure 5, when an LLC miss occurs (isolated or not),
the corresponding instruction must wait for data from memory,
but the CPU pipeline continues issuing and executing independent
instructions. Execution may halt, however, before the LLC miss is
resolved, for two reasons. First, instruction issue may stop because
the instruction window has filled with instructions, all of which are
dependent, directly or indirectly, on the instruction waiting for data
3Wemeasured the number of prefetches per instruction on our Sandy Bridge evaluation
platform (Section 6.1). The overall difference between DDR3-800 and DDR3-1600
memory configurations across all benchmarks is less than 5%.
Table 3: Notation used in formulas: Out-of-order processors
Description Symbol
Inputs
Instructions in reorder buffer InsROB
Size of miss information status holding register (MSHR) MSHR
Minimum CPI, equal to reciprocal of maximum IPC CPImin
Intermediate outputs
Number of execution cycles overlapped with
LLC miss stalls, due to OOO mechanism Cycooo
Number of instructions executed during
LLC miss stalls, due to OOO mechanism Insooo
Cyctot component in the case of perfect LLC Cyc0
Memory level parallelism:
Number of concurrent LLC misses (memory accesses) MLP
from main memory. Second, instruction commit may stop because
the reorder buffer (ROB) has filled with instructions that cannot
be committed until after the waiting instruction has itself been
committed. The upper part of Figure 5 shows a timeline indicating
whether instruction execution has halted, while the lower part of
the figure shows snapshots of the ROB occupancy before an isolated
LLC miss, while the processor is waiting for data, and after the data
has been received. Following the LLC miss, the ROB is occupied
with a certain number of instructions. The ROB begins to fill, as the
processor executes independent instructions. At some point, either
there are no more independent instructions or the ROB becomes
full. In either case, instruction execution will stall. Once the LLC
miss has been resolved and the LLC miss data are available, the
instructions waiting for the data can be executed and committed.
This allows the instructions in the ROB to be committed, so the
processor can resume issuing and executing new instructions.
In Figure 5, the period after the LLC miss in which the processor
is executing new independent instructions is labeled as Cycooo.So,
the number of stall cycles is given by [14, 22]:
StallsLLC = Penmem −Cycooo for an isolated miss (7)
If execution is immediately halted following the LLC miss, then
Cycooo would equal zero, and the LLC miss penalty would equal
the memory access penalty, as for the in-order case in Section 4.2.
The number of independent instructions that are executed dur-
ing this period, of Cycooo, is referred to as Insooo, and indicated
in the lower half of Figure 5. The connection between Cycooo and
Insooo requires knowledge of the CPI over the period. Our analy-
sis partitions the application execution in sampling segments of
1 second (detailed in Section 4.1), and considers there to be a steady
average execution rate. So, during these execution segments the
CPI equals its average rate of CPI0, as detailed in Table 2 and the
corresponding text. Therefore, Cycooo can be calculated as:
Cycooo = CPI0 × Insooo for an isolated miss (8)
The factor of Insooo depends on the number of independent in-
structions and the number of free instruction slots in the ROB. It
is analyzed in detail in the next section.
4.3.2 Estimating Insooo. State-of-the-art architectures do not
incorporate counters that can be used to measure the value of
Insooo. We therefore calculate bounds on its value and incorporate
LLC miss ROB fills Requested miss data
ROB
ocupancy
Commiting
ROB full
Commit resumes,
execution ramps back
to a steady state
Insooo
Cycooo
LLC miss penalty (StallsLLC)
time
Instructions
execution
memory access penalty (Penmem)
Free
entries
Ocupied
entries
InsROB
Figure 5: In OOO processors, LLCmisses overlap with the ex-
ecution of the instructions independent of the missing data.
The overlap depends on the number of independent instruc-
tions in the instruction window and number of free entries
in ROB [22].
these bounds into the PROFET error estimate. We use the platform-
specific parameters and the application’s measured CPI.
The Insooo lower bound is trivial: Insooo ≥ 0, since OOO
execution may stop immediately after the LLC miss and continue
being stalled until the requested miss data arrives.
The Insooo upper bound is calculated as the lower of two
constraints. The first is the reorder buffer size, InsROB, which corre-
sponds to the maximum number of instructions that can be stored
in the ROB [22]:
Insmax1ooo = InsROB (9)
The ROB size is a characteristic on the target architecture. In our
study, we analyze two architectures, as described in Section 6. In
Sandy Bridge EP-2670 CPU the ROB comprises 168 entries, while
in Intel Knights Landing Xeon Phi 7230 it has 72 entries.
The second upper bound is determined by the maximum number
of instructions that can be executed during the LLC miss. In this
scenario, the whole Penmem is covered by the OOO execution, so
Cycooo would equal Penmem. Therefore, since Insooo is calculated
as Cycooo/CPI0 (Eg. 8), we can combine the two equations to find
that the maximum number of instructions that the processor will
execute in this time is Penmem/CPI0. Since the second upper bound
assumes that OOO execution covers the whole memory access
penalty, there cannot be any stalls due to LLC misses, i.e., CPILLC
would equal 0. Therefore, since the application’s overall CPI is
defined to be CPItot = CPI0 +CPILLC, it must be (in this case) that
CPItot equals CPI0. Combining these facts, the final form of the
second upper-bound, given in terms of inputs to PROFET, becomes:
Insmax2ooo = Penmem ×
Instot
Cyctot
(10)
The overall upper bound on Insooo is the minimum of the two limits:
Insmaxooo = min
(
InsROB, Penmem × Instot
Cyctot
)
(11)
LLC
miss1
Requested
miss data1
LLC misses penalty (n×StallsLLC)
LLC
missn-1
LLC
missn
Requested
 miss datan-1
Requested
miss datan
Cycooo
memory access penalty1
memory access penaltyn-1
memory access penaltyn
time
Instructions
execution
Figure 6: Handling overlapping LLC misses in an OOO pro-
cessor: the penalty of a single miss is divided by a number
of concurrent LLC misses [22].
Since the value of Insooo can be anywhere between its bounds we
consider Insooo to be a free parameter and perform a sensitivity
analysis when calculating other dependent parameters.
4.3.3 Overlapping LLCmisses: Impact ofmemory level par-
allelism. Previously, in Section 4.3.1, specifically in Eq. 7, we con-
sidered the case of an isolated read LLC miss. This section now
considers the general case, in which after an LLC miss occurs, and
while the corresponding instruction is waiting for data from mem-
ory, the CPU pipeline generates one or more additional LLC misses.
This situation is illustrated in Figure 6. Any stall cycles that occur
should be counted once per group of overlapping LLC misses rather
than once per LLC miss, which was the case for Eq. 7. The num-
ber of concurrent LLC misses is typically known as the memory
level parallelism, and is denotedMLP [3, 13]. The penalty per LLC
miss is therefore given by the number of stall cycles divided by
MLP [3, 22]:4
StallsLLC =
1
MLP
× (Penmem −Cycooo)
=
1
MLP
× (Penmem −CPI0 × Insooo) (12)
Karkhanis et al. [22] analyze this in detail and show that Eq. 12
is correct independently of the moment in which second, third,
or subsequent LLC misses occur, as long as they occur within the
Cycooo interval. IfMLP equals 1, then the above equation becomes
identical to Eq. 7; so Eq. 12 covers both cases, that of isolated and
overlapping LLC misses.
Current processors cannot directly measureMLP , so it must be
estimated based on the parameters that are available. We derive
lower and upper bounds onMLP and a point estimate.
TheMLP lower and upper bounds can be computed starting
from the equation CPI (1)tot = CPI0 + CPI
(1)
LLC, then substituting
CPI
(1)
LLC from Eq. 1 and Stalls
(1)
LLC from Eq. 12:
CPI
(1)
tot = CPI0 +
MissLLC ×
(
Pen
(1)
mem −CPI0 × Insooo
)
Instot ×MLP (13)
4LLC misses can also overlap with front-end miss events such as instruction cache
miss, branch misprediction, etc. These overlaps, however, tend to be rare leading to an
insignificant performance impact [9].
Rearranging to isolateMLP and writing as a function to make
clear which values are unknown gives:
MLP (Insooo,CPI0) =
MissLLC
Instot × (Pen
(1)
mem −CPI0 × Insooo)
CPI
(1)
tot −CPI0
(14)
This equation expressesMLP , which we want to know, in terms
of Insooo, the free variable that we will vary later, andCPI0, which is
unknown but can be bounded. The lower bound onCPI0 isCPImin,
the reciprocal of the processor’s highest theoretical IPC. The upper
bound on CPI0 is CPI (1)tot , since CPI0 was defined to be one (of two)
components contributing toCPI (1)tot . Now thatCPI0 is bounded, and
assuming a value of Insooo, it is possible to use Eq. 14 to obtain
the range of potential values of MLP , either via a sweep on CPI0
between its lower and upper bounds or using differential calculus.
A second upper bound onMLP is the size of theMiss Information
Status Holding register (MSHR) [25]. The MSHR is the hardware
structure that keeps information about in-flight cache misses, so
they can be resolved once the corresponding data arrives. Its size is
CPU-specific; e.g. it is 10 for Sandy Bridge [17] and 12 for KNL [21].
The MLP point estimate is derived by assuming that the ap-
plication’s behavior is uniform (in a sense to be clarified below)
over the sampling segment. Specifically, we assume that the num-
ber of LLC misses per instruction is homogeneous across the time
segment, in which case it must equalMissLLC/Instot. In the period
between the LLC miss and the arrival of its data, the processor
executes Insooo instructions, so with a constant rate of LLC misses,
the total number of additional LLC misses is MissLLCInstot × Insooo. The
value of Insooo is a free parameter, as described in Section 4.3.2,
so the value being calculated here is a function of that parameter.
In order to account for the first LLC miss, which has not yet been
counted, the point estimate for the total number of LLC misses, as a
function of Insooo, to which the stall cycles must be attributed, is:EMLP (Insooo) = MissLLC
Instot
× Insooo + 1 (15)
Note that EMLP (Insooo) is a point estimate for MLP based on
the available information. If the point estimate is outside the valid
range, between the lower and upper bounds described above, then
it is corrected to lie in the range.
4.4 Performance as a function of latency
This section completes the analysis of out-of-order processor per-
formance as a function of latency.We start by repeating Eq. 4, which
gives the predicted CPI in terms of StallsLLC:
CPI
(2)
tot = CPI
(1)
tot +
MissLLC
Instot
×
(
Stalls
(2)
LLC − Stalls
(1)
LLC
)
(4 again)
As remarked at the beginning of Section 4.3, in comparison with
an in-order processor, an out-of-order processor has amore complex
expression for StallsLLC, and this was given in Eq. 12:
StallsLLC =
1
MLP
× (Penmem −CPI0 × Insooo) (12 again)
Finally we replace theMLP parameter in this equation with the
point estimate in Eq. 15:
EMLP (Insooo) = MissLLC
Instot
× Insooo + 1 (15 again)
Memory access latency on the target memory system
IP
C
to
t(2
)
Insooo =
IPCtot
(1)
Latmem
(1)
Insooo
max
Insooo = 0
Figure 7: Performance as a function of the memory access
latency. The different curves arise by varying the unknown
parameter Insooo within its bounds. Please read the text be-
fore interpreting this figure.
In fact, as explained in Section 4.3.3, this value is restricted to
lie between the lower and upper bounds given in that section. For
the sake of clarity, we consider the more common case for which
it is not necessary.
Combining Eq. 4, Eq.12 and Eq.15, and assuming that Instot,
MissLLC, CPI0, Insooo andMLP do not change when moving from
one memory system configuration to another, then CPI (2)tot can be
calculated as:
CPI
(2)
tot = CPI
(1)
tot +
Pen
(2)
mem − Pen(1)mem
Insooo + Instot/MissLLC
(16)
This equation is written in terms of the memory access penalty,
Penmem, but at the system level, outside a detailed analysis of a
particular processor’s pipeline, only Latmem is relevant. Recall that
Penmem was defined to be the memory access latency, Latmem mi-
nus the cost of an LLC hit. We note, therefore, that the expression
Pen
(2)
mem − Pen(1)mem is equal to Lat (2)mem − Lat (1)mem. Taking account of
this and rewriting in terms of the IPC instead of the CPI gives:
IPC
(2)
tot =
IPC
(1)
tot
1 + IPC (1)tot × Lat
(2)
mem−Lat (1)mem
Insooo+Instot/MissLLC
(17)
The various values in Eq. 17, IPC (1)tot , Instot, and MissLLC are
known because they were measured on the baseline memory config-
uration. All other inputs to PROFET, such as InsROB,MSHR,CPImin
(see Table 3) appear in the upper and lower bounds of Insooo.5
Eq. 17 is plotted in Figure 7. The x axis is the target system
memory latency, Lat (2)mem, and they axis is the predicted IPC, IPC
(2)
tot .
Eq. 17 is a function of the independent parameter Insooo, which
we cannot measure or calculate exactly. We bounded its value in
the previous section, and varying it between the lower and upper
bounds gives the family of curves shown in the figure. Note that the
case of Insooo = 0 corresponds to an in-order processor. This can
be seen by comparing Eq. 17 and Eq. 6. As indicated on the figure,
when the target memory latency is the same as the baseline memory
latency, Lat (1)mem, PROFET correctly “predicts” the measured IPC to
be that of the baseline system, IPC (1)tot .
It is easy to be misled by Figure 7. For instance, a decrease in
the memory latency by a fixed value, e.g. reducing the lead-off load
penalty by 10 ns, is not equivalent to simply moving by 10 ns to
5Some of the input parameters appear only in the upper or lower bounds of MLP ,
which are not in Eq. 17 but considered in the full PROFET model.
0 50 100 150 200 250 300 350
Used memory bandwidth [GB/s]
200
250
300
350
M
em
or
y
ac
ce
ss
la
te
nc
y
[n
s]
0
Application using DDR4
Application using MCDRAM
Performance modelBandwidth-latency curves
150
Figure 8: Graphical interpretation of a performance estima-
tion as a merged solution of Sections 3 and 4.4.
the left on the x axis. This is because, as seen in the figure, such a
change will result in an increase in the IPC, which will itself cause
an increase in the used memory bandwidth, for reasons explained
in the next section. This increase in used bandwidth will cause a
movement to the right in the memory’s bandwidth–latency curve
and therefore increase the memory system latency, counterbalancing
the original decrease in memory system latency. We address this
problem in the next section.
4.5 Performance estimation — the ultimate
step
This section completes the PROFET performance model. We start
from the bandwidth–latency curves described in Section 3, which
give the loaded memory access latency as a function of used mem-
ory bandwidth. We then combine these curves with the analysis
in Section 4.4, which gives application performance as a function
of the loaded memory latency. Doing so, in the right way, gives a
prediction of the performance on the target memory system, with
error bars.
The solution will be explained through Figure 8. The x axis is
the used memory bandwidth and the y axis is the memory access
latency. We show the measured bandwidth–latency curves for the
Knights Landing platform, exactly as in Figure 4a. As before, the
lightest curves correspond to 50% reads and 50% writes, and the
darkest curves correspond to 100% reads. Now, however, since we
are analysing a specific application segment, the proportion of reads
is known (it is RatioR/W), so we know which curve from the family
to select. The selected curve, which corresponds to RatioR/W, is
shown as a dashed white curve.
We now turn to Figure 7, and use the latency–performance plot
to construct a latency–bandwidth plot, i.e. to find the used memory
bandwidth as a function of the memory access latency. This is be-
cause the total number of memory accesses performed over the ap-
plication’s execution is a constant as argued in Section 4.2. The total
number of accesses is evaluated, for the baseline memory system, by
multiplying bandwidth by time, giving BW (1)used × (Cyc
(1)
tot/FreqCPU).
Dividing this expression by the execution time on the target mem-
ory system gives:
BW
(2)
used =
BW
(1)
used × (Cyc
(1)
tot/FreqCPU)
(Cyc
(2)
tot/FreqCPU)
=
BW
(1)
used
IPC
(1)
tot
× IPC (2)tot (18)
We therefore obtain a plot of bandwidth vs. latency simply
by multiplying the value on the y axis of Figure 7 by the factor
BW
(1)
used/IPC
(1)
tot . The axes now match those in Figure 8 except they
are transposed. We therefore transpose the bandwidth vs. latency
plot, by swapping the x and y axes, and superimpose it onto Fig-
ure 8. This gives the family of lines for the PROFET performance
model.
When the application runs on a memory system, it must be
located on the memory system’s bandwidth–latency curve and on
one of the PROFET performance model curves that was just added.
It must therefore be located on the intersection of these curves, as
indicated in Figure 8. For the baseline memory system, we find that
all PROFET performance model curves intersect the bandwidth–
latency curve in the same place, at the bandwidth measured on the
real system.
Each pair of bandwidth–latency and PROFET performancemodel
curves will intersect in exactly one place. There cannot be more
than one intersection because the memory system’s bandwidth–
latency curve is increasing (as a function of latency) whereas the
application’s performance model curve is decreasing (as a function
of latency). In addition, theremust be an intersection point, since the
application’s curve decreases from a very high latency necessary
to get a small used memory bandwidth whereas the memory’s
bandwidth–latency curve increases to a very high latency close to
the maximum sustainable bandwidth.
In summary, we start from the targetmemory system’s bandwidth–
latency curve and Eq. 17, which defines a family of PROFET per-
formance model curves. We perform a sweep of the valid range
for Insooo, and for each value, find the intersection of its perfor-
mance model curve with the target bandwidth–latency curve. To
find this intersection we use the bisection method. This point gives
a bandwidth on the y axis, which can be converted to an IPC by
rearranging Eq. 18. Varying Insooo in this way gives the minimum,
maximum and point estimate for IPC, from which the number of
cycles and execution time can also be easily deduced. Recall that the
discussion so far is related to a single segment (time interval) of the
application. Summing over all segments gives the predicted mini-
mum, maximum and point estimate execution time for the whole
application. Execution of the complete PROFET prediction for the
whole application is very fast. For example, for the benchmarks un-
der study the performance estimate for each target memory system
is completed within seconds.
4.6 Novelties of the presented analytical model
Since our PROFET analytical model is based on CPI stack analy-
sis as widely used for performance modeling [14], it is important
to emphasize the contributions of our work beyond the previous
studies.
Computation of the CPI component that corresponds to the
execution stalls due to the LLC misses (Eq. 1) is described by previ-
ous studies [3, 4, 14]. Hennessy and Patterson [14] and Karkhanis
and Smith [22] also analyze execution of independent instructions
Insooo after the LLC miss (Eq. 7) and define some of the Insooo
bounds (Eq. 9). Finally, the MLP and its impact on CPI stack analy-
sis (Eq. 12) are also well explored by the community [3, 4, 6, 10, 22].
CPI stack, Insooo and MLP are the foundation of various ana-
lytical models that quantify the performance impact of the main
memory latency [3, 6, 10, 22]. The previous analytical models, how-
ever, have one great challenge: they require detailed application
profiling, which can be performed only with hardware simulators.
The main objective of our work is to avoid the use of the simulators,
and to develop an analytical model based only on the parameters
that can be obtained or derived from performance counters mea-
surements on actual platforms. This requires a novel approach to
the MLP estimate, presented in Eq. 13–15. In addition to this, we
also present additional Insooo bounds in Eq. 10 and Eq. 11. Finally,
to the best of our knowledge, this is the first study that combines
analysis of application performance as a function of memory access
latency (Sec 4.4) with bandwidth–latency curves (Sec 4.5) and that
shows that its analysis leads to a unique solution.
5 POWER AND ENERGY ESTIMATION
Similarly to performance estimation, we develop the PROFET power
and energy models. Based on the application profiling and memory
power parameters, these models predict the variation of the system
power and energy consumption due to the change of the memory
systems. PROFET’s power and energy models are both validated on
the Sandy Bridge E5-2670 server running SPEC2006 and scientific
HPC applications.6 As for the PROFET performance model, the
error of the power and energy estimation is low: less than 2% for
power and less than 3% for energy consumption. Due to the lack of
space, the detailed description of the PROFET power and energy
models is moved to Appendix A, while their evaluation is given in
Appendix B.1.
6 METHODOLOGY
In this section, we present the hardware platforms and benchmarks
used in the evaluation of PROFET. We also list the tools used for
the application profiling and server power measurements. Finally,
we summarize the main steps of the PROFET evaluation process.
6.1 Hardware platforms
We evaluate PROFET on Sandy Bridge-EP E5-2670 and Knights
Landing (KNL) Xeon Phi platforms. The most important features
of the platforms are summarized in Table 4.
The Sandy Bridge-EP server is a representative of mainstream
high-performance computing (HPC) servers, and it is still in use,
especially in smaller Tier-0 systems [1]. In the server under study,
we were able to test four memory frequencies: DDR3-800, DDR3-
1066, DDR3-1333 and DDR3-1600, and we used these configurations
to evaluate PROFET’s performance, power and energy models.
The Intel Knights Landing (KNL) Xeon Phi platform [36] is an
emerging platform that combines two types of memory with differ-
ent memory bandwidths and access latencies: DDR4 DIMMs and
3D-stackedMCDRAM [36]. Since it uses two types of memories, the
system offers three modes of operation: cache mode, flat mode and
6The PROFET power and energy models were not developed for the KNL server,
because we lacked reliable MCDRAM power parameters. This research is a part of
ongoing work.
Table 4: The most important features of experimental plat-
forms
Platforms Sandy BridgeE5-2670
Knights Landing
Xeon Phi 7230
Sockets 2 1
Cores per socket 8 64
CPU freq. [GHz] 3.0 1.3
L1i, L1d 32 kB, 32 kB 32 kB, 32 kB
L2 512 kB 1MB
L3 20MB /
Memory conf.
per socket
4 chann.
DDR3-800/1066/1333/1600
8 chann. MCDRAM
6 chann. DDR4-2400
Memory capacity 64GB 16GB MCDRAM96GB DDR4
hybrid mode. In our experiments, we use flat mode, in which the
DDR4 and MCDRAM are configured as separate NUMA nodes, and
we execute our workloads either in DDR4 or MCDRAM memory.
6.2 Benchmarks
We evaluated PROFET on a set of SPEC CPU2006 benchmarks [37]
and scientific HPC applications. The scientific applications are se-
lected from the Unified European Application Benchmark Suite (UE-
ABS) [30]. These applications are parallelized using Message Pass-
ing Interface (MPI) and are representative of production applica-
tions running on HPC systems in Europe. We choose four applica-
tions: ALYA, representative of the computational mechanics codes,
and GROMACS, NAMD, and Quantum Espresso (QE) computa-
tional chemistry applications. The remaining UEABS applications
could not be executed because their input dataset sizes exceed the
main memory capacity of our hardware platforms.
In all the experiments on Sandy Bridge platform, we fully utilize
the available 16 CPU cores (2 sockets, 8 cores each): we execute
16 copies of each SPEC CPU2006 benchmark, or 16 application
processes for each UEABS applications. The KNL platform com-
prises 16 GB of the MCDRAM, which was insufficient to execute
any of the UEABS application. Also, for each SPEC CPU2006 bench-
mark, we had to determine the maximum number of the instances
whose cumulative memory footprint would fit into the MCDRAM.
In the charts, this number of benchmark instances is specified in
parentheses after the benchmark name.
In order to quantify the level of stress that our workloads put on
the memory system, we measure their memory bandwidth on the
platforms under study. Figure 9 shows the bandwidth utilization, rel-
ative to the maximum sustained memory bandwidth measured with
STREAM benchmark [27]. When reporting the evaluation of PRO-
FET in Section 7, we emphasize the results for the high-bandwidth
benchmarks with over 50% of the used memory bandwidth. These
benchmarks are the most affected by the changes in the memory
system, and therefore the most challenging to model.
6.3 Tools and methodology
Application profiling requires measurements of the CPU cycles, in-
structions, LLC misses, read and write memory bandwidths, as well
as the row-buffer access statistics, number of page activations and
lib
qu
an
tu
m
bw
av
es
lb
m
m
ilc
so
pl
ex
G
em
sF
D
T
D
le
sl
ie
3d m
cf
sp
hi
nx
3
w
rf
om
ne
tp
p
ze
us
m
p
ca
ct
us
A
D
M
as
ta
r
gc
c
de
al
II
bz
ip
2
go
bm
k
xa
la
nc
bm
k
sj
en
g
hm
m
er
to
nt
o
gr
om
ac
s
h2
64
re
f
ca
lc
ul
ix
na
m
d
pe
rl
be
nc
h
ga
m
es
s
po
vr
ay Q
E
A
LY
A
G
R
O
M
A
C
S
N
A
M
D
0%
20%
40%
60%
80%
100%
U
se
d
po
rt
io
n
of
m
ax
.s
us
.b
w
Read memory bandwidth Write memory bandwidth
High Bandwidth (HBW) Low Bandwidth (LBW) HBWLBW
(a) Sandy Bridge E5-2670. Fully utilized server: 16 SPEC CPU2006
instances or 16 UEABS MPI processes.
sp
hi
nx
3
[6
0]
le
sl
ie
3d
[4
8]
lib
qu
an
tu
m
[3
6]
lb
m
[3
2]
om
ne
tp
p
[6
4]
so
pl
ex
[3
6]
G
em
sF
D
T
D
[1
8]
m
ilc
[2
0]
ca
ct
us
A
D
M
[2
0]
gc
c
[2
4]
as
ta
r[
48
]
hm
m
er
[6
4]
ze
us
m
p
[3
0]
xa
la
nc
bm
k
[3
6]
bw
av
es
[1
8]
w
rf
[2
2]
de
al
II
[2
4]
h2
64
re
f[
64
]
bz
ip
2
[1
8]
m
cf
[8
]
go
bm
k
[6
4]
gr
om
ac
s
[6
4]
sj
en
g
[6
4]
pe
rl
be
nc
h
[2
6]
to
nt
o
[4
4]
na
m
d
[6
4]
ca
lc
ul
ix
[4
8]
ga
m
es
s
[4
8]
po
vr
ay
[5
6]
0%
20%
40%
60%
80%
100%
U
se
d
po
rt
io
n
of
m
ax
.s
us
.b
w High Bandwidth Low Bandwidth
(b) Knights Landing Xeon Phi. The MCDRAM capacity limits the
number of the benchmarks instances, specified in the square brack-
ets.
Figure 9: The workloads under study show a wide range of
memory bandwidth utilization, and different ratios of the
Read and Write memory traffic.
page misses, and number of cycles spent in memory power-down
states. All these inputs are measured by the hardware counters and
the LIKWID performance tool suite [39]. The counters used in the
study are widely available in mainstream HPC servers [16, 18].
We used a Yokogawa WT230 [5] power meter to measure the
server power consumption. The power meter measures the voltage
and current at the power plug to calculate the power consumption
of the whole server, including power supply, motherboard with
all its components, CPUs, and memory. The measurements were
sampled on one second time period. The energy consumption was
calculated by summing the power consumption over the execution
time.
7 EVALUATION
Evaluation of PROFET is done in four steps. First, we execute a
benchmark on the baseline memory system, e.g., Sandy Bridge server
with DDR3-800. In this run, we measure the benchmark perfor-
mance, power and energy consumption, and collect all the hard-
ware counters needed for prediction using PROFET. Second, we use
PROFET to estimate the benchmark performance, power and en-
ergy on the target memory configuration, e.g., DDR3-1600. Third, we
change the platform memory configuration from the baseline to the
target memory, e.g., from DDR3-800 to DDR3-1600. This requires
changing the BIOS settings for the Sandy Bridge, and changing the
execution NUMA node for the Knights Landing platform. Finally,
lib
qu
an
tum
bw
av
es lbm mi
lc
so
ple
x
Ge
ms
FD
TD
les
lie
3d mc
f
sp
hin
x3 wr
f
om
ne
tpp
ze
us
mp
ca
ctu
sA
DM ast
ar gc
c
de
alI
I
bz
ip2
go
bm
k
xa
lan
cb
mk sje
ng
hm
me
r
ton
to
gro
ma
cs
h2
64
ref
ca
lcu
lix
na
md
pe
rlb
en
ch
ga
me
ss
po
vra
y QE
AL
YA
GR
OM
AC
S
NA
M
D
0%
20%
40%
60%
80%
100%
R
el
at
iv
e
IP
C
di
ff
er
en
ce
w
.r.
t.
D
D
R
3-
80
0
Measured performance improvement
Estimated DDR3-800 -> DDR3-1066
Estimated DDR3-800 -> DDR3-1333
Estimated DDR3-800 -> DDR3-1600
Figure 10: Sandy Bridge, DDR3-800→ 1066/1333/1600: Changing the DRAM frequency has a significant performance impact.
PROFET’s performancemodel estimations are precise, with low error bars, and accurate, with small difference from the values
measured on the actual hardware.
sph
inx
3 [
60]
les
lie3
d [
48]
lib
qua
ntu
m [
36]
lbm
[32
]
om
net
pp
[64
]
sop
lex
[36
]
Ge
ms
FD
TD
[18
]
mi
lc [
20]
cac
tus
AD
M
[20
]
gcc
[24
]
ast
ar [
48]
hm
me
r [6
4]
zeu
sm
p [
30]
xal
anc
bm
k [
36]
bw
ave
s [1
8]
wr
f [2
2]
dea
lII
[24
]
h26
4re
f [6
4]
bzi
p2
[18
]
mc
f [8
]
gob
mk
[64
]
gro
ma
cs
[64
]
sje
ng
[64
]
per
lbe
nch
[26
]
ton
to [
44]
nam
d [
64]
cal
cul
ix [
48]
gam
ess
[48
]
pov
ray
[56
]
0%
50%
100%
150%
200%
250%
R
el
at
iv
e
IP
C
di
ff
er
en
ce
w
.r.
t.
D
D
R
4-
24
00
Measured performance improvement
Estimated DDR4-2400 -> MCDRAM
Figure 11: Knights Landing, DDR4-2400→MCDRAM: Despite thewide range of the performance variation, between−9% (mcf )
and 212% (leslie3d), themodel shows high accuracy. Smaller KNL reorder buffer leads to a smaller range of the Insooo sensitivity
analysis, and therefore more precise performance prediction, i.e., smaller error bars.
we execute the benchmark on the target memory system, measure
the actual performance, power and energy consumption, and com-
pare them with the values estimated by PROFET.
7.1 Sandy Bridge: DDR3-800→ 1066/1333/1600
The results for the Sandy Bridge server are displayed in Figure 10.
For each benchmark we plot three sets of bars, one per DRAM
frequency change: DDR3-800→1066/1333/1600. As described in
Section 4.3, since we could not determine the exact value of the
Insooo parameter in the experimental platform, we performed a
sensitivity analysis by varying Insooo between 0 and Insmaxooo , and
doing the performance estimate for each possible value. The solid
bars correspond to the mean performance estimate, while the er-
ror bars show the lowest and highest estimated performance. In
addition, we plot the actual performance improvement measured
on the real platform, marked with a (red) cross marker for each
experiment.
First, we can see that the error bars, i.e., ranges of the estimated
performance are narrow. Across the high-bandwidth benchmarks,
the average width of the error bars is only 2.6%, 5.5% and 7.4%,
for DDR3-1066/1333/1600 respectively. Across the low-bandwidth
benchmarks, the average width of the error bars is even lower, at
1.1%, 1.6% and 2%. This means that, although for the architectures
under study we cannot determine the precise value of the Insooo pa-
rameter, we can, in all cases, apply a sensitivity analysis and obtain
a narrow range of estimated performance. As expected, the high-
bandwidth benchmarks, which are more sensitive to the CPU’s
ability to hide memory latencies through memory parallelism, have
a greater sensitivity to Insooo, leading to wider, but still acceptable,
error bars. Second, PROFET’s predictions are highly accurate. The
average difference from the performance measured on the actual
hardware for DDR3-1066/1333/1600 frequencies is 1.8%, 3.8% and
5.1% for the high-bandwidth benchmarks, and it drops to just 1%,
1.3% and 1.6% over the low-bandwidth benchmarks. Finally, the pre-
sented results show that the DRAM frequency increase indeed has a
significant performance impact. For example, increasing the DRAM
frequency from the baseline DDR3-800 to the target DDR3-1600
causes average performance improvement of 22%, and it reaches
80% for the libquantum benchmark. Therefore it is important to un-
derstand the relation between the available memory bandwidth and
the overall application performance, which is the main objective
of our work.
7.2 Knights Landing: DDR4-2400→MCDRAM
Figure 11 shows the estimated and measured performance im-
provement of the Knights Landing Xeon Phi with high-bandwidth
MCDRAMwith respect to the DDR4-2400 memory. Again, PROFET
shows high precision. Actually, the width of the estimation error
bars is only 3.2% for high-bandwidth benchmarks and 0.4% for low-
bandwidth benchmarks, both of which are significantly smaller
than the corresponding figures for the Sandy Bridge system. The
main reason for this is the 72-instruction reorder buffer in the KNL
platform, w.r.t. the 168 instructions in the Sandy Bridge. A smaller
reorder buffer leads to a narrower range for the Insooo sensitivity
analysis, and therefore a more precise performance prediction. In
the case of low-bandwidth benchmarks, additional reason for small
error bars is that their overall performance sensitivity on the change
from the baseline to the target memory system is low. Therefore,
PROFET also predicts low performance difference with close values
for minimum and maximum of the Insooo sensitivity analysis. In
their case, Insooo parameter from Eq. 17 has negligible impact on
the estimation result, either because of the close values of Lat (2)mem
and Lat (1)mem or because of the low number of LLC misses in Eq. 17.
The MCDRAM provides 4.2-fold higher bandwidth over DDR4,
which leads to significant performance improvement for the high-
bandwidth benchmarks, up to 212% improvement for the leslie3d.
However, the MCDRAM also has a 23 ns higher lead-off latency
(see Figure 4a), that penalizes benchmarks with low and moderate
memory bandwidth requirements. Actually, for bzip and mcf, ex-
ecution on the MCDRAM leads to performance loss. For the mcf
benchmark, this performance loss reaches a non-negligible 9%.
Despite the wide range of the performance variation, between
−9% and 212%, when moving from DDR4 to the MCDRAM, PRO-
FET shows high accuracy. The difference between PROFET’s per-
formance estimates and the measurements on the actual hardware
is 7% for high-bandwidth benchmarks, and drops down to 1.6%
for the low-bandwidth benchmarks. Also, PROFET’s predictions
accurately distinguish between the benchmarks that significantly
benefit from the MCDRAM, and the benchmarks that show negli-
gible performance improvements or even performance loss. This
confirms that PROFET properly considers both segments of the
memory bandwidth–latency curves: the constant latency segment,
close to the lead-off memory latency, and the exponential segment,
close to the memory bandwidth saturation point.
7.3 Additional evaluations
Section 7.1 summarizes the evaluation of the performance model
on the Sandy Bridge when increasing the DRAM frequency from
DDR3-800 to DDR3-1066/1333/1600. Appendix B.1 extends these
results with the evaluation of the power and energy model on
the same platform and DRAM configurations. PROFET’s power
and energy estimates are evaluated versus measurements of an
external power meter taken on the actual Sandy Bridge server.
As remarked above, PROFET’s power and energy models were
not developed for the KNL server, because we lacked the reliable
MCDRAM power parameters. As a part of the ongoing work, we
are exploring different ways to estimate these parameters.
The evaluation results show that PROFET’s predictions arehighly-
accurate, typically with only 2% difference from the performance,
power and energy consumption measurements on the actual hard-
ware. PROFET provides accurate estimations even when the base-
line and target memory systems have fundamentally different
bandwidth–latency curves, with different lead-off latency and
with an n-fold difference in the available bandwidth.
In Appendix B.2, we compare the performance estimates of PRO-
FET with ZSim+DRAMSim2 hardware simulators. The comparison
is done based on the actual measurements on the Sandy Bridge
platform. The presented results show that PROFET has much better
accuracy than the simulators. Also, PROFET’s estimations closely
follow the trend of the actual measurements, while the simulated
results show completely different trends. Appendix B.2 also sum-
marizes additional advantages of PROFET. PROFET is faster than
the hardware simulators by three orders of magnitude, so it can
be used to analyze production HPC applications, arbitrarily sized
systems, and numerous design options, within a practical length of
time. Also, PROFET does not require detailed CPU modeling since
the CPU functionality, including actual data prefetcher and out-of-
order engine, is already accounted for by the application profiling.
Finally, PROFET can be used on various platforms as long as they
support the required application profiling. PROFET was initially
developed for the Sandy Bridge platform, and later we evaluated it
for the KNL server. Adjustment of PROFET to the KNL system was
trivial, requiring changes to only a few hardware parameters.
8 RELATEDWORK
CPU and memory analytical models: Numerous studies pro-
pose analytical models as an alternative to cycle-accurate hardware
simulation. We summarize the ones that are directly related to our
work.
Karkhanis and Smith [22] present an OOO processor model
that estimates the performance impact of instruction window size,
branch misprediction, instruction cache misses and data cache
misses. The model is validated versus detailed superscalar OOO
CPU simulation and the authors conclude that, although the model
provides much less data than detailed simulation, it still provides
insights on what is going on inside the processor. Published in 2004,
this work became the foundation for numerous advanced proces-
sor modeling approaches. Eyerman et al. [10] extend the work of
Karkhanis and Smith [22] and develop the mechanistic model and
interval analysis. The interval analysis breaks the total execution
time into intervals based on the miss events, branch mispredictions
and TLB/cache misses, and then predicts the execution time of each
interval. Genbrugge et al. [12] propose interval simulation which
accelerates multi-core simulation via a combination of the high-
level mechanistic analytical model [10] and detailed simulation.
The mechanic analytical model is used to estimate the core-level
performance between two miss events, while the miss events are
determined through the simulation of branch predictor and the
entire memory hierarchy: private per-core caches and TLBs, shared
caches, cache coherence, network on chip, memory controller, and
main memory. Finally, Van den Steen et al. [6] present various en-
hancements of the interval model [10]. First, the study incorporates
architecture-independent application profiling, so the application
can be profiled once and then simulated on any given platform.
The authors also demonstrate that the analytical model can be
connected to the McPAT power tool [26] for estimation of the pro-
cessor power and energy consumption. Finally, the authors make
first steps in memory system modeling by considering congestion
on the memory bus. Unlike our study, however, they do not take
into account contention in the memory controller and memory
device itself, which are more challenging.
The greatest challenge of the presented modeling approaches
is that, although they do not require simulation to carry out the
performance prediction, they do require detailed application profil-
ing, which can be performed only with hardware simulators. Even
with recent advances in hardware performance counters, we are
still far away from being able to read values such as, for example,
the number of cold, capacity and conflict cache miss. Since these
methods require simulation results, even for application profiling,
a significant effort is required to set up and tune for a target ar-
chitecture, a serious amount of simulation time, and potentially
high simulation errors. Our approach avoids these issues by using
only those parameters that can be read using performance counters
on real hardware. This is the greatest advantage of our work com-
pared with previously-mentioned studies. Limiting the application
profiling to the performance counters available on real hardware
requires a novel approach to infer various application parameters,
such as the MLP or number of executed OOO instructions. There-
fore, although we start with the same foundation as the previous
studies, as detailed in Section 4.6, estimation of the important ap-
plication parameters and the overall performance calculation are
fundamentally different from previously-presented models.
The second important difference of our work compared with the
previous studies is the treatment of the main memory access la-
tency. The previous studies use the memory access latency obtained
from a detailed simulation of the whole memory hierarchy [12] or
simply use a constant latency [6, 10, 22]. Our study performs a de-
tailed analysis of memory bandwidth–latency curves and uses these
curves as an intrinsic part of the overall performance estimation,
as detailed in Section 4.5.
Third, unlike previous studies, PROFET encompasses data prefetch-
ing. This is an important difference because prefetching may have
significant impact on the application performance, behavior and
memory bandwidth usage.
Fourth, we complement the PROFET performance model with
power and energy consumption estimates, and evaluate all three
PROFET models against fully-utilized actual HPC servers with
multi-threaded or multi-programmed benchmark execution. The
previous models are validated versus the same simulators used for
the application profiling. This means that the previous evaluations
overlook potentially high errors of application profiling on a sim-
ulation versus the actual hardware. Finally, most of the previous
studies [6, 10, 22] build and validate the models for single-core
processors.
Workload characterization and memory DVFS: The perfor-
mance impact of memory bandwidth and latency is frequently
estimated by workload characterization studies and memory DVFS
proposals. Memory models that could quantify this impact are miss-
ing, or not publicly available, so some studies develop their own
models [4, 7]. These models are developed for very specific tasks
in the context of the larger studies and they successfully fulfil their
objectives. Still it is questionable whether they can be applied gen-
erally because of two main limitations. First, the modeled CPU and
memory systems are much simpler than state-of-the-art production
platforms. Second, themodels are not validated versus any real hard-
ware or hardware simulators, so it is difficult to quantify the errors
they may introduce. Even so, we consider these studies as very valu-
able for any follow-up on this topic because they analyze different
approaches formainmemorymodeling and share their experiences.
9 CONCLUSIONS
This study presents PROFET, an analytical model that quantifies the
impact of the main memory on application performance and system
power and energy consumption. PROFET is based on memory
system profiling and instrumentation of an application execution
on a real platform with a baseline memory system. By running on
the real platform, PROFET handles many aspects (e.g. prefetcher
and out-of-order engine) that are oversimplified in state-of-the-art
methods. The outputs from PROFET are the predicted performance,
power and energy consumption on the target memory.
PROFET is evaluated on two actual platforms: Sandy Bridge-
EP E5-2670 and Knights Landing Xeon Phi platforms with various
memory configurations. The evaluation results show that PROFET’s
predictions are very accurate — the average difference from the
performance, power and energy measured on the actual hardware is
typically only about 2%. We also compare PROFET’s performance
predictions with simulation results for the Sandy Bridge-EP E5-
2670 system with ZSim and DRAMSim2 simulators. PROFET shows
significantly better accuracy while being three orders of magnitude
faster than the hardware simulators. We release PROFET’s source
code and all input data required for memory system and application
profiling [31]. The released model is ready to be used on high-end
Intel platforms, and we would encourage the community to use it,
adapt it to other platforms, and share their own evaluations.
ACKNOWLEDGMENTS
This work was supported by the Spanish Ministry of Science and
Technology (project TIN2015-65316-P), Generalitat de Catalunya
(contracts 2014-SGR-1051 and 2014-SGR-1272), Severo Ochoa Pro-
gramme (SEV-2015-0493) of the Spanish Government; and the Eu-
ropean Union’s Horizon 2020 research and innovation programme
under ExaNoDe project (grant agreement No 671578) and EuroEXA
project (grant agreement No 754337); the U.S. Department of De-
fense under Contract FA8075-14-D-0002-0007, TAT 15-1158; and
the U.S. National Science Foundation under Award 1642424.
A POWER AND ENERGY MODELING
In this appendix, we give the detailed description of the PROFET
power and energy models.
A.1 Power modeling
Apart from performance, power and energy demand are important
system constraints. InmodernHPC systems, thememory subsystem
contributes 10–16% of the total server power consumption [11]. It
is therefore valuable to quantify the trade-offs in power and energy
consumption due to the change of the memory system.
As mentioned in Section 5, we analyse the difference in total
system power consumption when we move from baseline to target
memory system. In our study, we assume that when the memory
system changes, the biggest impact on total system power con-
sumption is the change of the main memory power consumption.
Hence, we focus on modeling the power consumption of the mem-
ory subsystem and consider that power consumption of the rest
of the system does not change [7]. Estimating power and energy
consumption requires measurements of the total platform power
consumption, ratio of time spent in memory power-down states
(active standby, precharge power-down and self-refresh states in
our system) and row-buffer access statistics (rate of page hits and
page misses). Table 5 summarizes the symbols used in the formulas
for power and energy estimation.
Table 5: Notation used in formulas: Power modeling
Description Symbol
Input parameters
Total platform power consumption Ptot
Percentage of time spent in active standby state tact
Percentage of time spent
in precharge power-down state tppd
Percentage of time spent in self-refresh state tsr
Percentage of row-buffer hits phit
Percentage of row-buffer misses pmiss
Used memory bandwidth for read/write traffic BW rd/wr , (1)used
Intermediate outputs
Total memory power Pmem
Power consumption of the rest of the system,
apart from the memory Prest
Operational memory power Pop
Background memory power Pbg
Memory power in active standby state Pact
Memory power in precharge power-down state Pppd
Memory power in self-refresh state Psr
Total memory read/write operations power Prd/wr
Power of memory refresh operations Pref
Duration of sampling segment Tsample
Number of read/write memory accesses
on a sampling segment N
rd/wr
access
Energy on termination resistors
for a single read/write memory access E
rd/wr
term
Single memory read/write access energy Erd/wraccess
Energy of a read/write row buffer miss access Erd/wrmiss
Energy of a read/write row buffer hit access Erd/wrhit
To calculate the components of memory power consumption,
we use Micron’s guide for calculating power consumption of DDR3
memory systems [28]. Apart from the parameters in Table 5, Mi-
cron’s power consumption guide requires IDD currents, memory
system voltage and DIMM timing parameters, which are detailed
in DIMMs documentation [29].7 The PROFET power model was
not developed for the KNL server, because we lacked of the reli-
able MCDRAM power parameters. The estimation of these power
parameters is part of ongoing work.
Micron’s power consumption guide defines how to calculate
power consumption of individual read or write memory accesses
and DRAM power-down states. In our power analysis, we begin
from total platform power consumption and expand its components
down to these basic DRAM operations. We start by dividing the
total platform power consumption into two components, power
of the memory system and power of the rest of the platform apart
from the memory [7]: Ptot = Pmem + Prest. As in the analysis of
the performance in Section 4, we use the superscripts (1) and (2)
to distinguish between the Ptot and its components in the baseline
and target memory systems, respectively:
Baseline memory: P (1)tot = P
(1)
mem + P
(1)
rest
Target memory: P (2)tot = P
(2)
mem + P
(2)
rest (19)
Since we assume that the power of the rest of the system stays the
same, we can write P (2)rest = P
(1)
rest.
8 Therefore, total platform power
consumption in the target memory system can be expressed as:
P
(2)
tot = P
(1)
tot + (P
(2)
mem − P (1)mem) (20)
In further analysis we focus on the Pmem. It comprises two compo-
nents, background power Pbg and operational power Pop:
Pmem = Pbg + Pop (21)
Background power accounts for the current state of the memory
system. There are several possible states of the memory system,
depending on which power-down states are used. On our exper-
imental system, there are three supported memory power states:
active standby, precharge power-down and self-refresh. They differ
in terms of power consumption and the transition latency to the
active state. In active standby state, the memory device consumes
the highest power Pact but executes the commands immediately,
without any latency penalty. Precharge power-down state consumes
power Pppd, which is less than Pact, with a moderate latency penalty.
In self-refresh mode, memory consumes the least power Psr, but
has a significant latency penalty for coming back to active standby
mode. Multiplying each of these powers with the corresponding
time share spent in each of them (tact, tppd and tsr, respectively),9
and summing these products gives the background power Pbg:
Pbg = tact × Pact + tppd × Pppd + tsr × Psr (22)
7During the evaluation on different memory frequencies (DDR3-800/1066/1333/1600)
we used the same DIMMs. Each memory frequency, however, uses different timing
and IDD current parameters.
8Quantifying the impact of a change from baseline to target memory system on Prest
is a part of ongoing work.
9Please note that tact + tppd + tsr = 1.
Operational power presents the sum of power consumptions while
reading or writing the data, plus the power consumption of the
refresh operations:
Pop = Prd + Pwr + Pref (23)
Components Prd and Pwr present power consumptions of all read
or write memory accesses, respectively, on a sampling interval.
Expanding these components further leads to the power consump-
tions of individual read or write memory accesses, which can be
calculated using the Micron’s power consumption guide. However,
using the power consumption of individual read or write memory
accesses to calculate Prd and Pwr is not trivial. These individual
reads or writes are interleaved and overlapped in time. In order
to sum the powers of individual reads or writes, we have to know
their distribution on intervals which are the orders of magnitude
of 1 ns, which is infeasible in current hardware platforms.
To mitigate this problem we calculate the energy of individual
read or write memory access. Using energy instead of power implies
that we do not have to know the distribution of memory accesses
on a sampling segment to calculate the sum of energies of all the
individual reads or writes. This way, we calculate Prd and Pwr in
two steps. First, we sum the energies of all the individual reads or
writes on the sampling segment. Second, we divide this cumulative
energy from the previous step with the duration of the sampling
segment. If an energy of a single read or write memory access
is Erd/wraccess and there are N
rd/wr
access number of reads or writes on a
sampling segment Tsample, we can write:
Prd =
∑N rdaccess
i=1 E
rd
access, i
Tsample
=
Erdaccess × N rdaccess
Tsample
(24)
Pwr =
∑Nwraccess
i=1 E
wr
access, i
Tsample
=
Ewraccess × Nwraccess
Tsample
(25)
Number of reads or writes on the sampling segment can be mea-
sured with memory bandwidth hardware counters. A single read or
write memory access transfers the amount of data defined as width
of the memory bus (64 bits) multiplied by number of bursts (8), so
8 Bytes × 8 bursts = 64 Bytes of data. Therefore, number of reads
or writes equals the total read or write traffic during the sampling
segment, divided by the size of a single memory access:
N rdaccess =
BW rdused ×Tsample
64 B (26)
Nwraccess =
BWwrused ×Tsample
64 B (27)
Using Eq. 26 and 27 with Eq. 24 and 25, we get:
Prd = E
rd
access ×
BW rdused
64 B (28)
Pwr = E
wr
access ×
BWwrused
64 B (29)
Energy per single memory access accounts for read or write opera-
tion with its sub-operations. It also includes the energy on termina-
tion resistors, which are common in DDR devices. Our Sandy Bridge
experimental platform uses adaptive open-page policy [15, 19], there-
fore performed sub-operations depend whether the target row in
memory array was open or closed when accessing it. If the target
row was open, it is a row-buffer hit access and it consumes only
the energy for reading or writing the data. When the target row
is closed, there are two scenarios. The first scenario is that there
is no other opened row in the same bank and initially the energy
accounts for opening a row and reading or writing the data. This
row will be eventually closed after a time-out, so precharge energy
should be added afterwards. The second scenario is that if there is
an opened row (which is not the target one) in the same bank, it has
to be closed first. It accounts the energy for precharging and closing
the opened row, activating the target row and reading or writing
the data. Both scenarios include same sub-operations from the en-
ergy point of view and we consider them as row-buffer miss case in
further analysis. Since we measure the ratio of row-buffer hits phit
and row-buffer misses pmiss10 w.r.t. total number of accesses, the
energy per memory access can be represented as:
Erdacc = E
rd
hit × phit + Erdmiss × pmiss + Erdterm (30)
Ewracc = E
wr
hit × phit + Ewrmiss × pmiss + Ewrterm (31)
The energy parameter Emiss represents the row-buffer miss en-
ergy, including opening the row, sending or receiving the data
and precharging the row. Ehit represents the row-buffer hit energy
and it includes sending or receiving the data. Eterm represents the
energy on termination resistors. These parameters are calculated
using Micron’s power consumption guide.
Now that we have all the power components, we can include
all of them from Equations 21 to 31 into Eq. 32 and calculate P (2)tot
on the target memory system. Before this step, we have to make
two assumptions. First assumption is that the ratio of time spent in
memory power-down states stays the same on baseline and target
memory system. Hence, t (2)act = t
(1)
act , t
(2)
ppd = t
(1)
ppd and t
(2)
sr = t
(1)
sr .
Second assumption is that row-buffer access statistics does not
change from baseline to target memory system. So, p (2)hit = p
(1)
hit and
p
(2)
miss = p
(1)
miss. These assumptions are reasonable, since we assume
that Instot, MissLLC, CPI0, RatioR/W and memory access pattern
do not change from baseline to target memory system (detailed in
Section 4.2). So, the final solution for P (2)tot on the target memory
system is given in the Eq. 32.
A.2 Energy modeling
Once we have the performance and the power consumption es-
timations, we can estimate the total system energy consumption
with the target memory system. In general, energy is defined as the
integral of power over time:
Etot =
∫ ttot
0
Ptot (t )dt
10Note that phit + pmiss = 1.
P
(2)
tot = P
(1)
tot+(P
(2)
mem−P (1)mem) = P (1)tot+
(
P
(2)
bg +P
(2)
op −(P (1)bg +P
(1)
op )
)
= P
(1)
tot+P
(2)
bg −P
(1)
bg +P
(2)
op −P (1)op
= P
(1)
tot+P
(2)
bg −P
(1)
bg +P
(2)
ref−P
(1)
ref+P
(2)
rd −P
(1)
rd +P
(2)
wr−P (1)wr = P (1)tot+
P (2)bg −P (1)bg (Eq. 22)︷                                                                              ︸︸                                                                              ︷
tact × (P (2)act − P (1)act ) + tppd × (P (2)ppd − P
(1)
ppd) + tsr × (P
(2)
sr − P (1)sr ) +(P (2)ref−P
(1)
ref )
+
P (2)rd −P (1)rd (Eq. 28 and Eq. 30)︷                                                                                                                                                              ︸︸                                                                                                                                                              ︷
1
64B ×
(
BW
rd, (2)
used × (E
rd, (2)
hit × phit + E
rd, (2)
miss × pmiss + E
rd, (2)
term ) − BW rd, (1)used × (E
rd, (1)
hit × phit + E
rd, (1)
miss × pmiss + E
rd, (1)
term )
)
+
1
64B ×
(
BW
wr, (2)
used × (E
wr, (2)
hit × phit + E
wr, (2)
miss × pmiss + E
wr, (2)
term ) − BWwr, (1)used × (E
wr, (1)
hit × phit + E
wr, (1)
miss × pmiss + E
wr, (1)
term )
)
︸                                                                                                                                                                   ︷︷                                                                                                                                                                   ︸
P (2)wr −P (1)wr (Eq. 29 and Eq. 31)
The final solution for P (2)tot , showing all the required parameters.
(32)
In our experiments, we measure and analyse power consumption
on sampling segments of 1 s. Hence, we represent total energy from
our experiments in a discrete form:
Etot =
N∑
i=1
Ptot, i × ∆ti (33)
The parameter ∆ti is the duration of the sampling segment, and
equals ∆ti (1) = 1 s on a baseline memory system (detailed in
Section 4.1). During ∆ti (1) segment, Ins
(1)
tot, i instructions are ex-
ecuted. As we mentioned in Section 4.2, we assume that Instot
does not change from baseline to target memory system, therefore
Ins
(1)
tot, i = Ins
(2)
tot, i. However, duration of the corresponding time in-
terval ∆ti (2) on the target memory system is not the same as ∆ti (1) .
This implies that ∆ti (2) on the target memory system is inversely
proportional to the estimated performance improvement:
∆t (2) =
IPC
(1)
tot, i
IPC
(2)
tot, i
× ∆ti (1) (34)
Finally, total energy on the target memory system is:
E
(2)
tot =
N∑
i=1
P
(2)
tot, i × ∆t
(2)
i =
N∑
i=1
P
(2)
tot, i ×
IPC
(1)
tot, i
IPC
(2)
tot, i
× ∆ti (1) (35)
Parameter P (2)tot, i can be calculated using the Equation 32, and IPC
(2)
tot, i
can be calculated using the PROFET performance model from Sec-
tion 4.
B ADDITIONAL EVALUATIONS
Due to the lack of space, Sections 7.1 and 7.2 of the paper show
a subset of the PROFET evaluation experiments performed in the
study. Rest of the evaluation results are presented in this appendix.
Appendix B.1 shows the evaluation of the PROFET power and en-
ergy model for Sandy Bridge when increasing the DRAM frequency
from DDR3-800 to DDR3-1066/1333/1600. PROFET’s power and
energy estimates are evaluated versus measurements of an external
power meter taken on the actual Sandy Bridge server. These results
are a direct extension of the Sandy Bridge performance model eval-
uation presented in Section 7.1. Afterwards, in Appendix B.2 we
compare PROFET’s performance estimates with the outcomes of
the ZSim+DRAMSim2 hardware simulator.
B.1 Sandy Bridge: DDR3-800→ 1066/1333/1600
Power and energy estimation
In Section 7, we showed the performance evaluation of PROFET on
a Sandy Bridge server when increasing the DRAM frequency from
DDR3-800 to DDR3-1066/1333/1600. In this section, we present the
evaluation results for system-level estimation of power and energy
consumption.
B.1.1 System power. Figure 12 shows the estimated and measured
system power. As in Figure 10, for each benchmark we plot three
sets of bars, one per DRAMconfiguration: DDR3-800→ 1066/1333/1600.
Also, as in all previous evaluation charts, we plot PROFET’s point
estimate (solid bars) and estimation bounds (error bars), and the
power consumption measured on the actual server (cross markers).
Although some of the high-bandwidth benchmarks, libquantum
to GemsFDTD show a moderate prediction error of 3–4%, in general
the power prediction error of PROFET is small, below 1% on average.
The most important finding of the results presented in Figure 12
is that the significant increment in the DRAM frequency causes
very small change in the overall server power consumption. For
example, increasing the DRAM frequency from DDR3-800 to 1600
(100% increment) causes average power increment of only 2%. Even
if we focus on the high-stress memory benchmarks, the power
increment is still below 5%. This is not surprising. As assumed in
PROFET’s power model, when changing the DRAM frequency, the
most important impact on total system power consumption is the
change of the main memory power consumption. Although the
relative change of the memory power itself could be significant,
this is still a small portion of the overall server power [11].
B.1.2 Energy consumption. Estimated andmeasured changes in the
system energy consumption when increasing the DRAM frequency
from DDR3-800 to DDR3-1066/1333/1600 are given in Figure 13.
The results show that the DRAM frequency has a significant impact
on the overall energy consumption. For example, increasing the
DRAM frequency from DDR3-800 to DDR3-1600 leads to average
energy savings of 13% and reaches 41% savings for the libquantum
lib
qu
an
tum
bw
av
es lbm mi
lc
so
ple
x
Ge
ms
FD
TD
les
lie
3d mc
f
sp
hin
x3 wr
f
om
ne
tpp
ze
us
mp
ca
ctu
sA
DM ast
ar gc
c
de
alI
I
bz
ip2
go
bm
k
xa
lan
cb
mk sje
ng
hm
me
r
ton
to
gro
ma
cs
h2
64
ref
ca
lcu
lix
na
md
pe
rlb
en
ch
ga
me
ss
po
vra
y QE
AL
YA
GR
OM
AC
S
NA
M
D
0%
1%
2%
3%
4%
5%
6%
7%
8%
R
el
at
iv
e
po
w
er
co
ns
.d
iff
.
w
.r.
t.
D
D
R
3-
80
0
Measured power cons. increment
Estimated DDR3-800 -> DDR3-1066
Estimated DDR3-800 -> DDR3-1333
Estimated DDR3-800 -> DDR3-1600
Figure 12: Sandy Bridge, DDR3-800 → 1066/1333/1600: DRAM frequency has a minor impact on the overall server power
consumption. Increasing DRAM frequency from DDR3-800 to DDR3-1600 (100% increment) causes average power increment
of only 2%.
benchmark. The results also show that PROFET’s energy predic-
tions are precise, with narrow error bars, and accurate, with average
prediction error of below 2%.
The presented results and findings are not surprising. Our pre-
vious results showed that increasing the DRAM frequency causes
significant execution time reductions, and only few percent server
power increment. Since, energy is the integral of the power con-
sumption over the application execution time, it is reasonable to
expect significant energy savings. Also, it is anticipated that the
energy predictions are precise and accurate, since they are derived
from the performance and power estimations.
B.2 PROFET vs. Hardware simulator
Novel memory systems are typically explored using hardware simu-
lators. In this section we compare PROFET’s performance estimates
with ZSim+DRAMSim2 hardware simulators.
B.2.1 Experimental. We compared the hardware simulators and
PROFET on the Intel Sandy Bridge platform described in Sec-
tion 6.1. Although the Sandy Bridge is a main-stream HPC archi-
tecture released eight years ago (January 2011), finding a CPU
simulator that accurately models this architecture was not trivial.
After extensive search and analysis of the available options, we
decided to use the ZSim simulator. ZSim [33] was initially devel-
oped to simulate Intel Westmere architecture (released in 2008),
and it was recently upgraded simulate the Sandy Bridge [40]. An-
other reason for selecting ZSim was its simulation speed. The ZSim
was designed for simulation of large-scale systems, and, to the
best of our knowledge, it is one of the fastest system simulators
available. Different main memory configurations were simulated
with DRAMSim2 [32]. DRAMSim2 is a cycle accurate model of
the DDR3 main memory validated against manufacturer Verilog
models.
The accuracy of PROFET and ZSim+DRAMSim2 were compared
on an example of increasing memory frequency from DDR3-800 to
DDR3-1066/1333/1600. To keep the simulation time at the accept-
able level, we focus on the CPU2006 benchmarks and exclude the
HPC scientific applications. In all the experiments we fully utilize
the available 16 cores of the Sandy Bridge platform by executing
16 benchmark copies. The simulations were limited to 150 billion
instructions of each benchmark. In order to make a fair comparison
with the simulator, the benchmark execution on the actual system
and the corresponding PROFET predictions were done for the same
150 billion instructions.
lib
qu
an
tum
bw
av
es lbm mi
lc
so
ple
x
Ge
ms
FD
TD
les
lie
3d mc
f
sp
hin
x3 wr
f
om
ne
tpp
ze
us
mp
ca
ctu
sA
DM ast
ar gc
c
de
alI
I
bz
ip2
go
bm
k
xa
lan
cb
mk sje
ng
hm
me
r
ton
to
gro
ma
cs
h2
64
ref
ca
lcu
lix
na
md
pe
rlb
en
ch
ga
me
ss
po
vra
y QE
AL
YA
GR
OM
AC
S
NA
M
D
-50%
-40%
-30%
-20%
-10%
0%
10%
R
el
at
iv
e
en
er
gy
co
ns
.d
iff
.
w
.r.
t.
D
D
R
3-
80
0
Measured energy cons. decrease
Estimated DDR3-800 -> DDR3-1066
Estimated DDR3-800 -> DDR3-1333
Estimated DDR3-800 -> DDR3-1600
Figure 13: Sandy Bridge, DDR3-800 → 1066/1333/1600: DRAM frequency has a significant impact on the system energy con-
sumption. Energy predictions of PROFET are precise, with low error bars, and accurate, with small difference from the actual
values measured on real hardware.
lib
qu
an
tum
bw
av
es lbm mi
lc
so
ple
x
Ge
ms
FD
TD
les
lie
3d mc
f
sp
hin
x3 wr
f
om
ne
tpp
ze
us
mp
ca
ctu
sA
DM ast
ar gc
c
de
alI
I
bz
ip2
go
bm
k
xa
lan
cb
mk sje
ng
hm
me
r
ton
to
gro
ma
cs
h2
64
ref
ca
lcu
lix
na
md
pe
rlb
en
ch
ga
me
ss
po
vra
y
0%
20%
40%
60%
80%
100%
R
el
at
iv
e
IP
C
di
ff
er
en
ce
w
.r.
t.
D
D
R
3-
80
0
Measured performance improvement
Estimated DDR3-800 -> DDR3-1600
Simulated DDR3-800 -> DDR3-1600
Figure 14: Sandy Bridge, DDR3-800→ 1600: Comparison of the estimated (PROFET) and simulated (ZSim+DRAMSim2) perfor-
mance improvement. Estimations of PROFET correspond to the real-system measurements much better than the simulated
performance.
B.2.2 Results. In Figure 14, we compare the measured, simulated
and estimated performance improvement when increasing the
DRAM frequency fromDDR3-800 toDDR3-1600. TheDDR3-800→DDR3-
1066/1333 results show the same trend and lead to the same con-
clusions.
The actual measured values (red cross markers) show significant
performance improvement for the high-bandwidth benchmarks
(left hand-side of the chart) and no performance changes for the low-
bandwidth benchmarks (right hand-side of the chart). PROFET’s es-
timations are accurate, with an average error of 3.6%, and closely fol-
low the trend of the actual measurements. Also, for the benchmarks
for which the estimation error is moderate, e.g., lbm and milc, the
estimated performance have high error bars, clearly indicating a lim-
ited estimation precision in these cases. The simulated performance
shows significant discrepancy with the actual measured values. The
average simulation error is 15.7%, which is significant considering
that the average DDR3-800→DDR3-1600 performance improve-
ment is 17.5%. The range of the simulator error is also very high,
between −23.7% (libquantum) and 45.7% (omnetpp). Finally, trend of
the simulated results is completely different from the actual one. As
already mentioned, high-bandwidth benchmarks experience high
performance improvement, and insignificant performance changes
for the low-bandwidth benchmarks. The simulator underestimates
performance gains of the high-bandwidth benchmarks (left hand-
side of the chart) while it overestimates gains for the low-bandwidth
benchmarks (right hand-side of the chart). Therefore, the simulated
performance gains are roughly uniform over the benchmark suite,
which is a completely different trend from the actual measurements.
B.2.3 Time to perform prediction. We now compare the total time
for prediction using PROFET against ZSim and DRAMSim2 hard-
ware simulators. In our experiments, running the hardware simula-
tors on a representative segment of the SPEC CPU2006 benchmark
suite took 242 hours. Running the PROFET model required three
steps. First, memory profiling to obtain 26 curves (0% to 50% pro-
portion of reads at 2% increments) with 35 points each, for DDR4,
at a cost of 1 second per experiment, required 910 seconds. Sec-
ond, profiling of the representative segments of the benchmark
suite on the real machine required 127 seconds. Third, running
the PROFET model required a couple of seconds. In total, PROFET
required 1040 seconds, which is 838 times faster than the hardware
simulators.
B.2.4 Discussion. In addition to the better accuracy, PROFET has
various advantages over hardware simulators. PROFET is faster
than the hardware simulators by almost three orders of magnitude,
so it can be used to analyze production HPC applications, arbitrarily
sized systems, and numerous design options. In this paper we pre-
sented experiments on two hardware platforms, Sandy Bridge and
KNL. We analyzed four memory configurations for Sandy Bridge
(DDR3-800/1066/1333/1600), and two for the KNL (DDR4-2400 and
MCDRAM). In all the experiments, we analyzed all the benchmarks
from the SPEC CPU2006 suite. Finally, for the Sandy Bridge plat-
form, we also analyzed power and energy consumption in each
memory configuration, and four HPC production applications. Per-
forming the study of this size by using hardware simulators would
be impossible within a practical length of time.
Additionally, the method is based on profiling of the applica-
tion’s memory behavior, so it does not require detailed modeling
of the CPU as it already takes account of the real (and not publicly
disclosed) data prefetcher and out-of-order engine. Therefore, it
can be used to model various platforms as long as they support the
required application profiling. PROFET was initially developed for
the Sandy Bridge platform, and later we evaluated it for the KNL
server. Adjustment of PROFET to the KNL system was trivial, as it
required changes to only a few hardware parameters, such as, for
example the reorder buffer size.
We release the PROFET source code as open source [31]. The
release includes all PROFET inputs and outputs and evaluation
results for the case study that is used in the rest of this paper. The
package includes the memory system profiles, CPU parameters,
application profiles and memory power parameters, as well as the
power, performance and energy outputs from PROFET and the
measurements on the baseline and target platforms. The released
PROFET model is ready to be used on high-end Intel platforms, and
we would encourage the community to evaluate PROFET versus
actual hardware platforms and hardware simulators, and share their
findings.
REFERENCES
[1] 2018. TOP500 List. http://www.top500.org/.
[2] Arira Design. 2013. Hybrid Memory Cube Evaluation & Development Board.
http://www.ariradesign.com/hmc-board.
[3] Yuan Chou, Brian Fahs, and Santosh Abraham. 2004. Microarchitecture Opti-
mizations for Exploiting Memory-Level Parallelism. In Proceedings of the 31st
Annual International Symposium on Computer Architecture. 76–87.
[4] R. Clapp, M. Dimitrov, K. Kumar, V. Viswanathan, and T. Willhalm. 2015. Quan-
tifying the Performance Impact of Memory Latency and Bandwidth for Big
Data Workloads. In IEEE International Symposium on Workload Characterization.
213–224. https://doi.org/10.1109/IISWC.2015.32
[5] Yokogawa Test & Measurement Corporation. [n.d.]. WT230 Digital Power Meter.
https://cdn.tmi.yokogawa.com/IM760401-01E.pdf.
[6] S. Van den Steen, S. Eyerman, S. De Pestel, M. Mechri, T. E. Carlson, D. Black-
Schaffer, E. Hagersten, and L. Eeckhout. 2016. Analytical Processor Performance
and PowerModeling UsingMicro-Architecture Independent Characteristics. IEEE
Trans. Comput. 65, 12 (Dec 2016), 3537–3551. https://doi.org/10.1109/TC.2016.
2547387
[7] Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo
Bianchini. 2011. MemScale: Active Low-power Modes for Main Memory. In Pro-
ceedings of the International Conference on Architectural Support for Programming
Languages and Operating Systems. 225–238.
[8] P. G. Emma. 1997. Understanding some simple processor-performance limits.
IBM Journal of Research and Development 41, 3 (May 1997), 215–232. https:
//doi.org/10.1147/rd.413.0215
[9] Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2006. A
Performance Counter Architecture for Computing Accurate CPI Components.
In Proceedings of the 12th International Conference on Architectural Support for
Programming Languages and Operating Systems. 175–184.
[10] Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A
Mechanistic Performance Model for Superscalar Out-of-order Processors. ACM
Trans. Comput. Syst. 27, 2 (May 2009), 3:1–3:37.
[11] Xixhou Feng, Rong Ge, and K. W. Cameron. 2005. Power and energy profiling of
scientific applications on distributed systems. In IEEE International Parallel and
Distributed Processing Symposium. https://doi.org/10.1109/IPDPS.2005.346
[12] D. Genbrugge, S. Eyerman, and L. Eeckhout. 2010. Interval simulation: Raising
the level of abstraction in architectural simulation. In The Sixteenth International
Symposium on High-Performance Computer Architecture. 307–318. https://doi.
org/10.1109/HPCA.2010.5416636
[13] Andrew Glew. 1998. MLP yes! ILP no! International Conference on Architectural
Support for Programming Languages and Operating Systems, Wild and Crazy Ideas
Session (Oct. 1998).
[14] John L. Hennessy and David A. Patterson. 2017. Computer Architecture: A Quan-
titative Approach (6th ed.).
[15] Intel Corporation. 2012. Intel® Xeon® Processor E5-1600/E5-2600/E5-4600 Product
Families Datasheet - Volume One. Technical Report 326508.
[16] Intel Corporation. 2012. Intel® Xeon® Processor E5-2600 Product Family Uncore
Performance Monitoring Guide. Technical Report.
[17] Intel Corporation. 2016. Intel® 64 and IA-32 Architectures Optimization Reference
Manual. Technical Report.
[18] Intel Corporation. 2017. Intel® Xeon Phi™ Processor Performance Monitoring
Reference Manual - Volume 2: Events. Technical Report.
[19] Bruce Jacob, Spencer Ng, and David Wang. 2007. Memory Systems: Cache, DRAM,
Disk.
[20] Bruce L. Jacob. 2009. The Memory System: You Can’t Avoid It, You Can’t Ignore
It, You Can’t Fake It. Synthesis Lectures on Computer Architecture 4, 1 (2009),
1–77.
[21] James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor
High Performance Programming: Knights Landing Edition (2nd ed.).
[22] Tejas S. Karkhanis and James E. Smith. 2004. A First-Order Superscalar Proces-
sor Model. In Proceedings of the Annual International Symposium on Computer
Architecture. 338–349.
[23] Y. Kim, W. Yang, and O. Mutlu. 2016. Ramulator: A Fast and Extensible DRAM
Simulator. IEEE Computer Architecture Letters 15, 1 (Jan. 2016), 45–49.
[24] Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson,
William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon
Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards,
Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams,
and Katherine Yelick. 2008. ExaScale Computing Study: Technology Challenges
in Achieving Exascale Systems.
[25] David Kroft. 1981. Lockup-free Instruction Fetch/Prefetch Cache Organization.
In Proceedings of the Annual Symposium on Computer Architecture. 81–87.
[26] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen,
and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing
Modeling Framework for Multicore and Manycore Architectures. In Proceedings
of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture.
469–480.
[27] John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High
Performance Computers. Technical Report. University of Virginia. http://www.
cs.virginia.edu/stream/
[28] Micron Technology, Inc. 2007. Calculating Memory System Power for DDR3.
Technical Report TN-41-01.
[29] Micron Technology, Inc. 2013. MT36JSF1G72PZ-1G6M1, 8GB (x72, ECC, DR)
240-PinDDR3 RDIMM. http://www.micron.com/~/media/documents/products/data-
sheet/modules/parity_rdimm/jsf36c1gx72pz.pdf.
[30] Partnership for Advanced Computing in Europe (PRACE). 2013. Unified European
Applications Benchmark Suite. www.prace-ri.eu/ueabs/.
[31] Milan Radulovic, Rommel Sanchez Verdejo, Paul Carpenter, Petar Radojković,
Bruce Jacob, and Eduard Ayguadé. 2019. PROFET — Analytical model that
quantifies the impact of the main memory on application performance and
system power and energy consumption. https://github.com/bsc-mem/PROFET.
[32] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. DRAMSim2: A Cycle Accurate
Memory System Simulator. IEEE Computer Architecture Letters 10, 1 (Jan. 2011),
16–19.
[33] Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Mi-
croarchitectural Simulation of Thousand-core Systems. In Proceedings of the 40th
Annual International Symposium on Computer Architecture. 475–486.
[34] Rommel Sanchez Verdejo, Kazi Asifuzzaman, Milan Radulovic, Petar Radojković,
Eduard Ayguadé, and Bruce Jacob. 2018. Main Memory Latency Simulation: The
Missing Link. In Proceedings of the International Symposium on Memory Systems.
1–9.
[35] Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. Keynote
Presentation at the 44th Annual IEEE/ACM International Symposium on Mi-
croarchitecture (MICRO).
[36] A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell,
R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon
Phi Product. IEEE Micro 36, 2 (March 2016), 34–46. https://doi.org/10.1109/MM.
2016.25
[37] Standard Performance Evaluation Corporation. [n.d.]. SPEC CPU 2006. http:
//www.spec.org/cpu2006/.
[38] Rick Stevens, Andy White, Pete Beckman, Ray Bair-ANL, Jim Hack, Jeff Nichols,
Al GeistORNL, Horst Simon, Kathy Yelick, John Shalf-LBNL, Steve Ashby, Moe
Khaleel-PNNL, Michel McCoy, Mark Seager, Brent Gorda-LLNL, John Morrison,
Cheryl Wampler-LANL, James Peery, Sudip Dosanjh, Jim Ang-SNL, Jim Dav-
enport, Tom Schlagel, BNL, Fred Johnson, and Paul Messina. 2010. A Decadal
DOE Plan for Providing Exascale Applications and Technologies for DOE Mis-
sion Needs. Presentation at Advanced Simulation and Computing Principal
Investigators Meeting.
[39] J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A Lightweight Performance-
Oriented Tool Suite for x86 Multicore Environments. In International Conference
on Parallel ProcessingWorkshops. 207–216. https://doi.org/10.1109/ICPPW.2010.38
[40] R. S. Verdejo and P. Radojković. 2017. Microbenchmarks for Detailed Validation
and Tuning of Hardware Simulators. In 2017 International Conference on High
Performance Computing Simulation (HPCS). 881–883.
[41] Wm. A. Wulf and Sally A. McKee. 1995. Hitting the Memory Wall: Implications
of the Obvious. ACM SIGARCH Computer Architecture News 23, 1 (March 1995),
20–24.
