A Bottleneck-Centric Tuning Policy for Optimizing Energy in Parallel Programs by Endrei, Mark et al.
1 
 
A Bottleneck-centric Tuning Policy for 
Optimizing Energy in Parallel Programs 
Mark ENDREI a,1, Chao JIN a, Minh DINH a, David ABRAMSON a, Heidi POXON b, 
Luiz DEROSE b, and Bronis R DE SUPINSKI c 
a Research Computer Center and School of ITEE, The University of Queensland, QLD, 
Australia 
b Cray Inc., Bloomington, MN, USA 
c Lawrence Livermore National Laboratory, Livermore, CA, USA 
Abstract. In order to operate within power supply constraints, the next generation 
of supercomputers must be energy efficient. Both the capacities of the target HPC 
system architecture and workload features impact the energy efficiency of parallel 
applications. These system and workload factors form a complicated optimization 
search space. Further, a typical workload may consist of multiple algorithmic 
kernels each with different power consumption patterns. Using the Parallel Research 
Kernels as a case study, we identify key bottlenecks that change the energy usage 
pattern and develop strategies that improve energy efficiency by optimizing both 
workload and system parameters in an automated manner. The method provides 
significant insights to identify repeatable, statistically significant energy saving 
opportunities for parallel applications at various scales. 
Keywords. High Performance Computing, Energy Efficiency, Power Usage 
1. Introduction 
Power consumption constraints limit the number of active cores that can be used on a 
single processor, and the number of machines that run in a supercomputer center at peak 
performance [1]. Understanding the energy consumption of applications at various scales 
[2, 3] is critical to improve energy efficiency for parallel computing. The hierarchical 
memory subsystem is a significant bottleneck in terms of both performance and power 
for scientific computing, especially for memory-intensive applications [1]. Performance 
and power models, such as the roofline energy model [4] and the execution-cache-
memory (ECM) model [5], provide high-level guidance to balance energy consumption 
and performance optimally for a given application on a target platform. In practice, 
however, achieving a program’s optimal configuration frequently requires fine-grained 
tuning that searches a complicated space that consists of application characteristics, such 
as computational intensity, memory access patterns, and communication frequency, and 
system factors, such as the number of cores, cache design, memory bandwidth, dynamic 
voltage and frequency scaling (DVFS), and I/O performance. In addition, varying the 
data size of the same program often changes the optimal configuration. 
                                                          
1 Corresponding Author, The University of Queensland, School of ITEE, Brisbane QLD 4072 Australia; 
E-mail: mark.endrei@uq.edu.au. 
2 
 
Unlike most previous work that examines complicated scientific workloads, we 
examine the power behavior of major computation kernels on a NUMA (non-uniform 
memory access) node. The Parallel Research Kernels (PRK) [6] represent common 
compute, memory, and communication patterns of parallel workloads. Our investigation 
focuses on memory-bound kernels, including Stencil, Sparse, Transpose, and Random. 
We present a framework to identify changes in the energy consumption patterns that 
arise from hardware and software interactions. We investigate the divergence of energy 
consumption patterns while saturating system resources, such as memory bandwidth. A 
case study of representative memory-bound applications illustrates our method. We find 
that the best tuning policy varies even for the same kernel at different scales. Further, our 
results highlight the significant impact of the memory sub-system on power consumption. 
Specifically, this paper presents the following contributions: 
 An extensible, platform independent energy optimization framework; 
 Identification of the power and performance transitions for a set of kernels; 
 Strategies that use Pareto optimization to balance power usage and performance. 
The rest of this paper is organized as follows. Section 2 provides background 
information including related work and our motivation. Section 3 introduces our tuning 
methodology and associated tools. Section 4 applies our method and tools in a case study 
using the OpenMP runtime and Parallel Research Kernels. Our conclusions follow in 
Section 5. 
2. Background and Related Work 
Understanding the energy consumption of a given application at various scales is 
essential for optimizing the utilization of power-constrained computers [1]. This section 
reviews related work in measuring and modeling power consumption, and practically 
identifying the optimal settings for scientific computing kernels. 
2.1. Scientific Computing Kernels 
Scientific computing applications often consist of several algorithmic kernels, and the 
computing pattern within each kernel is typically consistent. For example, Asanovic et 
al. [7] identified 13 “dwarfs” that represent significant computation and communication 
algorithmic patterns. Comprehensively understanding the energy behavior of each 
pattern provides insight that is critical for improving application energy efficiency. 
We investigate the behavior of several kernels that represent typical scientific 
computing matrix- and vector-based operations on a single node. We use an OpenMP 
implementation of four kernels from the Parallel Research Kernels project [6]. The PRK 
provides 11 highly portable kernels that enable system designers to explore common 
compute, memory, and communication patterns. 
2.2. Auto-tuning Tools 
Tuning an application on a target platform is a practical way to find its energy efficient 
configuration. Several auto-tuning tools optimize the energy efficiency of parallel 
applications. For instance, Insieme [8], Active Harmony [9], POET [10], and AutoTune 
[11] have been used to improve energy efficiency at either compilation time or runtime. 
3 
 
The OpenTuner [12] application tuning framework includes search ensembles that can 
improve times to locate optimal configurations with compiler and code optimization. 
Most of these tuning tools treat the application and the system as “black-boxes”, where 
bottlenecks due to interactions between the application and system are not considered. 
The Nimrod toolkit [13] is an alternative with more generalized capabilities such as 
search optimization and hybrid HPC resource management. It has a relatively wide user- 
and skill-base with a range of uses in the scientific community. It uses declarative job 
specifications to minimize barriers for non-programmers. We improve Nimrod by 
providing a bottleneck-centric tuning method for optimizing energy efficiency for 
parallel applications. In addition, we show how to use this tool to identify bottlenecks in 
an application on a target platform, and to apply appropriate optimization policies 
according to transitions in energy consumption patterns. 
3. Tuning Framework and Methodology 
In this section we propose a method and automation tools to investigate the impact of 
parallel application workload features on performance and energy efficiency. We 
examine strategies to manage large optimization search spaces and statistical approaches 
to validate experimental measurements. 
3.1. Overview of the Tools 
Figure 1 (a) provides an architecture overview of our tuning framework. The framework 
uses Nimrod/O to orchestrate many energy experimental tasks on Cray XC series 
systems. In particular, the framework supports CrayPAT (Cray Performance Analysis 
Tool) for instrumenting a program, and CrayPM (Cray Power Management) for 
measuring power. These components are integrated with Nimrod/O using a pluggable 
architecture that can work on other platforms with simple extensions. To facilitate 
empirical energy efficiency studies, this framework supports two advanced features: 1) 
measurement confidence intervals; and 2) application configuration tuning through 
automatic exploration of large optimization search spaces. 
 
Figure 1. (a) Architecture Overview, (b) Method Workflow 
Measuring power on Cray XC Series. The Cray XC series system has node-level 
sensors to measure temperature, current, and voltage. Cray power management counters 
(pm_counters) provide real-time power and energy measurements, which are updated at 
Cray
Compiler
CrayPAT
Instrumentation
ALPS
Scheduler
HWPC
Measurement
CrayPM
Measurement
Cray XC40 Supercomputer
Experiment Configuration and Tuning Policy
Tools Integration
Scheduler 
Adapter
Parameters 
Configuration
Objectives 
Parser
Nimrod/O
Search/Optimization
Build
Adapter
Bottleneck Identification
Application Parameters 
Sweep
Platform Parameters 
Sweep
Local Optimization
Application Parameters 
Configuration
Platform Parameters 
Tuning
(a) (b) 
4 
 
a frequency of 10 Hz. CrayPAT [14] is a performance analysis tool that uses performance 
counters, including pm_counters and hardware performance counters (HWPC), to 
evaluate program behavior. It instruments the program, collects the specified counters at 
runtime, and reports the collected counters. CrayPAT also supports Intel Running 
Average Power Limit (RAPL) [15] counters. 
Orchestrating optimization tasks with Nimrod/O. Energy experiments normally use 
empirical research methods with full factorial design. Integration of the Nimrod/O tools, 
as Figure 1 (a) shows, allows users to orchestrate the following tasks: program 
compilation; instrumentation with CrayPAT; setting of application parameters; resource 
reservations using ALPS; program execution including energy and performance 
collection; and parsing of CrayPAT reports. Each task needs a different set of input 
parameters, such as thread placement policies for execution. Experiment outputs, or 
objectives, are captured and parsed from various measurement sources as the experiment 
runs. The captured objectives include performance metrics such as MFlops/s, execution 
time, cache miss rates, CPU stall cycles, and power and energy usage. 
Validating energy measurement. Performance and power measurements are prone to 
experimental error and noise due to the non-deterministic nature of systems. Physical 
measurements that have a range of random factors that influence measurement errors are 
often normally distributed. Consequently, we assume power and performance 
measurements are normally distributed when analyzing confidence levels. We use 
normality tests to validate this assumption. We use the t-distribution to analyze 
measurement confidence intervals [16] and the Q-Q (quantile-quantile) plot to compare 
a measured data distribution with the standard normal distribution visually [17]. 
Exploring the optimization space efficiently. Typically, an energy experiment must 
handle many parameters, which form a complex search space. Table 1 lists the input 
parameters and associated ranges that we use in our case study. These parameters 
generate a full factorial design with 348,480 combinations. 
Table 1. Experimental Parameters for the Case Study 
Parameter Component Configuration Levels 
Compute Nodes Scheduler 1 1 
CPU Frequency Scheduler 1.2 to 2.2 GHz 11 
Thread Placement Policy Scheduler Compact, Scatter 2 
Grid Size Workload 500 to 90,000 (for Stencil) 180 
Iterations Workload 50, 5,000 (for Stencil) 2 
Thread Count Workload 1 to 44 22 
Counters Workload Performance, Energy 2 
Nimrod/O automates exploration for the optimal combinations, and allows users to use 
built-in optimization algorithms such as simplex [18] and PICS [19] to identify optimal 
configurations. Users can also provide optimization algorithms as Nimrod/O plugins. 
Our framework allows users to choose different tuning policies for optimizing 
energy efficiency. Tuning policies include optimization objectives and priorities for 
optimization, such as energy efficiency ahead of performance. When optimizing multiple 
objectives, the optimal parameter settings are selected from the dominant Pareto set [20]. 
3.2. Tuning Methodology 
We propose a high-level tuning methodology that consists of two groups of activities: 1) 
Bottleneck Identification, and 2) Local Optimization, as shown in Figure 1 (b). This 
5 
 
methodology helps users to identify important energy consumption patterns of the 
program and platform, and to select appropriate tuning policies. 
Bottleneck Identification explores how performance changes across key application 
and platform parameters. Bottlenecks, either inherent in the computer architecture or 
existing within the interactions between software and hardware, typically cause 
performance transition points. We can identify these transitions using application 
parameter sweeps, such as grid size in the Stencil case study, and platform parameters, 
such as CPU frequency and thread count. Thus, these sweeps provide a view on 
parameter ranges that result in degraded performance. These transitions mark the edge 
conditions for Local Optimization. 
Typically, parameter tuning sensitivity changes with each transition. Sensitivity to 
the same level of parameter adjustment is comparatively consistent between transitions. 
Users investigate the local energy consumption pattern and tailor an appropriate 
optimization strategy. This investigation may include selection of transition-specific 
starting points for Nimrod/O optimization algorithms, for example. In the next section, 
we provide case studies to illustrate our tuning methodology. 
4. Case Study 
We investigate the PRK Stencil, Sparse, Transpose and Random kernels in this section. 
4.1. Experimental Setup 
Using the framework that we described in section 3, we conduct experiments on a Cray 
XC system equipped as Table 2 shows. Each run uses a single, exclusively allocated 44-
core node. We monitor energy and power consumption for the entire node with 
pm_counters. We use RAPL counters to monitor DRAM power and energy. We collect 
performance and energy counter events in separate runs to avoid counter multiplexing, 
which reduces application perturbation and improves measurement accuracy. To assess 
energy efficiency, we use operations per Joule (for example, Flops/J, Bytes/J, Updates/J). 
Table 2. System Specifications 
Component Specification 
CPU model Intel Xeon CPU E5-2699 v4 (Broadwell) 
CPU clock 2.2 GHz 
Sockets (NUMA Nodes) 2 per compute node 
Cores 22 per socket 
Last Level Cache (LLC) 55 MB per socket 
Main memory (DRAM) 64 GB per socket 
Memory bandwidth 76.8 GB/s max 
The kernels listed in Table 3 provide configurable OpenMP workloads for our 
performance and energy efficiency research on memory-bound application workloads. 
We set iterations to ensure run time is at least 10 times the measurement sample rate. 
Each kernel uses double precision floating point values. We derive the listed problem 
size limits from the LLC and DRAM capacities of our system. 
6 
 
 
Table 3. Kernels Summary 
Name Description Configuration Size Limits 
(Cache / Mem) 
Stencil Explicit stencil operation on a 2D 
square discretization grid 
Stencil radius: 2 
Iterations: Cache 5,000 / Mem 50 
2.6k / 89k 
Sparse Canonically indexed, sparse-matrix 
by dense-vector product 
Difference stencil radius: 2 
Iterations: 1,000 
0.8k / 28k 
Transpose Dense matrix transposition (ܥ = ܣ்) Blocking/tiling: disabled 
Iterations: 20 
2.6k / 89k 
Random Random updates to a table, stressing 
memory bandwidth and latency 
Update ratio, vector length: 
number OpenMP Threads 
13M / 16G 
4.2. Bottleneck Identification 
We conduct the initial grid 
size sweep for each kernel 
with 34 threads (17 per 
socket) and a CPU 
frequency of 1.8 GHz. 
These settings are about 
75% of the maximum. 
Measurement resolution 
limits the sweep range at 
the lower end, while 
memory size limits it at 
the upper end. Figure 2 
shows performance of 
each kernel with 
numbered performance 
transitions that identify 
the system bottlenecks. 
We identify Stencil 
transitions (1, 2) at grid sizes of around 5k and 30k in Figure 2 (a). Performance reaches 
100 GFlops/s before the first transition but drops sharply as grid size increases. The 
processing rate remains approximately 56 GFlops/s until the second transition at grid 
size 30k. From this point, the performance curve declines steadily. 
Sparse also reaches its highest performance of 33 GFlops/s before the first transition 
(3). Figure 2 (b) shows that the processing rate remains relatively flat at around 13 
GFlops/s until the second transition (4) at grid size 8k, after which performance declines. 
The Stencil and Sparse transitions coincide with LLC saturation and increasing CPU 
stalls, as we show in the next section. 
Figure 2 (c) shows Transpose performance transitions (5, 6) at around matrix order 
500 and 3,000. These transitions coincide with lower level cache and LLC saturation. 
Figure 2 (d) shows Random performance transitions (7, 8) at around table sizes of 500M 
and 4G. These transitions are due to CPU stalls leveling off and increasing LLC 
saturation. We focus mostly on Stencil for the rest of the case study, touching briefly on 
aspects of the other kernels, in the interests of space. 
  
  
Figure 2. Problem Size Sweep for (a) Stencil, (b) Sparse, (c) Transpose, 
and (d) Random 
(1) (2) 
(a) 
(3) (4) 
(b) 
(6) 
(5) 
(c) 
(8) 
(7) 
(d) 
7 
 
4.3. Local Optimization 
We investigate performance and energy efficiency optimization around the identified 
bottlenecks in this section, including cache and stalls transitions. 
4.3.1. Cache Transition 
Figure 3 (a) and (b) provide a close-up view of the LLC saturation transition for Stencil. 
As the processing rate drops, we observe the combined effects of increasing LLC miss 
rate and power consumption. Figure 3 (a) shows that Stencil power consumption 
increases substantially from 160W to 260W, with increased DRAM power consumption 
accounting for approximately 60% of this increase. This change confirms that moving 
data across the deep memory hierarchy consumes a significant amount of power. 
Sparse (not shown) has a similar pattern. However, for Transpose and Random (not 
shown), the ramp up to maximum DRAM power occurs at lower problem sizes. Peak 
DRAM power coincides with LLC miss rates above 20-40%. 
Figure 3 (b) shows DRAM bandwidth (MB/s) for Stencil ramping up towards the 
CPU limit, indicating that memory bandwidth becomes a constraint at this transition. The 
NUMA curve shows that NUMA nodes are directing 100% of in-memory processing to 
their local DRAM bank, so NUMA misses are not a significant factor. We observe 
similar behavior for 
Sparse. However, NUMA 
misses are a factor for 
Transpose and Random, 
with their NUMA locality 
leveling off at 55% and 
50% local, respectively. 
Memory bandwidth 
utilization for Transpose 
and Random is 
accordingly lower. 
4.3.2. Stalls Transition 
Figure 4 provides a breakdown of Stencil and Sparse stall types along with the 
performance curve from Figure 2. For Stencil, the second transition at 50K may not be 
predicted by model-based methods, such as the ECM or Roofline models. It corresponds 
to an increase in CPU stall 
cycles. Re Order Buffer 
(ROB) and Store Buffer 
(SB) stalls are both flat for 
Stencil, but Reservation 
Station (RS) stalls are 
increasing as performance 
tapers off. RS stalls occur 
when entries are not 
available in the instruction 
pipeline. RS stalls also dominate Transpose (not shown), however, Figure 4 (b) shows 
that ROB stalls dominate Sparse and Random (not shown). ROB stalls occur when the 
CPU front-end allocates instructions faster than the execution engine can retire them. 
  
Figure 4. CPU Stalls Analysis – (a) Stencil and (b) Sparse 
  
Figure 3. Cache Analysis – Stencil (a) DRAM Power, (b) DRAM 
Bandwidth, NUMA Locality and LLC Miss Rate 
(a) (b) 
(a) (b) 
8 
 
4.3.3. Thread Scaling 
Thread scaling first compares the Scatter and Compact thread placement policies. 
Compact places threads in one socket before moving to another, while scatter utilizes 
both sockets uniformly. We confirm Scatter performance is equivalent or superior across 
the problem size range, and as such, use Scatter placement in the rest of this section. 
Figure 5 shows Stencil OpenMP thread scaling and LLC miss rates for 24, 34 and 
44 threads at 1.8 GHz. All kernels exhibit superior thread/core scaling for in-cache 
operation compared to in-memory operation. In-cache operation also exhibits good 
energy efficiency improvement for all kernels with core scaling. LLC effects drive a 
rapid cache transition at small problem sizes for Stencil and Sparse. For Transpose and 
Random, LLC effects are gradual, extending across the problem size range as miss rates 
increase. 
 
Figure 5. Stencil OpenMP Thread Scaling – (a) Performance, (b) Energy Efficiency, (c) LLC Miss Rate 
The error bars in Figure 5 show the 95% t-distribution confidence interval for the mean 
of five measurement samples consisting of 5,000 iterations. 
Table 4 provides comparison values for performance and energy efficiency for 
problem sizes under each of the identified bottleneck problem sizes. The shaded row 
shows that in-memory (DRAM) Stencil performance is 58% of the in-cache (LLC) 
performance, and in-memory energy efficiency is 37% of in-cache energy efficiency. 
Table 4. Thread Scaling at 1.8 GHz 
Kernel 
(Unit) 
Bottle-
neck 
Size Performance (Unit/s) Energy Efficiency (Unit/J) 
Max Percent Threads Max Percent Threads 
Stencil LLC 2k 99G 100 34 551M 100 34 
(Flops) DRAM 5k 57G 58  34 205M 37 24 
 Stalls 72k 49G 49 24 199M 36 24 
Sparse LLC 512 33G 100 34 237M 100 34 
(Flops) DRAM 2k 13G 39 34 52M 22 24 
 Stalls 16.4k 13G 39 24 47M 20 24 
Transpose LLC 15k 33G 100 44 128M 100 44 
(Bytes) DRAM 60k 7.7G 23 34 29M 23 24 
 Stalls 60k 7.7G 23 34 29M 23 24 
Random Stalls 540M 700M 100 34 2.9M 100 24 
(Updates) LLC 1.1G 660M 94 34 2.7M 93 34 
 DRAM 4.3G 550M 79 44 2.0M 69 44 
Figure 6 shows Stencil OMP thread scaling and CPU stall rate for 24, 34 and 44 threads 
at 1.8 GHz. Stall rates impact larger matrix orders, as Figure 6 (c) shows. As we add 
more cores, the Stalls/MFlop ramp up commences at lower matrix orders. This correlates 
closely with the drop off in both performance and energy efficiency. 
Random stall rates peak at smaller problem sizes, which correlates with poor energy 
efficiency scaling. Table 4 also provides stalls regime comparisons for each kernel. It 
shows that LLC Random performance is 94% of stalls regime performance, and LLC 
(a) (b) (c) 
9 
 
energy efficiency is 93% of stalls regime energy efficiency. We observe CPU stall effects 
dominating at large problem sizes for Stencil and at small problem sizes for Random. 
 
Figure 6. Stencil OpenMP Thread Scaling – (a) Performance, (b) Energy Efficiency, (c) Stall Rate 
4.3.4. Frequency Scaling 
Figure 7 shows CPU frequency scaling and LLC miss rates for 1.4, 1.8 and 2.2 GHz with 
34 OpenMP threads for Stencil. All kernels except Random exhibit superior frequency 
scaling for in-cache operation. In-cache operation of Stencil, Sparse and Transpose also 
exhibits good energy efficiency improvement with frequency scaling. Random energy 
efficiency scaling is poor for all problem sizes. 
Table 5 provides comparison values for performance and energy efficiency for 
problem sizes under each of the identified bottleneck problem sizes. It shows that in-
memory (DRAM) Stencil performance is 46% of the in-cache (LLC) performance, and 
in-memory energy efficiency is 39% of in-cache energy efficiency. 
 
Figure 7. Stencil CPU Frequency Scaling – (a) Performance, (b) Energy Efficiency, (c) LLC Miss Rate 
Table 5. Frequency Scaling at 34 OpenMP Threads 
Kernel 
(Unit) 
Bottle-
neck 
Size Performance (Unit/s) Energy Efficiency (Unit/J) 
Max Percent Frequency Max Percent Frequency 
Stencil LLC 2k 125G 100 2.2 551M 100 1.8 
(Flops) DRAM 5k 58G 46 2.2 214M 39 1.4 
 Stalls 72k 43G 34 2.2 173M 31 1.4 
Sparse LLC 512 42G 100 2.2 260M 100 2.2 
(Flops) DRAM 2k 13G 31 2.2 52M 20 1.4 
 Stalls 16.4k 12G 29 2.2 45M 17 1.4 
Transpose LLC 15k 29G 100 2.2 111M 100 1.8 
(Bytes) DRAM 60k 7.9G 27 2.2 32M 29 1.4 
 Stalls 60k 7.9G 27 2.2 32M 29 1.4 
Random Stalls 0.54G 690M 100 2.2 2.9M 100 1.4 
(Updates) LLC 1.1G 660M 96 1.4 3.0M 103 1.4 
 DRAM 4.3G 510M 74 2.2 2.1M 72 1.4 
Figure 8 shows Stencil CPU frequency scaling and CPU stall rate for 1.4, 1.8 and 2.2 
GHz with 34 OpenMP threads. Stall rates impact Stencil across the matrix order range, 
jumping with each increase in frequency. The stall rate increase correlates closely with 
(a) (c) (c) 
(a) (b) (c) 
10 
 
the observed drop in energy efficiency. Stall rates have a similar effect on Random 
energy efficiency that is most visible at smaller table sizes. 
  
Figure 8. Stencil CPU Frequency Scaling – (a) Performance, (b) Energy Efficiency, (c) Stall Rate 
4.4. Tuning Strategy 
Tuning policies should match energy consumption patterns with identified bottlenecks. 
In-cache operation shows compute-intensive features and favors maximum CPU 
frequency and higher cores to achieve peak energy efficiency and performance. As CPU 
stalls increase, DVFS tuning becomes important. We summarize a range of tuning 
opportunities for each kernel in Table 4 and Table 5. The boxed values in the Threads 
and Frequency columns highlight bottleneck-related opportunities, where adjusting 
DVFS or core counts improves performance or energy efficiency. When the optimal 
configurations for energy efficiency and performance diverge, we identify a dominant 
set of Pareto-optimal performance and energy efficiency points. 
 
Figure 9. Stencil Pareto Set – (a) Pareto Front (b) MFlops/s, (c) MFlops/J 
Figure 9 (a) and Figure 10 show optimal trade-off points between performance and 
energy efficiency along the Pareto front for each kernel. Points off the front are not 
Pareto-optimal as points on the front always provide an improvement in one parameter 
with less impact on the other. The surfaces in Figure 9 (b) and (c) represent performance 
and energy efficiency across the CPU frequency and thread count search space. 
 
Figure 10. Pareto Front for (a) Sparse, (b) Transpose, and (c) Random 
Table 6 provides a summary of the kernel Pareto sets. It shows the maximum Stencil 
performance of 51 GFlops/s is achieved at 2.2 GHz and the maximum energy efficiency 
of 236 MFlops/J is achieved at 1.3 GHz. The optimal OpenMP thread count is 20 and 28 
(a) (b) (c) 
(a) (b) (c) 
(a) (b) (c) 
11 
 
respectively. Tuning the thread count and CPU frequency along the Pareto front provides 
18-34% performance improvement and 67-79% energy efficiency improvement 
compared to simply maximizing threads and CPU frequency at 44 and 2.2 GHz. 
Our method structures the parameter search space to make the number of 
combinations that need to be explored manageable. We reduce the 348,480 combinations 
identified in Table 1 to 562. Searches include sweeps for initial problem size and thread 
placement, local thread and frequency scaling, and the Pareto front search. The tuning 
policy also must consider measurement error margins. Table 6 shows ±1% performance 
tuning benefit for Random. Energy efficiency can be automatically preferred if the 
performance measurement error margin is higher, ±5% for example. 
Table 6. Tuning Performance and Energy Efficiency along the Pareto Front 
Kernel Size Perf  
base1 
EE 
base1 
Core2 Freq2 Perf 
min2 
EE 
max2 
Core3 Freq3 Perf 
max3 
EE 
min3 
Perf % 
Range 
EE  % 
Range 
Stencil 70k 38 132 28 1.3 45 236 20 2.2 51 220 18-34 67-79 
Sparse 16k 12 43 20 1.2 12 77 24 2 13 63 0-8 47-79 
Transpose 20k 32 99 44 1.8 30 102 44 2.2 32 99 -6-0 0-3 
Random 1G 661 2.2 32 1.2 652 2.8 40 1.6 668 2.5 -1-1 14-27 
Notes 1. Performance and energy efficiency at maximum cores and threads 
 2. Values at bottom RHS of Pareto Front (ie. min perf, max EE in Figure 10A and Figure 10) 
 3. Values at top LHS of Pareto Front (ie. max perf, min EE in Figure 10A and Figure 10) 
Our future work includes adopting key parameters identified in this work in an improved 
performance and energy efficiency model for workloads and systems. We expect that 
system parameters such as cache and memory capacity, memory bandwidth and locality, 
and CPU pipeline capacity will be important. Workload classification metrics will 
include size, and computation and memory intensity for predicting cache miss rates and 
CPU stall cycles. Further, we expect extending our work to multi-node systems will 
provide additional tuning opportunities as we add nodes to scale up core, cache, and 
memory capacity to suit the workload. 
5. Conclusions 
We investigated the divergence of energy consumption patterns while saturating system 
resources, such as memory bandwidth, for major kernels. Our approach finds appropriate 
tuning policies for various kernels at different scales. In particular, as resource contention 
rises with memory access, additional cores and higher CPU frequencies can still provide 
small performance improvements at the cost of significant drops in energy efficiency. 
We confirmed that sensitivity to system parameter adjustment varies across transitions. 
Thus, we can optimize using a limited range of configurations derived from appropriate 
starting points. Our case study of various benchmarking kernels shows that this method 
can improve the efficiency of investigating energy efficiency at different scales. We 
demonstrate up to 34% performance improvement and 79% energy efficiency 
improvement, while reducing the parameter search space by several orders of magnitude. 
Our methodology can also detect some software-hardware interaction bottlenecks that 
many model-based methods can miss. 
12 
 
References 
[1] C. Jin, B. R. de Supinski, D. Abramson, H. Poxon, L. DeRose, M. N. Dinh, M. Endrei, and E. R. Jessup, 
"A survey on software methods to improve the energy efficiency of parallel computing," International 
Journal of High Performance Computing Applications, 2016. 
[2] O. Sarood, A. Langer, A. Gupta, and L. Kale, "Maximizing throughput of overprovisioned HPC data 
centers under a strict power budget," in Proceedings of the International Conference for High 
Performance Computing, Networking, Storage and Analysis, 2014, pp. 807-818: IEEE Press. 
[3] T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. De Supinski, "Exploring hardware 
overprovisioning in power-constrained, high performance computing," in Proceedings of the 27th 
international ACM conference on supercomputing, 2013, pp. 173-182: ACM. 
[4] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, "A roofline model of energy," in Parallel & Distributed 
Processing (IPDPS), 2013 IEEE 27th International Symposium on, 2013, pp. 661-672: IEEE. 
[5] J. Hofmann and D. Fey, "An ECM-based energy-efficiency optimization approach for bandwidth-limited 
streaming kernels on recent Intel Xeon processors," in Proceedings of the 4th International Workshop on 
Energy Efficient Supercomputing, 2016, pp. 31-38: IEEE Press. 
[6] R. F. Van der Wijngaart and T. G. Mattson, "The Parallel Research Kernels," in HPEC, 2014, pp. 1-6. 
[7] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. 
Plishker, J. Shalf, and S. W. Williams, "The landscape of parallel computing research: A view from 
berkeley," Technical Report UCB/EECS-2006-183, EECS Department, University of California, 
Berkeley2006. 
[8] P. Gschwandtner, J. J. Durillo, and T. Fahringer, "Multi-objective auto-tuning with insieme: Optimization 
and trade-off analysis for time, energy and resource usage," in Euro-Par 2014 Parallel Processing: 
Springer, 2014, pp. 87-98. 
[9] C. Ţăpuş, I.-H. Chung, and J. K. Hollingsworth, "Active harmony: Towards automated performance 
tuning," in Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002, pp. 1-11: IEEE 
Computer Society Press. 
[10] S. F. Rahman, J. Guo, and Q. Yi, "Automated empirical tuning of scientific codes for performance and 
power consumption," in Proceedings of the 6th International Conference on High Performance and 
Embedded Architectures and Compilers, 2011, pp. 107-116: ACM. 
[11] R. Miceli, G. Civario, A. Sikora, E. César, M. Gerndt, H. Haitof, C. Navarrete, S. Benkner, M. Sandrieser, 
and L. Morin, "Autotune: A plugin-driven approach to the automatic tuning of parallel applications," in 
Applied Parallel and Scientific Computing: Springer, 2012, pp. 328-342. 
[12] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. 
Amarasinghe, "Opentuner: An extensible framework for program autotuning," in Proceedings of the 23rd 
international conference on Parallel architectures and compilation, 2014, pp. 303-316: ACM. 
[13] D. Abramson, R. Sosic, J. Giddy, and B. Hall, "Nimrod: a tool for performing parametrised simulations 
using distributed workstations," in High Performance Distributed Computing, 1995., Proceedings of the 
Fourth IEEE International Symposium on, 1995, pp. 112-121: IEEE. 
[14] L. DeRose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon, "Cray performance analysis tools," in 
Tools for High Performance Computing: Springer, 2008, pp. 191-199. 
[15] Intel Corp, "System Programming Guide, volume 3B-2 of Intel 64 and IA-32 Architectures Software 
Developer’s Manual," 2011. 
[16] S. Patil and D. J. Lilja, "Statistical methods for computer performance evaluation," Wiley 
Interdisciplinary Reviews: Computational Statistics, vol. 4, no. 1, pp. 98-106, 2012. 
[17] A. Loy, L. Follett, and H. Hofmann, "Variations of Q–Q Plots: The Power of Our Eyes!," The American 
Statistician, vol. 70, no. 2, pp. 202-214, 2016. 
[18] D. Abramson, A. Lewis, and T. Peachey, "Nimrod/O: a tool for automatic design optimisation using 
parallel and distributed systems," in Proc. 4th International Conference on Algorithms & Architectures 
for Parallel Processing (ICA3PP 2000), 2000: Citeseer. 
[19] T. Peachey, M. Riley, D. Abramson, and J. Stewart, "A simplex-like search method for bi-objective 
optimization," in EngOpt 2012: 3rd International Conference on Engineering Optimization, 2012, pp. 1-
10: Federal University of Rio de Janeiro. 
[20] T. Kipouros, T. Peachey, D. Abramson, and A. M. Savill, "Enhancing and developing the practical 
optimisation capabilities and intelligence of automatic design software," in 8th AIAA Multi-Disciplinary 
Design Optimization Specialist Conference, Honolulu, Hawaii, 2012, pp. 1-7. 
 
