Current microprocessors include several knobs to modify the hardware behavior in order to improve performance under di erent workload demands. An impractical and time consuming o ine proling is needed to evaluate the design space to nd the optimal knob con guration. Di erent knobs are typically con gured in a decoupled manner to avoid the time-consuming o ine pro ling process. is can o en lead to underperforming con gurations and sometimes to con icting decisions that jeopardize system powerperformance e ciency.
INTRODUCTION
Multicore architecture is the main trend in processors development nowadays. Every new generation of processors is increasing the number of cores and the number of threads that can run within the same core (i.e. Simultaneous Multithreading or SMT). As a result, processor shared resources experience contention, which might lead to performance degradation. Processors have several hardware Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. Application behavior of the NPB suite under di erent hardware knob con gurations. e Y axis shows speedup with respect to the default hardware con guration. e X axis shows power consumption normalized to the maximum value observed. SMT level is represented by color and the shape corresponds to the data prefetcher con guration.
knobs to prevent performance degradation by adapting its behavior to workloads demands, such as the SMT, DVFS levels, the decode priorities or the data prefetcher se ings. ese knobs allow the user to tune the hardware to adapt it to workload demands.
Multiple policies have been proposed to derive suitable con gurations for the hardware knobs, but these policies have always treated them independently of each other [3, 5, 25, 42, 43] . is independent actuation can lead to con icting decisions that jeopardize system power-performance e ciency [39] . For example, a higher SMT level allows to increase the overall system throughput, but it reduces the e ective bandwidth and last level cache size per thread. As a result, coordinating these decisions with other knobs that also contend for the memory bandwidth, such as the data prefetcher or DVFS, is required to optimize the overall system power-performance e ciency.
To illustrate the need for a coordinated adaptive system, Figure 1 shows the performance and the average power consumption of the NAS Parallel Benchmarks (NPB) [26] suite with di erent knob congurations: four SMT levels and four levels of aggressiveness for the data prefetcher 1 . Performance is normalized to the default conguration (SMT8 level and default prefetcher se ing) and power is normalized to the maximum observed value. e highest SMT level is not always the optimal con guration due to increased last-level cache misses and contention in the execution units. An aggressive con guration for the prefetcher does not imply a be er performance either, but it usually ends in more power consumption. In some cases, running in a low SMT level provides a modest 5% performance improvement, while disabling the prefetcher and running in the highest SMT level provides signi cant performance bene ts (up to 40%) and reduces power consumption. Also, Figure 1 shows that di erent knob con gurations yield a wide range of speedup and power consumption tradeo s across applications. Furthermore, applications can have di erent intra resources demands, increasing even more the variety of best performing con gurations.
An extensive and exhaustive o ine pro ling is required to discover the best hardware con guration per application. However, given the number of possible hardware con gurations, performing an exhaustive pro ling of each of them for each application and input data size is a time consuming process. In addition, since application optimal hardware con guration changes during di erent application execution phases, exploring all the hardware con guration for each application phase becomes unfeasible in a practical amount of time. us, we believe that using an adaptive online coordinated management of related hardware knobs is a more robust and less costly approach to performance tuning than exhaustive o ine pro ling.
In this paper, we propose libPRISM 2 , an interposition library for shared memory parallel programming models that transparently adapts the di erent hardware knobs available in the architecture. During execution time, libPRISM discovers the best hardware con guration for di erent ne-grained regions of the application without user intervention and without modifying the original source code of the application.
Overall, the main contributions of this paper are: • We present a detailed power/performance characterization of a wide set of parallel benchmark suites (NPB [26] , SPEC OMP [33] , CORAL [9] ) on an IBM POWER8 platform. e results show that best performing SMT level and data prefetcher con guration di er between applications and between application phases, leading to di erences in performance and power up to 113% and 12% respectively.
• We introduce libPRISM, a library to dynamically manage hardware resources in a transparent way to the user for OpenMP parallel applications without the need to recompile applications or runtimes. libPRISM can be used in di erent runtimes, with di erent hardware knobs and it can be easily extended.
• We describe an implementation of an adaptive policy to manage SMT and prefetcher hardware knobs in a coordinated fashion using libPRISM infrastructure. We demonstrate speedups of up to 220% in execution time (15.4% on average) and up to 13% reduction in power consumption (2.0% on average), without any signi cant slowdown across the suites when compared to the static default knob con guration. is paper is organized as follows: Section 2 provides the required background for this work, while Section 3 introduces libPRISM and our adaptive policy. Section 4 describes the experimental setup and Section 5 shows the evaluation of our framework. Next, Section 6 discusses the related work and, nally Section 7 presents the conclusions of this paper.
BACKGROUND
is section provides the required background about the SMT and data prefetch knobs targeted in this work. e runtime systems for shared memory programming models that we leverage to manage these knobs are also described.
Simultaneous Multithreading
SMT increases the number of executing threads within the same core, which can be very useful to hide memory latency and exploit more instruction level parallelism (ILP). In a processor with SMT capabilities, the processor fetches instructions from di erent threads and puts them on a shared instruction queue. en, in the execution 2 libPRISM source code is available at: h ps://github.com/criort/libPRISM stage, all threads share the hardware resources of the core where they run, e ectively increasing the overall resource utilization and the system throughput. However, individual thread performance may degrade due the contention on the shared hardware resources.
Multi-programmed workloads can signi cantly bene t from SMT capabilities, since the di erent threads stress di erent functional units or have di erent memory pa erns. erefore, the usage of the hardware resources is higher [15, 18, 32, 36] . In contrast, parallel applications that follow a traditional fork-join parallelization scheme, execute the very same code on the di erent threads. Consequently, all threads are competing for the same hardware resources, leading to a higher contention on shared hardware resources, which sometimes degrades overall system performance. Consequently, a higher SMT level can even degrade overall performance [10, 21] .
Hardware Data Prefetching
Hardware data prefetching reduces memory latency by bringing data to the processor's cache before it is needed. is reduces stalls due to memory accesses. Almost all current processors include a hardware data prefetch engine as it is a powerful technique to reduce memory latency, which is one of the main bo lenecks for performance.
Applications with predictable (e.g. regular) memory access patterns and spatial locality signi cantly bene t from data prefetching. Other workloads with unpredictable (e.g. random) memory pa erns do not bene t at all from the prefetcher, and it can even degrade performance. Useless prefetches waste memory bandwidth (increase in power consumption) and pollute the cache hierarchy (decrease in performance).
e data prefetching algorithm is usually hardcoded in the processor design and it is not possible to modify it. Vendors o en add instructions to let the programmer or the compiler do so ware prefetching; this adds a step in the optimization process of a code. Some processors allow the user to con gure the data prefetcher to match the workload characteristics by selecting the number of lines to bring ahead of time, prefetch data on load and/or store instructions, etc. A correctly con gured data prefetcher can speed up the execution time, save memory bandwidth and reduce power consumption [24, 25] .
In this work, we propose an automatic management of the data prefetcher transparent to the user while coordinating it with the SMT knob. is needs to be done in a coordinated fashion because the number of threads impacts the data prefetcher and data prefetcher con guration can determine the optimal number of threads to be used. is will be seen in detail in Section 5.
Runtime Systems and Shared Memory Programming Models
With the increasing number of cores, orchestrating the parallel execution of an application is becoming more di cult. e usage of a runtime system to manage this complexity is a common practice to exploit the parallelism of multi-core systems. Runtimes are used as an abstract layer in the so ware stack to parallelize codes. Usually, they need compiler support to translate from keywords to real code that will be executed: the programmer just needs to use a speci c keyword or directive to spawn all the desired threads, share the data among them, or synchronize them. is method reduces the burden of developing parallel applications and drives the design of future architectures [4, 16, 30, 38] . Figure 2 : libPRISM execution stack and work ow.
major vendors. OpenMP is based on directives annotated by the developer to a sequential source code. en, these directives are translated to parallel code at compile time. Directives delimit a part of the source code that is executed in parallel. We refer to this code executed in parallel as parallel region. Depending on the speci c runtime implementation, at the beginning of a parallel region, the runtime creates or activates the requested number of threads and executes the parallel code. At the end of the parallel region, the runtime destroys the created threads or deactivates them.
We take advantage of these runtimes in order to automatically manage hardware knobs for several reasons: applications have different phases already annotated, which provides intra-application granularity. is can be exploited to adapt the hardware per phase instead of per application. ose phases usually behave regularly over time and we can learn from their previous executions. Finally, it is possible to use library interposition with the runtime to capture and recon gure the hardware at the beginning and at the end of each phase of the application.
In the next section, we introduce libPRISM, which leverages these properties to adapt the hardware knobs in runtime systems for programming models based on shared memory systems.
LIBPRISM
libPRISM is an interposition library that recon gures the available hardware knobs at execution time. Its decisions are based on the custom de ned policies that are implemented. Policies can leverage the information from the di erent sensors available, such as performance counters, temperature, power, etc., to drive the di erent knobs present in the system. Each time the application enters a parallel region, libPRISM asks to the policy the required knob congurations and the sensors to be tracked during the execution of the parallel region. en, libPRISM sets the di erent hardware sensors and knobs accordingly. If there are multiple parallel regions, they will be executed with their respective knob con guration according to the implemented policy.
To achieve this goal without recompiling the application or the runtime system, libPRISM is located on top of the runtime system, as shown in Figure 2 . When a parallel region starts, the application calls our library instead of the runtime system. en, libPRISM takes care of communicating con guration changes to the runtime system and to the underlying hardware. e so ware stack shown in Figure 2 allows libPRISM to: (1) communicate changes to the runtime system; (2) gather data from the runtime and the hardware; and (3) avoid the need to recompile the application or the runtime itself. In this scenario, the application executes as usual without being aware that libPRISM is dynamically adapting the hardware resources based on a custom de ned policy.
libPRISM uses a library interposition mechanism to intercept calls from the application to the runtime. Figure 2 gives an overview of the work ow of libPRISM. When the application calls the runtime to start or nish a parallel region, libPRISM intercepts these calls and executes the policy speci c code before calling the runtime system. libPRISM records information about the parallel region that is going to be executed and recon gures the knobs based on the implemented policy. en, libPRISM calls the runtime system with the selected parameters as if it was the application. As a result, the application executes with the selected best found performing knob con guration without requiring any modi cation.
Our goal is to implement a policy using libPRISM infrastructure to tune SMT and hardware prefetcher knobs in order to exploit the optimization opportunities to maximize performance and, if possible, reduce the power consumption. libPRISM tracks and pro les at execution time every parallel region of the application. At compile time, parallel regions are transformed into functions that are called by the application. Parallel regions can be identi ed by their next program counter (PC) in the program stack of the intercepted runtime function calls. libPRISM identi es a parallel region using this PC, as shown in Figure 2 . libPRISM passes that information to the policy, which keeps track of the number of times a parallel region is executed. For every parallel region that is executed, the policy records a performance pro le under di erent knob con gurations.
e policy builds this performance pro le for each parallel region using di erent performance counters (executed instructions and cycles) and the execution time of the region.
Note that, in several programming models, there exists the possibility to use a master thread that creates work for the other worker threads. is is the case when using the task abstraction available in OpenMP. is behavior is usually not exposed to the user, and it is handled internally in the runtime system. To support this type of parallelism in libPRISM, we use the master thread to measure system performance a er it creates all the tasks to be executed by the worker threads without requiring any modi cation in the runtime.
Adaptive Algorithm
e MAXPERF policy explores di erent knob con gurations in order to identify the best con guration per parallel region at execution time. e policy manages two hardware knobs that are targeted in this work: the SMT level and the data prefetcher, but it can work for N hardware knobs. It is optimized to handle parallel applications that use common runtimes such as OpenMP.
e policy implements a greedy search through the di erent hardware con gurations in order to identify the best found performing con guration. e use of a greedy algorithm instead of an MAXPERF policy exploration phase algorithm.
exhaustive one helps to reduce the overhead cost of exploring all the possible con gurations of the hardware knobs.
MAXPERF policy adopts a hierarchical search algorithm. It explores di erent con gurations for a particular hardware knob at a time. MAXPERF tunes rst the hardware resources that have more impact on the nal performance of the application. We base our heuristic on a single factor search over a multi-factor search to reduce the exploration space, therefore, reducing the overhead cost of exploring. Our heuristic allows converging faster to a hardware knob con guration while taking into account inter-knobs e ects.
In this work, our heuristic achieves a competitive performing hardware knob con guration with respect to the best static hardware knob con guration found for each application when tuning the SMT level and the prefetcher aggressiveness knobs.
For instance, we have measured that the best performing SMT level can lead to a performance boost larger than 10% (with respect to the default SMT level), while the best performing data prefetcher se ing boosts performance around 5% (with respect to the default data prefetcher). As a result, the MAXPERF policy rst explores the di erent SMT con gurations from SMT8 to single thread (ST) to nd a competitive performing SMT se ing. en it explores the di erent prefetcher con gurations from the most aggressive to the least aggressive one. Starting with the hardware knobs con gured to the most aggressive con guration allows the policy to maximize performance, reducing the possibility of degrading performance.
e policy implements an exploration phase followed by a steady state phase. In the steady phase, it is possible to do a correction phase if needed.
is is a good approach in order to minimize overhead by leveraging repetitive behavior of the parallel regions of the applications and to correct hardware knobs con guration in case the behavior changes over time.
e pseudo-code of the exploration phase is shown in Listing 1. A parallel region is identi ed by the PC of the intercepted function call. If the duration of the parallel region is too short (i.e. below a threshold), libPRISM stops the exploration phase as the cost of reconguring the available hardware knobs would neglect the potential performance bene ts of an optimized hardware con guration (Line
Listing 2: MAXPERF policy steady-state phase algorithm.
3 in the Listing 1). is threshold has to take into account the time spent in changing the speci c hardware knobs.
In the exploration phase, the rst time a parallel region is executed, libPRISM sets the available hardware knobs to the most aggressive con guration and measures its performance. is is done to spend the minimum amount of time in a knob con guration that degrades performance, which is usually the least aggressive knob con guration. is measurement is repeated a number of repetitions in order to avoid measurement noise due to new knob con guration. For instance, the rst parallel region execution a er changing the SMT level might su er from increased number of cache misses (cold cache e ects).
Next time the same parallel region is executed, libPRISM lowers the aggressiveness level of the knob and measures performance again. If lowering the aggressiveness of the knob leads to a slowdown in performance, the exploration phase for this knob stops and the previous con guration is selected as the best found performing con guration (Lines from 17 to 19 in Listing 1). en, the policy continues the exploration phase with the next knob to con gure (Lines 20 in Listing 1).
e maximum number of iterations for the exploration phase without taking into account re-explorations is: 2 × Number of SMT levels + Data prefetch aggressiveness con gurations. In our experiments, we observe that the maximum number of iterations is never reached. Our observations prove that less than 10 iterations (6.1 iterations on average) are enough to tune non-variable parallel regions with our algorithm. is is typically a low number of iterations with respect to the total number of iterations.
A er the exploration phase, the policy identi es a competitive performing knob con guration for a particular parallel region and reaches a steady-state phase. e pseudo-code of this phase is shown in Listing 2. Every time the parallel region is executed, the knobs are set to the identi ed best found performing knob con guration. In order to identify phase changes in the application, the execution time of the parallel region is compared against the average execution time found during the exploration phase. If the last execution time signi cantly di ers from the average execution time, the exploration phase starts again but with increased number of repetitions in order to minimize continuous recon guration overheads and take into account di erent control ow paths in the execution of the parallel region (Line 4 in Listing 2). In our experiments, we select a threshold of 5.0% to start again the exploration phase.
Case Study: SMT level and Data Prefetcher
To illustrate the detailed behavior of the MAXPERF policy, we describe a case study in which we use libPRISM and the MAXPERF policy to select the best SMT level and hardware data prefetcher for the CG application from the NPB suite. In the exploration phase, libPRISM explores SMT8, SMT4, SMT2 and ST for the SMT level; for the data prefetcher it explores aggressive, medium, default aggressive and disabled prefetcher con gurations. e policy starts Figure 3 : Adaptive algorithm in libPRISM to select a competitive performing con guration for SMT level and data prefetcher for the CG application. Details on the hardware knob con guration are explained in Section 4. Repetitions is set to 1 (algorithm shown in Listing 1). the exploration with the most aggressive con guration, SMT8 level and an aggressive data prefetcher. Figure 3 shows how the exploration phase is performed on the longest parallel region of CG benchmark.
is gure shows the selected SMT level and prefetcher con guration in a particular iteration of the parallel region, as well as the execution time of the parallel region under this con guration. In the rst three iterations, the policy lowers the SMT level from SMT8 to SMT2 until there is a slowdown in execution time. erefore SMT4 level is chosen as the best SMT level. en, in the next four iterations, the policy lowers the prefetcher aggressiveness to the point where it totally disables the prefetcher.
When an important change in performance during the prefetcher tuning happens, the MAXPERF policy starts again to re-explore the SMT level. Since disabling the prefetcher provides a 20% performance improvement, the policy triggers again the exploration phase for the SMT level with the prefetcher disabled.
is is shown in Lines 21 and 22 from Listing 1, which correspond to the correction phase. In Figure 3 , this is shown in iterations 5 and 6. At the end of iteration 5, the policy knows which prefetcher con guration is competitive in terms of performance for the SMT4 level. In iteration 6, a er se ing the hardware knobs, the policy realizes it needs to restart the exploration for the SMT knob, which takes place in the iteration 7. In this exploration phase, the policy just lowers the SMT level to 4, which o ers worse performance than SMT8 level. As a result, this parallel region will be recon gured every time to SMT8 level and disabled prefetcher con guration, which leads to a 38.3% performance improvement with respect to the default con guration.
e policy does not detect any phase change during the rest of the execution in CG for this particular parallel region.
EXPERIMENTAL SETUP
We evaluate the solution on a POWER8 based system (8247.42L model) [31] . e system has an IBM POWER8 processor that runs at 3.15GHz with 512GB of DDR3 CDIMM memory running at 1.6GHz.
e POWER8 processor in this system is packaged as a dual-chip module where each chip has 6 cores. Each core has 64KB L1 data and 32KB L1 instruction caches, a 512KB L2 cache and a 8MB L3 cache.
e system runs Ubuntu 14.10 operating system with the kernel version 3. 16 . We compile all the benchmarks with GCC version 4.9.3, which fully supports OpenMP 4.0.
Simultaneous Multithreading
e POWER8 processor has a maximum SMT level of 8: each core can run simultaneously up to eight threads. It also supports running 1, 2 and 4 threads (ST, SMT2 and SMT4 levels). e operating system (OS) sees a physical core as a group of 8 virtual cores. When the machine boots, it automatically sets the SMT level to 8. If no application is running, the SMT level is adjusted automatically by the hypervisor based on the utilization of the system. For example, when the system is in SMT8 level, the OS exposes 8 virtual cores per each physical core. When just one of those virtual cores is used, the system sets the SMT level to ST level automatically, making all the core hardware resources available to the application. To set the correct SMT level, we need to specify the number of threads running in a physical core. is can be done manually by se ing the desired number of threads of the application and pinning threads to physical cores accordingly. Also, it can be done by disabling the virtual cores through speci c online registers exposed by the OS. In OpenMP, the required number of threads can be de ned through an environment variable or directly from the application code with speci c calls to the runtime. By default, the parallel applications evaluated use all the threads available in SMT8 level.
Data Prefetcher
e data prefetcher can be controlled at the core level by a special purpose register called Data Streams Control Register (DSCR) [20] , which is exposed by the OS. e DSCR has 12 di erent elds, o ering a total of 2 25 possible con gurations. e most relevant elds are the following ones:
• LDS: Enables data prefetching for load instructions.
• SNSE: Enables data prefetching for load and store instructions that have a stride bigger than a cache block.
• URG: Number of cache blocks that will be prefetched, from 1 cache block up to 7 cache blocks.
When the machine boots, it automatically sets the prefetcher to the default con guration: LDS activated, URG set to 4, and all the other options disabled. libPRISM considers this default conguration, as well as three more prefetcher con gurations. When disabling the data prefetcher, we disable all its available options. e medium con guration has URG set 7, LDS activated and all the other options disabled. e aggressive con guration has URG set to 4, LDS and SNSE activated, and all the other options disabled.
Benchmarks
To evaluate the e ectiveness of the policy implemented in libPRISM, we use a wide set of benchmarks from three di erent suites: NPB [26] with the class D inputs, SPEC OMP 2012 [33] with the reference input, and a subset of the CORAL [9] benchmarks with the recommended input size.
e NPB suite is composed of 5 kernels and 3 pseudo-applications, which are derived from computational uid dynamics (CFD). e SPEC OMP 2012 suite contains 14 applications from CFD to image modeling. ey are focused on compute intensive performance. All SPEC OMP benchmarks are evaluated except imagick and smithwa, as these two benchmarks did not pass SPEC's validation tools in our environment.
e CORAL suite tests di erent parts of the systems, from CPU to network performance. It includes applications from scalable science benchmarks, data-centric benchmarks or kernels. We selected four of the most relevant benchmarks in the suite: Lulesh, HACC, graph500 and AMG. All the benchmarks are parallelized with OpenMP and wri en in C, C++ or Fortran.
Benchmarks are executed on 6 cores and pinned to them to avoid thread migration. We pin the di erent threads with the environment variable OMP PLACES. Benchmarks can run with 6, 12, 24 and 48 threads for ST, SMT2, SMT4 and SMT8 levels, respectively, and they are executed in isolation until completion. 
Metrics
In Section 5, we report speed up in execution time, power consumption and energy-delay product (EDP) for all the benchmarks. We measure wall time for the entire application. When running with libPRISM infrastructure, it also reads the timebase register from the POWER8 processor for ne-grained analysis of parallel regions. To analyze the execution of the di erent benchmarks, multiple performance counters are collected using perf [11] . We use AMESTER (Automated Measurement of Systems for Energy and Temperature Reporting) [19] to measure the power consumption of the processor and memory chips. e tool remotely collects power, thermal and performance metrics from the system using the Flexible Service Processor (FSP). e FSP allows reading of di erent sensors from the system without using any of the processing cycles of the system. erefore, it has no impact on the performance of the running benchmarks. In Section 5, we report the average power consumption for the total execution and energydelay product (EDP). Power consumption results do not include the idle power of the system to put more emphasis on active power consumption savings. When reporting EDP, we report energy (taking idle power of the system into account) multiplied by execution time.
EVALUATION
In this section we evaluate the execution time, power consumption and EDP of libPRISM and the MAXPERF policy.
Performance
We compare the policy against di erent static prede ned con gurations. Figure 4 shows the execution time speedups for the following con gurations:
• ST + DEFAULT prefetcher: Single thread (ST) and default data prefetcher.
• SMT8 + DEFAULT prefetcher: Default con guration when the machine boots (SMT8 level and default data prefetcher), used as the baseline to normalize speedups.
• Best static per application (BSA): Best hardware con guration found for each application a er an exhaustive o ine pro ling.
• MAXPERF: Dynamically sets the hardware knobs con guration for every parallel region in the application based on the MAX-PERF policy, which seeks the maximum performance in terms of execution time. is policy uses the libPRISM infrastructure. Figure 4 shows that the default hardware con guration is already the best performing con guration for 10 out of 24 evaluated benchmarks. For the remaining 14 benchmarks, half of them can reach performance improvements above 10%, illustrating the need for an adaptive system that manages shared hardware resources. On average, BSA reaches a 14.9% performance improvement over the default con guration. e policy MAXPERF slightly increases this performance improvement (15.4%) and achieves competitive results for all benchmarks without any signi cant slowdown across the benchmarks and without requiring any o ine pro ling. Figure 5 shows the breakdown of the selected hardware con gurations during the execution with libPRISM and the MAXPERF policy. A rst observation from this gure is that only 4 benchmarks run 90% of the time with the default con guration. A second observation is that 8 benchmarks have parallel regions with di erent hardware requirements. libPRISM and the MAXPERF policy e ectively detect this situation and select the appropriate hardware con guration per parallel region.
In the case of the NPB suite, there are several benchmarks that achieve the best performance with the default hardware con guration: EP, IS and MG. achieves be er performance with the SMT4 level, as explained in Section 3.2. In FT, the MAXPERF policy gets a speedup of 1.71x by selecting SMT2 and SMT4 levels in di erent parallel regions, and keeping the prefetcher in the default con guration, as shown in Figure 5 . LU and SP get a 8.0% and 5.5% speedup respectively lowering only the SMT level. LU uses SMT4 the most, and SP needs to use SMT4 56.3% of the time and SMT2 the remaining of the time. However, there is no performance di erence between BSA and the MAXPERF policy as the performance in SMT4 and SMT2 levels is very similar. MAXPERF always chooses the least aggressive con guration possible if multiple con gurations have similar performance. is reduces the power consumption as Section 5.2 shows. From the evaluated benchmarks in the SPEC OMP 2012 suite, 8 of them run faster with the default con guration. In the case of bostalgn and botsspar, they get a 7.1% and 25.6% performance improvement, respectively, by lowering the SMT level to SMT4. Ilbdc also needs a SMT4 level and the prefetcher enabled to get a 12.6% improvement. Finally, mgrid331 achieves up to a 2.2x speedup. Some parallel regions of this benchmark su er signi cant performance slowdowns in the default con guration. MAXPERF combines SMT4 and SMT8 levels for di erent parallel regions to achieve its maximum speedup, as shown in Figure 5 . In some benchmarks such as applu331, bt331, fma3d or md, MAXPERF policy decides to turn o the data prefetcher. However, no performance improvement in performance is achieved as the memory component for those benchmarks is very low.
In the case of the CORAL benchmarks, results are also workload dependent. Lulesh achieves a 1.7x speedup with the MAXPERF policy, with mostly a combination of SMT2 and SMT4 levels (32.9% and 40.7% of the time is spent in these con gurations), while the prefetcher needs to be enabled but its aggressiveness does not matter. In HACC, a SMT8 level is enough to get the best performance, but MAXPERF disables the data prefetcher without impacting performance. In the case of graph500, the policy disables the data prefetcher in some parallel regions to boost performance by 5.9%. In AMG, the best con guration is SMT4 level and the most aggressive prefetcher. However, MAXPERF does not achieve the performance of BSA because this benchmark is composed by many small parallel regions (less than 1ms) where MAXPERF decides to maintain the default con guration to avoid recon guration overheads.
Energy E ciency
Next, we discuss the energy e ciency results obtained with libPRISM using the MAXPERF policy. Figure 6 shows the power consumption of the processor and the memory components when running in the default con guration and when using libPRISM with the MAXPERF policy. Power results are normalized to the default con guration. e processor power represents 82.9% of the power in both con gurations. In the MAXPERF policy, the memory power is reduced from 16.5% to 13.9%. However, di erent benchmark suites show very di erent power pro les: NPB and CORAL benchmarks have a high memory power consumption, while SPEC OMP 2012 spends 95.1% of the power on the processor.
To enhance the analysis of the energy e ciency of libPRISM and the MAXPERF policy, Figure 7 shows the energy-delay product (EDP) when using libPRISM and MAXPERF normalized to the default con guration. is gure considers the entire system power consumption, including the idle power consumption. e combination of be er performance results (as shown in Figure 4 ) and reduced power consumption (as shown in Figure 6 ) explain the signi cant reduction in EDP (15.9% on average), being above 19.8% for seven of the evaluated benchmarks.
In the case of the NPB suite, the memory power ranges from 15.7% to 47.3% of the total power consumption. In the case of EP, IS, and MG, MAXPERF does not reduce the power consumption and the EDP. In the case of BT, LU and SP, se ing a less aggressive con guration implies a reduction in the power consumption. BT reduces processor and memory power consumption by 3.4% and 6.9%, respectively, and EDP is reduced by 19 .8%. LU reduces memory power consumption by 1.5% and EDP by 14.8%. SP reduces processor and memory power consumption by 5.3% and 2.6%, respectively, and EDP by 12.3%. Finally, in the case of CG and FT, MAXPERF slightly increases the power consumption as these benchmarks exhibit a much higher performance with libPRISM and the MAXPERF policy. CG increases processor and memory power consumption by 7.6% and 8.6%, respectively, but EDP is reduced by 43 .8% thanks to the reduced execution time. FT increases processor power consumption by 9.4% and reduces memory by 5.1%, while EDP is reduced by 65.4%.
e SPEC OMP 2012 suite is mostly CPU-intensive, as can be seen by the power distribution in Figure 6 . Although MAXPERF disables the data prefetcher in multiple cases (see Figure 5 ), the overall power consumption is not signi cantly reduced. MAXPERF is able to speedup execution and lower the power consumption and the EDP of di erent benchmarks. Botsalgn and botsspar reduce the SMT level to SMT4, which reduces processor power consumption by 5.6% and 9.0%, respectively, and EDP by 13.8% and 38.4%, respectively. A similar situation happens in mgrid331, with a 13.1% reduction in processor power and 80.2% in EDP. SPEC OMP 2012 suite shows that when no performance optimization opportunities exist with respect to the default knob con guration, libPRISM with the MAXPERF policy does not introduce noticeable overheads.
Finally, in the case of the CORAL benchmarks, we see di erent behaviors on the power consumption. Power consumption of the processor ranges from 73.1% to 98.6% and the memory component goes from 1.4% to 26.9%. libPRISM and the MAXPERF policy improve the energy e ciency and the overall performance. In Lulesh, libPRISM with MAXPERF policy reduces processor and memory power consumption by 5.5% and 6.8%, respectively. EDP is reduced by 65.4%. HACC runs faster with the default con guration, and even if MAXPERF disables the data prefetcher, there is no performance and power di erence as this benchmark has very low memory utilization. graph500 shows a reduction of memory power consumption of 7.1% a er disabling the prefetcher, while EDP is reduced by 12.7%. AMG runs be er with a lower SMT level, but as described in the previous section, short parallel regions prevent libPRISM from recon guring the hardware and no di erences in execution time are seen. Nevertheless, power consumption and EDP are slightly reduced.
Individual Performance Analysis
In this section, we provide a detailed performance analysis of three interesting benchmarks: CG, FT and Lulesh. For this purpose we read the required performance monitoring counters (PMC) to obtain the CPI breakdown [20] . We focus on these PMCs: GRP CMPL, completed instructions; GCT NOSLOT CYC, cycles when there are no instructions from threads; CMPLU STALL VSU, cycles stalled by the vector-and-scalar unit; CMPLU STALL DMISS L2L3, completion stall by a data cache miss which is resolved in L2 or L3 caches; CM-PLU STALL DMISS L3MISS, completion stall due to a cache miss the 
q ( j ) = suml 10 enddo 11 ! $omp end do 12 ...
Listing 3: Relevant parallel region of CG.
L3; CMPLU STALL THRD, a thread could not complete an instruction because the completion port was being used by another thread; CMPLU STALL DCACHE MISS, cycles stalled by data cache misses in the L1 cache; ese PMCs are the most signi cant ones for the applications shown. e rest of PMCs from the CPI breakdown are represented as OTHER STALL.
CG has several parallel regions, but one of them covers more than 96% of the total execution time. Figure 8 shows the CPI breakdown for this parallel region when using SMT8 or SMT4 and the prefetcher set to default or disabled. libPRISM chooses SMT8 level and disables the prefetcher, as shown in Figure 5 . In the default con guration, the main reason for stalled cycles is data cache misses as re ected by STALL DMISS L2L3 and STALL DMISS L3MISS CPI breakdown components in Figure 8 . When changing from SMT8 to SMT4 with the prefetcher enabled (third bar in Figure 8) , we see these two PMCs decrease by 3.2% and 3.6%. When the prefetcher is turned o , there is a reduction of 12.1% in SMT8 level and of 11.8% in SMT4 level. Overall, the main performance bo leneck in CG is the large number of cache misses, which are reduced when disabling the prefetcher or lowering the SMT level.
Using libPRISM we can relate the executed parallel region with the source code, which is shown in Listing 3.
is code iterates through a vector and accumulates its values with a non-regular access pa ern (p(colidx(k))). For this type of access pa erns, not only the data prefetcher is unable to bring useful data to the cache but it also degrades the performance of the benchmark by polluting the cache and reduced e ective memory bandwidth.
In the case of FT, MAXPERF selects the default prefetcher and reduces the SMT level to SMT4 and SMT2. Figure 9 shows the CPI breakdown and the instructions per cycle (IPC) for a parallel region of the FT benchmark when executed in di erent SMT levels with the default prefetcher con guration. e IPC is maximized when we execute the parallel region in SMT2 level. Stalls in SMT8 and SMT4 levels are mainly due to cache misses, which are reduced by lowering the SMT level. Once running in SMT2, the prefetcher can hide the memory latency of cache misses and the main bo leneck becomes the number of available execution units: a large percentage of stalled cycles in the vector-scalar unit (CMPLU STALL VSU CPI component). Lowering the SMT level to ST exacerbates this problem (less ILP) and reduces the total throughput.
Finally, we analyze the Lulesh benchmark. is benchmark has 17 short regions, which are not recon gured by libPRISM (their duration is below the speci ed threshold) and 7 long parallel regions, which are executed 2500 times and have an execution time between 2 and 4 milliseconds. As shown in Figure 5 , MAXPERF selects all possible SMT levels for these parallel regions, achieving speedups from 1.03x to 2.42x. MAXPERF also selects all possible prefetcher con gurations for di erent parallel regions, with speedups ranging from 1.01x to 1.04x . In Lulesh, the parallel regions access nodes from a list in a non-regular pa ern. As a result, running with less threads does a be er usage of the memory bandwidth and the last level cache. In 2 of the relevant parallel regions, SMT4 level is the best SMT level because threads are not only loading from memory but also doing intensive CPU operations with the loaded data. In contrast, SMT2 level provides be er performance for those parallel regions where threads are memory intensive and perform few operation with the loaded data.
Discussion
In this section we discuss potential applicability of libPRISM together with its limitations.
In order to de ne a good policy, basic knowledge of the architecture is needed and some experimental process is required to identify the order in which di erent hardware knobs are explored. A er this basic pro ling, the exploration does not hurt overall performance.
is can be done by running a small training set of benchmarks. Although we only demonstrated the usage of libPRISM for coordinating the management of SMT and prefetcher knobs for OpenMP applications on a POWER8-based system, the infrastructure can be leveraged for other purposes. For instance, other shared memory programming models that mark parallel regions can be supported by libPRISM using the same library interposition mechanism. Also, other hardware knobs and sensors can be used by the policies implemented within libPRISM. is is enabled by the generic, modular, extensible and architecture-agnostic design of libPRISM.
e library interposition mechanism, reading the sensors and the con guration of hardware knobs can add overheads to the execution of the application. We measured this overhead by running the benchmarks with and without libPRISM infrastructure. In this experiment, libPRISM only tracks and pro les the di erent parallel regions without recon guring the hardware knobs. e measured overhead in terms of execution time is always below 2.3% (1.0% on average), mainly because of monitoring small parallel regions. A er selecting an appropriate threshold to control which parallel regions are explored, the exploration overhead is e ectively reduced to less than 1.0%, which makes the energy overhead negligible as well.
RELATED WORK
As far as we know, this is the rst work to combine SMT with data prefetching knobs to see the interaction between each other and achieve a jointly-optimized con guration of these hardware knobs.
Simultaneous Multi-threading
Previous work on SMT is focused to achieve fairness [1-3, 6, 7, 37] . Other authors predict IPC when running in a SMT processor and schedule serial applications on virtual cores in order to boost the overall performance of the system [17, 18, 32, 36] . ese works focus on multi-programmed workloads. is is in contrast to this work, which targets parallel workloads.
ere is work on dynamically choosing the best SMT level for parallel workloads. Zhang et al. [42, 43] and Heirman et al. [21] propose a dynamic algorithm inside the OMP runtime in order to choose the best number of threads. Jia et al. [23] uses machine learning to boost performance by se ing the correct SMT level. Besides not con guring multiple hardware knobs, our work di ers from these previous works in 2 aspects: (1) their solutions are implemented inside the runtime, limiting the possibilities of usage of the work and (2) their search space is small compared to ours.
Data Prefetching
ere are previous works that propose hardware modi cations of the prefetcher implementations in order to improve performance on multicore chips [12-14, 40, 41, 44] . Our proposal bene ts from already implemented data prefetchers, therefore there is no extra cost and it can be used in current existing hardware to improve performance.
In terms of so ware, most of the previous work has been developed for serial applications or multi-programmed workloads [22, 27, 29] . Using similar workloads, Jimenez et al. detects phases of applications at runtime and changes the data prefetch con guration according to the overall demands of the applications running on the system [24, 25] . ese phases are not explicitly de ned in the workloads, therefore, the algorithm constantly iterates through the di erent data prefetch con gurations.
In this work, we use the already annotated parallel regions as phases. Phases are re-explored only when their behavior change, reducing exploration time and minimizing possible slowdowns due to low performing hardware knob con gurations. Also, we take into account possible inter-e ects between the SMT level and the data prefetcher knobs. In addition, in [24, 25] , the operating system needs to be modi ed. Our solution works without any modi cation on the so ware stack.
Also, Chilimbi et al. make use of so ware prefetching to speedup applications at execution time [8] . Wang et al. uses information at compile time to correctly set the data prefetcher aggressiveness [40] . In contrast, in this work we focus on parallel workloads that are common in high performance computing.
Few research has been done when referring to parallel workloads. Li et al. applied machine learning to automatically recon gure the data prefetcher for di erent workloads [28] . Prat et al. added intelligence to a task-based runtime to automatically manage the aggressiveness of the data prefetcher for parallel workloads [35] .
ese works lack the control of the number of threads working in the same task. erefore, the possible interaction with the SMT level and the data prefetcher is also missing.
CONCLUSIONS
Because of the potential resource contentions among threads in the memory subsystem, current processors o er the user a wide range of con gurable knobs such as the SMT level or the data prefetcher aggressiveness. Unfortunately, nding the optimal se ings of these knobs is di cult because of the large search space, the strong interactions between di erent architectural knobs and the di erent hardware demands of application phases.
In this work we introduce libPRISM, an infrastructure for parallel applications to dynamically adapt the architectural knobs based on a custom policy. On top of libPRISM we develop the MAXPERF policy, which manages the SMT level and the data prefetcher aggressiveness with the goal of increasing performance.
We evaluate our solution for a wide set of OpenMP benchmarks running on a POWER8 system. Results show a boost in performance of up to 220% (15.4% on average), a dynamic power consumption reduction of up to 13% and an energy-delay product reduction of up to 80% when compared to the default static system con guration.
