Abstract-While several hardware mechanisms have been proposed to control the interaction between hardware threads in an SMT processor, few have addressed the issue of software-controllable SMT performance. The IBM POWER5 and POWER6 are the first high-performance processors implementing a software-controllable hardware-thread prioritization mechanism that controls the rate at which each hardware-thread decodes instructions. This paper shows the potential of this basic mechanism to improve several target metrics for various applications on POWER5 and POWER6 processors. Our results show that although the software interface is exactly the same, the software-controlled priority mechanism has a different effect on POWER5 and POWER6. For instance, hardware threads in POWER6 are less sensitive to priorities than in POWER5 due to the in order design. We study the SMT thread malleability to enable user-level optimizations that leverage software-controlled thread priorities. We also show how to achieve various system objectives such as parallel application load balancing, in order to reduce execution time. Finally, we characterize user-level transparent execution on POWER5 and POWER6, and identify the workload mix that best benefits from it.
Ç

INTRODUCTION
T HE limitations in exploiting instruction-level parallelism (ILP) have motivated thread-level parallelism (TLP) as a common strategy to improve processor performance. There are several TLP paradigms which offer different benefits as they exploit TLP in different ways. For example, Simultaneous multithreading (SMT) reduces fragmentation in onchip resources. In addition to SMT, Chip-Multiprocessing (CMP) is also effective in exploiting TLP with limited transistor and power budget. This motivates processors vendors to combine both TLP paradigms in their processors. For instance, Intel i7 as well as IBM POWER5 and POWER6 combine SMT and CMP.
Because SMT processors share most of the core resources among threads, some of them implement mechanisms to better partition the shared resources. For instance, the 2-way SMT processors POWER5 and POWER6 improve the usage of resources across threads with mechanisms in hardware [14] , [15] that suspend a thread from consuming more resources when it stalls for a long-latency operation. One of the interesting new features which enables better resource balancing is that POWER5 and POWER6 allow software to control the instruction decode rate of each thread in a core by eight priority levels, from 0 to 7. The higher the priority difference between the two threads in each core, the higher the difference of decode cycles, and hence, the difference of hardware resources received by the two threads. 1 The Operating System (OS) can provide a user interface to change thread priorities such that software can control the speed at which each hardware thread run with respect to the other hardware thread in a core. The default priority configuration (i.e., both hardware threads having priority 4) is designed to guarantee fair hardware resource allocation between hardware threads. From a software point of view, the main motivation to override the default configuration is to address instances where nonuniform hardware resource allocation is desirable. Several examples can be enumerated such as virtualization in SMT, OS idle thread, thread waiting on a spin-lock, latency-sensitive threads, software determined nonuniform balance and power management [6] , [14] , [22] . In some cases, softwarecontrolled thread priorities can also improve instruction throughput or parallel applications execution time [1] , [3] , by optimizing hardware resource allocation.
Although software-controlled thread priorities have a considerable potential, lack of quantitative studies limits their use in real-world applications. In this paper, we provide a quantitative study of the POWER5 and POWER6 prioritization mechanism. We show that the effect of thread prioritization depends on the characteristics of the 1 . Note that software-controlled hardware priorities are independent of the operating systems concept of process or task prioritization. In fact, task priorities are used to prioritize scheduling of running tasks among CPU's and, therefore, are a pure software concept.
two threads running simultaneously in a core. We also show that thread priorities have different effects on applications in POWER5 and POWER6. We analyze the major processor characteristics that lead to this different behavior. In particular, although both processors are dualcore and each core is two-way SMT, their internal architectures are different. While POWER5 has out-oforder cores with many hardware shared resources, POWER6 follows a high-frequency design optimized for performance, leading to a mostly in-order design in which fewer resources are shared between threads. Finally, we show the benefits of software-controlled thread priorities in real-world applications including parallel applications and multiprogrammed environments.
We define SMT thread malleability (or simply malleability) as the ratio between the performance of a thread with a given priority configuration and its performance with default priority configuration. To characterize POWER5 and POWER6 thread prioritization mechanism, we developed a set of microbenchmarks that stress specific hardware resources such as data cache, issue queues, and memory bus. Moreover, we measure the malleability of real workloads, represented by some of the SPEC CPU2006 benchmarks [24] . Also, we develop a Linux kernel patch that provides an interface to the user to set all possible priorities available in kernel mode. Without a kernel patch, only three of the eight priorities are available to the user.
The main contributions of this paper are:
. We quantify the effect of software-controlled priorities in POWER5 and POWER6, measuring the average per-thread malleability using microbenchmarks. . We explain the observed reduction in malleability in POWER6 with respect to POWER5. Also, we explain the reason why applications that are memory bounded or have deep-data dependency chains show similar malleability. . We measure the malleability of a subset of SPEC CPU2006 benchmarks using higher priorities, to describe the effects of hardware-thread priorities on real workloads. . We quantify the implications of using priority 1 and show that it can be effectively used to provide transparent execution [7] . Our results with SPEC CPU2006 show that POWER6 can achieve more than 94 percent performance for the foreground thread, respective to its single-thread (ST) performance. . We show how hardware-thread priorities can effectively be used to reduce the execution time of a parallel application from NAS multizone benchmarks. The rest of this paper is organized as follows: Section 2 presents the related work. Section 3 describes POWER5 and POWER6 microarchitecture. Section 4 describes the experimental setup and provides a description of the microbenchmarks. Section 5 contains experimental results and an analysis based on architectural considerations. Section 6 describes two use cases of hardware-thread priorities. Section 7 concludes with guidelines on performance tuning using the hardware priority mechanism.
RELATED WORK
Some previous studies focus on ensuring QoS in SMT architectures. Cazorla et al. introduce a mechanism to force predictable performance in SMT architectures [4] . They manage to run time-critical jobs at a given percentage of their maximum instructions per clock (IPC). To attain this goal, they need to control all shared resources of the SMT architecture.
Regarding CMP architectures, Rafique et al. propose to manage shared caches with a hardware cache quota enforcement mechanism and an interface between the architecture and the OS to let the latter decide quotas [21] . Nesbit et al. introduce Virtual Private Caches (VPC) [20] , which consists of an arbiter to control cache bandwidth and a capacity manager to control cache storage. They show how the arbiter allows meeting QoS performance objectives or fairness. A similar framework is presented by Iyer et al. [12] , where resource management policies are guided by thread priorities. Individual applications can specify their own QoS target (e.g., IPC, miss rate, cache space) and the hardware dynamically adjusts cache partition sizes to meet their QoS targets. An extension of this work with an admission mechanism to accept jobs is presented in [10] .
Also, previous works show that SMT performance heavily depends on the nature of the concurrently running applications [6] , [23] . Tuck and Tullsen analyze the performance of a real SMT processor [25] , concluding that SMT architectures provide an average speedup over singlethread architectures of about 20 percent and that, even if the processor is designed to isolate threads, performance is still affected by resource conflicts.
Other works propose the use of hardware-thread priorities to control thread execution in SMT processors. Many of these proposals implement fetch policies to maximize throughput and fairness by reducing the priority, stalling, or flushing threads that experience long latency [5] , [26] . Boneti et al. analyze the effect of hardware priorities on POWER5 [1] , and use hardware priorities to balance resources in SMT processors [2] and to implement dynamic scheduling for HPC [3] .
In this context, the concept of Fair CPU utilization accounting for CMP and SMT processors, introduced by Luque et al. [16] , [17] and by Eyerman and Eeckhout [8] , can be used to improve the efficiency of thread prioritization mechanisms.
Let us assume a workload composed by several tasks (T a; T b; . . . ; T n) running in an n-core multicore or n-way SMT processor. The mechanisms proposed [8] , [16] , [17] provide an estimation of the execution time that each of these tasks would have if it runs in isolation (Ti_isolation). By measuring the difference in execution time between the execution time in CMP/SMT and the execution time in isolation (Ti_cmp/Ti_isolation or Ti_smt/Ti_isolation), we can determine the slowdown each task is suffering in CMP/ SMT. The slowdown (or the relative speed) could be used to guide the SMT prioritization mechanism (or any other prioritization mechanism for multicores such as the one described by Moreto et al. [18] ) to ensure Quality of Service, that is, to ensure that tasks do not suffer a performance degradation greater than a preestablished threshold.
To our knowledge, this is the first extensive study that quantify the effect of hardware-thread priorities on two SMT processors with substantially different microarchitecture, such as POWER5 and POWER6.
POWER5 AND POWER6 MICROARCHITECTURE
This section provides a brief description of POWER5 and POWER6 microarchitecture and of the features that are relevant to SMT and thread priorities. A detailed description of the processors can be found in the works of Le et al. [15] and Sinharoy et al. [22] . Fig. 1 shows a high-level diagram of POWER5 and POWER6 processors. Both processors have two cores and each core supports 2-thread SMT. In both processors, each core has its own L1 data and instruction cache. In POWER5, L2 cache is shared among cores whereas in POWER6 each core has its own L2 cache. In both processors, the off-chip L3 cache is shared. POWER6 microprocessor has a ultrahigh frequency core and represents a significant change from POWER5 design. Register renaming and massive out-of-order execution as implemented in POWER5 are not employed in POWER6. However, POWER6 implements limited out-oforder execution for floating point instructions [15] .
POWER5 and POWER6 Core Microarchitecture
Simultaneous Multithreading
POWER5 has separate instruction buffers for each thread. Based on thread prioritization, up to five instructions are selected from one of the instruction buffers and a group is formed. Instructions in a group are all from the same thread. POWER6 core implements an independent dispatch pipe with a dedicated instruction buffer and decode logic for each thread. At the dispatch stage, each group of up to five instructions per thread is formed independently. Later, these groups are merged into a dispatch group of up to seven instructions to be sent to the execution units. Several other features have been implemented in POWER6 to improve SMT performance. For instance, the L1 I-cache and D-cache size and associativity have been increased from the POWER5 design. POWER6 core has dedicated completion tables (GCT) per thread to allow more outstanding instructions [15] .
Both processors deploy two levels of resource control among threads through dynamic resource balancing in hardware and through thread prioritization in software. POWER5 and POWER6 dynamic hardware resource-balancing mechanisms monitor processor resources to determine whether one thread is potentially blocking the other thread execution. Under that condition, the progress of the offending thread is throttled back allowing the sibling thread to progress (automatic throttling mechanism). For example, POWER5 considers that there is an unbalanced use of resources when a thread reaches a threshold of L2 cache misses or TLB misses, or when a thread uses too many GCT (reorder buffer) entries [22] .
Software-Controlled Hardware-Thread Priorities
In POWER5 and POWER6, software-controlled priorities range from 0 to 7, where 0 means the thread is switched off and 7 means the thread is running in single-thread mode (i.e., the other thread is off).
Using priority 1 for both threads has the effect of executing the threads in low-power mode. In addition, the execution of one thread with priority 1 while the other has a priority >1 causes the former to use only hardware resources leftover by the latter.
The enforcement of software-controlled priorities is carried in the decode stage. In general, a higher priority translates into a higher number of decode cycles. In POWER5, assuming a primary thread and a secondary thread 2 with priorities P and Q (where P > 1 and Q > 1), decode cycles are allocated as follows:
2. decode cycle rates:
where r high is the decode cycle rate of the thread with higher priority and r low is the decode cycle rate of the thread with lower priority. The thread with higher priority receives R À 1 every R decode cycles, while the thread with lower priority receives 1 every R decode cycles. For instance, assuming that the primary thread has priority 6 and the secondary thread has priority 2, R would be 32, so the core decodes 31 times from the primary thread (r high ¼ 31=32) and once from the secondary thread (r low ¼ 1=32). Hence, the performance of the process running as primary thread increases to the detriment of the one running as secondary thread.
In the special case when threads have the same priority, R would be 2, and each thread alternately receives one slot (r high ¼ r low ¼ 1=2).
The previous formula is available for POWER5, while for POWER6 we assume that the decode cycle rate is a monotonic function of the priority difference: Table 1 shows priority values and levels, required privilege levels, and instructions used to set priorities. Supervisor or OS can set six of the eight priorities ranging from 1 to 6, while user software can only set priority 2, 3, and 4. The hypervisor can use the whole range of priorities. Priorities can be set by issuing a pseudo or instruction in the form of or X,X,X, where X is a specific register number [9] , [11] . This operation only changes the thread priority and performs no other operation. In case it is not supported (i.e., running on previous POWER processors), or in case of insufficient privileges, the instruction is simply treated as a nop.
EXPERIMENTAL SETUP
In order to explore the capabilities of the softwarecontrolled priority mechanism in the POWER5 and POWER6 processors, we perform a detailed set of experiments. Our approach consists in measuring the performance of microbenchmarks running in SMT mode as the priority of each thread is increased or reduced.
The performance of a process in an SMT processor are conditioned by the programs running simultaneously on the other hardware thread, and by their phase. Evaluating all the possible programs and all their phase combinations is infeasible. Moreover, the evaluation of a real system, with several layers of running software, OS interferences and all the asynchronous services, becomes even more difficult.
For this reason, we use a set of microbenchmarks that stress particular processor characteristics. While this scenario is not typical with real applications, it is a systematical way to understand the hardware priority mechanism. This methodology in fact, provides a uniform characterization based on specific program characteristics that can be mapped into real applications.
To verify the effects of hardware priorities on real applications, we measure the malleability of a subset of SPEC CPU2006 benchmarks with different priority configuration. To ensure that all the benchmarks are fairly represented in the final results, we use the FAirly MEasuring Multithreaded Architectures (FAME) methodology [27] , [28] which requires running in SMT mode the same benchmark pair for multiple times until both benchmarks are equally represented in the total execution time.
Running all pairwise combinations of SPEC CPU2006 benchmarks and all priority combinations with FAME methodology would take too much time to complete. 3 In order to reduce experimentation time, we choose a subset of SPEC CPU2006 as follows: 1) we choose benchmarks such that the spectrum of performance, memory, and execution unit characteristics are fairly represented in the subset, 2) following Snavely et al. [23] recommendation on symbiotic OS scheduling, we pair high-IPC (CPU-intensive) benchmarks with low-IPC (memory-bound) benchmarks, in order to provide efficient utilization of the SMT core.
High-IPC benchmarks are bzip, calculix, cactusADM, and h264ref. Low-IPC benchmarks are mcf, omnetpp, and milc. The resulting combination represents mixes of high-IPC and low-IPC benchmarks as well as integer and floating point benchmarks.
Experimental Environment
The results presented are obtained by compiling the benchmarks with gcc version 4.1.2 20070115 (SUSE Linux), Linux kernel version 2.6.23, libpfm-3.8, and mpich2-1.0.8. We executed the experiments on an Open Power 710 (Op710) and on a JS22 IBM server, with the same executable. It is worth noting that the Op710 POWER5 processor is equipped with a third-level (L3) cache while the JS22 POWER6 processor we use does not have the third-level cache.
The Linux Kernel Modification
Some of the priority levels are not available in user mode (Section 3.3). In fact, only three levels out of eight can be used by user mode applications, the others are only available to the OS or the hypervisor. Modern Linux kernels running on POWER5 and POWER6 processors exploit software-controlled priorities in few cases such as reducing the priority of a process when it is not performing useful computation. Basically, the kernel uses thread priorities in three cases:
. The processor is spinning for a lock in kernel mode.
In this case, the priority of the spinning process is reduced. . A CPU is waiting for operations to complete. For example, when the kernel requests a specific CPU to perform an operation by means of a smp_call_func-tion() and it cannot proceed until the operation completes. Under this condition, the priority of the thread is reduced. . The kernel is running the idle process because there is no other process ready to run. In this case, the kernel reduces the priority of the idle thread and eventually puts the core in single-thread mode. In all these cases, the kernel reduces the priority of a hardware thread and restores it to MEDIUM (4) as soon as there is some work to perform. Furthermore, since the kernel does not keep track of the actual priority, to ensure responsiveness it also resets the thread priority to MEDIUM every time it enters a kernel service routine (e.g., interrupt, exception handler, or system call). This is a conservative choice induced by the fact that it is not clear how and when to prioritize a hardware thread and what the effect of that prioritization is.
In order to explore the entire priority range, we develop a kernel patch that provides an interface to the user to set all the possible priorities available in kernel mode: . We make priority 0 to 7 available to the user. As mentioned in Section 3.3, only three priorities (4, 3, 2) are directly available to the user. Without this kernel patch, any attempt to use other priorities result in a nop operation. Priority 0 and 7 (thread off and single-thread mode, respectively) are available to the user through a hypervisor call. . We avoid the use of software-controlled priorities inside the kernel; otherwise, experiments would be effected by unpredictable priority changes. . Finally, we provide an interface through the /proc pseudo file system which allows user applications to change their priority.
Microbenchmarks Description
In order to build a basic knowledge of the effect of softwarecontrolled priorities, we used METbench (Minimum Execution Time Benchmark [3] ), a microbenchmark suite designed to stress specific processor characteristics. We classify microbenchmarks into three classes: Integer, Floating Point, and Memory as shown in Table 2 . In the Integer class, there are cpu_int, which contains mixed integer instructions (one multiplication every two additions), cpu_int_add, which contains integer additions, cpu_int_mul which contains integer multiplications, and lng_chain, which is composed of mixed integer instructions with high inter-instruction dependency. The latter is designed to limit ILP exploitation for out-of-order processors (i.e., POWER5). In the Floating Point class, cpu_fp_asm is a benchmark that has a high percentage of mixed type floating point instructions. In the memory class, there are three microbenchmarks: ldint_l1, ldint_l2, and ldint_mem. Microbenchmarks ldint_l1 and ldint_l2 are designed to always hit in the L1 and L2 cache, respectively, while ldint_mem is designed to always miss in cache.
All the microbenchmarks share the same structure: they implement for a loop with enough iterations to run for at least one second. Hence, the microbenchmarks differs mainly in the loop body, which shows a different instruction-mix according to the desired behavior.
We validated the behavior of each microbenchmark through analyzing performance counters.
Integer Microbenchmarks
The four integer microbenchmarks share a common loop structure. Listing 1 shows the main loop code for cpu_int. The code for cpu_int_add is the same except that in the loop there are only additions (for instance instead of c ¼ c Ã c Ã it we used c ¼ c þ c þ it). Analogously, the loop for cpu_int_mul contains only multiplications. Note the macro LOOP_-UNROL_100, used to repeat the same code 100 times, reducing control-flow instructions in the loop. 
Floating Point Microbenchmarks
In the Floating Point class, we implement the cpu_fp_asm microbenchmark in POWER assembly in order to have a better control on its behavior, and hence maximize the use of the floating point unit.
Memory Microbenchmarks
The three memory microbenchmarks share a common loop structure. In the loop, loads are executed using a pointer chasing technique. In this technique, an array is initialized with pointers, so that each element contains the address of the next element to access. In order to execute several times, the last element of the array contains the address of the element at the beginning.
Listing 2 shows the main loop code for ldint_l1. The code for ldint_l2 and ldint_mem is exactly the same except that the array size varies in order to obtain the desired use of the cache hierarchy. Specifically, ldint_l1 uses approximately 25 percent of the first level cache and makes all loads hit in L1. The microbenchmark ldint_l2 fills the first level cache, uses approximately 25 percent of the second level cache and makes all loads hit in L2. Finally, ldint_mem fills all the cache levels and makes all loads to access main memory.
ANALYSIS OF RESULTS
In this section, we analyze the performance variation obtained with the software-controlled priority mechanism. First, we analyze the performance of microbenchmarks running in SMT mode with default priorities (priorities 4/4), then we analyze the malleability for threads running with higher and lower priorities. Subsequently, we show the effect of using the maximum priority difference in SMT (priorities 6 and 1). Finally, we show the malleability of benchmarks from SPEC CPU2006 suite.
Default Priorities
When running with default priorities (priorities 4/4), core resources are equally shared between threads. The default priority configuration is used to optimize throughput when knowledge about workload characteristics is not available. Threads running in SMT mode have lower performance compared to running in ST mode. Table 3 shows the average instructions per clock decrement of each microbenchmark running in SMT mode with default priorities against all other microbenchmarks, with respect to running in ST mode. . CPU-intensive microbenchmarks (cpu_fp_asm, cpu_int, cpu_int_add, and lng_chain) show more IPC decrement when they run in POWER5 than when they run in POWER6. . Microbenchmarks with instruction dependencies or memory-bounded microbenchmarks (ldint l1, ldint l2, ldint mem) show less significant IPC decrement and are quite similar in POWER5 and in POWER6. Based on the results, our main conclusion are . CPU-intensive microbenchmarks that exploit out-oforder execution are more affected by SMT execution in POWER5 than in POWER6. . Microbenchmarks with instruction dependencies and memory-bounded microbenchmarks cannot fully exploit execution resources when in ST mode due to their intrinsic dependencies. Therefore, the two threads of the latter types can overlap and efficiently use the execution units in SMT mode.
Malleability
Let IP C ST be the IPC that a given thread has when it runs in ST mode (single-thread mode [22] ). In ST mode, all core resources are allocated to the only running thread. Let IP C P =Q SMT be the IPC of the same thread when it runs in SMT mode with another thread, the first thread having priority P and the second thread having priority Q. For instance, IP C
4=4
SMT is the IPC of that thread when it runs with another thread, both having priority 4 (default priority configuration).
We define SMT thread malleability (or simply malleability) as the ratio between the IPC in SMT mode with a given priority configuration and the IPC in SMT mode with default priority configuration:
The highest IPC achievable by a thread is still IP C ST , that is, for any priority configuration P/Q we have that IP C P =Q SMT IP C ST . Hence, the malleability for a thread is upper bounded by the IPC in ST mode normalized to the default priority configuration:
We consider that the maximum malleability is obtained using priorities 6/2, as we exclude priority 1 because it is designed for low-power executions. Fig. 2 shows the correlation between the maximum malleability and the IPC in ST mode normalized to the IPC in SMT mode. Namely, x-axis reports , and each dot in the graph represents the actual pair of microbenchmarks.
As Fig. 2 shows, there is a clear positive correlation between these two variables (coefficient estimate b ¼ 1:19 and coefficient of determination R 2 ¼ 0:97). In fact, as explained before, the maximum performance that a task can obtain with priorities is upper bounded by the ST performance.
Higher Priority
In this section, we analyze the malleability of a thread when it runs in SMT mode with higher priority than the other thread. We use priorities in the range 6-2 because priority 1 is used for low-power mode, and it will be examined in detail in Section 5.5.
Graphs in Fig. 3 show a higher malleability on POWER5 compared to POWER6, when running CPU-intensive microbenchmarks. In POWER5, the thread speedup with higher priorities is up to six times, while in POWER6 it is less than two times. We can derive the following conclusions:
. The main reason for the lower impact of priorities on POWER6 is that the performance in SMT with priorities 4/4 are close to the upper bound. . In POWER5, we observe a high speedup in cpu_int (Fig. 3a) and in cpu_int_add (Fig. 3b) when their priorities are increased and they run with cpu_int_mul. The reason is that cpu_int_mul executes integer multiplications that take several cycles. Because the rate at which cpu_int_mul instructions complete is lower than the rate at which they are fetched into the processor, it clogs the issue queue. As a result, cpu_int_mul stalls the execution of CPU-intensive microbenchmarks like cpu_int or cpu_int_add. When we increase the priority of CPU-intensive microbenchmarks, their decode rate is higher, and hence they are less affected by cpu_int_mul.
. In POWER5, we can observe a high speedup when CPU-intensive microbenchmarks run with ldint_l2 and when we increase CPU-intensive microbenchmarks priority (Figs. 3a and 3b) . The reason is that, with priorities 4/4, ldint_l2 fills the load/store queue with high-latency loads, and prevents any other instruction from being dispatched. Consequently, using higher priorities for the CPU-intensive microbenchmarks when they run with ldint_l2 results into a high speedup. However, the same behavior cannot be observed when CPU-intensive microbenchmarks run with ldint_l1, because load operations in ldint_l1 have a lower latency and hence do not clog the load/store queue. This behavior is not observed and X-axis is the hardware priority for the primary and secondary threads (primary-priority/secondary-priority). Please note the different scale for cpu_int_add and ldint_l2.
when running with ldint_mem because of the automatic throttling mechanism [14] trigged by in-flight L2 misses. The same phenomenon happens when ldint_l1 runs together with ldint_l2 (Fig. 3f) . Finally, in POWER6 this phenomenon cannot be observed, because it is an in-order design. In fact, instructions in POWER6 can execute in the fixed point units even if the load/store queue is clogged. As a result, in POWER6 ldint_l2 does not affect CPU-intensive microbenchmarks as it happens in POWER5. . For the POWER6, we observe that the maximum speedup with priorities is obtained when executing two copies of microbenchmarks using mainly a single functional unit (cpu_int_add in Fig. 3b and cpu_int_mul in Fig. 3c ). For cpu_fp_asm, ldint_l1, ldint_l2, ldint_mem in Fig. 3 , we observe that . In POWER6, for most of the microbenchmarks the speedup is zero, because they reach the upper bound performance (performance in ST mode) with priorities 4/4 (Figs. 3e, 3g, and 3h). . In POWER6, there is a speedup as we increase the priority of ldint_l1 when it runs with cpu_int_mul ( Fig. 3f) , because ldint_l1 uses the fixed point unit to compute the effective address [15] . Since cpu_int_mul uses the fixed point unit with long latency operations, it competes with ldint_l1 for this resource. As we increase the priority of ldint_l1, it gets access to this resource more frequently, hence improving its performance. This is not observed in POWER5 (Fig. 3f) because the effective address is computed through a dedicated adder inside the load/store unit. . In POWER5, as we increase the priority of ldint_l1 when it runs with ldint_l2 we observe a high speedup (Fig. 3f) . This is because ldint_l2 completely fills the L1 cache evicting the data of ldint_l1. By increasing the priority of ldint_l1, we increase its cache access frequency, hence reducing the effect of ldint_l2. This speedup cannot be seen when ldint_l1 runs with ldint_mem. The main reason is that ldint_mem has lower cache access frequency per cycle, as every load has to go to main memory. As a result, the cache lines belonging to ldint_l1 are more frequently accessed and thus are not evicted by the least recently used (LRU) replacement policy. . In POWER6, there is a small ldint_l2 speedup because in most of the cases with priorities 4/4 we already reach the upper bound (ST mode). . In POWER5, the speedup of ldint_l2 running with ldint_mem is due to the fact that the ldint_mem fills completely the L2 cache, thus increasing the number of L2 misses of the former and hence reducing its performance. . In POWER5, the speedup of ldint_l2 running with itself is lower than when running with ldint_mem. Since ldint_l2 uses only 25 percent of the L2 cache, two running ldint_l2 can fit in the L2 cache. On the other hand, because ldint_mem completely fills the L2 cache, it considerably affects ldint_l2's performance. . In POWER5 and POWER6, ldint_mem (Fig. 3h) is almost insensitive to a higher priority. This confirms the observation that microbenchmarks with very low IPC cannot be improved using priorities, since with priorities 4/4 the upper bound (IPC in single-thread mode) is already reached.
Lower Priority
In this section, we present the malleability of a thread running with lower priority than the other thread. We consider the range of priorities 6-2, while priority 1, because of its special behavior (low-power mode), will be examined in Section 5.5. Fig. 4 shows that lower priorities significantly affects the performance of all microbenchmarks. Microbenchmarks cpu_int, cpu_int_add, cpu_int_mul, cpu_fp_asm, lng_chain, and ldint_l1 in Fig. 4 show that thread slowdowns are in the same order of magnitude in POWER5 and POWER6.
Microbenchmarks ldint_l2 and ldint_mem in Fig. 4 show that in POWER6 lower priorities have a smaller impact than in POWER5. Note also the higher impact observed in POWER5 with priorities 3/6 and 2/6 (priority difference !3) when running with a memory-bounded microbenchmark. Furthermore, this behavior is not reported in POWER6.
Based on the results, we can conclude the following:
. Low-IPC microbenchmarks are less affected by changing thread priority. For instance, ldint_l2 is less affected than ldint_l1 and ldint_mem is less affected than ldint_l2. . The use of lower priorities with memory-bound benchmarks leads to a smaller impact in POWER6 with respect to POWER5; this confirms the lower thread resource sharing in POWER6 microarchitecture compared to POWER5. . In POWER5, a microbenchmark running against ldint_l2 or ldint_mem with priorities 3/6 and 2/6 (priority difference ! 3) shows a significative slowdown, while this cannot be observed in POWER6.
Maximum Priority Difference
The maximum priority difference in SMT is obtained when one thread has priority 6 and the other priority 1. The use of priorities 6/1 has an interesting effect: the thread with priority 6 shows a performance close to its single-thread mode. This result means that the priority mechanism can be used to provide an SMT configuration where we can run a background thread with minimum effect on the foreground thread. Graphs in Fig. 5 show the execution time (y-axis) of the primary thread when running with different secondary threads (x-axis) using priorities 6/1, for POWER5 and POWER6. Values are normalized to the primary thread ST execution time. In POWER6, the performance impact on the primary thread is almost zero except when ldint_l2 or ldint_mem runs with another memory-intensive microbenchmark, mostly due to interactions at cache and memory levels. This shows that a thread can run in background without significantly affecting the primary thread. Table 4 shows the performance of the secondary thread when running with priority 1 as percentage of its singlethread performance. On POWER5, ldint_l2 and ldint_mem achieve 19.09 and 86.57 percent of their single-thread performance while on POWER6, ldint_l2 and ldint_mem achieve, respectively, 3.64 and 53.79 percent of their singlethread performance. For both machines, while CPUintensive microbenchmarks report low performance with priority 1, ldint_mem maintains significant performance even when running with priority 1.
Malleability of SPEC CPU2006
The two primary uses of software-controlled priorities are: providing imbalanced thread execution, as needed by the applications, and improving instruction throughput. In an imbalanced thread execution, software can control core resource allocation to improve a given target metric. For instance, enabling faster execution of higher priority jobs or implementing load balancing [3] . To achieve higher throughput, software can intentionally imbalance SMT resource sharing to improve the performance of the primary thread, without significantly reducing the performance of the secondary thread. For instance, when a CPU-intensive thread is running together with a memory-bound thread, throughput can be improved by providing more resources to the CPU-intensive thread.
In order to reduce hardware resource contention, high-IPC loads can be paired with low-IPC loads on the same and X-axis is the hardware priority for the primary and secondary threads (primary-priority/secondary-priority).
core. As shown in previous sections, the effect of hardware priorities on memory-bound microbenchmarks is smaller than the effect on CPU-intensive microbenchmarks. In this experiment, we run pairs in which the primary thread is high-IPC and the secondary thread is low-IPC. Based on the benchmark profile, we used bzip, cactusADM, calculix, and h264ref as high-IPC benchmarks and mcf, milc, and omnetpp as low-IPC benchmarks.
In this experiments, we focus on the effects of higher priorities on the primary thread, assuming that the performance requirements of the secondary threads are subordinate to the performance requirements of the primary thread. Fig. 6 shows the speedup of the primary thread as we increase its priority with respect to the secondary thread. Using priorities 6/2 (primary thread priority is 6 and secondary thread priority is 2), the primary thread in POWER5 obtains a speedup up to 1.70 times the performance with default priorities, while in POWER6 up to 1.18 times the performance with default priorities.
Overall, hardware-thread priorities can be used when threads in a core present different hardware resource use. In particular, when the primary thread is CPU intensive and the secondary thread is memory bound. In this situation, we increase the primary thread malleability without affecting the overall throughput.
USE CASES
In this section, we present two use cases of hardware-thread priorities. Our objective is to show that, even if not for all kind of workloads, this feature can be effectively used to improve load balancing (use case A) and to implement transparent threads (use case B). The applications we use are taken from two different domains: a benchmark from the NAS Parallel Benchmarks [19] and six benchmarks from the CPU SPEC 2006 suite [24] .
Use Case A-Load Balancing
This use case shows how to use hardware-thread priorities to reduce parallel applications' execution time.
Block Tridiagonal (also called BT) is a benchmark from the NAS Parallel Benchmarks. BT is designed to solve discretized versions of the Navier-Stokes equation in three dimensions and uses a structured discretization mesh. BT Multizone (BT-MZ) [13] is a version of the same benchmark that uses several meshes (also called zones) because often a single mesh is not enough to describe realistic complex domain. When BT-MZ runs both on POWER5 and POWER6, its Message Passing Interface (MPI) processes are imbalanced: during each iteration MPI processes have to wait for the last process to complete thus spending a significative fraction of time in waiting state, without performing any useful work.
To balance the application, tasks having high waiting time can be paired with tasks having low waiting time (bottlenecks), then scheduled on the same SMT core. Then, hardware-thread priorities of tasks with low waiting time can be increased, to reduce the overall waiting time.
To balance BT-MZ, we run processes 1 and 4 on the first core and processes 2 and 3 on the second core. We found that the best combination of priorities is 4/5 for the first core and 4/6 for the second core. This configuration allows BT-MZ to be better balanced on both architectures. Tables 5 and 6 show the breakdown of MPI states when BT-MZ runs with the original configuration and with the balanced configuration, on POWER5 and POWER6, respectively. The column running refers to the percentage of time the process is effectively running on the core, waiting refers to the percentage of time spent waiting for a synchronization and others refers to other MPI states with negligible contribution to the total time. The percentage of time a process is in waiting state decreases when BT-MZ is executed with the balanced configuration. Consequently, the execution time is reduced by 11.4 percent on POWER5 and by 16 percent on POWER6.
Use Case B-Transparent Threads
Dorai and Yeung [7] propose transparent threads: an SMT resource allocation policy that allows the background thread to use resources not required by the foreground thread. The objective is to obtain minimum performance degradation of the foreground thread compared to when it runs in single-thread mode. In POWER5 and POWER6, this can be achieved using priority 6 for the foreground thread and 1 for the background thread. Potential uses are for instance in garbage collection, prefetching, virus scanning, file indexing, defragmentation, or other low-priority kernel tasks.
The characterization with microbenchmarks described in Section 5.5 shows that transparent threading is more effective when the background thread is a memorybounded thread. To this extend, we select six benchmarks from the SPEC CPU2006 benchmark suite: three CPU intensive to be used as foreground threads (bzip, cactusADM, and calculix) and three memory bounded to be used as background threads (mcf, milc, and omnetpp). Fig. 7a reports the performance of the foreground thread using transparent thread execution with respect to its performance when running in isolation on POWER5 and POWER6. As shown in Fig. 7a , the use of transparent threads is particularly effective on POWER6, with a performance degradation up to 5.5 percent for the selected benchmarks. On the other hand, due to the higher level of thread resource sharing, using transparent thread on POWER5 leads to a performance degradation of up to 20.86 percent. This result confirms the different effect of hardware-thread priorities on POWER5 and POWER6 and lead to conclude that POWER6 architecture design is more adapt to exploit transparent execution. Fig. 7b reports the performance of the background thread using transparent thread execution with respect to its performance when running in isolation. As Fig. 7b shows, the degradation of the background thread is considerable, especially on POWER6. This nonetheless should not be considered a drawback, given that the purpose of transparent execution is to run a thread in background that does not have performance requirements. In this paper, we characterize the software-controlled hardware-priority mechanism for IBM POWER5 and POWER6, based on the use of microbenchmarks.
We use a systematic approach in which we execute experiments with all the priorities combinations and with different running modes (ST and SMT). With this methodology, we obtain several architectural insights that explain different behaviors of the thread prioritization mechanism on POWER5 and POWER6. The main conclusions are the following:
. The use of priorities generally leads to a smaller performance difference between ST and SMT modes in POWER6 than in POWER5, mostly due to the absence of the out-of-order execution on POWER6.
Since in POWER6 the per-thread SMT malleability is smaller than in POWER5, increasing the priority of a thread generally leads to a smaller speedup than in POWER5. . On both processors, we have confirmed the correlation between high IPC and high sensitivity to priorities. . In POWER5 with a priority difference greater or equal to 3, there is a significant malleability of the memory-bounded threads. Therefore, performance tuning using priority differences greater or equal to 3 should be performed with a good understanding of the workload's memory behavior. . We empirically measure the correlation of the malleability with the performance variation between SMT and single-thread execution. . We show that hardware priorities can be used to improve load balancing for parallel applications: the execution of BT-MZ (NAS benchmarks) with a balanced configuration obtains an execution time reduction of 11.4 percent on POWER5 and of 16 percent on POWER6. . We evaluate transparent execution, a mechanism that allows the foreground thread to run in SMT mode with performance close to single-thread mode. With applications from SPEC CPU2006 benchmark suites, the foreground thread reach up to 94 percent of its performance in single-thread mode running on POWER5, and up to 99 percent running on POWER6. As future work, we plan to study POWER7 which, as its predecessors, also features hardware-thread priorities.
Overall, we believe this study can be useful to the OS community and to other software communities to tune software performance by exploiting the software-controlled priority mechanism of current and future SMT processors.
