To realize the performance potential of multicore systems, we must effectively manage the interactions between memory reference behavior and the operating system policies for thread scheduling and migration decisions. We observe that these interactions lead to significant variations in the performance of a given application, from one execution to the next, even when the program input remains unchanged and no other applications are being run on the system. Our experiments with multithreaded programs, including the TATP database application, SPECjbb2005, and a subset of PARSEC and SPEC OMP programs, on a 24-core Dell PowerEdge R905 server running OpenSolaris confirms the above observation. In this work we develop Thread Tranquilizer, an automatic technique for simultaneously reducing performance variation and improving performance by dynamically choosing appropriate memory allocation and process scheduling policies. Thread Tranquilizer uses simple utilities available on modern Operating Systems for monitoring cache misses and thread context-switches and then utilizes the collected information to dynamically select appropriate memory allocation and scheduling policies. In our experiments, Thread Tranquilizer yields up to 98% (average 68%) reduction in performance variation and up to 43% (average 15%) improvement in performance over default policies of OpenSolaris. We also demonstrate that Thread Tranquilizer simultaneously reduces performance variation and improves performance of the programs on Linux. Thread Tranquilizer is easy to use as it does not require any changes to the application source code or the OS kernel.
INTRODUCTION
The advent of multicore systems presents an attractive opportunity for achieving high performance on a wide range of applications. However, due to its non-uniform memory access latency (NUMA), the performance of applications on a multicore system is highly sensitive to operating system policies. The impact of operating system (OS) policies on performance makes consistently maximizing performance a significant challenge. Even after an application has been tuned, it may exhibit significantly different levels of performance from one execution to next. This is demonstrated by the results of the following experiment that we conducted. We executed 15 multithreaded programs, Table I . Performance Variation of the Programs Programs are executed with OPT threads where OPT threads is the minimum number of threads that gives best performance on our 24-core machine. Speedup is relative to the serial version of the programs. Performance of TATP is expressed in transactions per second (throughput) and speedup is relative to the single client throughput. Performance of SPECjbb2005 is expressed in 'SPECjbb2005 bops ' (throughput) including the TATP database application, SPECjbb2005, as well as programs from PARSEC and SPEC OMP suites, on a 24-core Dell PowerEdge R905 server running OpenSolaris.2009.06. Each program was executed 10 times while no other applications were being run. As shown in Table I , we observed significant variation in the performance of each program over its ten executions. The table provides the minimum and maximum execution times observed, the standard deviation (SD) for execution time, the minimum and maximum speedups achieved, and the percentage difference between Max Speedup and Min Speedup (% Diff). Table I also shows the type of the program-Memory-intensive (Mem) or CPU-intensive (CPU). In the above experiment, the number of application threads (OPT threads) created for each application was chosen such that it maximized the observed performance on our 24-core machine.
As shown in Table I , most of the programs exhibit significant performance variation. For example, the standard deviation of the performance of streamcluster is 10.2. Since the width of one standard deviation is about 68% in a normal distribution, standard deviation of 10.2 means that there is 32% chance that the performance will lie beyond + or − 4.8% of the mean. Minimizing performance variation while simultaneously maximizing performance is clearly beneficial. Elimination of performance variation has another advantage. Many optimization techniques for improving performance or optimizing power consumption [Merkel et al. 2010; Zhuravlev et al. 2010; Dhiman et al. 2010; ] on multicore machines rely on performance monitoring data. The presence of high variation in performance degrades the accuracy of the information collected and the benefits of the optimization techniques. Moreover, due to high performance variation we must use multiple runs of target programs for collecting average performance values. However, with low performance variation one may collect the same quality information via fewer runs.
To find the causes of performance variation, we conducted a performance variation study of 15 multithreaded programs. Intuitively, the cause of performance variation of an application depends upon the kind of resources the application requires. Since these 15 programs mainly exploit CPU and Main memory (not I/O), in this paper we conduct performance variation studies by considering different OS memory allocation and scheduling policies and then identify those policies that lead to high performance and low performance variation.
An OS scheduler migrates threads from one core to another core to balance the load across the cores. However, thread migrations are expensive events as they cause a thread to pull its working set into cold caches, often at the expense of other threads . The impact of thread migrations on the performance of memory-intensive programs is significant because of their large working sets, high cache miss rates due to thread migrations, and non-uniform memory access latency. The negative impact of thread migrations on performance variation is worse when improper memory allocation policies are used. For example, memory-intensive programs experience high cache miss-rate and variation in cache miss-rate due to migrations when the default next-touch memory allocation policy is used.
For CPU-intensive programs, thread context-switches significantly impact the performance. Thread context-switches can be involuntary or voluntary. Involuntary context-switches (ICX) happen when a thread is taken off a core due to expiration of their time quantum or preemption by a higher priority thread. Voluntary context switches (VCX) are result of blocking system calls (e.g., for I/O) or when the thread fails to acquire a lock. CPU-intensive programs exhibit a higher ICX-rate in comparison to memory-intensive programs. With the default Time Share (TS) scheduling policy, priorities of threads change very frequently to adjust corresponding time-quantum for balancing load and providing fairness in scheduling. The ICX and also the variation in ICX-rate increases due to frequent changes in priorities with TS scheduling. The variation in ICX-rate significantly impacts the performance variation of CPU-intensive programs.
In summary, the variation in cache miss-rate is the major cause of performance variation of memory-intensive programs and the variation in ICX-rate is the major cause of performance variation of CPU-intensive programs. Other events such as, kernel intervention, system processes, and voluntary context-switches further increase performance variation. In this work, we focus on reducing the negative impact of thread migrations on the performance without preventing thread migrations because for a machine with large number of cores (e.g., a 24 core machine), thread migrations lead to better performance (see Section 3.2.1). Consider the model in which we bind one thread per core and thus eliminate thread migrations. This approach is effective for small number of cores but not for machines with large number of cores. From extensive experimentation on our 24 core machine we found that with one thread per core model most programs perform significantly worse in comparison to default policies that permit thread migrations.
Based upon the above observations we develop the Thread Tranquilizer framework that uses simple utilities available on modern operating systems to dynamically monitor application behavior and adapt execution to reduce performance variation. Thread Tranquilizer uses cputrack(1) utility to monitor cache miss-rate and mpstat(1) utility to monitor ICX-rate . Based on these two events, Thread Tranquilizer dynamically applies proper memory allocation and scheduling policies to achieve high performance and low variation in performance. The overhead of Thread Tranquilizer is negligible and it requires no changes to the application source code or the OS kernel.
The remainder of the paper consists of the following. The causes of performance variation are identified in Section 2 through an in-depth performance analysis of several multithreaded programs on a 24-core multicore system. We show that the combination of Random or Round-Robin memory allocation and Fixed-Priority (FX) scheduling policies simultaneously reduces performance variation and improves performance of memory-intensive programs. The FX policy simultaneously reduces performance variation and improves performance of CPU-intensive programs. In Section 3 we present Thread Tranquilizer, a framework that uses simple utilities available on modern operating systems to monitor cache miss-rate and ICX-rate of a target program and, based on these events, it dynamically applies appropriate memory allocation and scheduling policies that can yield up to 98% (on average 68%) reduction in the performance variation and up to 43% (on average 15%) improvement in the performance over the default policies of OpenSolaris. On Linux, Thread Tranquilizer simultaneously reduces performance variation up to 91% and improves performance up to 53%. Related work and conclusions are given in Sections 4 and 5.
PERFORMANCE VARIATION STUDY
The reasons for performance variation of an application depend upon the kind of resources it uses. In this work we use benchmark programs that stress mainly CPU and main memory. Therefore, we analyze OS policies that affect the usage of CPU and memory hierarchy. We study discuss how different memory allocation policies along with thread migrations affect the performance of memory-intensive programs and then study the effect of CPU scheduling policies on the performance of CPU-intensive programs. In our discussion cache miss-rate refers to the last-level cache miss-rate. Before presenting our results we describe the experimental setup used.
Experimental Setup
2.1.1. Target Machine and Benchmarks. Our experimental setup consists of a Dell PowerEdge R905 server whose configuration is shown in Table II . As we can see this machine has 24 cores. It is a ccNUMA machine and remote access takes about 2.5 times more time than local access in this machine.
We use a total of 15 benchmark programs: eight programs are from PARSEC [Bienia et al. 2008] In this work we ran each experiment 10 times and we present standard deviation results for the ten runs. We could not use some of the PARSEC and SPEC OMP programs due to the following reasons. Since Thread Tranquilizer takes around 5 seconds for one pass, the running-time of the input program should be larger than five seconds. However, in blackscholes and dedup PARSEC programs, the main thread runs for most of the time, and worker threads are run for a fraction of time. We were unable to compile raytrace and vips of PARSEC, and fma3d, art, applu, ammp of SPEC OMP on OpenSolaris. The remaining programs (galgel, apsi) of SPEC OMP do not scale beyond 6 threads.
Tuning the Programs. Since we are conducting a study on a machine with 24 cores, we first examined the programs to see if their implementations require any tuning consistent with the use of large number of cores. We manually tuned the programs using the following three techniques before using them for performance variation analysis. These tuning techniques significantly improved performance and reduced performance variation. The performance of the tuned versions of programs is considered as the baseline in the evaluation of Thread Tranquilizer.
(1) Lock-Contention and Libmtmalloc. For programs that make extensive use of heap memory, to avoid the high overhead of malloc, we used the libmtmalloc library to allow multiple threads to concurrently access to heap [Attardi and Nadgir 2003 ].
We also studied the impact of memory allocators libumem [Benson 2003 ] and libhoard [Berger et al. 2000 ] on performance and found that libmtmalloc is better. (2) TLB Misses and Larger Page. Larger page sizes for the processor's memory management unit allows more efficient use of the TLB, ultimately resulting in improved application performance [Boyd-Wickizer et al. 2010] . Larger pages minimize TLB misses, then consequently minimizes kernel intervention to serve the TLB misses, and thus, improves performance and reduces performance variation. Since our machine only supports 2MB pages along with the default 4KB pages, we used 2MB pages for tuning memory-intensive programs. (3) Balancing Input Load across Worker Threads. In some of the applications the input load is not distributed evenly across the worker threads. In particular, when the load is not divisible by the number of threads, the extra load is assigned to a single thread. This uneven distribution of load increases performance variation. We improved the load distribution code so that the load could be more evenly spread across the worker threads. For example, consider swaptions program from PARSEC with input load of 128 swaptions. In the original code when 24 threads are used the input load of 128 swaptions is distributed as follows: five swaptions each for 23 threads; and 13 swaptions for the 24th thread. We improved the input-load distribution code so that it assigns six swaptions each to eight threads and five swaptions each to the remaining 16 threads. This simple modification simultaneously reduced performance variation and improved performance.
2.1.2. Operating System. We carried out the performance variation study using OpenSolaris.2009.06 as it provides a rich set of tools to examine and understand the behavior of programs. The memory placement optimization feature and chip multithreading optimizations allow OpenSolaris to support hardware with asymmetric memory hierarchies, such as cache coherent NUMA systems and systems with chip-level multithreading and multiprocessing. To capture the distance between different CPUs and memories, a new abstraction called "locality group (lgroup)" has been introduced in OpenSolaris. Lgroups are organized into a hierarchy that represents the latency topology of the machine .
Thread Migrations & Memory Allocation Policies
The OS scheduler migrates threads from one core to another core to balance the load across the cores. Thread migrations are expensive as they cause a thread to pull its working set into cold caches, often at the expense of other threads . Moreover, on NUMA machines, the negative impact of thread migrations is even higher because of variation in memory-latency. Therefore, the negative impact of thread migrations on performance of memory-intensive applications is even higher.
To understand the impact of thread migrations on our machine, we created two single threaded micro benchmarks-one program is CPU-intensive as it executes arithmetic operations in a loop and another program is memory-intensive as it creates several large arrays using malloc and then reads and writes to them. We ran these programs for the same amount of time (nearly 16 seconds) 100 times in two different configurations: no-migration configuration where we bind the program to only one core; and allowmigration configuration where we give the control of this program to the OS scheduler without binding it to a core so that thread migrations under default policies are possible. We used DTrace scripts [Cantrill et al. 2004; ] to find the number of migrations experienced by these two programs and also to measure the average time a thread takes to migrate. In Figure 1 , the table shows the average and standard deviations of the execution time and the average number of migrations per run for these programs. Using DTrace scripts we find that a thread migration takes on average around 100 μs on our machine. Therefore, the migration cost from the OS side is around 420 μs for the CPU-intensive program, but the program experiences the overhead around 8000 μs. However, the system overhead due to thread migrations with the memory-intensive program is around 330 μs, but the memory-intensive program experiences the overhead of around 459000 μs. This simple experiment clearly shows that the impact of thread migration on the performance and performance variation of memory-intensive program is much higher in comparison to CPU-intensive program.
Figure 1 also shows that the memory-intensive program experiences significantly larger performance variation compared with the CPU-intensive program. To minimize thread migration overhead and preserve locality awareness, Opensolaris tries to migrate the threads among the cores belonging to the same chip. Actually in this experiment, the thread migrations happened among the cores of the same chip. However, when a program number of threads greater than number of cores is running, we can expect the OS to migrate threads from one chip to another chip to balance the load across the cores. This will further increase the migration cost and degrade the speedups.
2.2.1. Next-Touch (the Default Policy). The memory allocation policies significantly affect the impact of thread migrations on performance. The key to delivering performance on a NUMA system is to ensure that physical memory is allocated close to the threads that are expected to access it . The next-touch policy is based on this fact and it is the default policy on OpenSolaris. Thus, memory allocation defaults to the home lgroup of the thread performing the memory allocation. Under next-touch policy a memory-intensive thread can experience high memory latency overhead and high cache miss-rate and most importantly high variance in cache miss-rate when it is started on one core and migrated to another core which is not in its home locality group. This also leads to HyperTransport traffic which degrades performance due to high variation in memory latency of a NUMA system. Moreover, lock-contention, IO, and memory-demanding behavior cause an increase in thread migrations. The thread migrations cause changes in thread priorities which further increases the variation in performance.
2.2.2. Random and Round-Robin Policies. While next-touch is the default memory allocation policy for private memory (heap and stack), the random allocation policy is the default policy for shared memory with explicit sharing when the size of shared memory is beyond the default threshold value 8MB . This threshold is set based on the communication characteristics of Message Passing Interface (MPI) programs . Therefore, it is not guaranteed that the random policy will be always applied to the shared memory for multithreaded programs that are based on pthreads and OpenMP. If the shared memory is less than 8MB, then the next-touch policy is the default also for the shared memory. More importantly, programs with huge private memory (e.g., the heap size for facesim is around 306MB) will dramatically benefit from Random/RR policies rather than the default next-touch policy. Instead of using the next-touch policy for private memory, we use Random or Round-robin (RR) policies for both the private and the shared memory for memory-intensive multithreaded programs. Table III lists these memory allocation policies . RR policy allocates a page from each leaf lgroup in round robin order. Random memory allocation just picks a random leaf lgroup to allocate memory for each page. Therefore, both RR and Random policies eventually allocate memory across all the leaf lgroups and then the threads of memory intensive workloads get a chance to reuse the data in both private and shared memory. This reduces cache-miss rate and memory latency penalty. These policies optimize for bandwidth while trying to minimize average latency for the threads accessing it throughout the NUMA system . They spread the memory across as many memory banks as possible, distributing the load across many memory controllers and bus interfaces, thereby preventing any single component from becoming a performance-limiting hot spot. Moreover, random placement improves the reproducibility of performance measurements by ensuring that relative locality of threads and memory remains roughly constant across multiple runs of an application . Therefore, RR or Random policies minimize cache miss-rate and more importantly variation in cache miss-rate.
Dynamic Priorities and ICX
The main goal of a modern general-purpose OS scheduler is to provide fairness. Since, it is not guaranteed that all the threads of a multithreaded program behave similarly at any moment (e.g., due to differences in accessing resources such as CPU, Memory, Lock data structures, Disk), we can expect the OS scheduler to make frequent changes to thread priorities to maintain an even distribution of processor resources among the threads. By default, the OS scheduler prioritizes and runs threads on a time-shared basis as implemented by the TS Scheduler Class. The adjustments in priorities are made based on the time a thread spends waiting for or consuming processor resources and the thread's time quantum varies according to its priority.
Thread priorities can change as a result of event-driven or time-interval-driven events. Event-driven changes are asynchronous in nature; they include state transitions as a result of a blocking system call, a wakeup from sleep, a preemption, etc. Preemption and expiration of the allotted time-quantum produces involuntary thread context-switches (ICX). Here changing priorities mean updating priority of threads based on their CPU usage and moving them from one priority-class queue to another priority-class queue according to their updated priority. If multiple threads have their priorities updated to the same value, the system implicitly favors the one that is updated first since it winds up being ahead in the run queue. To avoid this unfairness, the traversal of threads in the run queue starts at the list indicated by a marker. When threads in more than one list have their priorities updated, the marker is moved. Thus the order in which threads are placed in the run queue of a core the next time thread priority update function is called is altered and fairness over the long run is preserved .
Since all the threads of an application do not behave similarly at any moment (e.g., due to their CPU usage, lock-contention time, sleep time etc.), the positions of the threads in run queues are different from one run to another run. The frequent changes in thread priorities produces variation in ICX-rate and thus, variation in performance. Moreover, ICX often includes lock-holder thread preemptions and we can expect increase in the frequency of lock-holder preemptions as load increases (i.e., thread count grows). More importantly, whenever a lock-holder thread is preempted, the threads that are spinning for that lock will be blocked, which in turn increases VCX-rate, and leads to poor performance under high loads .
2.3.1. Fixed Priority Scheduling. The Fixed Priority (FX) scheduling class attempts to solve the issue of frequent thread ping-ponging with TS class. Threads execute on a CPU until they block on a system call, are preempted by a higher-priority thread that has become runnable, or they have used up their time quantum. The allotted time quantum varies according to the scheduling class and the priority of the thread. OS maintains time quanta for each scheduling class in an object called a dispatch table. Threads in the fixed priority class are scheduled according to the parameters in a simple fixed-priority dispatcher parameter table. The parameter table contains a global priority level and its corresponding time quantum. Once a process is at a priority level it stays at that level at a fixed time quantum. The time quantum value is only a default or starting value for processes at a particular level, as the time quantum of a fixed priority process can be changed by the user with the priocntl(1) command or the priocntl(2) system call. By providing same priority to all the threads of a multithreaded application, FX class dramatically reduces ICX, completely avoids lock-holder thread preemptions, and thus reduces performance variation. Moreover, unlike TS class, only time-driven tick processing is done for FX class. This reduces dispatcher locks, minimizes OS intervention, and minimizes performance variation.
Role of Time-Quantum. The FX class priority range is 0 to 60, and the corresponding time-quantum range is 200 ms to 20 ms. By default, programs are run with fixed priority level zero (200 ms time-quantum) when we apply FX scheduler class. However, in order to get more benefit of using FX class, we also need to find proper time quantum value for the programs running under FX class. It is appropriate to provide small time quantum for CPU-intensive application threads as these threads heavily compete for CPU resources. In this way no thread will wait for a long time for a CPU. Since threads of a memory-intensive application do not heavily compete for CPU resources, large time-quantum is appropriate for them. Therefore by allocating appropriate timequantum according to the recourse-usage characteristics of the target multithreaded program, we can provide fair allocation of CPU cycles for all the threads of the program. To find appropriate time-quantum according the resource usage of applications, first we categorize the applications as memory intensive or CPU intensive. For this we chose a couple of applications from each category (a total of 4 out of 15), and ran them with varying time-quantum ranging from 5 ms to 300 ms. The 4 applications used to find the appropriate time-quanta are: streamcluster, swaptions, ferret, and bodytrack. From extensive experiments, we identified that memory intensive programs with 100 ms time-quantum give better performance and low performance variation and 20 ms time-quantum works well for CPU-intensive programs. Therefore, we conclude that FX class with appropriate time-quantum reduces ICXrate, more importantly reduces variation in ICX-rate, and thus, reduces performance variation.
Combination of Memory Allocation and Scheduling Policies
To find the appropriate configuration, we tested the 15 programs in the six configurations shown in Table IV . Figure 2 shows the running-times and cache miss-rates of facesim program (a memory-intensive program) in 10 runs with the six configurations. As shown in the boxplots of Figure 2 , there is a reduction in the cache miss-rate and also reduction in the variation of cache miss-rate with the combination of Random (or RR) and FX policies. Therefore, as we can see, the reduction in the variation of cache miss-rate reduced the performance variation. This clearly shows that threads reuse the data from private memory which is spread across the nodes by the RR or Random policy. There is also significant improvement in the performance. As we expected, memoryintensive programs experience low performance variation with improved performance using Random or RR memory allocation policies. Figure 3 shows the running-times and ICX-rates of mgrid for 10 runs. Mgrid is a CPU-intensive program and scales well in comparison to facesim. As shown in the boxplots of Figure 3 , with FX scheduling policy the variation in the ICX-rate is reduced and thus there is a reduction in the variation of running-times. As we expected there is no significant impact of Random and RR policies on the performance of mgrid. Table IV lists the configurations. Figures 4 and 5 show that the combination of Random and FX policies reduces performance variation and improves performance simultaneously for memory-intensive programs. Figure 6 shows that FX scheduling policy reduces performance variation and improves performance simultaneously for CPU-intensive programs. There is no significant impact of memory allocation policies on CPU-intensive programs. FX is very effective for programs with high lock-contention. Since swaptions, x264, and ferret are CPU-intensive have low lock-contention, FX slightly improves their performance. However, the performance variation with (FX + Next) is low compared to (TS + Next) for these three programs. Since bodytrack is a CPU-intensive and high lock-contention, the variation with (FX + Next) is significantly lower compared to (TS + Next). Moreover, among the five CPU-intensive programs, mgrid benefits significantly from FX scheduling policy. This is because the tuning techniques libmtmalloc and larger page already improved the performance and reduced the performance variation significantly for the other four CPU-intensive programs (swaptions, bodytrack, ferret, and x264). From the above experiments, it is clear that memory-intensive programs get benefit from the combination of Random memory allocation and FX scheduling policies and CPU-intensive programs significantly get benefit only with the FX scheduling policy. Since the variation in cache miss-rate causes the performance variation of memoryintensive programs, and the variation in ICX-rate cause the performance variation of CPU-intensive programs, in the next section, we present a framework that monitors cache miss-rate and ICX-rate of the target program on line, and based on these events, it dynamically applies proper memory allocation and scheduling policies.
THE THREAD TRANQUILIZER FRAMEWORK
Thread Tranquilizer monitors the cache miss-rate and thread ICX-rate of a running program and based on their variation, it dynamically applies appropriate memory allocation and scheduling policies. The execution of the target program is begun with the default Next-Touch and TS policies and the program's miss-rate and ICX-rate is monitored once the worker threads have been created. Thread Tranquilizer takes a maximum of five seconds to complete one pass to select appropriate memory allocation and scheduling policies according to the phase changes of the programs. Therefore, programs with very short running times will not benefit from Thread Tranquilizer. Thus, in this work, we evaluated Thread Tranquilizer with the programs where their worker threads run for more than five seconds.
We used cputrack(1) utility to measure miss-rate and mpstat(1) utility to measure ICX-rate. One second time-interval is the minimum timeout value we could have used with the default implementation of mpstat(1) utility. However, we modified this utility to allow time intervals with millisecond resolution to measure the ICX-rate with 100 ms interval. Therefore, using cputrack(1) utility and the modified mpstat utility, we collect 10 samples of miss-rate and ICX-rate with 100 ms interval, then derive a profile data structure from the 10 samples, which contains average miss-rate, average ICX-rate, and standard deviations of miss-rate and ICX-rate. Figure 7 shows the state-transition diagram for one pass of Thread Tranquilizer, i.e., in a time-interval of five seconds. If the average miss-rate is greater than the miss-rate threshold, then we treat the program as a memory-intensive and we apply Random memory allocation policy through pmadvise(1) utility with proper advice options. Alternatively, we can use the kernel debugger mdb utility. We also apply FX policy with 100 ms timequantum using priocntl(1) utility. To see the effectiveness of these new policies, we again collect a new profile with 10 samples of program's miss-rate and ICX-rate. Since we would like to reduce performance variation without reducing performance, we also consider the average miss-rate along with the standard deviation of miss-rate. If the average miss-rate is less than the previous average miss-rate and the standard deviation of miss-rate is less than the standard deviation of previous miss-rate, we will continue running the program with the new policies. Otherwise, we reset to the default Next-Touch memory allocation policy and apply FX scheduling policy with 100 ms time-quantum. Therefore, Next-Touch vs Random policies are decided for each allocation based on the size of the shared memory requested by the programs.
If the program is CPU-intensive (i.e., average miss-rate is less than the missrate), then we apply only FX scheduling policy with 20 ms time-quantum. To see the effectiveness of the FX policy, we again collect average ICX-Rate and standard deviation of ICX-rate. If the average ICX-rate is less than the previous average ICX-rate and the standard deviation of ICX-rate is less than the previous standard deviation of ICX-rate, then we keep running the target program with the FX policy until the completion of the pass. Otherwise, we reset to the default TS scheduling policy. Thread Tranquilizer uses a daemon thread to continuously monitor the target program and to deal with its phase changes. Every five seconds, a timer sends a signal and the daemon thread catches the signal and repeats the above process to effectively deal with the phase changes of the target program.
Thus, Thread Tranquilizer monitors the target program's miss-rate and ICX-rate online, according to these events it dynamically applies appropriate memory-allocation Performance is improved. and scheduling policies, and simultaneously reduces performance variation and improves performance.
Evaluating Thread Tranquilizer
We evaluated Thread Tranquilizer with the 15 programs described that we used for the above performance variation study. Figure 8 and Table V show that performance variation (coefficient of variation) is reduced and performance is improved simultaneously. As we can see, the combination of Random and FX policies has significant impact on the performance variation of the programs-memory-intensive programs get benefit from the combination of Random and FX policies and CPU-intensive programs with only FX scheduling policy. Though there is no significant performance improvement for the swim program, the performance variation is reduced dramatically with Thread Tranquilizer. As we can see, the performance variation is reduced significantly upto 98% (on average 68%) and also performance is improved upto 43% (on average 15%) with the Framework. . Thread Tranquilizer is very effective against parallel runs of more than one application. We simultaneously ran four programs swim (SM), equake (EQ), facesim (FS), and streamcluster (SC) 10 times with the configurations of (Next + TS) and (Random + FX) policies.The figures in the first row show the running-times of individual programs in 10 runs, the figures in the second row show the running-times of the four programs in individual runs, and the figures in the third row show the total running-times of the four programs in each run. As we can see the combination of Random and FX policies provide fairness relative to the default policies of OpenSolaris.
3.1.1. Improves Fairness and Effectiveness Under High Loads. The combination of Random memory allocation and FX scheduling policies improves fairness in scheduling when there are more than one application running. Figure 9 shows that this combination not only reduces the performance variation of individual multithreaded programs, it also reduces the performance variation of the total running-times of the multithreaded programs in concurrent runs. As we can see from the second row of figures in Figure 9 , the fluctuation in running-times of the individual programs is low across ten runs in the presence of other programs with the combination of Random and FX policies. That is, in each run, OS allocates resources fairly to all the four programs, and also the total running-time (throughput) is improved. Moreover, as we can see in Figure 9 , the combination of FX and Random policies is very effective under heavy loads. There are a total of 90 threads from the four multithreaded programs running on 24 cores, i.e., over 375% load. Similar performance improvements resulted from several experiments of running multiple applications with Thread Tranquilizer. We are unable to present those due to lack of space. Thus, Thread Tranquilizer is also effective when there is more than one application running on the system. 3.1.2. Performance Improvements on Linux. On Linux Kernel 2.6.32 (Ubuntu 10.04 LTS server), using numactl(1) utility we applied RR memory allocation policy along with the default next-touch policy to memory-intensive programs. Linux does not support FX scheduling policy and Random policy. As shown in Figure 10 , RR policy simultaneously reduces performance variation up to 91% and improves performance up to 53% of memory-intensive workloads. Performance is improved. 
Discussion
Thread migrations, voluntary context-switches (VCX), and kernel intervention also contribute to performance variation. VCXs are generated because of the lock-contention and the IO properties of the applications. Both lock contention and I/O cause voluntary context switches (VCXs) and thus have the consequence of increasing the thread migration rate. This is because both lock contention and I/O result in sleep to wakeup and run to sleep state transitions for the threads involved. When a thread wakes up from the sleep state, the OS scheduler immediately tries to give a core to that thread by increasing the priority of the thread. If it fails to schedule the thread on the same core that it used last, it migrates the thread to another core. Moreover, the generation of VCXs is not controlled by the Operating System. However, in order to reduce lockcontention, we tuned the programs above studied with the library libmtmalloc. The library libmtamlloc minimizes lock-contention and consequently minimizes VCX rate and thread migration rate. This tuning technique improved performance of most of the 15 programs and also reduced their performance variation.
3.2.1. Binding Threads vs Performance. We can minimize thread migrations using simple utilities available on modern OS. For example, OpenSolaris provides pbind(1) utility to bind a thread/process to a core (no migrations), and psrset(1) utility to bind a thread/process to a set of cores, i.e., threads can be migrated among the cores of the core-set. We applied both these utilities on all the applicable programs, i.e., OPT threads of these programs is less than or equal to 24. However, only one of these programs benefited and that too only from processor set (psrset) utility. Moreover, OPT threads is greater than 24 for 9 programs out of the 15 programs studied (see Table I ). Though pbind(1) reduces thread migrations to zero, the programs experienced significant performance loss. On a machine with few cores, binding of one thread per core may slightly improve performance; but this is not true for larger number core machines . As shown in Figure 11 , we conducted experiments with one-thread-per-core binding model and observed that for most programs performance is significantly worse. For example, fluidanimate performs best with 21 threads without binding on our 24-core machine. When it is run with 21 threads with binding the performance loss is 7%. swim performs best with 48 threads without binding on our 24-core machine. If we use one-thread-per-core binding model (24 threads on 24 cores), then performance loss of swim is 14%. Likewise, performance losses of several programs: ferret, swaptions, facesim and bodytrack are also significant. If the OPT threads of an application is greater than the number of cores, it is not reasonable to apply psrset(1) or pbind(1). This is because along with the application threads, there are several high-priority system processes running in the system. Since user does not have control on these system processes, OS scheduler can manage load balancing better than user initiated "static" thread-to-core mappings. More importantly, "#threads > #cores" leads to several thread-to-core mappings, and therefore exploration of these mappings to find the best one is not a simple task.
We conducted several experiments with different "threads-to-cores" mappings (e.g., 12 threads on 12 cores set, 18 threads on 18 cores set) and identified that it is useful to use either pbind or psrset when the OPT threads is less than the number of cores. Among the six applicable programs, only one program "streamcluster" benefits from processor sets (PSET). Figure 12 shows that the performance variation of streamcluster (with OPT Threads of 13) is dramatically reduced with processor set configuration (Next + TS + PSET), where 13 streamcluster threads run on a 12-cores set, i.e. on two NUMA nodes. However, Thread Tranquilizer (i.e., RA + FX + PSET) configuration further reduces performance variation and improves the performance of streamcluster by around 36% over the default PSET configuration, i.e. (Next + TS + PSET). Here, Thread Tranquilizer applies PSET RANDOM policy.
3.2.2. Improved Scalability. Thread Tranquilizer also improves the scalability of streamcluster. The OPT threads of streamcluster with Thread Tranquilizer is 25 threads with average running time of 105.3 secs on 24 cores which represents a performance improvement of 24.5% over the Random + FX + PSET for which OPT threads is 13. Thread Tranquilizer improves scalability of the most of the programs.
3.2.3. Lessons Learned. Since multicore systems are rapidly taking over the computing world and NUMA architecture solves the scalability problem, we can expect many core machines in future. In order to fully exploit many core machines, modern operating system must therefore evolve with appropriate process scheduling and memory management techniques. More specifically two important challenges to be addressed.
(1) The OS scheduler should adaptively allocate resources such as number of cores according to the resource usage of whole application instead of individual threads of the application-thread to core mapping should be done at application level instead of thread level. (2) The OS should adaptively apply appropriate process scheduling and memory allocation policies according to the application characteristics.
Thus, in this work, we show how to reduce performance variation and improve performance simultaneously through proper choice of memory placement and process scheduling policies. We achieved this by implementing Thread Tranquilizer, a framework using simple utilities available on modern operating systems. Thread Tranquilizer monitors cache misses and thread context-switches online, and based on these events, it dynamically applies proper memory placement and scheduling policies that can yield up to 98% reduction in the performance variation and up to 43% improvement in performance over default policies of the OpenSolaris Operating System. The overhead of Thread Tranquilizer is negligible (0.07% of CPU utilization). Moreover there is no need to modify the target application source code or the OS kernel. It is also worth mentioning that since modern architectures and Operating System provide support for performance monitoring events and utilities that we used in this work, Thread Tranquilizer can be easily ported to other systems.
RELATED WORK
Many researchers have studied performance variability of parallel applications on large-scale parallel computers. Mraz [1994] , examined the effect of variance in message passing communications introduced by the AIX Operating System in parallel machines built of commodity work stations. They mainly considered the variance introduced by the interrupts of the AIX Operating System and concluded that globally synchronizing the system clocks gives the best results overall as it generally caused the daemons to run in a co-scheduled fashion and did not degrade system stability. Petrini et al. [2003] showed that for large-scale parallel computers, OS noise such as periodically monitoring I/O, could cause serious performance problems. The techniques they proposed to eliminate system noise is to turn off unnecessary system daemons, moving heavyweight daemons to one node instead of spreading them across multiple nodes, etc. In Gu et al. [2004] demonstrated that a significant source of noise in benchmark measurement in a Java Virtual Machine is due to code layout. Like Kramer and Ryan [2003] , Skinner and Kramer [2005] showed that variability in performance is inherently tied to contention for resources between applications and operating system. Hensley et al. [2001] minimized performance variation using cpusets on SGI Origin 3800. Most of the above works and some other works [Wright et al. 2009; Evans et al. 2003; Lofstead et al. 2010; Tabatabaee et al. 2005] used MPI based parallel applications on large-scale parallel systems and studied the performance variability caused by network and IO communications. Gioiosa et al. [2010] also studies OS noise on the performance of MPI based NAS applications running on a quad-core machine, and it provides design and implementation of a scheduler to optimize performance by dynamically binding threads to cores and thus minimizing thread migrations.
Several other researchers investigated the interference of OS on application performance. Ferreira et al. [2008] showed how to quantify the application performance costs due to local OS interference on a range of real-world large-scale applications using over ten thousand nodes. Tsafrir et al. [2005] also identifies a major source of noise to be indirect overhead of periodic OS clock interrupts, that are used by all general-purpose operating systems as a means of maintaining control. Using simulations, Gupta et al. [1991] explored the trade-offs between the use of busy-waiting and blocking synchronization primitives and their interactions with the scheduling strategies. Sandholm and Lai [2009] presented a system for allocating resources in shared data and compute clusters that improves MapReduce job scheduling by regulating user-assigned priorities to offer different service levels to jobs and users over time. Shen [2010] provides a characterization of request behavior variations using server applications and finds that the inter-core resource sharing on multicore platforms obfuscates the request execution performance. Some studies [Constantinou et al. 2005; Teng et al. 2009; Becchi and Crowley 2006] investigated the impact of thread migrations and context-switches on application performance, and some other [Nataraj et al. 2007; De et al. 2007; Mann and Mittaly 2009; Seelam et al. 2010; Beckman et al. 2008 ] also explored the impact of OS noise on application performance. Verghese et al. [1996] developed page migration and replication policies for ccNUMA systems and showed that migration and replication polices improved memory locality and reduced the overall stall time. Koita et al. [2000] presents several policies for cluster-based NUMA multiprocessors that are combinations of a processor scheduling scheme and a page placement scheme and investigates the interaction between them through simulations. McCurdy and Vetter [2010] introduced Memphis, a data-centric toolset that uses Instruction Based Sampling to help pinpoint problematic memory accesses to locate NUMA problems in some NAS parallel benchmarks. Alameldeen and Wood [2003] provided a methodology to compensate for variability, that combines pseudo-random perturbations, multiple simulations and standard statistical techniques; and Hocko and Kalibera [2010] used page coloring and bin hopping page allocation algorithms to minimize performance variation. Touati and Worms [2010] proposed a statistical methodology called "Speedup-Test" for analyzing the distribution of the observed execution times of benchmark programs and improving the reproducibility of the experimental results. While Touati and Worms [2010] focuses on statistical methodology for enhancing performance analysis methods to improve the reproducibility of the experimental results for SPEC CPU2006 and SPEC OMP 2001, Thread Tranquilizer reduces performance variation of multithreaded workloads by applying appropriate scheduling and memory allocation policies. Pinter and Zalmanovici [2006] developed an extension to the Linux scheduler that exploits inter-task data relations to reduce data cache misses in multi-threaded applications running on SMP machines and improves performance and optimizes energy consumption. Several researchers [Merkel et al. 2010; Zhuravlev et al. 2010; Dhiman et al. 2010; Blagodurov et al. 2011 ] have proposed optimization techniques based on use of hardware performance monitoring to find performance bottlenecks and these techniques dynamically improve performance and optimize power consumption of single threaded and multithreaded benchmark programs on multicore machines.
In this work we identified the reasons for performance variation of multithreaded programs. We also show how high performance and low performance variation can be simultaneously achieved through proper choice of memory management and scheduling policies without changing a single line of application code. Moreover, users can easily apply these techniques on the fly and adjust OS environment for multithreaded applications to achieve better performance predictability.
CONCLUSIONS
This paper provided an in-depth performance variation analysis of emerging multithreaded programs on a 24-core machine running OpenSolaris.2009.06., and identified the causes of the performance variation with precise details. We identified that the variation in cache miss-rate causes performance variation of memory-intensive programs and the variation in ICX-rate causes performance variation of CPU-intensive programs. Based on these observations, we developed a Framework that monitors cache misses and thread context-switches of a target benchmark program online, based on these events, it dynamically applies proper memory placement and scheduling policies, and simultaneously improves performance on an average by 15% and reduces performance variation on an average by 68% on OpenSolaris. On Linux, Thread Tranquilizer simultaneously reduces performance variation by up to 91% and improves performance by up to 53%. Since modern architectures and OSs provide support for performance monitoring events, our framework can be easily ported to other systems.
