Abstract
Introduction 1
Exploiting thread-level parallelism is believed to be a reliable way to achieve higher performance improvements in the future. Moreover, as technology advances provide microprocessor design with more options, finding new solutions to use the possible capabilities is necessary. Chip multiprocessing offers an attractive solution as using multiple cores makes efficient execution of parallel threads possible.
There are many design decisions that impact performance in a CMP. Designers have to make many choices including, the number of cores, the appropriate interconnect and the memory system. Server applications focus primarily on throughput. A CMP targeting these applications, ideally, uses a large number of small lowpower cores. On the other hand, for desktop users the performance of a single application is more important. Architects of a desktop would more likely focus on a smaller number of larger and more complex cores with better single-thread performance. As CMPs become + The author was spending his sabbatical leave at the school of computer science of IPM when this work was done. more popular in different design spaces, and considering the variety of programs that a typical processor is expected to run, exploiting heterogeneous CMPs seems to be a reasonable future choice. A heterogeneous CMP equipped with both high-and low-complexity cores provides enough resources to execute both latency-and throughput-sensitive applications efficiently. Recent research in heterogeneous CMPs has identified significant advantages over homogeneous CMPs in terms of power and throughput [1] .
In order to achieve the potential benefits of heterogonous CMPs, an effective application-to-core scheduler is essential. Such a scheduler would assign latency-sensitive applications to stronger cores while leaving the throughput-sensitive to the weaker ones. Running demanding applications on weak cores hurts performance.
Meantime, running low-demand applications on strong cores, results in unnecessary power dissipation. The task of the scheduler is to avoid both scenarios, finding the best assignment that prevents both resource over-utilization and under-utilization.
Between two possible scheduling policies, static and dynamic, the latter has significant advantages. This is particularly true for heterogeneous CMPs. In addition to the behavior variations among applications, there are behavior changes within an application. Therefore, an application's demand for processor resources varies during runtime. Dynamic scheduling uses runtime information to tune core-application assignments through such changes.
Many scheduling policies have been introduced for parallel programs, (e.g., gang scheduling and uncoordinated scheduling). Such policies often focus on concurrent execution on the threads of an application on distinct processors. In this study applications are independent. To achieve b gned to core ication demand namic schedul rding to resou wn that heterog duling increa ove average re y loads [6] . HARD comes with two important benefits. The first benefit is maximizing throughput. This is achieved as more demanding threads are assigned to more powerful cores as soon as such demands are noted and suitable cores become available. The second benefit is reducing power. The scheduler assigns threads and applications to more simple cores as soon as the system detects that the application is underutilizing its core. In this work we show that both benefits are obtainable using HARD.
HARD Scheduler
In this section we introduce HARD. The scheduler relies on two subsections: phase detection and reassignment.
Phase Detection Unit
In order to record application behavior we run our phase detection algorithm at the end of every 100K clock cycle interval. Very short intervals can degrade performance as the result of frequent and high switching overhead. Long intervals, on the other hand, may miss application phase changes. We picked 100K intervals after testing many alternatives.
The phase detection algorithm relies on two measurements: throughput and core utilization. In [17] Kumar used throughput to detect phase changes. We use throughput and core resource utilization to identify a phase change. We use a counter to record the number of retired instructions as an estimation of application throughput. The size of the counter depends on the maximum possible throughput in each core. For example, maximum nominal throughput of a 6-way core during a 100K clock interval is 600K instructions. Therefore, a 20-bit binary counter is large enough to store the throughput of the intervals.
To calculate core utilization, instruction window occupancy is measured. Two counters, one for integer instruction window utilization and one for floating point instruction window utilization, are used to estimate core utilization. The phase detection algorithm uses the greater of the two to decide core utilization. The sizes of the counters depend on both instruction windows size and interval size. For example, for a core with the largest instruction window among all cores (i.e., 104 and 48 for integer and floating point instructions respectively), and 100K clock intervals, the maximum number of occupied instruction window entries are 104×100K and 48×100k, respectively. These numbers can be stored using 24-bit and 23-bit binary counters. Core utilization within each interval is estimated by dividing the counter values by the interval size.
The phase detection algorithm uses throughput and core utilization to categorize intervals into one of the following classes.
• Upgrading (UG): An interval belongs to this category if a) the utilization exceeds high utilization threshold (HUT) or b) the throughput exceeds the high throughput threshold (HTT).
Either condition implies that the application could use more resources and that switching to a stronger core will most likely boost performance.
• Downgrading (DG): An interval belongs to this category if utilization is less than the low utilization threshold (LUT) and throughput is less than the low throughput threshold (LTT). Under these circumstances we assume that the application is under-utilizing core resources and switching it to a weaker core will most likely reduce power dissipation without compromising performance.
• No-change (NC): An interval not belonging to either of the classes discussed above is assumed to be in this class. We assume that the core is running the application within reasonable power and performance budgets. Therefore there is no need to switch to a weaker or stronger core. 
Reassignment Unit
As we discussed before, our dynamic scheduler uses the phase detection algorithm to detect application phase changes. When the phase detection unit detects an upgrading or downgrading phase change, the scheduler activates the reassignment unit. This unit records application phase change history. History is used to decide if the application needs to switch to another core.
Note that there is a cost associated with switching. For example, the cache should be flushed to save all dirty cache data in caches is an simulator we u phase detection r referred to as h core has its p record of the ation. DHC is t interval was y. The counter a stronger or application to a value is above ely (we picked testing many m every 100K uires up to six one of the pretice that using n applications n frequency.
require special is effective. In if and when a cur. Our study counter is reset 6 (or -6), the aker) core. An at a time. For resenting three where a UG est core. The e to one of the n one choice at more idle core ation from the our study the ll the possible choose one of plications. The ollows. When core with the wngrading to a reatest demand to make sure g cores benefit ined below).
plications 
Methodology
We simulate a heterogeneous multicore processor using three types of cores with different performance levels, one EV6-like, two EV5-like and two EV4-like cores as reported in Table 1 . Although there are many other possible configurations, our study shows that this configuration provides adequate execution resources to r u n t h e w o r k l o a d s c h o s e n i n t h i s s t u d y .
The EV6-like core has the highest performance. The level one cache is private for each core and the instruction and data caches are separated. A large unified level two cache is shared between all cores. MESI protocol is used for cache coherency. All cores are simulated in 100 nm technology and run at 2.1 GHz. The simulations have been carried out using a modified version of the SESC simulator [7] . We use eight scientific/technical parallel workloads from Splash2 [8] . These workloads consist of four applications, i.e., barnes, water-spatial, ocean and fmm, and four computational kernels, i.e., radix, lu, cholesky and fft. We have used multi-program workloads which are composed of different sets of Splash2 benchmarks. Table 2 shows the benchmarks of each multi-program workload.
Experimental results
In this section we present the simulation results. To provide better understanding we also compare HARD to a static scheduler. However there are many possible static schedules possible for each application mix, we assign applications to cores based on application run time. For example, the most time consuming application is assigned to the strongest core. The application runs on the same core during the entire runtime.
Quantifying the performance of a computer system executing multiple programs is not a straightforward task as programs of the same mix interfere. In [2] , Eyerman et al., introduced three metrics for quantifying the performance of a computer system executing multiple programs. Average Normalized Turnaround Time (ANTT) quantifies the average turnaround time slowdown due to multi application execution. System Throughput (STP) quantifies accumulated singleprogram performance under multi program execution. A system is fair if the coexecuting programs in multiprogram mode experience equal relative progress with respect to single-program mode.
Performance Oriented Configuration
In this section we report results assuming that performance is the main goal. To estimate performance of each mix we measure performance for each core and report average performance across all cores. We tune the scheduler using the parameters reported in Table 3 . We run all mixed benchmarks using both HARD and the static scheduler. We measure and report (in Table 3 ) average performance and power achieved for all mixed benchmarks. 
Power Oriented Scheduling
In this section we report results assuming that reducing power is the main goal. Again we measure and report average power across all cores. We use the parameters introduced in Table 4 . Table 4 shows three sets of thresholds. As reported Set-C results in maximum power saving while Set-A leads to minimum performance loose. We run all mixed benchmarks using both HARD and the static scheduler. We measure and report (in Table 4 ) average performance and power achieved for all mixed benchmarks. Figure 8 and 9 show the amount of performance loss and power saving, respectively, using Set-B from Table  4 . As reported some mixes of benchmarks (e.g., mix7) come with higher performance loss but also show very high power reduction compared to others. On the other hand there are mixes (e.g., mix4) that show considerable power savings at the expense of modest performance loss. To provide better understanding we measured the number of UG/DG switches for mix2 and mix7 of Table2. The number of UG/DG switches of mix2 are 1/1, 1/0, 1/0, 2/2 and 2/1 for performance oriented scheduling and 0/0, 3/2, 0/0, 1/2 and 3/2 for power oriented scheduling for radix, lu, barnes, water-spatial and cholesky benchmarks, respectively. The number of UG/DG switches of mix7 are 1/1, 1/0, 1/1, 1/1 and 1/0 in performance oriented scheduling and 0/0, 5/4, 1/3, 7/7 and 4/2 in power oriented scheduling for radix, cholesky, water-spatial, fmm and lu benchmarks, respectively.
In Figure 11 we show the cores participating in the execution of barnes benchmark chosen from mix3. Regular and dotted lines show the cores picked to execute the application for both performance oriented and power oriented configurations respectively. For example, barnes starts its execution on the middle performance core and switches into the weakest core and ends its execution on the middle performance one under the power oriented configuration (represented by dotted line). 
Related Works
In this section we briefly review previous works on phase change detection techniques and dynamic scheduling in multiprocessor, multithreaded and multicore platforms. Comparison between our proposed scheme and previous studies is part of our ongoing research.
In [9] authors improved the execution time and power of multicore processors by predicting the optimal number of Threads depending on the amount of data synchronization and the minimum number of threads required to saturate the off-chip bus.
In [10] Accelerated Critical Sections (ACS) is introduced which leverages the high-performance core(s) of an Asymmetric Chip Multiprocessor (ACMP) to accelerate the execution of critical sections. In ACS, selected critical sections are executed by a highperformance core, which can execute the critical section faster than the other, smaller cores. Consequently, ACS reduces serialization.
In [11] authors proposed scheduling algorithms based on the Hungarian Algorithm and artificial intelligence (AI) search techniques. Because of dynamic heterogeneity, hard errors and process variations, performance and power characteristics of the future large-scale multicore processors will differ among the cores in an unanticipated manner. These thread assignment policies effectively match the capabilities of each degraded core with the requirements of the applications.
In [12] devised phase co-scheduling policies for a dual-core CMP of dual-threaded SMT processors was introduced. They explored a number of approaches and find that the use of ready and in-flight instruction metrics In [13] authors proposed a scheme for assigning applications to appropriate cores based on the information presented by the job as an architectural signature of the application.
In [14] authors made a case that thread schedulers for heterogeneous multicore systems should balance between three objectives: optimal performance, fair CPU sharing, and balanced core assignment. They argued that thread to core assignment may conflict with the enforcement of fair CPU sharing. They demonstrate the need for balanced core assignment. In [15] authors introduced a cache-fair algorithm which ensures that the application runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated. If the thread executes fewer instructions per cycle than it would under fair cache allocation, the scheduler increases that thread's CPU time slice.
In [16] Kumar monitored workload run-time behavior and detected significant behavior changes. The study also considered different trigger classes based on how IPC changes from one steady-phase to another.
Conclusion
In this work we presented HARD a dynamic scheduler for heterogeneous multi-core systems. HARD uses past thread assignments to find the best matching core for every core. HARD saves power by downgrading applications with low resource utilization to weaker cores. HARD improves performance by upgrading demanding application to stronger cores.
We study different program mixes and show that HARD can reduce up to 46% power while improving performance by 10% for the application mixes used in this study.
