Multithreading is widely used to increase processor throughput. As the number of shared resources increase, managing them while guaranteeing predicted performance becomes a major problem. Attempts have been made in previous work to ease this via different fairness mechanisms. In this article, we present a new approach to control the resource allocation and sharing via a service level agreement (SLA)-based mechanism; that is, via an agreement in which multithreaded processors guarantee a minimal level of service to the running threads. We introduce a new metric, C SLA , for conformance to SLA in multithreaded processors and show that controlling resources using with SLA allows for higher gains than are achievable by previously suggested fairness techniques. It also permits improving one metric (e.g., power) while maintaining SLA in another (e.g., performance). We compare SLA enforcement to schemes based on other fairness metrics, which are mostly targeted at equalizing execution parameters. We show that using SLA rather than fairness based algorithms provides a range of acceptable execution points from which we can select the point that best fits our optimization target, such as maximizing the weighted speedup (sum of the speedups of the individual threads) or reducing power. We demonstrate the effectiveness of the new SLA approach using switch-on-event (coarse-grained) multithreading. Our weighted speedup improvement scheme successfully enforces SLA while improving the weighted speedup by an average of 10% for unbalanced threads. This result is significant when compared with performance losses that may be incurred by fairness enforcement methods. When optimizing for power reduction in unbalanced threads SLA enforcement reduces the power by an average of 15%. SLA may be complemented by other power reduction methods to achieve further power savings and maintain the same service level for the threads. We also demonstrate differentiated SLA, where weighted speedup is maximized while each thread may have a different throughput constraint.
Multithreading is widely used to increase processor throughput. As the number of shared resources increase, managing them while guaranteeing predicted performance becomes a major problem. Attempts have been made in previous work to ease this via different fairness mechanisms. In this article, we present a new approach to control the resource allocation and sharing via a service level agreement (SLA)-based mechanism; that is, via an agreement in which multithreaded processors guarantee a minimal level of service to the running threads. We introduce a new metric, C SLA , for conformance to SLA in multithreaded processors and show that controlling resources using with SLA allows for higher gains than are achievable by previously suggested fairness techniques. It also permits improving one metric (e.g., power) while maintaining SLA in another (e.g., performance). We compare SLA enforcement to schemes based on other fairness metrics, which are mostly targeted at equalizing execution parameters. We show that using SLA rather than fairness based algorithms provides a range of acceptable execution points from which we can select the point that best fits our optimization target, such as maximizing the weighted speedup (sum of the speedups of the individual threads) or reducing power. We demonstrate the effectiveness of the new SLA approach using switch-on-event (coarse-grained) multithreading. Our weighted speedup improvement scheme successfully enforces SLA while improving the weighted speedup by an average of 10% for unbalanced threads. This result is significant when compared with performance losses that may be incurred by fairness enforcement methods. When optimizing for power reduction in unbalanced threads SLA enforcement reduces the power by an average of 15%. SLA may be complemented by other power reduction methods to achieve further power savings and maintain the same service level for the threads. We also demonstrate differentiated SLA, where weighted speedup is maximized while each thread may have a different throughput constraint.
Authors' address: R. Gabor, School of Electrical Engineering, Tel Aviv University; Intel Corporation. Email: rongabor@post.tau.ac.il or ron.gabor@intel.com; A. Mendelson: Microsoft Corporation. Email: avim@microsoft.com; S. Weiss: School of Electrical Engineering, Tel Aviv University. Email: weiss@eng.tau.ac.il. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
INTRODUCTION
Power and manufacturing reasons push the computer industry toward multicore and multithreaded systems. There is an ongoing trend for increasing the number of cores on a chip and the number of threads of a single processor. Processor vendors already supply quad-core processors as mainstream generalpurpose processors [Halfhill 2006; McGregor 2007] , as well as multithreaded eight-core processors for servers and high performance computing [Kongetira et al. 2005] . Some vendors have also announced plans for "many-core" products with a large number of cores integrated on a single chip [Pawlowski 2007 ]. Highly threaded processors and virtual machines are used for server consolidation [Rosenblum and Garfinkel 2005] . These servers can run a number of independent threads from different users. The higher the number of threads sharing a resource, the higher the probability of imbalance in the resource usage Gabor et al. 2007; Raasch and Reinhardt 1999] . For example, in a multicore processor, one thread may hog the memory bandwidth, causing other threads to run very slowly. In multithreaded processors, where the pipelines and other resources are shared, one thread may use most of a shared resource, such as the data cache, evicting the data of the other threads and causing them to severely slow down. Fairness enforcement Gabor et al. 2006; Raasch and Reinhardt 2003] was proposed in order to ease the effect of such imbalance resource allocation, but fairness does not always work as well as expected, as we will show later in this article.
Resources are shared between threads in order to achieve high power and area utilization. As the number of shared resources increases, the resource allocation problem becomes more severe. For example, in SMT (Simultaneous Multi-Threading), where the pipelines and other structures are shared, special fairness mechanisms need to be implemented in order to achieve predictable performance. In CMP (Chip Multi Processing), last level caches, buses and power consumption are shared, and usually require fairness mechanisms [Chang and Sohi 2007; Hsu et al. 2006; Iyer 2004; Iyer et al. 2007; Kim et al. 2004; Rafique et al. 2006; Yeh and Reinman 2005] . SOE (Switch-onEvent or coarse-grained) multithreading is an in-between multithreading solution, it allows sharing all the resources of the processor in a timeshared mode. SOE is an attractive multithreading approach that allows us to use the analytical framework developed by Gabor et al. [2006 Gabor et al. [ , 2007 for this work.
Admission control ] can be used to reduce fairness issues by allowing or forbidding jobs from running together. A different approach to fairness issues is to implement fairness mechanisms into the processor to enforce fairness among the running threads. These mechanisms include, for example, resource allocation and partitioning, and complement admission control.
Previous work on chip resource allocation usually focused on achieving fairness of allocation between the different threads. In this article, we propose a new approach, based on Service Level Agreement (SLA); that is, based on an agreement in which multithreaded processors guarantee a minimal level of service to the running threads. The use of SLA allows us to shift the notion of "fair systems" to "balanced systems." The balance between the resources is based on the demand, or requirement of the resources, as viewed by the system at any given time. The system maintains a minimum level of service for these resources for each thread and divides the leftover between the threads so that optimal execution is achieved.
SLA acts as a contract specifying the minimal service level that is acceptable by each thread. It can relate to different parameters, such as performance, resource allocation and even to power and thermal budgets. Service level can be quantified using the ratio of the measured parameter value when threads are executing together over their single thread execution value. For example, performance service level can be quantified by using the speedups of the individual threads and service level in shared caches can be quantified by using the decrease in hit rate or decrease in cache share caused by the sharing of the cache. The case when all of the threads experience similar degradation in the measured parameter (due to the sharing) is usually considered fair. For example, fairness in multithreaded processors can be considered as equal speedups to all threads [Gabor et al. 2006; Kondo et al. 2007; Luo et al. 2001 ]. This case shows that fairness does not necessarily provide conformance to SLA (all threads getting an equal nonacceptable service level is considered fair). Conforming to SLA, however, provides each thread with a minimal service level, which ensures that no thread is hogging any resource or is causing a starvation of any other thread.
This article presents a new metric, C SLA , that measures conformance to SLA. The C SLA metric is different from fairness metrics, which attempt to equalize parameters of different threads. The new C SLA metric is suitable for use as a constraint on the execution while other metrics are being maximized or minimized. For example, we demonstrate in this article that our C SLA metric is suitable for being used as a constraint while maximizing the weighted speedup 1 (WeightedSpeedup), achieving higher WeightedSpeedup than the WeightedSpeedup achieved using fairness metrics. We also demonstrate that C SLA may be used as a constraint on the speedups of the individual threads while reducing the overall average power. 
Fairness metric based on Max-Min ratio:
IPC of thread j when executed using SOE with other threads.
IPC no miss
IPC of a thread excluding last-level misses (as if the last level cache always hits).
InstPerMiss j
Instructions per miss in thread j . The average number of instructions between two consecutive misses in thread j .
CycPerMiss j
Cycles per miss in thread j . The average number of cycles between two consecutive misses in a thread j .
InstPerSwitch j
Instructions per switch in thread j . The average number of instructions thread j executes before it is switched out (in SOE).
CycPerSwitch j
Cycles per switch in thread j . The average number of cycles thread j executes before it is switched out (in SOE).
The contributions of this article are as follows:
-We introduce a novel metric for measuring the Conformance to Service Level Agreement, C SLA , for multithreaded processors. This metric is applicable to any multithreaded or multicore system and for measuring conformance to SLA in any measured parameter (performance, resource sharing, etc.). -We compare the new C SLA metric to previously suggested fairness metrics, such as F stdev , F max-min , F min max and HarmonicMean (see Table I ). This comparison gives insight on the usefulness of each metric.
-We show examples of using C SLA for setting a constraint on the execution, which provides a range of acceptable execution points allowing for the optimization of the execution. We use C SLA based on the speedups of the individual threads to maximize the weighted speedup or reduce the average power, while satisfying the SLA requirements. -We demonstrate the effectiveness of the SLA approach in a differentiated service case, in which the weighted speedup is maximized while each thread may have a different throughput constraint. -We study the new C SLA metric, in the context of switch-on-event (SOE) multithreading and show how the weighted speedup can be increased beyond that Service Level Agreement for Multithreaded Processors
which is achieved while using traditional fairness approaches, and how to apply the C SLA metric to reduce the power.
The rest of this article is organized as follows. Section 2 describes previously suggested fairness metrics. Section 3 presents the new conformance to service level agreement metric (C SLA ). Section 4 describes the inherent SLA issues in switch-on-event (SOE) multithreading and presents the SOE analytical model. Section 5 is an extended example. Sections 6 and 7 show how to maximize the weighted speedup and how to reduce the power while satisfying SLA requirements in a two-thread SOE model, as well as empirical results. Section 8 demonstrates the effectiveness of SLA approach in differentiated service case, where each thread may have a different throughput constraint. Section 9 describes simulation tools and methodology including machine configuration and SOE implementation details. Section 10 describes related work. We conclude with the summary in Section 11.
FAIRNESS METRICS
SLA guarantees that each thread receives the required level of service. We will define a metric for measuring the conformance to SLA in Section 3. Several metrics have been suggested for the quantification of fairness in multithreaded processors. This section describes these metrics. In the following sections, we will compare SLA enforcement to fairness enforcement.
Admission control attempts to decide which threads should be allowed to run together, before they are executed, in order to reduce mutual interference effects. For the purpose of admission control, absolute metrics (e.g., cache or bandwidth absolute stated requirement) are intuitive and useful. Absolute metrics can be easily summed up and used for admission decisions. On the other hand, SLA and fairness measure the threads mutual effects while they are running, and are usually measured as relative numbers compared to a baseline (e.g., single-threaded execution parameters or even relative to the stated requirement). Fairness and SLA can complement admission control, by using the relative metrics to determine if all threads are treated fairly. This is important when resources are insufficient (e.g., resource oversubscription), which can still happen even when using admission control. SLA can also be used to decide how to divide the excess, in order to maximize or minimize some target (e.g., maximize performance or reduce power).
SLA and fairness can relate to different parameters such as performance, resource allocation, and even to power and thermal budgets. Fairness is usually quantified using the ratio of the measured parameter value when threads are executing together over their single thread execution value. . Using this notation we can define, for example, fairness or service level in throughput using the IPC of the different threads. In this case, the fairness or service level will be defined using the speedups of the individual threads: Table I for abbreviations and symbols used in this article). Several metrics have been proposed for fairness quantification in microprocessors.
Variance (F var ) and Stdev (F stdev ) Fairness can be quantified using the variance or the standard deviation of p j [Hsu et al. 2006; Luo et al. 2001] . A fair situation is considered when the variance or the standard deviation is minimized, and absolute (perfect) fairness is when p 1 = p 2 = · · · =p n , which results in F var =F stdev =0. For example, when performance is considered, fairness can be quantified using the standard deviation of the speedups of the individual threads, and absolute fairness is when all threads experience the same speedup. Since variance and standard deviation are conceptually similar, we will use only the standard deviation in the rest of the article.
Max-Min Difference (F max-min ) Fairness can be quantified using the difference between the maximum and minimum of p j :
The lower the value, the higher the fairness, and a value of 0 is absolute fairness ( p 1 = p 2 = · · · =p n ) [Kim et al. 2004; Kondo et al. 2007] .
Max-Min Ratio (F min max ) Fairness can be quantified using the ratio between the minimum and the maximum of p j :
. A value of 1 is absolute fairness ( p 1 = p 2 = · · · =p n ), and 0 means that at least one of the threads is starved [Gabor et al. 2006; Chen and Ma 2007] .
Harmonic Mean (HarmonicMean) The Harmonic Mean of p j can be used to encapsulate fairness as well as the goodness of p j into a single metric: [Cazorla et al. 2003; Luo et al. 2001] . For example, it can be used to encapsulate both the fairness in throughput (IPCs of the individual threads) and the achieved speedups (throughput improvement) into a single metric. Figure 1 shows how the different metrics behave in the case of two threads, for the case of 0≤ p j ≤1. F stdev (Figure 1(a) ) and F max-min (Figure 1(b) ) behave in a similar manner, except that standard deviation is always higher than the max-min difference. Both F stdev and F max-min achieve highest fairness when p 1 = p 2 = · · · =p n , in which case F stdev =F max-min =0. The difference between these two metrics is when more than two threads are used, in which case the standard deviation averages the fairness over all threads while the max-min difference (F max-min ) uses the worst-case difference. Figure 1 (c) shows F min max , which also achieves its highest fairness when p 1 = p 2 = · · · =p n (in which case F min max =1). F min max considers fairness from a worst-case scenario, using the two threads that achieve the max and min values of p j . The HarmonicMean metric (Figure 1(d)) achieves its highest value when p 1 = p 2 =...= p n =max j ( p j ), and its value decreases as we get further away from that point. The HarmonicMean metric attempts to encapsulate both the fairness of p j and its value into a single metric. Jain et al. [1998] defined desired properties of fairness metrics: (i) Population size independence: fairness metrics should not depend on the number of threads in the processor. This requirement is satisfied by all of the previously described metrics. (ii) Scale and metric independence: fairness metric should not depend on the measured parameter or on its scale. This requirement is satisfied only by F min max . 2 (iii) Boundedness: fairness metric should be bounded between 0 and 1. This is useful for the definition of discrimination metrics and enables intuitive use of the metric value. F stdev and F max-min have a minimum value of 0 (best fairness) but do not have an upper bound.
3 F min max is always bound between 0 and 1 (as required) and HarmonicMean has no upper bound (but HarmonicMean is a combined measure of performance and fairness, so it can deviate from this requirement). (iv) Continuity: a change in system fairness should be reflected in a continuous manner in the metric. All of the metrics satisfy this requirement. F min max is the only metric that satisfies all of et al. [Jain et al. 1998 ] Jain et al. [1998] ST (lost 99% and 98% of their performance), but it can also mean that one thread maintained its full performance while the other lost 50%.
All of the previously described fairness metrics, with the exception of HarmonicMean, consider a system as fair whenever all of the threads have the same value of
. This sounds reasonable, but it may lead to strange effects when it is being enforced. Let us apply the fairness metrics to cache usage in a dual-core processor with a shared cache. Let us assume we run two independent threads. The first thread is a large matrix calculation, whose dataset fits exactly in the cache. The second thread is a small matrix calculation whose 2 Let us consider, for example, the case of measuring throughput using IPC (instructions per cycle) or using IPS (instructions per second). IPC and IPS measure the throughput using different scales (IPS=IPC·Freq). The value of F stdev , F max-min and HarmonicMean will depend on the scale. 3 There is no guarantee that
. Even in the case of IPC, an upper bound is not necessarily 1, since there may be cases where one of the threads will have a speedup >1 when the two threads are running together (e.g., when the other threads acts intentionally or unintentionally as a prefetcher, helping the first thread to achieve higher IPC when the two are executed together). dataset fits within a small part of the cache. Before we can use any fairness metric, we have to define p j . In the case of cache usage, p j can measure, for example, cache share, cache hit rate, or hit count. Assuming we use hit rate as p j , the first thread, whose dataset size is the same as the cache size, will suffer a large decrease in its hit rate due to the sharing. This happens because its hit rate when it was running alone was 1 (always hit), but when the other thread takes up part of the cache, the matrix no longer fits, and each miss causes some other useful part of the matrix to be evicted. In order to achieve high fairness using F stdev , F max-min or F min max , the degradation in hit rate should be the same for both threads. This means that since the first thread has high degradation in its hit rate, the other thread's hit rate should also severely degrade. Enforcing fairness, in this example, based on equalizing p j harms one of the threads with no or only very little benefit for the other.
The previously described example illustrates the shortcoming of any fairness metric that attempts to equalize the degradation (or change) caused by the sharing. Furthermore, from a user's perspective, equalizing p j can lead to strange behavior. For example, consider a dual-threaded processor. An application of a certain user may lose most of its performance because of another user's application. This situation may still be considered fair if the other application also lost the same amount of performance. From the user's perspective, a fair situation is when the sharing of the dual-threaded processor does not degrade the performance by more than 2, otherwise it would be better to simply run the applications one after the other.
It should be noted that fairness is not limited to parameters whose value represent "goodness" (where high value is good), such as IPC or cache hit rate. Fairness can also apply to parameters where a lower value represents goodness, such as CPI or miss rate. In order to simplify the discussion, we assume that in such cases 1 p j is used instead of p j .
CONFORMANCE TO SERVICE LEVEL AGREEMENT METRIC (C SLA )
The shortcomings of the various fairness metrics previously described lead us to the conclusion that there is a need for an alternative, a metric that would encapsulate the user's perspective of sharing resources and degradation in measured parameters. Hence we suggest a different approach, service level agreement (SLA). A system satisfies SLA as long as all the threads achieve a value of p j that is higher or equal to a predefined requirement, R.
The conformance to SLA metric, C SLA , can be formally defined as shown in (1).
where N is the number of threads and the required value is R = 1/N . The definition of C SLA allows comparing threads based on how close they are to the required level of p j . The thread that experiences the worst degradation determines the overall metric value. C SLA saturates at 1, which means that the requirement is satisfied by all of the threads. The minimal value of C SLA is 0, which is the lowest conformance value (at least one of the threads has p j =0).
Note that C SLA is the system conformance level; if one of the threads has p j =0, then C SLA =0 even if other threads receive service at or above the required level.
In a dual-threaded processor, we can define the performance (IPC based) SLA using a required value of R=1/2. In this case, SLA is maintained as long
for both threads. Figure 1 (e) shows how C SLA behaves in a two-thread scenario, with R=1/2. As shown, as long as both threads have their p j ≥R, the value of C SLA is 1 (shown as white). The value of the C SLA metric decreases when p j of any thread goes below R=1/2.
Sections 6, 7, and 8 demonstrate the effectiveness of using C SLA =1 as a constraint when optimizing execution or reducing the power in SOE. Our approach and the C SLA metric are by no means limited to SOE and are also applicable to CMP and SMT.
In order to use C SLA , a required value (R-the constraint) needs to be specified. This can be done, for example, by the system administrator, who should also specify the target execution optimization when the requirement is met by all threads (e.g., minimizing the power or maximizing the performance). When the requirement can not be met, the system should attempt to maximize the C SLA , which means get the worst-case thread as close as possible to the requirement. This usually means equalizing the measured metric degradation over all threads (fair execution). The use of relative requirement is useful when threads characteristics are unknown or have large variations. It is also useful when determining the relative degradation one is willing to tolerate in order to optimize the execution (e.g., tolerate some performance degradation in order to reduce power).
SLA AND FAIRNESS IN SWITCH ON EVENT MULTITHREADING
In this section, we analyze the conformance level and fairness in Switch-OnEvent Multithreading (SOE) using an analytical model. In SOE, the core runs a single thread at any given time. When the thread is expected to stall for a long time (e.g., on a last level cache miss) the thread is swapped out and replaced with another. Switching threads is done using a pipeline flush [Eickemeyer et al. 1997; Farrens and Pleszkun 1991; Thekkath and Eggers 1994] . From a software perspective, the core contains multiple threads that run concurrently. The core, however, multiplexes the threads using predefined detectable events, such as long-latency miss events. SOE has been implemented in several multithreaded commercial processors, such as IBM's RS64 IV [Borkenhagen et al. 2000 Gabor et al. [2006] describe the fairness problem in SOE, when threads with different CycPerMiss (average cycles between misses or long stall events) are executed together. Let us consider a two-thread SOE processor switching on last-level cache misses. The fairness problem arises when one of the executed threads has a lower miss frequency than the other. Figure 2 shows this case where T 1 has a much lower miss frequency (higher average number of cycles between misses) than T 2. In this case, the thread with the high miss frequency (T 2) switches out quickly, giving way to the low miss-frequency thread (T 1). Fig. 2 . Example of unfair execution in SOE. T 1 has lower miss frequency (higher average number of cycles between misses) than T 2, which leads to T 1 getting more execution time than T 2.
Fig. 3. Representation of single-thread runs (top) and SOE (bottom).
The low miss-frequency thread (T 1) will then execute for a longer time, until it switches out. This kind of round robin between the two threads produces the situation where the low miss-frequency thread (T 1) gets more cycles to run than the high miss-frequency thread (T 2), leading to unfair execution, which may not be of an acceptable SLA.
We use the Gabor et al. [Gabor et al. 2006; Gabor et al. 2007 ] SOE analytical model, which represents the execution of single-thread applications as sequences of instructions delimited by long latency stalls (last-level cache misses). InstPerMiss (instructions per miss) is the average number of useful instructions executed between two consecutive misses, and CycPerMiss (cycles per miss) is the average number of cycles between those misses. When a thread executes alone on the processor, each miss causes an execution stall of MissLatency cycles (MissLatency is the average memory access time in processor cycles). InstPerMiss j and CycPerMiss j (InstPerMiss and CycPerMiss of thread j ) are illustrated in Figure 3 for two threads, both when running together (using SOE), and when executed separately (alone).
We can use InstPerMiss j and CycPerMiss j to analyze the IPCs, speedups, weighted speedup, conformance level and the fairness using different approaches. The IPC of each thread in single-thread mode and in SOE mode is shown in Eq. (2) and Eq. (3), where SwitchLatency is the average number of cycles a thread switch takes in SOE.
Service Level Agreement for Multithreaded Processors
This representation is based on several assumptions. It has been assumed that the work done speculatively while misses are being resolved is relatively small and that misses do not overlap. It has been also assumed that resource sharing (e.g., branch prediction history tables sharing) has a negligible impact on the model accuracy. These assumptions simplify the model and do not significantly reduce its accuracy [Gabor et al. 2006] . In order to work around the miss overlapping effect, we limit our analysis in the rest of the paper to the case when CycPerMiss j ≥MissLatency−2SwitchLatency (in a two-thread SOE model), a reasonable assumption holding in all of the cases we analyzed. The model uses round robin thread switching, which is reasonable for an SOE implementation [Gabor et al. 2007 ].
To enforce fairness or SLA in SOE, we induce additional thread switches, not necessarily hiding any long stall event. The use of induced switches introduces two variables: CycPerSwitch j and InstPerSwitch j , which are the average number of cycles and instructions between switches in thread j . In this case, the IPC SO E j is defined in (4).
Using Eq. (2) and Eq. (4) we can express Speedup j , the speedup of thread j , as shown in Eq. (5):
Eq. (5) and Eq. (4) can be used to calculate different metrics, such as aggregate IPC, weighted speedup (WeightedSpeedup), harmonic mean of speedups (HarmonicMean), as well as fairness metrics, such as F min max [Gabor et al. 2006 [Gabor et al. , 2007 . C SLA can be calculated as defined in Eq. (1).
EXAMPLE: SOE WITH DIFFERENT OPTIMIZATION TARGETS
In this section, we provide an example that illustrates the concepts presented so far. 4 In the example, we use the analytical model described in the previous 6:12
• R. Gabor et al. SO E j ; optimizing for "aggregate IPC & SLA" maximizes the aggregate IPC under the constraint of C SLA =1 (conforming to SLA); optimizing for F min max (which is the same as optimizing for F max-min or F stdev ) means equalizing the speedups of both threads; optimizing for HarmonicMean (harmonic mean of speedups) maximizes the combined throughput and fairness metric of Luo et al. [2001] ; optimizing for WeightedSpeedup maximizes the weighted speedup (with or without the constraint of C SLA =1). Figure 4 shows the results of the different optimization points. The vertical axis is the achieved value of the five metrics. The values of the five metrics for the different optimization points are groups on the horizontal axis. Plain SOE is the baseline, which achieves a low degree of fairness and C SLA (F min max =0.5; C SLA =0.88). Optimizing for aggregate IPC yields the lowest fairness, C SLA and WeightedSpeedup when compared to any other optimization point, including plain SOE. This is because optimizing for aggregate IPC means giving the thread with higher IPC no miss (T 2 ) as much time to run as possible, switching to the other thread for the duration of any miss handling, and switching back as soon as the miss is resolved. Optimizing the execution for WeightedSpeedup also yields low fairness and C SLA , since the execution is biased strongly toward the thread with lower CycPerMiss (T 1 ), allowing the other thread to run only while misses in T 1 are handled. In this example, optimizing for WeightedSpeedup reduces the aggregate IPC and vice versa (we will show later that this is not always the case).
Optimizing the execution for F min max results also in achieving high C SLA , as well as high HarmonicMean and WeightedSpeedup (higher than plain SOE). Optimizing for aggregate IPC or WeightedSpeedup with SLA (under the constraint of C SLA =1) achieves nearly the same results as when optimizing with no constraints at all. When SLA is enforced the optimization yields acceptable execution (as defined by C SLA ) and improves improves the WeightedSpeedup but causes a decrease in aggregate IPC (since it biased the execution toward the lower IPC no miss thread). On the other hand, enforcing SLA using the C SLA =1 constraint increases WeightedSpeedup and causes only a slight decrease in the aggregate IPC.
We can also examine the various runs of the example as points in the twodimensional space of switching points CycPerSwitch j . CycPerSwitch j ranges from MissLatency−2SwitchLatency to CycPerMiss. Figures 5(a) shows the aggregate IPC as a function of CycPerSwitch 1 and CycPerSwitch 2 , with two annotated points, plain SOE and the maximum aggregate IPC point (shown in Table  II as "Optimized for Agg IPC"). The thin dark stripe marks F min max =1 (where the speedups of the individual threads are equalized), the wider stripe shows where C SLA =1 (the speedup of the individual threads is ≥1/2), and the rest is where C SLA <1. Figure 5(b) shows the same space for WeightedSpeedup.
As shown, the C SLA =1 constraint is less "strict" (meaning more points satisfy the constraint) than F min max =1. The larger area in the graph covered by C SLA =1, relative to F min max =1, allows the selection of points with higher aggregate IPC or WeightedSpeedup. Plain SOE as well as the points with maximum aggregate IPC or maximum WeightedSpeedup have unacceptable values of C SLA and fairness (C SLA <1; F min max too low). In this example, enforcing fairness or SLA (F min max =1 or C SLA =1) increases the WeightedSpeedup but reduces the aggregate IPC. Had the IPC no miss of the two threads been switched (so that IPC no miss =2.5 and 1.0 for T 1 and T 2 , respectively), then optimizing for aggregate IPC or for WeightedSpeedup (with or without the constraint of C SLA =1) would have increased the aggregate IPC over that of plain SOE. This is so because, in this case, the low CycPerMiss thread (which is favored when optimizing for WeightedSpeedup) is the same thread that has the higher IPC no miss (which is favored when optimizing for aggregate IPC). In this case, optimizing for aggregate IPC or for WeightedSpeedup with the constraint of C SLA =1 would result in the same execution parameters (improving WeightedSpeedup and aggregate IPC over plain SOE). Optimizing for F min max would also improve the aggregate IPC because it would also bias the execution toward the high IPC thread. This case shows that enforcing fairness may improve the aggregate IPC. 
MAXIMIZING THE WEIGHTED SPEEDUP UNDER SLA
When enforcing SLA, we can choose to optimize performance by maximizing either the aggregate IPC or the WeightedSpeedup. The two optimization targets produce different results in most cases. In this section, we demonstrate the effectiveness of using the C SLA metric for maximizing the weighted speedup. A similar method can be used with aggregate IPC as the target.
We simulated a two-thread SOE processor, as described in Section 9. In each thread, the number of misses causing a switch and the number of cycles up to the switch are counted. These counters are then used to compute the CycPerMiss of each thread every 250,000 cycles. The results are used to calculate CycPerSwitch j that needs to be maintained in order to maximize WeightedSpeedup under SLA (see Appendix A for more details). This CycPerSwitch j is enforced in the following 250,000 cycles using a deficit mechanism. A thread switches out when its cycle quota (based on CycPerSwitch j ) elapses without a miss. The deficit mechanism keeps track of the CycPerSwitch j quota left over when a miss triggers a thread switch. The left over quota is added to the quota for the next time when that thread switches in. This approach maintains the required average CycPerSwitch j while allowing misses to trigger additional switch points. The same technique, with a different CycPerSwitch j calculation, was used for F min max enforcement in the work of Gabor et al. [2006 Gabor et al. [ , 2007 . Figure 6 shows the achieved conformance level of the different simulation runs, as measured by C SLA . It shows the achieved C SLA of plain SOE (switches only on misses) as well as that of the runs which enforce F min max =1 (equalize the speedups of the individual threads [Gabor et al. 2007] ) and C SLA =1 (maximizes WeightedSpeedup under the constraint that the speedups of the two threads are both ≥1/2). The simulations are ordered by the achieved C SLA of plain SOE. 9 out of 38 runs achieve extremely low C SLA , of no more than 0.1, when executed using plain SOE (shown on the left of the figure) . This means that one thread lost 95% or more of its performance. In these cases, enforcing SLA or the achieved C SLA significantly. Plain SOE achieves high C SLA , of at least 0.9, in 19 out of 38 runs (runs on the right in the figure). In these cases, enforcing F min max =1 or SLA does not significantly change the achieved C SLA . The results show that SLA enforcement significantly improves the C SLA for the unbalanced cases and has negligible effect on the balanced cases. Figure 7 shows the achieved WeightedSpeedup and aggregate IPC. Figure 7 (a) shows that SLA enforcement improves WeightedSpeedup by an average of 10% over plain SOE in unbalanced cases (in the cases where C SLA ≤0.85 in plain SOE). The highest WeightedSpeedup improvement is of 85% (gzip_025:mgrid_019). When threads are balanced, SLA enforcement has little effect on the WeightedSpeedup, with an average of 5% WeightedSpeedup improvement over all the runs (balanced and unbalanced). As shown in Figure 7 (c), there is a drop in the aggregate IPC in the unbalanced cases (shown on the left). In these cases, SLA enforcement biased the execution toward the lower IPC no miss thread. The aggregate IPC is hardly affected in the balanced runs (runs on the right, which achieve high C SLA in plain SOE). F min max =1 enforcement usually causes a smaller decrease in the aggregate IPC as compared to SLA enforcement. This is because in all of our runs, the low CycPerMiss thread is the low IPC no miss thread. In these cases, optimizing for WeightedSpeedup (under C SLA =1 constraint) causes additional biasing toward the lower IPC no miss (as described in Section 5). Had we optimized for aggregate IPC instead of for WeightedSpeedup then we would have minimized the aggregate IPC drop caused by the fairness (or SLA) enforcement, and would have achieved higher aggregate IPC than was achieved by F min max =1 enforcement (with lower WeightedSpeedup).
To gain further insight, we take a closer look at three representative simulation runs. We examine the WeightedSpeedup space using the average CycPerMiss j values of both threads as measured in the plain SOE run. The CycPerMiss values shown are: CycPerMiss g zip 025 =14,976; CycPerMiss mgrid 019 =328; CycPerMiss gap 029 =1,159; CycPerMiss vpr 008 =2,268; CycPerMiss perl bmk 024 =3990. Figure 8 shows the WeightedSpeedup space for gzip_025:mgrid_019, gap_029:vpr_008 and perlbmk_024:perlbmk_024. The first pair has the highest improvement in WeightedSpeedup when C SLA =1 (rather than F min max =1) is enforced. In the same situation, the second pair has only a slight improvement. Finally, the third pair uses the same application point on both threads.
The 3D graphs in Figure 8 show F min max =1 as a thin dark stripe. A wider stripe marks C SLA =1 and the rest is where C SLA <1. In the two examples that use different threads (Figures 8(a) and 8(c) ), C SLA =1 constraint enables us to maximize the WeightedSpeedup over that of F min max =1. This can be clearly seen in the 2D plots (Figures 8(d) and 8(f) ). These 2D plots show the WeightedSpeedup : CycPerSwitch 2 surface of the 3D plots, for the case of CycPerSwitch 1 =CycPerMiss 1 . In the third case (Figure 8(e) ), where both threads have the same CycPerMiss, there is little to gain from SLA over F min max =1 or plain SOE. In real life, however, the two threads rarely run at exactly the same speed. Instead, they have slightly different CycPerMiss's, which can be translated into WeightedSpeedup improvement when using SLA (we calculate fairness parameters every 250,000 cycles, which allows us to closely track performance changes). For the same reason, we see a slight improvement in WeightedSpeedup in most of the cases when the same application, from the same starting point, is run in both threads. 5 As shown, the analytical model correlates with our empirical results and provides a good estimate to the WeightedSpeedup gain of the SLA enforcement method over the others. It should be noted that the simulations themselves adjust to the measured CycPerMiss every 250,000 cycles (finer granularity than the global average shown in the figure), thus can achieve higher gains than indicated by the analytical model (which uses the average CycPerMiss value of the whole run).
REDUCING THE POWER UNDER SLA
In this section, we show that maximizing the conformance to SLA (C SLA ) can be used as a constraint on one metric while optimizing a different metric. As before, we define SLA using the speedups of the individual threads. The requirement C SLA = 1 is satisfied as long as each thread achieves its required speedup. We use 1/2 as the speedup requirement for two-thread SOE.
An operation mode in which power is reduced while keeping C SLA =1 can be useful, for example, when the processor overheats (e.g., limiting the temperature in laptop or handheld devices). In this case, power should be reduced in order to lower the processor temperature. An added advantage is in the case of limited power supplies, where power consumption of the processor should be temporarily reduced due to some circumstances, such as other servers or components temporarily requiring more power than the power supply has been provisioned for.
Power reduction can be done by using Dynamic Voltage Scaling (DVS) [Pering et al. 1998 ], where voltage and frequency are reduced in order to save power. It can also be done by Dynamic Clock Disabling (DCD), where the clock is dynamically disabled for periods of time, or by throttling resources such as fetch or data cache accesses [Brooks and Martonosi 2001] . The drawback of these approaches is that they reduce the power by reducing the performance of all the threads, regardless of their power consumption. This means that in the case that one thread is a power-hogger, the other thread will have to lose performance as well as in order to reduce the overall power (e.g., processor frequency reduction). We can work around this drawback by selecting the minimal power execution point, out of the acceptable (C SLA =1) execution points. In this way, we are biasing execution toward the lower power thread while maintaining SLA. SLA is important in order to avoid starvation of any thread. It should be noted that even after biasing the execution toward the threads with the lower power requirement, we may still have to employ methods such as DVS or DCD in order to guard against pathological cases. Effective biasing will, however, prevent unnecessary performance loss.
The Data Cache Unit (DCU) power is used as a case study for our power reduction technique. We simulated a two-thread SOE processor (Section 9 describes the processor in detail). We measured the DCU power, using access counts multiplied by the energy per access. The energy per access of the simulated data cache (32K, 8-ways, 64 bytes per way, 2 read ports and 1 write port) was estimated using CACTI 4.2 [Tarjan et al. 2006 ] at 70nm process technology. The average power was periodically calculated for each thread using its access counter, CACTI energy per access, and the cycles it has been running. We then calculate the acceptable (C SLA =1) execution points based on the SOE analytical model (see Eq. (11) in Appendix A), and choose the execution point (CycPerSwitch 1 and CycPerSwitch 2 ) which is acceptable and which maximizes the execution time of the lower average power per cycle thread. This calculation and thread parameters sampling was done every 250,000 cycles. Using this method, we conceptually bias the execution toward the lower average power thread while maintaining SLA. Figure 9 shows the power-reduction execution mode relative to the power of plain SOE as well as the achieved C SLA value of the two execution modes. The simulation runs are ordered by their achieved C SLA of plain SOE. The simulation runs that are balanced in plain SOE, shown on the right side of the figure, are indifferent to the power reduction execution mode. This is because, in these cases, both threads have the same characteristics (many of them are the same application on both threads). Biasing execution toward one of the threads has low potential power gain since both threads are similar (consume the same average power). When threads of different characteristics are executed together, they usually achieve low C SLA , mostly due to the difference in CycPerMiss values, as shown in the example in Section 4. In these cases, shown on the left side of the figure, plain SOE achieves low C SLA . Power reduction under SLA execution mode improves the achieved C SLA , usually to a value that is close to 1, and reduces the power by up to 26%. This can also occur when both threads are running the same application.
6 For example, ammp_038:ammp_038 (the left-most run) shows 26% power reduction. This occurs due to different phases in the application being executed on the two threads, which allows for power optimization under SLA. The power-reduction execution mode achieved an average of 15% power reduction in the 10 nonbalanced runs, shown on the left side, while maintaining high C SLA .
DIFFERENTIATED SERVICE UNDER SLA
SLA can incorporate differentiated service concepts in which each thread may have a different SLA constraint. This could be useful for scheduling threads with different requirements together. For example, one thread may have a realtime requirement, while the other can be a best effort. Even in this case, we may need to guarantee some minimal level of throughput for the best effort thread. Example of such a scenario could be a real-time (online) video encoder thread, that has to process a certain amount of data per second, scheduled together with a background thread that does some calculation that has less strict throughput requirements. Such a scenario requires slight change to the C SLA metric defined in Eq. (1). It requires a per thread SLA constraints, replacing the constant 1/N constraint.
SLA-based metric, C SLA , for differentiated service can be defined as shown in Eq. (10), where C j is the speedup constraint for thread j , C j >0. It should be noted that Eq. (1) is a special case of Eq. (10) in which C j =1/N for all threads.
Using C j =0 means that thread j is a best-effort thread. This means that the SLA constraint allows for that thread not to run at all. In order to allow for C j =0, Equation (10) needs to be modified to consider min j
for all j such that C j >0 (ignore the best-effort threads).
We have implemented differentiated service support into the simulator, using the model described in Appendix B, where the sum of the speedup constraints is 1 (thread 2 required speedup is 1-first thread speedup). We maintain CycPerSwitch j for each thread such that the WeightedSpeedup is maximized under the constraint that C SLA =1. Appendix B describes the algorithm used for setting the CycPerSwitch for both threads as well as the mathematical reasoning behind it. Figure 10 shows the results of differentiated service simulations for two threads combination: gap_029:vpr_008 and perlbmk_024:perlbmk_024. Each combination was simulated using seven differentiated service constraints, labeled as α : 1−α, where α is the constraint on the speedup of the first thread and 1−α is the constraint on the speedup of the second thread. The results of the simulation runs are shown in Figures 10(a) Simulation results show that controlling CycPerSwitch, as formulated in Appendix B, is effective at maintaining differentiated service constraints. Comparing the results to the analytical model WeightedSpeedup charts shows that the trend of WeightedSpeedup follows that of the analytical model. In gap_029:vpr_008, the analytical model shows that WeightedSpeedup increases with α (WeightedSpeedup increase by 0.03), while in the case of perlbmk_024:perlbmk_024, changing α has negligible effect on WeightedSpeedup (WeightedSpeedup changes by 0.002). This is because in the first case, increasing α increases the time the slower thread gets (the thread with the the higher CycPerMiss). This means, conceptually, that the faster thread gives up some fraction of the time in favor of the slower thread. The slower thread gains more speedup than the faster threads loses. In the case of perlbmk_024:perlbmk_024, both threads are from the same application (run at similar speed), hence moving execution cycles from one to the other have smaller effect on the overall WeightedSpeedup.
It should be noted that the analytical model projects higher total WeightedSpeedup than what we get in the simulation runs. This is due to the inaccuracies of the model, which tends to underestimate the single-threaded performance [Gabor et al. 2006] . This is because sharing of resources reduces the IPC in between misses of SOE, compared to ST. This inaccuracy affect the accuracy of the WeightedSpeedup projection, but as our simulation runs show, have small effect on the fairness or C SLA projection (which is used by our SLA enforcement). This can be viewed as the effect of maintaining fairness of multithreaded runs when compared to hypothetical single-thread runs with reduced resources effective size (reduced sizes due to the sharing in the multithreaded runs).
The differentiated SLA demonstrates the effectiveness of using C SLA =1 as a constraint while maximizing the WeightedSpeedup. The C SLA =1 constraint defines the range of acceptable execution points (CycPerSwitch j values), out of which we choose the one that maximizes our target (WeightedSpeedup).
SIMULATION METHODOLOGY
An out-of-order processor was simulated using a detailed cycle accurate execution driven simulator. The processor is derived from the P6 micro-architecture [ Gwennap 1995] . The simulator provides full accurate simulation of the application as well as interrupts, operating system IO and DMA side effects. This simulator and tool chain have been extensively used in many studies [Akkary et al. 2004; Cooksey et al. 2002; Falcon et al. 2004; Gabor et al. 2007; Mutlu et al. 2003; Singhal et al. 2004] .
We simulated an out-of-order core, with first level instruction and data caches, a unified second-level cache (L2), a pipelined bus and a constant latency memory. Table III summarizes machine configuration parameters. Structure sizes were based on Intel's disclosure of the core processor [Krewell 2006 ], and were slightly increased to reflect our view on a future version of that processor. Memory latency was set to 300 cycles, which is 75ns at 4GHz processor frequency.
We have modified the simulator to support SOE multithreading, switching on L2 cache misses (last-level cache misses). Misses induced by load instructions as well as i/d TLB page walks are tracked by flagging them in the Reorder Buffer (ROB). A thread switch is triggered when the head of the ROB (the next instruction or micro-operation to be retired) is flagged as handling a miss which has not been resolved. 7 A thread switch causes instructions to be drained from the RS (Reservation Station), ROB and load buffers (LBs). Switch latency, the number of cycles from the start of the switch until the first instruction of the next thread retires, is simulated in a cycle-accurate manner. The switch latency is not constant (e.g., depends on the instruction that was switched in), and usually accumulates to around 25 cycles.
Structures such as iTLB, dTLB, caches and branch prediction history are shared, and are not flushed on switch. This approach maintains performance after thread switch events [McNairy and Bhatia 2005] . Some structures, such as the 128 entries 4-way dTLB, include a thread tag per entry, to remove the flushing requirement. The L1 D-cache is physically tagged (does not require thread tag per entry). The store buffer keeps dispatching retired stores even after a flush, but does not forward their data if they are not from the same thread. The misses and cycles counters are sampled every 250,000 cycles, for the recalculation of fairness or SLA parameters. To ensure that all the threads run in every 250,000 cycles, each thread is limited to a maximum of 50,000 cycles before it is forced to switch out. Spec CPU2000 benchmarks [CPU2000] were simulated on a two-thread SOE configuration. Caches were warmed up using 10,000,000 instructions from each thread (interleaved). Threads were simulated until both of them completed at least 6,000,000 instructions. The first 1,000,000 simulated instructions were not included in the results (statistics), and were used to warm up the internal micro-architecture state (internal structures such as the branch prediction tables as well as the state of the fairness or SLA mechanism). When running the same benchmark on both threads, the two threads were offset by 1,000,000 instructions.
We used noncooperative combinations of ST workloads. Equal amounts of work is guaranteed by sampling the simulator separately for each thread when it reaches its designated instruction count. Both threads continue to run, even after one of them is sampled, until the last one reaches its designated instruction count. This guarantees that we use the same work (instruction segment of each thread) in the different runs.
We used 38 combinations of benchmarks. In 18 cases, we used the same benchmark at the same starting point on both threads. Each combination was simulated using plain SOE (no fairness or SLA enforcement), with F min max maximization (max-min speedup ratio enforcement [Gabor et al. 2007] ) and with SLA enforcement while maximizing WeightedSpeedup or reducing the power. In addition, we simulated each benchmark separately on the processor in order to obtain its actual IPC ST j . We use name-number as naming convention to refer to the executed benchmark application on each thread, where name is the application name and number is the point we used in that execution (number is an enumeration of starting points at regular intervals).
RELATED WORK
Fairness has been studied in the context of job scheduling. Job scheduling includes, for example, process scheduling on multiprocessor systems (at the operating system level) and packet stream processing in communication systems (e.g., network packet processing in routers). Job scheduling deals with the decision of how to handle queues of jobs in resource limited environments. Wang and Morris [1985] defined the Q-factor as the mean response time relative to a global first-come-first-served (FCFS) policy. The Q-factor measures both performance and fairness, using FCFS as a baseline. Many communication scheduling algorithms were analyzed by comparing their performance (handling or completion time) to that of General Processor Sharing (GPS), which is a byte-level round-robin. Such algorithms include, for example, Self-Clocked Fair Queuing (SCFQ) [Golestani 1994 ], Weighted Fair Queuing (WFQ) [Goyal et al. 1997] and Worst-case Fair-Weighted Fair Queuing (WF 2 Q) [Bennett and Zhang 1996] . These algorithms attempt to create a behavior similar to that of the theoretical GPS model, and hence achieve fairness. Avi-Itzhak and Levy [2004] based their fairness measure on the service order. Wierman and Harchol-Balter [2003] suggested using E[T (x)/x], the expected normalized response time for job of size (handling time) x, as a measure of fairness. The expected normalized response time is the expected slowdown induced by the scheduling algorithm to a job of size x. Raz et al. [2004] defined a resource-allocation queuing fairness measure (RAQFM). RAQFM is based on the principle that all jobs present in a system at time t deserve an equal share of the resources. Hence, the share of each job at time t is 1/N (t), where N (t) is the number of jobs present at time t.
Our C SLA metric bares some conceptual resemblance to GPS and RAQFM [Raz et al. 2004] . In both of these job-scheduling concepts, a theoretical system where each job gets an equal share of resources from the time it arrives until it completes, is considered a fair system. These job-scheduling concepts then use this theoretical system in order to determine job scheduling. In our case, we already have threads running, and we need to optimize the processor resources among them. Furthermore, in job-scheduling concepts, the required service time per job is usually not affected by the other coscheduled jobs, while in our case, this is true. In the SOE case, we use the mutual effects between the coscheduled threads in order to optimize the weighted speedup under C SLA =1 constraints.
Admission control ] has been proposed to reduce fairness issues between concurrent threads by controlling the allowed threads combinations. Admission control uses prestated or expected demands of jobs to make decisions before the execution. Resource absolute requirements, such as cache or bandwidth requirements, are useful for admission control, since they can be easily summed up. Fairness or SLA enforcement are done among the already running threads and can use the actual execution as a feedback to the enforcement. In this case, relative metrics such as Speedup can be effectively used. SLA and fairness enforcement can be used to complement admission control, and their metrics can be used as feedback for future admission control decisions.
Fair cache sharing between multiple coscheduled threads has been shown to be a potential cause of serious problems such as threads starvation. Cache sharing can be extremely unfair when, for example, a thread with high miss rate and poor locality constantly causes evictions of other thread's data that will be required soon after. Dynamic and static resource partitioning schemes have been proposed to improve fairness in sharing caches and other resources Chang and Sohi 2007; Hsu et al. 2006; Iyer 2004; Iyer et al. 2007; Kim et al. 2004; Rafique et al. 2006; Yeh and Reinman 2005] . Kim et al. [2004] studied fairness issues in cache sharing in CMP. They showed that optimizing for fairness also increases throughput while maximizing throughput does not necessarily improve fairness. [Nesbit et al. 2006; showed that fair memory scheduler can also improve performance.
Fairness of threads was studied in the context of SMT [Cazorla et al. 2006; Luo et al. 2001; Reinhardt 1999, 2003; Tullsen and Brown 2001; Chen and Ma 2007] . Tuck and Tullsen [2003] noted that cache and other resource conflicts exist even when SMT processors are designed to minimize conflicts between threads. Raasch and Reinhardt [2003] showed that resource partitioning in SMT improves fairness. For example, statically 6:26
• R. Gabor et al. partitioned ROB improves fairness as compared to competitive sharing. Luo et al. [2001] used fetch policy as a heuristic to prioritize the different threads, in order to improve fairness. Cazorla et al. [ , 2006 studied Quality of Service (QoS) enforcement in SMT. They sample the throughput of threads when executed separately (for a short period of time). They then enforce QoS by dynamically managing resources to achieve the operating system QoS requirements (per thread percentage of IPC requirement). Lee and Asanovic [2006] suggested using measurements from various components in order to enforce QoS requirements. Chen and Ma [2007] suggested measuring the single-threaded performance of each thread in an SMT processor while the other thread is being stalled by a cache miss. These measurements can be used to estimate the single-thread IPC of each of the threads, as well as the overall fairness.
SUMMARY AND CONCLUSIONS
This article presents a new SLA metric based on requirements rather than on equalizing a measured property or a ratio. Previously suggested fairness metrics, such as F min max , F max-min , and F stdev concentrate on equalizing a parameter (e.g., equalizing the speedups of the individual threads). The use of a fairness metric provides a way to share the hardware resources. A service level agreement is a contract between the software (a thread) and a multithreading or multicore platform. Each thread receives the agreed capacity regardless of the service level provided to other threads. We use SLA as a constraint. In addition, after satisfying SLA, if some processing capacity is left unused as it is often the case, we show that this capacity can be used for optimizing a target metric such as weighted speedup or power.
Our simulation results show that the SLA enforcement mechanism works well for a variety of thread combinations. In some cases, plain SOE achieves high service level even without any enforcement. In these cases, our mechanism has little if any effect on the optimized parameter (WeightedSpeedup, aggregate IPC, or power) or on the achieved C SLA . On the other hand, the SLA enforcement mechanism has a significant impact on instances in which plain SOE produces low service level, improving the achieved C SLA as well as the optimized metric. In these cases, we were able to increase WeightedSpeedup by an average of 10% or reduce the power by an average of 15%. While these improvements may not be very impressive when considered by themselves, their significance becomes evident when compared with performance losses that may occur by enforcing fairness, if a fairness mechanism is used. Where power reduction is concerned, other methods such as DVS or DCD (discussed in Section 7) may be necessary, in conjunction with SLA, to target tighter power budgets. In these situations, the use of SLA combined with other power reduction methods provides equitable reduction in the service level for various users or applications.
SLA can include priorities between threads by setting a different required value (constraint) to each thread. We have demonstrated such a differentiated service case, showing that SLA approach is useful for setting a constraint on the execution, while allowing for the optimization of a target metric (WeightedSpeedup).
For the purpose of introducing the new concepts of SLA and C SLA , SOE is an attractive multithreading approach that allows us to extend an analytical model [Gabor et al. 2006 [Gabor et al. , 2007 rich enough to emulate a processor's shared resources but simple enough to be mathematically tractable. The concept of SLA is not limited however to SOE. For example, it can be used to enforce a service level in SMT and CMP processors (e.g., SLA in shared caches). In order to apply methods such as the one demonstrated here for SOE, one needs to be able to know or be able to estimate how changes in policy will affect the service level and the optimization target (e.g., WeightedSpeedup or power as used in our case). In SMT, this can be done using profiling based techniques, where the single-thread IPC of the individual threads is profiled before the threads are executed together. It can also be applied to SMT by sampling each thread by its own at regular intervals [Cazorla et al. 2006] or during cache misses of the other thread [Chen and Ma 2007] . These single-thread IPC samples can then be used, for example, to periodically estimate the fairness and bias the fetch priory of the threads accordingly [Cazorla et al. 2006; Chen and Ma 2007] .
The SLA approach can be extended to include multiple metrics. For example, the system may offer SLA in performance (IPC), cache occupancy, bus usage and power budget. In this case, each thread can have four required values, one for each shared resource under SLA (some of which may be 0, for best effort). These per-thread multidimensional SLA requirements will define the set of acceptable execution points, out of which the system will have to choose the one which optimize its target (e.g., WeightedSpeedup or power). In some cases, when it is difficult to estimate or measure the effects of the shared resources, a search for the optimal execution point (such that C SLA =1) can be performed. In this case the usage of each resource is biased toward some threads, and a feedback loop is required in order to tune the biasing for optimal execution. This sort of multidimensional SLA introduces many challenges. We leave this for future work.
APPENDIXES APPENDIX A. MAXIMIZING WEIGHTEDSPEEDUP UNDER SLA
In order to maximize the weighted speedup (WeightedSpeedup) under SLA in two-thread SOE, we have to determine the CycPerSwitch j that needs to be maintained. Corollary A.1 shows how to obtain the CycPerSwitch j so that WeightedSpeedup is maximized under C SLA =1 constraint. The Corollary can be proven algebraically. A high level of proof is given in the analysis in the following text.
The following analysis assumes that MissLatency>2SwitchLatency>0 and that, without loss of generality, MissLatency−2SwitchLatency≤CycPerMiss 1 ≤ CycPerMiss 2 . It also uses MissLatency−2SwitchLatency≤CycPerSwitch j ≤ CycPerMiss j for j =1 and 2. CycPerSwitch j should not be lower than MissLatency−2SwitchLatency, otherwise threads might switch back after a miss, before that miss has been resolved (this is a reasonable assumption according to our simulations). CycPerSwitch j cannot be more than CycPerMiss j as there is no way to force the average number of cycles between switches to be more than those between misses (we switch on misses as well). Based on these assumptions and on the SOE analytical model (see Eq. (5) and Eq. (7)), we can make the following observations (all of which can be proved):
( 
≥ MissLatency
We use Corollary A.1 to set CycPerSwitch j in the empirical studies of Section 6. We also set CycPerSwitch j to be at least MissLatency − 2SwitchLatency to ensure that misses are resolved by the time we perform switches.
APPENDIX B. SOE AND DIFFERENTIATED SERVICE
Differentiated service is when each thread has a different SLA constraint (Section 8). The same method shown in Appendix A can be used to analyze the differentiated service case. This case, however, is more complex, as it is not guaranteed that there exists a point where CycPerSwitch 1 =CycPerMiss 1 which satisfies the differentiated SLA constraints. It can be shown that, if any point exists that satisfied these SLA constraints, then there is also at least one point on the border of CycPerSwitch 2 =CycPerMiss 2 or CycPerSwitch 1 =CycPerMiss 1 that satisfies these constraints. The rest of the appendix discusses the differentiated service case for a two-thread SOE model. Empirical results of using this model are shown in Section 8.
The following discussion uses C SLA , the differentiated service version of C SLA as shown in Eq. (10). Let us use α as the SLA constraint for the first thread and 1−α for the second thread, where 0<α<1. In this case, the range where C SLA =1 and WeightedSpeedup is maximized is usually found where CycPerSwitch 1 =CycPerMiss 1 and CycPerSwitch 2 is in the range defined in Eq. (12). 
Our domain of 0<2SwitchLatency<MissLatency and MissLatency− 2SwitchLatency≤CycPerMiss 1 ≤CycPerMiss 2 guarantees that there is always some point that will satisfy the differentiated SLA constraints given as α and 1−α. It should be noted that Eq. (11) is a special case of 12, in which α=1/2.
The empirical results in Section 8 are for the described model, where the sum of the speedup constraints is 1 (thread 2 required speedup is 1 − first thread speedup). In order to determine the CycPerSwitch for both threads we used Eq. (12), with cond from Corollary A.1 determining which limit to use for CycPerSwitch 2 (in order to maximize WeightedSpeedup). When the range defined by Equation (12) was not achievable (greater than CycPerMiss 2 ), we used Equation (13) to set CycPerSwitch 1 (using maximum CycPerSwitch 1 that is in that range).
It is also possible to analyze the general two-thread differentiated service case. In this case, the speedup of the first thread must be at least C 1 , and the second thread speedup must be at least C 2 , C 1 >0 and C 2 >0. The analysis follows the same approach as shown earlier, yielding the domain in which the given SLA constraints can be met and how to maximize WeightedSpeedup in those cases. The analysis of this case has an additional range in which the SLA constraints cannot be met (in which case we may wish to maximize C SLA ). The analysis of the general case is not in the scope of this article.
