Single-thread performance and throughput often pose different design constraints and require compromises. Mainstream CPUs today incorporate a non-trivial number of cores, even for mobile devices. For power and thermal considerations, by default, a single core does not operate at the maximum performance level. When operating conditions allow, however, commercial products often rely on turbo boosting, which temporarily increases the clock frequency to increase single-thread performance. However, increasing clock speed may result in a poor performance return for invested energy. In this article, we make a case for a more effective boosting strategy, which invests energy in activities with the best estimated return. In addition to running faster clocks, we can also use a look-ahead thread to overlap the penalties of cache misses and branch mispredicts. Overall, for similar power consumptions, the proposed adaptive turbo boosting strategy can achieve about twice the performance benefits while halving the energy overhead.
INTRODUCTION
The central goal of processor design has always been to best utilize available resources to achieve maximum performance. One challenge is that the operating situation is unpredictable: Sometimes there are multiple ready software threads, and then improving system performance may be done by executing more of them in parallel; other times, improving single-thread performance is the only way to improve that of the whole system.
In general, single-thread performance and throughput pull the resource allocation decision in different directions and require compromises. For instance, mainstream CPUs today incorporate a non-trivial number of cores, even for mobile devices, and for power and thermal considerations, by default, a single core does not operate at the maximum performance level. One general way of addressing the difference in resource demand is to allow dynamic reallocation. Simultaneous multithreading and turbo boosting are two examples widely adopted in high-end commercial products [40, 66] . In turbo boosting, the system reallocates unused thermal and power headroom from idle cores to boost the frequency of a busy core to improve its single-thread performance.
5:2 S. Kondguli and M. Huang
While the general idea of dynamic resource allocation is sound, experience has shown that increasing clock speed has not been a very effective approach to boosting performance in generalthis is partly why chip operating frequency has been relatively static for more than a decade now. In this article, we make a case for a more effective boosting strategy, which is cognizant of the behavior of the code and invests energy in activities with the best estimated performance benefits. In some cases increasing the core frequency is the right action. Other times, especially in the presence of significant cache misses and/or branch mispredictions, we can dynamically increase the strength of branch prediction and prefetching by executing a look-ahead thread on the idle core. We will show that both the architectural and software support for such a decoupled look-ahead (DLA) system are practical. We also find that simple controller logic can do a reasonably good job at managing such an adaptive turbo boosting system. Overall, the proposed design is significantly more effective in improving performance and is much more energy efficient: Compared to frequency-only boosting, the adaptive design achieves about twice the performance benefits at half the energy overhead. To summarize, we make the following contributions in this article:
(1) Even though turbo boosting has been studied extensively, we proposed a relatively simple addition that makes it significantly more efficient than the state of the art. (2) We present an architectural support for DLA, purposefully kept at a minimum complexity, making it a compelling feature to add to modern architectures. (3) We performed detailed, multifaceted analyses to permit a reader to understand clearly where the benefits come from.
Helper Threading
Various flavors of helper threading have been proposed in the past. Micro helper threading [4, 16, 21, 30, 43, 47, 48, 52, 53, 60, 73, 74 ] launches short threads ahead of potentially stalling instructions to avoid performance bottlenecks. These threads include specific instructions (often identified via lengthy manual tuning) required to precompute address or branch direction in a timely manner. Also, these threads are more suited to thread contexts that are designed specifically for them [16] . Another body of work attempts to provide a hardware solution to helper threading [14, 15, 18, 27, 36, 41, 42, 44, 54, 65, 67] . These proposal either start from an out-of-order core design and add hardware components to alleviate performance in case of stalls or use a modified in-order core with additional hardware components to optimally extract ILP. Most of these techniques generate a reduced copy of the original program in runtime. Although, the amount of information available is limited at runtime and this limits their performance.
Idle core in a multicore system can also be used to execute a reduced copy of the original program on a separate thread context [5, 11, 31, 33, 34, 50, 69, 72] . Many of these proposals target the same performance hurdles but differ in the way they accelerate the look ahead core. We call this flavor of helper threads DLA. Unlike micro helper threads, a decoupled look-ahead thread is a continuous process and requires less micromanagement from the main thread and therefore presents a powerful option for turbo boosting without slowing down the main thread.
OPPORTUNITY ANALYSIS
When operating conditions allow, a number of things can be done to improve the execution speed of the processor, albeit at the expense of extra energy expenditure. Increasing clock frequency is straightforward in that commercial processors already come with the capability to dynamically adjust clock frequency and voltage supply. However, the need to also increase the supply voltage can result in disproportional energy overhead for the performance gain.
Another way to improve performance is to reduce the number of stall cycles. Conceptually, this can be done by activating more powerful branch predictors or prefetchers or other means of increasing the effort to expose and exploit inherent parallelism. Depending on the execution behavior, targeting these stall cycles may offer a better opportunity to improve performance than boosting the clock frequency.
To have a more concrete understanding of the potential benefits, we take SPEC CPU benchmarks and compare a few configurations. Specifically, we measure the speedup obtained through an increase of clock frequency (by 33%) and elimination of branch mispredicts, L2 cache misses, or L1 cache misses. Since some results are from idealized configurations, they are only meant as rough characterizations of the potentials.
In Figure 1 , the y-coordinate of each application shows the speedup achieved by a particular boosting configuration. To show the correlation between boosting effect and baseline performance, the x-coordinate shows the average IPC of that application (over the entire observation window in the baseline configuration). Note that showing such correlation does not imply one metric is a function of the other. Linear regression results are shown as solid lines to indicate some first-order trend.
We can observe several things from the experiments. Not surprisingly, different boosting mechanisms do not have the same effect on all applications. Also, boosting clock frequency is complementary to other type of boosting (suggested by the idealizations). Indeed, the effect of increasing clock frequency has a positive correlation with baseline performance while the other boostings have a negative correlation. In other words, increasing the clock frequency tends to be less effective for slower applications. Intuitively, those applications are slow often due to poor cache behavior and hence elevated memory accesses. Increasing core clock frequency has limited benefit when memory access becomes the limiting factor. Instead, these applications tend to enjoy significant speedups from better caching behavior.
Clearly, when deciding on the proper form of boosting, we need to take into account the program behavior. Even within one program, the behavior also varies, demonstrating what is often called phase behavior [2, 10, 12, 24, 26, 49, 61, 63, 64] . In Figure 2 , we show a few common statistics about program behavior measured over different loops of a random example application. The result illustrates the behavior difference even within the same program.
To sum up, there is significant variation of program behavior both within and across different applications. As a result of this variation, the best mechanism to boost system performance (when allowed additional energy budget) may be very different. A fixed policy of always investing that energy budget in clock frequency boosting is likely to be a suboptimal strategy. A sensible turbo boosting strategy should also include mechanisms that targets stall cycle reduction and a controller that can make judicious selections. We discuss our proposed designs next.
ARCHITECTURE DESIGN
The concept of a dual-threaded execution mode that we call decoupled look-ahead is not new. We keep this concept (which has many different implementation proposals) but provide a simple implementation. We first describe the underlying components in Section 4.1.
Additionally, we propose a control logic to enable what we call Adaptive Turbo Boosting (ATB) to achieve the goal of maximizing performance gain for the invested energy. As it turns out, energy expenditure is strongly correlated with performance benefit: The best performing boosting configuration also costs the least energy in a vast majority of the phases. 1 Our design, therefore, simply picks the best performing configuration for each phase of execution based on their code structure. We discuss the controller in Section 4.2.
Dynamic Boosting Mechanisms
Increasing clock speed is a readily available mechanism in commercial processors. Dynamically boosting the performance of cache and branch predictor is less straightforward. One can imagine reconfigurable branch predictors [19, 38] and prefetchers [75, 76, 77] . But in this article, we focus on using the available idle core, which is often (though not always) the source of available power and thermal headroom to apply boosting.
Prior Art: Indeed, a number of designs have been proposed to use an otherwise idle core to perform some kind of look-ahead action to assist the main core. The key challenge is to make the look-ahead action fast enough that there will be sufficient assistance rendered to the actual execution thread. A number of techniques are proposed to achieve that by reducing logic complexity of the core design [8, 9, 50] , skipping predicted dead instructions [57, 69] , skipping long-latency accesses [11, 72] , optimistic over-clocking [33] , and using a special-purpose look-ahead thread (with special various speculative optimizations) [5, 7, 31, 32, 55] .
Note that another large body of work uses more specialized micro threads to perform more targeted look-ahead (e.g., References [4, 16, 21, 25, 30, 47, 52, 60, 73] ). The support of the execution of these special micro threads tends to be an integrated part of the core, and thus does not lend directly to the situation of leveraging an idle core in a multicore architecture. Nevertheless, the overall take-away point is that there are numerous ideas about software-directed look-ahead to reduce stalls. It is not difficult to imagine a future processor to support some combination of the proposed features to allow dynamic activation of look-ahead thread as a performance boosting mechanism.
Overview of Support:
In this article, we use a similar setup along the lines of a number of prior proposals [7, 11, 31, 57, 69, 72] . On top of a generic multi-core architecture, we require the following architectural support, ordered from least to most special-purpose:
(1) Containment of speculation: The look-ahead thread usually involves speculative optimizations and thus can not be allowed to update the architectural state. The support is simple as most of the state is already naturally confined to the thread context. The only additional support needed is about dirty lines in the private caches (in our study the L1 data cache) in the look-ahead mode: They are not used to supply coherence requests from other cores and are not written back on eviction but simply discarded. (2) Communication of look-ahead results: In a multicore architecture with shared lower level caches, the look-ahead thread can already warm up the shared caches without any additional support. However, a mechanism to explicitly pass on results from the look-ahead thread is valuable. First, we can send over the branch outcomes. Second, we can also send prefetching hints. This is useful when we want to delay the release of prefetches at the appropriate time, which is helpful for L1 prefetches. We use FIFO buffers for these purposes ( Figure 3 ). Finally, with proper multiplexing support, such buffers can be shared among a 5:6 S. Kondguli and M. Huang group of cores so that they can be dynamically configured to pair up any two cores in the group at a given time. (3) Support for instruction masking: Finally, we find some convenience in having the code of the look-ahead thread being a subset of the main thread. In other words, the look-ahead thread uses the same program binary and we only manipulate a set of bit masks to indicate whether instructions are to be part of the look-ahead thread or not. With proper hardware support, instructions masked off will be deleted immediately on fetch in the look-ahead thread. These mask bits can be generated either offline or online through dependence analysis of the program binary [31] . We create a few different versions as will be discussed later.
Operations in Detail:
When an idle core becomes available, the controller will decide on how to boost the execution speed (Section 4.2). When it chooses to enable the look-ahead mode, the currently executing thread context will be spawned onto the idle core in the look-ahead mode. The bit masks will be stored in a fixed location in the program binary properly aligned and will be fetched into the instruction cache. In other words, when an I-cache miss happens, the controller will issue two read requests to the L2 cache for the instructions (at the address A i ) and their masks stored in address A m = f (A i ). 2 Based on the masks embedded in the I-cache, the instruction fetch logic will delete unwanted instructions.
The communication buffers (Figure 3 ) will be configured accordingly so that the look-ahead thread will write into the queues for the main thread to read from them. Specifically, at the commit time, the outcome of a conditional branch ("taken" or "not taken") will be stored in FIFO order into a branch outcome queue (BOQ). The main thread reads from the queue at the fetch stage (thus also in program order) and uses the result for branch prediction unless the entry is invalid. 3 The BOQ serves a multitude of purposes. First, it passes a branch outcome as a prediction to the main thread. This ensures that in the steady state, the majority of branch mispredictions are experienced only in the look-ahead thread. Second, it is a simple and effective mechanism to detect incorrect look-ahead control flow. When a branch prediction fed by the look-ahead thread turns out wrong (which is relatively rare), it means that the look-ahead thread is executing down the wrong path. We will reboot the thread from the current state of the main thread. 4 Third, we can easily know and control the depth of look-ahead: The number of unread entries in the BOQ equals the number of dynamic basic blocks the look-ahead thread is ahead of the main thread. To prevent run-away prefetching, we only need to limit the size of the BOQ (512 entries in this article). Fourth, it is a convenient way to allow delayed (just-in-time) prefetching. When a prefetch hint is generated, it can be associated with a branch entry and released only on the dequeuing of that BOQ entry.
In addition to the continuous branch direction hints, occasionally the look-ahead thread has other hints such as a data or TLB prefetching hint or a branch target hint. We put these less frequent hints with more information content into another wider but shallower FIFO queue that we call the footnote queue (FQ). When one or more such hints are inserted for a particular basic block, the footnote bit is set for the branch entry in the BOQ telling the main thread to dequeue footnotes. Each footnote entry contains bits to identify the type of hints, the address, and whether this is the last hint for the basic block.
When a memory instruction in the look-ahead thread incurs a primary L1 data cache miss, the controller inserts the cache line address into the FQ. To achieve just-in-time prefetching, we associate this footnote with the branch a few entries upstream. For simplicity, this is done by attaching the footnote to the last entry in BOQ, corresponding to the most recently committed branch instruction. When the main thread acts on the hint and starts the prefetch, it will have some time to overlap with the fetch.
Versions of Skeleton:
There are different options in deciding what static instructions to keep on the look-ahead thread. Each option effectively creates a different static code for the look-ahead thread that we call skeleton. The most basic version of the skeleton includes all branches and their backward dependence chains and is produced with a binary parser [31] . On top of this, we can add or subtract instructions. Specifically, there are three high-level options to decide:
• L2 prefetch targets (L2): Instructions that account for significant portions of L2 misses can be added to skeleton; • L1 prefetch targets (L1): Additionally, instructions that account for significant portions of L1 misses can be added to skeleton; • Biased branches (Br): Conditional branches with a bias over a threshold can be converted to unconditional branches in the skeleton.
Experiments show that the performance benefit of using a particular skeleton, and which set of options are the best depend on the underlying binary. We therefore leave it to the controller to decide on these options. A combination of these three types of hints can produce up to eight different skeleton versions. For practicality, we use four (Br, Br+L2, L1+L2 and Br+L1+L2) that empirically work well to reduce the search space. The remaining versions are seldom the best performing version in our test cases.
Controller
With a number of options to boost performance, the goal of the controller is to find the optimal configuration, i.e., the one that maximizes the benefit (execution time reduction) for the incurred cost (extra energy). The key capability to this goal is to predict the impact of a particular option. There are two components to our proposed solution. First, the controller divides the execution into units. Second, measurements of execution characteristics will be fed into a model that predicts either the impact of various boosting mechanisms or the best option directly.
Dividing Execution into Segments.
Prediction of future behavior is always based on the (implicit) assumption that future behavior repeats or at least correlates with the observed past behavior. This assumption is more plausible when the pipeline executes the same code in the period being predicted as in the past period of observation. The quintessential example is a loop. Execution cycles can be broken down into iterations. Statistics collected in one period consisting of one or more iterations can then be used to estimate the same statistics of future iterations. Thus, we propose to break execution sequence into segments corresponding to that of repeating code execution (rather than those with an arbitrary length) and make predictions based on these segments. We discuss an architectural support for doing so, but we note that this can also be done mostly with software using profiling.
There are a number of considerations. First, the units need to be of sufficiently coarse granularity: For short instruction sequences, we can neither accurately measure execution statistics or profitably adjust the system configuration. For this reason, if the inner loops are short in duration, our system ignores them and tries to capture outer loops. Second, the unit needs to capture most of the execution time and yet be simple enough. In some cases, significant execution time is spent in some form of recursion without a canonical looping structure. We thus track function calls in addition to regular backward branches.
The general idea is to capture the "loop branches" that manifest as back-to-back instances of the same backward branch-no other backward branches occur in between. These branches then serve to mark the boundaries between iterations of the same code. Certain complications prevent these branches from occurring back-to-back. For example, a backward branch within the loop body can intersperse among instances of the loop branch. To filter these branches out, we use a loop-branch register to keep track of both the PC and the target of a backward branch. When a newly encountered backward branch matches that stored in the loop-branch register, the loop is identified. We then start to count cycles to measure the granularity of this loop. If we encounter a different backward branch than that in the current loop register, then there are a few different possibilities: The new branch is a loop branch and the branch tracked in the loop branch register is part of the body, the other way around, both are loop branches with one of them being the outer branch, and so on. While the detailed decision logic is shown in Figure 4 , the general heuristic is that a backward branch is considered to be nested within another one if its address falls between the PC and the target of the latter. This heuristic does not cover all possibilities but works well for normal code.
Finally, some backward branches are not loop branches. Once the hardware realizes that, it remembers them in a table to help reduce the time it takes to identify a stable loop. In our evaluation we use a Non-Loop PC Table ( NLPT) of 20 entries to store the PC of such branches. If a branch is found in this table, then it is skipped by the loop marker.
Characterization of Segments.
Once the appropriate loop branches are identified, the execution is divided into segments, each further divided into repeating instances. Depending on the control policy, we may need to characterize the behavior of a segment to decide the most Branch mispredictions encountered appropriate boosting mechanism. We do so by measuring several statistics (Table 1) for a whole number of iterations such that the period is long enough to be meaningful. These statistics are used both in training for building models and online when applying the models to pick the right boosting mechanism. In either case, the hardware starts to track the statistics on the detection of a new loop branch in the loop register. On a subsequent execution of the branch where the total number of executed instructions passes the threshold (50,000 instructions in this article), the branch is thus a qualified loop branch and a handler is invoked to process the event. 5 In offline model building, the handler records the statistics and uses them according to the specific model as will be discussed later. In the case of online application, the handler runs the computation required by the model using the collected statistics and sets the recommended boosting mechanism in the loop table. From this point on, the hardware only needs to apply the selected boosting mechanism and no longer collects statistics for the loop represented by this branch. In our evaluations, we use a loop table of 20 entries, each taking about 8B of storage. Storing more than 20 entries have no noticeable benefits.
Prediction Models.
We use three different models to predict the best boosting strategy. Each one has its own tradeoffs. Our broader point is that no matter what the control policy is, having support for decoupled look-ahead will make the overall system more effective.
Regression Analysis: Regression analysis is perhaps the most commonly applied method to find relationship among variables. We apply regression analysis as follows. In the model building phase, 5:10 S. Kondguli and M. Huang we run different code segments and measure the effect of a particular boosting mechanism on those segments. We then feed the model with training data where the inputs are the aforementioned execution statistics without any boosting mechanism (Table 1) for every segment and the output is performance gain produced by every boosting configuration.
After training the model, a set of coefficients will be obtained to make a prediction on a new code segment for each of the boosting mechanism. Specifically, the coefficient describes a quadratic function of all statistics:
where SP b is the predicted speedup of boosting configuration b and S i 's are the statistics of the code segment. At runtime, on collecting the statistics for a newly discovered segment, the controller evaluates all SP b and picks the one with the maximum speedup.
The benefit of the regression model is its simplicity. A quick computation of a quadratic formula is all that is needed during runtime. However, this simplicity comes at the expense of simplifying the complex relationship between the behavior of the code and the effectiveness of different boosting mechanisms. This leads to more suboptimal choices as we will see later.
Decision Tree: Another common predictive model is a decision tree. In our case, we provide the execution statistics (Table 1 ) along with the identity of the boosting configuration that provides the most performance gain for a set of code segments in the offline training phase. We use the Orange data mining package [22] to build the decision tree. At runtime, we feed the execution statistics from a newly encountered code segment to the handler code running the decision. The boosting configuration picked by the decision tree will be used for that code segment.
Compared to the regression model, the decision tree is much more flexible and tends to pick better performing boosting mechanisms more often. The cost of the model is increased online computation to carry out the decision. On average, about 20,000 comparisons are made per decision, roughly 100 times more computation than the regression model. However, since on average a decision is made every 3.9 million instructions, this overhead is still negligible.
Empirical Search: Finally, we also experiment with a straightforward predictive policy of trial and error. There is no training phase involved in this model. Instead, the controller simply goes through all boosting configurations one at a time recording the execution speed (instructions committed per unit time), and selects the configuration showing the most gain.
Compared with the previous two models, the empirical search model requires no offline training but essentially performs an (oversimplified) online training. Unless the number of configurations is very high, this model is practical, and, as we will show later, rather accurate.
Putting It All Together.
As explained in the beginning of this section, the goal of this controller is to identify the optimal configuration for each segment of the program. To that effect, loop hardware identifies loop boundaries to detect segments, and prediction models find the best configuration for that segment. Figure 5 shows a descriptive diagram of the workings of ATB using empirical search model. At every branch commit, loop hardware is consulted to recognize if this particular instruction marks the beginning a new loop or a new iteration of the same loop. The empirical search model requires only one of the eight statistics described in Table 1 (Cycles). At the beginning of every new loop, the model checks if the optimum configuration for this loop is already present in the Loop Control Table (LCT) . If found, then ATB informs the boosting mechanism of the optimum configuration and no further action is necessary while this loop is being executed. If LCT does not contain any information about the new loop, then Loop Register (LR) is reset and LoopBr field is updated with PC of the loop branch. 6 The empirical search model enables every boosting configuration for a few (Loop Threshold) loop iterations and identifies the configuration that provides maximum speedup (IPC). Once the optimum boosting configuration is identified, its information is stored in LCT and the system is informed of the right boosting mechanism.
Note that the controller is only invoked when the loop hardware identifies a new loop branch. There are times when a program executes code that does not belong to a discernible loop before transitioning to another loop. In our evaluations such transient phases make up for less than 2% of the execution time. To reduce unnecessary mode switching overheads, during transients, the system continues to the configuration set for the previous loop.
EXPERIMENTAL ANALYSIS
Using execution-driven simulations (Section 5.1), we first show (in Section 5.2) the primary figures of merit for the proposed system. We show that significantly more performance gain is obtained in the new boosting design than adjusting clock speed only. We then perform an in-depth analysis of various components and usage scenarios of the design (Section 5.2 to 5.4).
Simulation Setup
We model various configurations using execution-driven simulation. Our simulator is based on Gem5 [13] . The baseline system is a multicore chip with out-of-order cores. Table 2 lists additional technical details.
The turbo boosting configurations are loosely based on an Intel i7 chip [40] , where a baseline frequency of 3GHz (at 0.8V) can be scaled up to seven steps, each step being 133.3MHz [39] , and a corresponding voltage increase of 50mV [17] . For DLA we model a communication delay of three cycles between lead core and main core, similar to an L1 cache access. For energy consumption modeling, we use McPAT [45] and assume a 22nm technology node. We modified McPAT to correctly model our proposed architecture. All the pipeline stages, buffers, caches, network and memory are faithfully modeled. The baseline configuration operates at 3GHz and all the figures are normalized to it unless otherwise mentioned. We evaluate our proposal on multiple benchmark suites. In addition to SPEC2006 [35] we use CRONO [1] (graph applications) and STARBENCH [3] (embedded applications) benchmark suites. For SPEC2006 we use reference inputs. For STARBENCH we picked large inputs from the provided input set and for CRONO we use graph input data structures from google, amazon, twitter mathoverflow and california road-networks. All benchmarks are compiled using gcc with -O3 option. We use simpoint methodology to accurately capture all phases of the application. We generate five simpoints per benchmark with 10 million instruction interval. Simpoints are generated using the SimPoint Tool [62] . Note that simulation results are prone to errors due to simulators [6] , sampling methodology [62] , baseline configuration, and so on. We report relative numbers instead of absolute values in an effort to reduce the impact of such errors.
Primary Figures of Merit
Performance Benefit: A primary use of today's turbo boosting mechanism is to leverage the power and thermal budget made available by one or more idle cores in a multicore system. Our main argument for this article is that in those situations, the best strategy to leverage the idle core depends on characteristics of the application, and thus an adaptive turbo boosting architecture is needed. With proper support for ATB, we can improve performance much more than that achievable with just conventional frequency-only boosting. Figure 6 shows the speedup of applications under three different boosting policies: frequency-only boosting (FB), always employing DLA, and ATB. In this figure, we only show one variant that uses the empirical search controller. For visual clarity, we sort the applications based on increasing ratio of
, where SP AT B and SP F B are the speedups of ATB and FB, respectively.
From the figure, we can see three main things.
(1) As already suggested early in Figure 1 , boosting clock frequency and performing lookahead have complementarity: When one shows a low performance gain, the other tends to show a high gain. Not surprisingly therefore, having the two options at our disposal is much better: When FB does not work well for a particular code segment, ATB will choose to deploy DLA and vice versa. Even though the major benefit of ATB is from the complementarity between DLA and FB, we note that ATB allows a better implementation of DLA. Through tuning, ATB can pick a better skeleton compared to the heuristically derived skeleton. This effect can be seen in the shaded portion of the ATB: Without frequency boosting, ATB can improve DLA's overall performance by 4.8%. The inclusion of frequency boosting provides an additional performance gain of 16%. (2) With traditional FB, the speedup ranges from 1.0 to 1.5 with a (geometric) mean of 1.19.
With our ATB, the speedup ranges from 1.05 to 2.34 with a mean of 1.4, essentially doubling the gain in performance on average. With the exception of the first five applications, the benefit of ATB over FB ranges from noticeable to dramatic. Figure 7 shows the resulting distribution of execution between the two boosting mode under the ATB controller. (3) For the Crono benchmark suite, the benefit of DLA and ATB is more pronounced than that in the SPEC suite, whereas the benefit of increasing frequency is less. As our computing systems are increasingly used towards the analysis of large data sets, the case for including support for DLA is more compelling.
It is not always possible to use the frequency boosting feature at its maximum potential. Runtime factors like core temperature, estimated current/power consumption, the number of active cores and so on, determine the amount of boost that can be supported by the processor. Figure 8 shows the impact of FB and ATB under different boosting headrooms. Note that ATB is even more appealing when the range of frequency boosting is limited. As the frequency boost reduces from 1GHz down to 330MHz, the amount of performance improvement obtained by FB drops from 19% eventually to 9%. Comparatively, ATB's benefit lowers from 40% to 34% at most. It is worth mentioning that with limited frequency boosting, ATB favors DLA under certain epochs that earlier favored frequency boosting.
Energy Overhead: Another consideration for a turbo boosting mechanism is the energy cost paid for the performance gain. In this respect, our proposed system is also an improvement over the conventional frequency boosting. Before we inspect experimental data, we first look at a common approximation that energy cost doubles in DLA due to executing the program twice. Though a tempting assumption, it is nevertheless a significant overestimation. First, not all instructions are on the look-ahead thread. On average, 66% of instructions are on the look-ahead thread even if the system always runs in DLA mode. In ATB, DLA is enabled for about 62% of time, leading to roughly 44% extra committed instructions.
Second, adding these 44% instruction does not increase energy overhead by the same proportion as many activities are not duplicated but only time shifted. For instances, the baseline microarchitecture decodes 25% more instructions than committed. In DLA, wrong-path instructions are almost completely avoided in the main thread. Also memory accesses are mostly just time shifted, with the traffic being only 3% higher in baseline DLA than the underlying microarchitecture.
Third, still other overheads do not increase at all or even decrease. Leakage and clock distribution energy for the main thread generally decreases due to shorter execution time.
Roughly speaking, about 52% of baseline energy budget increases proportional to added instructions; 12% remains constant regardless of added instructions; and 36% decreases proportional to cycles saved. Combining these contributions together, Figure 9 shows the energy consumption of the whole system under different boosting strategies.
We can see two things from the figure. First, the energy cost of boosting tends to have a negative correlation with performance, because reduction in execution time directly translates to reduced static energy cost. As a result, the energy overhead is high when the particular boosting mechanism has a small performance benefit. This leads to the second observation that since our ATB naturally avoids the ineffective configurations, it also avoids the high energy overhead. Take s.xa for instance, ATB avoids the high energy overhead of DLA. On average, the energy overhead of ATB, at 22%, is about half of that in frequency-only boosting (45%).
Overall, with half of the energy cost and twice the performance gain, ATB is not only decidedly superior to FB, it actually lowers the system's energy-delay product on average. 
Controller Analysis
Prediction Errors: For ATB, the controller needs to make good predictions about the effects of different configurations to make a judicious choice. Error in predictions lead to suboptimal choices. We quantify such error from the three prediction models (Section 4.2.3).
Recall that in all three models, the system obtains execution parameters for a period of time (a few iterations of a detected segment S i ) and makes a prediction of the best boosting configuration C k . We ran simulations to get the actual best execution time for S i among all configurations:T min = min j T (S i , C j ). The difference between the time of the predicted best configuration and T min is considered the error. Thus, the relative prediction error for segment S i is
, and the error for the entire application is a weighted average of those for each segment. We show this error for all three prediction models and applications in Figure 10 .
To understand this error, we loosely break it down into two components due to the two assumptions made in the prediction process:
(1) Given the execution statistics, the model can accurately predict the effect of all boosting configurations; and (2) The observed behavior in the initial sample iterations accurately represents the entire segment.
The error due to the first assumption (which we call the model error) can be quantified by feeding the model with the exact execution statistics of the segment post hoc. Note that for the empirical model, there is no model error by definition. The remaining error (which we call sampling error) can be attributed to assumption 2. From Figure 10 (a), we compare the errors of the three different prediction models. For clarity, we only show a subset of the applications from SPEC2006 benchmark suite. Specifically, we sort all applications by the sum of errors from all three models and show every other application. The general observation is the same across all applications: First, for the same application, higher prediction errors correspond to more suboptimal choices and thus lower overall speedup. Second, the sampling error across all three models are reasonably close to each other (on average, about 3.2-3.5%). The decision tree model, being more sophisticated than regression, has a lower model error (2.1% vs 3.7%) and a better performance in general (a speedup of 1.33 vs. 1.32). The empirical approach essentially forgoes the prediction model altogether and thus has no model error and better performance (speedup of 1.37) than the other two. Segment Marking: In our proposed design, we rely on architectural support to mark segments so that they are based on the same code executed. This is to reduce the sampling error. To justify the architectural support, we compare to an often-used simpler alternative that assumes behavior stays stable within a "phase" and goes through phase changes with observable effects on execution statistics.
In this traditional approach, for each fixed-length segment (100,000 instructions in this experiment), we collect execution statistics to act as a "signature." Any significant deviation of the signature in a new segment suggests the start of a new phase. At the beginning of new phase, statistics collected for a fixed-length segment will be assumed to represent the behavior of the phase and drive the boosting configuration decision. Figure 11 contrasts these two approaches. A direct comparison is to measure the performance of the system under two controllers where the only difference is how execution is broken into self-similar units. In this experiment, we use the regression prediction model and show the difference in speedup in Figure 11(a) . We see that in general as well as on average, using our proposed segment marking hardware produces noticeably better results. This is largely because by better capturing the repeating code structure, the assumption of predictable future behavior holds better. Figure 11 (b) quantifies this by measuring the deviation of statistics of the segments.
Specifically, in both cases, statistics (Table 1) are measured for a segment sm and are used to predict those for another segment sp. The deviation for statistic St is thus the absolute different between their respective values St sm and St sp . Since these statistics can have different magnitudes, we normalize the deviations to average them meaningfully. To do so, we take the global mean value of the statistic mean(St ) as the standard yard stick for normalization. For example, if the statistic is branch mispredicts, then the mean value would be the global average mispredicts per 100,000 instructions across all applications. Note that the exact value of the mean is not nearly as important as its order of magnitude. We then average the normalized deviation (
) across all statistics, and then weighted-average across all segments. Figure 11(b) compares the perapplication average deviation between the two approaches. We can see that the average deviation is much lower when using our segment marking hardware, indicating better similarity of behavior between segments deemed similar.
Other Usage Scenarios
In Section 5.2 we showed that with DLA support and an adaptive controller, ATB can achieve much more significant performance benefit than FB for less energy consumption, thanks to the fact that DLA is often the more effective boosting mechanism. But the benefit of DLA support goes beyond that, as we show in these following usage scenarios. Boosting Range: In the earlier experiment, the controller chooses either DLA or FB, not both. This has the side effect of limiting the power consumption to well within twice the baseline power consumption. When the thermal and power headrooms allow, the controller can choose both and thus materially extends the range of boosting achievable. In the following experiment, we select the best-performing option without concern for the overall power. Figure 12 (a) shows the speedup obtained for SPEC benchmarks. We can see that the range of boosting is further expanded to up to 2.5x with a geometric mean of 1.55x compared with 1.37x before. The energy cost rises to 1.48x (Figure 12(b) ). Note that this is a non-trivial extension of the boosting range. To achieve the same effect with only increasing clock frequency would require the processor to raise the frequency by 100% on average to more than 6GHz and sometimes even more (Figure 12(c) ). At these levels, it is no longer just a matter of paying exorbitant energy costs, but the feasibility.
Broader Choices: For high-performance processors, frequency boost usually has a very limited range [40] , and a resulting limited range of speedups. Decoupled look-ahead, on the other hand, has a much wider range of achievable speedups. Consequently, in an multi-program environment with a variety of threads, there is more chance to find a target thread that can receive significant speedup.
To illustrate this point more concretely, consider a 4-core processor chip with three active threads and one idle core that provides the turbo boosting. To exploit the additional freedom of selecting which thread to boost, the controller will be extended as follows. When any thread enters a new segment, the controller calculates the benefit of boosting the new segment and compares that to the current target of boosting. If the benefit is sufficiently higher, then we can terminate the boosting for the current target and engage boosting for the new segment. Of course, a practical implementation would have to consider a number of secondary issues. 7 We model this scenario with randomly chosen mixes of three applications each (Table 3) . We use the empirical controller modified as discussed above. In this experiment, we favor the thread that will receive the most speedup (relative increase of speed) as the target of boosting. 8 In Figure 13 , we show the result of the multi-program (MP) version of both FB and ATB. Since the controller can change the target of boosting, the three programs can each gain some part-time boosting. In order to meaningfully compare with the earlier result where boosting is applied to a fixed thread (Figure 6 ), we piece together the portion of the program being boosted and treat it as a virtual program with its own baseline execution time. Effective speedup is the speedup of this virtual program. As we can see, MP-FB averages 1.23 effective speedup for these mixes, which is a small improvement over the speedup of 1.19 for FB in the single-program case. In contrast, MP-ATB averages 1.57 effective speedup, which is a more significant improvement over 1.4x for ATB.
The key to this improvement is that we can choose to "invest" boosting to the threads with the best return. Based on the characteristics of frequency and look-ahead boosting, the effect is statistically much more pronounced for look-ahead. To show this point more quantitatively, we perform a statistical analysis, taking thousands of random mixes of applications and given the speedup observed for each segment under different boosting, we can estimate the effective speedup obtained with MP-FB and MP-ATB. Without using execution driven simulation, this estimate ignores some inefficiencies and is thus a bit optimistic: For the mixes shown in Figure 13 , the statistical estimate of effective speedup is 1.22 for MP-FB and 1.56 for MP-ATB. In other words, the error with the statistical estimate appears to be negligible.
With the statistical analysis we can obtain the probability distribution of the effective speedup for both MP-FB and MP-ATB. We present this probability distribution in a form that shows the probability of obtaining a particular speedup or better. Figure 14 summarizes the implications some of which already discussed: (1) The MP-ATB curves move significantly to the right (or upside) of that of MP-FB. The opening between the two curves summarizes the benefit for including decoupled lookahead support: For a typical situation, it is likely to significantly improve the speedup, or, alternatively, there is more chance to achieve the same target speedup. (2) The expectation of the speedup of MP-ATB (1.56) is significantly higher than that of MP-FB (1.22) . (3) The range of achievable speedup is greatly extended. In a minority of cases, the effective speedup of MP-ATB goes beyond 2, suggesting that in those situations, it may even be more profitable to intentionally vacate a core for boosting use.
CONCLUSIONS
When operating conditions allow, processor chips can choose to invest more energy to speed up execution in a turbo boosting mode. Today's turbo boosting mechanisms only involve increasing clock frequency at a higher operating voltage. In this article, we have argued for an ATB architecture that is significantly more effective in achieving performance gain and at the same time more energy efficient than an FB design. To enable ATB, we need to add the support for decoupled lookahead execution and runtime support to make judicious decision about the right boosting actions for the executing code. Through experimental analysis, we have shown that, thanks to the complementarity of decoupled look-ahead and frequency boosting, ATB is much more effective at improving performance. When keeping the power roughly the same, it achieves a 1.4 average speedup compared to 1.19 with FB, while halving the energy overhead from 45% to 22%. ATB also greatly extends the range of achievable speedup, especially in a multi-program environment where it can choose among multiple target threads to accelerate. In a number of randomly created application mixes, ATB can achieve an even higher effective speedup of 1.56. Overall, compared to existing turbo boosting designs, the proposed adaptive turbo-boosting architecture is a much more effective and yet more energy-efficient solution.
