In this paper, we address the problem of timing speculation for multi-threaded workloads executing on a multi-core processor. Our approach is based on a new observationheterogeneity in path sensitization delays across different threads in multi-threaded programs. Leveraging this heterogeneity, we propose Synergistic Timing Speculation (SynTS) to jointly optimize the energy and execution time of multithreaded applications. In particular, SynTS uses a sampling based online error probability estimation technique, coupled with a polynomial time algorithm, to optimally determine the voltage, frequency and the amount of timing speculation for each thread. Our experimental evaluations, based on detailed cross-layer simulations, demonstrate that SynTS reduces energy delay product by up to 21%, compared to existing timing speculation schemes.
Introduction
Timing speculation is a promising approach to increase the processor performance and energy efficiency [9] . Under timing speculation, an integrated circuit is allowed to operate at a speed faster than its slowest path-the critical path. It is based on the empirical observation that these critical path delays are rarely manifested during program execution. Consequently, as long as the processor is equipped with an error detection and recovery mechanism, its performance can be increased and/or energy consumption reduced beyond that achievable by conventional operation.
While many past works have dealt with timing speculation within a single core, in this work, we uncover a new direction by exploring timing speculation for a multi-core processor executing a parallel, multi-threaded application. Through a rigorous circuit-architectural analysis, we observe that during the execution of a multi-threaded program, there is a significant variation in circuit delay characteristics across different threads. Consequently, under timing speculation (e.g., a higher operating frequency), some threads exhibit higher timing error 1 probabilities-referred as timing speculation critical threads. For the same amount of timing speculation (i.e., frequency increase), a timing speculation critical thread has a longer execution time than other threads, even if the thread execution times are balanced during nominal operation. Figure 1 shows an example of a timing speculation critical thread in the Radix benchmark from the SPLASH-2 benchmark suite (see Section 4 for our detailed methodology). In this example, Thread 0 consistently has the highest error probability with decreasing clock period, about 4× greater than the thread with the lowest error probability.
Motivational Example
To exploit this intriguing thread-level heterogeneity in timing error characteristics, we propose to perform synergistic timing speculation for multi-threaded workloads by incorporating timing speculation criticality. We call our approach SynTS. Figure 2 explains, the opportunities of SynTS to substantially improve the system energy efficiency by exploiting heterogeneity in thread error probabilities.
We start with a scenario in which all four threads run at the nominal voltage and frequency without timing speculation. We will assume that the threads are racing to a barrier, and reach the barrier at the same time at the nominal voltage and frequency. This means that the threads are perfectly balanced-with perfect work distribution and perfect cache access latencies. In Step 1, we exploit timing speculation to reduce thread execution time by increasing the frequency (reducing clock period) for all threads at the expense of a small error probability. As shown in Figure 2 , reducing the clock period by 24%, reduces execution time of Thread 0 by 7%, observed empirically. The execution times of threads 1, 2 and 3 are reduced even further, since they have lower error probabilities. Any further increase in clock frequency hurts performance since it causes an increase in timing errors and its associated recovery overheads in Thread 0, nullifying the positive affects of higher frequency.
At the end of Step 1, we note that although the threads all reach the barrier sooner than in the nominal case, Thread Figure 2 : Overview of the SynTS approach. The data here are generated based on the error probability curve in Figure 1 . More details about the algorithm and experimental parameters are given in Section 3 and Section 4, respectively. 0 is now critical, i.e., it reaches the barrier last, while the other three threads have some slack.
In
Step 2, we leverage the slack that has been created to reduce the voltage and potentially frequency of Threads 1, 2 and 3, thus reducing their energy consumption without hurting the application's execution time. In this case, the voltage of all three threads is reduced to 0.9V. Overall, for this motivational example, both the execution time and the energy consumption of the barrier interval reduce by 7%, each. This example demonstrates the dual benefits in execution time and energy consumption-the significant potential of synergistic timing speculation that is not achievable by naively adapting existing single-core timing speculation approaches to the multi-core setting.
In summary, this example motivates the key research question that we answer in this paper: how can we synergistically determine the optimal voltage, frequency, and as a result the error probability, for each thread/core, so as to jointly optimize for execution time and energy consumption?
Paper Contributions
To the best of our knowledge, ours is the first work to deal with the problem of synergistic timing speculation for multithreaded applications. To this end, we make the following novel contributions:
• Using an elaborate cross-layer methodology, we provide the first empirical evidence of heterogeneity in error probability of threads under timing speculation for SPLASH-2 benchmark applications. We propose SynTS, an approach that is aware of this heterogeneity and synergistically performs timing speculation for all threads to optimize the performance and energy.
• We propose a mathematical formulation for SynTS as a discrete optimization problem, and propose a polynomialtime algorithm that optimally solves this problem. Subsequently, we propose a practical online implementation of SynTS based on error probability sampling and our polynomial-time algorithm.
• Using rigorous circuit-architectural simulations, we demonstrate a significant improvement in energy-efficiencyup to 28% reduction in the energy delay product (EDP), compared to existing timing speculation schemes.
Related Work
Previous works most relevant to our work fall in two broad categories: (a) timing speculation architectures for singlecore processors; and (b) conventional voltage/frequency scaling techniques for multi-threaded applications. The works in the first category aim at reducing the clock cycle time to boost performance, and/or reduce the supply voltage to save energy. Many of these techniques are recovery-based, where rare timing-errors are detected and recovered using micro-architectural techniques [9, 5, 1] . Some recent works propose proactive techniques, where they anticipate an upcoming timing error before the clock edge, using various sensors embedded in the pipe stages [8, 13, 11] . Several other works advocate dynamic clock skewing [16, 18] , in combination with timing speculation, for energy efficiency. All of these papers are focused on isolated processor components and/or a single-core pipeline, and do not address how to synergistically overclock or voltage scale multi-core processors.
The second category of works explores the use of conventional voltage/frequency scaling (i.e., without any timing speculation), to optimize the energy and execution time of multi-threaded applications. In these works, the criticality of threads is assessed from their individual execution latency variance due to specific architectural events (e.g., cache misses) or the balance of work among threads [12, 2, 15, 4] . None of these works address timing speculation for multi-threaded applications or exploit timing speculation criticality, which our work does for the first time.
SynTS Design
We describe our design of SynTS in three steps. First, we discuss our mathematical model for timing speculation on multi-core processors (Section 3.1). Then, we describe an ideal offline implementation of SynTS (Section 3.2), and finally present a practical online version (Section 3.3).
System Model
In this paper, we consider a multi-core processor consisting of M homogeneous cores, and a multi-threaded application executing on the processor with one thread per core (that is, the number of threads is also M ). The cores are equipped to handle timing speculation, that is, they can both detect and recover from errors using schemes proposed in literature [9] .
Each core can dynamically tune its voltage and clock frequency (or equivalently, clock period); a capability that is available in several commercial multi-core processors. The voltage of core i (i ∈ [1, M ]) is Vi ∈ V, picked from Q discrete voltage levels, i.e., V = {V1, V2, . . . , VQ}. For every voltage level V ∈ V, there is a nominal clock period t nom (V ) at which the core is guaranteed to operate without any errors.
To enable timing speculation, core i can operate at a clock period smaller than the nominal period. That is, the clock period of core i, t
We refer to ri as the timing speculation ratio (TSR), and assume that it is picked from one of S discrete levels including 1. That is, ri ∈ R, where R = {R1, R2, . . . , RS = 1}. Note that the TSR implicitly corresponds to the discrete clock frequency at which cores can operate.
For a given ri, the error probability is given as p err i = erri(ri). erri is a decreasing function of ri; longer clock periods imply lower error probability. An example of the error probability function are shown in Figure 1 . As observed before, the error probability function can vary from one thread to another, i.e., it is thread specific.
With these preliminaries, we can now model system performance and energy consumption. Our performance model is based on the one proposed by [10] for processors with finegrained error recovery mechanisms like Razor. In particular, the seconds per instruction (SPI)
2 of thread i can be written as:
where C penalty is the error recovery penalty (5 cycles for the Razor processor [10] ) and CP I base i is the baseline clock per instruction of thread i in the absence of errors.
Our focus in this work is on parallel applications that use barrier synchronization. Without loss of generality, we will model application execution time for a single barrier interval; the total execution time can be obtained by summing over all barriers interval. In a given barrier interval, thread i executes Ni instructions. The execution time of the barrier phase, texec, is determined by the last thread to reach the barrier [2] :
Finally, the energy consumption of a thread, eni, can be written as:
where α is the average switching capacitance of a core. The energy equation multiplies the energy consumed per clock cycle with the number of clock cycles, including the extra cycles introduced due to timing speculation. We note that although the model does not currently account for leakage power, it can be easily extended to do so. Furthermore, although our focus in this paper is exploring the energy versus execution time trade-offs, the proposed approach can be generalized to address power consumption as well.
Offline Optimization
We now describe our proposed synergistic timing speculation methodology. We begin by optimistically assuming that the workload characteristics, that is, each thread's error probability function, err(ri), is known in advance. In Section 3.3, we will relax this assumption and describe an online policy that estimates the error probability function for each thread on the fly. Our goal is to determine the optimal voltage and clock period of cores in each barrier interval, so as to minimize a weighted sum of the total energy consumption and the barrier execution time. This can be formulated as the following optimization problem, SynTS-OPT,
2 Inverse of instruction per second (IPS). such that Vi ∈ V and ri ∈ R for all i ∈ [1, M ]. θ is a designerspecified weighting factor that determines the importance of execution time vis-a-vis energy.
MILP Formulaton SynTS-OPT is a discrete, non-linear optimization problem. We first reduce it to a mixed integer linear programming (MILP) problem, which we refer to as SynTS-MILP. To do so, we introduce binary variables x ijk that are set to 1 if thread i runs at voltage level Vj and TSR R k , and 0 otherwise. The objective function can now be written as:
subject to:
and
Equation 6 constrains execution time of the barrier to be larger than that of each thread, while Equations 7, 8 and 9 compute the clock period, error probability and energy, respectively, in terms of the x ijk variables. Equation 10 ensures that each thread gets assigned to only one voltage and frequency level. SynTS-MILP can be input to a standard MILP solver to obtain optimal voltage and frequency levels for each thread.
Polynomial-time Algorithm Solving a MILP problem is, however, not practical in an online setting since the runtime of MILP solvers scales poorly with the problem size. Fortunately, the specific form of SynTS-MILP lends itself to a polynomial-time solution shown in Algorithm 1 -a key contribution of our paper. The intuition behind our algorithm, SynTS-Poly, is as follows. We iteratively demarcate each thread as the critical thread, i.e., the thread that has the longest execution time (O(M ) iterations). For a critical thread, we try all combinations of voltage and timing speculation ratios; for each such combination, we obtain the thread's execution time (O(QS) iterations). Then, for every other thread, we search for the lowest energy configuration that allows it to finish before or with the critical thread (O(M QS) iterations). These steps yield a pair of energy and execution time values for each thread, voltage and clock period combination (a total of M QS pairs). Of these, we return the configuration with the lowest weighted cost. The run-time of SynTS-Poly is quadratic in the number of threads/cores, voltage levels and timing speculation ratios.
Lemma 3.1. Algorithm 1 is guaranteed to return an optimal solution for SynTS-OPT.
Proof. Detailed proof is omitted for brevity, but we note that it depends on the fact that the non-critical threads only impact the energy component of the cost function.
Online Optimization
In practice, the error probability versus the clock period data is not available in advance, and must be estimated onthe-fly. We propose an online sampling based approach to address these practical constraints.
At the beginning of each barrier interval, each thread spends the first Nsamp instructions in a sampling phase. During the sampling phase, all threads run at a fixed voltage Vsamp ∈ V, but at different clock periods. In particular, each thread spends Nsamp S instructions at each of the S available frequency levels (or equivalently, timing speculation ratios), for the voltage level Vsamp.
At the end of the sampling phase, we obtain an estimate of the error probability function, erri, for each thread i. We refer to the estimated error probability function asẽrri. Note that although the error probability estimate is obtained for a single voltage level (Vsamp), the error at any other voltage V ∈ V is estimated asẽrri(
). Finally, the estimated error probability functions are provided as the input to SynTS-Poly algorithm (Algorithm 1), which returns optimized voltage and timing speculation ratios (i.e., clock periods) for each thread. The threads are then run at these optimized levels for the remaining barrier interval.
The number of instructions in the sampling phase, Nsamp and the voltage at which cores execute in the sampling phase, Vsamp, are both knobs in our online approach. Increasing Nsamp provides more precise error estimates, but results in greater energy and execution time overheads during sampling. Increasing Vsamp increases the energy but reduces 
Evaluation Methodology
In this section, we describe our cross-layer methodology that we use to evaluate the proposed synergistic timing speculation techniques. Figure 3 gives an overview of our methodology. We now describe our methodology in detail. Architectural Simulator We use the cycle-accurate Gem5 simulator [3] to model a 4-core Alpha processor. We use several Splash2-benchmarks representing a range of real world applications [17] , on the simulator and extract cycle-by-cycle input vectors for each stage in the processor. These input vectors are used to drive the netlist in the circuit-level timing analysis, to estimate the actual propagation delay of each instruction during program execution. We run each benchmark for 4 barrier intervals, or to its completion, whichever comes first.
Timing Error Modeling Using Synopsys Design Compiler,
we synthesize the Illinois Verilog Model [14] of the Alpha processor to obtain a gate-level netlist for each pipe stage. Next, by feeding cycle-by-cycle input vectors for each stage to its structural RTL netlist, we record the sensitized paths for each instruction. In this work, we focused on the execute pipe stage, specifically the SimpleALU. The propagation delays of gates on the sensitized paths are obtained from HSPICE simulation using the Predictive Technology Model (PTM) for the 22nm node [19] . Finally, to model the impact of voltage scaling on the propagation delay, we use HSPICE simulations of 22 nm ring oscillators and record the clock period versus voltage, as shown in Table 1 .
Based on this cross-layer methodology, we obtain traces of propagation delays for each instruction, for different supply voltages. Now, for any given speculative clock period, we are able to identify which instructions induce timing failures. We replay the traces using the oneIPC model [7] , and add a penalty of five clock cycles for error inducing instructions [10] .
Benchmarks We use the above methodology to characterize the error probability functions of 5 SPLASH-2 benchmarks-FMM, Radix, LU, FFT and Ocean. Of these, FFT and Ocean have homogeneous error probabilities for all threads, for which conventional timing speculation and our approach would work just as well. In fact, the FFT error probabilities are high and do not permit any timing speculation. Hence, we report our results on FMM, Radix and LU. Figure 4 for the Radix benchmark.
Experimental Results
We compare our proposed SynTS approach to several comparative schemes, as described below:
• Nominal V/F (Nominal): each core runs at its nominal voltage and corresponding clock period, i.e., without any V/F scaling and without any timing speculation.
• Optimal V/F without timing speculation (No-TS):
each core runs at different voltage levels so as to minimize the weighted cost function of energy and execution time in Equation 4 , but without any timing speculation. The No-TS baseline reflects existing approaches that attempt to balance workload variations between threads using DVFS as proposed by [12] .
• Per-core timing speculation (Per-core TS): each core leverages timing speculation, to independently minimize its own energy and execution time cost, as measured using Equation 4. Per-core TS serves as a best possible bound for single-core timing speculation techniques like Razor [9] , since it has offline access to the error probability functions for each core.
We begin by presenting results for the offline version of SynTS, and then present results for a practical online implementation of SynTS.
Offline Optimization Results
We begin by comparing the above schemes above to SynTS in the offline setting; i.e., assuming that the error probability functions in each barrier interval are known in advance. Although such an offline approach cannot be implemented in practice, it allows us to quantify the best results that can be obtained from SynTS. We make several observations from these figures. First, we note that both TS approaches have lower best-case execution times than the No-TS baseline; both TS approaches use over-clocking to increase performance. Moreover, SynTS provides high performance at lower energy costs than Percore TS, 21% lower for FMM and 10% lower for Radix. Second, in its low energy configuration, SynTS is 18% and 8% faster than Per-core TS for FMM and Radix, respectively. In both cases, Per-core TS consumes marginally less (< 2%) energy. The relative savings of SynTS over No-TS are even greater. Finally, although omitted for space constraints, the results for LU are qualitatively similar.
Online Optimization Results
A critical factor in the success of SynTS online optimization is the fidelity of the error estimates obtained from the sampling phase. Figure 6 shows the actual and estimated error probability functions for an entire barrier interval of the FMM and Radix benchmarks. Here, the length of the sampling phase, N sample , is set to 10% of the total number of instructions in the barrier interval. Observe that in both cases, (1) the estimated error probabilities are close to the actual probabilities, and (2) importantly, the critical thread from a timing speculation perspective is always identified. We have observed the same behaviour over other barrier intervals and for the LU benchmark.
For the online experiments, we set N sample to 50K instructions, except for FMM, which has very short barrier intervals. Therefore, for FMM, we set N sample to 10K instructions. In all cases, V sample is set to the nominal chip voltage. For every voltage level in Table 1 , each core can choose among six clock periods that are a fraction r ∈ [0.64, 1] of the nominal clock period. Finally, since our focus here is on addressing heterogeneity in error probability (and the impact of estimating it online), we assume that information on workload heterogeneity (Ni for each thread) is available from offline characterization or using online workload prediction techniques proposed in the literature [12, 6, 2] . Figure 7 plots the execution time, energy and energy-delay product (EDP) of our online implementation of SynTS to competing approaches for the FMM, Radix and LU benchmarks. Results are for a fixed value of θ that weights energy and execution time equally. The results are normalized to the SynTS (offline), allowing us to evaluate the overheads of implementing SynTS online. We can make several observations from Figure 7 . (1) The overhead of online versus offline SynTS is relatively low-10.3% in EDP on the average over the three benchmarks. These overheads are because of imperfect error probability estimation, and because the sampling phase is executed at sub-optimal voltage and frequency levels to estimate the error probabilities. (2) Notwithstanding the difference, SynTS outperforms the competing approaches for all three benchmarks. Compared to existing timing speculation, online SynTS is better up to 21% in terms of EDP, on an average, for the three benchmarks. The benefits are even greater when compared to No-TS.
Conclusion
In this paper, we proposed Synergistic Timing Speculation (SynTS), a novel technique to optimize the energy and execution time of multi-threaded applications executing on multi-cores. SynTS is based on a new empirical observation -heterogeneity in the sensitized delay distributions and thereby the error probabilities under timing speculation across threads. Our empirical evaluation of SynTS illustrates that it improves EDP by up to 21% as compared to per-core timing speculation and up to 28% compared to no timing speculation. As future work, we plan to evaluate a larger set of benchmark applications, and extend our approach to multi-threaded applications that use other synchronization mechanisms, besides barriers.
