Abstract. As process technology scales down, power wall starts to hinder improvements in processor performance. Performance optimization has to proceed under a power constraint. The co-optimization requires exploration into a huge design space containing both performance and power factors, whose size is over costly for extensive traditional simulations. This paper describes a unified model covering both performance and power. The model consists of workload parameters, architectural parameters plus corresponding power parameters with a good degree of accuracy compared with physical processors and simulators. We apply the model to the problem of co-optimizing the power and performance. Concrete insights into the tradeoffs of designs for performance and power are obtained in the process of co-optimization.
Introduction
The tradeoffs between power and performance especially in embedded processors have attracted much attention. Although there have been many analytical models and simulators which address power or performance issue separately, there is still a need for a holistic model that provides insights into complex tradeoffs in an integrated manner. Because of the complexity, even integrated models tend to focus on a few processor components such as pipelines or the instruction queue without offering a system-wide view.
In order to arrive at a more realistic system-wide view of the power-performance trade-off, we proposed an integrated model based on a previous performance model of superscalar processors. In that model, nearly all major processor components including instruction classes, instruction dependencies, the cache, the branch unit, the decoder unit, the central instruction buffer, the functional units, the retirement buffer, the retirement unit, and instruction issue policy were modelled. Later, we extended the model to out-of-order-issue processors. We further extended this performance model by linking the performance metrics with the dynamic capacitance of each processor components, thereby deriving the power consumption for each of the processor components and finally the processor as a whole. The major component of static power, leakage power [1, 7] was also incorporated in the model. We validated this model by comparing its predicted power consumptions with simulation over the same benchmarks using Sim-Wattch [4] and the results are on average within 10.9% accuracy. The average power consumption obtained by our model agrees with the measured result reported by Synopsys Power Compiler with a power library from Virginia Tech [12] . Our average result also agrees with analytical outcome of the Berkeley Advanced Chip Performance Calculator (BACPAC) [13] .
We explain the definitions and results of the performance model in Section 3. We then present our power model in Section 5 and show how it is combined with the performance model. Section 6 describes the validation results. In Section 7, we interpret a co-optimization issue. Then we depict how the combined model handles the issue and other concrete tradeoffs for co-optimization. This is followed by a conclusion.
Related Work
The performance component of our model resembles that of Noonburg and Shen [9] in terms of the similar separable components. Part of our model, namely the modelling of the instruction window, is based on the work of Pyun et. al. [10] . We go beyond their work by proposing a comprehensive model that accounts for all the key components of a state-of-the-art superscalar processor.
In addition to many traditional issues such as performance, area, cost and reliability, power consumption has been recognized as a major concern of architects of portable and embedded computer processors. High level models have been proposed to identify areas of significant power density modelled by Cai [5] . The BACPAC calculator [13] also falls into the category. Bergamaschi and Wang [2] added power states and symbolic simulation into the calculation. These models are based on architectural complexity in terms of gate equivalents, activities in a circuit, instruction-level costs, behavior-level abstraction, or system-level power estimation. However, they did not consider powerperformance tradeoffs in an integrated way.
Some unified approaches to address both the power and performance have been proposed recently. Brooks et. al. [3] introduced a measured metric called the powerperformance efficiency. Conte et. al. [6] separated architectural and technology components of dynamic power, and used a near-optimal search to tailor a processor design to different benchmarks. While Conte's model used the trace-driven simulation to collect high level statistics about pipeline stages, our model dwells into greater details of each processor component. Their approach only considers a subset of the parameters accounted for in our integrated model. Most importantly, they do not account for the clock frequency. Some other unified approaches addressed part of parameters in our model. Srinivasan et. al. [11] focused on the pipeline optimization in terms of the best power-performance efficiency. Moreshet and Bahar [2] centered on the instruction issue queue.
A recent work [16] briefed the model to integrate power and performance in an extended abstract. Another more recent work [17] focused on the co-optimization process without full details of proofs for the analytical model. In this paper, besides full details of the models, we provide for the first time, a generic solution to non-linear recurrences involved in the analytical model.
Performance Model
A multiple-class multiple-resource (MCMR) system is a queuing system where there are several classes of customers, each requiring a particular set of resources to service. To model a generic superscalar processor, we used a network of MCMR systems. Each stage of the pipelines contributes to the final results of the processor. The lowest throughput of all the pipeline stages is the bottleneck of the entire processor and determines the maximum possible throughput of the processor. We shall now recall the main results of the performance model.
The throughput of the processor Θ is the minimum of the service rates of decoder unit (μ dec ), central window (μ win ), and retirement unit (μ ret ):
Let W dec denote the decode width, i.e. the maximum number of instructions that can be decoded in one cycle. Let I br be the average number of (non-branch) instructions between two branch instructions (inclusive of one of the branches), T br be the misprediction penalty time (the time taken to fetch and decode the correct instructions), p ins,miss be the instruction cache hit ratio, t ins,pen be the instruction cache miss penalty time, and p br,prtd be the probability of a correct branch prediction. If I br < W dec , the average decoding rate without overflow in the central window, μ dec is:
where C 1 , C 2 , and C 3 are linear functions of I br , T br , W dec , and p br,prdt . The rest cases for I br and W dec relations are available in [15] . Let W ret denote the retire width, i.e. the maximum number of instructions that can be retired in one cycle. Let D be the average dependence distance (inclusive of one of the instruction in the dependence) between two instructions that have a data dependence relation. Under an in-order retirement policy, the average retirement rate for D < W ret is given below:
where the average time for an antecedent instruction to pass through functional units is:
where type ∈ {ieu, f pu, lsu, br} is the set of types of functional units in the processor, namely the integer execution unit, the floating point unit, the load store unit and the branch unit. S i ∈ [0, 1] is the fraction of the total number of instructions that is executed on functional unit i for a given benchmark, and t i is the average service time of each functional unit of type i.
represent the probabilities of the data cache prediction and the instruction cache prediction, respectively. These parameters are determined by benchmarks. Thus, they vary from one benchmark to another one.
In the model, the central window works as the instruction buffer. Instructions stay in the central window after they are decoded until they are issued. For out-of-order processors, any independent and ready instruction in the instruction window may be dispatched to an available functional unit. Given ρ k,t (Z win ) as the probability that k instructions of type t are issued from the window of size Z win , then:
and
is the probability that k independent instructions are extracted from Z win instructions and φ pipe,t (k) is the probability that at least k pipeline units of type t are available [10, 15] .
Solving the Recurrence
Let us solve the above challenging non-linear recurrences (6) by abstracting the type t from it. In other words, we shall consider the simpler but an equivalent description of the non-linear recurrences. Initial cases (INI):
Zwin−1 ) where k and Z win are natural numbers, and p ∈ [0, 1]. In practice, usually k ∈ {1, ..., 10} and Z win ∈ {4, ..., 20}.
First, we show that the above recurrence has a finite number of iterations by determining the degree of the polynomial P k (Z win ) in variable p for any parameter k and Z win . Moreover, for some particular cases, we can even point out the analytical form of that polynomial. For the polynomial P (X) given by a 0 + a 1 X+ ... +a m X m , the following notations can be done:
, that is, m is the maximum exponent of P with a non-zero coefficient. In this case, a m is called the dominant coefficient; 
Proof. (a) Considering the identity (REC) for k = 2, it follows that
, for any i ∈ {0, ..., Z win − 2}. According to (INI), the identity for Z win − 2 is P 2 (2) = p. By replacing P 2 (2), ..., P 2 (Z win − 1), in this order, in the previous identities, it results the identity (a).
(b) The identity is still a recurrence, but it is depending only in terms P k−1 (Z win −i), that is, both arguments are smaller than P k (Z win ). Considering the identity (REC) for
, in this order, in the previous identities, it results the identity (b).
The following result ensures the finiteness of (REC) by specifying deg, minDeg, as well as the dominant and subordinate coefficients for the polynomial P k (Z win ). and the dominant coefficient of
and the subordinate coefficient of P k (Z win ) is 1.
Proof.
We proceed by induction on k ≥ 2. Base: k = 2. According to identity (a) of Lemma 4.1, the highest exponent of p in
Moreover, the dominant coefficient is (−1)
. The subordinate coefficient, as well as minDeg, can be easily obtained by considering i = 1 in identity (a) of Lemma 4.1. Inductive
Step: We suppose that (a) and (b) hold for any P k (Z win ), where k < k, Z win < Z win . Considering the identity (b) of Lemma 4.1, the subordinate coefficient can be obtained by taking i = Z win − k + 1, that is, according to the inductive hypothesis, p
. To compute the dominant coefficient for P k (Z win ), we need to sum all the dominant terms for i = 1 to Z win − k + 1. According to the inductive hypothesis, the dominant term of The analytical form of the polynomial P k (Z win ) is very hard to be obtained. However, there are two general forms which allow that (Theorem 4.2).
Theorem 4.2. For any k ≥ 2, we have
. For obtaining P k (k + 1), we proceed by induction on k. The case k = 2 holds obviously. According to (REC), we have
According to the inductive hypothesis, this can be continued by p
, which is equivalent to what was needed to be proved.
Power Model
The power consumption of a resource consists of a dynamic and a static component, i.e., π tot,res = π static,res +π dyn,res . The static portion is given by π static,res = I static,res × V dd . The leakage current I static,res is an exponential function of threshold voltage V t (in mV) by Sylvester and Keutzer [14] :
where ω is the device width in micro meter. According to the formula, the static power increases with the downsizing process technologies. For any technology node, the static power takes a usually stable portion of the total power. Khouri and Jha [7] summarized the ratios of the static power over the total power based on 6 different circuits, which are listed in Table 1 . For the dynamic power component, which is dependent on workloads, we used a model that is similar to that of several recent studies [2] , [8] . We model dynamic power as a function of dynamic capacitance (C res ), the supply voltage (V dd ) and the clock frequency (Ω):
For each component of the processor, the capacitance is obtained by either using the same empirical formulas used by Sim-Wattch or by means of summing up the bit stream changes. With total dynamic capacitance and number of accesses of a resource, we can obtain the dynamic capacitance per access to the resource (C a,res ) Table 3 . The total power of a processor is the sum of the power consumption by each resource/component.
Validation of Models
The earlier version of the performance model was for in-order-issue processors, and it was validated against the results measured physically on an UltraSPARC processor with an average error of 5.1%. This performance model was extended to out-of-order issue processors, and validated with SimpleScalar out-of-order issue simulated processor with a small average error of 5.9%.
As further validations, we also compared our results with those of other power models. The BACPAC [13] calculator shows that the typical power consumption is 24.03 watts for a 5-million-transistor processor running at 600MHz and V dd of 2.5V. The power consumption is close to the averaged analytical power of 27.38 watts. Using the same V dd , clock frequency and a 0.25μm technology based power library by Sulistyo and Ha [12] , we also obtained a total power of 32.1 watts reported by the Synopsys Power Compiler. for a similar RISC processor design in the scale. More details of the validations are available in [17] .
The inputs to the performance model are given in Table 2 . The capacitance parameters from Table 3 are inputs to our power model. Our architectural analysis yields values of N a,req,res : N a,req,win = 6, N a,req,regf ile = 2 and N a,req,dec = N a,req,ieu = N a,req,f pu = N a,req,lsu = N a,req,br = N a,req,icache = N a,req,dcache = 1. We assume the service rate of register file equals the one of retirement unit, that is μ ret = μ regf ile . 
Co-optimization Applications of the Models
We shall now show by examples how the model can be used to explore the design space to reach a co-optimized solution.
A Co-optimization Issue: To co-optimize power and performance, we need to minimize π dyn,tot in (10), while maximizing the throughput in terms of number of instructions per second, i.e. Θ × Ω. Firstly, we let the user set an upper limit, π U say, i.e. π dyn,tot ≤ π U . Within this constraint, we seek to maximize Θ in (1) along with varying Ω. In short, this approach is to maximize the throughout under a power budget.
In order to obtain the configuration with the least energy consumption for a computation, we look for the minimal total energy to finish the task whose number of instructions is n i . Let π u,x be the upper power limit for the x-th optimization case, the constraint π dyn,tot ≤ π u,x ≤ π U should hold when seeking for the maximum performance θ × Ω. If such a case x exists, then the time to execute the application is n i /(θ × Ω). Consequently, this will also yield the minimal total energy, at the x-th case where the power is π dyn,tot :
Impact of Clock Frequency: We will now use bzip2 as an example to show how co-optimization is achieved. To begin, we set an upper bound on the dynamic power, π dyn,U = 25 watts. The co-optimized solution is obtained by the following search procedure: 1. Read the performance values of 256.bzip2 from 
2.
For the power constraint on the dynamic power, π u,x from 25 watts down to 1 watt in steps of −1 watt do: 2.1. For each clock frequency Ω from 100 to 600 MHz at a step of 100 MHz, we repeat the following steps to obtain the maximum performance θ × Ω under the power constraint of 25 watts.
2.1.1. With the above performance service ratios of resources and Ω, we obtain π res in (8), where C a,res is obtained from Table 3 .
2.1.2. Sum up π res for all components. If the total π dyn,tot is less than π u , then we have found a configuration within the constraints. We also note down the performance θ × Ω and π dyn,tot . 3. Find the maximum of θ × Ω i and its associated π dyn,tot and Ω i .
Impact of Leakage Power:
Using our model, we can study the impact of leakage power on the maximum clock frequencies and dynamic power consumptions If we vary the clock frequencies while keeping the rest parameters fixed, we can obtain clear changes in both leakage power and dynamic power. We find the leakage power without optimization grows consistently along with the reduction of feature size. For the technology node of 0.07μm, the leakage power overtakes the dynamic power as the dominant power factor. This trend hinders the increase of clock frequencies which ranges from 400 Mhz for 0.35μm technology to 3 Ghz for 0.07μm technology. With optimizations [7] on leakage power, the total power budget can be more effectively spent on the dynamic power consumption. The leakage power will be kept lower than the dynamic power. The maximum clock frequency for 0.07μm technology can be improved to 5.2 Ghz.
Projection of Minimum Dynamic Power:
We also apply our model to gauge the minimum dynamic power for different benchmarks. We keep the processor configuration fixed, and seek for a possible low for a certain workload. In practice, we use the service rates of resources μ res and the capacitance primitives of resources C a,res in Table 3 to obtain the dynamic capacitances of resources C res . The dynamic power π dyn,res is obtained by feeding C res , V dd and Ω into Equ. (8) .
For example, the minimum C a,win for the instruction window is 0.631 (10 −10 farad) in Table 3 , and the minimum μ win for the instruction window is 1.548. Along with the average number of access to the instruction window per request, N a,req,win = 6, we obtain the minimum C a,win as C a,win = 0.631 × 10 −10 × 1.548 × 6 ≈ 5.861 × 10 −10 . Then the minimum π dyn,win = 5.861 × 10 −10 × 2.5 2 × 600 × 10 6 ≈ 2.198 watts. The minimum total dynamic power of 15.81 watts is found by repeating the above process for all the resources. This bound implies that the processor dynamic power can be reduced to lower than the bound with proper scheduling and choice of workloads.
Conclusion
In this paper, we raise an approach to power and performance co-optimization using our unified model accounting for both issues. Validation against an established power simulator using large SPEC2000 benchmarks indicates the accuracy of the model. The results are also in agreement with previous analytical studies and experimental results.
In the process of co-optimization, we showed the impact of leakage power on the performance improvements for different technology nodes. We also obtained a bound of the minimum dynamic power. In addition, we found that the clock frequency is the dominant factor compared to the cache, instruction window and functional units in improving performance under dynamic power constraints. These results illustrate our model is a useful tool for designers to make power-aware decisions at early stages of co-optimization.
