Abstract| This paper presents systematic techniques to nd low-power, high-performance superscalar processors tailored to speci c user applications. The model of power is novel because it separates power into architectural and technology components. The architectural component i s found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is based on case studies of actual designs. It is used to solve an important problem: decreasing power consumption in a superscalar processor without greatly impacting performance. Results are presented from runs using simulated annealing to reduce power consumption subject to performance reduction bounds.
Abstract| This paper presents systematic techniques to nd low-power, high-performance superscalar processors tailored to speci c user applications. The model of power is novel because it separates power into architectural and technology components. The architectural component i s found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is based on case studies of actual designs. It is used to solve an important problem: decreasing power consumption in a superscalar processor without greatly impacting performance. Results are presented from runs using simulated annealing to reduce power consumption subject to performance reduction bounds.
The major contributions of this paper are the separation of architectural and technology components of dynamic power, the use of trace-driven simulation for architectural power measurement, and the use of a near-optimal search to tailor a processor design to a benchmark.
Keywords| Superscalar, power dissipation, instructionlevel parallelism, near-optimal search, high-level synthesis.
I. Introduction
All recent high-performance, desktop processor o erings are superscalar designs. These processors use duplicated, independent functional units to execute instructions in parallel. The ability to execute in parallel is limited by the ow of information between instructions, since some instructions depend on results calculated earlier in the program. Superscalar processor organizations use hardware techniques such as the Tomasulo algorithm 1 to detect parallelism and execute code correctly. Empirical results suggest as much a s a v e times speed improvement when instruction-level parallelism is exploited 2 . Current designs seek parallelism by examining and issuing four to six instructions per cycle, with higher rates expected 4 , 5 , 6 , 7 . Successful use of these high issue rates requires careful tuning of the microarchitecture. There is a wealth of technological alternatives for this task. These include branch handling strategies 8 , functional unit duplication 2 , and instruction fetch, issue, completion and retirement policies 9 . The deciding factor between the various techniques is a function of the performance each adds, versus the cost each incurs. Unfortunately, this tradeo analysis rarely takes power consumption into account. Consequently, current superscalar processors consume anywhere from 30 to 70 watts of power and will soon be approaching 100 watts which can lead to problems with respect to die packaging and package cooling techniques as well This is a revised and expanded version of the paper presented by the rst three authors at the 28 th Hawaii International Conference on System Sciences Jan. as decreased battery life for portable devices such as notebook computers and cellular telephones.
The organization of a high-performance microprocessor is determined using results from behavioral simulations. Performance is measured as the number of cycles per instruction or the overall run time for a set of test programs. Power consumption is not considered until much later in the design process, and, as such, it is the responsibility of the circuit designers rather than the architects. However, parallelism and pipelining have a direct impact on processor designs. Highly-parallel processors consume more power per cycle than non-parallel hardware. Deeply-pipelined functional units consume more power, since energy is consumed over a shorter period of time. This suggests tradeo s between power consumption and processor organization that defy simple, rule-of-thumb approaches.
This research develops a system-level, behavioral model of power consumption for designing low-power, high-performance superscalars. This model is a separable cost function that can be used to optimize such architectures. The cost function is separated into organizational and technological components. The organizational component is measurable from a behavioral-level simulation of the type used for high-level design. The technological component depends on the implementation technology. The components can be combined after simulation to estimate power dissipation. A near-optimal search algorithm is employed to reduce the power consumption of superscalar processor designs without high sacri ces in performance. The combined cost function and near-optimal search method is suitable for tradeo analysis of processor organizations. The method introduces power considerations into the organizational design process, reducing overall power consumption through organizational changes.
II. Methods and Models
The processor model for this study is a superscalar engine with full-Tomasulo scheduling and pipelined functional units. To a c hieve high parallelism, integer and oating-point functional units are duplicated and the functional unit latencies are varied. This paper focuses on power-centric design of the processor's execution unit and its pool of functional units. For the Alpha 21264, this unit comprises roughly half of the chip area 5 . The execution unit has 9 functional units, the types of functional units are shown in Table I . A 64-bit word size is assumed. The integer class is composed of 64-bit integer ALU units IALU , 64-bit shifter hardware Shift and branch hardware Branch. The oating-point units are grouped into addition FPAdd, multiplication FPMul and division FPDiv. FPDiv is a pseudo-unit: division actually takes place in the multiplier using the quadratic convergence division method in an iterative, unpipelined fashion 1 . All units are designed using static CMOS with input bu ering.
The data cache is accessed through three functional units: the Load, Store and PMiss units. PMiss is an abbreviation for Pending Miss. A n y L o ad operation that causes a cache miss is automatically coupled with a dynamically created PMiss operation. These operations fetch the missing cache block independently from other cache accesses. Once a PMiss operation completes, its associated Load operation is allowed to execute. This unit incorporates the lockup-free cache design presented by Kroft 14 .
Consider a processor design space composed of one or more 1 This algorithm can achieve the precision required by the IEEE standard at reasonable cost and speed 12 and was implemented in the RS 6000 13 of the functional units of A. A System-Level Power Model Excessive p o w er dissipation is known to cause serious packaging and thermal problems. Some instances are the 72 watts dissipated by the 600MHz DEC Alpha 21264 10 and the estimated 100+ watts dissipated by the upcoming Compaq Alpha 21364 which will run at speeds exceeding 1GHz 11 . As clock rates increase, this aspect of design gains equal importance as the performance and die space.
Power dissipation in static CMOS can be divided into static, dynamic, and short-circuit 2 components. Static power dissipation is due to the reverse bias leakage current b e t w een di usion regions and the substrate during steady state. This component is highly technology dependent. The static power dissipation, P static , for a particular functional unit is estimated by, P static : An example helps illustrate the model. System designers often assume that pipelining does not a ect power dissipation. The theory is that if any instruction uses a functional unit, it must travel through all stages of the unit in turn, which implies it consumes the same power as it would on an unpipelined unit neglecting latching costs. Figure 2 shows why this assumption is false. Here three instructions are executed on a pipelined unit Figure 2a and on an unpipelined unit Figure 2b . The corresponding power cost for each i s s h o wn below the gure. The unpipelined version uses 55 of the power of the pipelined version. The reason for this di erence is the pipeline speedup e ect, which i s a n a r c hitectural phenomenon. The assumption that pipelining does not matter has also been persuasively disproved in 16 .
The total dynamic power consumption can be calculated from the power consumptions of each unit. Let Sijk be the ith pipeline stage in jth copy of functional unit type k. The total dynamic power consumption, PTOT, is then, Further details concerning this model are presented below.
B. Simulation techniques System-level design employs trace-driven behavioral simulation, where the traces are taken from a set of industry-standard benchmark programs. Members of the SPEC92 workstation benchmarks 17 are used here, summarized in Table II . The benchmarks are compiled using the public-domain GNU C compiler, which implements an aggressive set of code-improving optimizations, including a priority-based list scheduling algorithm 18 . This shortens the critical dependence path of instruction sequences as much as possible, enhancing parallelism between instructions and resulting in higher superscalar processor performance. The traces of the benchmarks are generated from benchmarks using the Spike tracing tool 19 .
The simulator implements a dynamic instruction scheduling model, with the window for instruction scheduling moving between correctly predicted branches. Yeh's adaptive training branch algorithm is used to predict branch behavior, since it is a very highly accurate prediction scheme 20 . Since the benchmarks can generate extremely long traces, trace-sampling techniques are employed to reduce trace size and simulation time see 21 , 22 for details. Only the pipeline state is sampled. The entire memory system including branch hardware and caches are simulated using the full trace. This results in a relative error of no more than 3 for the processor performance C. Tradeo analysis using near-optimal search One goal of this study is to determine designs that achieve
The following is the method used to guide the simulated annealing algorithm: At each step of the algorithm, the next processor design, mi+1, is derived from the current design, mi, using a restricted random selection procedure. The random selection procedure is: 1 select l functional units at random from mi, where l is a random integer in the range 1; 3 , 2 the number of each of these functional units in mi is changed by a random integer in the range ,3; 3 . Any n umber greater than the issue rate four instructions per cycle or less than 1 is rejected. For units with several possible pipeline latencies, a slightly more restrictive procedure is used to randomly alter the latencies.
The initial design used as the starting point for the search is: D. Performance metrics A performance metric is used that takes into account both performance due to processor organization and due to technological considerations. Parallelism or instructions per cycle IPC is often used for architectural performance. IPC is ultimately limited by the issue rate a design feature and interinstruction dependencies a benchmark characteristic. IPC alone lacks technology considerations. For example, short latency functional units produce high IPC, since dependencies are resolved quicker using shorter latencies shallow pipeline depths. However, lower degrees of pipelining may lengthen the execution unit's critical path. This has an impact on the total time to execute a program, but is not re ected by the IPC metric. Hence, tradeo analysis employing only IPC would result in a sub-optimal design. The critical path that determines cycle time is typically through the rst level of the memory hierarchy e.g., the data cache. Shallow pipelines can shift this critical path into the execution unit. Since this study concentrates on the superscalar execution unit, the aim is to optimize the critical path within the pipelines of the functional units. This reduces the impact of the execution unit's critical path on the external cycle time of the processor. A metric that combines IPC and critical path delay is the critical time per instruction CTPI. CTPI is the ratio of the critical path delay to the number of instructions per cycle. Optimizing the execution unit for low CTPI reduces the chance of a ecting the processor's cycle time. For this reason, CTPI is used in the search algorithm's cost model.
E. Example technology cost model
The example technology cost model considers a processor implementation technology with a budget of 1.7 million transistors and a supply voltage of 3.3 volts. This is based on the reported gures in 3 for a 0.75m three metal-layer CMOS process technology. Although the rst-level data cache is not included in the execution unit, its miss rate impacts the overall performance of the superscalar core. A 16KB, 2-way associative data cache is assumed. This design assumes a page size of 8K bytes so that cache data store indexing can occur in parallel with TLB access. Cache misses are handled by the hardware using a lockup-free mechanism 14 . The latency to repair a missing block from the L2 cache is assumed to be 10 cycles.
The speci c cost model depends on CTPI and power consumption estimates. CTPI is calculated from the number of instructions, the number of cycles for the execution of the program, and an estimate of the critical path. The deepest pipeline stage in the execution unit is used to nd the critical path employing a technique presented in 22 . It is rarely true that the functional units can be pipelined such that the cycle time is exactly inversely proportional to the degree of pipelining. Instead the deepest pipeline stage for each degree of pipelining is determined. The sum of the device propagation delays within this stage constitutes the cycle time.
The CTPI increase of processor mi, CTPImi, is constrained to a fractional increase over CTPIm0:
where K is the CTPI budget.
Transistor level analysis of published work provided the approximations for each functional unit type. This model is presented in Table III . Since the FPMul unit is used iteratively for division, the FPDiv unit does not consume any die space and is not mentioned in the table.
Only relative p o w er dissipation increases are required for the cost model. Therefore, the power estimate is normalized to remove a n y m ultiplicative error in the model. The coe cients are adjusted such that dynamic power is 10,000 times larger than static power for a single device a typical ratio. Static power is estimated using Equation 2. Equation 5 is used to estimate dynamic power. Stage energies are calculated using the functional unit designs of Table III. The overall goal is to minimize power subject to constrained performance degradation. An expression for the combined cost function is, fmi = power of mi; if CTPImi K CTPIm0, 1;
otherwise: 8
III. Experimental Results
This section presents example results of the system-level power dissipation model and tradeo analysis method. The initial design, m0, is selected using Equation 6 with the issue rate equal to four instructions per cycle IR = 4. Figure 3 illustrates the evolution of the cost function during a near-optimal search for the espresso benchmark. As may be seen, an immediate attempt is made to reduce the power from that of the initial design, m0. Although the new power is better than the original, the search continues for a more global minimum. The search is initially liberal in its design selections but eventually settles into a low p o w er region of the design space.
A. Performance of initial designs Table IV shows the performance of the m0 designs for the 12 benchmarks. Power consumption has been normalized to the tomcatv result. The integer benchmarks achieve l o w er performance, in general, than the oating-point benchmarks. Execution of integer code also consumes less power by approximately 40 on average. Floating-point units consume higher amounts of power than that of integer units, due to a higher number of transistors per unit. Note also the strong correlation between high IPC low CTPI and high power usage: more instructions executing in parallel implies more functional units active. No change is seen in the number of transistors from latency 3 to 4 since the placement of the latches results in fewer bits that need to be latched. y Load is through the data cache, which is excluded from the execution unit. However, slight o v erhead is required for each load operation to latch the values. Multiple load units are implemented by i n terleaving the cache. z Value shown is extrapolated from 14 . 
B. Optimized designs
The optimized designs for each benchmark are presented in this section. Four CTPI budgets are considered: 105, 110, 120 and 150 of the initial CTPIm0. The CTPI, IPC and relative decrease in power consumption values are also presented. The CTPI and power dissipation of the designs are presented graphically in Figures 4 and 5 , respectively. Figure 4 shows several interesting trends. The integer benchmark designs do not sacri ce considerable performance except for the 150 budget recall that lower CTPI is a gure of merit. The 110 designs achieve performance comparable to the initial designs for espresso, gcc, and sc, while achieving reductions in power. A similar result occurs for hydro2d, mdljdp2, and tomcatv. A slightly less impressive result can be seen for the remainder of the optimized designs. This demonstrates that the tradeo technique is successful in nding lower-power yet high-performance designs.
The 150 designs are clearly di erent from the other de- signs. These achieve considerable power consumption savings Tables V and VI present the speci c optimized designs for  CTPI budgets of 105 Table Va, 110 Table Vb, 120  Table VIa, and 150 Table VIb . The tables also present the IPC, CTPI and the percentage reduction in power consumption over m0 Table IV for the optimized designs. The designs are presented in terms of their per-functional unit n and`parameters.
and 110 CTPI budget designs
Designs optimized for 105 and 110 budgets represent applications where power must be reduced, but overall superscalar performance is of prime importance. Such applications would include general-purpose computing and mission-critical 
Benchmark
Reduction n`n`n`n`n`n`n`nc ompress 2.08 7.21 5 . 43 2 1 1 1 1 1 1 5 1 6 2 3 2 1 1 
Reduction n`n`n`n`n`n`n`nc ompress 2.01 7. embedded systems. The reduction in power consumption of the 105 budget is modest for all benchmark-based designs 2.26 7.64, with the exception of ora 36.4. The 110 budget presents similar behavior. The most-common unit to duplicate for both budgets is the integer ALU, followed by the Load units. Power is reduced primarily through optimized Load pipeline depths. Several designs for the oating-point benchmarks choose to include multiple copies of the oating-point units, in spite of their heavy power burden. This is a result of the high-performance goals of this tradeo analysis. 120 and 150 CTPI budget designs The 120 and 150 budget designs represent di erent design goals from the 105 110 budget designs. Here the goal is to trade superscalar performance for reduced power consumption. An example application would be a low-power embedded system. The 120 budget designs for the integer benchmarks Table VIa do not di er considerably from the 105 110 designs. This is not the case for the 150 budget designs Table VIb, where duplicated IALU units have been eliminated and CTPI has increased for four of the six integer benchmarks. The e ect of this change on power consumption is dramatic, with power reductions of 32.8 37.8. Two exceptions are for compress and eqntott. The IALU units are retained and the power reduction is much less impressive. This clearly shows that optimization of the IALU unit is critical for low-power embedded systems that execute primarily integer code.
The oating-point benchmarks force several interesting tradeo decisions for the 120 and 150 budgets. The mostinteresting of these is the method chosen for power reduction of the oating-point hardware. The number of oating-point units is reduced over that of the 105 and 110 budget designs, but several benchmarks continue to use duplicated units e.g., doduc, hydro2d, tomcatv, and wave5. The power is reduced by decreasing the degree of pipelining from six to four stages for the FPMul units and from ve to four stages for the FPAdd units. Such reductions are re ected in higher CTPI, but the improvements in power consumption are signi cant. For example, the 4.68 power reduction of the 110 budget mdljdp2 design Table Vb improves to 26.24 for the 150 design. Other oating-point speci c designs achieve l o w er power by eliminating duplicated IALU units.
When combined, these results show that low p o w er designs can be achieved by judiciously adjusting processor organization for power reduction.
IV. Conclusion
This study has presented new techniques for high-level tradeo analysis and system-level modeling of power consumption before circuit implementation. The major contributions of this paper are the separation of architectural and technology components of dynamic power, the use of trace-driven simulation for architectural power measurement, and the use of near-optimal search for organizational tradeo analysis.
An example cost model was developed to demonstrate the technique and applied to two application areas: high-performance, power optimized designs 105 and 110 CTPI budgets and embedded, low-power designs 120 and 150 budgets. Several insights can be drawn from the results. Overall power consumption can be reduced via organizational changes alone. For high-performance designs, the techniques in this paper nd signi cant reduction in power for little performance penalty. This result argues for the use of these techniques before the circuit design is commenced. For embedded, lowpower designs, two speci c trends emerged. For the oatingpoint applications, the degree of pipelining is a critical parameter. For several integer-intensive applications, the IALU unit is the most critical for power consumption. Although this is an intuitive result, it is not universally true. Two applications compress and eqntott did not eliminate IALU unit duplication, even when performance was allowed to reduce by a s m uch as 50. This suggests that some applications require higher power designs.
Two extensions to this work are possible. One is the study of additional benchmarks. In particular, power consumption via organizational adjustment is an application-speci c task. The methods presented in this paper can be used to study any application. An additional extension is to consider di erent example technology power consumption models. Naturally, the optimized design space will vary according to the technology dependent aspect of the cost function namely the functional unit energy models. Also, more accurate functional unit energy models -with respect to input transition properties -may lead to a shift in the optimized design space. Both extensions are readily achieved with only minor changes to the overall framework.
