Power consumption is a major concern in embedded microprocessors design. Reducing power has also been a critical design goal for general-purpose microprocessors. Since they require high performance as well as low power, power reduction at the cost of performance cannot be accepted. There are a lot of device-level techniques that reduce power with maintaining performance. They select non-critical paths as candidates for low-power design, and performance-oriented design is used only in speedcritical paths. The same philosophy can be applied to architectural-level design. We evaluate a technique, which exploits dynamic information regarding instruction criticality in order to reduce power. We evaluate an instruction steering policy for a clustered microarchitecture, which is based on instruction criticality, and find it is substantially energy-efficient while it suffers performance degradation.
Introduction
Power consumption is becoming one of the most important constraints for microprocessors design in nanometer-scale technologies. Device engineers, circuit designers, and system architects are faced with many challenges. In the area of embedded microprocessors, power has already been a major design constraint. However, it is also a limiting factor in general-purpose microprocessors, because they are used in mobile and embedded computer platforms. For example, microprocessors that support out-of-order execution are designed for modern embedded systems [8] . In order to manage the impact of increasing microprocessor power consumption, some architectural-level techniques are required as well as device-level design improvements.
Design tradeoffs can be achieved using many devicelevel techniques such as transistor size optimizations [5] , multiple supply voltages [13] , and multiple threshold voltages [14] approaches. Power reduction without performance loss is achieved by selecting non-critical paths as candidates for low-power design. In other words, performance-oriented design is used only in speed-critical paths. The same philosophy can be applied to architectural-level design. Seng et al. proposed to exploit dynamic information regarding instruction criticality in order to reduce power [10] . We have also studied this technique independently in parallel [3] . To further reduce power consumption, we propose to adopt it to the clustered architecture [9] by splitting the instruction queue into a fast queue and a slow one. In this clustered datapath, the fast cluster consists of the fast instruction queue and fast functional units. Similarly, the slow cluster consists of the slow queue and slow units. In other words, each cluster belongs to the different power-supply (and clock) domain. The remainder of this paper is organized as follows. Section 2 discusses related studies. Section 3 explains how energy consumption is reduced by critical path prediction. Section 4 describes our evaluation methodology. Section 5 discusses our simulation results. Finally, Sect. 6 provides our conclusions.
Previous Work
The critical path is a chain of dependent instructions, which determines the number of cycles executing the program. And thus, the performance of the processor is limited by the speed at which it executes the instructions along the critical path. If we can identify which instructions are critical, we can accelerate their execution by any means. Kobayashi et al. [7] used instruction criticality for instruction steering policy and proposed the path info table, which keeps dynamically-generated data flow graph. They calculate the length of each critical path based solely on the data dependence graph.
Critical path prediction [3] , [12] is another technique for identifying critical instructions dynamically. Exploiting information regarding instruction criticality is effective not only for improving processor performance but also for reducing power consumption [3] , [10] . Seng et al. [10] utilized a dynamic mechanism. They proposed to use the critical path predictor to identify non-critical instructions. The critical instructions are executed on fast and power-hungry functional units while the non-critical ones are executed on slow and power-efficient units. We extended Tune et al.'s study [12] and proposed several correlation-based critical path predictors [3] .
Energy-Efficient Clustered Datapath
We exploit dynamic information regarding instruction criticality in order to reduce power. This is based on critical path prediction [3] , [12] . Figure 1 is a data flow graph, where each node represents an instruction and each arc represents dependence between instructions. When we assume that every execution latency is 1 cycle, it is easily observed that the instruction path that determines the execution cycle of these instruction flows consists of instructions {I0, I2, I5, I6, I8, I9}. This is the critical path.
In contrast, instructions I1, I3, I4, and I7, which we call non-critical instructions, do not determine the execution cycle, and thus their latency can be increased without affecting the execution cycle. This is the key observation of this technique. If we can find non-critical instructions dynamically, they can be executed on slow functional units whose supply voltage is low. In summary, only critical instructions are dispatched into fast and power-hungry functional units, while non-critical ones are dispatched into slow and powerefficient units. Since it is expected that its execution cycle is not increased, energy consumption will be reduced. In the remainder of this paper, we assume the slow unit is 2 times slower than the fast unit. Note that the slow unit can be implemented either with or without pipelining technique.
In order to find non-critical instructions dynamically, we use a critical path prediction (CPP) buffer [12] . The CPP buffer is a table and resembles caches, as shown in Fig. 2 . It is indexed by the program counter with each entry having a saturating up-down counter. When an instruction is identified as critical, its corresponding counter is incremented. Otherwise, it is decremented. When the counter value exceeds a threshold value, the instruction is predicted as critical. The counter length and the value size are dependent upon implementations. When the instruction is predicted as critical, it is dispatched into a fast functional unit. Otherwise, it is executed on a slow unit.
In order to further reduce power consumption, we split the instruction queue into a fast queue and a slow one, as shown in Fig. 3 . In this clustered datapath, the fast cluster consists of the fast instruction queue and fast functional units. Similarly, the slow cluster consists of the slow queue and slow units. When the slow queue is pipelined, instructions are dispatched every fast clock cycle while back-toback execution between dependent instructions is impossible. Otherwise, instructions are dispatched every slow clock cycle. Each cluster belongs to the different power-supply (and clock) domain. In order to connect different powersupply domains, level converters are generally used. In the case where any synchronizing mechanism is required between different clock domains because the slow cluster is not pipelined, we assume that two clusters are connected by using small FIFOs. Combination of the FIFOs and level converters or power isolation cells makes possible to synchronize the clusters [20] . We have already seen an SoC with such multiple voltage and asynchronous clock domains [15] . Since the instruction queue is one of the most powerhungry block [4] , reducing supply voltage and clock frequency in the slow queue is very effective. The splitting of the instruction queue has another advantage of design complexity reduction [9] .
In this clustered microarchitecture, we should consider an additional delay when inter-cluster bypassing occurs. In this study, we assume that the bypass delay is 1 cycle, regardless of the existence of FIFOs between clusters. The bypass delay of 1 cycle follows the actual implementation of Alpha 21264 [19] . In addition to the bypass delay, there is another delay when the instruction queue is not pipelined. In this case, the inter-cluster delay consists of the bypass delay and synchronizing delay, as shown in Fig. 4 . The upper part of Fig. 4 explains the bypassing from the fast cluster to slow one, and the lower part explains the reverse. In the case at which the fast cluster finishes an execution at the beginning of the slow clock period, the synchronizing delay is required when the result is used in the slow cluster. Due to the additional delay, the clustered microarchitecture might suffer performance loss.
Methodology
We implemented a timing simulator using the SimpleScalar/Alpha tool set (ver.3.0a) [1] . The baseline model is an out-of-order execution superscalar processor and its configuration is summarized in Table 1 .
Every fast functional unit can execute most integer operations in one cycle as shown in Table 1 , and every slow functional unit executes operations in two cycles. In the rest of this paper, functional units means integer units. Both fast and slow units share their circuit design, while each transistor's size and threshold voltage might be optimized independently. Instructions can be dispatched into both fast and slow functional units. That is, if there are no free fast (slow) units, a critical (non-critical) instruction can be dispatched into a slow (fast) unit instead of its predicted criticality. Note that this flexibility is lost when we split the instruction queue into the fast and slow queues, and that issuing instruction into the queues stalls when the either of the queues is full. We will evaluate pipelined and non-pipelined slow units. The former can sustain the same throughput as that of the fast units. The latter diminishes the throughput by half. We assume that the increase in power due to extra latches in the pipelined units is a factor of 1.15 [2] . Note, in the split instruction queue model, each instruction dispatched into a functional unit does not release its corresponding entry in the queue but remains there until it is committed. We will evaluate two cases of scaling for supply voltage and clock frequency; one is based on Transmeta TM5800 [11] , and the other is based on Intel XScale [6] . These are summarized in Table 2 . The higher frequency and voltage are applied to fast functional units and the fast instruction queue, and the lower ones are applied to slow units and the slow queue. Note that clock frequency and supply voltage are not dynamically adaptable but are predetermined. The energy consumed in the functional units is calculated as the product of its active power and execution time. Because power is proportional with clock frequency and quadratically with supply voltage, we can estimate the energy from the active rate of each unit and the execution cycles, using the supply voltage and frequency scaling shown in Table 2 .
We choose to use the critical path predictor proposed in [12] due to its simplicity. Since we are interested in lowpower architectures in this study, the complex mechanism [7] can not be accepted. It has a direct-mapped table of 3-bit saturating up-down counters. When an instruction is identified as critical according to the QOLD heuristic [12] , its associated counter is incremented by 1. Otherwise, it is decremented by 1. The QOLD heuristic marks the oldest instruction in the instruction queue as critical. The oldest [10] . The SPEC2000 CINT benchmark suite is used for this study. Table 3 lists the benchmarks and the input sets. We use the object files provided by the University of Michigan. They were compiled by DEC C V5.9-008 on Digital UNIX V4.0 (Rev.1229). For each program except for 252.eon, 1 billion instructions are skipped before the actual simulation begins. Each program is executed to completion or for 100 million instructions. We do not count nop instructions.
Results
In this section, simulation results are presented. First, the criticality-based instruction scheduling is investigated. And then, the clustered datapath is evaluated. Figure 5 shows processor performance when a 64K-entry CPP buffer is used. We use committed instructions per cycle (IPC) as a metric for evaluating performance. We evaluate five variations, each of which has various combinations of a total of 6 functional units. For each group of five bars, the first one is for the model that has 6 slow units and the last one is for the model that has 6 fast units. The figure shows the configurations of the remaining bars. Note that the baseline model is 6fast/0slow. As explained above, (i) Pipelined.
Energy Reduction via Criticality-Based Scheduling
(ii) Non-pipelined.
Fig. 5 Instructions per cycle (64K).
(i) Pipelined.
Fig. 6 %Increase in execution cycles (64K).
we use two types of slow units. One is the pipelined unit and the other is the non-pipelined one. Figure 5 (i) shows the results when slow units are pipelined, and Fig. 5 (ii) shows the results when slow units are not pipelined. First, it is easily observed that processor performance is significantly diminished when all functional units are replaced by slow units. However, except in the case of 256.bzip2, the configuration of 2fast/4slow has comparable performance with the baseline configuration. From these observations, we confirm that the critical path prediction mitigates performance loss due to lower supply voltage. Now, let's look at the results for the model using the non-pipelined units. We can see no significant differences between Figs. 5(i) and (ii), and thus we expect that the nonpipelined units are more desirable than the pipelined units from the point of their lower power consumption. However, this will be denied below.
Because we are interested in energy efficiency, execution cycle is a more useful metric than IPC. Figure 6 shows the percent increase in execution cycles when the 64K-entry CPP buffer is used. For example, in the case of 164.gzip, the processor model 1fast/5slow is 1.5 times slower than the baseline model. Figure 6 gives us a different impression two configurations disappears due to low resolution of the figure. To make this observation clear, we show the IPCs of the configuration 0fast/6slow in Table 4 with that of the baseline (6fast/0slow). We can see considerable difference between each pair of IPCs. From Fig. 6 , it is observed that 3 fast units are required in order to keep processor performance. When this configuration is used, the increase in execution cycles is less than approximately 10% on average. Hence, we will use 3 fast functional units in the rest of this paper, except when specified otherwise. Another interesting observation is there are considerable differences between from Fig. 5 , in which the difference between IPCs for the results for the pipelined and non-pipelined models. The latter model suffers serious impact on execution cycles. However, when 3 fast units are used, the difference is insignificant as shown in Fig. 7 . In summary, we have found that the configuration of 3fast/3slow is in a good tradeoff point when we consider the increase in execution cycles.
We now consider the influence of the number of the (i) Pipelined.
(ii) Non-pipelined. (i) Pipelined.
Fig. 9 %Distribution of dispatched units (4K).
CPP buffer entries on execution cycles. Figure 8 shows the results, when the number of the CPP buffer entry is reduced from 64K to 32K, 16K, 8K, and 4K. The functional units configuration is 3fast/3slow. As can be easily observed, the influence is insignificant for the most cases. Thus, we will use a 4K-entry CPP buffer in the rest of this paper, except when specified otherwise. Next, we study how frequently the slow units are used. Figure 9 presents the percent distribution of dispatched functional units. We show three processor configurations, which are 1fast/5slow, 2fast/4slow, and 3fast/3slow. Each bar is divided into four parts. The bottom part indicates the percent of instructions that are predicted to be non-critical and are dispatched into slow units (NS). The next indicates that of instructions that are predicted to be critical and are dispatched into slow units (CS). The next part is for the instructions that are predicted to be non-critical and are dispatched into fast units (NF). The top indicates the percent of instructions that are predicted to be critical and are dispatched into fast units (CF). As the sum of NS and CS is relatively larger, power consumption is reduced. However, execution cycles increase according to CS, resulting in a possible increase in energy consumption. On the other hand, large NF diminishes power efficiency. In summary, large NS and small CS and NF are expected.
First, it is easily observed that, as the number of slow units increases, CS increases, resulting in performance degradation. This confirms the results shown in Figs. 5 and 6. Second, when we use the non-pipelined slow units, NF becomes large. Because the non-pipelined units have low throughput, more units are required for non-critical instructions. And last, in the case of the 3fast/3slow configu-(i) Pipelined.
(ii) Non-pipelined. ration, approximately 70% and 60% of instructions are dispatched into slow functional units on average in the cases of the pipelined and non-pipelined unit's models, respectively. Figure 10 shows the influence of the number of the CPP buffer entries on the distribution, indicating that it is less significant than that of processor configuration.
Next, we compare the conventional superscalar processor with the proposed ones. Figure 11 presents energydelay product (EDP) reduction ratio over the conventional processor. The EDP is a well-known metric for evaluating the tradeoff between energy and performance. We focus on functional units. One of the interesting observations is that the energy efficiency of the pipelined units is better than that of the non-pipelined units, while the power reduction of the pipelined units is smaller than that of non-pipelined units. This is because the increase in execution cycles is insignificant due to the pipelined units' sustained throughput. When we use the pipelined units, average EDP reduction is 16.7% and 27.2% for TM5800 and XScale models, respectively. Because integer functional units consume 10% of the total power [16] , the proposed technique reduces roughly 3% of the total power. On the other hand, a 64 KB instruction and a 64 KB data caches in all consume 15% of the total power [16] . Using the CACTI simulator [21] , we estimated the power consumed in the 64K-entry CPP buffer and it is 13.6% of power consumed by the 64 KB data cache [3] . Hence, the 64K-entry CPP buffer consumes roughly 1% of the total power. As explained above, we can replace the 64K-entry CPP buffer with the 4K-entry one without severe performance loss, power consumed by the CPP buffer is negligible. Based on these discussions, we can conclude that the proposed technique substantially improves energy efficiency. Figure 12 shows the relationship between supply voltage ratio and the average EDP reduction. The supply voltage ratio is defined as the supply voltage for slow units over that for fast units. Thus, the smaller the ratio value is, the larger the expected energy reduction is. Note that every slow unit works at the same supply voltage. For example, the ratio is 1.15 1.8 * 100 = 63.9% for XScale. As can be easily observed, the pipelined units are better in energy reduction when the ratio is less than approximately 97% while the ratio of more than 80% is required to really reduce EDP. Therefore, it is confirmed that the pipelined units are better both in performance and in energy reduction than the non-pipelined units.
(ii) Non-pipelined. 
Energy Reduction via Clustered Datapath
The power reduction in the instruction queue, which is explained in Sect. 3, is effective because it is one of the most power-hungry blocks [4] . This section evaluates two configurations for the queue; one consists of 32-entry fast queue and 32-entry slow queue (denoted as 32fastQ/32slowQ). The other consists of 16-entry fast queue and 48-entry slow queue (denoted as 16fastQ/48slowQ). The first configuration is chosen because both clusters have three functional units each. The second configuration is chosen because we have found that over 70% of instructions are dispatched into slow units. Under the assumption of the voltage/frequency scaling in XScale, 30.4% and 36.3% power reduction in the queue can be attained for the 16fastQ/48slowQ configuration with the pipelined and non-pipelined unit's models, respectively. Figure 13 shows the percent increase in execution cycles, when the splitting of the instruction queue is applied to the processor configuration of 3fast/3slow with the 4K-entry CPP buffer. The significant performance loss is observed. The major reason of the increase in execution cycles is the decrease in instruction throughput due to the slow instruction queue. Compared with the centralized instruction queue used in the previous subsection, it can attain at most half instruction dispatch rate. The other reason is the extra delay due to inter-cluster bypassing. Table 5 presents how frequently inter-cluster bypassing occurs. Approximately 20% of bypassing is between two clusters. This is serious especially in the non-pipelined model. Another reason is the loss of flexibility in instruction issuing and dispatching. However, 11.7% energy reduction in the instruction queue is achieved, when we use the 16fastQ/48slowQ configuration with the voltage/frequency scaling in XScale. Because the instruction issue queue consumes 18% of the total power [16] , roughly 2% of the total energy reduction is achieved. Figure 14 shows the distribution of dispatched unit. It is observed that approximately 80% of instructions are dispatched into slow units, and thus are issued into the slow instruction queue. Therefore, it is confirmed that the 16fastQ/48slowQ configuration is in a good tradeoff point. Figure 15 shows how EDP is reduced from the baseline model in functional units, when the 16fastQ/48slowQ configuration is used with the voltage/frequency scaling in XScale. Unfortunately, EDP is not improved. It is increased by 13.3% on average when the pipelined units are used. Since integer units consume 10% of the total power [16] , roughly 1.3% of the total energy is increased. Hence, by combining with the energy reduction in the instruction issue queue, 0.7% of the total energy reduction is achieved. The decrease in energy efficiency is due to the significant increase in execution cycles. As explained above, its main reason is frequent inter-cluster bypassing. In order to mitigate the performance loss due to the inter-cluster bypassing, instruction steering policy has to be improved. The study on the steering policy is an active research topic [17] , [18] . This is an interesting future study.
Conclusions
This paper investigated a clustered superscalar processor for improving energy efficiency in microprocessors. We evaluated utilizing critical path prediction for this purpose, and found that it is effective for power reduction. In addition, we evaluated the effect of sustaining throughput on power and performance. We found that the pipelined functional units are better in energy reduction as well as in performance than the non-pipelined units even if the increase in hardware due to extra latches is considered.
