Exploiting compiler-generated schedules for energy savings in high-performance processors by Madhavi Valluri et al.
Exploiting Compiler-Generated Schedules for Energy
Savings in High-Performance Processors
Madhavi Valluri
Laboratory for Computer
Architecture
The University of Texas at
Austin
Austin, TX 78712
valluri@ece.utexas.edu
Lizy John
Laboratory for Computer
Architecture
The University of Texas at
Austin
Austin, TX 78712
ljohn@ece.utexas.edu
Heather Hanson
Computer Architecture and
Technology Laboratory
The University of Texas at
Austin
Austin, TX 78712
hhanson@ece.utexas.edu
ABSTRACT
This paper develops a technique that uniquely combines
the advantages of static scheduling and dynamic schedul-
ing to reduce the energy consumed in modern superscalar
processors with out-of-order issue logic. In this Hybrid-
Scheduling paradigm, regions of the application containing
large amounts of parallelism visible at compile-time com-
pletely bypass the dynamic scheduling logic and execute
in a low power static mode. Simulation studies using the
Wattch framework on several media and scientic bench-
marks demonstrate large improvements in overall energy
consumption of 43% in kernels and 25% in full applications
with only a 2.8% performance degradation on average.
Categories and Subject Descriptors
C.1 [Processor Architectures]: RISC/CISC, VLIW Ar-
chitectures
General Terms
Performance, Design
Keywords
Low Energy, Instruction-Level Parallelism, Dynamic Issue
Processors, Very Long Instruction Word Architectures
1. INTRODUCTION
A large portion of the energy consumed in modern super-
scalar processors such as the DEC Alpha 21264, Pentium
Pro, Pentium 4, HAL SPARC64, HP PA-8000 etc, can be
attributed to the dynamic scheduling hardware (or out-of-
order issue logic) responsible for identifying multiple instruc-
tions to issue in parallel. Comprising of highly associative
and multi-ported queues such as the instruction window and
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
ISLPED August 25–27, 2003, Seoul, Korea
Copyright 2003 ACM 1-58113-682-X/03/0008 ...$5.00.
reorder buer in addition to complex logic circuitry required
for the wake-up and select of instructions, the dynamic is-
sue hardware consumes a signicant amount of the micro-
processor's energy budget. The energy consumption of the
dynamic issue hardware accounts for nearly 30% of the over-
all energy in existing processors [2][4][7], and is projected to
grow even further with increasing issue widths and window
sizes [20]. A recent study showing the power distribution es-
timation for the Alpha 21464 microprocessor shows that the
issue logic is responsible for nearly 46% of the total power
dissipation on the chip [18].
In processors with dynamic scheduling logic, the hardware
searches for parallel instructions, irrespective of whether the
compiler-generated schedule is perfect or not. The com-
plex, power-hungry dynamic scheduling hardware is per-
fectly justiable in many applications with irregular control
structures, where it is dicult for the compiler to derive
compact schedules due to unpredictable branches and small
basic blocks. However, for regular and well-structured pro-
grams such as media and scientic applications, the compiler
is easily able to generate ecient schedules for considerable
portions of the code. The dependence analysis and schedul-
ing performed by the hardware is completely redundant in
these regions. Therefore for these regular regions, a large
amount of energy is being expended repeatedly in the dy-
namic issue hardware for unnecessary work.
We present a novel technique that combines the advan-
tages of both compile-time static scheduling and run-time
dynamic scheduling to lower the energy consumption in a
processor. In this Hybrid-Scheduling paradigm, regions of
code containing large amounts of parallelism that can be
identied and exploited at compile-time bypass the out-of-
order issue logic and are issued and executed exactly in the
order prescribed by the compiler. The processor runs in the
superscalar mode with dynamic scheduling until a special
instruction that indicates the beginning of a statically sched-
uled (S-Region) is detected, at which point, the out-of-order
issue engine is shut down, and the processor switches to a
VLIW-like static mode in which instruction packets sched-
uled by the compiler are issued sequentially in consecutive
cycles with minimal hardware support. Energy is conserved
primarily by reducing the work done in the out-of-order issue
logic.
The hybrid-scheduling scheme is particularly suited for
high-performance general-purpose systems which need toALU
DIV
FP
MUL
ALU
Buffer
Reorder 
Window
 Issue
Decode
Program
Exception Handling
           Unit S−Buffer
Figure 1: The Hybrid-Scheduling Microarchitecture
cater to diverse application domains such as integer, me-
dia and scientic applications. It is important to tune the
system to the varying needs of the diverse applications. The
hybrid-scheduling architecture thus allows us to use aggres-
sive and power-hungry scheduling hardware for applications
that warrant it, while facilitating low energy execution for
structured applications that do not need such hardware.
We evaluate the eectiveness of this scheme for several
media and scientic benchmarks using the Wattch 1.0 [2]
simulator. Our results reveal that we can achieve large
improvements in overall energy consumption amounting to
43% in kernels and 25% in full applications with minimal
performance degradation.
The rest of the paper is organized as follows: The details of
microarchitecture supporting the hybrid-scheduling scheme
are described in Section 2. The compiler support required
for this scheme is described in Section 3. Our experimental
framework and benchmarks used are explained in Section 4.
Section 5 presents our simulation results. Related work is
discussed in Section 6 and nally, we present concluding
remarks in Section 7.
2. THEHYBRID-SCHEDULINGMICROAR-
CHITECTURE
The hybrid-scheduling microarchitecture is shown in Fig-
ure 1. In this architecture, statically scheduled regions or
S-Regions execute in a low power static mode. S-Region
instructions are scheduled by the compiler into groups or
packets of independent instructions that can be issued in
parallel. The compiler also annotates regions with special
\start-static" and \end-static" instructions to indicate the
beginning and the end of S-Regions. Once an S-region is de-
tected, instructions in the region can be issued to the func-
tion units without any dynamic dependence checks.
Initial Program Execution: In this scheme, the program be-
gins execution in the normal superscalar fashion, i.e. de-
coded instructions are dispatched to the instruction window
where they wait for their operands, ready instructions are
issued to the function units and nally instructions retire
in-order from the reorder buer. Execution continues in the
superscalar mode until \start-static", an instruction indicat-
ing the beginning of an S-Region, is detected.
The S-Buer: All instructions following the \start-static"
instruction, i.e S-Region instructions are decoded and stored
in the S-Buer (shown in Figure 1). The instructions are
placed into the buer before register renaming is performed.
The S-Buer, similar to a decoded instruction cache [8][12],
is a circular buer that stores instructions in the decoded
form. For instructions fetched from this buer we need not
fully repeat the decoding step; only register read is per-
formed. A detailed structure is shown in Figure 2. Each
S-Buer line holds one instruction packet. The packet size
is xed and can be determined at design time. The maxi-
mum packet size is the number of function units available
in the processor. Each S-Buer line also contains a PC eld
where the program counter value of the rst instruction in
the group is stored, and two special bits (S-bit and B-bit)
which are set if the line holds the rst instruction of an S-
Region or if it holds a branch instruction respectively. The
buer also maintains a pointer to the next available entry
for lling and consecutive instruction packets are placed in
consecutive lines of the S-Buer. Instructions are placed un-
til the \end-static" instruction is detected indicating that all
the instructions belonging to the S-Region have been cap-
tured in the buer.
PC
bit
:
:
:
:
:
Instr N S .................... Instr1
bit
B PC V PC
S− Buffer (c) Branch Table (b) PC Table
IB # SB # SB # Target
(a)
Figure 2: Structures used in the static mode
Switching to Static Mode: After the S-Region has been cap-
tured within the S-Buer, fetch from the instruction cache
is stalled and the processor prepares to switch to the static
mode of issue. S-Region instructions are scheduled assum-
ing that all the function units are available for use in the
static mode and that all the register values are available
in the register le, implying that the S-Region instructions
can begin execution only after the last instruction in the
superscalar mode has completed. Therefore, static mode is-
sue can begin immediately provided the superscalar pipeline
has drained while the S-Region was being captured. If not,
we wait a few additional cycles for the superscalar pipeline
to drain completely. Note that since we have to wait until
the superscalar pipeline is empty before we can start issue
in the static mode, there is a cycle-time overhead incurred
in switching between the two modes. However, if we choose
S-Regions such that they execute for long durations in the
static mode, the switching cost is amortized, leading to neg-
ligible performance degradation. After the last instruction
in the superscalar mode has completed, the dynamic issue
logic is turned o and instructions begin issuing from the
S-Buer.
Instruction Execution in Static Mode: Instructions from the
S-Buer are issued to the functional units without any fur-
ther dependence analysis. One complete S-Buer line is is-
sued every cycle. Memory misses if any, handled by hard-
ware interlocks, cause the static mode issue to stall. All
instructions issued in the static mode completely bypass
the issue logic, leading to enormous energy savings in theprocessor. Instructions issue in the static mode until the
last instruction of the S-Region is executed (detected by
the \end-static" instruction). The processor then exits the
static mode, starts fetching from the instruction cache and
returns to the superscalar mode of execution.
Tracking S-Regions: The scheme uses a small content ad-
dressable table, called PC-Table (shown in Figure 2), to keep
track of all the S-Region blocks in the S-Buer. Each en-
try in the table contains a PC eld, the corresponding S-
Buer line number holding the rst instruction of the re-
gion and a valid bit. Each time a new S-Region is lled into
the S-Buer, an entry is created in the PC table. When a
\start-static" instruction is decoded, the PC table is probed
to check if the corresponding S-Region is already present in
the S-Buer. Invalidation of S-Regions is handled by adding
an extra bit to the instruction buer, namely the S-bit. The
S-Buer line holding the rst instruction in the region has
its S-Bit set. The value of the S-bit is always checked before
lling a line in the S-Buer. If the bit is set, we invalidate
the entry of the previous S-Region in the PC Table before
proceeding any further.
Branching in the Static Mode: Branches in the static mode
by default are predicted as `always taken'. The Branch Ta-
ble shown in Figure 2 is used for storing the target address
of the branch instruction. Entries in the branch table are
created when the branch is decoded. Each entry holds the
PC of the branch and the S-Buer line number of the target
instruction. The branch table is content addressable with an
S-Buer line number. When issuing an S-Buer line contain-
ing a branch instruction (indicated by the B-bit), the branch
table is accessed and the target S-Buer line number is ob-
tained. Subsequent issue of instructions is performed from
this S-Buer line. If the branch is not-taken, as soon as the
branch has been resolved in the execute state, the instruc-
tion packet in the decode stage is squashed. The PC Table
is searched with the computed target address and if there is
no valid entry corresponding to the address, the processor
exits the static mode and returns to the superscalar mode
of execution. This branching scheme allows us to have zero-
cycle branch instructions in the taken case, and a unit cycle
latency for the not-taken case when the target instruction is
in the S-Buer. Hence, this scheme is particularly suited for
branches highly biased in the `taken' direction. The hybrid-
scheduling compiler employs if-conversion [1] to eliminate
unbiased branches wherever possible.
In-order Retirement and Exceptions: In the static mode, in-
structions are issued in packetized groups and also commit
as a group. This feature allows us to design a low-power
reorder buer with very few ports and low associativity.
Therefore, rather than using the reorder buer present in
the superscalar mode, we provide a separate buer for the
static mode and refer to it as the Exception Handling Unit
(EHU). We use the reorder buer with future le approach
proposed for VLIW processors by Ozer et al. [14] to support
in-order commit of instructions and for handling precise ex-
ceptions in the static mode.
This section discussed in detail the microarchitecture of
the hybrid-scheduling scheme. The following section de-
scribes how the compiler selects regions for issue in the low
power static mode.
3. ROLE OF THE COMPILER: IDENTIFY-
ING AND SCHEDULING S-REGIONS
S-Regions bypass the dynamic scheduling logic, hence it
is critical that the compiler generates schedules comparable
to the schedules generated by the out-of-order issue logic for
these parts of the code. An S-Region could be a basic block
or a group of basic blocks such as loops, hyperblocks [11],
superblocks [19] or subroutines. There are several criteria
that must be considered before a region can selected for issue
in the static mode:
1. The region should exhibit large amounts of ILP (in-
struction level parallelism), visible and exploitable at
compile-time. Typical examples of such regions include
regions without function calls, code sequences without
hard-to-predict branches etc.
2. It is desirable for the region to have a single entry
point. This greatly simplies the task of keeping track
of all S-Regions in the S-Buer. Multiple entry points
will require a larger PC-Table to keep track of the re-
gions in the S-Buer.
3. Regions should have regular, predictable memory ac-
cess patterns. Due to the absence of dynamic schedul-
ing in the static mode, it is dicult to hide cache
miss latencies by instruction overlap. Hence, mem-
ory access patterns of the region are critical. Regions
with regular access patterns that are amenable to tech-
niques such as prefetching should be chosen.
4. S-Regions of long durations are preferred since there
is an overhead for switching from dynamic to static
mode as described in the previous section. With long
running S-Regions, this cost becomes negligible. The
minimum duration of the region is determined by the
switching overhead. For example, in our experiments,
the switching overhead is observed to be between 20-
30 cycles. In order to keep the overhead below 1%
the S-Region should run for at least 2000-3000 cycles.
Eligible regions can be proled and regions of su-
cient durations can be selected for the static mode of
execution.
The compiler schedules instructions in the selected S-
Regions into xed-size packets. To make the schedules com-
pact, techniques such as unrolling, software pipelining, trace
scheduling etc are employed. All load and store instructions
are scheduled assuming they hit in the cache. Misses, if
any, are handled by hardware interlocks. The compiler also
annotates the S-Regions with the special \start-static" and
\end-static" instructions.
4. EXPERIMENTAL EVALUATION
4.1 Benchmarks
Media and scientic applications are both important do-
mains of applications targeting the high-performance general-
purpose processors and these programs have tremendous po-
tential for applying the hybrid-scheduling technique. We
study ve audio and video compression/encoding applica-
tions in the Mediabench [10] suite (adpcm, epic, g.721, jpeg,
mpeg2) and several media kernels (iir, add, scale, autocorr,r, dct). We also study two scientic applications from the
SPECFP suite of benchmarks (swim, tomcatv).
In our study, we choose loops as static regions and con-
sider all loops without function calls to be potential S-Regions.
Loops in the programs were identied and proled. Table 1
shows the characteristics of the S-Regions in the applica-
tions. Column 2 and 3 show the number of loops without
function calls in the benchmarks. Column 4 in the table
shows the average duration of the potential S-Regions. The
duration shown is the weighted average, where the weights
are determined based on the percentage of program time at-
tributed to a region. The candidate S-Regions were selected
based on the duration of the loop. We set the minimum du-
ration of a loop that could be selected as an S-Region to 1000
cycles, corresponding to approximately 3% switching over-
head. Further, loops that exhibited irregular (non-constant
stride) memory access patterns were eliminated. We ob-
serve that except in G.721, the programs spend a signicant
amount of time in S-Regions. In G.721, all the S-Regions
were disqualied due to their short durations and hence we
could not apply the hybrid-scheduling mechanism to this
benchmark.
Benchmarks were compiled on a DEC Alpha 21064 ma-
chine with the cc compiler. The benchmarks are compiled
with the highest optimization levels; optimizations such as
loop unrolling, software pipelining, if-conversion, prefetch-
ing were applied to create compact schedules.
Table 1: S-Regions in Media and Scientic Applica-
tions
Bench- Number %Time Avg. Number of Time in
mark of S- in S- Duration selected selected
Regions Regions (cycles) S-Regions S-Regions
ADPCM 1 99 17607 1 99
EPIC 18 87 1.2e6 11 87
G721 5 49 84 0 0
MPEG 77 81 6107 32 75
JPEG 20 50 6403 4 31
SWIM 14 93 3.3e8 13 67
TOMC 6 93 2.1e7 4 71
4.2 Simulation Environment
We have implemented the hybrid-scheduling architecture
within the Wattch 1.0 simulator [2] framework. Complete
conguration details of the simulated processor are given
in Table 2. The base processor has an issue width of four.
Power distributions for dierent hardware structures in the
base processor are shown in Table 3. The power breakdowns
in the table represent the maximum power per unit. We
assume that aggressive clock-gating is employed and hence
power is scaled linearly with port or unit usage. Unused
units dissipate 10% of their maximum power.
The static mode structures are modeled using the RAM
and CAM models provided by Wattch 1.0. We assume that
the static mode also supports execution of only four instruc-
tion per cycle. The structures introduced for the static mode
execution are inherently low energy structures due to small
sizes, low associativity and fewer port requirements. More
details of the structures introduced are given in Table 4.
The size of the S-Buer needs to be large enough to at least
hold the largest S-Region in the programs. For the bench-
marks studied, an S-Buer size of 128 lines was found to be
sucient (the largest S-Region in the applications was only
Table 2: Processor Conguration
Feature Attributes Feature Attributes
IW/LSQ 128/64 entries S-Buer 128 rows,
194 bits/row
ROB 128 entries EHU 16 rows,
284 bits/row
Width 4-way PC/Br Table 16/10 entries
L1 Dcache 32K 4-way IALU (4) 1-cycle latency
1-cycle FPALU (4) 2-cycle latency
L1 Icache 32K DM IMult (2) 3-cycle mult lat.
1-cycle 10-cycle div lat.
L2 Cache 512KB, 4W FPMult (2) 3-cycle mult lat.
8-cycle 15-cycle div lat.
Memory 40-cycle lat LD/ST (2) 1-cycle
34 lines long, found in tomcatv.) Each S-Buer line contains
4 instruction entries. Each line in the buer also holds a PC
and two state bits. The S-Buer is accessed only once per
cycle, either during lling or during issue.
Table 3: Power distribution for dierent hardware
structures in the baseline processor. Total processor
power for this conguration is 125W.
Unit Power Unit Power
Branch Predictor 3.49% Rename Logic 0.46%
IW/LSQ 15.45% ROB 19.70%
Register File 2.63% Result Bus 3.98%
Int ALU 3.42% FP ALU 10.5%
ICache 2.61% ITLB 0.20%
DCache 5.22% DTLB 0.68%
L2 Cache 3.42% Clock 28.20%
The size of the exception handling unit is also small, since
it needs to be only one entry longer than the longest latency
operation [14]. Instructions access the EHU associatively
during instruction writeback. However, unlike the reorder
buer, during issue and commit only a single entry is ac-
cessed [14]. Each row in the EHU contains four instruction
entries. Each entry holds one result value, a destination
register number and two state bits. The maximum power
consumed by the EHU is 1.1% and the future le, which is
similar to the register le, is 2.63% of the overall processor
power. The PC and branch tables are small structures since
we place only few S-Regions in the S-Buer. They account
for only 0.24% of the total processor power.
Table 4: Static Mode Structures
Unit Ent- bits/ Ports Max Assoc. Power(%
ries row access access of total)
S-Buer 128 194 1R/1W 1/cyc No 0.68%
EHU 16 284 1R/5W 6/cyc Partial 1.1%
Future File 32 64 8R/4W 12/cyc No 2.63%
5. EXPERIMENTAL RESULTS
Figure 3 provides the energy and energy-delay results for
all benchmarks. The corresponding performance degrada-
tion suered by the programs is given in Figure 4. We ob-
serve that the hybrid-scheduling scheme is able to achieve
very large improvements in energy consumption without any
signicant increase in the execution time.0
5
10
15
20
25
30
35
40
45
50
i
i
r
a
d
d
s
c
a
l
e
a
u
t
o
f
i
r
d
c
t
a
v
g
a
d
p
c
m
e
p
i
c
j
p
e
g
m
p
e
g
2
s
w
i
m
t
o
m
c
a
v
g
%
 
I
m
p
r
o
v
e
m
e
n
t
 
i
n
 
E
n
e
r
g
y
 
a
n
d
 
E
n
e
r
g
y
-
D
e
l
Energy
Energy-Del
Figure 3: Energy and Energy-Delay improvements
0
1
2
3
4
5
6
7
8
9
i
i
r
a
d
d
s
c
a
l
e
a
u
t
o
f
i
r
d
c
t
a
v
g
a
d
p
c
m
e
p
i
c
j
p
e
g
m
p
e
g
2
s
w
i
m
t
o
m
c
a
v
g
%
 
D
e
g
r
a
d
a
t
i
o
n
 
i
n
 
E
x
e
c
u
t
i
o
n
 
T
i
m
e
Figure 4: Performance degradation results
In kernels, the average energy improvement is seen to be
43.6%, with the improvement ranging from 40% (scale)
to 46% (autocorr). The average performance degradation
caused by the hybrid-scheduling approach is a mere 2.23%.
The highest performance drop of 4.8% is observed in add
and the lowest is 0.16% seen in iir. On an average, the
energy-delay product improves by 42.3%.
The energy improvement and performance results for the
applications are also included in Figures 3 and 4. On aver-
age, in applications, we observe an energy improvement of
25% and a performance degradation of 3.4%. In the appli-
cations, the energy savings are directly proportional to the
amount of time spent in S-Regions. The highest improve-
ment in energy is seen in ADPCM (33%), in which almost
99% of the execution time is spent in S-Regions. JPEG
shows the lowest savings (12%).
The energy improvements seen are primarily due to the
savings in the issue window and reorder buer power. Ad-
ditional power savings are observed in the fetch and decode
phases of the pipeline. Since static mode instructions are ac-
cessed from the S-Buer which is signicantly smaller than
the instruction cache, fetch power reduces considerably. Fur-
ther, since instructions are stored in a decoded form, decoder
power is also saved in the static mode. Additionally, we do
not access the branch predictor in the static mode, this leads
to further energy savings. Figure 5 shows the energy sav-
ings in each hardware structure. Energy savings in the clock
nodes of the structures is shown separately. Note that these
are not absolute values but only portray the ratio of savings
from each structure.
The performance degradation suered by an application
depends on the nature of the loop schedules, static mode
cache misses and the switching overhead incurred. One of
the key constraining factors observed in this study is the
limitation that the schedules generated by the compiler are
restricted to integral values. The performance degradation
caused due to rounding the lengths of schedules to integer
values can be signicant but is considerably reduced by un-
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
i
i
r
a
d
d
s
c
a
l
e
a
u
t
o
f
i
r
d
c
t
a
d
p
c
m
e
p
i
c
j
p
e
g
m
p
e
g
2
s
w
i
m
t
o
m
c
I
m
p
r
o
v
e
m
e
n
t
 
i
n
 
E
n
e
r
g
y
rename bpred window rob lsq icache clock
Figure 5: Energy improvements in dierent hard-
ware structures
rolling the loop. In our experiments we explored dierent
degrees of unrolling and software pipelining to nd perfect
or near-perfect schedules for the loops. In addition to the
integral length schedule limitation, degradation observed in
Figure 4 is caused due to the switching overhead and cache
misses if any.
6. RELATED WORK
Recent work in the area of energy-eective microproces-
sors has shown that it is possible to reduce power in the out-
of-order issue units and other units by dynamically resizing
the structures [3][4][6][15][16]. Some approaches monitor the
changes in the IPC (instructions per cycle) of the program
and alter the issue window size [3][4]. Ghaisi et al. [6]
propose a technique wherein the OS dictates the expected
IPC of the program for dierent phases, and the hardware
chooses between dierent processor congurations such as
pipeline-gating, in-order issue and out-of-order issue. An-
other approach uses the immediate history of the actual us-
ages of the dierent queues to resize them [15]. Dynamic
critical path information used for steering critical and non-
critical instructions separately to two smaller queues rather
than one large queue has shown considerable gains in energy
consumption [16]. Iyer et al. explore a technique which pro-
les dierent characteristics of the program such as ALU
usage, register le usage, instruction window usage, etc [9].
Hotspots in the program are determined and processor units
are scaled accordingly.
In all the above techniques, the sizes of issue logic struc-
tures are manipulated. In our approach we completely elimi-
nate the use of an instruction window and a complex reorder
buer for some regions of the code. Further, all dynamic re-
sizing methods that reduce power in the superscalar pipeline
structures could potentially be applied in the superscalar
mode of execution in the hybrid-scheduling approach lead-
ing to larger overall savings in energy consumption.
Talpes et al. [17] suggest a technique that collects sched-
ules created by the dynamic issue logic into a large trace
cache and reuses them to save issue power. Franklin et
al. [5] and Nair et al. [13] proposed similar schemes that
were aimed primarily at improving the clock frequency of
the processor. An important dierence between these tech-
niques and the hybrid-scheduling scheme is that these tech-
niques are insensitive to the available ILP in dierent phases
of programs, leading to inecient use of the caches holding
the scheduled instructions. Moreover, schedules created by
the dynamic issue hardware, when looking at a limited win-
dow of instructions, are not known to be optimal.7. CONCLUDING REMARKS
In this work, we introduce a hybrid-scheduling technique
that conserves energy in a superscalar processor by exploit-
ing compiler-generated schedules for regular and structured
regions of applications. Regular regions bypass the power-
hungry units such as the issue window and reorder buer and
execute in a low power static mode. Execution in the static
mode also results in additional energy savings in the decode
logic, instruction cache and branch predictor. The hybrid-
scheduling architecture employs dynamic scheduling for the
less regular instruction sequences. We show that the hybrid-
scheduling technique can reduce energy consumption by as
much as 46% for kernels and up to 33% in full-length appli-
cations with minimal performance degradation. Further, an
attractive feature of this architecture is that it allows us to
orthogonally apply all previously proposed dynamic resizing
techniques to reduce energy consumption in the superscalar
mode of issue for much larger overall energy savings in the
processor. The proposed technique will be eective for any
application which contains regions of code that can be stat-
ically scheduled eectively. While it might be dicult to
identify such regions in many general purpose integer appli-
cations, the necessity to support media and other applica-
tions with structured code on the desktop computer makes
this technique important for general purpose microproces-
sors.
8. ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers
for their valuable comments. We thank Steve Keckler and
Doug Burger for many useful discussions during the early
stages of this work. We also thank Juan Rubio, Anand Ra-
machandran and Vivekananda Vedula for their suggestions
to improve the initial draft of the paper.
This research is partially supported by the National Sci-
ence Foundation under grant number 0113105; the Defense
Advanced Research Projects Agency under contract F33615-
01-C-1892; and by the AMD, Intel, IBM, Tivoli and Mi-
crosoft corporations.
9. REFERENCES
[1] J. R. Allen, K. Kennedy, C. Portereld, and
J. Warren. Conversion of control dependence to data
dependence. In POPL83, Austin, Jan 1983.
[2] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A
framework for architectural-level power analysis and
optimizations. In 27th International Symposium on
Computer Architecture, Jun 2000.
[3] A. Buyuktosunoglu, S. Schuster, D. Brooks, P. Bose,
P. Cook, and D. Albonesi. An adaptive issue queue for
reduced power at high performance. In Workshop on
Power-Aware Computers Systems, held in conjunction
with ASPLOS, Nov 2000.
[4] D. Folegnani and A. Gonzalez. Energy-eective issue
logic. In 28th International Symposium on Computer
Architecture, Jun. 2001.
[5] M. Franklin and M. Smotherman. A ll-unit approach
to multiple instruction issue. In 27th Annual
International Symposium on Microarchitecture, 1994.
[6] S. Ghiasi, J. Casmira, and D. Grunwald. Using IPC
variation in workloads with externally specied rates
to reduce power consumption. In Workshop on
Complexity Eective Design, Vancouver, Canada, Jun.
2000.
[7] M. K. Gowan, L. L. Biro, and D. B. Jackson. Power
considerations in the design of the Alpha 21264
microprocessor. In Design Automation Conference,
pages 726{731, 1998.
[8] G. Hinton, D. Sager, M. Upton, D. Boggs,
D. Carmean, A. Kyker, and P. Roussel. The
microarchitecture of the Pentium 4 processor.
Technical report, Intel, Feb. 2001.
[9] A. Iyer and D. Marculescu. Run{time scaling of
microarchitecture resources in a processor for energy
savings. In Kool Chips Workshop, held in conjunction
with MICRO{33, 2000.
[10] C. Lee, M. Potkonjak, and W. H. Mangione-Smith.
Mediabench: A tool for evaluating and synthesizing
multimedia and communications systems. In 30th
International Symposium on Microarchitecture, pages
330{335, Dec. 1997.
[11] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank,
and R. A. Bringmann. Eective compiler support for
predicated execution using the hyperblock. In 25th
International Symposium on Microarchitecture, pages
45{54, 1992.
[12] S. W. Melvin, M. Shebanow, and Y. N. Patt.
Hardware support for large atomic units in
dynamically scheduled machines. In 21st International
Symposium on Microarchitecture, Dec. 1988.
[13] R. Nair and M. E. Hopkins. Exploiting instruction
level parallelism in processor by caching scheduled
groups. In ISCA, 1997.
[14] E. Ozer, S. W. Sathaye, K. N. Menezes, S. Banerjia,
M. D. Jennings, and T. M. Conte. A fast interrupt
handling scheme for VLIW processors. In
International Conference on Parallel Architectures and
Compilation Technique, Oct. 1998.
[15] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing
power requirements of instruction scheduling through
dynamic allocation of multiple datapath resources. In
34th International Symposium on Microarchitecture,
pages 90{101, Dec 2001.
[16] J. S. Seng, E. S. Tune, and D. M. Tullsen. Reducing
power with dynamic critical path information. In 34th
Annual International Symposium on
Microarchitecture, Dec. 2001.
[17] E. Taples and D. Marculescu. Power reduction
through work reuse. In International Symposium on
Low Power Electronics and Design, 2001.
[18] K. Wilcox and S. Manne. Alpha processors: A history
of power issues and a look to the future. In CoolChips
Tutorial, An Industrial Perspective on Low Power
Processor Design in conjunction with Micro-33, Dec.
1999.
[19] W.M.W. Hwu et al.,. The superblock: An eective
technique for vliw and superscalar compilation. In The
Journal of Supercomputing 7(1), Jan 1993.
[20] V. V. Zyuban and P. Kogge. Inherently lower-power
high-performance superscalar architectures. In IEEE
Transactions on Computers, pages 268{285, Mar.
2001.