CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution by Mohammadi, Milad et al.
ar
X
iv
:1
60
6.
01
60
7v
1 
 [c
s.A
R]
  6
 Ju
n 2
01
6
CG-OoO
Energy-Efficient Coarse-Grain Out-of-Order Execution
Milad Mohammadi⋆, Tor M. Aamodt†, William J. Dally⋆‡
⋆Stanford University, †University of British Columbia, ‡NVIDIA Research
milad@cs.stanford.edu, aamodt@ece.ubc.ca, dally@stanford.edu
ABSTRACT
We introduce the Coarse-Grain Out-of-Order (CG-
OoO) general purpose processor designed to achieve
close to In-Order processor energy while maintaining
Out-of-Order (OoO) performance. CG-OoO is an
energy-performance proportional general purpose
architecture that scales according to the program
load1. Block-level code processing is at the heart of
the this architecture; CG-OoO speculates, fetches,
schedules, and commits code at block-level granu-
larity. It eliminates unnecessary accesses to energy
consuming tables, and turns large tables into smaller
and distributed tables that are cheaper to access.
CG-OoO leverages compiler-level code optimizations
to deliver efficient static code, and exploits dynamic
instruction-level parallelism and block-level parallelism.
CG-OoO introduces Skipahead issue, a complexity
effective, limited out-of-order instruction scheduling
model. Through the energy efficiency techniques
applied to the compiler and processor pipeline stages,
CG-OoO closes 64% of the average energy gap between
the In-Order and Out-of-Order baseline processors at
the performance of the OoO baseline. This makes
CG-OoO 1.9× more efficient than the OoO on the
energy-delay product inverse metric.
1. INTRODUCTION
This paper revisits the Out-of-Order (OoO) execution
model and devises an alternative model that achieves
the performance of the OoO at over 50% lower en-
ergy cost. Czechowski et al. [2] discusses the energy
efficiency techniques used in the recent generations of
the Intel CPU architectures (e.g. Core i7, Haswell)
including Micro-op cache, Loop cache, and Single In-
struction Multiple Data (SIMD) instruction set archi-
tecture (ISA). This paper questions the inherent en-
ergy efficiency attributes of the OoO execution model
and provides a solution that is over 50% more energy
efficient than the baseline OoO. The energy efficiency
techniques discussed in [2] can also be applied to the
1Not to be confused with energy-proportional designs [1].
Energy-performance proportional scaling refers to linear
change in energy as the processor configuration allows higher
peak performance (Figure 26).
CG-OoO model to make it even more energy efficient.
Despite the significant achievements in improving en-
ergy and performance properties of the OoO proces-
sor in the recent years [2], studies show the energy
and performance attributes of the OoO execution model
remain superlinearly proportional [3, 4]. Studies indi-
cate control speculation and dynamic scheduling tech-
nique amount to 88% and 10% of the OoO superior
performance compared to the In-Order (InO) proces-
sor [5]. Scheduling and speculation in OoO is performed
at instruction granularity regardless of the instruction
type even though they are mainly effective during un-
predictable dynamic events (e.g. unpredictable cache
misses) [5]. Furthermore, our studies show speculation
and dynamic scheduling amount to 67% and 51% of
the OoO excess energy compared to the InO processor.
These observations suggest any general purpose proces-
sor architecture that aims to maintain the superior per-
formance of OoO while closing the energy efficiency gap
between InO and OoO ought to implement architectural
solutions in which low energy program speculation and
dynamic scheduling are central.
Our study provide four high level observations.
First, OoO excess energy is well distributed across all
pipeline stages. Thus, an energy efficient architecture
should reduce energy of each stage. Second, OoO
execution model imposes tight functional dependencies
between stages requiring a solution to enable energy
efficiency across all stages. Third, as mentioned by
others, complexity effective micro-architectures such
as ILDP [6] and Palachara, et al. [7] enable simpler
hardware, such as local and global register files that
improve energy efficiency. A block-level execution
model, like CG-OoO, enables energy efficiency by
simplifying complex, energy consuming modules
throughout the pipeline stages. Fourth, since dynamic
scheduling and speculation techniques mainly benefit
unpredictable dynamic events, they should be applied
to instructions selectively. Unpredictable events are
hard to detect and design for; however, we show a
hierarchy of scheduling techniques can adjust the
processing energy according to the program runtime
state.
CG-OoO contributes a hierarchy of scheduling
techniques centered around clustering instructions;
static instruction scheduling organizes instructions
at basic-block level granularity to reduce stalls. The
CG-OoO dynamic block scheduler dispatches multiple
code blocks concurrently. Blocks issue instructions
in-order when possible. In case of an unpredictable
stall, each block allows limited out-of-order instruction
issue using a complexity effective structure named
Skipahead ; Skipahead accomplishes this by perform-
ing dynamic dependency checking between a very
small collection of instructions at the head of each
code block. Section 4.4.1 discusses the Skipahead
micro-architecture.
CG-OoO contributes a complexity effective block-
level control speculation model that saves speculation
energy throughout the entire pipeline by allowing
block-level control speculation, fetch, register renaming
bypass, dispatch, and commit. Several front-end
architectures have shown block-level speculation can be
done with high accuracy and low energy cost [8, 9, 10].
CG-OoO uses a distributed register file hierarchy to
allow static allocation of block-level, short-living regis-
ters, and dynamic allocation of long-living registers.
The rest of this paper is organized as follows. Sec-
tion 2 presents the related work, Section 3 describes
the CG-OoO execution model, Section 4 discusses the
processor architecture, Section 5 presents the evaluation
methodology, Section 6 provides the evaluation results,
and Section 7 concludes the paper.
2. OVERVIEW & RELATED WORK
CG-OoO aims to design an energy efficient, high-
performance, single-threaded, processor through target-
ing a design point where the complexity is nearly as
simple as an in-order and instruction-level parallelism
(ILP) is paramount. Table 1 compares several high-level
design features that distinguish the CG-OoO processor
from the previous literature. Unlike others, CG-OoO’s
primary objective is energy efficient computing (column
3), thereby designing several complexity effective (col.
4), energy-aware techniques including: an efficient reg-
ister file hierarchy (col. 9), a block-level control specu-
lation, and a static and dynamic block-level instruction
scheduler (col. 7, 8) coupled with a complexity effective
out-of-order issue model named Skipahead. CG-OoO is
a distributed instruction-queue model (col. 2) that clus-
ters execution units with instruction queues to achieve
an energy-performance proportional solution (col. 6).
Braid [11] clusters static instructions at sub-basic-
block granularity. Each braid runs in-order as a block
of code. Clustering static instructions at this granular-
ity requires additional control instructions to guarantee
program execution correctness. Injecting instructions
increases instruction cache pressure and processor en-
ergy overhead. Braid performs instruction-level, branch
prediction, issue and commit. WiDGET [3] is a power-
proportional grid execution design consisting of a de-
coupled thread context management and a large set of
simple execution units. WiDGET performs instruction-
level dynamic data dependency detection to schedule in-
structions. In contract to these proposals, the CG-OoO
clusters basic-block instructions statically such that, at
runtime, control speculation, fetch, commit, and squash
are done at block granularity. Furthermore, CG-OoO
leverages energy efficient, limited out-of-order schedul-
ing from each code block (col. 8).
DESIGN D
is
tr
ib
u
te
d
µ
-a
rc
h
it
ec
tu
re
/
C
o
a
rs
e-
G
ra
in
E
n
er
g
y
M
o
d
el
in
g
C
o
m
p
le
x
it
y
E
ff
ec
ti
v
e
D
es
ig
n
P
ro
fi
li
n
g
N
O
T
d
o
n
e
P
ip
el
in
e
C
lu
st
er
in
g
S
ta
ti
c
&
D
y
n
a
m
ic
S
ch
ed
u
li
n
g
H
y
b
ri
d
B
lo
ck
-l
ev
el
O
u
t-
o
f-
O
rd
er
S
ch
ed
u
li
n
g
R
eg
is
te
r
F
il
e
H
ie
ra
rc
h
y
CG-OoO ! ! ! ! ! ! ! !
Braid [11] ! ! ! !
WiDGET [3] ! ! ! ! !
TRIPS [12, 13] ! ! ! ! !
Multiscalar [14] ! ! ! !
CE [7] ! ! ! !
TP [15] ! ! ! !
MorphCore [16] ! ! !
BOLT [17] ! ! !
iCFP [18] ! !
ILDP [6] ! ! ! !
WaveScalar [19] ! ! ! ! !
Table 1: Eight high level design features of the CG-OoO archi-
tecture compared to the previous literature.
Multiscalar [14] evaluates a multi-processing unit
capable of steering coarse grain code segments, often
larger than a basic-block, to its processing units. It
replicates register context for each computation unit,
increasing the data communication across its register
files. TRIPS and EDGE [12, 20] are high-performance,
grid-processing architectures that uses static instruc-
tion scheduling in space and dynamic scheduling in
time. It uses Hyperblocks [21] to map instructions
to the grid of processors. Hyperblocks use branch
predication to group basic-blocks that are connected
together through weakly biased branches. To construct
Hyperblocks, the TRIPS compiler uses program
profiling. While effective for improving instruction
parallelism, Hyperblocks lead to energy inefficient
mis-speculation recovery events. Palachara, et al. [7]
supports a distributed instruction window model that
simplifies the wake-up logic, issue window, and the
forwarding logic. In this paper, instruction scheduling
and steering is done at instruction granularity. Trace
Processors [15] is an instruction flow design based
on dynamic code trace processing. The register
file hierarchy in this work consists of several local
register files and a global register file. ILDP [6] is
a distributed processing architecture that consists of
a hierarchical register file built for communicating
short-lived registers locally and long-lived registers
globally. ILDP uses profiling and in-order scheduling
from each processing unit. In contrast to all of these
proposals, the CG-OoO compiler does not use program
profiling (col. 5), and avoids static control prediction
by clustering instructions at basic-block granularity.
CG-OoO uses local and segmented global registers to
reduce data movement and SRAM storage energy.
iCFP [18] addresses the head-of-queue2 blocking
problem in the InO processor by building an execution
model that, on every cache miss, checkpoints the
program context, steers miss-dependent instructions to
a side buffer enabling miss-independent instructions to
make forward progress. CFP [22] addresses the same
problem in an OoO processor. Similarly, BOLT [17],
Flea Flicker [23], and Runahead Execution [24] are
high ILP, high MLP3, latency-tolerant architecture
designs for energy efficient out-of-order execution.
All these architectures follow the runahead execution
model. BOLT uses a slice buffer design that utilizes
minimal hardware resources. CG-OoO solves the
head-of-queue scheduling problem through a hierarchy
of energy efficient solutions including the Skipahead
(Section 4.4.1) scheduler (col. 8).
WaveScalar [19] and SEED [25] are out-of-order data-
flow architectures. The former focuses on solving the
problem of long wire delays by bringing computation
close to data. The latter is a complexity effective de-
sign that groups data-dependent instructions dynam-
ically and manages control-flow using switch instruc-
tions. MorphCore [16] is an InO, OoO hybrid archi-
tecture designed to enable single-threaded energy effi-
ciency. It utilizes either core depending on the pro-
gram state and resource requirements. It uses dynamic
instruction scheduling to execute and commit instruc-
tions. In contract to the above, CG-OoO is a single-
threaded, block-level, energy efficient design that ad-
dresses the long wire delays problem through cluster-
ing execution units, register files and instruction queues
close to one another. CG-OoO is end-to-end coarse-
grain, and code blocks do not need additional instruc-
tions to mange control flow.
3. CG-OOO ARCHITECTURE
The goal of the CG-OoO processor is to reach near the
energy of the InO while maintaining the performance
level of OoO. This section introduces the CG-OoO as
a block-level execution model that leverages a hierar-
chy of solutions (software and hardware) to save energy.
Section 3.3 provides an execution flow example.
CG-OoO consists of multiple instruction queues,
called Block Windows (BW), each holding a dynamic
basic-block and issuing instructions concurrently.
BW’s share execution units (EU) to issue instructions
(Figure 1). Several BW’s and EU’s are grouped to form
2Stall of a ready operation behind another stalling operation
in a first-in-first-out (FIFO) queue
3MLP: Memory level parallelism.
 !
 !  ! !  !
 !  !  !
"#$%&'(
)*+$%&'(
 !  !  !  !  !  !  !  !
,#$%&'(
 !  !  !  !
 !  ! !  !  !  ! !  !  !  ! !  !
-
.
"
-
.
"
-
.
"
-
.
"
Figure 1: The CG-OoO {BW’s, EU’s} cluster network.
execution clusters. CG-OoO uses compiler support to
group and statically schedule instructions.
3.1 Hierarchical Design
3.1.1 Hierarchical Architecture
CG-OoO groups instructions into code-blocks that
are fetched, dispatched, and committed together. At
runtime, each dynamic block is processed from a dedi-
cated BW. To manage data communication energy, BW
and EU’s are grouped together to form clusters. Fig-
ure 1 shows CG-OoO clusters highlighted; thin wires,
in blue, enable data forwarding between EU’s. Mi-
croarchitecture clustering provides proportional energy-
performance scaling based on program load demands.
Scalable architectures are previously studied by [3, 26,
27]. CG-OoO extends this concept to energy efficient,
block-level execution.
3.1.2 Hierarchical Instruction Scheduling
We use static instruction list scheduling on each
basic-block to improve performance and energy (a) by
optimizing the schedule of predictable instructions
along the critical path, (b) by improving MLP via
hoisting memory operations to the top of basic-blocks,
and (c) by minimizing wasted computation due to
memory mis-speculation (Section 3.2.1). The compiler
assumes L1-cache latency for memory operations.
BW’s in each cluster schedule instructions con-
currently to hide each other’s head-of-queue stalls.
We call this scheduling model block level parallelism
(BLP). Furthermore, each BW supports a complexity
effective, limited out-of-order instruction issue model
(Section 4.4.1) to address unpredictable cases where
coarse-grain scheduling cannot provide enough ILP.
These techniques combined help save energy by limiting
the processor scheduling granularity to the program
runtime needs (Section 3.3 shows an example).
3.1.3 Hierarchical Register Files
The CG-OoO register file hierarchy consists of:
Global Register File (GRF), and Local Register File
(LRF). The GRF provides a small set of architecturally
visible registers that are dynamically managed while
LRF is statically managed, small, and energy efficient.
The GRF is used for data communication across
BW’s while LRF is used for data communication
within each BW. Each BW has its dedicated LRF. As
shown in Section 6.2.2, 30% of data communication
(register→register and register↔memory) is done
through LRF’s. To further save energy, the GRF
 !"#$% &'(()*+,#-.+/0(#"1/ 223%4 !"# 0(1567%
 !" #$%%%%#&%%%#!% #'%%#(%%%
8'394,(
Figure 2: The head instruction format
 !!"#
$%&$$ '()* +,-+./$%0./$%12
$%&$2 )** 3$./3$./4+
$%&+$ 566 7$./3$./48
$%&+2 69 7+./7$
$%&1$ ,:( 3+./7+./ !!"
;:</;:*(%/=/>+?
*@/A
;:*(%/B=/+?
C/9';6(/DE)6F(5G;:*(%H/I=/J JKLMN OP?
 !"  #"
Figure 3: A simple do-while loop program and its assembly code
is segmented and distributed among BW’s. GRF
segmentation does not rely on a block-level execution
model and may be used independently. Similar register
file models are studied in [6, 11, 15, 28]. CG-OoO
evaluates them from the energy standpoint.
3.2 Block-level Speculation
OoO processors avoid fetch stall cycles by performing
BPU lookups immediately before every fetch irrespec-
tive of the fetched instruction types [29]; this leads to
excessive speculation energy cost and redundant BPU
lookup traffic by non-control instructions which in turn
may cause lower prediction accuracy due to aliasing [30].
CG-OoO supports energy efficient, block-level specu-
lation by using only one BPU lookup per code block.
The compiler generates an instruction named head to
(a) specify the start of a new code block, (b) access
the BPU to predict the next code block, (c) trigger the
Block Allocation unit to allocated a new BW and steer
upcoming instructions to it (Figure 4). head is often
ahead of its branch by at least one cycle making the
probability of front-end stall due to delayed branch pre-
diction low.
Figure 2 shows the head instruction fields: (a) op-
code, (b) control instruction presence bit, (c) block
size4, (d) control instruction least significant address
bits. The example code in Figure 3 shows head has
HasCtrl=1’b1 indicating a control operation ends the
basic-block. If HasCtrl=1’b0, BPU lookup is disabled
to save energy. In Figure 3, local and global operands
are identified by r and g prefixes respectively.
3.2.1 Squash Model
CG-OoO supports block-level speculative control
and memory squash. Upon control mis-prediction,
the front-end stalls fetching new instructions, all
code blocks younger than the mis-speculated control
operation are flushed, and the remaining code blocks
are retired. The data produced by wrong-path blocks
are automatically discarded as such blocks never
retire. Once the BROB is empty, the processor state is
non-speculative, and it can resume normal execution.
3.3 CG-OoO Program Execution Flow
4The compiler partitions code blocks larger than 32 instruc-
tions. Bird et al. [31] shows the average size of basic-blocks
in the SPEC CPU 2006 integer and floating-point bench-
marks are 5 and 17 operations respectively.
 !"#$
%&'()#*)"+
,'*#- ('#"(' &'+./' '0'#1*'
2&)*'3
 .#$
 !"#$
#"//)*
)+4*&1#*)"+
4*''&
 !"#$
.!!"#.*)"+
Figure 4: CG-OoO processor pipeline stages
This section illustrates CG-OoO architecture with a
code example. To better understand the execution flow,
Figure 4 shows the CG-OoO processor pipeline. The
highlighted stages differ the traditional OoO. Control
speculation, dispatch, commit are at block granularity,
and rename is only used for global operands. Section 4
discusses how each stage saves energy.
Figure 5a illustrates a two-wide superscalar CG-OoO.
The instruction scheduler issues one instruction per BW
per cycle to the two EU’s. The code in BW’s are two
consecutive iterations of the abovementioned do-while
loop. Figure 5b shows the cycle-by-cycle flow of in-
structions through the CG-OoO pipeline. Instructions
in iterations 1 and 2 are green and red respectively. It
also shows the contents of BW0, BW1, and the Block
Re-Order Buffer (BROB). Here, lw is a 4-cycle opera-
tion, and all others 1-cycle.
In cycle 1, {head.1, add.1} instructions are fetched
from the instruction cache. In cycle 2, the immedi-
ate field of head.1 is forwarded to the BPU. In cy-
cle 3, head.1 speculates the next code block before the
control operation, bne.1, is fetched; furthermore, the
Block Allocator assigns BW0 to the instructions follow-
ing head.1, and BROB reserves an entry for head.1 to
stores the runtime status of its instructions. In cycle 4,
BW0 receives its first instruction. In cycle 5, add.1 is
issued while more instructions join BW0. In cycle 10,
the last instruction of iteration 1 leaves BW0. In cycles
11, BW0 is available to hold new code blocks. In cycle
13, head.1 is retired as all its instructions complete ex-
ecution; at this point, all data generated by the block
operations will be marked non-speculative.
4. CG-OOO MICRO-ARCHITECTURE
This sections presents the CG-OoO pipeline micro-
architecture details and highlights their energy saving
attributes. These stages save energy by utilizing sev-
eral complexity effective techniques through (a) the use
of small tables, (b) reduced number of table accesses,
and (c) hardware-software hybrid instruction schedul-
ing.
4.1 Branch Prediction
Figure 6a shows the micro-architectural details of the
branch prediction stage in the CG-OoO processor; it
consists of the Branch Predictor (BP) [29], Branch Tar-
get Buffer (BTB), Return Address Stack (RAS), and
Next Block-PC. Equation 1 shows the Next Block-PC
computation relationship.
PCNext−head = PChead+fall-through-block-offset
(1)
The fall-through-block-offset is the immediate
field of the head instruction shown in Figure 2. In the
CG-OoO model, only head PC’s access the BPU. Upon
lookup, a head PC is used to predict the next head PC.
Speculated PC’s are pushed into a FIFO queue, named
Block PC Buffer, dedicated to communicate block ad-
dresses to Fetch (Figure 6b).
 !"
 !"# $%&$'()*+'()*,-
"## ./'(./'(0$
122 .)'(./'(0/
23 .$'(.)
%4! .,'(.$'(5667
 !#
$% $%
&'()*+,)-.'/
0,123+42*
 4.,5/6*23-,)/7/82),1/7/92,.32/7/:2';<2
!*-)2/ ;,5/7/ 4.,5/=.<<-)
 4.,5/>44.,;).*
&'()*+,)-.'/0)22*
 
:
?
 
 !"# $%&$'()*+'()*,-
"## ./'(./'(0$
122 .)'(./'(0/
23 .$'(.)
%4! .,'(.$'(5667
CYCLE BR PRED FETCH DECODE RENAME DISPATCH EXECUTE WB COMMIT BW0 BW1 BROB
1 head.1, add.1
2 head.1 sll.1, lw.1 head.1, add.1
3 bne.1, head.2 sll.1, lw.1 head.1, add.1 head.1
4 head.2 add.2, sll.2 bne.1, head.2 sll.1, lw.1 add.1 add.1 head.1
5 lw.2, bne.2 add.2, sll.2 bne.1, head.2 sll.1, lw.1 add.1 sll.1, lw.1 head.1, head.2
6 lw.2, bne.2 add.2, sll.2 bne.1 sll.1 add.1 lw.1, bne.1 head.1, head.2
7 lw.2, bne.2 add.2, sll.2 lw.1 sll.1 bne.1 add.2, sll.2 head.1, head.2
8 lw.2, bne.2 add.2 bne.1 sll.2, lw.2, bne.2 head.1, head.2
9 sll.2 add.2 bne.1 lw.2, bne.2 head.1, head.2
10 lw.2 sll.2 bne.1 bne.2 head.1, head.2
11 bne.1 lw.1 bne.2 head.1, head.2
12 bne.1 bne.2 head.1, head.2
13 head.1 bne.2 head.2
14 bne.2 lw.2 head.2
15 bne.2 head.2
16 head.2
Figure 5: (a) The CG-OoO architecture models. (b) Cycle-by-cycle flow of example instructions in CG-OoO.
 !"
 #"
 $"
 !"#$!
%&'
'()*
&+,-".!
%)%
 !"# !"
/*0
1!234
%5#"647,
&893:;"3<#84,-".!403-=!
1!234%5#"64*$$:!994
7:!$<"3#:403-=!
1!234>!3".4-$$:!994>:#?4!2"!@3<#84#:4A:-8".4?<9@:!$<"3<#8
 !"#$%&'%
 ())*+
B+0.-:!
$"%%&' ()*+ &,%)-.&)$$/!'
 !"#$!403-=! /!=<93!:4/!8-?!403-=!
/!8-?!
/
/
C
/
(
B
7
/
(
B
 !"#
,-#".*
/*0*#0"+
D-E
 +12#3%&+*.4#04"2%5240%
Figure 6: The Branch Prediction Unit (BPU) micro-architecture.
Once completed, each control operation verifies the
next-block prediction correctness and updates the cor-
responding BPU entry(ies) accordingly. Since the BPU
is indexed by head PC’s, control operations access their
corresponding BPU entry(ies) by computing their head
PC using Equation 2.
PChead = PCcontrol−op − code-block-offset (2)
4.2 Fetch Stage
Figure 7a illustrates a control flow graph with five
basic-blocks. Each block is marked with its head iden-
tifier, h, at the top, and its control operation identifier
(if any), c, at the bottom. Figure 7b illustrates the
mapping of these basic-blocks to the instruction cache
where each box represents an instruction and each set
of adjacent boxes with the same color represent a fetch
group.5 Entries marked I represent non-control, non-
head operations in each basic-block in Figure 7a. As
shown in Figure 7b, an arbitrary number of head’s may
exist in a fetch group. Fetch and BPU handle all cases.
The Block PC Buffer holds either a next-PC address
or a 64’b0 where the former is the next code block
to fetch from the cache, and the latter is a hint that
next-PC is unknown. An unknown PC happens when
a head operation, hh, is predicted not-taken and hh it-
self is not yet fetched. Recall, the predictor needs to
have the fall-through-block-offset of hh to predict
the next block. In such cases, Fetch assumes the fall-
through block is adjacent to the hh block in memory;
so, it continues fetching the next block while the fall-
through block address for hh is computed.
5This section assumes no fetch-alignment [32].
 !
"!
 #
"#
 $
 %
"%
 !  "
 #
 $
 !"
& & & & & & & & # && "!& & ! &
"#  $ &  % & & & &&& && && &
& & & & & & "%  '& && && && &
#$%&'()&*+$,-!)./
 0"
&
 '
"'
 %
& & & & & & & "'& && && && &
Figure 7: (a) A control flow graph (CFG) with 5 blocks labeled B1
to B5. (b) Mapping of CFG operations onto a 4-wide instruction
cache.
4.3 Decode & Register Rename Stages
The Decode micro-architecture follows that of the
conventional OoO except for its additional function-
ality to identify global and local register operands by
appending a 1-bit flag, named Register Rename Flag
(RRF), next to each register identifier. If an instruction
holds a global operand, it accesses the register rename
(RR) tables for its physical register identifier; other-
wise, it would skip RR lookup (Figure 6d). Skipping
the register rename stage reduces the renaming lookup
energy by 30% on average. This saving is realized due
to our block-level execution model. Our RR evaluations
use the Merged Rename and Architectural Register File
model discussed in [33, 34, 35].
4.4 Issue Stage
Before discussing the CG-OoO issue model, let us
visit the Block Window micro-architectural shown
in Figure 8a. It consists of an Instruction Queue
(IQ), a Head Buffer (HB), a dedicated LRF, a GRF
segment, and a number of EU’s. IQ is a FIFO that
holds code block instructions. HB is a small buffer
that holds instructions waiting to be issued by the
Instruction Scheduler in a content accessible memory
(CAM) array. HB pulls instructions from the IQ and
waits for their operands to become ready for issue.
The CG-OoO issue model allows register file accesses
only to operations in the HB thereby (a) avoiding the
OoO post-issue register file read cycle, and (b) saving
the pre-issue large data storage overhead [36] by only
storing operands in HB’s. Because the number of
operations in all HB’s is a fraction of all in-flight
instructions, this model is as fast as the OoO pre-issue
model, and more energy efficient than both models.
4.4.1 Skipahead Instruction Scheduler
The Skipahead model allows limited out-of-order is-
sue of operations. The term limited means out-of-order
 !"#$%&#'(!
)%*%*
+,-
.*/012%33*$
24(&51
6'!0(7
8,-
9*:;*!#
<=<=
(a) The Block Win-
dow (BW) microarchi-
tecture
 !"#$%
&'()
*+,
'*-)
*+,.
'*-)
*+,/
 !"#$%
&'()
*+,
'*-)
*+,.
'*-)
*+,/
 !"#$%
&'()
*+,
'*-)
*+,.
'*-)
*+,/
 !
*%0$12
 !
*%0$12
"#$%&
'())#*
34*)&%!%5$%5"1)
-6%"7
*43)&%!%5$%5"1)
-6%"7
(b) The logic to support the Skipahead
issue model in Head Buffer (HB)
Figure 8
instruction issue is restricted to only the HB instruc-
tions, a subset of all code block instructions. When a
HB instruction, Ins, becomes ready prior to HB instruc-
tion(s) ahead of it, if Ins does not create a true or false
dependency with older instructions, it may be issued
out-of-order. Figure 8b shows the complexity effective
XOR logic used for dependency checking.
For example, assuming a three-entry HB, in Figure 9,
instructions are issued as {1, 3} followed by {2, 4}.
Before issuing 3, its operands are dependency checked
against those of 2.
The Skipahead model improves the CG-OoO perfor-
mance by 41% (Section 6.1) while enabling a selection
and wakeup model no more energy hungry than an in-
order issue model. The wakeup unit presents three
sources of energy efficiency. (a) In each BW, the wakeup
unit uses a small HB storage space to hold operand
data. In contrast to OoO, operations stored in the In-
struction Queue are not included in the wakeup pro-
cess. (b) The wakeup unit searches small CAM tables
for source operands. For instance, in a CG-OoO proces-
sor with 8 BW’s, each with 3 HB entries, the wakeup
unit accesses 48 CAM source operand entries. The OoO
baseline assumes 128 [24, 37] in-flight operations in In-
struction Window to search for ready operands. (c) Lo-
cal write operands wakeup source operands associated
with their own BW only.
4.5 Memory Stage
The CG-OoO Memory stage consists of a load-store-
unit (LSU) that operates at instruction granularity. A
squash is triggered when a sw conflicts with a younger
lw at which point the block holding the lw is flushed;
this means useful instructions older than lw are also
squashed. For instance, in Figure 9, if operation 2 were
to trigger a memory mis-speculation event, the entire
block, including operation 1, would be squashed. Flush-
ing useful operations is called wasted squash which the
compiler reduces by hoisting memory operations toward
the top of basic-blocks. Efficient memory speculation
models such as NoSQ [38] can further improve processor
energy efficiency by replacing associative LSU lookups
with indexed lookups. Evaluating the energy impact of
such designs is outside of the scope of this work.
4.6 Write Back & Commit Stage
Figure 10 shows the contents of a BROB entry; it
holds the block sequence number, block size, and block
 ! "#$$ "" !%"&'%"()
'! "$* ""&+%" !
)! ",-- "" "%"&.%"( 
.! "/01 ""&'%" "%"2334
Figure 9: A code snippet with two data-dependencies.
 !"#$%&'  !$&()* +,- +,. +,/ +,0 +,1 +,2 +,3 +,4 +,5 +,6
Figure 10: A Block Re-Order Buffer (BROB) entry
global write (GW) register operand identifiers. BlkSize
is initialized by the corresponding head operation. The
GW fields are updated by instructions with global write
registers as they are steered from the Register Rename
stage to their BW. The compiler controls the number
of global write operands per code block.
4.6.1 Write-Back Stage
Once an instruction completes, it writes its results
into either a designated register file entry (global or lo-
cal) or into the store queue. In Figure 10, BlkSize is
decremented upon each instruction complete; once its
value is zero, the corresponding block is completed.
4.6.2 Commit Stage
A block is committed when it is completed and is
at the head of the BROB. During commit, all global
registers modified by the block are marked Architec-
tural using the GW fields in BROB. Upon commit, sw
operations in the committing code block retire; in do-
ing so, the Store-Queue “commit” pointer moves to the
youngest sw belonging to the committing block. This
sw is found via searching for the youngest store oper-
ation whose Block SN matches that of the committing
block. Note, our LSU holds a Block SN column.
Checkpoint-based processors [39, 40, 41] propose a
general concept applicable to many architectures (e.g.
iCFP [18]). While outside of the scope of this work,
coarse-grain checkpoint-based processing is promising
for extending the energy efficiency of CG-OoO.
4.7 Squash Handling
CG-OoO handles squash events through the follow-
ing steps: (a) the BPU history queue and Block PC
Buffer flush the content corresponding to wrong-path
blocks. The code block PC resets to the start of the
right path; in case of a control mis-speculation, the
right path is the opposite side of the control opera-
tion, and in case of a memory mis-speculation, it is the
start of the same code block. (b) All BW’s holding
code blocks younger than the mis-speculated operation
flush their IQ, Head Buffer, and mark LRF registers in-
valid. (c) LSU flushes operations corresponding to the
code block younger than the mis-speculated operation
by comparing the mis-speculated Block SN against that
of memory operations. (d) BROB flushes code block en-
tries younger than the mis-speculated operation. The
remaining blocks complete execution and commit.
5. METHODOLOGY
The evaluation setup consists of an in-house compiler,
simulator and energy model. The compiler performs Lo-
cal Register Allocation andGlobal Register Allocation as
well as Static Block-Level List Scheduling for each pro-
gram basic-block. This means the output ISA differs
from the gcc generated x86 ISA. The simulator consists
of a Pin-based functional emulator attached to a tim-
ing simulator [42]. The emulator supports wrong-path
execution. The dynamic instructions produced by the
emulator are mapped to an internal ISA for process-
ing by the timing simulator. Table 2 outlines the con-
figurations used by the simulator to support the tim-
ing and energy model for the CG-OoO, OoO, and InO
processors. The simulator uses the cache and mem-
ory model in [43]. All evaluations support instruction
fetch alignment. They also support data-forwarding be-
tween EU’s. Evaluations start after 2 billion instruc-
tions, warm-up for 2 million instructions, and simulate
the following 20 million instructions. The simulator
handles precise exceptions by executing instructions in
its in-order mode. Once recovered from the exception,
the program resumes normal execution.
Shared Parameters
ISA x86
Technology Node 22nm
System clock frequency 2GHz
L1, assoc, latency 32KB, 8, 4 cycles
L2, assoc, latency 256KB, 8, 12 cycles
L3, assoc, latency 4MB, 8, 40 cycles
Main memory latency 100
Instruction Fetch Width 1 to 8
Branch Predictor Hybrid
- G-Share, Bimodal, Meta 8Kb, 8Kb, 8Kb
- Global Pattern Hist 13b
BTB, assoc, tag 4096 entries, 8, 16b
Load/Store Queues 64, 32 entries, CAM
InO Processor
Pipeline Depth 7 cycles
Instruction Queue 8 entries, FIFO
Register File 70 entries (64bit)
Execution Unit 1-8 wide
OoO Processor
Pipeline Depth 13 cycles
Instruction Queue 128 entries, RAM/CAM
Register File 256 entries (64bit)
Execution Unit 1-8 wide
Re-Order Buffer 160 entries
CG-OoO Processor
Pipeline Depth 13 cycles
Number of BW’s 3-18
Instruction Queue / BW 10 entries, FIFO
Head Buffer / BW 2-5 entries, RAM/CAM
Execution Unit / BW 1-8 wide
GRF, LRF / BW 256, 20 entries (64bit)
GRF Segments 1-18
Number of Clusters 1-3 clusters
Block Re-Order Buffer 16 entries
Table 2: System parameters for each individual core
5.1 Energy Model
Our energy model produces per-access energy num-
bers for the simulator to use to compute the total en-
ergy of each hardware unit. This model extends the en-
ergy model in [43] to support tables, caches, wires, stage
registers, and execution unit energies and areas. It es-
timates per-access dynamic energy and per-cycle static
energy consumption. The simulator computes the total
dynamic energy by incrementing per-access energy of
each unit. It computes the total static energy by multi-
plying the number of simulation cycles by the per-cycle
leakage energy of each unit. Other logical blocks in the
processor (e.g. control modules) are assumed to have
similar energy costs for the baseline OoO and the CG-
OoO, and to have secondary effect on the overall energy
difference.
RAM tables are modeled as standard SRAM units
accessed through decoder and read through sense am-
plifiers. Static and dynamic energy are generated us-
ing SPICE. Then, additional steps including area esti-
mation, energy scaling for different port configurations
and cache structures are done. Similarly, CAM tables
are designed as standard SRAM units accessed through
a driver input module and read through sense ampli-
fiers. To evaluate the energy and area of pipeline stage
registers, 6-NAND gate positive edge-triggered flip-flops
(FF) are simulated in SPICE.
Different 64-bit execution units including the add,
multiply, divide units for arithmetic and floating-point
operations are developed in Verilog and simulated in
the Design Compiler [4]. The Design Compiler provides
per-operation energy numbers for each unit.
HotSpot [44] is used for optimal chip floorplan and
wire length optimization. To extract wire energy num-
bers, we upgraded HotSpot to report wire energy us-
ing its wire length outputs. The energy per access
used for wires is 0.08 pJ/b-mm at the 22nm technol-
ogy node [45]. The simulator assumes all wires have
0.5 activity factor; so, every time the simulator drives a
wire, its energy consumption is incremented by half of
its per-access energy.
6. EVALUATION
CG-OoO achieves the performance of OoO at 48% of
its total energy cost on SPEC Int 2006 benchmarks [46].
This section quantifies the performance and energy ben-
efits of the CG-OoO processor and the pipeline stages
that contribute to its superior energy profile.
6.1 CG-OoO Performance Analysis
Figure 11a uses a 4-wide OoO superscalar processor
as the baseline for illustrating the relative performance
of a 4-wide InO processor with a CG-OoO processor (4-
wide front-end and 4 EU’s arranged as a single cluster).
In this case, the CG-OoO harmonic mean performance
is 7% lower than the OoO baseline. Performance results
are measured in terms of instructions per cycle (IPC).
In Figure 11b, the same 4-wide InO and OoO config-
urations are compared against a CG-OoO model with
a 4-wide front-end and 12 EU’s spread across 3 clus-
ters. In this configuration, the CG-OoO ILP reaches
that of the OoO. As can be observed for Hmmer, Bzip2,
and Libquantum benchmarks, the higher availability of
computation resources allows exploiting higher ILP.
The first source of performance gain is static block-
level list scheduling. Figure 12 shows the effect of static
scheduling on performance. On average, static schedul-
ing increases the CG-OoO performance by 14%. In case
of Hmmer, 19% more MLP is observed with the origi-
!"##$
!"#%$ !"#&$
!"##$
!"''$
!"##$
!"#($ !"#)$
!"*'$
!"#!$ !"#!$
!"'+$
!"#($
!$
!",$
!"%$
!")$
!"'$
+$
+",$
$-
./
0$
$1
23
4,
$
$5
66
$
$7
68
$
$5
9:
;
<$
$=
;
;
./
$
$>
?.
@A
$
$B
3:
CD
E@
FD
;
$
$=
,)
%/
.8
$
$G
;
@.
F4
4$
$H
IF
E/
$
$J
E0
E@
6:
;
<$
$=
E/
;
"$7
.E
@$
!
"
#
#
$
%
"
&
'#()*(+,-.#&
/0*0&1,2#34-#&5&6574$#&8,.94-#2:&
K5LG9G$ M@G$
(a) CG-OoO with 4-wide front-end & back-end
!"#$% !"#&% !"##% #"''%
#"'!%
!"!$%
#"'(%
!")*%
#"&*%
#"'$% #"'(%
!"#(% !"#!%
#"#%
#"$%
#"(%
#"&%
!"#%
!"$%
!"(%
%+
,-
.%
%/
01
2*
%
%3
44
%
%5
46
%
%3
78
9
:%
%;
9
9
,-
%
%<
=,
>?
%
%@
18
AB
C>
DB
9
%
%;
*E
)-
,6
%
%F
9
>,
D2
2%
%G
HD
C-
%
%I
C.
C>
48
9
:%
%;
C-
9
"%5
,C
>%
!
"
#
#
$
%
"
&
'#()*(+,-.#&
/0*0&1,2#34-#&5&64$789:;&
J3KF7F% L>F%
(b) CG-OoO with 4-wide front-end
Figure 11: Performance of InO and CG-OoO normalized to OoO.
!"#$%
&"!&%
!"'%
!"(%
!"#%
&"!%
&")%
&"'%
&"(%
%*
+,
-%
%.
/0
1)
%
%2
33
%
%4
35
%
%2
67
8
9%
%:
8
8
+,
%
%;
<+
=>
%
%?
07
@A
B=
CA
8
%
%:
)(
',
+5
%
%D
8
=+
C1
1%
%E
FC
B,
%
%G
B-
B=
37
8
9%
%:
B,
8
"%4
+B
=%
!
"
#
#
$
%
"
&
'(#)*&+,&!*-.)&!)/#$%0123&+2&4#5,+56-2)#&
78+8&9-:#012#&;&<1$*/&=&>?&
H6%.-9"%?0FC%;3I"% .-9"%?0FC%;3I"%
Figure 12: Effect of static block-level list scheduling on CG-OoO
performance. The Skipahead 4 dynamic scheduling model is used
for both results (see Figure 13).
nal binary schedule generated using gcc (with -O3 op-
timization flag) than the code generated using static
block-level list scheduling. The higher MLP is due to
a superior global code schedule which in turn leads to
fewer stall cycles. In both cases, Hmmer performs bet-
ter than the OoO baseline.
The next source of performance gain is through BLP.
To illustrate the contribution of BLP, let us assume
each BW can issue up to four operations in-order; that
is, if an instruction at the head of a BW queue is not
ready to issue, younger, independent operations in the
same queue do not issue. Other BW’s, however, can is-
sue ready operations to hide the latency of the stalling
BW. The No Skipahead bar in Figure 13 refers to this
setup. It shows on average 17% of the performance
gap between the InO and OoO is closed through BLP.
Benchmarks like H264ref and Sjeng exhibit better per-
formance for the InO model. This is because the InO
processor has a shallower pipeline depth (7 cycles) com-
pared to the CG-OoO processor (13 cycles) allowing
faster control mis-speculation recovery.
The last source of performance improvement in CG-
OoO is the Skipahead model where limited out-of-order
instruction scheduling within each BW is provided.
This feature leverage the Head Buffer tables. Figure 13
shows the performance gain obtained via varying the
number of HB entries. Without Skipahead, 17% of
the gap between OoO and InO is closed. Skipahead 2
refers to a HB with two entries; Skipahead 2 closes an
additional 67% of the performance gap between InO
and OoO. Skipahead 4 (i.e. 4-entry HB) closes the rest
of the performance gap. No significant performance
difference is observed for larger HB sizes. All CG-OoO
results use the statically list scheduled code.
!"#$% !"#&% !"##% #"''%
#"'!%
!"!$%
#"'(%
!")*%
#"&*%
#"'$% #"'(%
!"#(% !"#!%
#"#%
#"$%
#"(%
#"&%
!"#%
!"$%
!"(%
%+
,-
.%
%/
01
2*
%
%3
44
%
%5
46
%
%3
78
9
:%
%;
9
9
,-
%
%<
=,
>?
%
%@
18
AB
C>
DB
9
%
%;
*E
)-
,6
%
%F
9
>,
D2
2%
%G
HD
C-
%
%I
C.
C>
48
9
:%
%;
C-
9
"%5
,C
>%
!
"
#
#
$
%
"
&
'#()*(+,-.#&
/0*0&1,2#34-#&5&64$78&9&:;&
J>F% K7%<:12CL,CM% <:12CL,CM%*% <:12CL,CM%$% <:12CL,CM%)% <:12CL,CM%(%
Figure 13: The performance attribute of the Skipahead model.
Figure 14 shows the CG-OoO performance as the pro-
cessor front-end width varies from 1 to 8. Comparing
the harmonic mean results for the OoO and CG-OoO
shows the CG-OoO processor is superior on narrower
designs. A wider front-end delivers more dynamic op-
erations to the back-end. Because the OoO model has
access to all in-flight operations, it can exploit a larger
effective instruction window. Despite the larger number
of in-flight operations, the CG-OoO model maintains a
limited view to the in-flight operations making an 8-
wide CG-OoO machine not much superior to its 4-wide
counterpart.
6.2 CG-OoO Energy Analysis
In this section, the source of energy saving within
each stage is discussed. Overall, CG-OoO shows an
average 48% energy reduction across all benchmarks.
Energy results are measured in terms of energy per
cycle (EPC). Figure 15a shows the total energy level
for the CG-OoO, OoO, and InO processors; Figure 15b
shows the harmonic mean energy breakdown for dif-
ferent pipeline stages; all benchmarks follow a similar
energy breakdown trend as the harmonic mean. This
figure shows the main energy savings are in the Branch
Prediction, Register Rename, Issue, Register File ac-
cess, and Commit stages. Figure 15c shows 61% aver-
age energy saving for the CG-OoO compared to OoO
!"
!#$"
!#%"
!#&"
!#'"
("
(#$"
(#%"
")
*
+,
"
"-
./
0
$
"
"1
22
"
"3
24
"
"1
5
6
7
8
"
"9
7
7
*
+"
":
;*
<
=
"
">
/6
?
@
A
<
B@
7
"
"9
$
&
%
+*
4"
"C
7
<
*
B0
0
"
"D
EB
A
+"
"F
A
,A
<
26
7
8
"
"9
A
+7
#"
3
*
A
<
"
")
*
+,
"
"-
./
0
$
"
"1
22
"
"3
24
"
"1
5
6
7
8
"
"9
7
7
*
+"
":
;*
<
=
"
">
/6
?
@
A
<
B@
7
"
"9
$
&
%
+*
4"
"C
7
<
*
B0
0
"
"D
EB
A
+"
"F
A
,A
<
26
7
8
"
"9
A
+7
#"
3
*
A
<
"
G1HC5C" C5C"
!
"
#$
%
&'
()
*
+,
-
)
)
*
.
-
+
/)#0"#$%12)3+4#"15)1*+6'*57+,8))-+
9:":+;%<)&'1)+=+6'*57+>+?@+
I/JBK"L"("
I/JBK"L"$"
I/JBK"L"%"
I/JBK"L"'"
Figure 14: The speedup of the OoO and CG-OoO for different
front-end widths.
baseline with similar performance. Since the main con-
tribution of this paper is an energy efficient processor
core, Figure 15c excludes cache and memory energy.
!"#
$%"#
%!"#
&%"#
'!!"#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
(
)
(
#
*
+
,(
)
(
#
-.
(
#
#/012# #3456$# #+77# #879# #+):;<# #=;;01# #>?0.@# #A5:BCD.EC;# #=$FG109# #(;.0E66# #HIED1# #JD2D.7:;<# #=D1;K#80D.#
!
"
#$
%
&'
()
*
+,
-
.
+
!"#$%&'()*+-#"/)00"#0+,1)#23+
45-,.+617+899:;+<"<+=%0)&'1);+>'*7?+@+AB+
(a) Normalized EPC of processors (cache energy included).
!"#
$%"#
%!"#
&%"#
'!!"#
()(# *+,()(# -.(#
#/0123#450.#
!"#$%&&#"'()%"*+',"%-./#0)'12%"-*%&'
34!(5'6)7'899:;'
6)2278#
159#:;5#
252)1<#
5=56>85#
7??>5#
@56)@5#A#15.025#
B586C#
D10.6C#E15@#
(b) Harmonic mean EPC breakdown of all processors.
!"#$ !%#$
!&#$
'(#$
!(#$
)(#$
!(#$
!&#$
)!#$
!(#$
)!#$
!'#$
)*#$
%#$
"'#$
'%#$
+'#$
(%%#$
$,
-.
/$
$0
12
3"
$
$4
55
$
$6
57
$
$4
89
:
;$
$<
:
:
-.
$
$=
>-
?@
$
$A
29
BC
D?
EC
:
$
$<
"&
!.
-7
$
$F
:
?-
E3
3$
$G
HE
D.
$
$I
D/
D?
59
:
;$
$<
D.
:
J$6
-D
?$
!
"
#$
%
&'
()
*
+,
-
.
+
!"#$%&'()*+."#)+,/)#01+."/23$45"/++
67"7+8%2)&'/)9+:'*;<+=+>?+
K4LF8F$ M?F$
(c) Normalized EPC of processors (cache energy excluded).
Figure 15: CG-OoO, OoO, InO energy per cycle (EPC).
Figure 16 shows the inverse of energy-delay (ED)
product indicating the favorable energy-delay charac-
teristics of the CG-OoO over OoO for all benchmarks,
even those that fall short of the OoO performance such
as Sjeng and Gobmk. The CG-OoO is 1.9× more effi-
cient than the OoO on average.
Figure 17 shows the static and dynamic energy break-
down for different benchmarks relative to the OoO base-
line. On average, the leakage energy is smaller than 4%
of the total energy.
6.2.1 Block Level Branch Prediction
Block-level branch prediction is primarily focused on
saving energy by accessing the branch prediction unit
at block granularity rather than fetch-group granularity.
Figure 18 shows the average block sizes for SPEC Int
2006 benchmarks. For a benchmark application with
average block size of eight running on a 4-wide proces-
sor, this translates to roughly 2× reduction in the num-
ber of accesses to the BPU tables. Figure 19 shows the
relative energy-per-cycle for the CG-OoO model com-
pared to the OoO baseline. On average, Block Level BP
!"#$
%"!$
!"&$ !"'$ !"'$
%"($
!"#$
%"($
!"&$ !")$
%"*$
!"'$
!"#$
*"*$
*"&$
!"*$
!"&$
%"*$
%"&$
$+
,-
.$
$/
01
2%
$
$3
44
$
$5
46
$
$3
78
9
:$
$;
9
9
,-
$
$<
=,
>?
$
$@
18
AB
C>
DB
9
$
$;
%'
)-
,6
$
$E
9
>,
D2
2$
$F
GD
C-
$
$H
C.
C>
48
9
:$
$;
C-
9
"$5
,C
>$
!
"
#$
%
&'
()
*
+,
-.
/
+0
,
-1
23
+
.4)#567/)&%6+8#"*9:;+<4=)#2)+
0>?+@%2)&'4)A+B'*;CDE3+
I3JE7E$
Figure 16: The CG-OoO inverse of energy-delay product normal-
ized to OoO.
!"#
$%"#
%!"#
&%"#
'!!"#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
(
)
*+
,
+
#
+
,
+
#
-.
+
#
/012# 3456# #)77# 879# #),:;<# #=;;01# #>?0.@# #A5:BCD.EC;# #=$FG109# #+;.0E66# #HIED1# #JD2D.7:;<# #=0D1;K#80D.#
!
"
#$
%
&'
()
*
+,
-
.
+
/0%12+3+456%$'2+,6)#75+
8/-,.+960+:;;<=+
LM.D;57#N.01@M# >EDO7#N.01@M#
Figure 17: Static and dynamic EPC normalized to OoO.
is 53% more energy efficient than the OoO model. Hm-
mer shows 83% reduction in branch prediction energy
because of its larger average code block size.
!"
#"
$!"
$#"
%!"
%#"
"&
'(
)"
"*
+,
-%
"
".
//
"
"0
/1
"
".
23
4
5"
"6
4
4
'(
"
"7
8'
9:
"
";
,3
<=
>9
?=
4
"
"6
%@
A(
'1
"
"B
4
9'
?-
-"
"C
D?
>(
"
"E
>)
>9
/3
4
5"
CF
'(
>:
'"
!
"
#
$
%
&'
(
)'
*+
,-
&"
./
(
+
,'
'0
'1
2(
.3
'
45%&67%'12(.3'89:%'
Figure 18: Average code block size of SPEC Int 2006 benchmarks.
6.2.2 Register File Hierarchy
The CG-OoO register file hierarchy contributes to the
processor energy savings in four different ways, each of
which is discussed here. (a) LRF’s are low energy ta-
bles, (b) segmented GRF reduce access energy, (c) local
operands bypass register renaming, and (d) register re-
naming is optimized to reduce on-chip data movement.
Local registers are statically managed, and account
for 30% of the total data communication. The 20-entry
LRF energy-per-access is about 25× smaller than that
of a unified, 256-entry register file in the baseline OoO
processor. The LRF has 2 read and 2 write ports and
the unified register file has 8 read and 4 write ports.
In addition, since each BW holds a LRF near its in-
struction window and execution units, operand reads
and writes take place over shorter average wire lengths.
LRF’s also enable additional energy saving by avoid-
ing local write-operand wakeup broadcasts. Figure 20
shows the contribution of the local register file energy
compared to the OoO baseline; it shows an average 26%
reduction in register file energy consumption due to lo-
!"#$
%!#$
&!#$
&'#$
%'#$
(&#$
))#$
%*#$
!*#$
&*#$
)*#$
&'#$
!&#$
+#$
,+#$
!+#$
)+#$
'+#$
(++#$
$-
./
0$
$1
23
4,
$
$5
66
$
$7
68
$
$5
9:
;
<$
$=
;
;
./
$
$>
?.
@A
$
$B
3:
CD
E@
FD
;
$
$=
,)
!/
.8
$
$G
;
@.
F4
4$
$H
IF
E/
$
$J
E0
E@
6:
;
<$
$=
E/
;
K$7
.E
@$
!
"
#$
%
&'
()
*
+,
-
.
+
./01"1+2-3+,4)#56+
71"1+2%8)&'4)+0+9'*:;<=>+
Figure 19: The BPU access EPC normalized to OoO.
!"#
$"#
%!"#
%$"#
&!"#
&$"#
'!"#
'$"#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
(
)
*+
,
+
#
-.
+
#
#/012# #3456&# #)77# #879# #),:;<# #=;;01# #>?0.@# #A5:BCD.EC;# #=&FG109# #+;.0E66# #HIED1# #JD2D.7:;<# #=D1;K#
80D.#
!
"
#$
%
&'
()
*+
,
-
*
!"#$%&'(.)*/.0'12.#*3'&.1*+4.#05*
67"7*8%1.&'4.*9*:')2;*<*=>*
>*)LM#N# ALM# LM#
Figure 20: The register file (RF) access EPC normalized to the
OoO processor. Overall, the CG-OoO RF hierarchy is 94% more
efficient than that of the OoO. S-GRF 9 shows the energy of a
CG-OoO GRF with 9 segments, and RF shows the energy of an
InO processor register file with the same number of ports as the
OoO processor.
cal register accesses.
Because local register operands are statically allo-
cated, they do not require register renaming. As a
result, 23% average energy consumption reduction is
observed in the register rename stage (see Figure 21).
The global register file used in both OoO and CG-
OoO has 256 entries. While the use of local registers en-
ables the use of a smaller global register file in CG-OoO
without noticeable reduction in performance, our exper-
iments use equal global register file sizes for fair energy
and performance modeling between CG-OoO and OoO.
To reduce the access energy overhead of a unified reg-
ister file and to increase the aggregate number of ports
in the CG-OoO, this processor model breaks the global
register file (GRF) into multiple segments. Each seg-
ment is placed next to a BW. The access energy to
each register file segment is divided by the number of
segments relative to the OoO unified register file access
energy. Figure 20 also shows the contribution of the
global register file energy compared to the OoO base-
line; it shows an average 68% reduction in the global
register file energy consumption due to register file seg-
mentation. Notice GRF segmentation is not commonly
used in OoO architectures; some ARM architectures
bank the register file for various purposes such as better
!"#!$
!"#%$
!"&'$
!"%($ !"%)$
!"*#$
!"%)$
!"&+$
!"*#$
!"%!$
!"*&$
!"%*$
!"##$
!,$
)(,$
(!,$
#(,$
+!!,$
$-
./
0$
$1
23
4)
$
$5
66
$
$7
68
$
$5
9:
;
<$
$=
;
;
./
$
$>
?.
@A
$
$B
3:
CD
E@
FD
;
$
$=
)*
G/
.8
$
$H
;
@.
F4
4$
$I
JF
E/
$
$K
E0
E@
6:
;
<$
$=
E/
;
"$7
.E
@$
!
"
#$
%
&'
()
*
+,
-
.
+
/)0'12)#+/)3%$'30+,3)#04+
56"6+7%1)&'3)+8+9'*2:+;+<=+
Figure 21: The register rename access EPC normalized to OoO.
!"#
$!"#
%!"#
&!"#
'!"#
(!!"#
#)
*+
,#
#-
./
0$
#
#1
22
#
#3
24
#
#1
56
7
8#
#9
7
7
*+
#
#:
;*
<=
#
#>
/6
?@
A<
B@
7
#
#9
$&
%+
*4
#
#C
7
<*
B0
0#
#D
EB
A+
#
#F
A,
A<
26
7
8#
#9
A+
7
G#3
*A
<#
!
"
#$
%
&'
()
*
+,
-
)
#.
/
+
0).$)-1)*+2).'31)#+4'&)+,-)#./+5#)-*+
67"7+8%3)&'-)+9+:'*1;+<+=>+
(#:*=7*<B#1HI# J#:*=7*<B#1HI# K#:*=7*<B#1HI#
Figure 22: The S-GRF access EPC normalized to OoO.
thread context switching support [47]. Figure 22 shows
the effect of register file segmentation on energy. It
shows the case of a unified GRF, one GRF segment per
cluster (for a 3-cluster CG-OoO), and one GRF segment
per BW. As the number of register segments increases,
energy consumption decreases linearly.
Placing a GRF segment next to each BW is energy
saving when operations read/write global operands
from/to the closest segment. Our register renaming
algorithm reduces data communication over wires by
allocating an available physical register from the GRF
segment nearest to the BW of the renamed instruction.
6.2.3 Instruction Scheduling
The CG-OoO processor introduces the Skipahead is-
sue model. In OoO and CG-OoO, in-flight instructions
are maintained in queues that are partly RAM and
partly CAM tables. For the InO model, instructions are
held in a small FIFO buffer. Figure 23 shows the en-
ergy breakdown of the dynamic scheduling hardware; it
shows the majority of the OoO scheduling energy (75%)
is in reading and writing instructions from the RAM
table. Another 20% of the OoO energy is in CAM ta-
ble accesses. The “Rest” of the energy is consumed in
stage registers and the interconnects used for instruc-
tion wakeup and select. This figure also indicates 90%
average reduction in the CG-OoO RAM table energy
(relative to OoO RAM energy) which is due to access-
ing smaller SRAM tables, and 95% average reduction
in the CAM table energy which is due to using 2 to 4-
entry Head Buffers (HB) instead of the 128-entry CAM
tables used in the baseline OoO instruction queue. The
“Rest” average energy is increased by 40% due to the
more pipeline registers at the issue stage. Overall, the
CG-OoO issue stage is 84% more efficient than OoO.
!"#
$!"#
%!"#
&!"#
'!"#
(!!"#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
)
*
)
#
+
,
-)
*
)
#
./
)
#
#0123# #4567$# #,88# #98:# #,*;<=# #><<12# #?@1/A# #B6;CDE/FD<# #>$&%21:# #)</1F77# #GHFE2# #IE3E/8;<=# #>E2<J#
91E/#
!
"
#$
%
&'
()
*
+,
-
.
+
/01%$'2+324)*5&)#+,1)#60+
78"8+9%:)&'1)+;+<'*=4+>+?@+
KL?M# KG9# +G9#
Figure 23: The instruction issue EPC normalized to OoO.
6.2.4 Block Re-Order Buffer
The CG-OoO processor maintains program order at
block-level granularity. This makes read-write accesses
to the BROB substantially smaller than that of OoO
ROB. Block write operations are done after decoding
each head and block reads are done at the commit stage.
Instructions access BROB to notify the corresponding
block entry of their completion. In addition, since the
BROB is designed to maintain program order at block
granularity, it is provisioned to have 16 entries rather
than 160 entries used for OoO [37]. The 10× reduction
in the re-order buffer size makes all read-write opera-
tions 10× less energy consuming. Figure 24 shows 76%
average energy saving for CG-OoO.
!!"#
!$"# !$"#
!%"#
!&"#
!$"#
!&"#
!%"#
!'"#
!$"#
'("#
!)"#
!*"#
+"#
'+"#
!+"#
&+"#
*+"#
#,
-.
/#
#0
12
3!
#
#4
55
#
#6
57
#
#4
89
:
;#
#<
:
:
-.
#
#=
>-
?@
#
#A
29
BC
D?
EC
:
#
#<
!)
*.
-7
#
#F
:
?-
E3
3#
#G
HE
D.
#
#I
D/
D?
59
:
;#
#<
D.
:
J#6
-D
?#
!
"
#$
%
&'
()
*
+,
-
.
+
."$$'/+0/%1)+,2)#13+
45"5+6%7)&'2)+8+9'*/:;<=+
K4LF8F#
Figure 24: The commit EPC normalized to OoO.
6.3 Clustering and Scaling Analysis
The CG-OoO architecture focuses on reducing pro-
cessor energy through designing a complexity-effective
architecture; to remain competitive with the OoO per-
formance, this architecture supports a larger number
of execution units (EU). To do so, the CG-OoO model
must employ a design strategy that is more scalable
than the OoO. A cluster consists of a number of BW’s
sharing a number of EU’s. To illustrate the effect of
different clustering configurations, the experimental re-
sults in this section assume three clusters.
Figures 25a and 25b show the normalized average per-
formance and energy of SPEC Int 2006 benchmarks ver-
sus the number of BW’s per cluster for various number
of EU’s per cluster. The speedup figure shows some
clustering configurations reach beyond the performance
of the OoO. All clustering models exhibit substantially
lower energy consumption overhead compared to the
OoO design. The most energy efficient configuration is
the one with 1 BW and 1 EU per cluster; it is 63%
more energy efficient than the OoO, but only at 65%
of the OoO performance. The most high-performance
configuration evaluated here is the one with 6 BW’s
and 8 EU’s per cluster; it is 39% more energy efficient
than the OoO, and 104% of the OoO performance. The
design configuration evaluated throughout this section
corresponds to the cross-over performance point with 3
BW’s and 4 EU’s per cluster.
Figure 26 shows the energy-performance char-
acteristics of the CG-OoO model plotting all the
cluster configurations presented above. The lowest
energy-performance point in the plot refers to the 1
BW, 1EU per cluster configuration and the highest
energy-performance point refers to the 6 BW, 8 EU
per cluster configuration. This figure suggests as the
!"#$
!"%$
!"&$
!"'$
!"($
)"!$
)")$
)$*+$ ,$*+$ -$*+$ %$*+$
!
"
#
#
$
%
"
&
'()*+,-.&/#(,&!"##$%"&
01+1&2(3#4-,#&5&6-$78&9&:;&
)$./$ ,$./$ 0$./$ '$./$ 121$ 341$
(a)
!"#$
%"#$
&"#$
'"#$
("#$
)"#$
*"#$
+"#$
,""#$
,$-.$ !$-.$ %$-.$ ($-.$
!
"
#$
%
&'
()
*
+,
-
.
+
/%#$"0'1+2)%0+!"#$%&'()*+,0)#34+5+.41&)+
67"7+8%9)&'0)+:+;'*<=+>+?@+
,$/0$ !$/0$ &$/0$ *$/0$ 121$ 341$
(b)
Figure 25: Normalized Performance & Energy for different clus-
tering configurations. All configurations assume a 3-cluster CG-
OoO model; the total number of BW’s and EU’s is calculated
through multiplying the above numbers by 3. Here, performance
is measured as the harmonic mean of the IPC and the energy is
measured as the harmonic mean of the EPC over all the SPEC
Int 2006 benchmarks.
!"#$
!"%$
!"&$
!"'$
($
!")$ !"&$ !"*$ !"'$ !"+$ ($ ("($
!
"
#
$
%
&'
(
)
*
+,
"
-
)
#
+
!"#$%&'()*+,)#."#$%/0)+
!"#$%&'()*+,"-)#+123+,)#."#$%/0)+
,-./0/$ /0/$ 12/$ 342567$8,-./0/9$
Figure 26: The energy versus performance plot showing different
CG-OoO configurations normalized to the OoO. The CG-OoO
core configurations illustrate the energy-performance proportion-
ality attribute of the CG-OoO. Performance is measured as the
harmonic mean of the IPC and energy is measured as the har-
monic mean of the EPC over the SPEC Int 2006 benchmarks.
processor resource complexity increases, the energy-
performance characteristics grow proportionally.
Beyond a certain scaling point, the wakeup/select and
load-store unit wire latencies become so large that the
energy-performance proportionality of the CG-OoO
will break. Identifying this energy-performance point
is outside of the scope of this work.
7. CONCLUSION
The CG-OoO leverages a distributed micro-
architecture capable of issuing instructions from
multiple code blocks concurrently. The key enablers of
energy efficiency in the CG-OoO are (a) its end-to-end
complexity effective design, and (b) its effective use
of compiler assistance in doing code clustering and
generating efficient static code schedules. Despite the
reliance of the CG-OoO architecture in providing en-
ergy efficiency static code, it requires no profiling. This
architecture is an energy-proportional design capable
of scaling its hardware resources to larger or smaller
computation units according to the workload demands
of programs at runtime. The CG-OoO supports an
out-of-order issue model at block granularity and a lim-
ited out-of-order issue model at instruction granularity
(i.e. within block). It leverages a hierarchical register
file model designed for energy efficient data transfer.
Unlike most previous studies, this work performs a
detailed processor energy modeling analysis. CG-OoO
reaches the performance of the out-of-order execution
model with over 50% energy saving.
8. REFERENCES
[1] L. A. Barroso and U. Ho¨lzle, “The case for
energy-proportional computing,” Computer, no. 12,
pp. 33–37, 2007.
[2] K. Czechowski, V. W. Lee, E. Grochowski, R. Ronen,
R. Singhal, R. Vuduc, and P. Dubey, “Improving the energy
efficiency of big cores,” in Proceeding of the 41st annual
international symposium on Computer architecuture,
pp. 493–504, IEEE Press, 2014.
[3] Y. Watanabe, J. D. Davis, and D. A. Wood, “Widget:
Wisconsin decoupled grid execution tiles,” in ACM
SIGARCH Computer Architecture News, vol. 38, pp. 2–13,
ACM, 2010.
[4] O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and
M. Horowitz, “Energy-performance tradeoffs in processor
architecture and circuit design: a marginal cost analysis,”
in ACM SIGARCH Computer Architecture News, vol. 38,
pp. 26–36, ACM, 2010.
[5] D. S. McFarlin, C. Tucker, and C. Zilles, “Discerning the
dominant out-of-order performance advantage: is it
speculation or dynamism?,” in ACM SIGPLAN Notices,
vol. 48, pp. 241–252, ACM, 2013.
[6] H.-S. Kim and J. E. Smith, “An instruction set and
microarchitecture for instruction level distributed
processing,” in Computer Architecture, 2002. Proceedings.
29th Annual International Symposium on, pp. 71–81,
IEEE, 2002.
[7] S. Palacharla, N. P. Jouppi, and J. E. Smith,
Complexity-effective superscalar processors, vol. 25. ACM,
1997.
[8] A. Zmily and C. Kozyrakis, “Simultaneously improving
code size, performance, and energy in embedded
processors,” in Proceedings of the conference on Design,
automation and test in Europe: Proceedings, pp. 224–229,
European Design and Automation Association, 2006.
[9] G. Reinman, T. Austin, and B. Calder, “A scalable
front-end architecture for fast instruction delivery,” in ACM
SIGARCH Computer Architecture News, vol. 27,
pp. 234–245, IEEE Computer Society, 1999.
[10] E. Hao, P.-Y. Chang, M. Evers, and Y. N. Patt, “Increasing
the instruction fetch rate via block-structured instruction
set architectures,” International Journal of Parallel
Programming, vol. 26, no. 4, pp. 449–478, 1998.
[11] F. Tseng and Y. N. Patt, “Achieving out-of-order
performance with almost in-order complexity,” in Computer
Architecture, 2008. ISCA’08. 35th International
Symposium on, pp. 3–12, IEEE, 2008.
[12] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,
D. Burger, S. W. Keckler, and C. R. Moore, “Exploiting ilp,
tlp, and dlp with the polymorphous trips architecture,” in
Computer Architecture, 2003. Proceedings. 30th Annual
International Symposium on, pp. 422–433, IEEE, 2003.
[13] D. Burger and S. W. Keckler, “19.5: Breaking the gop/watt
barrier with edge architectures,” in GOMACTech
Intelligent Technologies Conference, 2005.
[14] G. S. Sohi, S. E. Breach, and T. Vijaykumar, “Multiscalar
processors,” in ACM SIGARCH Computer Architecture
News, vol. 23, pp. 414–425, ACM, 1995.
[15] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith,
“Trace processors,” in Microarchitecture, 1997.
Proceedings., Thirtieth Annual IEEE/ACM International
Symposium on, pp. 138–148, IEEE, 1997.
[16] K. Khubaib, M. A. Suleman, M. Hashemi, C. Wilkerson,
and Y. N. Patt, “Morphcore: An energy-efficient
microarchitecture for high performance ilp and high
throughput tlp,” in Microarchitecture (MICRO), 2012 45th
Annual IEEE/ACM International Symposium on,
pp. 305–316, IEEE, 2012.
[17] A. Hilton and A. Roth, “Bolt: energy-efficient out-of-order
latency-tolerant execution,” in High Performance Computer
Architecture (HPCA), 2010 IEEE 16th International
Symposium on, pp. 1–12, IEEE, 2010.
[18] A. Hilton, S. Nagarakatte, and A. Roth, “icfp: Tolerating
all-level cache misses in in-order processors,” in High
Performance Computer Architecture, 2009. HPCA 2009.
IEEE 15th International Symposium on, pp. 431–442,
IEEE, 2009.
[19] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin,
“Wavescalar,” in Proceedings of the 36th annual
IEEE/ACM International Symposium on
Microarchitecture, p. 291, IEEE Computer Society, 2003.
[20] D. Burger, S. W. Keckler, K. e. McKinley, M. Dahlin, L. K.
John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, and
W. Yoder, “Scaling to the end of silicon with edge
architectures,” Computer, vol. 37, no. 7, pp. 44–55, 2004.
[21] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and
R. A. Bringmann, “Effective compiler support for
predicated execution using the hyperblock,” in ACM
SIGMICRO Newsletter, vol. 23, pp. 45–54, IEEE Computer
Society Press, 1992.
[22] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and
M. Upton, “Continual flow pipelines,”ACM SIGPLAN
Notices, vol. 39, no. 11, pp. 107–119, 2004.
[23] R. D. Barnes, S. Ryoo, and W.-m. W. Hwu, “Flea-flicker
multipass pipelining: An alternative to the high-power
out-of-order offense,” in Proceedings of the 38th annual
IEEE/ACM International Symposium on
Microarchitecture, pp. 319–330, IEEE Computer Society,
2005.
[24] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt,
“Runahead execution: An alternative to very large
instruction windows for out-of-order processors,” in
High-Performance Computer Architecture, 2003. HPCA-9
2003. Proceedings. The Ninth International Symposium on,
pp. 129–140, IEEE, 2003.
[25] T. Nowatzki, V. Gangadhar, and K. Sankaralingam,
“Exploring the potential of heterogeneous von
neumann/dataflow execution models,” in Proceedings of the
42nd Annual International Symposium on Computer
Architecture, pp. 298–310, ACM, 2015.
[26] P. Salverda and C. Zilles, “Fundamental performance
constraints in horizontal fusion of in-order cores,” in High
Performance Computer Architecture, 2008. HPCA 2008.
IEEE 14th International Symposium on, pp. 252–263,
IEEE, 2008.
[27] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen,
“Low-power cmos digital design,” IEICE Transactions on
Electronics, vol. 75, no. 4, pp. 371–382, 1992.
[28] J. H. Tseng and K. Asanovic´, “Banked multiported register
files for high-frequency superscalar microprocessors,”ACM
SIGARCH Computer Architecture News, vol. 31, no. 2,
pp. 62–71, 2003.
[29] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, “Design
Tradeoffs for the Alpha EV8 Conditional Branch
Predictor,” in Proc. IEEE/ACM Symp. on Computer
Architecture (ISCA), pp. 295–306, 2002.
[30] M. Mohammadi, S. Han, T. Aamodt, and W. Dally,
“On-demand dynamic branch prediction,”
[31] S. Bird, A. Phansalkar, L. K. John, A. Mericas, and
R. Indukuru, “Performance characterization of spec cpu
benchmarks on intelaˆA˘Z´s core microarchitecture based
processor,” in SPEC Benchmark Workshop, 2007.
[32] S. W. Mahin, S. M. Conor, S. J. Ciavaglia, L. H.
Moulton III, S. E. Rich, and P. D. Kartschoke, “Superscalar
instruction pipeline using alignment logic responsive to
boundary identification logic for aligning and appending
variable length instructions to instructions stored in cache,”
Apr. 29 1997. US Patent 5,625,787.
[33] R. E. Kessler, E. J. McLellan, and D. A. Webb, “The alpha
21264 microprocessor architecture,” in Computer Design:
VLSI in Computers and Processors, 1998. ICCD’98.
Proceedings. International Conference on, pp. 90–95,
IEEE, 1998.
[34] L. Gwennap, “Mips r12000 to hit 300 mhz,”Microprocessor
Report, vol. 11, no. 13, p. 1, 1997.
[35] G. Hinton, D. Sager, M. Upton, D. Boggs, et al., “The
microarchitecture of the pentium R© 4 processor,” in Intel
Technology Journal, Citeseer, 2001.
[36] A. Gonza´lez, F. Latorre, and G. Magklis, “Processor
microarchitecture: An implementation perspective,”
Synthesis Lectures on Computer Architecture, vol. 5, no. 1,
pp. 1–116, 2010.
[37] I. Coorporation, “Intel 64 and ia-32 architectures
optimization reference manual,” 2009.
[38] T. Sha, M. M. Martin, and A. Roth, “Nosq: Store-load
communication without a store queue,” in Proceedings of
the 39th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 285–296, IEEE Computer Society,
2006.
[39] H. Akkary, R. Rajwar, and S. T. Srinivasan, “Checkpoint
processing and recovery: Towards scalable large instruction
window processors,” in Microarchitecture, 2003.
MICRO-36. Proceedings. 36th Annual IEEE/ACM
International Symposium on, pp. 423–434, IEEE, 2003.
[40] A. Cristal, M. Valero, J. Llosa, and A. Gonzalez, “Large
virtual robs by processor checkpointing,” tech. rep.,
Technical Report UPC-DAC-2002-39, Universitat
Politecnica de Catalunya, 2002.
[41] A. Cristal, D. Ortega, J. Llosa, and M. Valero,
“Out-of-order commit processors,” in Software, IEE
Proceedings-, pp. 48–59, IEEE, 2004.
[42] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,
G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood,
“Pin: building customized program analysis tools with
dynamic instrumentation,” ACM Sigplan Notices, vol. 40,
no. 6, pp. 190–200, 2005.
[43] S. Das, T. M. Aamodt, and W. J. Dally, “Slip: reducing
wire energy in the memory hierarchy,” in Proceedings of the
42nd Annual International Symposium on Computer
Architecture, pp. 349–361, ACM, 2015.
[44] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan,
K. Skadron, and M. R. Stan, “Hotspot: A compact thermal
modeling methodology for early-stage vlsi design,”Very
Large Scale Integration (VLSI) Systems, IEEE
Transactions on, vol. 14, no. 5, pp. 501–513, 2006.
[45] Y. Cao, T. Sato, D. Sylvester, M. Orshansky, and C. Hu,
“Predictive technology model,” Internet: http://ptm. asu.
edu, 2002.
[46] J. L. Henning, “Spec cpu2006 benchmark descriptions,”
ACM SIGARCH Computer Architecture News, vol. 34,
no. 4, pp. 1–17, 2006.
[47] K. T. by ARM, “ARM registers.”
http://www.keil.com/support/man/docs/armasm/armasm_dom1359731128950.htm.
[Online; accessed 24-August-2015].
