An energy efficient dependence driven scalable dispatch scheme by Nadathur, SriRam G
Retrospective Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 
1-1-2003 
An energy efficient dependence driven scalable dispatch scheme 
SriRam G. Nadathur 
Iowa State University 
Follow this and additional works at: https://lib.dr.iastate.edu/rtd 
Recommended Citation 
Nadathur, SriRam G., "An energy efficient dependence driven scalable dispatch scheme" (2003). 
Retrospective Theses and Dissertations. 19516. 
https://lib.dr.iastate.edu/rtd/19516 
This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and 
Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses 
and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, 
please contact digirep@iastate.edu. 
An energy efficient dependence driven scalable dispatch scheme 
by 
SriRam G Nadathur 
A thesis submitted to the graduate faculty 
in partial fulfillment of the requirements for the degree of 
MASTER OF SCIENCE 
Major: Computer Engineering 
Program of Study Committee: 




Iowa State University 
Ames, Iowa 
2003 
Copyright Q SriRam G Nadathur, 2003. All rights reserved. 
11 
Graduate College 
Iowa State University 
This is to certify that the master's thesis of 
SriRam G Nadathur 
has met the thesis requirements of Iowa State University 
Signatures have been redacted for privacy 
111 
TABLE OF CONTENTS 
LI5T OF TABLE5   vi 
LIST OF FIGURES   vii 
ABSTRACT  ix 
CHAPTER 1. INTRODUCTION .. .. .. .. .... . .... ....   1 
1.1 Trends in high performance computing   1 
1.2 The conventional superscalar microarchitecture   1 
1.3 Pipeline stages of a superscalar processor   2 
1.3.1 Fetch stage   3 
1.3.2 Dispatch stage   3 
1.3.3 Issue stage   4 
1.3.4 Write-back stage   4 
1.3.5 Commit stage   5 
1.4 Cache memory   5 
1.5 Thesis contribution   fi
l.fi Thesis organization    6 
CHAPTER 2. COMPLEXITY OF THE DISPATCH LOGIC   7 
2.1 Register renam Ong and dependence check  7 
2.2 Register renaming logic   8 
2.3 Rename map table(RMT)   10 
2.3.1 RMT design  10 
2.3.2 RMT delay model   10 
1V 
2.4 Dependence check logic(DCL)   12 
2.4.1 DCL design   12 
2.4.2 DCL delay model  12 
2.5 Dispatch scalability problem  13 
2.6 Complexity of other critical structures   14 
2.7 Related work   14 
CHAPTER 3. STUDY OF DEPENDENCE BEHAVIOR IN PROGRAMS 16 
3.1 Program behavior and dependency relationships   16 
3.2 RMT port utilization in SPEC CPU benchmarks   17 
3.3 Implications of port utilization on R.MT design   18 
CHAPTER 4. DEPENDENCE DRIVEN DISPATCH STAGE  22 
4.1 Proposed dispatch stage schema  22 
4.2 Port contention resolution   23 
4.3 Delay analysis Of modified dispatch scheme   25 
4.4 Accommodating wider dispatch in the proposed schema   27 
CHAPTER 5. PERFORMANCE EVALUATION  28 
5.1 Simulation environment   28 
5.2 Performance characteristics   29 
5.3 Energy characteristics   31 
CHAPTER 6. IPC DRIVEN CACHE ARCHITECTURE   34 
6.1 Introduction  35 
fi.2 Overview   37 
6.3 IPC based load/store classification   38 
fi.4 Sequential access schedule determination:   43 
6.4.1 Consistency of IPC classification   46 
6.5 Cache Energy-Delay Model   46 
CHAPTER 7. EXPERIMENTAL EVALUATION   49 
7.1 Methodology   49 
7.2 Simulation environment   49 
7.3 Performance characteristics   50 
7.4 Energy characteristics   50 
CHAPTER 8. CONCLUSIONS   54 
8.1 Dependence driven dispatch   54 
8.2 IPC driven sequential cache   55 
vi 
LIST OF TABLES 
2.1 R.MT Access Delay for 8, 12 and 16 Wide Dispatch in 0.18 ~ Process 11 
5.1 Access and activity energy reduction   31 
fi.l Sequential Way Access Schedule   40 
6.2 Energy Consumption Methodology   42 
7.1 Performance and Energy Characteristics   52 
vii 
LIST OF FIGURES 
1.1 Reservation Station Model Based Superscalar Pipeline   3 
2.1 Register Rename Logic   8 
2.2 DCL Schematic: XORs, Precharged NOR array and Pass transistor 
based Chain Logic   13 
3.1 Actual Port Utilization for gcc, mcf & equake   19 
3.2 Normalized Port Utilization for gcc, mcf & equake   20 
3.3 Port Utilization for SPEC 2000 Benchmarks(Average)   21 
4.1 Modified Dispatch Scheme with Reduced Number of Ports   23 
4.2 DCL Schematic: XORs, Precharged NOR array and Pass transistor 
based Chain Logic   2fi 
5.1 IPC Behavior of Modified Dispatch Stage for SPEC2000 benchmarks 30 
5.2 RMT Access Energy Reduction   32 
5.3 Port Activity Reduction   32 
fi.l Issue Select Stage Classifies the IPC of Load/Store Instruction  40 
6.2 Implementing Sequential Way Access   41 
6.3 Issue IPC Based Classification of Loads in SPEC2000 Benchmarks 41 
fi.4 Distribution of associativity requirements at each IPC epoch  42 
6.5 Consistency of IPC Classification   45 
6.fi Set Associative Cache Architecture   47 
7.1 Performance Characteristics of IPC driven cache architecture   51 
... 
V111 
7.2 Energy Characteristics of IPC driven cache architecture   52 
ix
ABSTRACT 
This thesis proposes two schemes that target superscalar microarchitectuers. The first 
scheme aims at alleviating the complexity of the dispatch logic by reducing the number of ports 
to the renamer. The second scheme targets L-1 data cache energy reduction by proposing a 
cache architecture incorporating IPC-aware dynamic associativity management. 
The dispatch stage constitutes a key critical path in the scalability of current generation 
superscalar processors. The rename map table (RMT) access and the dependence check logic 
(DCL) delays scale unfavorably with the dispatch width (DW) of a superscalar processor. It 
is a well-known program property that the results of most instructions are consumed within 
the following 4-6 instruction window. This program behavior can be exploited to reduce the 
rename delay by reducing the number of read/write ports in the RMT to significantly below 
the current 3 * DW . We propose an algorithm to dynamically allocate reduced number of 
RMT ports to instructions in the current dispatch window, matching dispatch resources to 
average needs rather than peak needs. This results in shorter R.MT access delays as well as in 
lower energy in the dispatch stage. The IPC reduction due to rename map table read/write 
port contention in the proposed scheme stays within 2-4%. The cycle time saved can also be 
leveraged to support wider dispatch in the same cycle time in order to offset this degradation. 
Data caches are designed with higher associativities to support data sets corresponding to 
peak load/store bandwidths. Existing schemes have incorporated a limited degree of dynamic 
associativity: either direct mapped or full available associativity (say 4-way) . The second 
part of this thesis explores a more general design space for dynamic associativity (for a 4-
way associative cache, consider 1-way, 2-way, and 4-way associative accesses) . The other 
major departure is in the associativity control mechanism. We use the actual instruction level 
X 
parallelism exhibited by the instructions surrounding a given load to classify it as an IPC ~ load 
(for 1 < J~ < IW with an issue width of IW) in a superscalar architecture. The lookup schedule 
(such as 2-way lookup in (Way 0, Way 1) in the first cycle; for a miss followed by a 2-way 
lookup in (Way 2, Way 3)) is fixed in advance for each IPC classifier 1 < k < IW for up to IW 
distinct lookup schedules. The schedules are as way-disjoint as possible for load/stores with 
different IPC classifications. The energy savings over SPEC2000 CPU benchmarks average 
28.6% fora 32KB, 4-way, L-1 data cache. The resulting performance (IPC) degradation from 
the dynamic way schedule is restricted to less than 2.25%, mainly because IPC based placement 
ends up being an excellent classifier 
1 
CHAPTER 1. INTRODUCTItaN 
1.1 Trends in high performance computing 
The superscalar microarchitecture is widely deployed in current generation microprocessors. 
To satisfy the ever increasing demand for higher levels of computing power, computer architects 
are investigating techniques to improve the performance of superscalar processors. The trend in 
the microprocessor industry favors increasingly complex out-of--order microarchitectures with 
the intention of exploiting larger amounts of instruction level parallelism. Qn one front, there 
is a drive towards finding more and more instructions to be executed in parallel, while on the 
other front, there is effort in microarchitecture, circuit and process levels to reduce the cycle 
time. Both these contribute to higher Instructions Per ,Second ̀  (IPSO. At the same time, with 
decreasing feature sizes and exploding numbers of transistors on chip, energy consumption 
and heat dissipation have become issues of great importance. Besides, computer architects 
are also facing the problem of non scalability of the superscalar implementation with some 
parameters such as the dispatch width, issue width and window size. This thesis targets, on 
one hand the complexity and scalability problems faced by computer architects with respect to 
some pipeline stages, and on the other hand certain cache optimizations that aim at achieving 
energy efficiency in L-1 data caches. 
1.2 The conventional superscalar microarchitecture 
A block diagram of the modern superscalar processor is shown in Figure 1.1. The microar-
chitecture delivers high performance by executing multiple instructions in parallel every cycle. 
The operation may be explained as follows. Multiple instructions are fetched from the instruc-
tion cache every cycle. The instructions are then decoded in the dispatch stage, which also 
2 
performs the register renaming and the dependence check. The instructions are dispatched to 
the reservation stations at the end of the cycle. The map table for the dispatch stage requires 
2 x n read ports and n write ports in order to dispatch n instructions in one cycle. Thus 
the complexity of dispatch stage increases quadratically with respect to the dispatch width. 
Reducing this complexity forms the first objective of this thesis. The instructions wait in the 
reservation stations for their operands to become available. The Instruction window contin-
uously monitors the dependencies among instructions in the instruction window and selects 
the appropriate instructions for parallel execution. The number of instructions selected by the 
hardware to issue in parallel is determined by the issue-width of the processor. The issue-logic 
is one of the most performance-critical components in a superscalar processor. Along with the 
dynamic scheduler window, it largely determines the amount of instruction level parallelism 
that can be extracted. Instructions are issued either to the integer or floating point pipelines, 
or if they are load/stores, they are sent to the load store queue for subsequent data cache 
access. Current generation processors come with multi-banked, multi-level associative cache 
memories and try to maintain L-1 data cache access times of a few cycles (typically two to 
three cycles) . The instructions that complete, write their results in the reorder buffer in the 
write back stage of the pipeline. The results of those instructions are also bypassed to the de-
pendent instructions, waiting in the reservation stations. The bypass delay is one of the biggest 
problems as far as scalability of superscalar processors is concerned. Finally, the instructions 
are committed in program order to the register-file/memory from the re-order buffer. 
1.3 Pipeline stages of a superscalar processor 
Modern superscalar processors typically have at least five pipeline stages -fetch, dispatch, 
issue, write-back and commit. The main reason for increasing the pipeline depth is due to 
the increasing demand for high speed in general purpose applications. This directly results in 
the increase in clock frequency as technology parameters are scaled. As the clock frequency 
increases, the number of tasks that could be done in unit clock cycle time decreases. So 
the current superscalar processors are designed with a pipeline depth of 8 - 20 stages. The 
3 





































Figure 1.1 Reservation Station Model Based superscalar Pipeline 
important stages and their functionalities are discussed below. Note that each of these stages 
might take one or more cycles depending on their pipelineablity. 
1.3.1 Fetch stage 
The fetch stage of the superscalar processor fetches the instructions from the instruction 
cache (I-Cache). As most of the superscalar processors allow speculative execution, they employ 
dynamic branch prediction techniques. To check if the branch instruction is taken or not, the 
branch prediction hardware is accessed simultaneously. The Branch Target Buffer (BTB) gives 
the target address of the branch if the branch is both taken and if its entry is present in the 
BTB. The address given by the BTB becomes the Program Counter (PC) for the next fetch 
cycle. 
1.3.2 Dispatch stage 
The dispatch stage is where the dependencies are checked among instructions that are being 
dispatched to the reservation stations in the subsequent cycle. This is the last stage in the 
pipeline that maintains the instructions in program order. The logical destination registers 
of all the instructions dispatched are also renamed to the physical destination registers. The 
source operands of all these instructions have to get the latest renamed values. So there is a 
map table from which all the instructions read the physical register values (physical tags) of 
4 
their source operands. For a dispatch width of n instructions, this table should have (2 x n) read 
ports and n write ports. The instructions with their physical tags are then dispatched to the 
reservation stations, which may be centralized (RUU) [6J or distributed (Tomasulo Algorithm) 
[7]. 
1.3.3 Issue stage 
The issue stage looks for independent instructions in the instruction window and tries to 
issue the ready instructions to the free functional units depending on the issue width and also 
on the number of free functional units. The instructions thus issued start executing in these 
functional units. The load instructions check the load/store queue (LSQ) to see if any out-
standing (uncommitted) stores are accessing the same address. If so, bypassing of the value 
occurs within the LSQ. Otherwise, data cache read/write is initiated depending on the avail-
ability of cache port and system bus. From the implementation point of view, the issue stage 
may be said to contain the wakeup logic and select logic. When an instruction completes and 
a result operand is produced, the wakeup logic wakes up consumer instructions (that hiber-
nate in the instruction window) . When an instruction has both its operands ready, the select 
logic chooses a subset of ready instructions and arbitrates them for available functional unit 
resources. There is a lot of research into breaking the wakeup-select loop but still maintain-
ing back to back execution of dependent instructions. There are also novel issue schemes like 
queuing, dependence based issue, tag elimination, broadcast free scheduling, etc. 
1.3.4 Write-back stage 
The completing instructions write the results values into the Re-Order Buffer (ROB) entries 
allocated to that instruction, which also takes care of in-order commit. The results of the 
completing instructions are also broadcast to the dependent instructions. The instructions in 
the reservation stations check the broadcast bus to see if the tag in the bus matches with the 
source operand tag (as explained in the issue stage functionality) . 
5 
1.3.5 Commit stage 
It is imperative to support precise exception model even in a superscalar processor that 
supports out-of--order processing of instructions. The hardware structures that assist and 
ensure in-order commit a,re the ROB or the RUU. The instructions are retired in program 
order to the register file from the ROB. The RUU unit acts as both the reservation station 
and the ROB. The reservation stations are centralized in that case. The ROB model caaa be 
used with both centralized and distributed reservation stations. 
1.4 Cache memory 
The instruction and data caches provide low latency access to instruction and memory 
operands respectively. In order to provide the necessary load/store bandwidth in a superscalar 
processor, the cache has to be banked or duplicated (and/or multi-ported) . The associativities 
range from direct mapped to 4-way set associativity, depending on the applications that a 
given processor targets. Not only do caches take up a significant fraction of the die area, but 
they also contribute anywhere from 15% to 45% of the total processor energy. The access 
time of a cache comes in the critical path more often (due to the high hit rates of current 
day caches) and is dependent on the associativity as well as the size of the cache. Lets us 
assume a 4-way set associative cache for all discussions from here on in this thesis. It should 
be noted that irrespective of the actual ILP delivered in the aforementioned pipeline stages, 
each cache access initiates activity in all of its four ways. Data ends up being found in only one 
of these ways, resulting in wastage of work done in the other ways. To alleviate this problem, 
sequential associative caches were proposed. But they suer from very high access times and 
hence are not e$icient. This thesis seeks to explore a relationship between the ILP delivered 
in the pipeline and the amount of work done by each load in accessing the cache. Based on 
this relationship, we propose a dynamic associativity management cache architecture that has 
the energy advantages of sequential associative cache, at the same time not compromising on 
access times. In this way we seek to reduce the energy per cache access, within reasonable 
performance degradation. In effect, the thesis proposes an IPC assisted placement scheme for 
facilitating sequential access with high hit rates. 
1.5 Thesis contribution 
This thesis proposes two independent techniques targeting the complexity and energy issues 
explained in the above sections. The first technique targets the dispatch stage complexity and 
alms at reducing the number of ports in the rename map table to reduce cycle time and 
rename energy. The second technique aims at reducing the cache access energy by exploiting 
the complex relationship between the ILP delivered and the number of ways to be accessed 
based on this metric. 
1.6 Thesis organization 
The rest of the thesis is organized as follows: Chapter 2 introduces the dispatch stage 
in greater detailed and develops a complexity framework and a VLSI model for the dispatch 
components. Chapter 3 presents a detailed analysis of the dependence behavior in programs 
and the way in which it can be leveraged to reduce the number of ports in the re-namer. 
Chapter 4 explains the proposed dispatch stage schema and addresses the issues arising due 
to its new organization. Chapter 5 presents a detailed account of the experimental evaluation 
of the scalable dispatch stage. Chapter 6 introduces the dynamic associativity management 
cache architecture that forms the second proposed design in this thesis. This chapter also 
builds a model for estimating the energy-delay product of the proposed scheme as against a 
conventional set associative cache. Chapter 7 presents performance and energy characteristics 
of the proposed cache access schedules with respect to SPEC20o0 CPU benchmarks. Finally, 
chapter 8 summarizes the contribution of the thesis. 
7 
CHAPTER 2. COMPLEXITY OF THE DISPATCH LOGIC 
In this chapter, the dispatch stage functionality, design and complexity are explored. We 
build a model to analyze the delay associated with each hardware structure constituting the 
dispatch stage of a current generation superscalar pipeline. 
The dispatch stage of a superscalar processor appears right after the instruction fetch stage. 
During the dispatch stage, sets of instructions, fetched in the previous cycle, are removed from 
the instruction fetch buffer by the dispatcher. Each such set constitutes a dispatch group. 
Instruction decoding, Register Renaming and Dependence check are the three operations car-
ried out on every dispatch group. Instruction Decoding refers to the act of deciphering the 
opcode and associating the instruction with the appropriate data path of the processor. Qnce 
the opcode is decoded, the respective resource allocator is apprised of the potential future 
resource requirement (e.g. function unit requirement) . Apart from instruction decode, the dis-
patch stage also sets up control and data dependence based linkages within instructions, which 
becomes a guideline for remaining pipeline stages. The dispatch stage is visited in greater 
detailed in the following sections. 
2.1 Register renaming and dependence check 
Data hazards prevent a program from executing instructions in parallel, by imposing re-
strictions on their schedule. This has serious impacts on the ideal gains obtainable through 
pipelining. Data dependences occur among instructions that may access (read or write) the 
same storage (a register or memory) location. Measures are to be taken to enforce correct or-
dering of instructions that reference the same location, failing which data integrity and hence 
the program correctness would be violated. Data hazards may be classified into true dependen-
8 
 ► PHYSICAL SOURCE 















Figure 2.1 Register Rename Logic 
PHYSICAL 
—► REG FOR 
REG R 
cies and artificial/name dependencies. Z~ue dependencies are said to exist when the consuming 
instruction tries to read a source before the producer writes into it. This is also referred to as 
the read-after-write (R.AW) hazard. Write-after-Head (WAR) hazards occur when an instruc-
tion tries to write a destination before it is read by a preceding instruction. Write-after-write 
(WAW) hazards occur when an instruction tries to write a value before the same location 
is written by a preceding instruction. WAR and WAW hazards are aztificial hazards since 
they can be removed completely by providing a larger set of registers to work with. remove 
artificial hazards by providing run-time mapping of a small set of logical registers to a large 
rename space of physical registers. This process is called register renaming. R.AW hazazds 
cannot be removed by any modern superscalar technique, since they are a natural program 
property. Hence the detection of inter-instruction dependencies forms the basis for determin-
ing the schedule of dependent instructions. This action is performed by the dependence check 
logic. This sums up the dispatch stage functionalities. 
2.2 Register renaming logic 
The high level block diagram of the register rename logic is shown in Figure 2.1. The register 
rename logic is used to translate logical register designators into physical register designators 
(larger set) . This is accomplished by accessing a map table that holds the current logical to 
physical mappings, specifically for the instructions in flight. The logical register designators are 
9 
used to index into the map table during accesses. If a given instruction produces a result, the 
logical destination register is assigned a free physical register from the physical rename space 
and the map table is updated to re~ect the new mapping for this logical register. This is done 
for the benefit of subsequent instructions that consume this value. However, since renaming 
of all instructions within a dispatch window happens in parallel within the same cycle, there 
could be cases where a logical register being renamed is written by an earlier instruction 
within the current dispatch window. The Dependency Check Logic (DCL) detects such reuse 
in order to prevent violation of true dependencies. The DCL compares the logical register 
designator being renamed against the destination register designators of earlier instructions 
in the current rename group. Upon a match, the tag corresponding to the physical register 
assigned to the earlier instruction (producer) is used instead of the tag read from the map table. 
This is achieved by setting the pre-RMT multiplexers based on the outcome of comparison. 
The multiplexors bypass the new name that is getting written into the R.MT instead of the 
old value that is being read from the RMT. This process is further simplified since the reads 
happen during the first half of the cycle and writes happen during the second half. For our 
discussions, let us assume the dispatch width to be n. For the dispatch of n instructions in one 
cycle, the RMT needs to be multi ported with 3n ports (2n read ports and n write ports). This 
is because each of the n instructions needs rename mapping of the two source operands and 
needs to write the rename mapping of one destination operand. For example, for a dispatch 
width of lfi instructions, the RMT has 48 ports. This linear scaling of port count forms a 
fundamental source of complexity in the dispatch stage. The Dependence check logic proceeds 
in parallel with the register renaming. In order to compare a logical register designator with 
the destination register of earlier instructions, the DCL employs (n2 — n) comparators. Each 
comparison is 5-bit wide (with 32 logical registers in the architecture) . Hence, in order to 
scale the ILP, as dispatch stage is scaled from, say 8 issue to 12 or lfi issue, the number of 
comparators required increases quadratically, thus forming another source of complexity in the 
dispatch stage. 
10 
2.3 Rename map table(RMT) 
2.3.1 .RMT design 
High performance VLSI implementations of the Rename Map Table (RMT) use the RAM 
(Random Access Memory) scheme because of the flexibility and scalability it offers, as against 
the CAM scheme. Popular machines like the MIPS R10000 confirm to RAM implementation 
of the RMT. In the R,AM scheme, the map table is implemented as a register file where each 
entry contains the physical register that is mapped to the logical register whose designator is 
used to index the table. The bits of the physical register designators are stored in cross-coupled 
inverters in each cell. A read operation starts with the logical register designator being applied 
to the decoder. The decoder decodes the logical register designator and raises one of the word 
lines. This triggers bit-line changes which are sensed and amplified by the sense-amplifier 
and the appropriate output is generated. Precharged bit lines a,re used to speed up the read 
operations. It should be recalled that the RMT is multi ported. 
2.3.2 RMT delay model 
The critical path for the rename logic is the time it takes for the bits of the physical register 
designators to be output after the logical register designators are applied to the address decoder. 
The delay of the critical path consists of the following components: the time taken to decode 
the logical register designator, the time taken to drive the word-line, the time taken by the 
access stack to pull the bit-line low and the time taken to sense this change and trigger the 
corresponding output. Each extra port adds one word-line per bit row and one bit line (or 
two if bit is also stored) per bit column. This stretches the word line and bit line lengths and 
capacitances by a factor proportional to the number of ports, and hence increases the delay 
along the critical path. Palacharla analyzes each component of the critical path delay and 
provides a mathematical model for the overall rename table access time as: 
Delay = co -~ cl  * DW + c2 * DW2 (2.1) 
11 




Table 2.1 R,MT Access Delay for 8, 12 and 16 Wide Dispatch in 0.18 ~ 
Process 
The effective parasitics involved in the decoder path are that of the predecode lines and the 
gate structures. Typically 3 or 4 bits are pre-decoded, followed by actual decode, using a 
NAND-NOR structure. The word line turns on access transistors to each cell. As the number 
of physical registers increases, the number of bits needed to represent them also increases, 
resulting in longer word lines. The bitline capacitance (and hence delay) is derived from the 
drain diffusion capacitance of the access transistor for each row and the metal capacitance of 
the bitline. An increase in the number of ports increases the bit line and wordline lengths (and 
their capacitances). Given that the quadratic components resulting from intrinsic RC delay 
components, are negligible in the design space involving DW = 8,12 or 16, the overall delay 
of the rename map table can be approximated as a lineal function of dispatch width as shown 
by the following equation: 
Dewy = co -I- cl x DW (2.2) 
Palacharla [ll] estimated the constants involved in the RMT access for di$'erent technologies 
for dispatch widths up to 8. We extend the same methodology to find the delay for 12 wide and 
16 wide dispatch windows. These delays fora 0.18 MOSIS process aze shown in Table 2.1. 
The constant delay component co, in Eq 2.2 evaluates to 396.1 ps, whereas the scaling factor 
for the linear component of the delay cl is found to be 20.7 ps (c2 is neglected). 
12 
2.4 Dependence check logic(DCL) 
2.4.1 DCL design 
The dependence check logic proceeds in parallel with the map table access. Every logical 
register designator being renamed is compared aga~ngt the destination register designators 
(logical) of earlier instructions in the current rename group. If there is a match, then the tag 
corresponding to the physical register assigned to the earlier instruction is used instead of the 
tag read from the map table. It should be noted that here we refer to the closest preceding 
producer of a value, in case there is more than one match. The number of comparators 
required in dependence check logic is a cause of concern when the scalability of the dispatch 
stage is considered. The DCL requires (n2 — n) comparators to check the dependencies among 
n instructions. As issue width widens, the area attributable to the comparators and routing 
become potential problems. 
2.4.2 DCL delay model 
There are three components to the DCL delay, namely the bus delay in propagating the 
logical source and destination registers to the comparators, the latency in comparison logic 
operation and finally the delay in propagating the results of comparison through a chain logic 
to determine the closest preceding producer. Of these three components, the chain logic delay 
forms the non-scalable component since it involves a signal that has to sequentially ripple across 
the entire dispatch width. Since the DCL delay is a fraction of the RMT access time (and 
DCL is in parallel with RMT), it does not affect the c*itical path. In our scheme, however, we 
propose to perform DCL before RMT access. Hence careful modeling of the DCL delay becomes 
important. We model the DCL delay as follows: delay of comparisons followed by the chain 
delay: DCL = tcornp + ~chazn • All comparisons proceed in parallel. Hence t~~p corresponds to 
the delay of one 5-bit comparison for an architecture with 321ogical registers. XOR structures 
[19] are used for comparison followed by a precharged parallel HMOs network to determine the 
overall outcome of comparison. SPICE modeling of the aforementioned circuitry (Figure 2.2) 
using TSMC .18~ process parameters from MOSIS [10] provides the critical path delay. In 
13 






























Figure 2.2 DCL Schematic: XOR,s, Precharged NOR array and Pass tran-
sistor based Chain Logic 
Figure 2.2 the EqZ signal is asserted if the ith instruction's result should be garnered by the 
corresponding source operand. The ma~clzi signal indicates if a DCL match for this source 
operand has been found. For dispatch widths of S, 12 and 16, we found the DCL delay tDe L
to be 41 ps, 52 ps, and 65 ps respectively. This confirms to linear scaling of the DCL delay 
with increasing dispatch widths. 
2.5 Dispatch scalability problem 
It is observed from the detailed analysis of rename map table and dependence check logic 
delay that it is going to be e~remely difficult to improve the performance of a superscalar 
microarchitecture by designing for higher instruction level parallelism. This thesis proposes 
to alleviate this scalability issue by utilizing the dependency structure inherent to a program 
to arrive at a design point that matches average dispatch needs (rather than peak needs). 
Simulations show that the negative impact of designing for average dispatch needs on the 
performance is negligibly small and is offset by the energy gains and greater dispatch potential. 
14 
2.6 Complexity of other critical structures 
The other pipeline stages of a superscalar processor also present scalability problems with 
increasing issue widths. A lot of research emphasis is attached to cache memory, branch predic-
tion and fetch structures that can efficiently support aggressive backends. The window wakeup 
Iogic, coupled with the selection logic form the critical loop that limits cycle time in current 
generation designs. The wake-up and select functionalities are implemented using a CAM 
(Content-Addressable Memory) scheme and form the window for dynamic scheduling. Every 
time a result is produced, the tag associated with the result is broadcast to all instructions in 
the issue window. Each instruction then compares the tag with the tag of its source operands. 
If a match occurs, the operand is marked available and depending on the availability of the 
next operand, the instruction is deemed ready for execution. The issue window, as discussed 
earlier, is a CAM array holding one instruction per entry. It is found that the wakeup logic 
delay is linearly proportional to the issue width. The selection logic picks instructions from 
a pool of ready instructions in the issue window for execution and arbitrates ready instruc-
tions for respective functional unit resources. Selection logic delay depends on the number of 
function units and the window size. It is found to increase logarithmically with the window 
size, for each functional unit. The number of levels (depth) of gates needed for arbitration 
also goes up as the window size increases leading to routability and area issues. Apart from 
these structures, the data bypass logic, responsible for bypassing result values to consumer 
instructions in the window, is another source of complexity. The number of bypasses required 
depends on the depth of the pipeline as well as the issue width of the microarchitecture. It 
is found that the number of bypass paths and their delay grows quadratically with the issue 
width and window size. 
2.? Related work 
The complexity of different stages as the factors limiting superscalar scalability wa,s quan-
tified by Palacharla [11], [12]. Franklin & Sohi [5] characterize the liveness spans of typical 
register allocated results. We use this information in characterizing the access patterns of the 
15 
rename map table. S~.nkaranarayanan & Tyagi [13] propose a hierarchical scheme for DCL to 
ameliorate the scalability issues related to DCL. They, however, do not address the scalability 
issues relating to rename map table. Sprangle & Carmean [17] introduce multi-cycle map table 
access methodology for deeply pipelined machines. Our work tries to maximize dispatch capa-
bility and could be used in multi-cycle as well as single cycle map table access models. Sprangle 
& Patt [18] reduce the comparator complexity of the DCL through static and dynamic means. 
Ernst &Austin [4] considered reducing the number of comparators in the result buses of a 
superscalar microarchitecture. The driving philosophy in our work is similar to this work [4] 
in as much as that a more efficient microarchitecture can be derived by focusing on average 
needs rather than peak needs. 
16 
CHAPTER 3. STUDY OF DEPENDENCE BEHAVIOR IN PROGRAMS 
In this chapter, the locality in the placement of producer and consumer instructions is 
understood with respect to SPEC CPU 2000 benchmarks. An intuitive observation into the 
dependence based instruction behavior is developed. Based on simulations, this intuitive ob-
servation is justified and the actual need for rename map table ports is calculated. This is 
followed by a discussion of how to leverage the understanding of average port needs into de-
signing an R,MT-DCL combination that e$iciently caters to average dispatch needs rather than 
peak needs. 
3.1 Program behavior and dependency relationships 
The typical programs exhibit strong locality in the placement of producer and consumer 
instructions. Franklin, Sohi [5] show that 60% of the register allocated results have useful 
lifetime (number of instructions separating define of a result from its last use} of five. This 
program behavior is reflected in the following expected dispatch window characteristic. Con-
sider adispatch group. The instructions at the head-end of the dispatch window most likely 
will garner their source operands) from the instructions in flight. Hence, they will read their 
source operands rename mapping from the R.MT. The allocation of two RMT read ports for 
these instructions is justified. For a dispatch window exceeding the size five, it is extremely 
likely that all the consumers of their result are within the dispatch window [5]. Such in-
structions, then, need not commit their result operand rename mapping to RMT with high 
likelihood. This implies that the allocation of one RMT write port for these instructions is an 
over-commitment. Similarly, the instructions at the tail-end of dispatch window will almost 
certainly need to commit their result operand's rename mapping to RMT (since all the con-
17 
Sumer instructions will appear in the following dispatch window (s)) . However, most of their 
source operands are likely to be generated (defined) by instructions within the same dispatch 
window. Hence, once again, two RMT read ports allocated to these instructions seem to be 
too many. For the instructions in the middle, the average need for RMT read ports is likely 
to be below two, and the average need for write ports is likely to be below one. This intuition 
drove the proposed dispatch stage design. 
3.2 RMT port utilization in SPEC CPU benchmarks 
In order to validate the aforementioned intuition, we collected profiles of SPEC2000 integer 
and floating point benchmarks. We associate three counters with each instruction slot, two 
for the source operands and one for the destination. For each cycle, if an instruction needs 
the rename mapping entry for a source operand (the DCL outcome shows that the result 
is not produced within this dispatch group), the corresponding counter is incremented. If a 
destination register is not redefined in the same dispatch window (the write port is needed), 
the write port counter is incremented. At the end of the simulation, these counters are divided 
by the number of simulated cycles to arrive at a port utilization number. We present these 
numbers in two flavors. In one case, the port use counters, accumulated through the simulations 
are divided by the total number of cycles (Actual Port Utilization graphs in Figure 3.1. This 
figure counts a port as not used even if the lack of need for the port arose from no instruction 
being present in a dispatch window instruction slot. If there is choice in distributing ports 
amongst instruction, which is what this thesis proposes, these numbers are the ones that 
should guide the dispatch stage R.MT port distribution. Whereas, an understanding of the 
fundamental program behavior is aided if the counters are also normalized for the occupancy 
rate of the instruction slot (Normalized Port Utilization graph in Fig 3.2) . That is, the counter 
value is divided only by the number of simulated cycles when that instruction slot was actually 
occupied. This number is a better indicator of the program characteristics not tainted by the 
micro architectural observation limitations. This could serve as a guideline to see what the 
implications would be, if in future, branch and fetch inefficiencies are overcome. It should be 
18 
noted that instructions that are stalled in the dispatch stage for many cycles are counted only 
once. 
This data was collected for two dispatch widths: 8 and lfi. Figure 3.1 and Figure 3.2 shows 
this data for gcc, rncf and equake. Figure 3.3 averages these numbers for both dispatch widths. 
The following observations are made from Figure 3.3. The normalized read port use frequency 
declines from 100% at the head-end to about 30% after the four head-end instructions slots 
(and stays consistently around 30%) . The write port use starts at about 60% at the head end 
in almost all cases and stays $at until about four tail-end instructions. From there, it goes up 
more or less linearly to 100% as expected by our intuitive model. Figure 3.3 also shows this 
data without normalization. The read port use frequency declines from 100% at the head-end 
to consistently below 20% for all benchmarks and dispatch widths. The write port use shows 
a slightly different pattern. It starts at about 60% at the head end in almost all cases and 
declines to about 40% at the tail end (effect of occupancy is more pronounced here) . 
3.3 Implications of port utilization an ~~,MT design 
The discussion in the previous section provides the basis for designing a modified dispatch 
stage with the objective of economizing on the RMT ports. We use the raw port use frequency 
data of Figure 3.3 to drive this distribution. Note that a typical instruction in the middle of 
a wide dispatch window has average need for 0.4 read port per source operand and 0.4 — 0.5 
write port for the destination operand. So it should be s~i~cient if a choice of 1.5 ports per 
instruction is made for an average instruction. The needs at the head end are slightly higher. 
Both the read ports are needed by the instruction at the head end. However, only 0.5 — 0.6 
write ports per instruction seem to su$ice. Given that any realistic branch resolution strategy 
will do better for the first basic block than for the later ones, it seemed reasonable to commit 
more port resources to the first basic block in the dispatch window. All these factors advertise 
for committing two read ports each for the first four instructions. As we move further away 
from the middle of the instruction window, the graphs seem to stabilize. Thus the same choice 




























































































3 5 7 9 11 13 15 
Instruction Position 
Figure 3.1 Actual Port Utilization for gcc, mcf &equake 
20 








100 ~ o 100 n. 
O 80 
0 
— — - -_ i ce % Rd1 0 80 
~ 
~% Rd1 
60 - - !~% Rd2 0 60 ~% Rd2 
~ 20 --
o 
~ 20 --- ---  - --- - 
o 
0 0 
1 3 5 7 9 11 13 15 
Instruction Position 
~ -, ~ ~ 
1 2 3 4 5 6 7 8 
Instruction Position 
equake (8) equake (16) 
120 -~ 
0 100 
120 - ~ 
0 100 n-~ 





~% 60 - -- - - - ~% Rd2 60 Rd2 





1 3 5 7 9 11 13 15 
Instruction Position 
0 ~ 







y - r 
100 ~ a° 
0 80 
0 





,.~ --- _ -- 
\ 








- -- -~ - 
-~, 
~ 20 - 
\ 
~ 




1 3 5 7 9 11 13 15 
Instruction Position 
~ ~ ~ ~ 
1 2 3 4 5 6 7 8 
Instruction Position 















Actual Port Utilization 
DW = 8 









>` 40 0 a 
0 20 
0 
Actual Port Utilization 
DW .16 











a 20 0 
Normalized Port Utilization 
DW = 8 













Normalized Port Utilization 






1 5 7 9 11 13 15 
Instruction Position 
Figure 3.3 Port Utilization for SPEC 2000 Benchmarks(Average) 
~'®' % Rd1 
~~ % Rd2 
~%W r 
22 
CHAPTER 4. DEPENDENCE DRIVEN DISPATCH STAGE 
In this chapter, a novel design for the dispatch stage that matches the number and location 
of RMT ports to average needs of programs is developed. This is followed by a detailed 
analysis of the delay of the modified dispatch components and a discussion on how to leverage 
the reduced dispatch time to dispatch more instructions within the same cycle time. The issues 
arising as a result of designing the RMT with reduced number of ports and sequentializing the 
RMT and DCL activities is addressed. 
4.1 Proposed dispatch stage schema 
As observed in the previous chapter, the average dispatch needs of SPEC2000 CPU bench-
marks is well below the conventional designed resources (3 x n ports) that target peak needs. 
The port use frequency data of Figure 3.3 is used to drive the distribution of k < (3 x DW ) 
ports to match with the average demands for each instruction slot in the dispatch window. 
Note that a typical instruction in the middle of a wide dispatch window has average need for 
0.4 read port per source operand and 0.4 — 0.5 write port for the destination operand. This 
led us to the choice of 1.5 ports per instruction for an average instruction. The needs at the 
head end are slightly higher. Both the read ports are needed by the instruction at the head 
end. However, only 0.5 — 0.6 write ports per instruction seem to suffice. Given that any real-
istic branch resolution strategy will do better for the first basic block than for the later ones, 
it seemed reasonable to commit more port resources to the first basic block in the dispatch 
window. Hence, we commit two read ports each for the first four instructions. However, we 
allocated only two write parts for these four instructions. This port distribution is shown in 
Figure 4.1. 
23 
8 —wide dispatch inst. order 






+4 I t+3 I_t+2 
~ ~ ~ 















Rename Map Table 
Figure 4.1 Modified Dispatch Scheme with Reduced Number of Ports 
The proposed dispatch stage functions as follows. The DCL detects dependencies within a 
dispatch group and sets up the appropriate 2-1 Mux/Demux so that a source operand generated 
within the same dispatch group need not read the RMT (i.e need not access the port). Also, 
if a redefinition of the destination operand occurs, then the first definition is prevented from 
writing its rename mapping to the R.MT. 
4.2 Port contention resolution 
The port allocation algorithm always prioritizes reads over write. For instance, in Figure 4.1 
if Instruction It+4 needs two read ports and I~+S needs one read port and one write port (on 
the basis of DCL outcome), the implemented scheme will match It+4 with two read ports, and 
allow I~+S to proceed with its read. The write for It+5 will be delayed for the following stall 
cycle. This is to minimize the probability of a dispatch stall due to a stalled RMT read. If 
an RMT read is stalled, no writes from the instructions in the following dispatch slots should 
be allowed, especially if the read and write are for the rename mapping of the same logical 
register. Otherwise, it would lead to incorrect program behavior. For instance, with reference 
to Figure 4.1, assume that both It+4 and I~+s need two read ports each (and no write ports). 
The proposed scheme will allocate two read ports to I~+4 and one read port to It+S. Let us 
assume that the rename mapping of logical register R5 in Instruction I~+s is delayed by a cycle 
24 
due to this stall. Let us also assume that Instruction I~+s has R5 as its destination operand, 
and it can be allocated a write port based on the port needs of I~+s and I~+7. However, if the 
write of rename[R5] from It+s is allowed to proceed, the read of rename[R5] by It+S in the 
following cycle would be erroneous. Hence, whenever a read of an instruction I~ is stalled, we 
stall all the writes of the following instructions I~~ for J~' > 1~, shown by the 'Write Disable' line 
originating at the DCL. It is possible to include reverse DCL logic to determine if a destination 
of a later instruction is aliased with the source operand of the current instruction whose read 
is being stalled. Only then, can the write be selectively disabled. However, the incremental 
IPC gains from selective write disabling were only marginally better than the simple scheme 
of disabling all the following writes on a read stall. Moreover, the area cost of reverse DCL 
logic would be quadratic in DW, and hence it did not appear to be worth it. The following 
pseudo-algorithm describes the port prioritization. 
for each pair of instruction slots (I (t) , I (t+1) ) 
{ 
if (I (t) needs the shared port : needs two or three accesses ) 
{ 
allocate the shared port to I (t) ; 
if (I (t) has 2 reads and 1 write) 




allocate the shared port to I(t+1); 
} 





case 1 write and some reads: delay the write; 
if a read also needs to be stalled, assert 
disable-following-writes-signal; 




4.3 Delay analysis of modified dispatch scheme 
An overview of the delay model for RMT access and DCL is provided in Chapter 2. Using 
the TSMC 0.18 process parameters from MOSIS, the DCL delay for a single comparison 
cell followed by chain logic propagation is evaluated. Note that we are comparing logical 
designators with each other and each comparison is potentially a 5 bit comparison. Since 
all 5 comparisons can work in parallel, only one comparator delay needs to be accounted for 
in the delay calculations. We implemented the SPICE model for 5 parallel bit-comparisons 
(Figure 4.2), which is effectively 5-XOR operations. The figure shows comparator cells driving 
a pass transistor based chain logic that spans across all the instructions that form a dispatch 
group. The output of the XOR gate drives a 5-input precharged NOR gate array, which folds 
the result of comparison into one line. This line drives the chain logic which determines if a 
match happens anywhere in the window, and if so, it indicates the location of the match (so 
as to locate the producer and consumer) . The DCL delay for tag match and worst case chain 
logic propagation is found to be around 41 ps, 52 ps and 65 ps respectively for dispatch widths 
of 8, 12 and 16. We obtain the map table access delay for the three dispatch widths from (Eq. 
2.1) as: 
t~ T = 396.1 + 20.7 * DW ps (4.1) 
If the DCL is performed before map table access, the time available for rename table access is 
reduced by the chain logic delay t~ha:~. Consider the cycle time of the original dispatch stage 
that is 8-wide, assuming it is limited by the R.MT access time. Its delay based on Equation 2 
26 




















Figure 4.2 DCL Schematic: XORs, Precharged NOR array and Pass tran-
sistor based Chain Logic 
is 561.7 ps. The key question is what dispatch width can the new scheme support 
in the same cycle time of 561.7 ps? We first subtract the DCL delay tDCL from this 
time: 561.7 — 41 = 520.7. The constant coefficient co of Equation 4.1 does not change (or 
changes very insignificantly) for the new dispatch with different number of ports. The number 
of instructions with three ports each that can be supported as per Equation 4.1 is 6.01 which 
corresponds to 18 ports. In our port distribution scheme, first four instructions are allocated 
10 ports. That leaves 8 ports for other instructions. Each group of two instructions is allocated 
three ports, and hence with 8 ports, we can support 5 additional instructions. This leads us to 
the conclusion that a 9-wide proposed dispatch can be supported in the time it took to dispatch 
8 instructions in the existing scheme. This calculation can be carried out for other dispatch 
widths as follows. It is found that the number of ports that can be supported is calculated 
to be 28 and 38 respectively for 12 and 16 wide machines. The first four instructions need 10 
ports in our scheme and each of the subsequent instructions require 1.5 ports on an average. 
Hence one could support a total of 16 and 22 instructions in the same cycle time that originally 
accommodated only 12 or 16 instructions. The extent by which one can increase the dispatch 
width by saving on ports increases as we move from 8 to 16 wide windows. 
27 
4.4 Accommodating wider dispatch in the proposed schema 
From the discussions on the modified dispatch scheme, we see that it is possible to sequen- 
tialize the DCL and RMT activities, reduce the number of ports to the RMT to less than 
3 x DW and support a given dispatch width with a less complex rename map table. This 
saves cycle time. However, on the other hand, it is also possible to retain the same 3 x DW 
ports, but support a wider dispatch within the same cycle time. It is important to be aware of 
the side effects of accommodating a wider dispatch within the same cycle time. Under current 
trends, the issue stage loop, namely the wakeup and select logic form the critical cycle time 
limiting structures. So increasing the dispatch width and consequently supporting a wider 
issue could increase the window size and lead to complexity problems in dynamic scheduling. 
However, lots of techniques have been proposed by researchers to alleviate the issue bottleneck, 
namely, clustered issue window, broadcast free scheduling, dependence based issue, queuing 
scheme, etc. Each of these technique reduces the issue critical path, under which condition, the 
proposed scheme of dispatching more instructions becomes practical. It is also important to 
consider the fan-out from dispatch to issue window and the subsequent effect on the scheduler 
CAM capacitances whenever one looks to increase dispatch width/issue window size. Recent 
research on serial vs parallel frontends indirectly accounts for the fan-out capacitances. In all, 
the proposed dispatch scheme comes in handy since designers have no option but to move to 
novel issue mechanisms and, in the process, end up facing the dispatch bottleneck. 
28 
CHAPTER 5. PERFORMANCE EVALUATION 
5.1 Simulation environment 
The SimpleScalar out-of--order simulator [3] version 3.0 targeting Alpha ISA [24] is em-
ployed to evaluate the proposed dispatch stage. We employ SPEC2000 CPU benchmarks to 
compare the performance of the proposed microarchitecture against base models. Contempo- 
rary and projected dispatch widths (DW) of n = 8, 12 and 16 are considered for evaluation, 
The dynamic scheduler window size is maintained at 64, 128 and 25fi entries respectively. The 
number of pipelined function units is also scaled accordingly. In a wide issue domain, the 
scheduler configuration (clustering, etc.) will determine the cycle time. The choice of a suit-
able window is beyond the scope of this discussion and negligibly affects the potential of the 
proposed dispatch scheme. 
The memory system consists of split Level-1 caches, the L-1 data cache being a 128KB, 
4-way set-associative cache and the L-1 instruction cache being a fi4KB direct mapped cache. 
Data cache access is assumed to have a 3 cycle latency. The instruction cache is designed to 
have a block size of fi4 bytes in order to fetch wider packets of instructions. There is also a 
512KB 4 way-set associative unified L-2 cache with a 12 cycle access latency. The capability 
to predict beyond one branch is also inherent to the front end. We have used both perfect 
prediction and gshare [8] scheme to evaluate the actual effectiveness of the dispatch algorithm. 
The gshar~e predictor uses a 8-bit global history and a 8K entry BTB. The Level-2 table for 
the branch predictor is of size 2K. 
The base microarchitecture is evaluated separately without accounting for any stalls at-
tributable to port contention. The proposed modifications are built into the simplescalar 
simulator by introducing a detailed dependency check stage and port scheduling before dis-
29 
patch of instructions. The simulations were run for at least 500 million instructions and all 
simulations are forwarded through their respective transient phases [15] to prevent cold-start 
effects. 
The circuit delay parameters presented by Palacharla for 0.18 technology in [11] are em-
ployed for the rename map table access delay calculations. The delay of the DCL subsystem 
is obtained through detailed modeling and optimization of the comparator -chain logic com-
ponents using SPICE. CACTI 2.0 [16] based simulations provide energy numbers obtained 
through conservative porting of the rename table. 
5.2 Performance characteristics 
The metric used for comparing the performance of the modified dispatch scheme against 
the base machine is instructions per cycle (IPC). The performance of both perfect and imper-
fect branch prediction(gshare) schemes are observed. Initially the IPC delivered by the base 
architecture for DW = 8, 12 and 16 is found ("Base IPC" in Fig. 5.1) . The only constraint 
that prevents renaming and hence the propagation of instructions to the issue stage in the 
modified dispatch scheme is the contention for the reduced number of read and write ports. 
This results in one or more additional cycles to dispatch the current dispatch window. For 
the same widths, the IPC delivered by the modified dispatch stage is evaluated and the degra 
dation due to port contention is reported. Figure 5.1 (Reduced ports with contention stalls) 
gives the IPC values obtained with perfect branch prediction and gshare branch prediction 
schemes as against respective base IPC values. Note that, for 33%, 38% and 41% saving in 
ports (DW = 8 ,12 and 16), the respective IPC degradation values are found to be between 
1 % and 4%. 
We also conducted simulations to quantize the effect on performance of both the port 
contention penalty and the increased IPC benefit (due to improved dispatch capability) simul-
taneously ("Improved DW -~ Contention stalls" in Fig 5.1) . The cycle-equivalent new dispatch 
schemes (DW = 9, 16, 22) are incorporated into the dispatch stage of the Simplescalar simu-






&Wide Dispatch and Perfect Prediction 
~~~ 
Fe 
Ja,~~ ~ti~Q ~ti~Q a~ ~Q`~e~~~a~ 
Base IPC 
■Reduced ports with 
contention stalls 




12-Wide dispatch and Perfect Prediction 
0.8 + 
~~o ~~QJa~e ~,~.Q ~,~Q a~ JQc rya eat 
~ ~ ~~ 
08ase IPC 
■Reduced ports with 
contention stalls 






16- Wide dispatch and Perfect prediction 
boo ~~QJa.F~ ~,~~Q ~ti~Q at` ,Q ~~~~~ac~ 
Base IPC 













&Wide Dispatch and gshare Prediction 
~~o ~~ 
Jamie 





■Reduced ports with 
contention stabs 
Olmproved DW(9) + 
contention statls 
12-Wide Dispatch and gshare Prediction 












16-Wide Dispatch and gshare Prediction 
c ~. 
~°o c~~ Ja~~ ,oy~Q oti~Q a~ JQ ~e~~ea~ 
a 
GBase IPC 
■Reduced ports with 
contention stalls 
~ Improved D W (22) + 
contention stalls 
Figure 5.1 IPC Behavior of Modified Dispatch Stage for SPEC2000 bench-
marks 
31 
architectural parameters such as the issue window. For perfect branch prediction, -1% to 2% 
speedup (for 8-, 12- and 16- wide original dispatch) across benchmarks is observed. The gshare 
prediction scheme shows 1% to 2.5% speedup. Thus, the increased dispatch capability offsets 
the stalls caused due to port contention. We also conducted simulations to observe the effects 
of increasing the issue window size and the number of function units in tune with the increased 
dispatch capability. If such an implementation becomes feasible within the same cycle time 
(due to novel issue schemes like clustering, queues, etc.) the performance benefits due to the 
proposed average case port allocation scheme ranges from 4% to 12%. It should be noted that 
these improvements are only an upper bound on the benefits obtainable from the modified 
dispatch scheme. 
5.3 Energy characteristics 
There are two factors that reduce the R.MT energy. The first factor is reflective of the 
reduced number of ports used in the proposed dispatch algorithm. The word line and bit line 
capacitances and the number of sense amplifiers in a RAM based RMT are linearly proportional 
to the number of ports. In the modified RMT, due to a reduction in the number of ports, 
there is a considerable reduction in the capacitances of these components and hence the energy 
per RMT access decreases. In order to quantize this factor, we conducted CACTI-2.0 based 
simulations of the RMT with different configurations corresponding to different number of 
ports and observed the energy values for both base scheme and modified scheme. It is seen 
that the R.MT access energy is reduced by 43% to 57%. 
DW % Energy Saved 
Per Access 
% Reduction in RMT port activity Net Energy Saved in RMT 
Perfect Prediction gshare Perfect Prediction gshare 
8 43.7 72.9 67.2 88.1% 85.7% 
12 52.5 .80.1 75.7 89.5% 87.2% 
16 51.1 84.1 78.2 91.8% 88.fi% 
Table 5.1 Access and activity energy reduction 
The second factor leading to energy reduction is attributable to the reduced activity at the 
32 



















Q ~ 1500 
~ ~ 1000 
a`~i v  500 
0 
Port Activity: Pertect Prediction 
~• 
=~ . 
~ ~/ ~, 
~~ ~~ ~; 
SN [ii. 
ra 
f .. _ ~' 
8 12 16 
Dispatch Width 
®Original Activity ■Reduced Activity 










500 ----- ¢' 
0 ` 
8 12 16 
Dispatch Width 
Original Activity ■Reduced Activity 
Figure 5.3 Port Activity Reduction 
33 
R.MT ports. In the new scheme, an instruction does not access the R.MT three times during a 
clock cycle. Instead, access is selective, driven by the DCL outcome. Whenever a part of the 
dispatch window stalls, the second cycle port access activity for the same dispatch window is 
very small. The activity at ports is also a function of branch prediction accuracy and is very 
low at the tail end of the instruction window for gshare scheme as against perfect prediction. 
We see a dramatic decrease in the activity of the ports,of the order of 67% - 84%. 
A composite energy reduction of $5-91% is observed accounting for both. RMT energy per 
access reduction and reduced port activity. Table 5.1 shows the data on the overall energy 
reduction. A rename cache could be used to reduce the activity on the rename map table. 
Even in this scenario, the access energy savings alone are quite significant. 
34 
CHAPTER 6. IPC DRIVEN CACHE ARCHITECTURE 
This chapter introduces the second objective of this thesis. Dynamic associativity manage-
ment in caches has been utilized for energy savings. The existing schemes have incorporated 
a limited degree of dynamic associativity: either direct mapped or full available associativity 
(say 4-way) . In this part of the thesis, we explore a more general design space for dynamic 
associativity (for a 4-way associative cache, consider 1-way, 2-way, and 4-way associative ac-
cesses). The other major departure is in the associativity control mechanism. We use the 
actual instruction level parallelism exhibited by the instructions surrounding a given load to 
classify it as an IPC J~ load (for 1 < 1~ < IW with an issue width of IW) in a superscalar 
architecture. The lookup schedule (such as 2-way lookup in (Way 0, Way 1) in the first cycle; 
for a miss followed by a 2-way lookup in (Way 2, Way 3)) is fixed in advance for each IPC 
classifier 1 < ~ < IW for up to IW distinct lookup schedules. Generally, a lower IPC load 
attempts to satisfy itself with a lower associativity access, whereas a higher IPC load tends to 
start with a higher associativity. The intuition is that in a higher ILP epoch of a program, 
a higher number of conflicting working sets are active, and hence need higher associativity. 
The schedules are as way-disjoint as possible for load/stores with different IPC classifications. 
The energy savings over SPEC2000 CPIT benchmarks average 28.6% per array fora 32KB, 
4-way, L-1 data cache. The resulting performance (IPC) degradation from the dynamic way 
schedule is restricted to less than 2.25%, mainly because IPC based placement ends up being 
an excellent classifier. 
35 
6.1 Introduction 
Energy efficiency within a computing system is a desirable characteristic both from a funda-
mental computing optimality as well as from system engineering perspective. On-chip caches 
have been singled out as significant contributors to the processor energy consumption. Set 
associative caches (currently 2-way to 4-way) are widely deployed in current processors due 
to their ability to lower miss rates with acceptable cycle times. The program characteristics 
have been evolving to include a larger number of working sets of larger sizes over time. This 
trend favors higher associativities in the future. A full associativity accessl switches the ca-
pacitance in all the ways simultaneously resulting in a maximum energy access per load. This 
was the motivation for several earlier research efforts [21], [20], [28], [2fi] where some of the 
load/store instructions are accessed in adirect-mapped, 1-way lookup. If that misses, a full 
associativity access is generated. All these research efforts use different prediction and control 
mechanisms to flag certain loads as direct-mapped. In these schemes, some of the loads switch 
the capacitance of only one-way (approximately 1/J~th the capacitance of all the l~ ways to a 
first order approximation) instead of all the designed ways. This saving is significant given the 
cache energy's contribution towards the total processor energy. Almost alI current generation 
processors, including Alpha [24], PentiumPro [25], StrongARM [22] and XScale`, dissipate from 
15% to 45% of their total energy in caches. In the subsequent parts of this thesis, we focus on 
an energy-e$icient scheme for L1 data cache based on dynamic associativity control driven by 
an IPC (instructions per cycle) classification. 
The primary motivation behind the proposed cache architecture is as follows. It strives to 
design microarchitecture components that consume energy in proportion to the delivered work. 
The incumbent microaschitecture design paradigm targets a peak instruction-level parallelism 
(ILP) such as 4 or 6 or 8 (the issue width of the superscalar microarchitecture) . Each compo- 
vent of the microarchitecture is designed to sustain this targeted peak ILP. However, typical 
programs exhibit a high variance in IPC over time. Moreover, this variance has short tem-
poral and spatial periods. In other words, the IPC variance is visible within extremely small 
1 A full associative access in this context refers to a k-way associative access for a k-way cache. 
36 
time windows (of less than 10 cycles) and at each stage in microarchitecture including fetch, 
dispatch and issue. This leads to a designed capacitance proportional to the peak IPC, which 
switches every cycle. Ideally, the switched capacitance would be proportional to the actual 
IPC (ILP) delivered in a given cycle. Hence, a commitment of silicon resources proportional 
to the peak IPC hurts both the delay and energy performance for every cycle with IPC less 
than the peak. One possible solution for this dilemma is to design microarchitecture compo- 
vents that switch capacitance proportional to the delivered IPC leading to a delay and energy 
performance in line with the actual program progress (IPC for that cycle) . This constitutes 
the motivation for our work. The efficiency in energy-delay domain is achieved at a design 
.point that matches resources to requirements. With this aim in mind, an adiabatic compiler & 
micro-architecture framework [27] wags proposed to alleviate the energy and complexity issues. 
This thesis addresses the design of data cache with such a characteristic to be part of the 
proposed microa,rchitecture. 
The Level-1 data caches are designed to be highly associative in order to sustain the tar-
geted peak IPC of the microarchitecture. They are also designed to be multi-ported to support 
multiple loads/stores driven by high IPC. A high IPC exposes the cache to a larger number of 
working sets, many of which are conflicting. A conventional 4-way set associative cache imple-
mentation probes all the four tag and data arrays (ways) in parallel. Hence each load/store 
needs energy for probing all the four ways. The simplistic intuition that drove our vision 
originally is as follows. The load/store instructions occur in epochs with variable IPC. Some 
loads belong to high ILP regions of a program, and some to the low ILP regions. Let each 
load/store be classified with the ILP of the constituent program region. The conflicting (not 
direct-mapped) working sets are the ones to generate pressure on a data cache for higher as-
sociativity needs. The conflicting loads constitute a certain percentage of all the loads for a 
given program (on average) . In fact, this percentage appears to be quite predictable accord-
ing to [2]. Again, assuming for intuitive simplicity, that the conflicting loads are uniformly 
distributed across the program, a higher ILP epoch would tend to have a higher number of 
conflicting loads. Hence, higher associativity would support a higher ILP program region more 
3? 
naturally. Similarly, a lower ILP region could still be supported with a lower associativity with 
an acceptable miss rate. This says that the number of instantiated ways can be proportional 
to the ILP of the program region around a given load. A simplistic scheme would be to map 
all the load/stores within ILP 1 region to Way-0, within ILP 2 region to Way 0 and Way 1, 
within ILP 3 region to Way 0, Way 1 and Way 2; and within ILP 4 region to all the four ways. 
In other words, the associativity of an access is determined by its ILP region. This simplistic 
notion forms the basis for this work, and is refined later to make it feasible. 
Note that such a mapping from IPC classification to the instantiated ways has the targeted 
property of energy dissipation being proportional the ILP. An approximate energy consumption 
model in this scenario would be as follows: ILP 1 load energy ti El_way; ILP 2 load energy 
^' 2 * ~1—way, ILP 3 load energy ~ 3 * El_way, ILP 4 load energy ~ 4 * ~1_way. This assumes 
that the capacitances of all the ways are equal, El_way is the energy of direct-mapped access, 
and the the bit line and word line energies dominate. 
Related Work: Many researchers have studied techniques to alleviate both cache energy 
and access time bottlenecks. Techniques encompass partitioning, decomposition and sequen-
tializing way access patterns. Albonesi [20] proposed a technique to partition ways selectively. 
Sequentializing way-access patterns was proposed by Grunwald et al. [21]. Circuit design 
techniques [23] were also proposed to conserve cache energy. Vijaykumar et al. [26] combined 
predictive schemes with selective access techniques to achieve low energy without compromising 
on access times. 
6.2 overview 
The fundamental difference between the proposed scheme and all the earlier schemes lies 
in their intrinsic goals. Even though all the existing schemes try to reduce energy based on 
the data set requirements, none of them aim at dissipating only as much energy as the work 
done (IPC delivered). We propose a selective, sequential cache architecture with the explicit 
objective of adapting its energy needs to the ILP. The ILP could be measured at one of many 
38 
microarchitecture stages. The issue stage select logic is the stage we utilize as the dynamic 
classifier of the load/store ILP region. The sequential-way-access schedule is fixed a priori for 
each ILP classification based on two factors. The first factor is the distribution of load/stores 
among the different IPC epochs. This determines the tolerance of a given IPC class loads to 
a mismatched schedule. The second factor is the distribution of di.~erent associativity needs 
among the loads classified as IPC 1~ for all J~. A sequential access schedule is chosen for each IPC 
classification on the basis of this distribution. This sequential access schedule is an ordering 
of the cache ways. For instance, the schedule [(0, 2); (1); (3)] ca.11s far probing Ways 0 ~ 2 in 
the first cycle; if that results in a miss then Way 1 is probed in the next cycle; if that still is a 
miss then Way 3 is probed. 
The issue stage classifies each load/store as belonging to one of the IPC epochs. If J~ 
instructions are selected by the issue wakeup/select logic in a given cycle, then all the load/store 
instructions among these J~ instructions are classified as IPC-1~ load/stores. When aload/store 
is issued from the load/store queue (LSQ) to the cache, its IPC classification (performed earlier 
at the issue stage) determines its sequential schedule, say [(0,1); (2, 3)]. It accesses only one (or 
two) ways out of the four available ways. Only the tags from the accessed ways ase compared. 
If a miss is indicated, a tag comparison is initiated in the other ways sequentially, before a final 
miss is signaled. Data access is initiated in each way in parallel with the tag comparison as 
usual. Correct data placement is fundamental to the efficiency of any sequential access scheme. 
The same schedule is also used for data placement. We observe in the following sections, that 
the dynamic IPC at issue stage ends up being a good classifier for load/store instructions, 
resulting in a good placement with in the cache ways (with little cross-talk between different 
ways) . 
6.3 IPC based load/store classification 
Load/store instructions involve two operations: effective address computation followed by 
the load/store dispatch from the LSQ. The issue stage wakeup &select logic wakes up all the 
instructions whose source operands just became available. These instructions are placed in 
39 
the ready queue. For a ~-wide superscalar microarchitecture, up to ~ of the ready instructions 
are selected to be issued. Vt~e use the number of selected instructions at the issue stage to 
annotate each load/store instruction among these selected instructions as shown in Figure fi.2. 
The number of selected instructions at this stage can vary from 0 to J~ (the peak ILP capacity 
of the microaschitecture). If there is a load/store among the selected instructions, then this 
number of selected instructions ranges from 1 to 1~. Hence each load/store gets placed into one 
of the J~ IPC buckets. Figure 6.1 shows an example of this classification. The load sitting in 
the issue ready queue is selected along with the two preceding instructions I nsn 1 and I nsn 2. 
It issues into the slot labeled as Load (4) of LSQ. It is annotated as IPC-3 class load/store. 
This annotation is carried along with the load all the way until the cache access. 
Binding of sequential access schedule: The sequential access schedule specifying the 
temporal ordering of the ways for a given IPC classification can be either hardwired into the 
cache control or could be dyna,rnically bound. A late binding results in greater generality 
allowing for per process (through compiler analysis) or even per procedure specification of the 
access schedule. The block containing the access schedules in Figure 6.2 is meant to represent 
the dynamic binding. The compiler could store such a sequential access schedule table into 
a memory mapped region. At the process initiation, the table could be read into a micro 
architectural table, which is read by the cache controller to initialize the sequential access 
schedule. 
The dynamic sequential access schedule binding logic, as seen in Figure 6.2 implements the 
sequential way access schedule by setting up appropriate mask bits before a cache access can 
occur. 
Dynamic associativity cache access: When a load is issued to the data cache sub-
system from the LSQ, its IPC bits are used to decode the sequential access schedule table 
in parallel to retrieve the temporal way access masks into the selected way mask register (as 
shown in Figure 6.2). In the first access cycle, Cycle I way mask is used to enable/disable 
the chosen ways. The Selected Way Mask register shifts down the temporal access schedules 
40 
IPC 3 
Load ~ Insn 2 ~ insn 1 
Issue Ready Queue 
Load (0) ; Load (1) ; Store (2) ; Load (3) ; Load (4) 
IPC-1 ; IPC-2 ; IPC-2 ; IPC-4 ; IPC-3 
Load/Store Queue (LSQ) 
Figure 6.1 Issue Select Stage Classifies the IPC of Load/Store Instruction 
down by one position so that the Cycle II way mask is at the head of the register. If this is 
a hit, the access is done. Otherwise, Cycle II way mask drives the second cycle access way 
enable/disable signals. Similarly, on a miss, the Cycle III way mask is utilized for a third cycle 









1 Way 0 Way 1 (Way 2,3) 
2 (Way O,Way 1) (Way 2,Way 3) 
3 Way 2,Way 3) (Way O,Way 1) 
4 (Way 2,Way 3) (Way O,Way 1) 
Table 6.1 Sequential Way Access Schedule 
41 
Step 2 
Dynamic Binding of Sequential 









O I 2 3 
O O 1 I 
l I O O 
O O I 1 
I 1 O O 
1 I 0 O 
O O I I 
I O O 0 
O 1 O O 







































Distribution of Loads in IPC Packets 
2 3 







Figure 6.3 Issue IPC Based Classification of Loads in SPEC2000 Bench-
marks 
42 







1 2 3 






Figure 6.4 Distribution of associativity requirements at each IPC epoch 
IPC class First Cycle Hit Second Cycle Hit Third Cycle Hit 
1 Ed +Ea +E~~ 2*Ed +2*(Ea)~-E~ 3*Ed +4*(Ea)+E~ 
2,3,4 Ed + 2 * (Ea) + Econ 2 * Ed + 4 * (Ea) + E~ 
Table 6.2 Energy Consumption Methodology 
43 
6.4 Sequential access schedule determination: 
How do we determine a good schedule for each IPC classification? Note that as we ob-
served earlier, the intuitive explanation for the IPC based classification is that the fraction of 
conflicting loads scales linearly within a group of k instructions. Hence, a higher associativity 
access benefits a higher ILP load. A good access schedule needs to determine the distribution 
of conflicting loads with respect the IPC classification. We also need to know and understand 
the distribution of loads into different IPC classifications. 
Consider a 4-way associative data cache within a 4 issue superscalar microarchitecture. 
The load/store instructions are classified into IPC-1, IPC-2, IPC-3, and IPC-4 classes. In 
order to determine the distribution of load/store instructions among these four IPC buckets, 
we classified load/store instructions in five SPEC 2000 benchmarks: gcc, mesa, equake, gzip 
and bzip according to issue IPC (Figure fi.3). Approximately 60-fi5% of load/store instructions 
are issued in an IPC-4 group, whereas 20% are issued in an IPC-3 group. The frequency of 
IPC-2 and IPC-1 load/store instructions was approximately 15~ and 5% respectively. 
We also need to know the associativity needs of the loads classified as IPC-1~ in the following 
sense. Consider all the IPC-4 classified loads. Some of these loads are direct-mappable they 
are non-conflicting with respect to a 1-way associative access. We denote this class of loads by 
IPC4,1. Some of these loads are non-conflicting with a 2-way associative access. These loads 
are denoted by I PC4,2 . We can similarly classify all the loads into I PCl,1, I PC1,2 i I PC1,3, 
IPC1,4, IPC2,1, IPC2,2, IPC2,3, IPC2,4, IPC3,1, IPC3,2, IPC3,3, IPC3,4, IPC4,1, IPC4,2, 
IPC4,3, IPC4,4. This helps us determine a suitable sequential way access schedule as follows. 
If IPC4,4 dominates the other load sets (highest, dominant frequency) then the access schedule 
for IPC-4 classified loads ought to be a single 4-way access [(0,1, 2, 3)]. 
In order to assess IPCz,~ for 1 < i, j < 4, we counted the number of accesses that map 
to the same set over a reasonable temporal window. We followed an experimental scheme 
similar to the one in [2] to derive this data. We implemented the most restrictive version of 
the cache —direct mapped. We maintain a buffer where all the replaced loads are placed. At 
certain time intervals, (these are the temporal windows within which the working sets are being 
44 
captured), we analyze this buffer to see how many loads map into the same set to determine 
the minimum associativity that will make these loads non-conflicting. Let us say a given load 
classified as IPC-3 was found to be 2-way associative. Then the access count far this load over 
the observation temporal window is added to a counter IPC3,2. In general, we maintain 16 
counters for IPCi,~ which accumulate such numbers. The resulting observations are shown in 
Figure fi.4. The four sub-groups along x-axis are for IPC-1 through IPC-4 classified load/store 
instructions. Within each sub-group IPC-i, the bars denote IPCz,1, IPCZ,2, IPCi,3, IPCz,4
from left to right respectively. Note that direct-mapped accesses dominate in each IPC class. 
However, more interestingly, among the conflicting load/store instructions, 2-way accesses 
dominate by far for all the four IPC classifications. The 3-way and 4-way accesses in all 
cases are very rare. This leads us to the conclusion that all the accesses in our sequential 
way access schedules will be limited to be at most 2-way associative. Any thing above 2-way 
associative access is a,n overkill. This, however, begs the following question: why should we ever 
consider a cache that is more than 2-way associative? In a traditional cache organization, the 
replacement policy does not make an effort to maintain the multiple working sets orthogonal 
(not intertwined), if it is possible to do so. Hence, a composite footprint of multiple 2-way 
associative working sets might appear as requiring 4-way or higher associativity. We believe, 
that it is this ability of the IPC based classification, to maintain the limited associativity 
working sets in isolation, that results in its effectiveness despite limiting its schedules to be at 
most 2-way associative. 
The insights gained from the two sets of data (in Figures 6.3 and 6.4) are combined to 
derive the sequential access schedule in Table 6.1. We tweaked the search space around this 
point for different access schedules, but these parameters gave us our best performance so far. 
This schedule represents least cross-traffic and hence a beneficial energy-delay tradeoff. 
Placement &lookup algorithm: Both the placement &lookup proceed as follows. If 
the IPC classifier is 1, only Way 0 is used for tag comparison and data access. If there is a 
miss, then Way 1 is enabled in the next cycle. If there is a miss there as well, then (Way 2, 
Way 3) are enabled in the next cycle. A miss here results in L 1 miss leading to an L2 access. 
45 













art bzip equake gcc gzip mcf mesa vpr Average 
Benchmark 
Load Found &IPC Matches ■Load Found &IPC Mismatch Not found 
Figure 6.5 Consistency of IPC Classification 
Similarly, for an IPC 4 load, Way 2 &Way 3 are enabled in the first cycle, followed by Way 
0 &Way 1 in the next cycle (if need be). This creates a pseudo-associative cache controlled 
by IPC classifiers. There are multiple hit times (in this ex~.mple three, only two for the most 
frequent case of IPC 4). If desired, all the IPC classifiers can be reduced to only two cycle 
access (ways are partitioned into two sets for each IPC value). Note that on a miss, the data is 
also placed into a specific way according to the way-ordering in Table 6.1. We currently place 
the data in direct-mapped way by forcing it into the first way enabled mapping. Based on 
the PSAC [21] observation, we expected to see at least 70-75% of the cache traffic completely 
contained within single ways with no cross traffic (only one cycle hit). Figures 6.3 and 6.4 
show this figure to be close to 66%. Note that we do not use any prediction that comes in the 
critical path for cache access. The IPC classifiers can also be compiler generated, and LSQ 
validated. In this study, we have only considered dynamically generated (by issue stage) IPC 
classifiers. 
46 
6.4.1 Consistency of IPC classification 
Since we rely on classifying loads at run time, it is important that a load classified as 
belonging to a particular IPC epoch is classified as belonging to the same IPC epoch for all its 
occurrences (if it appears again) . If not, the recurring load can probe a different way, leading 
to misses as well as energy degradation. The methodology to capture the consistency of IPC 
classification is as follows. We employ a 1024 entry buffer to store the replaced loads. Every 
time a load occurs, the buffer is checked to see if it has occurred before. If so, the current IPC 
classification is verified with the previous IPC classification. If they a►re the same, a counter 
is incremented; if different, another counter is incremented. If the load is not found in the 
bu.~er, a new entry is created for the load. Note that such a scheme captures IPC classification 
consistency within a time window corresponding to 1024 misses. This is a significantly large 
temporal window (with a 99% hit rate, and 20% load/store frequency, it captures a temporal 
window with 512000 instructions) . 
The observations from this experiment are shown in Figure 0.5. It is seen that, on average, 
the fraction of loads getting reclassified as belonging to the same IPC epoch is about 65% to 
70%. The mis-classified loads are a mere 16%. This vouches for the consistency of the IPC 
based classification methodology. This is one of the reasons why the scheme performs as well 
(as reported in Section 4) . 
605 Cache Energy-Delay Model 
The hardware organization of a typical set associative cache architecture is shown in Figure 
6.fi. In order to compare the benefits of the proposed cache design against conventional cache 
designs, it is important to understand the relationship between energy and delay based on 
this model. The N-way set associative cache has NStatic-RAM arrays each for data and tag. 
Each row of the data array stores memory words. For a given CPU address, tag decoding and 
the data accesses are both performed in parallel. This design methodology tries to balance 
the tag and data accesses so that the select signal from the tag array and the data are both 
available at the select multiplexors simultaneously. Every cache access results in decoding the 
47 























Figure 6.6 Set Associative Cache Architecture 
address issued by the CPU, which asserts exactly one of the wordlines. Aword-line is common 
to all 4 ways since a single set-address designator corresponds to all the ways in a given set. 
The bit lines are changed to reflect the selected bit cell state, followed by a sense amplifier 
that accelerates this change. Since the wordlines stretch across all four ways, their capacitance 
contributes significantly to the energy (E~,~ per way) and delay. For every access, since tag 
resolution would not be completed before data is available in the data-output drivers, energy 
(Ebl per way) is spent in the bit lines of all four ways (of a set). Added to that is the energy 
spent in redundant tag comparisons (E~,,,P) and sense amplifiers (Esd). Qnce tag resolution is 
done, the energy is spent in the multiplexor that drives one of the data output drivers (Ed,.,,). 
The energy spent per access is developed as follows. Let us define the array energy to be: 
Ea,.,. = E~,i -}- Ebl (+ E~,n,~). There are constant components of energy (E~,,,3 ) that involve the 
drivers, inverters and multiplexors. Then, the energy per access is roughly given by: 
E _ (Edec + 4 * Earr)tag + (Edec + 4 * Earr)data + Econs 
Even if the access results in a hit in the cache, almost three fourths of the total energy spent 
is redundant. In a sequential access cache, since data access follows tag resolution, even though 
energy is saved in the data array, cache access time is stretched and hence the energy-delay 
48 
product suers. 
If the tag and data array accesses were to be sequentialized, the schedule selected data 
array accesses are initiated in parallel with tag access/comparison. Note that both tag and 
data access proceed only in the scheduled way and not in all the ways. Now that the decoders 
drive smaller number of ways, cache access delay is reduced (set associative mapping tends 
towards direct mapped cache architecture). The data and tag array energy, in the 1-way (2-
way) access schedule would be roughly 25% {-50%) of the original energy. In the best case, a 
1-way access may result in a hit. But the scheme can result in two or three cycle accesses as 
well. In general, for a n cycle hit, where n = 1, 2 or 3, the energy is roughly given by: 
E = n * ~ (Edec ~' Earr) tag ~" (~dec ~" parr) data ~ ~' Econ s 
For reasons discussed in Chapter 6 and validated in Section 7, the proposed scheme results 
in good placement, which leads to a single cycle access for most loads. Hence, the energy saved 
in the arrays during every single cycle access offsets the extra energy used in the decoders 
during every multi-cycle hit. These energy computations, for varying hit times are formulated 
in Table 6.2. 
The influence of capacitance reduction is not just evident in energy savings but also man-
ifests itself in cache access latency reduction. For every cycle, the proposed schedule accesses 
a maximum of two ways, instead of four. Even in a cache design that involves partitioned 
word lines, the drive necessary is roughly one half of what is needed in a four-way access 
design. Hence the cache access latency comes down significantly. CACTI based simulations 
quantize this reduction.- We observe that the cache access time reduces by 13% (from 1.fi95ns 
to 1.475ns). 
49 
CHAPTER ?. EXPERIMENTAL EVALUATION 
?.1 Methodology 
The experimental evaluation of the modified cache control is simulation based. The first 
exercise determined the distribution of load/stores among different IPC classifications. Abase 
model was simulated to obtain this distribution (presented in Figure 6.3). These distribution 
numbers were critical in understanding and developing a practical way-access schedule detailed 
in Chapter 6. The performance and energy benefits of the selective way-access schedule was 
then evaluated as follows. 
?.2 Simulation environment 
The SimpleScalar simulator [3] version 3.0 targeting Alpha ISA [24] is employed to evaluate 
the IPC driven sequential selective cache access scheme. The simulation models a current gen-
eration 4-way dynamically scheduled processor microarchitecture with two levels of instruction 
and data cache memories. We employ nine SPEC2000 benchmarks to compare the performance 
of the proposed microarchitecture against base models. 
The baseline model can fetch and issue up to 4 instructions per cycle. The dynamic 
scheduler window is 64 entries deep and the load/store queue can store up to 32 entries. The 
memory system consists of split Level-1 caches, the L-1 data cache being a 32KB, 4-way set-
associative cache and the L-1 instruction cache being a 64KB direct mapped cache. Data 
cache access is assumed to have a 3 cycle latency. There is also a 256Kb 4-way set associative 
unified L-2 cache with a 6 cycle hit latency. Memory access is assumed to take takes 60 cycles. 
The model uses a 2 level branch prediction mechanism called gs~iar+e branch predictor which 
uses an 8-bit global history table and a 4K entry BTB. The simulations were run for at least 
50 
500 million instructions and all simulations are appropriately forwarded through the transient 
phases based on [15]. Energy numbers are related to the number of cache accesses and more 
specifically to the number of ways accesses. The energy per access is quantified through CACTI 
2.0 [16] based simulations. CACTI also provides a breakup of L-1 data cache access energy 
which is utilized to compute the energy gains from the proposed design based on Table 6.2. 
fi.3 Performance characteristics 
The metric used to compare the performance of the proposed scheme against the base 
scheme is Instructions per Cycle (IPC), which is re~ective of the ILP delivered. The base 
microarchitecture is evaluated separately without accounting for any penalty attributable to 
mig-classification/mis-placement, which leads to multi-cycle accesses. The proposed classifica-
tion scheme is integrated into the Simplescalar environment by adding a detailed IPC-aware 
sequential cache probing/placement scheme for every L-1 data cache access. The IPC based 
classifier is implemented in the scheduler stage to guide cache access. The only contribution to 
performance degradation could be attributable to any load that takes more cycles than usual, 
to return data (as a result of the sequential way access schedule) . 
The performance degradation numbers are reported in Table 7.1. Figure 7.1 shows that 
degradation remains within conservative limits of 2.25% and is even around 1% for four bench-
marks. These numbers indicate that the proposed IPC based classification leads to a better 
overall energy-performance design point. 
?.4 Energy characteristics 
In order to capture the energy benefits of the proposed scheme, CACTI 2.0 [16] based 
simulation results were used as a starting point. A 32 Kb, 4-way set associative cache, with a 
line size of 16 bytes is simulated to obtain the energy and delay values. We obtain a breakup 
of total energy/delay in terms of the decoder, wordline, bitline and sense-amp components for 
both tag and data arrays. The energy associated with the comparators (tag), multiplexors 










mcf bzip gcc equake art gzip mesa vpr parser Mean 
SPEC2000 CPU Benchmark 
~~  Base IPC ■IPC due to selective sequential cache access 
Figure ?.1 Performance Characteristics of IPC driven cache architecture 
cache of aforementioned dimensions. The respective energy components are scaled according 
to the number of accesses to the different ways obtained from Simplescalar simulations. 
The implementation of the proposed scheme in Simplescalar simulator also takes care of 
decomposing the scaling factors for sequential access schedules. This becomes necessary be-
cause, every time a load results in a first cycle miss, the decoder has to be exercised again to 
access the next way in the schedule. This doubles the decoder energy for that load. At the 
same time, every access might result in the switching of one or two ways, depending on the 
IPC-epoch it originates from. The net energy for the whole simulation cycle time is obtained 
as the product of the energy per access obtained from CACTI and the number of accesses, 
obtained from the simulator modeling. These figures are tabulated in Table 7.1 and shown in 
Figure 7.2. 
The energy saving, obtainable from the IPC aware, sequential access schedule is dramatic. 
Over nine benchmarks, the average savings for the L-1 data cache energy is about 28.6%. 
In a majority of the benchmarks, the savings are at least 20% or higher. Note that the 









Energy (nJ) %Energy 
Saved Ori final g Reduced 
mcf 0.829 0.801 3.401 4.373 3.075 29.677 
bzip 1.755 1.736 1.128 3.253 1.921 40.933 
gcc 1.194 1.165 2.454 6.441 3.740 41.939 
equake 2.088 2.065 1.092 3.802 2.431 36.067 
art 0.727 0.709 2.395 4.883 4.715 3.442 
gzip 1.859 1.828 1.667 4.276 2.967 30.624 
mesa 2.583 2.568 0.612 3.069 2.425 20.971 
vpr 1.625 1.565 3.729 5.319 3.582 32.649 
parser 2.267 2.197 3.079 5.766 4.548 21.133 
Mean 1.659 1.626 2.173 4.576 3.267 28.604 













mcf bzip gcc equake art gzip mesa vpr parser Mean 
SPEC2000 CPU Benchmark 
00riginal energy ■Reduced Energy 
Figure 7.2 Energy Characteristics of IPC driven cache architecture 
53 
average. Coupled with the fact, that the scheme has no negative implications on the cycle time 
makes it an impressive scheme for effective energy reduction in caches. 
54 
CHAPTER 8. CONCLUSIONS 
8.1 Dependence driven dispatch 
With decreasing feature sizes and the ability to incorporate more computing resources 
on silicon, contemporary processors are well equipped to extract maximum performance im-
provement out of any increase in the window size. The computer architecture community has 
adopted dynamic scheduling schemes leading to the dominance of superscalar microarchitec-
tures in the commercial domain. Apart from the issue logic, one of the key dynamic activity in 
superscalar microarchitectures that limits our ability to capture maximum ILP is the dispatch 
stage. In other words, designers have been crippled by the dispatch stage complexity which 
currently restricts the maximum pipeline bandwidth to around 4-8 instructions. In this thesis, 
we propose a novel dispatch methodology based on actual program behavior to alleviate the 
dispatch bottleneck. The dispatch resources are allocated to support the average program be-
havior instead of the peak. The overall IPC reduction resulting from this viewpoint is limited 
to less than 4%. Since current RMT port design strategies over-commit resources towards 
renaming, our scheme also reduces the energy by careful allocation of resources. This comple-
ments the theory that least energy is achieved at a design point which balances resources and 
requirements. The most important contributions are in the form of a new paradigm in rename 
map table design, with the potential for increased dispatch width (9, 16, 24 from 8, 12, 16) and 
reducing rename energy by reduction of ports by 33-44% as well as reduction of the activity 
at the ports by fi7-84%. 
55 
8.2 IPC driven sequential cache 
As a second contribution, this thesis proposes an IPC-aware sequential associative cache 
architecture, to alleviate the L-1 data cache energy bottleneck. This novel cache-access method-
ology creates a sequential schedule for cache accesses based on the ILP delivered at every cycle. 
The load/stores get classified by the instantaneous IPC and follow the access pattern during 
the access. Consequently, due to reduction in the number of ways looked up, the resources 
allocated tend to dissipate energy in proportion to the actual ILP delivered and not the peak 
ILP. The overall IPC degradation remains within tolerable limits of 2%, while the cache array 
energy savings are atleast 20% fora 32KB, 4-way set associative L-1 data cache. There are also 
cache access time advantages (13%) obtainable as a result of reduced capaciatnce per action. 
(This reduction is not leveraged in our study to show performance and energy results) . This, 
again reinforces our hypothesis that least energy is achieved at a design point which balances 
resources and requirements. 
An additional benefit of this scheme is to increase the available ports from the data cache. 
Consider a 2-port data cache. If there are two loads with a disjoint access schedule (say one 
accesses Ways 0 & 1 and the other accesses Ways 2 & 3), then both can be supported by the 
single port. The only extra cost for this scheme is in the extra decoders for each way for each 
port. Many interesting problems arise in this context. Specifically, can be access schedules be 
tweaked not to result in any more IPC loss, but at the same time are more way-disjoint? We 




[1] V.Agarwal, M.Hashish, S.Heckler, and D.Burger, "Clock Rate versus IPC: The End of the 
Road for Conventional Microarchitectures" , Proceedings of the 27th Annual International 
Symposium on Computer Architecture,June 2000. 
[2] B. Balton and T. N.Vijaykumar, "Reactive Associative Caches" , Proceedings of the in-
ternational symposium on parallel architectures and compiler techniques (PACT), 2001. 
[3] D. Burger, T. Austin, S. Bennett, Evaluating Future Microprocessors: The SimpleScalar 
Toolset, Technical Report CS-TR-96-1308, University of Wisconsin, Computer Science 
Dept., Madison, 1996. 
[4] D. Ernst, and T. Austin, "Efficient Dynamic Scheduling Through Tag Elimination" , Proc. 
29th Annual International Symposium on Computer Architecture, May 2002. 
[5J Manoj Franklin and Gurindar S. Sohi, "Register Traffic Analysis for Streamlining Inter- 
Qperation Communication in Fine-Grain Parallel Processors", Proceedings of the 25th In-
ternational Symposium on Microarchitecture, pp. 236--245, IEEE Computer Society Press, 
1992. 
[6] S.Vajapeyam and Gurindar S. Sohi, "Instruction Issue Logic for high perfoemance, in-
terruptible pipelined processors" , International Symposium on Computer Architecture, 
1987. 
[7] R.M.Tomasulo, "An e$icient algorithm for exploiting multiple arithmetic units", IBM 
Journal of Research and Development, January, 1967. 
57 
[8] McFarling, Scott, Combining Branch Predictors, Technical Report Number:TN-36, June, 
1993. 
[9] M. Moudgill, K. Pingali, and S. Vassiliadis, "Register Renaming and Dynamic Speculation: 
An Alternative Approach" , Proc. ~~th Annual International Symposium on Microarchi-
Lecture, 1993, pp. 202-213. 
[10] The MOSIS Service, Date Accessed: 11/20/02, http://www.mosis.com/Technical/Processes/proc-
tsmc-smos018.htm1 
[11] S. Palacharla, N. P. Jouppi, and J. E. Smith, "Complexity-Effective Superscalar Proces-
sors", Proc. ~.~th Annual International Symposium on Computer Architecture, 1997, pp. 
206-218. 
[12] S. Palacharla, "Complexity-Effective Superscalar Processors", Ph.D. Thesis, University 
of Wisconsin-Madison, 1997. 
[13] V. Sankaranarayanan, and A. Tyagi, "A Hierarchical Dependence Check and Folded Re-
name Mapping based Scalable Dispatch Stage" , Proc. of the IEEE International Conf er- 
ence On Computer Design: VLSI In Computers ~ Processors (ICCD 'O1), 2001, pp.249-
255. 
[14] V. Sankaranarayanan," A Hierarchical Dependence Check and Folded Rename Mapping 
based Scalable Dispatch Stage" , Masters' Thesis, Iowa ,state University, 2001. 
[15] Tim Sherwood, Erez Perelman and Brad Calder, "Block Distribution Analysis to Find 
Periodic Behavior and Simulation Points in Applications",Proc.l0th International Con- 
f erence on Architectural Support for Programming Languages and Operating Systems, Oc-
tober 2002. 
[16] P. Shivakumar and N. Jouppi, CACTI 3.0: An Integrated Cache Timing, Power and Area 
Model, Technical Report WRL 2001 f 2, DEC/Compaq Western P,esearch Labs, Palo Alto, 
CA. August 2001. 
58 
[17] Eric Sprangle and Doug Carmean, "Increasing processor performance by implementing 
deeper pipelines",Proceedings of the 29th annual international symposium on Computer 
architecture, 2002. 
[18] Eric Sprangle and Yale Patt, "Facilitating superscalar processing via a combined 
static/dynamic register renaming scheme" ,Proceedings of the 27th annual international 
symposium on Microar+chitecture, 1994. 
[19] J.Wang, S.F~ang and W.Feng, New E~ciertt Designs for XOR and XNOR functions on 
the Transistor Level, IEEE Journal of Solid State Circuits Vol. 29, No.7, pp. 780-786. 
[20] D. H. Albonesi, Selective cache ways: On-demand cache resource allocation, International 
Symposium on Microarchitecture, pages 248—, 1999. 
[21] B. Calder, D. Grunwald, and J. Emer, Predictive sequential associative cache, Proc. 2nd 
Symp. High-Performance Comp. Arch., ,San Jose, CA, pages 244-253, January 1996. 
[22] J. M. et. al, A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor, IEEE Journal of 
Solid-State Circuits, 11(31) :1703-1714, 1996. 
[23] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, Drowsy caches: Simple 
techniques for reducing leakage power, ISCA, June 2002. 
[24] Kessler.J, The 21264: A Superscalar Alpha processor with out-of--order execution, Pre-
sented at the 9th Annual Microprocessor Forum, San Jose, CA., 1996. 
[25] S. Marne, A. Klauser, and D. Grunwald, Pipeline gating: Speculation control for energy 
reduction, ISCA, pages 132-141, 1998. 
[26] M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, and K. Roy, Reducing set-
associative cache energy via way-prediction and selective direct-mapping, Proceedings 
of the 3.~th annual ACM/IEEE international symposium on Microarchitecture, Austin, 
Texas, pages 54-65, 2001. 
59 
[27] P. Ramarao and A. Tyagi, An adiabatic framework for a low energy micro-architecture 
and compiler, Worl~shop on interaction between Compilers and Computer Architecture 
(I~TEPA CT - 7), 2002. 
[28] T. Sherwood, E. Perelman, and B. Calder, Reducing set-associative cache energy via 
way-prediction and selective direct-mapping, Proceedings of the 3.~ th annual A C1Vl/I.~EE 
international symposium on Microarchitecture, Austin, Teaas, pages 54-65, 2001. 
