An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor by Zajac, Ralph, Jr
Rochester Institute of Technology
RIT Scholar Works
Theses Thesis/Dissertation Collections
9-1-2000
An Investigation of thread scheduling heuristics for
a simultaneous multithreaded processor
Ralph Zajac Jr
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
Zajac, Ralph Jr, "An Investigation of thread scheduling heuristics for a simultaneous multithreaded processor" (2000). Thesis.
Rochester Institute of Technology. Accessed from
An Investigation of Thread Scheduling Heuristics for a
Simultaneous Multithreaded Processor
by
Ralph Zajac, Jr.
A Thesis Submitted
In
Partial Fulfillment of the
Requirements for the Degree of
Master of Science
in
Computer Engineering
Primary Advisor:
Dr. Muhamma.d Slaaaball.; Assistant Professor
Committee Member:
Dr. Roy Czernikowski, Professor
Committee Member:
Dr. Hans-Peter Bischof, Assistant Professor, Computer Science Dept.
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
September, 2000
Release Permission Form
Rochester Institute of Technology
An Investigation of Thread Scheduling Heuristics for a Simultaneous
Multithreaded Processor
I, Ralph Zajac, Jr., hereby grant permission to any individual or organization to reproduce this
thesis in whole or in part for non-commercial and non-profit purposes only.
Ralph Zajac, Jr.
9~ IS'-oQ
Date
Abstract
Over the years, the von Neumann model of computing has undergone many enhancements.
These changes include an improved memory hierarchy, multiple instruction issue and branch predic
tion. Since the model's introduction, the performance of processors has increased at a much greater
rate than that of memory. Several modifications to hide this ever widening gap in performance
are being examined in current research. A very promising one is the Simultaneous Multithreaded
processor. This architecture strives to further reduce the effects of long latency instructions, such
as memory accesses, by allowing multiple threads of execution to be active in the processor at the
same time. With the introduction of multiple active threads in a single processor, several new
aspects of processor operation can have a sizeable effect on performance. One such aspect is how
to choose from which thread to fetch instructions during the next cycle.
For this project, three different classes of fetch scheduling mechanisms were defined and exam
ples of each were either studied or proposed. The proposed mechanisms were then tested using
a set of four sample programs by adding the mechanisms to a Simultaneous Multithreading sim
ulator based on the Simple Scalar tool set from the University of Wisconsin-Madison. With the
proper configuration, each of the proposed mechanisms improved the performance of the simulated
architecture. However, the best increase in performance was produced by the Event History Table.
It achieved an IPC of 2.0995 for two threads while overriding the primary scheduling mechanism
only 0.070% of the time.
Contents
Abstract i
List of Figures iv
List of Tables v
Glossary vi
Acknowledgments x
Trademarks xi
1 Introduction 1
2 Computer Architecture Overview 3
2.1 Modern Architectural Components 3
2.1.1 Reduced Instruction Set Computing 4
2.1.2 Pipelining 5
2.1.3 Out-of-Order Execution 7
2.1.4 Branch Prediction 8
2.1.5 Multiple Issue Techniques 10
2.1.6 The Memory-Hierarchy 11
2.2 Overcoming the Processor/Memory Speed Gap 13
2.2.1 Wide Superscalar and Superspeculative Processors 13
2.2.2 Simultaneous Subordinate Microthreading 14
2.2.3 Chip Multiprocessors 16
2.2.4 Multithreading Architectures 17
3 Thread Scheduling Methods 19
3.1 Basic Scheduling Methods 19
3.1.1 Multithreaded Superscalar Digital Signal Processor (MSDSP) 19
3.1.2 Simultaneous Multithreading Project 20
3.2 Compound Scheduling Methods 21
3.2.1 Thread Prioritization 22
3.2.2 Weighted Averages 23
3.3 Hierarchical Scheduling Methods 23
3.3.1 MISSCOUNT Overwatch 24
3.3.2 Effectiveness History Table 25
Simulator Overview 26
4.1 Base Simulator 26
4.1.1 Simulator Pipeline 27
4.1.2 Miscellaneous Structures 28
4.1.3 Simulator Memory Space 30
4.1.4 Simulator Completion 31
4.1.5 Simulator Approximations and Shortcuts 32
4.2 Simulator Modifications 33
4.2.1 Weighted BRCOUNT/ICOUNT Average 34
4.2.2 MISSCOUNT Overwatch 35
4.2.3 The Effectiveness History Table 36
Simulation Results 38
5.1 The Test Programs 38
5.2 Base Simulator 39
5.3 Weighted BRCOUNT/ICOUNT Average 40
5.4 MISSCOUNT Overwatch 43
5.5 The Effectiveness History Table 46
5.6 Summary 50
Conclusions and Future Work 51
6.1 Future Work 51
m
List of Figures
2.1 The standard 5-stage pipeline 5
2.2 Branch misprediction penalty for a basic pipeline 9
2.3 Abstraction of a memory hierarchy 12
3.1 Block diagram of a basic scheduling mechanism 20
3.2 Block diagram of a compound scheduling mechanism 22
3.3 Block diagram of a hierarchical scheduling mechanism 24
4.1 The Simple Scalar pipeline 29
4.2 sim-SMT memory space 31
5.1 Test results for base simulator 40
5.2 Performance of different weighted averages 41
5.3 System performance of the weighted averages 42
5.4 System performance of the base simulator 43
5.5 Results for MISSCOUNT overwatch 44
5.6 MISSCOUNT overwatch performance with two and four threads 45
5.7 Performance of mc5.2 46
5.8 Results for the EHT 48
5.9 EHT performance for two and four threads 48
5.10 Performance of eht7.20 49
IV
List of Tables
4.1 Summary of switches and constants for the various scheduling mechanisms 34
5.1 Percentage of Cycles MISSCOUNT was used 44
5.2 Usage statistics for mc5.2 46
5.3 Percentage of Cycles the EHT was used 47
5.4 Usage statistics for eht7.20 49
Glossary
arithmetic instruction An instruction in which a mathematical operation such as ADD. SUBTRACT,
MULTIPLY, or DIVIDE is applied to either integer or floating point operands.
BHT branch history table A branch prediction mechanism that, keeps track of weather a branch
has been taken in the past and is accessed via the branch's address.
block multithreaded See coarse-grained multithreaded.
branch misprediction When the branch prediction hardware of a processor makes a wrong guess
as to whether a branch is taken or not.
BRCOUNT A scheduling heuristic where the thread with the smallest number of branch instruc
tions in the static portion of the pipeline is given the highest priority.
basic scheduling Using a simple scheduling mechanism that is not dynamically modified and that
monitors only one aspect of processor performance.
BTB branch target buffer A memory for storing the last address that a branch went to.
cache A smaller, faster set ofmemory that is used to hold the most recently used data/instructions.
capacity miss A cache miss that occurs when the cache is not large enough to hold a program's
working set.
CDB common data bus A bus used by some processors capable of out-of-order execution to allow
all functional units waiting for an operand to be loaded at the same time.
coarse-grained multithreaded A processor which holds multiple contexts in hardware and switches
between them when a predefined event occurs.
cache miss When the data requested is not currently in the cache addressed.
CMP chip multiprocessor A proposed architecture where multiple, simple RISC cores are placed
on the same die and share an on-die L2 cache.
compulsory miss A cache miss that occurs on the very first access to a cache block.
conflict miss A cache miss that occurs when too many blocks map to the same set.
vi
compound scheduling Using more than one basic scheduling mechanism, possibly weighting
them, to determine a thread's priority.
control hazards The possibility of non-sequential program execution caused by jump and branch
instructions.
D-cache A cache dedicated to storing data.
deep pipelined A processor designed with a pipeline with many stages. This is one technique
that is used to increase the frequency that a processor is capable of running at. See also
pipelining.
DSP digital signal processor A microprocessor designed to produce high performance when pro
cessing signals of different, types.
EPIC explicitly parallel instruction computing A VLIW like design philosophy developed by Intel
and HP for future processor designs. See also VLIW.
EX The stage of the basic pipeline where instructions are executed.
execution-driven A simulator that a compiled program.
execution window In a processor capable of out-of-order execution, the instructions that are able
to be issued and are waiting for execution resources.
fine-grained multithreaded A processor which holds multiple contexts in hardware, and switches
between the contexts every cycle, or every couple of cycles.
floating-point instruction An instruction that performs arithmetic and comparison instructions
on floating-point data/registers.
functional simulator A processor simulator that will produce the results of a program but ignore
all of the internal details.
FUs functional units
hierarchical scheduling Using a basic scheduling mechanism in a supervisor for a second basic
scheduling mechanism.
ICOUNT A scheduling heuristic which gives the highest priority to the thread with the smallest
number of instructions in the static portion of the pipeline.
I-cache First level cache that contains only instructions.
ID The stage of the basic pipeline where instructions are decoded and their operands are read.
IF The stage of the basic pipeline where instructions are fetched from memory.
ILP instruction level parallelism The overlapping of the execution of independent instructions.
in-order execution The execution of instructions by a microprocessor in the order that the in
structions were fetched from memory.
vii
IQ instruction queue A structure used by processors that execute instructions out-of-order to
maintain the program order of the instructions. See also ROB.
IQPOSN A scheduling heuristic which gives the lowest priority to the thread with instructions
closest to the head of the instruction queue.
IS The stage of the out-of-order execution pipeline where instructions are decoded and checked for
structural hazards.
issue width The number of instructions that a processor can issue in a single clock cycle.
LI The first level of cache.
L2 The second level of cache.
MISSCOUNT A scheduling heuristic which gives the highest priority to the thread with the
smallest number of outstanding D-cache misses.
MEM The stage of the basic pipeline where memory accesses occur.
Multi-Hybrid branch predictor A branch predictor consisting of several small branch predic
tors that each perform well on specific types of branches.
microRAM Random access memory used by a SSMT processor to hold the code for the current
set of microthreads. See also SSMT.
MTA multithreaded architecture A processor capable of holding multiple program contexts and
sharing processor resources between them.
out-of-order execution The execution of instructions by a microprocessor as the data they need
becomes available without regard to program order.
overwatch mechanism The basic scheduling mechanism that corrects the decisions of the pri
mary scheduling mechanism in a hierarchical scheduling system.
performance simulator A processor simulator that models a specific microarchitecture.
PHT pattern history table A mechanism for keeping track of weather specific conditional branches
were taken in the past.
pipelining A technique where a processor overlaps the execution of multiple instructions.
prefetching The ability to fetch control-independent instructions that come after a branch and
to delay fetching instructions that follow a hard-to-predict branch.
primary mechanism The basic scheduling mechanism that handles the cycle-to-cycle thread
scheduling for a hierarchical scheduling mechanism. See also hierarchical scheduling.
RAW read after write A data hazard where an instruction attempts to read data that has not
yet been written by an earlier instruction.
vm
RO The stage of the out-of-order execution pipeline where instructions wait until all data hazards
are cleared and then read their operands.
ROB reorder buffer A structure used in processors that use out-of-order execution to maintain
program order and to store results until they can be committed.
round-robin A selection algorithm where a different item from a list is selected when called for.
SMT simultaneous multithreading A processor which holds multiple contexts in hardware and
can issue instructions from any number of these contexts each clock cycle.
Spatial locality The principle that states that items that are located at addresses that, are phys
ically close are usually referenced close to one another.
SSMT Simultaneous Subordinate Microthreading
superscalar A processor that can issue a varying number of instructions in a single clock cycle.
trace cache A cache used to store up to an issue-width's worth of instructions over multiple
branches.
Temporal locality The principle that states that programs tend to reuse data and instructions
that have been used recently.
TLP thread level parallelism The overlapping of the execution of different programs, or indepen
dent portions of the same program.
trace-driven A simulator that reads a previously generated list of instructions and executes them.
thread scheduling The process of assigning control of the fetch unit in an SMT to a thread or
threads for a specific clock cycle.
VLIW very long instruction word A processor that issues a fixed number of instructions in a
single clock cycle, usually as one large instruction.
WAR write after read A data hazard where an instruction attempts to write to an operand before
an earlier instruction can read from it.
WAW write after write A data hazard where an instruction attempts to write to an operand
before an earlier instruction.
WB The stage of the basic pipeline where data is written to the register file.
IX
Acknowledgments
I would like to thank the people who have helped me in the completion of this thesis. First, my
graduate committee members, Dr. Muhammad Shaaban. Dr. Roy Czernicowski and Dr. Hans-Peter
Bischof for their guidance and suggestions throughout this project. I would also like to thank many
of my fellow students, who helped me by offering comments during the process of choosing a topic
and the early stages of the work on this project. Finally, I would like to thank my parents for
always supporting me.
Trademarks
Alpha is a registered trademark of Compaq Corporation.
HP, PA-RISC, PA-8000 are trademarks of Hewlett-Packard Company
MAP1000 is a registered trademark of Equator.
MIPS is a registered trademark of MIPS Technologies, Inc.
POWER4 and PowerPC are registered trademarks of International Business Machines Corpo
ration.
Tera and Tera Computer System are registered trademarks of Tera Computer Systems, Inc.
XI
Chapter 1
Introduction
The majority of the current commercial microprocessors achieve high performance through the use
of superscalar and out-of-order execution designs. These design methods attempt to increase the
amount of instruction level parallelism (ILP) that the processor can take advantage of. The standard
RISC superscalar processor, as illustrated by current high performance commercial microprocessors,
is quickly reaching the limits of the ILP that can be extracted from the code. Efforts to extract even
more ILP from current programs include the extension of current architectural mechanisms [26]
and both speculative execution and speculative selection of data [18].
This approach tries to create more ILP by executing more instructions, some ofwhich will have
to be discarded once the true target of a branch is known. While this approach may or may not be
a viable way to improve performance, it ignores a very important problem: the ever growing gap
between the speed of the processor and the speed of memory.
The current research is trying to solve this problem through two similar, but different ap
proaches. The first is by designing a chip multiprocessor (CMP) [8, 24, 25, 14]. This approach is
based on the realization that integrated circuit (IC) manufacturing processes that will be available
soon will allow the placement of a number (2-8) of simple RISC processor cores onto a single die. In
this way, they hope to improve performance by allowing the processor to take advantage of thread
level parallelism (TLP). The second approach is to design a multithreaded architecture (MTA).
This path expands the usual microarchitecture of a processor to allow it to hold more than one
context, or thread, at one time. Several different approaches to how these threads act within the
MTA have been explored, including fine grain MTAs [2, 3. 20]. block MTAs [23, 30] and simulta
neous multithreading (SMT) designs [7, 16, 9, 33, 11, 32]. Of all these approaches. SMT has been
shown to not only provide the best performance under a wide variety of loads, but also to be the
most efficient at it both in terms of processor resource utilization [33, 19, 34. 14] and performance
increase for a unit increase in die area used [5].
Even with these advantages, there are still areas in the design of a SMT processor that require
special attention. Of these, the memory hierarchy, register file and instruction fetching mechanism
are perhaps the most important [33, 19, 35, 10]. Research has shown that the impact of different
register files and memory hierarchies is effectively hidden by the SMT process. This places a focus
on how the processor decides which thread gets control of the fetch unit each cycle. Until now, the
research has focused on single level heuristics to make this decision.
The goal of this project was to design, simulate, and analyze several thread scheduling mecha
nisms for a simultaneous multithreaded architecture. Chapter 2 presents background information
on computer architecture and proposed and implemented solutions to the current problems with
the disparity between processor and memory speeds. Chapter 3 describes how the scheduling for
control of the fetch unit is handled in several proposed Multithreaded Architectures, as well as the
Hierarchical Thread Scheduling mechanisms explored in this study. Chapter 4 gives an overview of
the simulator that was used for this project and describes the modifications made to this simulator.
The results that were extracted from the simulations are illustrated and explained in Chapter 5.
Finally, Chapter 6 summarizes the project's findings, and suggests possible future work that could
be performed.
Chapter 2
Computer Architecture Overview
Computer architecture is defined as "the structure of a computer that a machine language pro
grammer must understand to write a correct (timing independent) program for that machine"[15].
The concept was first developed in the 1960's to avoid having to rewrite every program for each
new computer that came out. This concept was pushed by IBM as they tried to design a family of
computers that would run the same software.
This chapter starts by giving a quick overview of the architectural components of a modern RISC
processor. This is necessary for an understanding of the effects ofmaking a modern processor SMT
capable. The second part of this chapter is an overview of several techniques suggested by research
to overcome the widening gap between the speed of memory and the speed of the processor.
2.1 Modern Architectural Components
The microprocessor of today is a complex collection of various techniques and mechanisms. These
different components vary in complexity and usefulness and often affect each other. For instance, a
complex branch prediction scheme is unlikely to provide a significant improvement in performance
for processors that have a short pipeline since the penalty for a mispredicted branch is small. For
processors with deep pipelines however, a complex branch predictor can greatly increase perfor
mance. The following sections will give an overview of the major properties of the modern, high
performance processor.
2.1.1 Reduced Instruction Set Computing
The design technique that makes many of the mechanisms used in modern processors possible,
or at least much less complex, is the idea of Reduced Instruction Set Computing (RISC). Many
early computers used Instruction Set Architectures (ISAs) that were complex. Instructions could
be different lengths, it was possible for many different types of instructions to access memory and
there were many different addressing modes available. Many processors also supported instructions
to help with operations such as string processing.
When actual code was studied, however, it was found that programmers rarely used the more
complex instructions or addressing modes. This information was used to design an ISA that was
smaller and simpler than those in current use. This new ISA was called a reduced instruction set
computer and it had the following properties:
Small Overall Size There were fewer instructions in the RISC ISA, allowing each of the instruc
tions to be optimized for speed;
Constant Instruction Length All the instructions in the new ISA were the same length. This
simplified the fetching and decoding of instructions;
Load/Store Architecture The new ISA was based on a load/store architecture where only load
and store instructions can access memory. This made the execution of all instructions much
more predictable;
Few Addressing Modes The number of available addressing modes was reduced to simplify
memory accesses even more;
Simple Instructions The complex instructions of older ISAs were done away with. With a pure
RISC processor, each instruction does one thing and one thing only. The more complex
instructions of older ISAs, such as the decrement, test and branch conditionally family of
instructions, were accomplished by using multiple instructions from the new ISA.
There are disadvantages to a RISC design, however. Since only load and store instructions can
access memory and because there is a smaller number of total instructions that are simpler, pro-
IF ID EXE MEM WB
Figure 2.1: The standard 5-stage pipeline
grams written for a RISC processor will be larger. This makes a RISC processor less desirable
in applications where there is a limited amount of memory. However, in most applications, the
performance gained from the optimizations possible because of using RISC outweigh the cost of
the larger program size.
An example of a RISC processor is the DLX designed by Hennessy and Patterson [15]. There
are also numerous commercial processors that are designed using the RISC principle. These include
Compaq's Alpha. Hewlett Packard's (HP) PA-RISC and the MIPS core.
2.1.2 Pipelining
The process of executing an instruction has several steps. First, the instruction is fetched from mem
ory. Then, it is decoded. Next, it is executed and finally the results are written. In a non-pipelined
processor, only one instruction is worked on at a time, with each instruction possibly taking mul
tiple clock cycles to complete. Since the instruction only needs to be fetched/decoded/etc. once, a
large portion of the processor is idle most of the time. This situation allows for the possibility of
working on more than one instruction at a time, thus leading to pipelining.
Pipelining breaks the total execution process down into a number of independent steps. Reg
isters are used between the stages of the pipeline to store the information necessary to continue
execution of the instruction. As an instruction works its way through the pipeline, new instructions
are brought in behind it to keep the earlier stages busy. A typical pipeline is shown in Figure 2.1.
The stages of this pipeline are as follows:
IF Instruction Fetch: the next instruction is fetched from memory based on the current value of
the program counter (PC) increment the PC;
ID Instruction Decode: the instruction is decoded to determine what it is supposed to do. This is
also where data is read from the registers;
EX Execution: this is where the instruction is actually executed;
MEM Memory Access/Branch Completion: access memory if needed, replace the PC if a branch;
WB Write-back: write the results of the instruction to the register [15].
In general, each pipeline stage takes one clock cycle to complete. (The exception to this is the EX
stage which may take multiple cycles to finish, depending on what instruction is being executed.)
Therefore, the number of pipeline stages is inversely proportional to the clock cycle of the processor
with more pipeline stages leading to a shorter clock cycle. A processor with many pipeline stages
is said to be deep pipelined.
Once the pipeline is full, it is theoretically possible to finish executing one instruction each
cycle. Like many theories, however, it is not possible to do this in practice due to pipeline hazards.
Pipeline hazards come in two main varieties. The first are data hazards. These occur when
pipelining changes the order that instructions access operands from what would occur in a non-
pipelined processor [15]. Following is a list of possible data hazards:
RAW read after write: an instruction tries to read data before an instruction that entered the
pipeline before it has had a chance to write it;
WAW write after write: an instruction tries to write an operand before it is written by an in
struction that entered the pipeline before it. This hazard only occurs in pipelines that write
to registers in more than one stage;
WAR write after read: an instruction tries to write an operand before it is read by an instruction
that entered the pipeline before it. This occurs when some instructions write early in the
pipeline and others read late [15].
There are several techniques used to eliminate these hazards. The simplest to implement is to write
to the registers on the first half of the clock cycle and to read from them on the second half. A
second technique is forwarding, or bypassing. This allows data produced from an instruction to be
available as soon as it is ready instead of waiting until the data is written to a register. Some data
hazards, however, cannot be avoided. In this case, the pipeline is stalled until the data is ready.
When the pipeline is stalled, the instructions in the pipeline stages before the data producing
instruction do not progress through the pipeline.
The second class of pipeline hazards are control hazards. These hazards occur when a branch
is encountered and can cause more performance to be lost than data hazards [15]. The branch may
or may not affect the PC, but the processor does not know for sure until the end of the EX stage
of the pipeline. This leads to the possibility that some instructions will be fetched from the wrong
section of code.
The simplest way to handle this type of hazard is to stall the pipeline as soon as a branch is
detected. This is, however, not the best way of handling branches. Some alternatives are having
the compiler place instructions that do not depend on weather the branch is taken immediately
after it, or using a branch prediction strategy to make an educated guess as to which direction the
branch will go. This method will be explored in more depth in Section 2.1.4.
2.1.3 Out-of-Order Execution
As the processor has been described so far, there are still times when the pipeline must be stalled
because of a dependency. This causes all instructions that follow the stalled instruction to stall as
well, leading to the possibility of functional units being idle. This is known as in-order execution.
It is possible, however, to maintain in-order instruction issue but to execute the instructions as
soon as the data they need are available. This gives out-of-order execution and completion, which
removes the possibility of maintaining precise exceptions. Since precise exceptions are desirable,
they can be maintained by forcing the instructions to complete in-order. This is accomplished
through the use of a reorder buffer (ROB), an instruction queue (IQ) or another similar device that
keeps track of the original order of the instructions and forces them to complete in-order [15].
To accommodate out-of-order execution, the standard pipeline (see Figure 2.1) must be modified
by replacing the ID stage with the two following stages:
IS Issue: decode the instruction and check for structural hazards;
RO Read Operands: wait until there are no data hazards, then read the operands.
2.1.4 Branch Prediction
From Section 2.1.2 we know that branches can cause problems for a pipelined processor. There
are several ways to reduce the penalty incurred from branches. Some of the static techniques have
been discussed already, scheduling of instructions by the compiler and the extreme, though simple
to implement, response of stalling the pipeline on any branch. Another way to deal with these
hazards is to try and predict where the branch is going to go.
It is possible to generate this prediction using both static and dynamic techniques. One of
the static techniques was already mentioned, placing instructions that do not depend on where the
branch goes directly after the branch to avoid having to stall the pipeline. A second static technique
is based on the 85/60 Branch- Taken Rule. This rule is based on the observation that around 85%
of backward-moving branches and about 60% of forward-moving branches are taken [15]. Based on
this rule, a processor would assume that any branch is taken. Alternatively, a processor may only
assume that a backward-moving branch is taken and that a forward-moving branch is not taken. It
is also possible for the compiler to give the processor clues about weather the branch will be taken
or not, possibly based on a profiling run of the program being compiled.
While the static methods are better than not using branch prediction at all, it is possible to
gain even more performance by predicting weather the branch will be taken or not dynamically.
The simplest form of dynamic predictor keeps a single bit history of if a branch was taken or not.
The history is stored under a tag made by taking a fixed number of lower-order bits from the
branch's address. An effective, and relatively inexpensive, way to improve the performance of this
mechanism is to increase the history to two bits. While it is possible to continue adding bits to the
history, the mechanism does not perform significantly better than the two bit schemes, so multi-bit
schemes are rarely used [15].
IF ID EXE MEM WB
Figure 2.2: Branch misprediction penalty for a basic pipeline
While a dynamic predictor can take advantage of program behavior in ways that static prediction
mechanisms cannot, there is still behavior that it cannot detect. One area where this is true is the
behavior of other branches in the program. A predictor that takes this behavior into account is
called a correlating predictor. It uses n global history bits to choose between 2" two-bit predictors
for a particular branch. It should be noted that both the single level predictor and the correlating
predictors could be basing their predictions for any given branch on the behavior of different
branches that map to the same history in the predictor, depending on what the program execution
has been up to this point.
While it is helpful to know if a branch is taken or not, it would also be helpful to know what
address the branch is likely to point to. For this a branch target buffer (BTB) is used. This
mechanism determines if an instruction is a branch and, if it is, it sends the target address that was
last produced by the branch at this position in memory to the fetch unit. This allows the processor
to begin fetching from the address before that address has been calculated.
No branch prediction, static or dynamic, is perfect. Occasionally, the different predictions will
be wrong. How often this occurs depends on the size of the predictor, the specific mechanism
being used and the behavior of the program currently being executed. When these mechanisms
are wrong, a branch misprediction is said to have occurred. This misprediction incurs a penalty
based on where in the pipeline the branch's behavior and target are determined. Figure 2.2 shows
the penalty for a branch misprediction with the basic pipeline from [15]. It should be obvious that
the length of the pipeline affects this penalty and that deeper pipelines would benefit from more
accurate branch prediction mechanisms.
2.1.5 Multiple Issue Techniques
Until this point, the mechanisms looked at have been trying to achieve the theoretical ideal of one
instructions per cycle (IPC). In order to improve performance beyond this point, the processor must
be able to take advantage of the instruction level parallelism (ILP) in programs by issuing more
than one instruction every cycle. There are two techniques that allow this: superscalar processors
and very long instruction word (VLIW) processors [15].
Superscalar processors are designed to allow a variable number of instructions to be issued each
clock cycle. These instructions can be chosen either statically by the compiler or dynamically using
techniques similar to those used for out-of-order execution (see Section 2.1.3). Modern superscalar
processors can usually issue a maximum of three to four instructions each cycle from the execution
window, or all instructions that are ready to be issued and are waiting for execution resources. The
number of instructions in the execution window each cycle varies because of dependencies, and the
number actually issued varies based on the availability of the processor's execution resources.
A superscalar processor is usually implemented by allowing certain classes of instructions to be
issued in the same clock cycle, assuming that there are no dependencies between them. An example
of this would be allowing an integer arithmetic instruction and a floating-point instruction to be
issued in the same cycle. It is also common to allow a memory instruction, a load or a store, to be
issued each cycle, in addition to the other instructions that are being issued. Other practices are
to have more than one functional unit of a given type to allow more instructions to be issued in a
single clock cycle and to pipeline the functional units where possible to allow the issuing of long
latency instructions each clock cycle.
The complexity added to a design in making it a dynamically issuing superscalar processor can
be substantial. VLIW processors try to offset this hardware complexity by having the compiler
statically group instructions that can execute at the same time. A typical instruction for this type
of processor might include two integer, a floating-point, a memory reference and a branch operation.
If the compiler cannot find an operation to fill one of these slots in the instruction, a NOP is used.
Both techniques allowing multiple issue of instructions are in use in commercial processors.
The majority of the currently available general purpose processors are superscalar designs. These
10
include Compaq's Alpha, HP's PA-RISC and the Motorola/IBM PowerPC. VLIW has found a home
in special purpose processors. One example of this is the MAP1000 digital signal processor (DSP)
from Equator. In the general purpose market, Intel and HP are currently developing processors
based on their explicitly parallel instruction computing (EPIC) technology, which is basically a
VLIW approach.
2.1.6 The Memory-Hierarchy
It was recognized early on that, if given the choice, programmers would want unlimited amounts
of fast memory. This is impractical for most applications for many reasons, not the least of which
is the prohibitive cost this would incur. Fortunately, the principles of spatial and temporal locality
were discovered. Temporal locality states that programs tend to reuse data and instructions that
have been recently used. Spatial locality states that items located at addresses that are physically
close to one another are usually referenced close together in time [15]. The locality principles were
used to develop the idea of a memory hierarchy.
The usefulness of the memory hierarchy is that, if a small amount of fast memory is placed
"close" to the processor and a larger amount of slower memory is place
"further"
from the processor,
the total memory system of the computer will have performance close to that of the fast memory
and a cost close to that of the slow memory. The faster of these memories are called caches.
The smallest, and
"closest" to the processor, is known as the first level, or LI, cache. Additional
levels of cache between the processor and the main memory are numbered successively, with the
second level, or L2, cache being larger than the LI cache and "farther" from the processor. This
pattern can continue indefinitely, though few processors use more than three levels. The hierarchy
is accessed in such a way that the contents of each level of the memory hierarchy is a superset of the
next level up the hierarchy. (Using this convention, the register file of the processor is level 0 and
is considered at the "top" of the hierarchy and the main memory is considered at the "bottom" .
See Figure 2.3 for a common representation of a memory hierarchy.)
A cache is addressed by blocks. A block is several words long and is the unit of memory that
the cache operates with when accessing the next level of the memory hierarchy. There are three
11
Register File
Figure 2.3: Abstraction of a memory hierarchy
ways that blocks can be placed in a cache. The easiest to implement is direct mapped. In this
type of cache each address can only map to one block. The second type of placement scheme is full
associative. This allows an address to be placed in any block in the cache. The third placement
scheme, set associative, is the general form of these two specific methods. With this placement
algorithm, a memory address maps to a set and can be placed in any block in that set. When a
set contains n blocks, the cache is said to be n-way set associative. The vast majority of caches
currently in use are either direct mapped, 2-way set associative or 4-way set associative; the other
alternatives do not produce a significant performance increase [15].
Since the locality principles only describe tendencies of programs, the information that is needed
is not always available in a specific level of cache. When the information requested is not present
in the current level of cache, a cache miss has occurred. The following list describes the three types
of cache misses [15]:
compulsory miss These misses occur on the very first access to a cache block since the block
cannot be in the cache yet.
capacity miss These misses occur when the cache is not large enough to hold all the blocks needed
12
during execution of a program.
conflict miss These misses occur for direct mapped and set associative caches when too many
blocks map to the same set.
There is no way to remove the Compulsory misses. Capacity and Conflict misses can be reduced
by increasing the size of the cache. Conflict misses can be reduced by increasing the associativity
of the cache.
2.2 Overcoming the Processor/Memory Speed Gap
Advances in both the design and the manufacture of microprocessors has created an ever-growing
gap between the performance of the microprocessor and the memory that the processor needs.
While cache can help to economically hide some of this gap (see Section 2.1.6), the differing rates
at which processors and memory are advancing is causing this gap to widen at an increasing rate.
This trend has caused research into various techniques of hiding that gap. In the following sections,
I will give a brief overview of some of these techniques.
2.2.1 Wide Superscalar and Superspeculative Processors
Researchers at both the University of Michigan (UM) and Carnegie Mellon University (CMU) are
attempting to get more performance out of the basic design of the current modern RISC processor
by using techniques to expose more ILP in programs [26, 18]. The team at UM has argued that
the highest level of performance will be attained by using very high performance processors in a
multiprocessor configuration and is designing a superscalar processor capable of issuing a maximum
of between 16 and 32 instructions each cycle from a pool of close to 2,000 instructions in the
execution window. Some of the mechanisms that they propose using to effectively use a processor
of this size are:
a trace cache, which is a cache to store up to an issue-width's worth of instructions across
multiple branches;
13
a Multi-Hybrid branch predictor that uses several small branch predictors that perform well
on specific types of branches, one of which is a buffer for the target addresses of indirect
jumps;
the ability to fetch control-independent instructions that come after a branch and to delay
the fetching of instructions that follow a hard-to-predict branch;
the use of data and instruction prefetching, prediction of data values and dependencies be
tween loads and stores;
24 to 48 FUs clustered into groups of three to five units to reduce delays on the common data
bus (CDB).
Simulations of this architecture have achieved an IPC of close to 13 [26].
The CMU project, called Superflow, is expanding the capabilities of current commercial proces
sors while adding aggressive predictive capabilities. Some of the characteristics of this architecture
include:
an issue width of 32;
a 128-entry ROB;
64 KB data and instruction caches with a 10 cycle miss delay to the L2 cache;
a 128-entry, fully associative store queue.
Added to this, relatively, conventional superscalar architecture, value and dependence prediction
mechanisms are used. Alias prediction is also used. This is a mechanism to predict a load address.
A trace cache is provided in addition to the standard instruction, data and second level caches as
well as a pattern history table (PHT). Simulations of the Superflow processor have shown significant
improvement in sustained IPC.
2.2.2 Simultaneous Subordinate Microthreading
Like the wide superscalar and superspeculative efforts, Simultaneous Subordinate Microthreading
(SSMT) focuses on improving the performance of a single thread [6]. It attempts to do this
14
by executing microcode threads to improve the branch prediction, cache behavior and prefetch
effectiveness of the processor. Since the performance of modern processors falls short of the ideal
made possible by assuming perfect operation of these mechanisms, Chappell, et. al. propose that
executing microcode that improves the effectiveness of these mechanisms will improve the overall
performance of the processor.
The code for these microthreads is stored in dedicated, on-chip random access memory (RAM)
called microRAM. A microthread can be activated through either a special instruction added to the
Instruction Set Architecture (ISA) or through pre-defined events. These pre-defined events spawn
a microthread in much the same way an interrupt spawns a handler. Some of the advantages to
using microthreads are:
complex algorithms can be used without increasing hardware complexity, since the microthreads
use the existing datapath;
the optimizations are very flexible, being tuned to specific applications and processor imple
mentations, or disabled when not needed;
use of microRAM means that the microthread does not compete for fetch bandwidth or
I-cache blocks;
the microthreads are written in an internal instruction set, allowing the ISA to remain un
touched, with the exception of a SPAWN instruction to launch a microthread.
This framework allows the microthreading mechanism to be very flexible. The use of microthreads
requires support from the compiler to load the microRAM and to spawn the microthreads. It is
assumed that some form of profiling will be used as part of the compilation process. Some support
from the operating system (OS) is also needed to deal with the addressing of the microRAM. The
only modifications to the processor that are needed are the addition of the microRAM, the ability
to issue instructions from that microRAM and support for the context of at least one microthread.
In [6], Chappell, et. al. demonstrate the use of SSMT by creating a software PAg branch
predictor using a branch history table (BHT) and a PHT. In order to keep the microthread from
15
interfering with the main thread too much, this predictor was only used on branches that the
hardware branch prediction mechanism does not handle well. Simulations in which the SPECint95
benchmarks were run showed that the microthread could significantly improve the performance of
some of the programs. However, on others it either produced no improvement or resulted in a slight
decrease in performance.
2.2.3 Chip Multiprocessors
The techniques that have been examined so far are all focused on improving the performance of a
single thread. The wide superscalar/superspeculative approach tries to do this through mechanisms
that either find or create more ILP, even if there is not much more to find, or by using aggressive
predictive techniques. The SSMT technique hopes to improve performance by running microthreads
aimed at improving the performance of some of the hardware already present in the processor in
parallel with the primary thread. The hope here is that the additional load on the processor
resources will be compensated for by the gain in performance.
On the other hand, chip multiprocessors (CMPs) attempt to improve the performance of the
processor by exploiting thread level parallelism (TLP) [14, 25]. The idea is to place multiple copies
of a simple RISC core on the same die. This allows the processors to share a high speed connection
to the same L2 cache, which is also placed on the die. In most cases, better performance is gained
by each core having its own LI cache instead of all the cores sharing a LI cache [24]. There is
little to no degradation of performance for a single thread because the RISC cores, even though
they are less aggressive, can be run at a higher frequency. Another advantage of CMPs is that they
are inherently divided into relatively small, mostly independent components. This characteristic
leads to the elimination of many of the long-latency interconnects that can be found on modern
processors, thus eliminating many of the timing problems present in these designs.
There are drawbacks to this architecture however. The most obvious is that execution resources,
i.e. functional units, are not shared between the cores. This leads to large portions of the transistors
on the die not being used when there is only a single thread available. Another result of this inability
to share FUs is that a thread could be stalled on one core because of a lack of FUs, even though
16
another core has a usable FU free. There is one processor resource that the cores share: memory
bandwidth. Whether this will cause a problem depends on the memory access pattern of the threads
running on the CMP and the memory bandwidth made available to the chip.
2.2.4 Multithreading Architectures
Designed around the same basic set of ideals that CMPs are. multithreaded architectures (MTAs)
create a processor that is capable of executing and keeping track of several threads or programs
with a single die, thus taking advantage of the TLP available. There is one fundamental difference,
however: MTAs share processor resources among the threads instead of having several processor
cores. There are three groups of MTAs: fine-grained, coarse-grained and simultaneous, each will
be described below.
Fine-Grained Multithreading
In a fine-grained multithreaded architecture, the processor is capable of holding the contexts for
several threads or programs at a time. Each cycle, a new thread gets control of the processor [20. 2].
There are usually restrictions to which threads are eligible to take control. For instance, in the
Tera computer system, each thread can only have one instruction in the pipeline at the same time
and an outstanding cache miss makes a thread ineligible to gain control [2]. The Tera system is a
commercial multiprocessor supercomputer meant for use in a multi-user scientific environment. It
is capable of running existing scientific code, in many cases only needing a simple recompile to do
so. It has been shown to decrease the execution time for several standard scientific workloads and
to be much easier to use as a platform to solve problems with irregularly shaped data sets [3].
Coarse-Grained Multithreading
Coarse-grained multithreaded architectures, also called block multithreaded architectures, are sim
ilar to fine-grained ones. The major difference is that control of the processor does not change
every clock cycle. In block multithreaded architectures, an event of some kind triggers the control
switch [30, 23]. Some examples of this event are a cache miss or the execution of a long latency
instruction. Generally, the trigger is some event that would usually cause the thread that has
17
control to stall.
Simultaneous Multithreading
The most flexible of all the MTAs are the simultaneous multithreading (SMT) architectures. A
SMT can hold the contexts of several threads at once, like the other MTAs. Unlike the other MTAs
however, a SMT can issue instructions from more than one thread in the same cycle. This gives
them the ability to dynamically adjust to varying levels of ILP and TLP [19]. Because of this.
SMTs have the potential to make the most efficient use of processor resources and have been shown
to outperform CMPs when the issue bandwidths are the same [14. 34]. In spite of the ability to
issue instructions from any thread, SMTs still have to determine which thread has control of the
instruction fetch resources for each clock cycle. Several different algorithms have been proposed
for making this determination, from simple round-robin schemes to adaptive heuristics [11, 33. 27].
These mechanisms will be covered in greater detail in Chapter 3.
18
Chapter 3
Thread Scheduling Methods
By their very nature, Simultaneous multithreading (SMT) processors do not have to worry about
which thread is able to issue instructions in a clock cycle (see Section 2.2.4). One decision that
does have to be made, however, is how to partition the use of the processor's fetch unit. There
are a wide variety of ways that this can be done and it has been shown that the use of a relatively
simple heuristic can produce a significant increase in performance [33]. In the context of this work,
thread scheduling refers to the process of choosing which thread or threads will have control of the
processor's fetch unit during a clock cycle. This chapter describes some of the mechanisms that
have been explored.
3.1 Basic Scheduling Methods
The projects described here all use a simple, single level mechanism to schedule the fetch unit to
different threads. Only one metric is used to make decisions and there is no attempt to monitor
the performance of this mechanism. Figure 3.1 illustrates the general structure of this mechanism.
3.1.1 Multithreaded Superscalar Digital Signal Processor (MSDSP)
This processor was proposed by Gulati as a masters thesis [11]. It is based on the Superscalar
Digital Signal Processor (SDSP) project at the University of California, Irvine [36]. This is a 32-
bit pipelined RISC processor that can fetch four instructions per cycle and issue up to eight. It
has been designed for integer processing with full considerations for a VLSI implementation. The
MSDSP adds floating point units, as well as the ability to have more than one thread running in
19
PCs from n threads
Input]netries
^T
i '
Basic
Scheduling
Mechanism
"
Fetch
Unit
Figure 3.1: Block diagram of a basic scheduling mechanism
the processor at a time. A simple round-robin mechanism is used to give control of the fetch unit
to a single, eligible thread each cycle [11].
3.1.2 Simultaneous Multithreading Project
The Simultaneous Multithreading Project at the University of Washington is where the idea of
simultaneous multithreading got its start and where much of the research on this architecture
has been carried out [31, 9, 34, 33, 19]. Tullsen, et. al. have explored several different ways of
partitioning and scheduling the fetch resources of a SMT processor [33]. The architecture that was
developed uses an ISA based on the Alpha 21164 and attempts to anticipate what structures will
be feasible in 2 to 3 years.
The first area that was explored was the partitioning of the fetch resources between the threads
in the processor. This was done by varying the number of threads that could fetch instructions
each clock cycle, as well as the number of instructions each thread could fetch. Four different
partitioning schemes were used:
1. one thread fetching up to eight instructions,
2. two threads fetching up to four instructions each,
20
3. two threads fetching up to eight instructions each and
4. four threads fetching up to two instructions each.
Simulation showed that two threads fetching four instructions and two threads fetching eight in
structions produced the best possible results. However, two threads fetching eight instructions
produced the most consistent results [33].
After choosing how to partition the fetch unit, Tullsen, et. al. went on to explore how to choose
which threads would get a chance to use the fetch unit during a particular clock cycle. The following
heuristics were devised and tested:
Round-Robin this is the simplest of the heuristics, every cycle a different set of eligible threads
gets to fetch instructions;
BRCOUNT the thread with the least number of branch instructions in the static portion of the
pipeline (decode, rename and instruction queues) gets the highest priority:
MISSCOUNT highest priority is given to the thread with the smallest number of outstanding
D-cache misses;
ICOUNT this heuristic gives the highest priority to the thread with smallest number of instruc
tions in the static portion of the pipeline;
IQPOSN this heuristic gives the lowest priority to the thread with instructions closest to the head
of the IQ.
ICOUNT and IQPOSN perform similarly, but ICOUNT always produces the best performance [33].
Because of this, it is often used in research that explores the behavior of SMTs [27, 32].
3.2 Compound Scheduling Methods
What I call the compound scheduling methods fall somewhere between the simple heuristics de
scribed above (see Section 3.1), and the more complex heuristics described below (see Section 3.3).
These compound methods take more than a single metric into account when determining what
21
Input metrics
PCIs from n threa
r u
ds
Basic
Scheduling
Mechanism
* '
^
"
Weight
Function Fetch
UnitBasic
Scheduling
Mechanism
Input metrics
Figure 3.2: Block diagram of a compound scheduling mechanism
priority to assign to a given thread and do not necessarily treat the metrics as having an equal
importance in determining a thread's priority. Figure 3.2 shows a block diagram of a compound
thread scheduling unit.
3.2.1 Thread Prioritization
Raasch and Reinhardt noticed that the vast majority of research into MTAs in general, and SMTs
in particular, has focused on improving the over all throughput of the processor. (One exception is
the Anaconda processor developed by Moore [23].) They contend that, while valid, this approach
will not lead to processors that provide the highest level of performance in a wide enough variety
of situations [27]. Two commonly occurring areas that are pointed out as being ignored by the
prevalent approaches are latency of user threads in an interactive system and fair distribution of
processor resources in a multiuser system.
The solution that is proposed is to add priority to the context of each thread in the processor.
22
This priority is set by software, presumably the OS. The priority system was added to a modified
version of the Simplescalar tool set. using both the round-robin and the ICOUNT heuristics. It
was found that while latency and fairness were improved, and even though the processor performed
better than a conventional single threaded processor, the total throughput fell well short of that
produced by the other systems [27].
3.2.2 Weighted Averages
This mechanism was proposed by Tullsen, et. al. as a suggestion for future work [33]. The specific
average suggested was one of BRCOUNT and ICOUNT. It was noted that these heuristics attack
different problems and therefore may work well together to increase performance. It would be
possible to do this with most of the other heuristics explored by Tullsen, et. al. A noticeable
exception would be the round-robin method. One of the mechanisms explored by this project was
a weighted average of BRCOUNT and ICOUNT (see Chapter 4 for a more detailed description).
3.3 Hierarchical Scheduling Methods
The hierarchical scheduling mechanisms are the most complex of the three discussed here. A
hierarchical mechanism uses two completely independent scheduling heuristics to assign priorities
to the threads. The first, or primary, mechanism is responsible for the cycle-to-cycle determination
of priorities. This mechanism is directly analogous to the basic schedulingmethods discussed earlier.
The second, or overwatch, mechanism collects data every cycle, just like the primary mechanism.
However, it does not directly affect the cycle-to-cycle operation of the primary mechanism. The
overwatch mechanism is used as a check on the primary mechanism, and only overrides it after
deciding that the primary mechanism has been making bad decisions for several cycles. One of
the big advantages of the overwatch mechanism not needing to act every cycle is that it can use
metrics that the primary mechanism cannot due to clock cycle restrictions. Figure 3.3 shows a
block diagram of a hierarchical scheduling mechanism.
For the work presented here, ICOUNT, as described by Tullsen, et. al. [33], was used as the
primary scheduling mechanism, because of its superior performance. The following sections give
23
Input metrics PCs from n threads
Primary
Scheduling
Mechanism
w w
Overwatch
Scheduling
Mechanism
Input metrics
Figure 3.3: Block diagram of a hierarchical scheduling mechanism
an overview of the overwatch mechanisms explored in this work. Chapter 4 will describe the
implementation of these mechanisms in more detail.
3.3.1 MISSCOUNT Overwatch
The first overwatch mechanism that was explored used a form of the MISSCOUNT heuristic. In
this implementation, the number of cache misses in each cycle is tracked and then remembered
for a number of cycles. Each cycle, the number of cache misses remembered is summed for each
thread and compared to a threshold. If the thread that is given the highest priority by the primary
mechanism has a number of cache misses equal to or greater than the threshold, the overwatch
mechanism overrides the primary and reduces the priority of that thread. It should be noted that
the MISSCOUNT described here differs from the one in Section 3.1.2. That MISSCOUNT keeps
track of all outstanding cache misses while this MISSCOUNT simply counts the number of cache
misses in a single cycle and remembers it.
24
3.3.2 Effectiveness History Table
The second overwatch mechanism that was explored has been called an effectiveness history table
(EHT). This is essentially a compound scheduling mechanism overseeing a basic mechanism. It
can keep track of any feasible number of metrics for a uniform or a variable number of cycles on a
per metric basis. These metrics can then be combined using a variety of weights to influence their
effect on the operation of the primary mechanism. There is, of course, a practical limit to what an
EHT can keep track of and how it can weight the data it, collects. In all likelihood, using more than
two metrics will be rare and using more than three will almost never happen. The EHT used for
this work was kept simple. Only two metrics were tracked and they were not weighted. Chapter 4
describes the implementation used in more detail.
25
Chapter 4
Simulator Overview
The simulator used in this study is a modified version of the simulator developed by Torrant in [32].
It models a SMT processor with a variable number of thread contexts that is defined at compile
time. This simulator was built on the out-of-order simulator of the Simple Scalar tool set from the
University ofWisconsin-Madison [4]. The first part of this chapter will discuss the basic architecture
of the simulator. The second part of this chapter will discuss the modifications made to the base
simulator.
4.1 Base Simulator
There are two ways a simulator is classified. The first is whether it simulates just the ISA or if it
models a microarchitecture. If it simply models an ISA, it is called a functional simulator. If it
models a microarchitecture in more detail, it is called a performance simulator. The other way a
simulator is classified is if it is trace-driven or execution-driven. A trace-driven simulator uses a
pregenerated list of instructions to execute a program. An execution-driven simulator generates its
own trace as it runs. The Simple Scalar tool set provides both a functional simulator and several
performance simulators, all of which are execution-driven. The simulator developed by Torrant,
sim-SMT, is based on the out-of-order issue version of the Simple Scalar performance simulator,
sim-outorder.
An execution-driven simulator is useless without a way to generate executables for it to run.
Some simulators provide this service by being able to run programs compiled for an existing ar-
26
chitecture. SMTSIM, by Tullsen, et. al., went this route [34]. The Simple Scalar tool set went a
different route in providing a compiler, gcc, and other binary utilities to produce executables for
its own ISA. Also, the libc standard C library was ported for use with Simple Scalar [4]. The
simulator runs a . out binaries that are produced by the included compiler for either a little-endian
or big-endian architecture, depending on what kind of architecture the simulator is running on.
There are built-in mechanisms for gathering and reporting different statistics about a particular
simulator run. When a program is run, the simulator outputs, through standard error, all the
statistics that have been registered with it. In addition to basic statistics, the simulator is also
capable of generating output based on statistical formulas. All that is needed to add statistics is
to add them to the code, register them in the code and update them as the simulator runs.
4.1.1 Simulator Pipeline
The simulator models a pipeline that logically has seven stages. Five of the stages are implemented
as the following discrete functions:
ruu_fetch This function fetches instructions from memory. The address is checked to ensure
that it is within the text boundaries of the current thread. If not, a NOP is sent to the
pipeline. If a branch is fetched, the control unit will stop fetching instructions from that
thread for that cycle, unless the branch prediction routine produces a predicted PC value.
Up to four instructions will be fetched from each of the two threads that have the highest
priority according to the ICOUNT method. This is also the function where penalties for
cache misses, branches and branch mis-predictions are enforced.
ruu_dispatch This function enters instructions in the register update unit (RUU) and load store
queue (LSQ). It will continue to do so for a thread until either the RUU and/or LSQ fill,
the fetch queue is empty or the decode bandwidth is used. This is the function where the
instruction is actually executed. If the instruction is a memory reference, it is placed in both
the RUU and the LSQ.
ruu_issue This function is where the instructions from the different threads are mixed for the first
27
time. It starts with the highest priority thread and issues instructions from its ready queue
until the queue is empty or the issue bandwidth is met. It then moves on to the next highest
priority thread and continues until all instructions have been issued from all the threads or
until the issue bandwidth has been met. An instruction is only issued if there is a functional
unit (FU) free. If there is not one free, then the instruction is placed back in the ready queue.
After all possible instructions have been issued, the ready queue is reclaimed and sorted.
There is also an option for round-robin issue.
ruu_writeback This function watches for instructions that have completed and marks them as
such. Any instructions that are dependent on the just finished instructions are updated.
This is also the function where mis-predicted paths are detected and handled.
ruu-commit This function is where instructions are committed and loads and stores are completed.
Once an instruction is committed, the RUU and LSQ entry for that instruction is released.
The sixth stage that is implemented by the simulator is the execution stage. This is controlled
by the event queues, the functional units and the ruu_releaseJu function. Logically, there is also
a decode stage to the pipeline. However, Simple Scalar decodes the program during initialization.
Therefore, that stage does not truly exist [4]. Figure 4.1 shows a logical diagram of the simulated
pipeline.
4.1.2 Miscellaneous Structures
This section will cover the other important structures of the simulated microarchitecture that have
not yet been discussed.
Register Update Unit
The Simple Scalar sim-outorder simulator, and therefore Torrant's sim-SMT, uses a register update
unit to allow out-of-order instruction execution while maintaining in-order commit. The RUU was
first proposed by Sohi in [29]. It is a combination reservation station pool and tag unit (TU)
(RSTU). A tag unit is sometimes used in processors that use Tomasulo's approach to out-of-order
instruction execution [15, 29]. Instead of having a tag for every possible destination register in the
28
RUU
I Instruction i
I Decode '
ruu fetch
I-TLB
ruu_dispatch ruu_issue
LI cache
L2 cache
main memory
ruu writeback ruu commit
I-TLB
Figure 4.1: The Simple Scalar pipeline
register file, the TU consolidates the tags from all of the currently active registers through the use
of a busy bit. (These tags are used to eliminate data dependencies between instructions.) A RUU
differs from a RSTU in that it is managed like a queue to limit it to committing instructions in the
order that it receives them. For a more in-depth description of how a RUU works, please see [29].
In sim-SMT, the RUU is the main entity that tracks the progress of a small number of instruc
tions and it interacts with all of the pipeline stages except for the fetch stage. The RUU keeps track
of many aspects of an instruction, including: the thread the instructions belongs to, the PC where
the instruction was fetched from, the operands needed by the instruction, if the instruction is valid,
if the instruction has been put into the ready queue and if the instruction has been committed.
For a complete list of what the RUU keeps track of, see [32].
29
Ready Queue
This structure keeps track of all the instructions that are ready to be issued to the functional units.
It keeps track of the sequence number of the instruction, the input or output operand number, a
reservation station for each instruction and a pointer to the next instruction in the queue. The
ready queue is scanned in the ruu_issue function to find issuable instructions. The sequence
number is used to keep the instructions in order, with the exception of memory, control and, long
latency instructions. These instructions are usually issued first, but there is an option to give the
same priority to all types of instructions.
Event Queue
The event queue is built from the same data structure that the ready queue is. However, unlike
the ready queue, it is not polled to see if there are entries, it is polled to see if any entries need
to be removed. Each entry in the queue keeps track of the cycle in which it is to complete. When
the simulator gets to that cycle, the ruu.writeback function will remove the instruction from the
queue and update the instructions that depended on the instruction that was removed.
4.1.3 Simulator Memory Space
A diagram of the memory space for sim-SMT can be seen in Figure 4.2 [32]. For a diagram of the
layout of the Simple Scalar simulators'memory space, see [32]. When the simulator is started, it
reads a program from the binary file and loads it into the simulator's memory space and sets up
the memory protection boundaries. All memory accesses made by a program run in the simulator
are checked to make sure they are either in the text section or the data section for that program.
One of the limitations of sim-SMT is its memory space. The simulator has one memory space
from 0x00000000 to OxFFFFFFFF. Since part of this space needs to be reserved for the simulator
itself, only addresses below 0x7FFFFFFF are usable. Since this area must hold the text, data
and stack spaces for up to four threads, the individual programs cannot be very large. Also, the
compiler and linker that are provided for the Simple Scalar tool set have some limitations in the
context of sim-SMT. Since I did not develop my own programs to run on the simulator, but simply
30
Unused
0x00400000
Code 1
0x00500000
Code 2
0x00600000
Code 3
0x00700000
Code 4
0x10000000
Data 1
0x20000000
Data 2
0x30000000
Data 3
0x40000000
Data 4
0x4FFFFFFF Stack 1
Stack 2
0x6FFFFFFF Stack 3
0x7FFFFFFF Stack 4
0x80000000
OxFFFFFFFF
Simulator
Code&
Data
Figure 4.2: sim-SMT memory space
used precompiled versions of the same test programs used in [32], the reader is referred to there for
the details on these limitations.
4.1.4 Simulator Completion
The Simple Scalar compiler will insert an exit system call at the end of a program that it compiles.
Normally, this system call would execute a long-jump back to back to themain-SMT . c main function
which would then proceed to uninitialize the simulator and print out the statistics collected. This
behavior would cause sim-SMT to exit when the first of the threads had completed. The solution
to this was to keep track of the active threads by adding a boolean array called thread_in_use. In
sim-SMT, an exit system call by a thread sets that thread's thread_in_use entry to FALSE and
then exits in the same manner as sim-outorder if there are no other active threads. If other threads
are active, then only this thread exits and the others continue normally.
31
4.1.5 Simulator Approximations and Shortcuts
Since sim-SMT could possibly have to search through a queue for each thread, the time complexity
of the simulator would increase. To get around this, the thread_in_use variable is used as a
shortcut. If the thread_in_use variable for a thread is FALSE, then many of the operations of the
pipeline are skipped and the ruu_fetch function will not fetch instructions for that thread.
In order to simplify the conversion of sim-outorder to sim-SMT, each thread was given its own,
reduced size, branch prediction hardware [32]. Another solution would have been to add a field to
the branch prediction mechanisms to keep track of which thread an entry belonged to.
There was also a similar decision possible with the cache. When first implemented, the LI cache
was direct mapped. However, because of the placement of the programs in the simulator's memory
space (see Section 4.1.3) the threads mapped to the same cache block and set up a state of thrashing.
The solution used was to make the cache set associative, but it also would have been possible to
either replicate the cache for each thread or to add a thread tag to the cache. Also, the caches
are unified, handling both instructions and data and the cache access bandwidth is not tracked.
Because of this, the memory statistics may not be accurate. For a more realistic simulator, there
should be separate caches for instructions and data and the cache bandwidth should be enforced.
The simulator does not support precise exceptions. While this would be a severe limitation for
a commercial, or even a prototype or research processor that was going to be fabricated, it is not
so for sim-SMT. The lack of precise exceptions removes the possibility of using an OS with the
simulator and the ability would need to be added if this was desired.
As with other processor simulators, sim-SMT, and the Simple Scalar simulators, have the un
derlying OS handle any system calls made by the programs run on the simulators [34, 32]. This
means that the instructions executed as part of any system calls are not reflected in the simulator
statistics. It is possible to write these calls into the simulator. If this could be done, it would allow
a more accurate measure of the performance of the simulated processor. It would also allow the
simulator to keep statistics about the amount of time each thread spent in system calls.
32
4.2 Simulator Modifications
In modifying the sim-SMT simulator, I attempted to be as unintrusive and simple as possible.
In pursuit of this, all modifications are controlled by #define statements added to the file ss.h
instead of command line arguments. There is one variable called HTS that is a master switch for
all of the modifications. If this is not defined, none of the other added controls will be defined and
the simulator will be compiled without any of my modifications. There is also a different switch to
turn on each of the three scheduling methods that I explored. In order to ensure proper operation,
only one of these should be defined at a time as I don't do any complicated checks. In addition to
these control switches, constants to set memory lengths and thresholds for some of the mechanisms
are also defined in ss.h.
Between the calls to ruu_issue and ruu_fetch in sim-SMT's sim_main function, a call is made
to the determine_thread_priority function. This function and the functions it calls model the
thread scheduling heuristic that is in use for the processor. This function was modified to call
the different functions that implemented the scheduling mechanisms explored. Unmodified, the
function would first count the instructions in the static portion of the pipeline by calling the
count-instructions function. It would then sort an array of priorities based on the results of the
instruction count. The priority array was then used by ruu.fetch to chose the threads to fetch
instructions from.
The count_instructions function is a loop which does the same thing to all the threads in the
processor. First, it checks thread_in_use to see if the thread it is currently on is active. If it is
not, then the instruction count for that thread is made artificially high to prevent it from getting
the highest priority. If the thread is active, the ruu_fetch_issue_delay variable for that thread
is then checked to see if instruction fetch should be delayed for this thread. If the thread is not
delayed, then every instruction that is in the RUU for that thread is checked to see if it is in the
static portion of the pipeline. For every one that is, a counter is incremented.
33
Scheduling
Mechanism Switches
Needed
Constants
Associated
Statistic
Weighted Average HTS
WEIGHTED -AVERAGE
none See Section 5.3
MISSCOUNT Overwatch HTS
MISSC-OVERWATCH
OVRWATCH_MEMORY -LENGTH
0VERWATCH_THRESHOLD
sim_mc_use
EHT HTS
EHT
DCMISS_LAT_COUNT_MAX
OVERWATCH_MEMORY_LENGTH
OVERWATCHJTHRESHOLD
sim_eht_use
Table 4.1: Summary of switches and constants for the various scheduling mechanisms
The following sections discuss the additions and modifications to this basic mechanism. Ta
ble 4.1 summarizes the switches and constants that need to be defined for each of the mechanisms
implemented.
4.2.1 Weighted BRCOUNT/ICOUNT Average
In order to implement this weighted average, existing functions were modified and a function to
count branches, count-branches, was written. The determine_thread_priority function was
modified to call count-instructions and then count-branchesbefore sorting the priority array.
In count_instrucitons, the values added to the instruction count for an inactive thread or for
a thread that has been delayed were increased to account for the higher possible counts that this
mechanism can produce. The last modification to the count-instructions function was a separate
command to change the instruction count to easily preserve the original state of the code and to
allow the process of disabling the weighted average to be as simple as commenting out a single
#def ine statement.
It was only necessary to add a single function to implement this scheduling mechanism. The
count-branches function is very similar to the count-instructions function. It is a loop that
runs through a set of checks for each of the threads. If a thread is not in use or delayed, the function
does nothing because the ICOUNT mechanism has already ensured that threads that match these
conditions are given a properly low priority. If a thread is active and not delayed, each entry in the
RUU for that thread is checked to see if it is a branch in the static portion of the pipeline. If it is,
34
the same counter that the count-instructions function incremented is incremented again.
The only other modification to the code that was needed for this weighted average was to
include a #def ine for the WEIGHTED -AVERAGE switch, to tell the compiler to select the code for this
mechanism. Weights for the two scheduling mechanisms are selected through changing how much
the count-instructions and count-branches functions change the count that is used to assign
priorities to the threads. Several different weightings were tested. More information about these
and the results they produced can be found in Chapter 5.
4.2.2 MISSCOUNT Overwatch
For this mechanism, the modifications to determine-thread_priority are similar to those made
for the weighted average. First, the function counts the instructions and sorts the priority array as
it normally would. Then three new functions are called: update_mc_tiistory, sum_mc_histories
and check-priority. There were two other modifications to existing sim-SMT code. The first was
the addition of a mcount variable for each thread to keep track of how many cache misses ocurr in
a clock cycle. The code to update this was added to the sections of ruu_issue and ruu_rr_issue
that checks for a cache miss. The second was the addition of the MISSC-WATCH switch to ss.h to
select the MISSCOUNT code to be compiled into the simulator.
The update_mc-history function is a loop that updates the history of the number of cache
misses for each thread that are stored in themcJiistory variable. The number of cycles to remember
cache misses for is controlled by the OVERWATCH-MEMORY-LENGTHconstant which is set with a #def ine
statement in ss .h. The process starts at the end of this history and moves entries toward the end
of the list, overwriting the last (oldest) entry in the history and placing the count from the current
cycle at the start of the history array. After the history is updated and the current count is saved,
the function resets the mcount variable for the current thread.
The sum_mc-historiesfunction is a loop that generates the sum of the number of cache misses
that are remembered in the mcJiistory variable for each thread and stores that value in the
mcount-sums variable for that thread. It ensures that an accurate sum is provided by first clearing
the sum generated in the previous clock cycle.
35
The last function that was added for this mechanism is the check-priority function. It starts
with the last thread that would be allowed to fetch instructions this clock cycle (in the case of
sim-SMT this would be the thread with the second highest priority) and checks the number of
cache misses generated by this thread against the DVERWATCH-THRESHOLD. This constant is set by
a #def ine statement in ss.h. If the number of misses recorded is equal to or greater than this
threshold, the thread is moved to the end of the priority queue. This comparison is then repeated
for each thread that had a high enough priority from ICOUNT to be allowed to fetch instructions,
moving toward the thread with the highest priority. When a threshold is exceeded, the function
also updates the sim_mc_use statistic, which is registered with the simulator.
4.2.3 The Effectiveness History Table
The changes made to sim-SMT to implement the EHT are very similar to those made for the
MISSCOUNT Overwatch mechanism. As mentioned in Section 3.3.2, only two, simple metrics
were used in this implementation of an EHT. The first metric was the number of outstanding cache
misses for each thread and the second was the BRCOUNT mechanism used in the weighted average
described above. These metrics were weighted equally in the decision making process of the EHT.
As with the other mechanisms, the EHT code is selected by defining the EHT switch in ss.h.
The count-branches function was only modified slightly. In its original implementation, the
variable that it was to modify when a branch was found in the static portion of the pipeline was
passed as an argument. In the implementation used for the EHT, that variable was declared as a
static variable accessible from the entire sim-SMT. c file, and was named brcount. This made it
easier to use this count in all the functions that need it.
The MISSCOUNT mechanism was implemented similarly to that noted above. However instead
of simply counting the number of misses, the latency was recorded in a miss_lats array for each
thread. This array was made large enough so that it would be impossible for a single cycle to fill it.
Even if there are still latencies in the array from previous cycles. The count of items in this array,
and the index into it, was kept in the mlindex variable for each thread. The maximum number of
latencies that can be kept track of at one time is set by the DCMISS-LAT_COUNT_MAX constant that
3G
is defined in ss .h.
The first new function called by determine_thread.priority after it has the ICOUNT mecha
nism determine priorities for the threads is update_eht function. This function first calls count-branches
and then calls update_lat Jiistory to update the number of outstanding cache misses and to up
date the miss.lats array. Once this is done, it updates the history stored in ehtJiistory in a
manner similar to that described in Section 4.2.2 for the MISSCOUNT Overwatch mechanism. The
difference is that ehtJiistory keeps track of the sum of brcount and the number of outstanding-
cache misses. The brcount variable is cleared after it has been saved in the eht Jiistory.
The update_latJiistory function is simple. For each thread, it first saves the current value
of mlindex in to mlcount. It then decrements each of the latencies in miss.lats and removes any
that have become zero. The sum_eht-data function is similar to the sum_mcJiistories function
used for the MISSCOUNT Overwatch mechanism. It takes the entries in the ehtJiistory for a
thread, sums them and stores the sum in the eht.sums variable for that thread, after having cleared
it. Once again, the OVERWATCH_MEMORY_LENGTH constant is used to set the number of cycles the
ehtJiistroy will keep track of.
The final modifications were made to the check-priority function. All that was needed, was
to change the threshold check from being performed on mcount _sums to a check of eht -sums. Also,
instead of updating the sim_mc_use statistic, the sim_eht_use statistic was registered and updated.
Once again, the threshold that is checked is defined by the OVERWATCH-THRESHOLD constant from
ss.h.
37
Chapter 5
Simulation Results
This chapter describes the tests run on the different scheduling mechanisms proposed. As stated
in Section 4.1.3, precompiled versions of the programs used by Torrant in [32] were used to test
these mechanisms. Section 5.1 gives a brief overview of these programs. The remaining sections
of this chapter describe how each of the mechanisms were tested and what configurations of these
mechanisms were used.
5.1 The Test Programs
All of the tests described here used the same four programs. Because of the memory limitations
mentioned in Section 4.1.3, these programs are small so that they can all fit in the same memory
space. The programs were coded by Torrant and used in the original development of sim-SMT as
described in [32]. Besides a small memory footprint, other goals in developing these programs were
ensuring that a mix of functional units were used in the tests and to have roughly the same number
of instructions in each program. Following is a list of the programs used and a brief description of
each.
my-test This program performs a large number of integer operations. It is a loop that continues
until the product of two increasing numbers reaches a specified value. At each step, the
numbers are output via printf to the screen.
matsolve This program solves a matrix via LU decomposition and backwards substitution. When
executed, it solves four matrices. This is the largest program and it also has the greatest time
38
complexity. While containing a fairly large amount of floating-point instructions, there are
also a good amount of integer and branch instructions.
newton This program runs a Newton interpolation on several different functions. In addition to
branch instructions, this program consists of a large number of floating-point instructions.
fp-test This program is made up mostly of floating-point and branch instructions and is similar to
my-test, except that the numbers are declared as floats. At each step, the current values
are output as integers. The values are not output as floats because of the limited stack space.
(See Section 4.1.3.)
Because of the various limitations of the simulator itself and the nature of the programs used
in testing the simulator, the results obtained are usable only as a proof of concept. The major
limitation of the tests run is the small size of the programs used to test. In addition, each program
is made up of a diverse collection of instructions although they have not been profiled to obtain
exact percentages. The performance of the simulator is also dependent on which thread context
the programs are placed in since thread 0 starts with the highest priority and will get to execute
first. Other resources that are not as heavily utilized as they would be under a
"real"
workload are
the cache and branch prediction. These mechanisms are modeled, but not necessarily heavily used
by the program load used in testing. There were cache misses and branch mispredictions, but the
frequency at which they occur is likely not representative of an actual system.
5.2 Base Simulator
Before any modifications were made to sim-SMT, a set of tests was run to determine the base
performance of the architecture. These tests consisted of running each of the programs described
in Section 5.1 by itself and with both two and four copies of the same program running at the same
time. Also, for two and four threads a
"system''
test was performed. In this test, the four programs
were run in various combinations either two or four at a time. Figure 5.1 shows the results of these
tests. For the individual programs, the results represent a single run of the simulator. For the
system tests, the results represent an average of the IPC produced by the different combinations of
39
my-test
matsolve
newton |
fp-test l
system [
Figure 5.1: Test results for base simulator
programs. The system results are a better approximation of how the architecture would perform
under a
"rear"
workload. All the test runs were run from batch files generated by scripts developed
by Torrant using the basic configuration file that he developed. For more details on these scripts
see [32]. For more details on the configuration files for sim-SMT, and for Simple Scalar, see [32, 4].
5.3 Weighted BRCOUNT/ICOUNT Average
The first mechanism to be implemented and tested was a weighted average of BRCOUNT and
ICOUNT. This mechanism differs from the others in how the weighting of the average is controlled.
Instead of being set by constants located in ss .h, the code that increments the icount and brcount
variables for each thread is changed to add different values. When deciding what values to use,
the effectiveness of both BRCOUNT and ICOUNT, as reported in [33], was taken into account.
The first weighting that was tried was the obvious one of equally weighting both metrics. After
this first run of tests, consideration was given to the performance of the individual metrics when
used alone. Since ICOUNT was shown to perform the best, it was always given more weight. Two
more test runs were then done. One with ICOUNT given twice the weight of BRCOUNT and one
with ICOUNT given three times the weight of BRCOUNT. The results of these runs are shown
40
2.2
2.1
G 20
my-test ^m
matsolvc EZD
ncwton l . I
Ip-tcst CZ3
system l l
S 1.9 -
U
8 1.8 |-
S
% 1.7
CJ
J 1,
1.5 h
1.4
2.2
2,1
?n
1 J
ft
o 1 SI
u
K. 1.8
1.7
fl
1 h
1.5
1.4
my-tcst H
matsolvc ^^
ncwton ^E3
fp-test E~3
system fZj
Number ofThreads
a)wl.l
iB3 I I
Number of Threads
c)w3.1
2.2
2.1
G 2-
a.
5 1.9
1.7
1.6
1.5
1.4
my-lest EM3
matsolvc ES3
-
ncwton tin, t
fp-test EZD
system c_T)
2.2
2.1
-. 2.0
U
6
^ 1.9
o
Li...
Number ofThreads
b) w2.1
my-tcst ^m
ncwton i^3
fp-tesi em
system czn
Number of Threads
d)w3.2
Figure 5.2: Performance of different weighted averages
in Figure 5.2, graphs a), b) and c). Information about the different weighting of the metrics is
presented in the labeling of these graphs in the form of wi.b. For this notation, i is the value that
ICOUNT adds to the count and b is the value that BRCOUNT adds to the count. Because these
mechanisms give priority to the thread that has the smallest number of instructions or branches
in the static portion of the pipeline, an individual mechanism is given more weight by increasing
the number it adds to the count. As an example, in wl.2, BRCOUNT is given twice the weight of
ICOUNT in determining the priority of the threads.
As can be seen from Figures 5.2 and 5.1, the performance of the weighted averages is identical to
the performance of the base simulator for a single thread. It can also be seen from Figure 5.2 that
41
2.18
2.16
2.14
=. 2.12
2.04
2.02
2.00
2.18
2.16
2.14
2.12
2.10
2.08
2.06
204
2.02
my-test
matsolvc
ncwton
fp-tcst
system
M
13
1
. :.:
Number of Threads
a)wl.l
my-te
mat
ncwton EM3
fp-test Ec23
system E~3
2 4
Number of Threads
2 .18
2.16
2.14
!.12
2.10 -
2.08
2.06
my-tcst ^m
matsolvc [: c J
newton r~~l
fp-tcst
system I I
Number of Threads
b)w2.1
2.18
2.16
2.14
2.12
2.10
2.08
2.06
2.04
2.02
my-test M
matsolvc t - j
ncwton tcct
fp-test EH
system I
'
: :j
c)w3.1
Number of Threads
d) w3.2
Figure 5.3: System performance of the weighted averages
the change in performance from different weights is not always easily seen at this scale. Figure 5.3,
graphs a), b) and c), show the performance for these weightings for two and four threads. The
performance of the base simulator for two and four threads only is shown in Figure 5.4. From these
graphs, it can be seen that the best performance is gained when ICOUNT is given twice the weight
of BRCOUNT.
After this first set of test runs, it was decided that a fourth would be tried where ICOUNT was
given a weight of 3 and BRCOUNT was given a weight of 2. This weighting still gives more weight
to ICOUNT, but its overall importance is less than in the other weightings. The results of this
test run are shown in graph d) of Figures 5.2 and 5.3. From these tests, it was found that this new
42
u
0.
>1
a 2
2.02
2
my-test H
matsolve ^H
newton E~3
fp-test EZD
system EE3
i
i
||pil;:C;C:;;:
2 4
Number of Threads
Figure 5.4: System performance of the base simulator
weighting performed the same when running two threads and slightly better when running four.
5.4 MISSCOUNT Overwatch
After testing the different weightings of the BRCOUNT/ICOUNT average, the MISSCOUNT
overwatch mechanism was implemented and tested. Several parameters were varied to deter
mine the setup that produced the best performance. The first of these parameters was the
OVERWATCH_THRESHOLD. It controls the number of cache misses it takes for the mechanism to over
rule ICOUNT. The second was the uVERWATCH_MEMORY_LENGTH. This parameter controls for how
many cycles the mechanisms remembers cache misses. Three sets of values were used in the test
ing. The values were chosen with an eye towards causing the overwatch mechanism to over rule
ICOUNT, but not very often. To achieve this, the OVERWATCH_MEMORY_LENGTH parameter was kept
larger than the threshold. Figure 5.5 shows the results of these tests. Information about the pa
rameters used in these tests is in the form of mcmi. In this notation, m is the value used for the
memory length parameter and t is the value used for the threshold. Table 5.1 lists the percentage
of cycles that the MISSCOUNT overwatch mechanism over ruled ICOUNT.
Unlike the weighted averages tested, the single thread performance of this mechanism is not
43
Parameter
Values
Number
of Threads
12 4
Number of
System Threads
2 4
mc3.2 0.011 0.026 0.020 0.023 0.019
mc5.3 0.006 0.013 0.009 0.012 0.008
mc.7.5 0.000 0.002 0.001 0.001 0.000
Table 5.1: Percentage of Cycles MISSCOUNT was used
2.2
2.1
- 2.0
u
o.
S 1.9
1.7
1.6
1.5
1.4
I T
;
Number of Threads
a) mc3.2
2.2
2.1
- 2.0
u
c-
i 1.9
CJ
1 1.7
CJ
c 1.6
1.5
1.4
my-tcst
matsolvc
ncwton ZZM
fp-tCSt ES3
system c]
2 4
Number ofThreads
2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
my-test ^^
matsolvc M
ncwton [33
fp-test EZ3
system EH] i
b) mc5.3
Number ofThreads
c) mc7.5
Figure 5.5: Results for MISSCOUNT overwatch
44
2 18
2 16
2.14
U
&. 2.12
f5 2.10
? 2.08
| 2.06
- 2.04
202
2.00
my-test ^33
matsolvc tt^cl
newton cc;:i
fp-tcst E223
system i
::'
'i
2.18
2.16
2.14
2.12
2.10
2.08
2.06
2.04
2.02
200
my-tcst
matsolvc
;;::::;:
newton l:i
fp-tesl C3
system CZ!
Ufa
- 'P4 If
Number of Threads
a) mc3.2
2.18
2.16
2.14
2.12
2.10
2.C
2.06
2.04
2.02
2.00 '
my-test r:.
matsolvc
newton
fp-test
system
i i
r~i
H I
1
Kccccl:
-
-msm
Number of Threads
b) mc5.3
Number of Threads
c) mc7.5
Figure 5.6: MISSCOUNT overwatch performance with two and four threads
identical to the base performance. The difference between the two, however, is not significant.
This mechanism is similar to the weighted averages in how close the performance of the mechanism
is between the different parameters tried. Figure 5.6 shows the performance of the three sets of
parameters for both two and four threads. From this figure, it can be seen that mc5.3 produces
the best over all performance. From Table 5.1, it appears that the best performance occurs when
the cycles that MISSCOUNT interferes is kept below 0.012%, but above 0.006%.
This observation prompted a new test case to see if performance could be improved. This test
run used a memory of 5 cycles and a threshold of 2. Figure 5.7 shows the results of this configuration.
These graphs show significant improvement in the performance of some tests. However, the system
45
Test Threads Percentage
of Cycles
Threads 1
2
4
0.020
0.043
0.108
System 2
4
0.042
0.036
Table 5.2: Usage statistics for mc5.2
2.2
2.1
C 2.0
Oh
u
fe 1.8
a.
1 L7
|l,
1.5
1.4
2.18
2.16
2.14
2.12
2.10
2.08
2.06
2.04
2.02
2.00
my-tcst
matsolvc
ncwton ic cct
Ip-tcsl c~]
system r~l
Number ofThreads
a) All tests
2 4
Number ofThreads
b) Two and four threads
Figure 5.7: Performance of mc5.2
tests produced slightly worse performance than the mc5.3 parameter set. Table 5.2 shows the
number of times that the MISSCOUNT mechanism took over. The performance numbers are
slightly surprising in light of the fact that MISSCOUNT overrode ICOUNT between 0.020% and
0.108% of the time, as these numbers are noticeably higher than those obtained when the best
performance of this mechanism was reached.
5.5 The Effectiveness History Table
The final mechanism to be implemented and tested was the effectiveness history table (EHT). Only
two of the three constants that control the operation of this mechanism were varied in these tests.
The DCMISS_LAT_C0UNT_MAXconstant was set to a value of 50 for all of the tests. Since this constant
controls the size of the array that keeps track of the number of cycles left for cache misses to com-
46
Parameter
Values
Number
of Threads
12 4
Number of
System Threads
2 4
eht5.15 0.050 0.606 0.427 0.260 0.267
eht,7.25 0.010 0.031 0.032 0.070 0.046
Table 5.3: Percentage of Cycles the EHT was used
plete, it was felt that this value would provide more than enough room to keep track of
outstanding-
cache misses given the memory latencies involved and the number of memory instructions a thread
could issue in one cycle. The parameters that were varied were 0VERWATCH_MEM0RY_LENGTH and
OVERWATCrLTHRESHOLD. As with the MISSCOUNT overwatch mechanism, values for these parame
ters were selected to give the EHT a chance to override ICOUNT, but not too often. To achieve this,
the exact opposite approach from the one used with the MISSCOUNT overwatch mechanism was
taken. For the EHT. the threshold was kept between two and three times as large as the number of
cycles to remember. This was done because this mechanism adds the total of the remaining latency
of all the outstanding cache misses to the number of branches in the static portion of the pipeline
for each thread. This has the potential of producing much larger values than the MISSCOUNT
described above. Figure 5.8 shows the results of varying these parameters. Similar to the figures for
MISSCOUNT performance, these figures use the following form: ehtm.t. Where m is the number
of cycles that are kept track of and t is the threshold used to determine if the EHT interferes with
ICOUNT. Table 5.3 lists the percentage of cycles that the EHT overrode ICOUNT.
Like the MISSCOUNT overwatch mechanism, the single thread performance of the EHT is off
from the base performance, but not significantly. Again, the performance of the different parameter
selections is hard to see on these graphs. Figure 5.9 shows the two and four thread performances
in greater detail. From this figure we can see that the eht7.25 parameter set provides the best
performance. Once again, better performance appears to be tied to keeping the number of cycles
the mechanism interferes low. This time between 0.010% and 0.070%.
As with the MISSCOUNT mechanism, this behavior suggested a new set of parameters to
try. In this case, the EHT was set to remember information for seven cycles and to override
47
2.2
2.1
G 2-
a.
2 1.9
>^
u
a 1.8
a.
1 1.7
my-tcst |H|
matsolvc BB
ncwton rjZ3
fp-tcst CZD
system CIZ1
1.6
1.5
1.4
2 4
Number of Threads
2.2
2.1
G 20
JS 1.9
CJ
>.
u
S3 1.8
a.
c
1.7
u
g
S 1.6
1.5
1.4
my-tcst ^
matsolvc B^j
ncwton CH
fp-tcst CH
system C3 r
I
a)eht5.15
Number ofThreads
b) eht7.25
Figure 5.8: Results for the EHT
Number of Threads
a)eht5.15
2,18
2.16
2.14
U
& 2.12
2.1
2.08
2,06
2.04
my-test M
matsolvc ^
ncwton ES
fp-test EZ3
system 1 ,t
i 1
il
j
:
> X
2 4
Number of Threads
b) eht7.25
Figure 5.9: EHT performance for two and four threads
48
Test Threads Percentage
of Cycles
Threads 1
2
4
0.041
0.448
0.236
System 2
4
0.235
0.133
Table 5.4: Usage statistics for eht7.20
2.2
2.1
u
a.
2 1.9
.2 1.7
f 1.6
1.5
1.4
my-test ^
matsolvc E3
ncwton ES3
fp-tcst CD
system I I
Number ofThreads
a) All tests
2.18
2.16
2.14
2.12
2.10
2.08
2.06
2.04
2.02
2.00
my-test
matsolvc EM3
newton MM
fp-test c-3
system cd
2 4
Number ofThreads
b) Two and four threads
Figure 5.10: Performance of eht7.20
ICOUNT at a threshold of 20. Figure 5.10 shows the performance of this set of parameters. With
a few exceptions, the performance of this parameter set is worse than the performance of eht7.25,
although mostly not significantly so. Table 5.4 shows how often the EHT overruled ICOUNT. The
performance of this set of parameters is in keeping with how often the EHT overrode ICOUNT
under the other sets of parameters.
49
5.6 Summary
As can be seen from these results, the EHT produces the largest increase in
performance. With a
memory length of 7 cycles and a threshold of 25, it produces an IPC of 2.0995 for two
threads in the
system test and and IPC of 2.1662 for four threads in the system test. The overwatch mechanism
overrode the primary mechanism 0.070% of the time for two threads and 0.046%
of the time for
four threads when using the same parameters.
50
Chapter 6
Conclusions and Future Work
Over the years, the von Neumann model of computing has undergone many enhancements. These
changes include an improved memory hierarchy multiple instruction issue and branch prediction.
Since the model's introduction, the performance of processors has increased at a much greater rate
than that ofmemory. A very promising proposed modification which attempts to hide this gap is the
Simultaneous Multithreaded (SMT) processor. With the introduction of multiple, simultaneously
active threads in a single processor, several new aspects of processor operation can have a sizeable
effect on performance. One such aspect is how to choose which thread to fetch instructions from
during the next cycle.
The goal of this project was to modify an existing model of a SMT processor to determine
the impact on performance of several mechanisms to determine this that were more complex than
most of the mechanisms used in current research. Of the three mechanisms explored, the proposed
Event History Table produced the greatest increase in overall system performance. It produced an
IPC of 2.0995 in a two-threaded system test while overriding the primary scheduling mechanism of
ICOUNT only 0.070% of the time.
6.1 Future Work
The concept of a SMT processor is very new to computer architecture. It has not been studied
nearly as much as multiple issue and out-of-order architectures, or even as much as traditional
multithreading architectures. The amount of research that explores effective methods to select
51
the best way to partition a SMT processor's fetch resources between the threads is limited as well.
Because of the memory limitations of the simulator used in this project, it should be considered only
on a "proof of concept"basis, and not as an indication of the actual potential of the mechanisms
proposed here. With this in mind, some projects to expand on this work might include:
The use of these mechanisms with a simulator able to run industry standard benchmarks,
such as Tullsens SMTSIM [34] or the simulator used by Raasch, et. al. [27].
A more in depth exploration of the effects of different parameters on the performance of the
EHT. In particular, the effect of weighting the individual metrics tracked by the EHT and
the addition of more metrics to those being tracked.
The effects of instruction content on the effectiveness of the mechanisms. A natural extension
of this would be to modify the EHT to take advantage of this information by using different
metrics in different situations, based on the mix of instructions being executed by a particular
thread or all of the threads.
52
Bibliography
[1] A revolutionary approach to programming: The Tera MTA.
http://www.tera.com/www/mta.html, 1996. Updated 1999.
[2] G. Alverson, R. Alverson, D. Callahan, B. Koblenz, A. Porterfield, and B. Smith. Exploiting
heterogenous parallelism on a multithreadedmultiprocessor. In Proceedings of the International
Conference on Supercomputing, pages 188-197, Washington, DC, Jul 1992.
[3] S. H. Bokhari and D. J. Mavriplis. The Tera Multithreaded Architecture and unstructured
meshes. Technical report. Institute for Computer Applications in Science and Engineering,
NASA Langley Research Center, Hampton, VA, Dec 1998.
[4] D. Burger and T. M. Austin. The SimpleScalar tool set, version 2.0. Technical Report 1342,
University ofWisconsin-Madison Computer Sciences Department, Jun 1997.
[5] J. Burns and J. Gaudiot. Quantifying the SMT layout overhead - Does SMT pull its weight?
In Proceedings of the Second International Symposium on High-Performance Computer Archi
tecture, Toulouse, France, Jan 1998.
[6] R. S. Chappell, J. Stark, S. P. Kim, S. K. Reinhardt, and Y. N. Patt. Simultaneous subordinate
microthreading (SSMT). In Proceedings of the 26th International Symposium, on Computer
Architecture, Philadelphia, PA, May 1998.
[7] Y. Chou, D. P. Siewiorek, and J. P. Shen. A realistic study on multithreaded superscalar
processor design. In Proceedings of the 1997 Euro-par, Passau, Germany, Aug 1997.
53
[8] K. Diefendorff. POWER4 focuses on memory bandwidth. Microprocessor Report, 13(13), Oct
1999.
[9] S. J. Eggers, J. S. Emer. H. M. Levy, J. L. Lo, R. L. Stamm, and D. M. Tullsen. Simultane
ous multithreading: A foundation for next-generation processors. IEEE Micro, pages 12-18,
Sep/Oct 1997.
[10] R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel. M. S. Squillante, and S. Liu. Evaluation of
multithreaded uniprocessors for commercial application environments. In Proceedings of the
23rd Annual International Symposium on Computer Architecture, pages 203-212, Philadelphia,
PA, May 1996.
[11] M. Gulati. Multithreading on a superscalar microprocessor. Master's thesis, University of
California, Irvine, 1995.
[12] M. Gulati and N. Bagherzadeh. Performance study of a multithreaded superscalar micropro
cessor. In Proceedings of the Second International Symposium on High-Performance Computer
Architecture, pages 291-301, Feb 1996.
[13] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W. Dietrich Weber. Comparative
evaluation of latency reducing and tolerating techniques. In Proceedings of the 18th Interna
tional Symposium on Computer Architecture, pages 254-263, Philadelphia, PA, May 1991.
[14] L. Hammond, B. A. Nayfeh, and K. Olukotun. A single-chip multiprocessor. Computer,
30(9):79-85, Sep 1997.
[15] J. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan
Kaufmann Publishers, Inc., Palo Alto, CA, second edition, 1996.
[16] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa.
An elementary processor architecture with simultaneous instruction issuing from multiple
threads. In Proceedings of the 19th International Stjmposium on Computer Architecture, pages
136-145, Philadelphia, PA, May 1992.
54
[17] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A multithreading technique targeting
multiprocessors and workstations. In Proceedings of the 6th International Conference on Ar
chitectural Support for Programming Languages and Operating Systems, pages 308-318. 1994.
[18] M. H. Lipasti and J. P. Shen. Superspeculative microarchitecture for beyond AD 2000. Com
puter, 30(9):59-66, Sep 1997.
[19] J. L. Lo. J. S. Emer, H. M. Levy. R. L. Stamm, D. M. Tullsen. and S..J. Eggers. Converting
thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM
Transactions on Computer Systems, 15(3):322-354, Aug 1997.
[20] M. Loikkanen and N. Bagherzadeh. A fine-grain multithreading superscalar architecture. In
Proceedings of the Conference on Parallel Architectures and Compilation Techniques, Boston,
MA, Oct 1996. IEEE Press.
[21] 5/21/99 update, http://www.tera.com/www/threads/update3.html, 1999.
[22] J. C. Mogul and A. Borg. The effect of context switches on cache. In Proceedings of the J^th
International Conference on Architectural Support for Programming Languages and Operating
Systems, pages 75-84, 1991.
[23] S. W. Moore. Multithreaded Processor Design. Kluwer Academic Publishers, Boston, MA,
1996.
[24] B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a mul
tiprocessor microprocessor. In Proceedings of the 23rd Annual International Symposium on
Computer Architecture, pages 67-77, Philadelphia, PA, May 1996.
[25] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip
multiprocessor. In Proceedings of the 7th International Conference on Architectural Support
for Programming Languages and Operating Systems, pages 2-11, 1996.
[26] Y. N. Patt, S. J. Patel, M. Evers, D. H. Friendly, and J. Stark. One billion transistors, one
uniprocessor, one chip. Computer, 30(9):51-57, Sep 1997.
55
[27] S. E. Raasch and S. K. Reinhardt. Applications of thread prioritization in SMT processors.
In Proceedings of the 1999 Workshop on Multithreaded Execution and Compilation. Jan 1999.
[28] S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithread
ing. In Proceedings of the 27th International Symposium, on Computer Architecture, Jun 2000.
[29] G. S. Sold. Instruction issue logic for high-performance, interruptible. multiple functional unit,
pipelined computers. IEEE Transactions on Computers. 39(3):349-359, Mar 1990.
[30] S. N. Storino, A. Aipperspach, J. M. Borkenhagen, R. J. Eickemeyer, S. R. Kunkel, S. Lev-
enstein, and G. Llhlmann. A commercial multi-threaded RISC processor. In Proceedings of
the IEEE International Solid-State Circuits Conference, pages 234 -235, Philadelphia, PA, Feb
1998.
[31] R. Thekkath and S. J. Eggers. The effectiveness ofmultiple hardware contexts. In Proceedings
of the 6th International Conference on Architectural Support for Programming Languages and
Operating Systems, pages 328-337, 1994.
[32] M. Torrant. Investigation of simultaneous multithreading as a method to enhance multiprocess
context switching. Master's thesis, Rochester Institute of Technology, 1999.
[33] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting
choice: Instruction fetch and issue on an implementable simultaneous multithreading proces
sor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture,
Philadelphia, PA, May 1996.
[34] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-
chip parallelism. In Proceedings of the 22rd Annual International Symposium, on Computer
Architecture, Jun 1995.
[35] R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, and J. Emer. Instruction fetching: Coping with code
bloat. In Proceedings of the 22rd Annual International Symposium, on Computer Architecture,
Jun 1995.
56
[36] S. Wallace and N. Bagherzadeh. Performance issues of a superscalar microprocessor. In
Proceedings of the International Conference on Parallel Processing, volume 1. pages 293-297.
Aug 1994.
[37] W. Weber and A. Gupta. Exploring the benefits of multiple hardware contexts in a multipro
cessor architecture: Preliminary results. In Proceedings of the 16th International Symposium
on Computer Architecture, pages 273-280, Philadelphia, PA, May 1992.
57
