Methods for minimizing performance degradation caused by branch delays by St. Onge, Debbie
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
7-1-1995 
Methods for minimizing performance degradation caused by 
branch delays 
Debbie St. Onge 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
St. Onge, Debbie, "Methods for minimizing performance degradation caused by branch delays" (1995). 
Thesis. Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
Methods for Minimizing Performance Degradation
Caused by Branch Delays
by
Debbie St. Onge
Under the direction of Ji-en Morris Chang
A Thesis Submitted to the Graduate Faculty of
Rochester Institute of Technology
in partial fulfillment of the requirements for
the Degree of Masters of Science




Dr. Ji-en Morris Chang (Thesis Advisor)
Prof.
Dr. Tony H. Chang
Prof.
Dr. Pratapa V.C.V. Reddy
Prof.
Dr. R. Unnikrishnan (Department Head)




Under the direction of Ji-en Morris Chang
98 Dutchess Hill Road
Poughkeepsie NY 12601
(914) 473-5308 (home) &
(914) 435-7908 (work) $?
yamaha@pkedvm9.vnet.ibm.com
I hereby grant permission for copies to be made of all/parts of my thesis provided
all use of my work is properly acknowledged.
Debbie St. Onge (July 1995)
Abstract
The presence of branch instructions in an instruction stream may adversely
affect the performance of a processor by introducing significant delays in the
execution process. As processors become more pipelined, the impact these
delays have upon performance increases. This thesis investigates why delays
occur when branch instructions are encountered. It also summarizes various
hardware methodologies which can alleviate the performance degradation due
to these delays. Simulation results show that these hardware methodologies
can improve branch performance by up to 45 percent.
Some branches are inherently necessary in order to implement programming
decisions. However, the use of branches within programs can inadvertently
cause significant performance degradation. This thesis analyzes several
methods to implement a programming decision and the performance of each
method, thus providing insight into programming guidelines which can be fol
lowed to improve branch performance. Measurements of these software tech
niques show performance improvements of up to 178 percent.
Abstract ii
Table Of Contents
Chapter 1. Introduction 1
1 . 1 Processor Performance Overview 2
1.1.1 Sample Pipeline Stages 3
1.1.2 Performance Measurements 7
1.1.3 Various Processor Implementations 10
1.1.3.1 Serial Processors 10
1.1.3.2 Parallel Processors 11
1.1.4 Pipeline Disruptions 20
1.2 Impact of Branches on the Pipeline 23
1.2.1 Serial Processor 23
1.2.2 Parallel Processors 26
1.2.2.1 Parallel Processor A 27
1.2.2.2 Parallel Processor B 31
1.2.3 Two Execution Units 37
1.3 Conclusions 40
1.4 Thesis Organization 40
Chapter 2. Branch Prediction Methods 42
Table Of Contents iii
2.1 Opcode-Based Branch Prediction 43
2.2 Branching Statistics From Typical S/390 Operating Environments ... 48
2.2.1 Workload Definitions 49
2.2.2 Taken Characteristics of Branches 51
2.2.3 Distances Between Taken Branches and Their Targets 53




2.3.3.1 Branch Predicted Taken 60
2.3.3.2 Branches Predicted Not Taken 61
2.3.3.3 Other Implementation Considerations 61
2.3.4 Cases Which Negate the
Benefit'
of a DHT 63
2.4 BHT (Branch History Table) 66
2.4.1 Introduction 66
2.4.2 Hashing 68
2.4.3 Replacement Algorithms 71
2.4.4 BHT Operation 72
2.4.5 Recoverability Aspects 74
2.4.6 Implementation 76
2.4.6.1 Synchronous versus Asynchronous Implementations 76
Table Of Contents iv
2.4.6.2 Reducing the Impact of Instruction Data Fetches 79
2.4.6.3 Updating the BHT ... 80
2.4.6.4 Moving Targets 81
2.4.7 Cases Which Negate the Benefit of the BHT 83
2.5 Active Streams 83
2.6 Condition Codes 86
2.7 Instruction Buffers 89
2.7. 1 Flushing the Instruction Buffers 90
2.7.2 Branching in the Instruction Buffers 94
2.8 Conclusions 97
Chapter 3. Performance Gains Due to Branch Prediction 98
3.1 Predicting Branches Taken/Not Taken 99
3.1.1 Parallel Processor A 99
3.1.2 Parallel Processor B 103
3.2 BHT Branch Prediction 107
3.2.1 Alleviating Register Interlocks 107
3.2.1.1 Parallel Processor A 108
3.2.1.2 Parallel Processor B 1 10
3.2.2 Alleviating Delays Due to Instruction Data Fetches 112
3.3 Branch Guess Wrong Penalty 115
Table Of Contents v
3.3.1 Moving Targets 118
3.3.2 Address Generation Interlocks 124
3.3.3 Conclusions 129
Chapter 4. Discussion of Programming Methods 130
4.1 Introduction 131
4.2 Reducing The Decision Making Within a Program 133
4.2.1 C+ + Examples 133
4.3 Cost of Incorrect Branch Prediction 144
4.3.1 Assembler Examples 154
4.4 Subroutine Branches 161
4.4.1 Take Decisions Outside of Subroutines if Possible 174
4.5 Branch Target Offsets 176
4.6 Conclusions 186
Chapter 5. Summary . . . 187
5.1 Conclusions 187
5.2 Related Concepts Not Explored Within This Thesis 189
5.2.1 Finite Benefits of Branch Prediction 189
5.2.2 Optimal Decision Statements 190
5.3 Future Work 191
Table Of Contents vi
Appendix A. S/390 Instruction Formats 192
Appendix B. S/390 Branch Instructions 196
Appendix C. Multiple Decodes per Cycle 197
C.l Definition of Steps in Variable Length Instruction Decode 198
C.2 Definition of Steps in Fixed-Length Instruction Decode 199
Appendix D. References 202
Appendix E. Definition of Terms 204
Appendix F. Acknowledgements 214
Appendix G. Biography 215
Index 216
Table Of Contents vii
List of Figures
1. An Example of a Programming Decision (FORTRAN) 1
2. Timing Diagram, Instruction Data in the Instruction Buffer 3
3. Timing Diagram, Instruction Data not in the Instruction Buffer 3
4. Performance Measurement Calculations 8
5. Two instructions processed by a serial processor 11
6. Two instructions processed, one operand access at a time 12
7. Two instructions processed, two operand accesses at a time 13
8. Six instructions processed, one operand access at a time 16
9. Six instructions processed, two operand accesses at a time 17
10. Minimum CPI formula for Parallel Processor A 18
11. Minimum CPI formulae for Parallel Processor B 19
12. Register Interlock 21
13. Condition Code Interlock 23
14. Three instructions processed by a serial processor 24
15. Three instructions processed by a serial processor, target not in
instruction buffers 25
16. Six instructions processed, one operand access at a time, no interlocks 27
17. Six instructions processed, one operand access at a time, no interlocks,
target not in instruction buffers 28
List of Figures viii
18. Six instructions processed, one operand access at a time, condition
code interlock 29
19. Six instructions processed, one operand access at a time, register
interlock 30
20. Six instructions processed, two operand accesses at a time, no
interlocks 32
21. Six instructions processed, two operand accesses at a time, no
interlocks, target not in instruction buffers 33
22. Six instructions processed, two operand accesses at a time, condition
code interlock 34
23. Six instructions processed, two operand accesses at a time, register
interlock 35
24. Six instructions processed, one operand access at a time, two
execution units 38
25. Six instructions processed, two operand accesses at a time, two
execution units 39
26. Unconditional Branches 44
27. Program Loop 47
28. DHT Layout 57
29. S/390 Instruction Address Bit Layout 59
30. BHT Layout 67
List of Figures ix
31. Four branch addresses - hashing using instruction address bits 24-27
70
32. Active streams when a branch is predicted not taken 84
33. Active streams when a branch is predicted taken 84
34. Traditional handling of a BCT instruction 88
35. Parallel handling of a BCT instruction 88
36. Example CXX1, two FOR loops - assembler code 91
37. Flushing the instruction buffers 93
38. Loop residing in the instruction buffers 95
39. Six instructions processed, one operand access at a time, no register
interlock, branch predicted correctly 100
40. Six instructions processed, one operand access at a time, no register
interlock, branch predicted correctly, target not in instruction buffers 101
41. Six instructions processed, one operand access at a time, register
interlock, branch predicted correctly 102
42. Six instructions processed, two operand accesses at a time, no register
interlock, branch predicted correctly 104
43. Six instructions processed, two operand accesses at a time, no register
interlock, branch predicted correctly, target not in instruction buffers 105
44. Six instructions processed, two operand accesses at a time, register
interlock, branch predicted correctly 106
List of Figures x
45. Six instructions processed, one operand access at a time, branch
predicted correctly, synchronous BHT
109
46. Six instructions processed, one operand access at a time, branch
predicted correctly, asynchronous BHT 110
47. Six instructions processed, two operand accesses at a time, branch
predicted correctly, synchronous BHT Ill
48. Six instructions processed, two operand accesses at a time, branch
predicted correctly, asynchronous BHT 112
49. Six instructions processed, one operand access at a time, branch
predicted correctly 113
50. Six instructions processed, two operand accesses at a time, branch
predicted correctly 114
5 1 . Branch prediction benefit equation 116
52. Example MT1, AGI on BCTR target due to AR 119
53. Example MT2, no AGI on BCTR target 120
54. Example AGI1, AGI on BCTR target due to AR 125
55. Example AGI2, no AGI on BCTR target 126
56. Example CXX1, two FOR loops 134
57. Example CXX1, two FOR loops, assembler code 135
58. Example CXX2, one FOR loop 136
59. Example CXX2, one FOR loop, assembler code 137
List of Figures xi
60. Example CXX3, one FOR loop, i always odd 144
61. Example CXX3, one FOR loop, assembler code 145
62. Example CXX4, one FOR loop, i always even 146
63. Example CXX4, one FOR loop, assembler code 147
64. Example ASM 1
,
branch at 000036 alternates between taken and not
taken 155
65. Example ASM2, branch at 000036 always taken 156
66. Example ASM3, branch at 000036 always not taken 157
67. Example CXX5, one FOR loop, inline function 163
68. Example CXX5, one FOR loop, assembler code 164
69. Example CXX6, one FOR loop, subroutine call 165
70. Example CXX6, one FOR loop, assembler code 166
71. Decision within the subroutine 175
72. Decision outside the subroutine 175
73. Example ASM4, DW aligned CSECT 178
74. Example ASM5, DW + 1 HW aligned CSECT 180
75. Example ASM6, DW aligned CSECT, data shifted by one word ... 182
76. Example ASM7, instructions shifted by one word 184
77. Example ASM8, instructions shifted by three half words 185
78. Multiple Part Decision Statement 190
79. Variable instruction length (S/390) 200
List of Figures xii
80. Fixed instruction length 200
List of Tables
1. Workload Descriptions 49
2. Percent of Branches Taken in Different Environments 52
3. Distance Between Taken Conditional Branches and Their Targets ... 54
4. Distance Between Taken Unconditional Branches and Their Targets 55
5. Performance Comparison between MT1 and MT2 122
6. Performance Comparison between AGI1 and AGI2 128
7. Number of Instructions Executed, CXX1 and CXX2 139
8. Number of Branches Executed, CXX1 and CXX2 141
9. Performance, CXX1 and CXX2 142
10. Number of Instructions Executed, CXX2, CXX3, and CXX4 148
11. Number of Branches Executed, CXX2, CXX3, and CXX4 150
12. Performance, CXX2, CXX3, and CXX4 152
13. Total Number of Instructions Executed, ASM1, ASM2, and ASM3 158
14. Performance, ASM1, ASM2, and ASM3 160
15. Number of Instructions Executed, CXX2, CXX5, and CXX6 167
16. Number of Branches Executed, CXX2, CXX5, and CXX6 170
17. Performance, CXX2, CXX5, and CXX6 172
List of Figures xiii
18. Percent increase when compared to CXX2 173
19. E Format 192
20. RR Format . . . 192
21. RRE Format 192
22. RX Format 193
23. RS Format 193
24. S Format 193
25. SSE Format 194
26. SS Format - Single Length Version 194
27. SS Format - Two Length Version 194
28. SS Format - Register Range Version 195
29. S/390 Branch Instructions 196
List of Tables xiv
Chapter 1. Introduction
The instruction set that a processor implements consists of a variety of
instructions. One frequently occurring set of instructions is the branches.
Branches are the instructions which implement the decision-making within a
program. Decision-making is a key part of information processing, enabling
program execution to vary dependent upon the current operating environment.
Figure 1 depicts a programming decision. To implement this decision first /
must be compared to 100. Based upon the result of the compare, the branch
either directs instruction processing to the code which increments /, or to the
code immediately following the IF statement. The decision making process is
composed of two types of instructions: the condition-code setting instruction
which performs a test. and the branch instruction which uses the result of the
test to direct instruction processing.
IF (I < 100) THEN 1=1+1;
Figure 1. An Example of a Programming Decision (FORTRAN)
Chapter 1. Introduction 1
When implementing the decision in Figure 1, the branch instruction must
wait until the compare is done and the condition code updated. Before proc
essing the instruction following the branch, the outcome of the branch must
be known. These delays before and after a branch could cause significant
performance degradation in pipelined processors.
1.1, "Processor Performance
Overview"
overviews a variety of processor con
figurations. These processors are useful for illustrating the principles dis
cussed within this thesis. 1.2, "Impact of Branches on the
Pipeline"
on
page 23 illustrates why branch instructions degrade performance.
1.1 Processor Performance Overview
This section introduces the processor characteristics that are the basis for the
branch performance discussions throughout this thesis. The processing of a
single instruction is accomplished through a series of steps. The steps which
can be done during a single clock cycle are referred to as belonging to a stage.
The series of stages necessary to process an instruction are called the process
or's pipeline. A sample pipeline, consisting of seven stages, is used for the
processor performance discussions within this paper.
Chapter 1. Introduction 2
/././ Sample Pipeline Stages
Figure 2 is a timing diagram showing the flow of an instruction through
the
sample pipeline. Figure 3 shows the elongation when the instruction data is
not in the instruction buffers. Following these timing diagrams are the
descriptions of the stages which make up the sample pipeline. The letters
in
parentheses are used to identify each stage in the timing diagrams.
N I D 0 F EC
< 11 cycles >
Figure 2. Timing Diagram, Instruction Data in the Instruction Buffer
Figure 3. Timing Diagram, Instruction Data not in the Instruction Buffer
Chapter 1. Introduction 3
(N) Compute instruction address
Calculating the address of the next sequential instruction consists of adding
the length of the current instruction to the address of the current instruc
tion. Most processors overlap the calculation of the next
sequential
instruction address with other operations to reduce the impact on perform
ance.
(I) Fetch instruction data
This is the process of moving the next few bytes of instruction data to the
decode unit. If the data is in the instruction buffers, as Figure 2 on
page 3 illustrates, then this takes one cycle. Otherwise, a fetch request to
the cache is necessary and the instruction fetch takes six cycles, as depicted
in Figure 3 on page 3.
(D) Decode the instruction
The instruction is deciphered to determine what is the next operation to
perform. Which register(s) or other facilities the instruction uses are identi
fied. Appendix C, "Multiple Decodes per
Cycle"
on page 197 has more
information on this stage of the pipeline.
(O) Compute the address of the operand(s)
Chapter 1. Introduction 4
Some instructions access one or more data locations. This stage calculates
the address(es) of these data location(s). If registers are used to compute
the data address then the instruction must wait until all preceding
instructions have finished updating these register(s). The timing diagrams
within this paper use instructions with only one operand in order to sim
plify the examples.
This is the stage which computes the branch target address. The branch
target is often referred to as an operand of the branch.
(F) Fetch the operand(s)
This stage issues the request(s) for the various pieces of data the instruction
needs. Some instructions may also issue stores. The timing diagrams
within this paper use instructions which issue only one fetch. For sim
plicity we assume .that the data is always found in the processor's cache,
thus, each fetch takes five cycles to access the data.
(E) Execute the instruction
The execution unit within the processor performs a specific task for the
instruction. For instance, the execution stage performs the addition for an
add instruction. Some instructions may require more than one cycle to
Chapter 1. Introduction 5
finish the execution stage. For simplicity, the timing diagrams within this
paper only use instructions with one cycle executions.
(C) Completion of the instruction.
This is often referred to as the point-of-no-return. Any register updates are
done, any updates to storage locations have been finished, any condition
codes or other facilities have been updated, the instruction has met all its
architected requirements. The completion unit gathers the results from the
other units (e.g. fetch, store, execution, etc.) and when all work has been
finished it marks the instruction complete.
The preceding pipeline is referenced for the examples used to illustrate the
concepts discussed within this paper. Other processor's pipelines vary
dependent upon the instruction set and how the work is divided into stages.
For example, some pipelines have fewer than seven stages. The (N) and (I)
stages may be able to be done together in a single cycle. The (O) and (F)
stages may also be combined into a single stage.
Several assumptions associated with the sample processors are listed below.
Instruction and operand data can be found within the processor's cache.
Chapter 1. Introduction 6
The cache has an access time of five cycles.
Several buffers, called instruction buffers, are kept close to the instruction
decoding unit. If the instruction data is found in these buffers the instruc
tion fetch takes one cycle. If the instruction data is not found within these
buffers, it needs to be fetched from cache and takes six cycles.
Each instruction buffer is a double
word1
in length.
Each execution stage takes only one cycle.
1.1.2 Performance Measurements
When designing a processor, understanding the time to complete each piece of
work is necessary in order to define the pipeline stages. The longest path
(critical path) is the one which limits the cycle time of the processor. There
are two approaches to improve performance. These approaches are not mutu
ally exclusive.
One approach is to keep the work (pathlength) done each cycle to a
minimum. This results in more stages but allows the design to be run at a
1
Eight bytes.
Chapter 1. Introduction 7
lower cycle time, i.e. less time is required to process the longest stage. The
other approach is to do as much as possible in a single cycle, thus reducing
the number of stages within the pipeline but raising the minimum cycle time
at which the processor can operate.
Tl = CPI
*
CT or MIPS = 1 / (CPI
* CT ) = 1 / Tl
where CT is in seconds
Figure 4. Performance Measurement Calculations
The equations used in this discussion are shown in Figure 4. Time per
instruction (Tl) is equal to cycles per instruction (CPI) multiplied by cycle
time (CT). Performance is commonly represented by the MIPS (million of
instruction per second) rating of the processor. MIPS is a misleading measure
when comparing processors which implement different instruction sets.
S/3902
has numerous instructions which perform the equivalent work of a series of
IBM System/390: IBM's mainframe architecture, an evolution from the S/360 and
S/370 architectures.
Chapter 1. Introduction 8
less complex instructions. These complex instructions typically have an exe
cution stage which requires multiple cycles.
A MVC (move character) instruction is an example of a common, complex,
S/390 instruction. This instruction moves data of up to 256 bytes in length
from one storage location to another. Other instruction sets may require a
series of register loads and stores to perform the same function.
Thus, it is misleading to use MIPS to compare processors which support dif
ferent instruction sets. Comparisons of processors which implement the same
instruction set can be made using MIPS or CPPCT. This thesis has each of
the sample processors operating at the same cycle time, so the comparisons
are made simply using CPI.
Chapter 1. Introduction 9
1.1.3 Various Processor Implementations
1.1.3.1 Serial Processors
Figure 5 on page 1 1 depicts the flow of two instructions through the pipeline
of a serial processor. All instruction data is found in the instruction buffers
so the (I) stage takes only one cycle. Each stage in a pipeline has specialized
hardware to perform a unique function. In a serial processor each piece of
hardware is only utilized for a small portion of the time. In the example in
Figure 5 on page 11, the bus to access data is utilized
45.53
percent of the
time, while all the other stages are only utilized 9.
14
percent of the time.
Utilized for 10 out of the 22 cycles.
Utilized for 2 out of the 22 cycles.
Chapter 1. Introduction 10
N I D 0 F EC
l-l-l-l-l l-l-l
l-l-l-l-l l-l-l
N I D 0 F EC
< 22 cycles >
Figure 5. Two instructions processed by a serial processor
This example illustrates that the hardware is not highly utilized in a serial
processor. Restructuring the processor to better utilize the hardware may
improve the price/performance characteristics. This approach attempts to
improve the performance of a processor without increasing the cost.
1.1.3.2 Parallel Processors
A parallel processor overlaps pipeline stages such that more than one instruc
tion is processed during a cycle. This allows each hardware stage to be better
utilized. Ideally, the parallel approach keeps the price very close to that of a
serial processor by minimizing the hardware added to allow stages to be over
lapped. The parallel implementation dramatically increases instruction
throughput, as is shown by the examples within this section.
Chapter 1. Introduction 11
There are many degrees to which processor pipelines can be parallelized. The
following examples illustrate two parallel processor variations. Figure 6 on
page 12 has only one bus to the cache, allowing only one fetch to be proc
essed at a time. The + denotes cycles in which a piece of work is waiting for
a resource. In this example the bus to cache, handling the operand access,
causes a delay to occur in the pipeline. In further discussions this processor
configuration will be referred to as Parallel Processor A.
N I D 0 F EC
l-l-l-l-l l-l-l
H_H_|+ + + + l-l-
N I D 0 F EC
< 16 cycles :
Figure 6. Two instructions processed, one operand access at a time
Knowing that this bus is the bottleneck, the system designer is able to
improve performance by widening the bus. The timing diagram in Figure 7
on page 13 illustrates what happens when two fetches can be processed con
currently. Adding the capability of simultaneously handling two operand
accesses is what accounts for the performance difference between these two
Chapter 1. Introduction 12
parallel processors. The processor configuration which can handle two fetches
simultaneously will be referred to as Parallel Processor B.
N I D 0 F EC
l-l-l-l-l l-l-l
l-l-l-l-l l-l-l
N I D 0 F EC
< 12 cycles >
Figure 7. Two instructions processed, two operand accesses at a time
The bus in Figure 6 on page 12 is utilized
62.5s
percent, while the other stages
of the processor are utilized
12.56
percent. In Figure 7 the utilization of each
Chapter 1. Introduction 13
bus drops to
41.77




It is important to note that a resource can be a bottleneck without being fully
utilized. The bus, in Figure 6 on page 12, was only utilized 62.5 percent, but
it was a bottleneck. Performance models not only measure the utilization of
resources, but also the time spent waiting for a resource. In Figure 6 on
page 12, 25 percent of the time was spent waiting on the bus. Understanding
both the resource utilizations and stages incurring delays gives the processor
designer the insight necessary to make the appropriate design decisions.
The processors used throughout this paper are implementing the same instruc
tion set and are operating at the same cycle time. Thus, their performance
can be compared using cycles per instruction. The serial processor in
Utilized for 10 out of the 16 cycles.
Utilized for 2 out of the 1 6 cycles.
Utilized for 5 out of the 12 cycles.
Utilized for 2 out of the 12 cycles.
Requires 22 cycles to process 2 instructions.
Chapter 1. Introduction 14
Figure 5 on page 1 1 has a CPI of 1 19. Parallel Processor A in Figure 6 on
page 12 has a CPI of 8; Parallel Processor B in Figure 7 on page 13 has a
CPI of 6. In this situation the wider bus is worth a performance gain of 2
CPI.
The best performance the serial processor can achieve is 1 1 CPI. However,
the parallel processors have more potential. After six instructions Parallel
Processor A is able to reach a CPI of 6, as shown in Figure 8 on page 16.
Figure 9 on page 17 illustrates Parallel Processor B after it has processed six
instructions. This processor is able to obtain a CPI of 3.7. In this situation,
the capability to do two operand fetches is worth 2.3 CPI.
Chapter 1. Introduction 15









+ + + + 1_|_|
F EC
-1+ + + + + + + + 1-|-|
0 F E C
-|-|++++++++++++- - - -
-|-|-|
DO F E C
-|-|-|++++++++++++++++
- - - -
-|-|-|
I D 0 F E C
-|-|-|-|++++++++++++++++++++- -




Figure 8. Six instructions processed, one operand access at a time
Chapter 1. Introduction 16
N I D 0 F EC
N I D 0 F EC
|+ + + |_f_|
N I D 0 F EC
|-|+++ l-l-
N I D 0 F EC
-|-|-|+ + + + + +
I D 0 F
_H_H+ + + + + + _ _






Figure 9. Six instructions processed, two operand accesses at a time
As instruction processing continues, the utilization of the various stages
increases. The bus in Figure 8 on page 16 is now utilized
83.310
percent, while
the other stages are utilized
16.7"
percent. In Figure 9 the utilization of each
bus is 68.
212
percent, while the utilization of each of the other stages increases
Chapter 1. Introduction 17
to 27.
313
percent. The longer instruction processing operates without inter
ruption, the lower the CPI and the higher the utilization of the stages.
As Parallel Processor B continues, uninterrupted, the CPI approaches 5. The







N = Number of Instructions Processed (N > 0)
Figure 10. Minimum CPI formula for Parallel Processor A
10
Utilized for 30 out of the 36 cycles.
11
Utilized for 6 out of the 36 cycles.
12
Utilized for 15 out of the 22 cycles.
13
Utilized for 6 out of the 22 cycles.
Chapter 1. Introduction 18
Parallel Processor B can obtain a minimum CPI of 2.5. The equations which
describe this processor's CPI are shown in Figure 11.
If N is Odd
5(N-1) +22 5N + 17
__ CPI ==> CPI
2N 2N
If N is Even
5(N-2) +24 5N + 14
CPI ==> CPI
2N 2N
= Number of Instructions Processed (N > 0)
Figure 11. Minimum CPI formulae for Parallel Processor B
Chapter 1. Introduction 19
1.1.4 Pipeline Disruptions
The previous processor examples demonstrate the benefits of parallel
processors. Parallel processors allow existing hardware to be better utilized,
thus improving the price/performance characteristics. However, as with the
contention for the bus, there are situations which cause delays in the pipeline.
These delays cause CPI to increase.
There are other interactions between instructions, referred to as interlocks,
which can cause delays in the pipeline. Interlocks occur when multiple
instructions require the same resource. These instructions must wait until the
resource is available. This wait results in a delay in the pipeline. Interlocks
commonly occur on registers, pieces of data, the condition code, and data
busses.
Register interlock occurs when an instruction needs to use a register which
is updated by a preceding instruction. The processor needs to recognize
this dependency and wait until the register has been updated before
allowing the dependent instruction to continue. By recognizing dependen
cies, architectural integrity is maintained. In order to remain architec
turally correct, the highly pipelined processor must give the appearance
that each instruction is processed sequentially.
Chapter 1. Introduction 20
Figure 12 on page 21 illustrates some coding examples which may cause
register interlock. If the variable B is kept in a register then this register
needs to be updated before the comparison can be done in the first coding
example. B must be updated before Y can be calculated in the second
example. Each case could cause a delay in a processor with an overlapped
pipeline.
B = A + X;
IF B < 100 THEN Y = Z;
- or -
B = A + X;
Y = B + C;
Figure 12. Register Interlock
Operand interlock occurs when an instruction requires data which is modi
fied by a preceding instruction. The instruction must wait until the data is
updated before receiving its copy. Some processors may utilize hardware
that recognizes this dependency and expedites the data to the waiting
instruction. Other processors have to wait until the store from the pre-
Chapter 1. Introduction 21
ceding instruction is complete before honoring the fetch for the updated
data.
If the variable B was not kept in a register but instead was kept in a
storage location then the coding exmples in Figure 12 on page 21 would
illustrate operand interlock. The storage location containing B needs to be
updated before the comparison can be done in the first coding example. B
must be updated before Y can be calculated in the second example. Each
case could cause a delay in a processor with an overlapped pipeline.
Condition code interlock occurs when a branch needs to use the condition
code but it is being updated by a preceding instruction. The branch has to
wait until the update is complete. Figure 13 on page 23 illustrates a deci
sion statement which may cause a condition code interlock. / must first be
compared to 100 and the condition code updated to reflect this comparison
before the branch can determine if the target instruction or next sequential
instruction will be executed. Condition code interlock can also occur when
two consecutive instructions want to update the condition code; the second
must wait until the first is complete.
Chapter 1. Introduction 22
IF I < 100 THEN GOTO TOP;
Figure 13. Condition Code Interlock
These are a few of the interlocks which can degrade performance. The next
section looks more closely at the delays induced by branch instructions.
1.2 Impact of Branches on the Pipeline
This section examines the pipeline flow when a taken branch instruction is
encountered. In the following examples, the second instruction in the
sequence is a branching instruction. During the (O) stage the branch target
address is computed. Since the target address has been calculated, the (N)
stage is not necessary for the instruction following a branch.
1.2.1 Serial Processor
Figure 14 on page 24 illustrates the serial processor handling a branch
instruction. The target of the branch is in the instruction buffers. The CPI
for these three instructions is 9.
Chapter 1. Introduction 23
N I D 0 F EC
N I D 0 E C
|-|-|-|-|-|-|
<== Branch Instruction
I D 0 F EC
27 cycles >
Figure 14. Three instructions processed by a serial processor
With a serial processor, the condition code is known at the time the branch is
executed. The register(s) needed to calculate the target address has been
updated. Predicting branch
condition14
is not beneficial since condition code
interlocks do not occur. Predicting the branch target address may be benefi
cial.
The examples within this paper assume the data is found in the first-level
cache. Since the branch target is often not in the same instruction buffer as
14
Taken or not taken.
Chapter 1. Introduction 24
the branch instruction15, a fetch from the first-level cache is often necessary
before the target instruction can be processed. Figure 15 on page 25 shows
the delay if the branch target is not in the instruction buffers. This delay
causes the CPI to increase to 10.7. If the instruction data is not in the
first-
level cache the delay is even longer. The length of the delay depends upon
the storage access times and at which level of the storage sub-system the data
resided. Predicting the branch target may result in a shortened wait for the
target instruction data.
N I D 0 F EC
l-l-l-l-l l-l-l
N I D 0 E C
|-|-|-|-|-|-| <== Branch Instruction
I D 0 F EC
I l-l-l l-l-l
< 32 cycles - >
Figure 15. Three instructions processed by a serial processor, target not in
instruction buffers
15




Chapter 1. Introduction 25
The sooner the branch target address is calculated, the sooner the fetch can be
issued for the needed data. Branch prediction can reap a substantial benefit,
especially when the storage
penalty16
is severe and the processor is highly
parallelized. Depending upon the storage penalty, it may be cost-effective to
implement branch prediction in a serial processor.
1.2.2 Parallel Processors
The impact of a branch on the parallel processors is examined in the fol
lowing section. The second instruction in the sequence is a branch. The
branch interaction is shown in conjunction with condition code interlock, reg
ister interlock, and storage delays.
The storage penalty is the time, in number of cycles, which it takes to access the
data.
Chapter 1. Introduction 26
1.2.2.1 Parallel Processor A
The example in Figure 16 has no interlocks. The branch causes some increase
of the CPI due to the delay in instruction processing. The resulting CPI is
6.2.
N I D 0 F E C
- + + + + +
|
<== Branch Instruction
N I D 0 E C
I D 0 F EC
|.|.|_|.|+ + + + 1_|-|
N I D 0 F EC
|-|-|-|-|+ + + + + + + + 1-|-|
N I D 0 F EC
|-|-|-|-|+ +++++++++++--
N I D 0 F
< 37 cycles
E C
Figure 16. Six instructions processed, one operand access at a time, no interlocks
Chapter 1. Introduction 27
If the instruction data for the branch target is not in the instruction buffers
then the timing in Figure 17 is obtained.
N I D 0 F E C
+ + + + + | <== Branch Instruction
N I D 0 E C
I D 0 F EC
|.|+++++.|.|_|++++ ..
-|-|-|
N I D 0 F EC
|-|+ + + + + -|-|-|+ +++++++--- -|-|-
N I D 0 F EC
|-|+ + + + + -|-|-|+ +++++++++++
N I D 0 F
< 42 cycles
Figure 17. Six instructions processed, one operand access at a time, no interlocks,
target not in instruction buffers
Chapter 1. Introduction 28
Figure 18 shows the timing when a condition code interlock
exists. The
branch execution must wait until the preceding instruction
completes. The
resulting CPI is 6.3.
N I D 0 F E C
-|+ + + + + +





I D 0 F EC
|.|.|.|.|+ + + + 1-|.|
N I D 0 F EC
|-|-|-|-|++++++++--
-|-|-|
N I D 0 F EC
|-|-|-|-|+ + + + + + + + + + + + --
N I D 0 F
38 cycles
E C
Figure 18. Six instructions processed, one operand access at a time, condition
code interlock
Chapter 1. Introduction 29
Figure 19 shows the timing when a register interlock exists. The (O) stage of
the branch must wait until the preceding instruction completes in
order to use
the updated register. This interlock causes a seven cycle delay and the
resulting CPI is 6.5. If the instruction data is not in the instruction buffers
then the 39 cycles are further elongated to 44 cycles, resulting in a CPI of 7.3.
From the optimum case in Figure 16 on page 27, the interlocks and storage
delay can cause CPI to increase by up to 1.1 CPI.






I D 0 F EC
|.|.|.|.|+ + + + 1_|-|
N I D 0 F EC
|-|-|-|-|++++++++---
-|-|-|
N I 0 0 F EC
|_|_|_|-|+ +++++++++++--
N I D 0 F E C
39 cycles
Figure 19. Six instructions processed, one operand access at a time, register inter
lock
Chapter 1. Introduction 30
1.2.2.2 Parallel Processor B
Figure 20 on page 32 shows Parallel Processor B's timing with a branch
instruction as the second instruction in the sequence. Even though there are
no interlocks or storage delays, the branch causes pipeline delays, resulting in
a CPI of 4.7. The branch instruction increased CPI by 0.2 when processed by
Parallel Processor A. The branch has caused a 1.0 CPI increase in Parallel
Processor B. The impact of the branch is much more severe on this
processor. Generally, the more overlapped the stages of a processor, the
greater the impact of pipeline delays.
Chapter 1. Introduction 31
N I D 0 F EC







1 1 1 1 1 III
|_|_|_|_|+ + + + +
_|_|
<== Branch Instruction
N I D 0 EC
l_l_l_l_ _ _ -l-l-l
1 III 1 1 1
I D 0 F EC
l_l_l_l |_____|_| |
II 1 1 II 1 1
N I D 0 F EC
|_|_|_H+ + +
N I D 0 F
|_|_|_|_|+++
N I D 0 F
< 28 cycles
Figure 20. Six instructions processed, two operand accesses at a time, no inter
locks
Chapter 1. Introduction 32
Figure 21 shows the impact when waiting for instruction data to
be loaded
into the instruction buffers. This delay increases the CPI to 5.5.
N I D 0 F EC
l-l-l-l-l l-l-l
I-I-I-I-I+ + + + +
-|-|
<== Branch Instruction
N I D 0 EC
I -l-l-l l-l-l
I D 0 F EC
1-1+ + + + +
-|-|-| l-l-l
N I D 0 F EC
|.|+ + + + + -|-|-|+ + + 1-|-|
N I D 0 F EC
|_|+ + + + + .|.|.|+ + + 1-|-|
N I D 0 F EC
< 33 cycles >
Figure 21. Six instructions processed, two operand accesses at a time, no inter
locks, target not in instruction buffers
Chapter 1. Introduction 33
Figure 20 on page 32 is the ideal case with no interlocks. Figure 22 has a
condition code interlock which increases the CPI to 4.8. Figure 23 on
page 35 has a register interlock, increasing the CPI to 5.0. If the instruction
data is not in the instruction buffers the CPI would increase to 5.8.
N I D 0 F EC
l-l-l-l-l l-l-l
I-I-I-I-I+ + + + + +1-1-1
<== Branch Instruction
N I D 0 EC
l-l-l-l l-l-l
I D 0 F EC
l-l-l-l-l l-l-l
N I D 0 F EC
|-|-|-|-|+ + + 1-|-|
N I D 0 F EC
I-I-I-I-I+ + + l-l-l
N I D 0 F EC
< 29 cycles >
Figure 22. Six instructions processed, two operand accesses at a time, condition
code interlock
Chapter 1. Introduction 34





N I D 0 E C
l-l-l-l l-l-l
I D 0 F EC
l-l-l-l-l l-l-l
N I D 0 F EC
I-I-I-I-I+ + + l-l-l
N I D 0 F EC
|.|-|-|-l+ + + l-l-l
N I D 0 F EC
'
30 cycles >
Figure 23. Six instructions processed, two operand accesses at a time, register
interlock
The previous examples have shown that a branch's impact on performance
depends on many factors. First, the branch itself induces some delay.
Second, branches can suffer from both condition code and register interlocks,
increasing their performance degradation. Finally, waiting for the target data
can delay instruction processing even further.
When comparing the two parallel implementations, branching has a more
adverse effect on Parallel Processor B. This is because it is the more parallel
Chapter 1. Introduction 35
of the two processors. The preceding examples illustrate the effects of the
various delays upon these two parallel processors.
Chapter 1. Introduction 36
1.2.3 Two Execution Units
When looking at design changes that improve performance, two execution
units might be considered. Figure 24 on page 38 and Figure 25 on page 39
show the new timings with two execution units. These timings reflect the
ideal case where no interlocks exist and the branch target is in the instruction





in order to achieve these timings19.
17
Out-of-order execution occurs because two execution units handle two instructions
independent of each other. In these examples, the branch instruction executes before
the preceding instruction. This execution is out of order.
18
The branch conditionally completes, allowing the next instruction to be processed.
The branch instruction is not architecturally complete until the preceding instruction
completes.
19
Refer to section 2.5, "Active
Streams"
on page 83 for a detailed discussion.
Chapter 1. Introduction 37
N I D 0 F EC
l-l-l-l-l l-l-l
I- 1- 1- 1- 1- 1- 1 <== Branch Instruction








1+ + + + l-l-l
DO F EC
1-1+ + + + + + + + |-|-|
I D 0 F EC
|-|-|+ +++++++++++- -





Figure 24. Six instructions processed, one operand access at a time, two exe
cution units
Chapter 1. Introduction 38
N I D 0 F E C
|-|-|-|-|-|-| <== Branch Instruction
N I D 0 E C
I D 0 F E C
N I D 0 F EC
-|-|-|+ + + 1-|-|
I D 0 F EC
_|_|_|_|+++ |_|.
N I D 0 F EC
< 23 cycles >
Figure 25. Six instructions processed, two operand accesses at a time, two exe
cution units
Two executions units allow the branch to be
hidden2*
and the CPIs to drop to
5.3 and 3.8, respectively. The additional execution unit also increases per
formance when the branch target is not in the instruction buffers because the
fetch is issued sooner. The second execution unit does not provide a perform-
20
The branch is completely overlapped with the preceding instruction. This situation is
often referred to as being hidden.
Chapter 1. Introduction 39
ance gain when a condition code or register interlock exists. Before adding a
second execution unit, the designer needs to understand where the largest
delays exist in order to assess the potential benefits of additional hardware.
1.3 Conclusions
This chapter has provided an understanding of the various delays which can
degraded performance. Many of these delays occur when branch instructions
are processed. This chapter has shown that parallel processors are more sensi
tive to branching delays than serial processors.
1.4 Thesis Organization
The remainder of this thesis is organized as follows:
Chapter 2 describes various hardware branch prediction methods and
implementation considerations.
Chapter 3 discusses the performance gains of these hardware branch predic
tion mechanisms.
Chapter 4 discusses the performance impact when implementing program
ming decisions.
Chapter 1. Introduction 40
Chapter 5 summarizes the results.
Chapter 1. Introduction 41
Chapter 2. Branch Prediction Methods
Many algorithms have been developed to reduce the delay caused by branch
instructions. The delayed branch scheme implements branches that take effect
several instructions after the branch instruction. [ST 93] Branch spreading
attempts to separate a conditional branch instruction from the instruction
which sets condition code. Branch spreading can be implemented by com
pilers and/or processors. Branch folding makes branches execute in zero
cycles by compounding them with the preceding
instruction(s).[DiMc 87] Branch
prediction avoids delays by predicting the outcome of a branch. This thesis
focuses on the benefits of branch prediction.
Three distinct methods of branch prediction will be investigated in the fol
lowing sections. These three methods are: opcode-based branch prediction;
decode history table prediction; and branch history table prediction. Statistics
describing typical branch behavior will be overviewed in order to support the
prediction implementations. Other concepts, relevant to branch prediction
and branch performance, are introduced in this chapter.
Chapter 1. Introduction 42
2.1 Opcode-Based Branch Prediction
One of the simplest forms of branch prediction is based upon the branch
instruction's operation code (opcode). The opcode can be used to determine
if the
branch21
is an unconditional or conditional branch. If the branch is
unconditional, then whether it is taken or not taken can be accurately pre
dicted by examining its opcode. An unconditional branch is one that either
always branches or never branches dependent upon its opcode. Unconditional
branches which never branch are often referred to as no-ops. Figure 26 on
page 44 contains examples of unconditional branches.
21 See Appendix B, "S/390 Branch
Instructions"
on page 196 for descriptions of the
S/390 branch instructions.
Chapter 2. Branch Prediction Methods 43
BCR 0,A This branch is followed by
the next sequential instruction
BCR F,A This branch is followed by the
instruction located at the
target address specified by GPR A
BAL A,B This branch is followed by the
instruction located at the
target specified by GPR B
Figure 26. Unconditional Branches
A conditional branch is one that may or may
not'
branch dependent upon the
current state of the processor. Each time a conditional branch is encountered
there are two possible directions instruction processing can go. One path of
instruction is referred to as the correct path, the other as the incor
rect path. The paths are also referred to as the taken (target) path and not
taken (next sequential instruction) path.
Chapter 2. Branch Prediction Methods 44
Combining these two variations produces four possibilities:
1. Taken and correct
2. Taken and incorrect
3. Not taken and correct
4. Not taken and incorrect
When predicting branches there is the possibility that incorrect instruction
processing will occur. In order to recover from incorrect predictions, the
processor needs a mechanism to backup to the point of the branch instruction
and continue processing down the correct path.
Conditional branches usually are predicted not taken. Conditional branches
tend to be taken approximately fifty percent of the time22. When attempting
to predict the conditional branches correctly, defaulting to taken or not taken
does not make much difference. If typical workloads exhibited a much higher
percentage of conditional branches taken this default might not be optimal.
Section 2.2.2, "Taken Characteristics of
Branches"
on page 51 has additional details.
Chapter 2. Branch Prediction Methods 45
For conditional branches, the default to not taken is chosen because the next
sequential instruction path is the simplest to execute. From a performance
perspective defaulting to the next sequential path reduces the number of
instruction fetches which are issued but not used. This is due to the fact that
the branch target data often requires an instruction
fetch23
because the target
data does not reside in the instruction buffers.
Another set of branches which can be predicted rather successfully using
opcode-based branch prediction are the loop-controlling branches. In loop
structures the branch is taken every time except the last iteration. If
loop-
controlling branches are predicted taken, then they are predicted correctly a
large percentage of the time. Figure 27 on page 47 illustrates a program
loop. A loop-controlling branch would be utilized to iterate through this loop
fifty times.
Section 2.2.3, "Distances Between Taken Branches and Their
Targets"
on page 53
gives an understanding of the typical distances between a branch and its target.
Chapter 2. Branch Prediction Methods 46
DO I = 1 to 50:
END;
Figure 27. Program Loop
BCT is one of the loop-controlling branches in the S/390 instruction set.
Other loop-controlling branches are: BXLE, BXH, and BCTR. In the typical
workload environments, loop-controlling branches are taken 70 to 99 percent
of the time, with most environments clustered around 85 to 95 percent taken.
Given this type of behavior opcode-based branch prediction does quite well.
The main benefit of opcode-based branch prediction is that it allows instruc
tion processing to continue without waiting for the branch instruction to com
plete. Even when a more sophisticated branch prediction method is
implemented, opcode-based branch prediction is often used in conjunction
with the other branch prediction method.
Chapter 2. Branch Prediction Methods 47
2.2 Branching Statistics From Typical S/390 Operating
Environments
The tables in the following two sections contain branch statistics from typical
S/390 environments. These are benchmarks which represent typical customer
environments. They are referred to as the
LSPR24
workloads.
In order to understand the behavior of customer environments, instruction
traces of these environments are gathered. These instruction traces are exe
cuted by processor models to produce an in depth picture of performance
characteristics. One of these performance characteristics is branching
behavior.
The following branching statistics are generated by processing these LSPR
workload traces. The traces represent four operating systems: VSE, MVS,
VM, and TPF25. They represent ten typical customer workload environments.
Large Systems Performance Reference.
MVS - Multiple Virtual Storage
TPF - Transaction Processing Facility
Chapter 2. Branch Prediction Methods 48
2.2.1 Workload Definitions
Table 1 (Page 1 of 3). Workload Descriptions
Environment Description
Batch MVS
Background jobs typical ofwhat are
running during batch periods on a MVS
system.
Batch VM
Background jobs typical ofwhat are
running during batch periods on a VM
system.
CICS MVS
Customer Information Control System, an
interactive workload where large numbers
of users execute transactions of varying
sizes typical ofwhat is found on large MVS
systems.
VM - Virtual Machine
VSE - Virtual Storage Extended
Chapter 2. Branch Prediction Methods 49
Table 1 (Page 2 of 3). Workload Descriptions
Environment Description
CICS VSE
Customer Information Control System, an
interactive workload where large numbers
of users execute transactions of varying
sizes typical ofwhat is found on VSE
systems. VSE systems are usually smaller
than MVS systems.
DB2 MVS
DATABASE 2, IBM's relational database
management system. A workload where a
large number of users submit a varity of
database transactions.
IMS MVS
Information Management System, IBM's
hierarchical database management system.
A workload where a large number of users
submit a varity of database transactions.
Interactive VM
Hundreds of users executing functions that
VM users typically run during the online
period of the day.
Interactive - MVS
Thousands of users executing functions
that MVS users typically run during the
online period of the day.
Chapter 2. Branch Prediction Methods 50
Table 1 (Page 3 of 3). Workload Descriptions
Environment Description
Scientific - MVS
Fortran jobs representing airflow calcu
lations.
TPF
Transaction Processing Facility, users exe
cuting transactions that would typically be
found in a TPF environment.
2.2.2 Taken Characteristics of Branches
The Percent of Branches Taken column is the percent of branch instructions,
out of the total branch instructions within the workload, which cause program
execution to proceed at the branch's target. The Percent of Conditional
Branches Taken column is the percent of branches which are conditional, and
taken26.
26 For example, a trace with 80 conditional branches; if 40 of these conditional
branches are taken, then fifty percent of the conditional branches are taken.
Chapter 2. Branch Prediction Methods 51
Table 2. Percent of Branches Taken in Different Environments
Environment Percent of Branches Taken
Percent of Conditional
Branches Taken
Batch MVS 57.0 53.5
Batch VM 68.8 59.8
CICS MVS 54.7 48.3
CICS VSE 56.3 48.7
DB2 MVS 55.0 47.6
IMS - MVS 56.4 49.5
Interactive VM 61.8 57.3
Interactive - MVS 52.9 50.6
Scientific - MVS 51.8 61.6
TPF 52.9 63.6
Chapter 2. Branch Prediction Methods 52
2.2J Distances Between Taken Branches and Their Targets
The information in the following two tables, in conjunction with knowing the
size of a cache line, gives an understanding of whether the target is typically
within the same cache line as the branch. If a target is not in the same cache
line as the branch there is a potential that the target does not reside in the
cache. This cache
miss21
can cause a significant delay.
Conditional branches tend to be closer to their targets than unconditional
branches. Conditional branches are used for implementing program decisions
and tend to be used to bypass a section of code. Unconditional branches tend
to be used for subroutine and function calls and returns.
27
A cache miss occurs when data is not found in the cache.
Chapter 2. Branch Prediction Methods 53
(N
0
co to *0 o VO OJ oo "l o







X r-; VO to VO ~ to oo VI Ov
Q








X i* CM oo o to Ov CN CO vo r
II











1? X Ov DO t-; <* CO "\ oo to r^ t>;

































1 co o o <N <* to oo VO
o\
V ^jf
O ri ri ri 00 ri





























































Chapter 2. Branch Prediction Methods 54
(N
VI















X 0 co VO -. <N co
v; Ov 0 00
11









X r^ rO r 0 t r- v 1-; 00 Ov
n






























X 00 00 VI cO VO V) 1-^ DO <N 00


















t-^ 0 <N VO Ov CN * ^ <* 00
o






































































Chapter 2. Branch Prediction Methods 55
2.3 DHT (Decode History Table)
2.3.1 Introduction
The decode history table is a simple array. Each entry in the array contains
an indicator used to predict branch condition. By predicting branch condi
tion, the DHT can reduce branch delays caused by condition code interlocks.
Figure 28 on page 57 illustrates a typical DHT configuration. The number of
rows in the DHT is often a power of two. X equals 2N, where N is the
number of bits used to select the DHT row.












Figure 28. DHT Layout
Chapter 2. Branch Prediction Methods 57
2.3.2 Hashing
The ideal DHT has each conditional branch map into its own row within the
DHT. In practice, this is not feasible. The branch instruction's address is
used to index into this array. This indexing is often referred to as hashing
because only a subset of the branch instruction address bits is used to choose
which entry to access. These bits are chosen such that branches within
sequential pieces of code do not map into the same DHT entries. Hashing
attempts to spread the conditional branch instructions evenly throughout the
DHT.
To choose which instruction address bits are used to select the DHT row
requires further understanding of typical branch frequency and proximity to
other branches. Typical code executed by a processor can be examined to
understand branch proximity. Examining code cannot provide insight into the
taken/not taken characteristics of the conditional branches. Operating envi
ronments in which there are multiple applications active need to be monitored
in order to characterize their branch behavior. One method to understand the
typical operating environment is to gather instruction traces of the activity
occurring in this typical environment.
Chapter 2. Branch Prediction Methods 58
Understanding the proximity of branches helps dictate which bits hashing
uses. If a branch occurs approximately every double word then making each
row in the DHT represent a double word results in approximately one branch
per entry. Every time the instruction address is incremented such that it is
within the next double word, it hashes into the next sequential row of the
DHT. Eventually the instruction address increases such that it wraps and
hashes into the first rows of the DHT.
I I I I I I I I I
24 25 26 27 28 29 30 31
Bit 31 is always zero for instruction addresses
Figure 29. S/390 Instruction Address Bit Layout
To accomplish a mapping where each row in the DHT represents a double
word, the lowest three address bits are not used for hashing. Each time bit 28
changes by one, the address hashes into the next DHT row. If the DHT has
4096 rows, 12 bits are needed to hash into the DHT. The high-order instruc
tion address bits fluctuate less, so often they are not used for hashing. To
keep things simple, sequential address bits are often used to hash into the
Chapter 2. Branch Prediction Methods 59
array. Based on the preceding information, instruction bits 17 through 28 are
chosen to hash into the DHT.
2.3.3 Implementation
When a conditional branch is decoded the instruction address is sent to the
DHT. The DHT is accessed by using the appropriate instruction address bits
to select a row within the DHT. The taken/not taken indicator contained
within the row is sent back to the decode unit to be used to direct instruction
processing.
2.3.3.1 Branch Predicted Taken
If the branch is predicted taken and the target address is able to be
computed28, then instruction processing continues at the target address. If the
target address cannot be computed then the branch has to wait until the target
address can be computed. If the prediction is correct then no updates to the
DHT are necessary.
No register interlocks.
Chapter 2. Branch Prediction Methods 60
If the prediction is found to be wrong then the DHT is sent both the branch
instruction address and the request to set the prediction to not taken. The
incorrect instruction processing has to be backed off and instruction proc
essing resumes at the next sequential instruction. The actual unit to send the
update request to the DHT can vary; it can be the completion unit when it
recognizes that the prediction is incorrect or the execution unit when it exe
cutes the branch.
2.3.3.2 Branches Predicted Not Taken
If the branch is predicted not taken then instruction processing continues at
the next sequential instruction. If this prediction is correct, no updates to the
DHT are necessary. If the prediction is incorrect, a request is sent to the
DHT to update the branch entry to taken. An incorrect prediction causes
instruction processing .to be reset. Instruction processing resumes at the target
address.
2.3.3.3 Other Implementation Considerations
Most DHT implementations only have one entry per row and thus are
referred to as direct-mapped. An associative DHT is one that has more than
one entry per row. If there are two entries per row then the DHT is said to
Chapter 2. Branch Prediction Methods 61
be two-way associative. An associative implementation requires more bits in
order to decide which entry to use. An associative implementation also
requires a more complex replacement algorithm to determine which entry to
replace.
Most implementations do not put all taken branches into the DHT. Often
the DHT is used in conjunction with opcode-based branch prediction. An
unconditional branch is predicted correctly by an opcode-based branch predic
tion methodology. Putting unconditional branches in the table might displace
another branch entry while providing no additional benefit.
It is not sensible to keep loop-controlling branches in a DHT. This is due to
the fact that these branches are predicted more accurately using opcode-based
branch prediction than using a DHT. When opcode-based branch prediction
is used, loop-controlling branches are always predicted taken. On the last
loop iteration the prediction is incorrect. If a DHT is used, the first time
through the loop the branch is incorrectly predicted not taken. As a result the
prediction is updated to taken. The last time through the loop the branch is
incorrectly predicted taken. The prediction is then updated to not taken. So
for every occurrence of the loop, the DHT branch prediction is incorrect twice
Chapter 2. Branch Prediction Methods 62
instead of just once. Because of this, opcode-based prediction is used for
loop-controlling branches.
There is a possibility that the decode unit can request
information29
from the
DHT on the same cycle the completion controls request the DHT to update
its information30. If the DHT array cannot simultaneously handle a read and
write operation, then some priority and staging logic is necessary.
2.3.4 Cases Which Negate the Benefit ofa DHT
Every branch prediction method has the potential to be wrong. A program
can be written such that it defeats the branch prediction, causing it to predict
incorrectly. It will be shown that the DHT can still predict correctly even
when it uses incorrect information. This is because it only predicts the
branch's condition.
A DHT can be defeated by a branch which alternates between being taken





Chapter 2. Branch Prediction Methods 63
uses the last occurrence to determine the prediction. Some DHTs may have a
mechanism to recognize where incorrectly predicted branches are occurring
and selectively turn off the branch prediction to avoid the cost of resetting the
instruction stream.
Since only a subset of the instruction address bits are used for hashing, syno
nyms can occur. Synonyms are two branch instructions with different instruc
tion addresses which map into the same DHT row. If the high-order bits
change quite frequently then the potential for synonyms increases since these
bits often are not used for hashing. Also, if a program is much more branchy
than expected, with more than one branch occurring per double word of
instruction data, then the amount of synonyms increases.
Synonyms can cause branch prediction to be incorrect. Two branches which
hash into the same DHT row, with one being taken and the other not taken,
can toggle the branch prediction of that row just as effectively as a single
alternating branch. A clear understanding, by the hardware designer, of
typical instruction data access patterns can minimize the occurrences of syno
nyms. How many bits and which bits to use during hashing directly impacts
the frequency of synonyms.
Chapter 2. Branch Prediction Methods 64
Synonyms can occur which do not negatively impact branch
prediction. If the
DHT is accessed and the prediction it uses belongs to a synonym, the predic
tion can still be correct. The two branches may have identical
conditions. It
will be shown that the Branch History Table (BHT) is not as resilient as the
DHT when encountering synonyms. The resilience of a DHT is shown in the
performance runs done on moving target branches in section 2.4.6.4, "Moving
Targets"
on page 81.
Chapter 2. Branch Prediction Methods 65
2.4 BHT (Branch History Table)
2.4.1 Introduction
A branch history table is another mechanism used to predict branches. It
predicts branches condition. If a branch is predicted taken, it also provides a
predicted target address of the branch. A BHT is depicted in Figure 30 on
page 67.
A subset of the branch instruction address bits are used to index rows in the
BHT. The number of BHT rows is often a power of two, such that X equals
2N, where N is the number of branch instruction address bits used to hash into
the BHT row. The entire target address may be kept in the BHT or just an
offset to the target address.
A BHT row may contain one or more entries. Some of the branch instruction
address bits may be kept in the entry to further distinguish it. These extra
bits in the entry are necessary if the BHT is implemented with multiple entries
per row. The more unique information which is kept in the BHT entry, the
less the likelihood of synonyms.
Chapter 2. Branch Prediction Methods 66
< y entries >
Rowl | | / / | |
1 1 / / 1 1
Row2 | | / / | |
1 1 / / 1 1
Row3 | | / / | |
1 1 / / 1 1
1 | / / | |
1 1 / / 1 1
1 1 / / 1 1
1 1 / / 1 1
Row X | | / / | |
1 1 / / 1 1
Figure 30. BHT Layout
Chapter 2. Branch Prediction Methods 67
2.4.2 Hashing
When designing the BHT, it is important to study which bits within the
branch instruction address fluctuate the most. Since the 31st addressing bit in
S/390 architecture is never equal to one, it does not make sense to use this bit
in BHT operations. If most programs run in 24-bit mode, then using the
high-order bits either to select the BHT row or keep within the BHT entry is
a waste of resource.
The optimal selection of bits spreads activity throughout the BHT. Hot
spots31
indicate that thrashing is occurring. Thrashing is when entries in the
BHT are replaced at such a high rate that they are not in the BHT long
enough to be used for branch prediction. To prevent thrashing, the typical
frequency and proximity of branch instructions needs to be known. This, and
how many entries there are per row, dictate which bits are most effective for
row selection. An example helps to illustrate this point.
31 Hot spots are entries within the array which experience a high rate of update
activity.
Chapter 2. Branch Prediction Methods 68
A BHT with sixteen rows uses four bits to select a row. The BHT has one
entry per row. If one branch is observed approximately every quad
word32
then having each row represent a quad word of addressing results in approxi
mately one branch per row. Figure 31 on page 70 shows four
branch
instructions hashing into this BHT. Bits 28-31 are not used, bit 27 is the
lowest order bit used. Every time bits 24-27 change by one33, the instruction
address hashes into the next BHT entry. This is what is meant by having each
row represent a quad word of addressing.
32
A quad word is sixteen bytes.
33 For example, from 0000 to 0001.
Chapter 2. Branch Prediction Methods 69
00000008 ==== maps into
00000016 ==== maps into =========>
0000002C ==== maps into =========>
0000003A ==== maps into
| Row #0 |
| Row #1 |
| Row #2 |
| Row #3 |
Figure 31. Four branch addresses - hashing using instruction address bits 24-27
In the previous example, if bits 23-26 are used for row selection, then the first
two branches map into the first row; the second two branches map into the
second row. With these branches, this hashing results in thrashing. Hashing
using bits 23-26 is the optimal design point for this BHT if branches occur
every 32 bytes. Using bits 23-26 does not cause thrashing if the BHT is
two-way associative.
Using the lower order bits (such as 25-28) causes the BHT accesses not to be
sequential. With bits 25-28, the branches hash into the following rows in the
Chapter 2. Branch Prediction Methods 70
BHT: 1, 2, 5, and 7, respectively. Section 2.4.6.1, "Synchronous versus Asyn
chronous
Implementations"
on page 76 explains why sequential accesses are
desired.
2.4.3 Replacement Algorithms
When multiple entries exist within a BHT row, a replacement algorithm is
necessary to determine which branch entry is replaced when a new branch
needs to be installed. An
LRU34
algorithm is a common approach. The sim
plest LRU algorithm puts a timestamp on the entry when it is created.
A branch is removed from the BHT by marking its entry invalid. Invalid
entries are replaced first. If there are no invalid entries then the timestamp is
used to determine the oldest entry. The oldest entry is replaced with the new
branch. Every time a write operation is initiated, the timestamp of the entry
being modified is updated. With an update-when-new implementation, if a
branch entry is frequently read its LRU bits do not indicate this. Eventually




Chapter 2. Branch Prediction Methods 71
A more complex LRU replacement algorithm updates the branch entry bits
whenever the branch entry is referenced. This ensures that frequently
encountered branches are kept in the BHT. This approach requires a write
(update) whenever the branch entry is accessed. Often the LRU bits are kept
separate from the branch entry to allow simultaneous read and write oper
ations to the BHT.
The previous sections illustrated a few of the considerations which need to be
made when designing a BHT. The workload characteristics are crucial when






The BHT is a costly piece of hardware in that it needs to contain a consider
able amount of data in order to be effective. It requires a significant number
of address bits to be used (for hashing and within the entry) in order to avoid
synonyms. It also contains the target address or an offset to the target
address.
35
Instruction addresses reflect workload execution behavior.
Chapter 2. Branch Prediction Methods 72
A unique benefit of using a BHT is that it predicts the target address of the
taken branches. One way to gain maximum benefit of this prediction is to
fetch the data at the target address when the prediction is made. Incorrect
branch prediction, however, may introduce the performance degradation indi
cated below.
Instruction data is brought into the instruction buffers but may not be
used.
This unused data may displace usable instruction data, causing it to be
refetched at a later time.
The additional requests for data might tie up the bus to the instruction
buffers so that the fetch for the correct target address is delayed.
i
These scenarios can potentially cause the performance to be worse than that
without a BHT. This behavior is illustrated by some of the programming
examples in Chapter 4, "Discussion of Programming
Methods"
on page 130.
Chapter 2. Branch Prediction Methods 73
2.4.5 Recoverability Aspects
Processors that use a BHT to improve their performance may experience sig
nificant performance degradation if their BHT hardware fails. Therefore, a
designer should explore the reliability and recoverability of the BHT. The
BHT itself may not be one large array, but instead constructed of numerous
smaller arrays. For example, assume the BHT with sixteen rows, discussed in
section 2.4.2,
"Hashing"
on page 68, is composed of two sub-arrays. One of
the hashing bits, bit 24, determines which of the two arrays to access. The
other three hashing bits (25-27) determine which of the eight rows in the
selected sub-array is accessed.
To recover a failure in one of the arrays, a designer could choose one of the
following recovery algorithms:
1. Ignore bit 24 and have all branches hash into the usable array via bits
25-27.
2. Continue to have the logic use bit 24, but all accesses to the non
functioning array are indicated as BHT misses.
Chapter 2. Branch Prediction Methods 74
With the first approach there are twice the number of branches accessing the
functioning array. Synonyms and thrashing increase, reducing the branch pre
diction effectiveness. With the second approach the branches accessing the
non-functioning array are never be predicted using the BHT. If opcode-based
branch prediction is available then these branches revert to that lower level of
branch prediction. The synonym and thrashing characteristics of the func
tioning array do not change.
In either approach there is some branch prediction capability remaining.
When these approaches are applied to BHTs consisting of more than two
sub-
arrays, however, the second approach is more resilient. Consider the case
where the sixteen-row BHT is built using four sub-arrays. Two address bits
select the sub-array (bits 24-25) and two address bits select the row within the
sub-array (bits 26-27). If one of these arrays ceases to function and the
address bits continue to be used the same manner, then 75 percent of the
BHT continues functioning. This 75 percent experiences no change in the
thrashing and synonym behavior.
If the first approach is used, half of the sub-arrays are no longer selected and
the remaining half experience a rise in thrashing and synonyms. The resulting
branch prediction capacity is degraded to less than half of the original BHT.
Chapter 2. Branch Prediction Methods 75
This example can be extended to a very large BHT built of many
small
sub-
arrays. When an error occurs and renders one sub-array unusable, the
impact
to the branch prediction capacity of the entire BHT is minimal.
2.4.6 Implementation
Unlike the DHT, it is beneficial to keep unconditional and loop-controlling
branches in a BHT. Often opcode-based branch prediction is used to predict
the condition of these branches. The BHT is then referenced to predict the
target address. When loop-controlling branches are included in the BHT, they
are not removed from the BHT when they are not taken. This permits the
target address to be predicted the next time the loop is encountered.
2.4.6.1 Synchronous versus Asynchronous Implementations
A BHT can operate asynchronously or synchronously with the instruction
processing unit36. Both implementations are discussed in this section.
With the synchronous implementation the instruction processing unit sends
the BHT a branch instruction address. The BHT is accessed using this
36
I-unit.
Chapter 2. Branch Prediction Methods 76
instruction address. The I-unit is sent a response indicating a
hit37
or a miss38.
If a hit occurs, the BHT returns the predicted target address. The I-unit issues
the fetch for the instruction data, if necessary, then continues instruction proc
essing. A synchronous BHT is always responding to an outside request,
whether it be to provide a prediction or update an entry.
The asynchronous implementation utilizes a
buffer39
between the BHT and the
I-unit. The I-unit sends an instruction address to the BHT. The BHT is
indexed using this address. If a hit occurs, the BHT puts the hit in the hit
buffer and searches at the location of the predicted target. By doing this, the
BHT directs its search along the predicted instruction path. When a hit does
not occur, the BHT assumes that there is no branch and the next sequential
instruction path is searched.
An asynchronous implementation needs all the taken branches to be contained
within the BHT because it directs BHT searching without continual input
37
Entry matching the branch instruction address is found.
38
Entry matching the branch instruction address is not found.
39
This buffer is commonly referred to as the hit buffer.
Chapter 2. Branch Prediction Methods 77
from the I-unit. An asynchronous BHT must contain both the unconditional
and loop-controlling branches in order to independently direct instruction
processing. The asynchronous BHT is not only predicting the branch condi
tion and target, it is also predicting where branches exist.
The asynchronous BHT continues searching until all entries in the hit buffer
are full. As the I-unit removes hits from the buffer and uses them to direct
instruction processing the BHT continues to update the buffer with new
entries. As long as the branch prediction continues to be correct, the BHT
continues predicting the instruction stream independent of the I-unit.
If a prediction by the asynchronous BHT is wrong, the I-unit resets instruc
tion processing at the correct address. It sends this new address to the BHT
with a signal to initiate a reset. The BHT clears the hit buffer and begins
searching at the new instruction address. The BHT receives an update, most
likely from the completion unit, as to whether it should install a branch,
delete a branch, or update an entry.
The asynchronous BHT is efficient in that the I-unit does not have to initiate
a request and wait for a response from the BHT. Except for immediately
after a reset, the BHT is ahead of the I-unit. The asynchronous BHT can
Chapter 2. Branch Prediction Methods 78
further increase performance by initiating the fetch for the target data before
the branch is decoded.
Since an asynchronous BHT is out-of-synch with the I-unit, it can encounter
false branches. A false branch occurs when the BHT predicts that a branch
exists at a certain location but when the I-unit decodes the instruction data it
does not find a branch. False branches can occur when only a subset of the
instruction bits are used by the BHT to identify an entry or when instruction
data has been altered since the last time it was processed.
False branches can only occur with asynchronous BHT implementations. A
synchronous BHT is only queried after a branch instruction is decoded, it
never predicts a branch where one does not exist. Most asynchronous BHT
implementations invalidate the entry causing the false branch in order to mini
mize the chance of another false branch.
2.4.6.2 Reducing the Impact of Instruction Data Fetches
Often the hit data held within the hit buffer is used to generate instruction
data fetches. If this is done soon enough, perhaps when the BHT encounters
the hit, then the wait for instruction data can be eliminated. This can be true
even in the case of a cache miss. Section 3.2.2, "Alleviating Delays Due to
Chapter 2. Branch Prediction Methods 79
Instruction Data
Fetches"
on page 112, illustrates the benefit of
prefetching40
the branch target data.
However, prefetching may result in fetches being issued for instruction data
which is not needed. This data can replace useful data within the instruction
buffers. To minimize performance impact, some processors have the capa
bility to cancel requests when it is discovered that they are not needed. This
cancelling capability is useful for both instruction and operand data requests
since executing down an incorrect path can also cause operand fetches which
are not necessary.
2.4.6.3 Updating the BHT
When branch prediction is incorrect the BHT is updated to reflect the latest
condition of the branch. If the branch is not taken, the BHT entry corre
sponding to this branch is marked invalid. Loop-controlling branches are the
exception to this rule, they are left in the BHT when not taken. If the branch
is taken, one of two responses occur:
Prefetching is a when a fetch for data is issued before it is known if the data is
needed. Prediction methods can drastically reduce the finite cache penalty if pre
fetching is employed.
Chapter 2. Branch Prediction Methods 80
1. If the branch entry is not in the BHT, install the branch in the BHT. In a
BHT with a single entry per row this branch replaces the previous branch,
if one existed. In a BHT with multiple entries per row, often a
LRU41
algorithm is employed to determine which branch the new one replaces.
2. The branch is predicted taken but the predicted target is incorrect.
a. Update the branch entry with the latest target address.
or
b. The BHT is implemented such that it can indicate the presence of a
moving target and sets the corresponding indicator bit. For a detailed
explanation of moving targets see section 2.4.6.4, "Moving
Targets."
2.4.6.4 Moving Targets
Some branches have what is referred to as a moving target. A moving target
means the target address of the branch changes from one execution to the
next. A single branch may be always taken but it directs instruction proc-
41
Least recently used. See section 2.4.3, "Replacement
Algorithms"
on page 71 for a
more detailed discussion.
Chapter 2. Branch Prediction Methods 81
essing to a different target address, depending upon the current contents of the
register(s) used to compute the target address.
For a DHT, moving target branches are correctly predicted because only the
branch condition is predicted. However, a BHT predicts the target incorrectly
for these branches. If cleaning up after this erroneous prediction incurs a
significant penalty, it is advantageous to recognize a moving target branch and
wait to compute the target address using the register(s).
When a moving target branch is indicated, the BHT reverts back to the DHT
level of branch prediction. An asynchronous BHT stops searching when it
encounters a moving target branch. The asynchronous BHT is given an
address at which to continue searching once the correct target is calculated.
Moving target branches account for 0.4 percent to 0.9 percent of all branch
instructions in typical workload environments.
Chapter 2. Branch Prediction Methods 82
2.4.7 Cases Which Negate the Benefit of the BHT
A BHT is defeated by a branch which alternates between taken and not taken.
The BHT always predicts this branch incorrectly. Some BHTs have a mech
anism to recognize where incorrectly predicted branches are occurring and
selectively turn off the branch prediction to avoid the cost of resetting the
instruction stream.
While the BHT offers increased performance over the DHT, it is much easier
to defeat than a DHT. In the BHT there are a couple components to the
prediction. If any one component is predicted incorrectly, then the entire pre
diction is wrong. A BHT implementation which does not recognize a moving
target branch can easily be defeated by that category of branches. The per
formance degradation due to incorrectly predicted branches is discussed
further in 3.3, "Branch Guess Wrong
Penalty"
on page 1 15.
2.5 Active Streams
An active stream is the sequential group of instructions which is currently
active within the processor's pipeline. An active stream that follows a pre
dicted branch is referred to as a conditional stream. These streams remain
conditional until the branch completes.












Figure 33. Active streams when a branch is predicted taken
Often the instructions following a branch are referred to as conditional
instructions within the conditional stream. To allow the processor to backup,
these instructions may execute with a temporary set of resources. When the
branch instruction is completed then the processor either continues with the
conditional instruction path or resets to the correct path.
Chapter 2. Branch Prediction Methods 84
The conditional stream can be thought of as a branch in a tree structure.
When a prediction is found to be incorrect then the incorrect branch is nipped
and a new branch is started. To maximize performance a processor may want
to have the capability of handling several branches within the pipeline.
Numerous conditional streams are active simultaneously when multiple branch
instructions are in the pipeline.
The number of active streams that processors can handle varies. In the simple
examples within this paper there are two active streams:
1. The stream containing the instructions preceding and including the branch
instruction.
2. The stream containing the instructions following the branch instruction.
The first active stream is not conditional, the second active stream is condi
tional.
Some processors have the capability of traversing both the next sequential
instruction and target paths simultaneously. When one path is determined to
be correct, instruction processing along this path is already in progress.
Chapter 2. Branch Prediction Methods 85
When allowing conditional streams to exist, a processor must be able to inval
idate all the operations done by conditional instructions. When resetting after
an incorrect prediction the processor may have to wait a few cycles while the
correct state is restored. This delay is commonly referred to as the branch
guess wrong penalty. See section 3.3, "Branch Guess Wrong
Penalty"
on
page 115 for a discussion of how this penalty reduces the benefit of branch
prediction.
2.6 Condition Codes
The S/390 condition code is used by conditional branches to determine if they
branch. Since the condition code is often set by the preceding instruction,
condition code interlocks occur frequently. Making the condition code avail
able earlier can improve performance. One approach is to provide the condi
tion code as soon as it is updated. A branch instruction should not have to
wait until the instruction updating the condition code completes. Often
processors can expedite the condition code to the branch immediately after
the preceding instruction executes.
Loop-controlling branches can be internally streamlined in order to update the
condition code sooner. A BCT is used to illustrate this point. The BCT
instruction decrements a register and branches if the result is not zero. If this
Chapter 2. Branch Prediction Methods 86
instruction is implemented the way it was just described, the register is first
inputted to an adder to decrement it and then the result is be compared to
zero. This implementation is depicted in Figure 34 on page 88.
Another way to implement the BCT is shown in Figure 35 on page 88. It is
possible to implement the same function but reduce the pathlength. If the
register is compared to one to determine if the branch is taken, the result is
the same as waiting until the register is decremented and then comparing to
zero. In parallel with the comparison, the register can be decremented. The
compare and the subtract can be performed simultaneously. The condition
code is available sooner with this implementation.
Chapter 2. Branch Prediction Methods 87
I Register A | | 1 |
V V






I (Register A) 1 | | 9






| Register A I
I
\ \ / / \ \ / /
\ \/ / \ \/ /
\ (sub) / \ (cmp) /
\ / \ /
V V
(Register A) 1 I I Condition code updated I
Figure 35. Parallel handling of a BCT
instruction
Figure 34. Traditional handling of a
BCT instruction
Many compilers separate branches from the instructions which set the condi
tion code whenever this is possible. In cases where there are unrelated
instructions which can be swapped without altering the outcome of the
program, shifting instructions can reduce both condition code and register
interlocks. See the register interlock discussion in section 3.3.2, "Address Gen
eration
Interlocks"
on page 124 for a more in depth discussion of compiler
optimizations.
Chapter 2. Branch Prediction Methods 88
S/390 supports one condition code at any given time. This creates a source
of
contention. Other architectures implement multiple condition codes. Mul
tiple condition codes reduces the affinity between instructions which update a
condition code and the branch instructions waiting to use the condition code.
This makes it easier to shift instructions in order to reduce condition code
interlocks. One instruction can be updating a condition code while a branch
is using another condition code. In these architectures the condition code is
not a single point of contention as it is in S/390.
2.7 Instruction Buffers
A processor may have different levels of cache in which its data can be found.
The objective is to keep the most frequently accessed data close to the
processor so as to reduce the access time. This implementation applies
equally well to instruction data. Often a processor has a few double words of
instruction data kept close to the decode unit for quick access. These buffers
of instruction data are often referred to as the instruction buffers.
Chapter 2. Branch Prediction Methods 89
2.7.1 Flushing the Instruction Buffers
One instruction buffer implementation has each holding data which is only
used once. The decode unit has a pointer to the location within the instruc
tion buffer that contains the next instruction to be processed. Once an
instruction is processed, the pointer either moves to the next sequential
instruction or to the branch target address. Often the decode pointer is not
implemented such that it can go back to a previously used instruction buffer.
This necessitates the constant refetching of data into the instruction buffers.
Depending upon the instruction buffer implementation, branching loops may
cycle through the instruction buffers. When instruction buffer data is only
used once, the loop fills the buffers with its data, flushing out other
instructions. Unless the program is a self-modifying program, the instruction
data remains constant during the execution of a loop. This flushing of the
instruction buffers should be avoided. The code in Figure 36 on page 91 is
used to illustrate what is meant by flushing the instruction buffers.
Chapter 2. Branch Prediction Methods 90
OFFSET OBJECT CODE LINE* PSEUDO ASSEMBLY LISTING
00049 1
*
for( i = l; i < 1000; i+:-2) {
OOOODE 4120 0001 115 LA r2,l
0000E2 5800 DOEC 116 L r0,236(,rl3)
0000E6 41E0 0002 117 LA rl4,2
OOOOEA 41F0 03E7 118 LA rl5,999
OOOOEE 0700 119 NOPR
OOOQFO 120 04 L2 DS 0D
00050 1
*








0000F2 872E 30A0 128 BXLE r2,rl4,04L2
O0OOF6 129 04L4 DS OH
0000F6 5000 DOEC 130 ST r0,236(,rl3)
00052 1
*
for( i = 2; i < 1000; i+=2) {
0000 FA 41E0 0002 132 LA rl4,2
0000FE 182E 133 LR r2,rl4
000100 5800 D0F0 134 L rO,240(,rl3)
000104 41F0 03E7 135 LA rl5,999











00010A 872E 30B8 140 BXLE r2,rl4,04L7
00010E 141 04L9 DS 0H
00010E 5000 D0F0 142 ST rO,240(,rl3)
Figure 36. Example CXX1, two FOR loops assembler code
For this example, the instruction buffers hold a double word of instruction
data each. Data in the instruction buffers is only accessed once. There are a
total of eight instruction buffers. Figure 37 on page 93 illustrates the con
tents of the instruction buffers after processing the first five iterations of the
Chapter 2. Branch Prediction Methods 91
loop. Notice that after a few more iterations of the loop, all of the instruc
tion buffers contain the eight bytes starting at 0000F0. If the loop (the AR
and BXLE instructions in this example) encompasses more than one
double
word then the instruction buffers are flushed after fewer iterations.
Chapter 2. Branch Prediction Methods 92
| 0000D8 - 000ODF |
Buffer #0 | |
| 000OE0 - 0000E7 |
Buffer #1 | I
| 0000E8 - 0000EF |
Buffer #2 | I
| 0000 F0 - 0000 F7 |
Buffer #3 | I
| 0000F0 - 0000F7 |
Buffer #4 | 1
| 0000F0 - 0000F7 |
Buffer #5 1 1
| 0000F0 - 0000F7 |
Buffer #6 1 1
| 0000F0
- 0000F7 |
Buffer #7 1 1
Figure 37. Flushing the instruction buffers
Chapter 2. Branch Prediction Methods 93
2.7.2 Branching in the Instruction Buffers
If the instruction buffers are implemented such that the decode pointer can go
back to a previously used buffer, then after the BXLE at 0000F2 the decode
pointer is moved back to the AR at 0000F0. This operation is often referred
to as branching in the instruction buffers. Figure 38 on page 95 illustrates the
outcome of this implementation. The result is that instruction processing
operates out of instruction buffer #3 for a substantial period of time. The
entire program fits within the eight instruction buffers.

















Figure 38. Loop residing in the instruction buffers
Loop-controlling branches cause the same subset of instruction data to be
used numerous times before moving onto another set of instructions. Since
Chapter 2. Branch Prediction Methods 95
these branches usually go to the same target address and are taken a large
percentage of the time, they benefit from branching within the instruction
buffers.
A loop slightly larger than the capacity of the instruction buffers will still
flush through the instruction buffers. With any mechanism, there is a
program which can be written to defeat the mechanism. For branching in the
instruction buffers, this is a simple loop, slightly larger than the capacity of
the instruction buffers.
An advantage of branching within the instruction buffers is the reduction of
traffic into the instruction buffers. No additional fetch activity into the
instruction buffers is necessary during the execution of this tight loop. When
trying to minimize the complexity of the processor,
inefficiencies42
may result.
The processor designer has to balance control complexity with data handling
efficiency. Branching in the instruction buffers is a very simple, efficient
method to improve branch performance.
42
Fetching more instruction data.
Chapter 2. Branch Prediction Methods 96
2.8 Conclusions
The techniques for branch prediction vary in complexity and performance
potential. Chapter 3, "Performance Gains Due to Branch
Prediction"
on
page 98 illustrates the performance gains these hardware mechanisms provide.
Other techniques may not predict branches, but instead make their processing
more efficient. For example, expediting the condition code decreases the
impact of condition code interlocks. Branching in the instruction buffers min
imizes the fetch activity into the instruction buffers through the reuse of
current data. Many of these methods can be combined to improve branch
performance.
Chapter 2. Branch Prediction Methods 97
Chapter 3. Performance Gains Due to Branch Prediction
Chapter 3 builds upon the processor timings introduced in Chapter 1 . It does
this by applying the branch prediction techniques discussed in Chapter 2. The
short-comings, implementation considerations, and performance benefits of
the branch prediction techniques were discussed in Chapter 2. These aspects
of branch prediction are reinforced by the timing diagrams within this
chapter.
Figure 6 on page 12 exhibits the significant performance benefits that parallel
processors derived through the overlapping of instructions. Figure 16 on
page 27 shows the degradation a branch instruction introduces by reducing
the operations that can be done concurrently. In order for parallel processors
to benefit from branch prediction, instruction processing needs to continue as
soon as the next instruction address is known. If the branch is predicted
taken, then as soon as the target address can be calculated by stage (O), the
target instruction needs to be processed. If the branch is predicted not taken,
then the instruction processing continues with the next sequential instruction.
Chapter 2. Branch Prediction Methods 98
3.1 Predicting Branches Taken/Not Taken
The simplest form of branch prediction is to predict branch condition. Two
methods of implementing this type of branch prediction are the DHT and
opcode-based branch prediction. The processor examples, in the following
sections, implement the DHT/opcode level of branch prediction. This type of
branch prediction improves the performance when condition code interlocks
occur.
3.1.1 Parallel Processor A
Figure 39 on page 100 illustrates the timing of Parallel Processor A. The
branch target is loaded into the decode unit the cycle after the (O) stage. If
the branch prediction is correct, the timing illustrated by Figure 39 on
page 100 is achieved. Branch prediction saves eight cycles when compared to
the timing shown in Figure 18 on page 29. The resulting CPI is 5. Branch
prediction results in a 21.
143
percent reduction in cycles and a 1.3 decrease in
CPI.
43
Eight cycles saved divided by 38 original cycles is 0.211, or 21.1 percent.
Chapter 3. Performance Gains Due to Branch Prediction 99
I D 0 F EC
l-l-l-l l-l-l
|_|_|_|_|+ + + + +
_|_|
<== Branch Instruction
N I D 0 EC
I D 0 F EC
|_H_H+ + + + l-l-l
N I D 0 F EC
I-I-I-I-I+ + + + + + + + 1-|-|
N I D 0 F EC




N I D 0 F EC
3Q cycles ;
Figure 39. Six instructions processed, one operand access at a time, no register
interlock, branch predicted correctly
Chapter 3. Performance Gains Due to Branch Prediction 100
If the branch target instruction data is not in the instruction buffers then the
timing in Figure 40 is achieved. Branch prediction saves eight cycles. This is
an
18.644
percent performance gain and a 1.3 decrease in CPI.
N I D 0 F EC
l-l-l-l-l l-l-l
I -I- I- 1-1+ + + + + -I- I <== Branch Instruction
N I D 0 EC
| |.|.| |_|_|
I D 0 F EC
|.|+ + + + + -|.|.|+ + + + 1-|-|
N I D 0 F EC
|.|+ + + + + -|-|-|+ + + + + + + + 1-|-|
N I D 0 F EC
|-|+++++-|-|-|++++++++++++- - - -
-|-|-|
N I D 0 F EC
< 35 cycles >
Figure 40. Six instructions processed, one operand access at a time, no register
interlock, branch predicted correctly, target not in instruction buffers
Eight cycles saved divided by 43 original cycles is 0.186, or 18.6 percent.
Chapter 3. Performance Gains Due to Branch Prediction 101
When a register interlock does occur, the branch is delayed and the timing of
37 cycles is achieved. This timing is illustrated in Figure 41 and has a CPI of
6.2. Comparing Figure 41 to Figure 19 on page 30 shows that branch predic
tion has saved two cycles. This is due to the fact that the (I) stage can now
occur on the branch execution cycle, the (E) stage, rather than waiting until
the branch completes.



















I D 0 F EC
-|-|-|-|+
+++++++++++--
N I D 0 F E C
< 37 cycles =
Figure 41. Six instructions processed, one operand access at a time, register inter
lock, branch predicted correctly
Chapter 3. Performance Gains Due to Branch Prediction 102
3.1.2 Parallel Processor B
With Parallel Processor B, a timing of 21 cycles is obtained using the
DHT/opcode level of branch prediction, illustrated by Figure 42 on page 104.
This is a savings of eight cycles over the timing shown in Figure 22 on
page 34. Both parallel processors receive the same cycle reduction and CPI
decrease due to branch prediction. However, for Parallel Processor B, the
eight cycle gain translates into a 27.
6*
percent performance gain. An identical
savings in CPI translates into a larger performance savings because Parallel
Processor B is initially functioning at a lower CPI.
45
Eight cycles saved divided by 29 original cycles is 0.276, or 27.6 percent.
Chapter 3. Performance Gains Due to Branch Prediction 103
N I D 0 F EC
l-l-l-l-l l-l-l
|_|_|_|_|+ + + + +|-|-|
<== Branch Instruction
N I D 0 EC
l-l-l-l l-l-l
I D 0 F EC
l-l-l-l-l l-l-l
N I D 0 F EC
|_H_H+ + + l-l-l
N I D 0 F EC
|_|_H_|+ + + 1_|_|
N I D 0 F EC
< 21 cycles >
Figure 42. Six instructions processed, two operand accesses at a time, no register
interlock, branch predicted correctly
Chapter 3. Performance Gains Due to Branch Prediction 104
If the instruction data following the branch is not in the instruction buffers
then the timing of 26 cycles, illustrated in Figure 43, is obtained. This is a
savings of eight cycles; a performance gain of 23.
5"
percent.
N I D 0 F EC
-l-l-l-l l-l-l
|_|_|_|_|+ + + +
+|-|-|
<== Branch Instruction
N I D 0 EC
I D 0 F EC
|_|+ + + + +
_|_,_, 1_|.|
N I D 0 F EC
H+ + + + + -|-|-l+ + + l-l-l
N I D 0 F EC
H+ + + + + -|-|-|+ + + 1-|-
N I D 0 F EC
26 cycles
Figure 43. Six instructions processed, two operand accesses at a time, no register
interlock, branch predicted correctly, target not in instruction buffers
Eight cycles saved divided by 34 original cycles is 0.235, or 23.5 percent.
Chapter 3. Performance Gains Due to Branch Prediction 105
The timing, when a register interlock occurs, is shown in Figure 44 on
page 106. This is a two cycles savings when compared to Figure 23 on
page 35.
N I D 0 F E C
-|-|+++++++-
I D 0 E C
|-|-|
<== Branch Instruction
I D 0 F E C
N I D 0 F EC
--I-H+ + +
N I D 0 F
-|-|-|-|+++
- -




: 28 cycles ->
Figure 44. Six instructions processed, two operand accesses at a time, register
interlock, branch predicted correctly
Chapter 3. Performance Gains Due to Branch Prediction 106
3.2 BHT Branch Prediction
Both of the parallel processors encounter instruction processing delays due to
branches. Some prediction methods alleviate delays caused by condition code
interlocks, but give minor benefit when faced with other branching delays. A
branch prediction methodology which predicts the branch target address can
alleviate the impact of both the register interlock and target fetch delays, in
addition to condition code interlock delays.
Predicting the target address allows processing to continue without waiting for
the target address to be calculated. This type of branch prediction can be
incorrect in two respects. First, it can predict the condition of a branch incor
rectly. Second, it can predict the target address incorrectly. Thus, even
though it predicted the branch taken and correctly so, when directing instruc
tion processing to the wrong target address a reset is necessary.
3.2.1 Alleviating Register Interlocks
The timings in the following sections use BHT branch prediction. These
timings are compared to timings without branch prediction which have both
condition code and register interlocks.
Chapter 3. Performance Gains Due to Branch Prediction 107
3.2.1.1 Parallel Processor A
A synchronous BHT achieves the timing depicted by Figure 45 on page 109.
An asynchronous BHT can achieve the timing illustrated by Figure 46 on
page 1 10. This is a gain of eight or nine cycles when compared to the
DHT/opcode level of branch prediction in Figure 41 on page 102. It is a
gain of ten or eleven cycles over the model with no branch prediction, the
timing of which is shown in Figure 19 on page 30. The synchronous BHT
has improved performance by 25.
647
percent and the asynchronous BHT has
improved performance by
28.24S
percent when compared to the processor
without branch prediction.
Ten cycles saved divided by 39 original cycles is 0.256, or 25.6 percent.
Eleven cycles saved divided by 39 original cycles is 0.282, or 28.2 percent.
Chapter 3. Performance Gains Due to Branch Prediction 108
N I D 0 F EC
-l-l-l-l l-l-l
|_|_|_|+ + + +
+|_|_|_|
<== Branch Instruction
N I D 0 E C
I D 0 F EC
-I-I-I-I+ + + + l-l-l
N I D 0 F EC
I-I-I-I-I+ + + + + + + + 1-|-|
N I D 0 F EC
|_|_|_|_|++++++++++++_ _ _ _ - | - |
-
N I D 0 F EC
29 cycles >
Figure 45. Six instructions processed, one operand access at a time, branch pre
dicted correctly, synchronous BHT
.apter 3. Performance Gains Due to Branch Prediction 109
N I D 0 F E







1 1 1 1 1 1
N I D 0





- 1 - i-i
1 1 1 1
I D 0 F
|-|-|-H+ + +
N I D 0
N I D 0









+ + + + + + + + + + + ___
F
-- 28 cycles
Figure 46. Six instructions processed, one operand access at a time, branch pre
dicted correctly, asynchronous BHT
3.2.1.2 Parallel Processor B
Parallel Processor B, with a synchronous BHT, has the timing shown in
Figure 47 on page 111, with a CPI of 3.5. With an asynchronous BHT this
timing improves to 3.2 CPI, shown in Figure 48 on page 112. The savings is
eight or nine cycles when compared to DHT/opcode level of branch predic
tion, which is illustrated in Figure 44 on page 106. An improvement of ten
or eleven cycles is observed relative to the model without branch prediction,
Chapter 3. Performance Gains Due to Branch Prediction 110
Figure 23 on page 35. When compared to the processor without branch pre
diction, the synchronous BHT has improved performance by 33.
349
percent and
the asynchronous BHT has improved performance by
36.7S0
percent.
N I D 0 F EC
l-l-l-l-l l-l-l
|_|_|_|+ + + + +|_|_|_|
<== Branch Instruction
N I D 0 E C
l-l-l-l l-l-l
I D 0 F EC
l-l-l-l-l l-l-l
N I D 0 F EC
|_H_H+ + + l-l-l
N I D 0 F EC
|_H_H+ + + l-l-l
N I D 0 F EC
< 20 cycles >
Figure 47. Six instructions processed, two operand accesses at a time, branch pre
dicted correctly, synchronous BHT
49 Ten cycles saved divided by 30 original cycles is 0.333, or 33.3 percent.
50 Eleven cycles saved divided by 30 original cycles is 0.367, or 36.7 percent.
Chapter 3. Performance Gains Due to Branch Prediction 111
N I D 0 F EC










I-I-I-I+ + ++ +I-I-I-I
<==
N I D 0 E C
l_l_l_l___
_i_| |
1 1 1 1 II 1
I D 0 F EC
1 |_|_|_|_ ___i_i i
1 II II II 1
N I D 0 F EC
|-|.|.|-|+++ - - -
N I D 0 F
N I D 0 F
< 19 cycles
Figure 48. Six instructions processed, two operand accesses at a time, branch pre
dicted correctly, asynchronous BHT
3.2.2 Alleviating Delays Due to Instruction Data Fetches
There are several examples in earlier sections showing the impact to the per
formance if the target instruction data is not in the instruction buffers.
Figure 49 on page 113 and Figure 50 on page 114 show the latest cycle at
which the BHT can predict the target address and avoid any instruction data
fetch penalty. Only with an asynchronous BHT is it feasible to obtain predic
tions this far in advance. Waiting for the branch instruction to decode, as the
Chapter 3. Performance Gains Due to Branch Prediction 112
synchronous BHT does, does not permit the instruction data fetch penalty to
be completely avoided.












_|+ + + + 1.|-|
DO F EC
-1-1-1+ + + + + + + + 1-|-|
I D 0 F EC
-|-|-|-|++++++++++++- - -








- 28 cycles >
Figure 49. Six instructions processed, one operand access at a time, branch pre
dicted correctly
Chapter 3. Performance Gains Due to Branch Prediction 113
I D 0 F EC
N I D
p f f f f f




I D 0 F E C
I D 0 F EC
-|-|-|+++ - - -
N I D 0 F
_|_|_|_|+++ - -






< __ 19 cycles
predict
fetch instruction data
Figure 50. Six instructions processed, two operand accesses at a time, branch pre
dicted correctly
Compared to the processor without branch prediction, the asynchronous BHT
can save up to sixteen cycles for each of the parallel processors when the
instruction data is not in the instruction buffers. This savings translates into a




percent performance improvement, respectively. If the
storage penalty is greater than five cycles, the potential performance gain is
greater.
3.3 Branch Guess Wrong Penalty
The previous sections do not illustrate what occurs when the branch predic
tion is incorrect. This is a situation which processor designers must address.
If a processor incurs a penalty when a prediction is incorrect then the timing
achieved is worse than that without branch prediction. These additional
cycles are referred to as the branch guess wrong penalty. Some branch predic
tion methodologies do not encounter additional penalties for being wrong; the
timing achieved is identical to the timing without branch prediction. When
there is a penalty for predicting a branch wrong, this penalty needs to be kept
to a minimum.
51 Sixteen cycles saved divided by 44 original cycles is 0.364, or 36.4 percent.
52 Sixteen cycles saved divided by 35 original cycles is 0.457, or 45.7 percent.
Chapter 3. Performance Gains Due to Branch Prediction 115
In order for a processor to maximize the benefit of branch prediction, the
penalty for incorrect predictions needs to be minimized. The following
example illustrates this point. The equation which is used to calculate cycles
saved is shown in Figure 51.
%correct * average cycles saved
- %wrong
*
average cycles for reset
total cycles gained per 100 branches
Figure 51. Branch prediction benefit equation
If after each branch guessed wrong it takes approximately five cycles to reset
and continue instruction processing at the correct location, then the branch
guess wrong penalty is five cycles. If each correctly predicted branch gains
three cycles and the prediction is correct 70 percent of the time then the cycle
gain is:
(70*3) - (30*5)
= 60 cycles per 100 branches
Chapter 3. Performance Gains Due to Branch Prediction 116
If 25
percent53
of the instructions are branches the result is a 0.15(60/400) CPI
gain. This is a small savings for the investment.
Three areas can be optimized to increase the branch prediction performance:
Improve the prediction accuracy.
Reduce the guessed wrong penalty.
Increase the cycles saved through branch prediction.
The prediction accuracy can be improved slightly by modifying the branch
prediction algorithm. However, the prediction accuracy is largely determined
by the workload characteristic: how predictable are the branches. After
reaching a given point, investment in additional hardware does not provide
significant increase in .branchprediction accuracy.
The cycles saved through branch prediction are also limited to the delay intro
duced by the branch instruction. Algorithms, such as the asynchronous BHT
provide additional benefit by reducing the impact of data fetch delays. A
branch prediction algorithm can be characterized by the cycles it saves. This
53 Typical of most S/390 workload environments.
Chapter 3. Performance Gains Due to Branch Prediction 117
is illustrated by the timing diagrams earlier in this thesis. However, a penalty
for predicting incorrectly must be subtracted from this gain. If this penalty is
too large, branch prediction is not a worthwhile performance enhancement.
In the next equation the branch guessed wrong penalty is decreased to one
cycle. The cycles saved is 180, therefore the CPI gain is 0.45 (180/400). The
CPI gain from decreasing the branch guessed wrong penalty from five to one
cycles is 0.30 (0.45 0.15). This is a significant increase due to just changing
this one variable.
(70*3) - (30*1) = 180 cycles
3.3.1 Moving Targets
MT1 and MT2 are programming examples which contain moving target
branches. MT1, shown in Figure 52 on page 119, updates the register con
taining the target address immediately before the branch is executed. The
target address in register 15 is updated by the AR 15,3 (add register) instruc-
Chapter 3. Performance Gains Due to Branch Prediction 118
tion, causing an
AGI54
on the target address. MT2, shown in Figure 53 on
page 120, updates the target address well before the branch is executed so
there is no AGI.
OFFSET OBJECT CODE LINE# PSEUD0 ASSEMBLY LISTING
00000A 5820 C4B2 004B8 9 L
2,=F'10'
0O000E 5830 C4B6 004BC 10 L
3,=F'2'
000012 58 F0 C4BA O04C0 11 INITIAL L 15,=A(INITIAL+2)
00016 12 FIRSTBR EQU
*
000016 1852 13 LR 5,2
000018 1A83 14 AR 8,3
00001A 1B73 15 SR 7,3
00001C 1862 16 LR 6,2
00O01E 1A93 17 AR 9,3
000020 1B83 18 SR 8,3
000022 1872 19 LR 7,2
000024 1A53 20 AR 5,3
000026 1B93 21 SR 9,3
000028 1882 22 LR 8,2
0O002A 1B63 23 SR 6,3
00002C 1A53 24 AR 5,3
00002 E 1892 25 LR 9,2
000030 1A73 26 AR 7,3
000032 1B63 27 SR 6,3
000034 1AF3 Update GPR 15 ==> 28 AR 15,3
000036 062 F Use GPR 15 ==> 29 BCTR 2,15
Figure 52. Example MT1, AGI on BCTR target due to AR
54
Address Generation Interlock. An address generation interlock is a register interlock
in which the register is needed to compute an address.
Chapter 3. Performance Gains Due to Branch Prediction 119
OFFSET OBJECT CODE LINE* PSEUDO ASSEMBLY 1.ISTING
00OO0A 5820 C4B2 004B8 9 L
2,=F'10'
O0O0OE 5830 C4B6 004BC 10 L
3,=F'2'










000016 1852 13 LR 5,2
000018 1A83 14 AR 8,3
0O001A 1B73 15 SR 7,3
0OO01C 1862 16 LR 6,2
O0O01E 1A93 17 AR 9,3
000020 1B83 18 SR 8,3
000022 1872 19 LR 7,2
000024 1A53 20 AR 5,3
000026 1B93 21 SR 9,3
000028 1AF3 Update GPR 15 ==> 22 AR 15,3
O0O02A 1882 23 LR 8,2
00O02C 1B63 24 SR 6,3
00002E 1A53 25 AR 5,3
000030 1892 26 LR 9,2
000032 1A73 27 AR 7,3
000034 1B63 28 SR 6,3
000036 062F Use GPR 15 ==> 29 BCTR 2,15
Figure 53. Example MT2, no AGI on BCTR target
The target address of the branch changes with each iteration of the loop. This
moving target cannot be predicted correctly by the BHT. The correct target
address is known only after the register containing the target address is
updated.
Chapter 3. Performance Gains Due to Branch Prediction 120
The relative performance of these examples is shown on page 122. Both BHT
implementations are asynchronous BHTs. The first BHT implementation
does not recognize moving targets and always traverses down a wrong path,
incurring a branch guessed wrong penalty. The second BHT implementation
recognizes that a moving branch has occurred. Subsequent encounters predict
the branch taken but wait until the register is updated to calculate the target
address. See section 4.1,
"Introduction"
on page 131 for a more detailed
description of the processors used for these performance measurements.
Chapter 3. Performance Gains Due to Branch Prediction 121
Table 5. Performance Comparison between MT1 and MT2
Prediction Method MT1 MT2
BHT does not recognize moving target
branches
1.00 0.99
BHT recognizes moving target branches 0.98 0.90
DHT 0.98 0.89
Opcode Branch Prediction 0.98 0.89
No Branch Prediction 0.98 0.90
Note: Performance is normalized to MT1 running with the BHT imple
mentation which does not recognize moving targets. This is done by
dividing the cycles needed to run each variation by the cycles consumed by
MT1 with the BHT which does not recognize moving targets. The smaller
the number, the better the performance.
Chapter 3. Performance Gains Due to Branch Prediction 122
\
Comparing the two BHT runs of MT2, recognizing the moving target gives
the BHT a nine percent gain in performance (0.99 versus 0.90). The BHT that
recognizes a moving target branch closely matches the performance of the
DHT. This is because the BHT reverts back to a DHT level of branch predic
tion. For branches with unpredictable targets the DHT is the optimal branch
prediction method.
The impact of the AGI can be determined when comparing the MT1 and
MT2 runs. The AGI is responsible for approximately an eight percent
reduction in performance (0.98 versus 0.90). AGIs cause serious performance
degradation in highly pipelined processors. This performance degradation can
often be alleviated by the careful placement of instructions by the compiler.
If a compiler recognizes register dependencies and moves instructions to sepa
rate the ones with the dependencies, significant performance improvements
can be achieved.
Chapter 3. Performance Gains Due to Branch Prediction 123
3.3.2 Address Generation Interlocks
Address generation interlock is a type of register interlock. It occurs when a
register is updated which is used by a subsequent instruction to compute an
address. This instruction must wait until the register is updated. The pre
ceding section had an example (MT1) where an AGI and a moving target
both occurred. This section focuses on the impact of just the AGI.
AGI1 and AGI2 are based upon the examples in the previous section, MT1
and MT2. They do not have a moving target as the register is updated but
not changed. They cannot be directly compared to the performance runs of
MT1 and MT2 as they execute more instructions.
AGI1, shown in Figure 54 on page 125, illustrates the BHT benefit in a pre
dictable AGI situation. AGO, shown in Figure 55 on page 126, does not
have an AGI.
Chapter 3. Performance Gains Due to Branch Prediction 124
OFFSET OBJECT CODE: LINE# PSEUDO ASSEMBLY I.1 STING
00O0OA 5820 C4B2 004B8 9 L
2,=F'10'
0O00OE 5830 C4B6 0O4BC 10 L
3,=F'0'
000012 58 F0 C4BA 004C0 11 INITIAL L 15,=A(FIRSTBR)
00016 12 FIRSTBR EQU
*
000016 1852 13 LR 5,2
000018 1A83 14 AR 8,3
00O01A 1B73 15 SR 7,3
00001C 1862 16 LR 6,2
00001E 1A93 17 AR 9,3
000020 1B83 18 SR 8,3
000022 1872 19 LR 7,2
000024 1A53 20 AR 5,3
000026 1B93 21 SR 9,3
000028 1882 22 LR 8,2
00002A 1B63 23 SR 6,3
00002C 1A53 24 AR 5,3
0O002E 1892 25 LR 9,2
000030 1A73 26 AR 7,3
000032 1B63 27 SR 6,3
000034 1AF3 Update GPR 15 ==> 28 AR 15,3
000036 062F Use GPR 15 ==> 29 BCTR 2,15
Figure 54. Example AGI1, AGI on BCTR target due to AR
Chapter 3. Performance Gains Due to Branch Prediction 125
OFFSET OBJECT CODE LINEi( PSEUDO ASSEMBLY 1.ISTING
OOOO0A 5820 C4B2 004B8 9 L
2,=F'10'
00OO0E 5830 C4B6 004BC 10 L
3,=F'0'
000012 58 FO C4BA OO4C0 11 INITIAL L 15,=A(FIRSTBR)
00016 12 FIRSTBR EQU
*
000016 1852 13 LR 5,2
000018 1A83 14 AR 8,3
O0O01A 1B73 15 SR 7,3
O0001C 1862 16 LR 6,2
00001E 1A93 17 AR 9,3
000020 1B83 18 SR 8,3
000022 1872 19 LR 7,2
000024 1A53 20 AR 5,3
000026 1B93 21 SR 9,3
000028 1AF3 Update GPR 15 ==> 22 AR 15,3
O0O02A 1882 23 LR 8,2
O0002C 1B63 24 SR 6,3
00002 E 1A53 25 AR 5,3
000030 1892 26 LR 9,2
000032 1A73 27 AR 7,3
000034 1B63 28 SR 6,3
000036 062 F Use GPR 15 ==> 29 BCTR 2,15
Figure 55. Example AGI2, no AGI on BCTR target
Chapter 3. Performance Gains Due to Branch Prediction 126
The BHT is able to obtain nearly the same performance on
both AGI1 and
AGO (0.93 versus 0.92), even though AGI1 has an AGI and AGI2 does not.
With the less sophisticated branch prediction methods the AGI causes a seven
percent performance decrease (0.92 versus 0.99). The BHT is beneficial in a
predictable AGI situation.
Chapter 3. Performance Gains Due to Branch Prediction 127
Table 6. Performance Comparison between AGI1 and AGI2
Prediction Method AGI1 AGI2
BHT does not recognize moving target
branches
0.93 0.92




Note: Performance is normalized to AGI1 running with no branch predic
tion. This is done by dividing the cycles needed to run each variation by
the cycles consumed by AGI1 running with no branch prediction. The
smaller the number, the better the performance.
Chapter 3. Performance Gains Due to Branch Prediction 128
3.3.3 Conclusions
Branch prediction is an investment in hardware which can reduce the delays in
the pipeline caused by branching instructions. Minimizing the delays increases
the processor utilization and throughput. Branch prediction hardware varies
in complexity, cost, and performance benefit. Cost and performance are
crucial considerations when deciding what investment is appropriate. The
amount of circuits consumed, people hours required to implement and test the
prediction hardware, and additional complexity to handle branch prediction
all have to be weighed against the actual performance gain when deciding
upon the appropriate approach.
Chapter 3. Performance Gains Due to Branch Prediction 129
Chapter 4. Discussion of Programming Methods
Chapter 3, "Performance Gains Due to Branch
Prediction"
on page 98, exam
ined branch prediction techniques which improve performance. This chapter
also briefly introduced programming techniques that can affect performance.
Chapter 4 expands upon this discussion by examining various programming
choices which can inadvertently induce performance degradation. Different
programming methods are used to produce the same result. As a result, the
quantity and type of instructions executed and the performance varies for each
implementation. Some of the performance degradation is more severe when
branch prediction is employed.
Chapter 3. Performance Gains Due to Branch Prediction 130
4. 1 Introduction
All of the performance measurements within this thesis use the following
processor model:
The processor is able to decode two instructions a cycle.
The processor is able to complete two instructions a cycle.
The processor has four execution units which are able to execute
instructions out of order.
The processor can have at most 32 instructions active in the pipeline.
The processor can have up to two branches active with two conditional
streams.
The branch guessed wrong penalty is approximately two cycles.
Four levels of branch prediction are measured:
No branch prediction. The processor must wait until the branch completes
before continuing with the next instruction.
Chapter 4. Discussion of Programming Methods 131
Opcode-based branch prediction. The processor predicts the branch taken or
not taken based upon the operation code of the branch instruction. Condi
tional branches are predicted not taken.
Decode History Table (DHT). The processor predicts the branch taken or
not taken using a DHT. The DHT has 256 entries and is two-way associa
tive. It operates in conjunction with opcode-based branch prediction.
Branch History Table (BHT). The processor predicts the branch taken or
not taken and the target address of the branch based upon the data con
tained in an asynchronous BHT. The BHT has 256 entries and is two-way
associative. The BHT operates in conjunction with opcode-based branch
prediction.
Chapter 4. Discussion of Programming Methods 132
4.2 Reducing The Decision Making Within a Program
There is a performance benefit to programming such that the number of
branch instructions is kept to a minimum. Unfortunately, it is not possible to
eliminate a large quantity of the branches since they are inherently necessary.
Most programs modify their execution based upon the current value of vari
ables. Every IF, DO WHILE, or other decision-making statement within a
high-level program translates into one or more branch instructions. The fol
lowing examples illustrate the performance gains which can be achieved by
minimizing the decision-making within a program.
4.2.1 C+ + Examples
Examples CXX1 and CXX2 illustrate two methods to sum all the even and
odd numbers from 1 up to 999 into even and odd sums. CXX1, shown in
Figure 56 on page 134, uses two FOR loops. The first loop sums the odd
numbers from 1 up to 999. The second loop sums the even numbers from 2
up to 998. There is no decision-making within the program as to whether i is
even or odd.
Chapter 4. Discussion of Programming Methods 133
for(i =1; i < 1000; i+=2)
sum_odd += i ;
for(i = 2; i < 1000; i+=2)
sum even += i ;
Figure 56. Example CXX1, two FOR loops
Chapter 4. Discussion of Programming Methods 134
OFFSET OBJECT CODE LINE* PSEUDC ASSEMBLY LISTING
00049 1
*
for( i = l; i < 1000;
i+=
-2) {
OOOODE 4120 0001 115 LA r2,l
0O0OE2 5800 DOEC 116 L r0,236(,rl3)
0O0OE6 41E0 0002 117 LA rl4,2
O0O0EA 41F0 03E7 118 LA rl5,999
OOO0EE 0700 119 NOPR











0000F2 872E 30A0 128 BXLE r2,rl4,@4L2
0000 F6 129 @4L4 DS OH
0000 F6 5000 DOEC 130 ST rO,236(,rl3)
00052 1
*
for( i = 2; i < 1000;
i+=
-2) {
0000FA 41E0 0002 132 LA rl4,2
00O0FE 182E 133 LR r2,rl4
000100 5800 D0F0 134 L r0,240(,rl3)
000104 41 F0 03E7 135 LA rl5,999











0O01OA 872E 30B8 140 BXLE r2,rl4,04L7
0O01OE 141 @4L9 DS OH
Figure 57. Example CXX1, two FOR loops, assembler code
CXX2 has one FOR loop with an IF statement used to direct program exe
cution. Program execution is dependent upon the current value of i. If i is
even it is added to sum even. Otherwise it is added to sum odd.
Chapter 4. Discussion of Programming Methods 135
for(i = 1; i < 1000; i++)
{
if (i % 2) sum_odd += i;
else sum_even += i ;
}
Figure 58. Example CXX2, one FOR loop
Chapter 4. Discussion of Programming Methods 136
OFFSET OBJECT CODE LINE* PSEUDC) ASSEMBLY LISTING
00049 1
* for(i = 1 ; i < 1000; i++) {
00O0DE 41E0 0001 115 LA rl4,l
00O0E2 182E 116 LR r2,rl4
00O0E4 5810 DOEC 117 L , rl,236(,rl3)
O0O0E8 5800 D0F0 118 L r0,240(,rl3)
O0O0EC 41F0 03E7 119 LA rl5,999
OOOO FO 120 04 L2 DS OD
00050 1
*
if (i % 2) sum_odd += i ;
0000 F0 1842 126 LR r4,r2
0000 F2 1854 127 LR r5,r4
0000F4 8B40 001E 128 SLA r4,30
0000F8 8A40 001E 129 SRA r4,30
0O00FC 5450 **** 130 N
r5,=F'l'
000100 1355 131 LCR r5,r5
000102 8A50 001F 132 SRA r5,31
000106 1445 133 NR r4,r5
000108 4780 **** 134 BZ 04L5




000112 137 04 L5 DS 0H
00051 1
*









000114 141 04 L3 DS OF
000114 872E 30A0 142 BXLE r2,rl4,04L2
000118 143 04L4 DS 0D
Figure 59. Example CXX2, one FOR loop, assembler code
Table 7 on page 139 and Table 8 on page 141 summarize the number of
instructions executed by each program. CXX2 has 5.7 times more instructions
and 2.5 times more branching instructions than CXX1. Programming with
Chapter 4. Discussion of Programming Methods 137
the intent to minimize the number of branch instructions often results in a
simplified program with fewer non-branch instructions, as is illustrated by
these examples.
Chapter 4. Discussion of Programming Methods 138
Table 7 (Page 1 of 2). Number of Instructions Executed, CXX1 and CXX2











Chapter 4. Discussion of Programming Methods 139
Table 7 (Page 2 of 2). Number of Instructions Executed, CXX1 and CXX2




Total Instructions Executed 2017 11504
Chapter 4. Discussion of Programming Methods 140
Table 8. Number of Branches Executed, CXX1 and CXX2




BC Conditional 0 999
BC Unconditional 0 500
Total Branches Executed 1001 2499
Chapter 4. Discussion of Programming Methods 141
Table 9. Performance, CXX1 and CXX2





Note: Performance is normalized to CXX1 running with the BHT imple
mentation. This is done by dividing the cycles needed to run each test by
the cycles consumed by CXX1 running with BHT branch prediction. The
smaller the number, the better the performance.
The predominant branch, BXLE, in CXX1 is a loop-controlling branch which
is predicted taken by all three prediction methods. CXX1 is a very small loop
so the target is always in the instruction buffers of the processor. These char
acteristics result in negligible differences between three of the model runs for
Chapter 4. Discussion of Programming Methods 142
CXX1. The run without branch prediction suffers because it needs to wait
until the BXLE completes before beginning the next instruction.
The IF statement within CXX2 wreaks havoc with most branch prediction
methods. Every other time it is encountered the branch at offset 000108, in
Figure 59 on page 137, is taken. Unless a very sophisticated branch predic
tion methodology is employed, this alternating pattern is not recognized and
is predicted incorrectly every time it is encountered.
The performance differences between the three CXX2 runs illustrate the
impact of the incorrectly predicted branches. The opcode-based branch pre
diction is impacted the least because the conditional branch is predicted not
taken. Due to this default, the prediction is correct half of the time. The
DHT and BHT implementations always predict the conditional branch incor
rectly.
Comparing the opcode-based prediction runs, CXX2's 5.7 times more
instructions use 5.1 times more cycles to produce the same result as CXX1.
For the DHT and BHT runs the coding choice has an even more adverse
effect on performance. These examples illustrates how programming tech
niques affect performance. The basic coding guideline derived from these two
Chapter 4. Discussion of Programming Methods 143
examples is to minimize and simplify the decision-making that a program
does.
4.3 Cost of Incorrect Branch Prediction
The next two examples, CXX3 and CXX4, illustrate the performance of each
of the paths traversed in CXX2. Each executes the same number of iterations
as CXX2. When i is odd the conditional branch at 000 10A is never taken.
This path, traversed by CXX3, has more instructions than the path traversed
by CXX4. This is because CXX3 always executes the unconditional branch at
000110.
for(i = 1; i < 1998; i+=2)
{





Figure 60. Example CXX3, one FOR loop, i always odd
Chapter 4. Discussion of Programming Methods 144
OFFSET OBJECT CODE LINE* PSEUDC ASSEMBLY LISTING
00049 1
*
for(i = 1 ; i < 1998; i+=2) {
00OODE 4120 0001 115 LA r2,l
OO0OE2 41E0 0002 116 LA rl4,2
000OE6 5810 DOEC 117 L rl,236(,rl3)
0O00EA 5800 D0FO 118 L rO,240(,rl3)
0O0OEE 41F0 07CD 119 LA rl5,1997
0000 F2 120 04 L2 DS OH
00050 1
* if (i % 2) sum_odd += i ;
0000 F2 1842 126 LR r4,r2
0000F4 1854 127 LR r5,r4
0000 F6 8B40 001E 128 SLA r4,30
0000 FA 8A40 001E 129 SRA r4,30
O0O0FE 5450 **** 130 N
r5,=F'l'
000102 1355 131 LCR r5,r5
000104 8A50 001F 132 SRA r5,31
000108 1445 133 NR r4,r5
O001OA 4780 **** 134 BZ 04 L5




000114 4700 0000 137 NOP
000118 138 04L5 DS 0D
00051 1
*









0O011A 142 04 L3 DS OH
00011A 872E 30A2 143 BXLE r2,rl4,04L2
00011E 144 04L4 DS OH
Figure 61. Example CXX3, one FOR loop, assembler code
Chapter 4. Discussion of Programming Methods 145
for ( i
_
0; i < 1998; i+:=2)
{
if (i % 2) sum odd += i;
el se sum sven += i;
}
Figure 62. Example CXX4, one FOR loop, i always even
Chapter 4. Discussion of Programming Methods 146
OFFSET OBJECT CODE LINE* PSEUDC) ASSEMBLY LISTING
00049 1
*
for(i = 0 ; i < 1998; i+=2) {
OOOODE 4120 0000 115 LA r2,0
0000E2 41E0 0002 116 LA rl4,2
O00OE6 5810 DOEC 117 L rl,236(,rl3)
OOO0EA 5800 DOFO 118 L r0,240(,rl3)
O0O0EE 41F0 07CD 119 LA rl5,1997
OOOO F2 120 04 L2 DS OH
00050 1
*
if (i % 2) sum_odd += i ;
OO0OF2 1842 126 LR r4,r2
0000F4 1854 127 LR r5,r4
0O00F6 8B40 001E 128 SLA r4,30
0000 FA 8A40 001E 129 SRA r4,30
OO0OFE 5450 **** 130 N
r5,=F'l'
000102 1355 131 LCR r5,r5
000104 8A50 001F 132 SRA r-5,31
000108 1445 133 NR r4,r5
0001OA 4780 **** 134 BZ 04 L5
00010E 1A12 135 AR rl,r2
000110 47 FO **** 136 B 04 L3
000114 4700 0000 137 NOP
000118 138 04 L5 DS 0D
00051 1
*









00011A 142 04L3 DS OH
00011A 872E 30A2 143 BXLE r2,rl4,04L2
00011E 144 04L4 DS OH
Figure 63. Example CXX4, one FOR loop, assembler code
Chapter 4. Discussion of Programming Methods 147
Table 10 (Page 1 of 2). Number of Instructions Executed, CXX2, CXX3, and
CXX4
Instruction Type CXX2 CXX3 CXX4
Branches 2499 2997 1999
LCR 999 999 999
NR 999 999 999
LR 2000 1999 1999
AR 999 999 999
SR 1 1 1
LA 4 5 5
ST 3 3 3
N 999 999 999
Chapter 4. Discussion of Programming Methods 148
Table 10 (Page 2 of 2). Number of Instructions Executed, CXX2, CXX3, and
CXX4
Instruction Type CXX2 CXX3 CXX4
L 3 3 3
SRA 1998 1998 1998
SLA 999 999 999
01 1 1 1
Total Instructions Executed 11504 12003 11004
Chapter 4. Discussion of Programming Methods 149
Table 11. Number of Branches Executed, CXX2, CXX3, and CXX4
Instruction Type CXX2 CXX3 CXX4
BXLE 999 999 999
BALR 1 1 1
BC Conditional 999 999 999
BC Unconditional 500 999 0
Total Branches Executed 2499 2997 1999
CXX3 executes more branches than CXX2 or CXX4. This is because it exe
cutes both the conditional and unconditional branch in its path. CXX4 exe
cutes only the conditional branch in its path. CXX2 alternates between these
two paths so it is mid-way between CXX3 and CXX4 with respect to the
number of branch instructions executed.
Chapter 4. Discussion of Programming Methods 150
Comparing CXX3 and CXX4 shows that the performance of these two paths
varies depending upon the type of branch prediction implemented. CXX3
performs significantly better than CXX4 when utilizing the opcode-based
branch prediction. Opcode-based branch prediction always predicts the condi
tional branch in CXX4 incorrectly. Programming so that the conditional
branch is predicted correctly more than makes up for the extra instructions
executed when using opcode-based branch prediction (4.6 versus 6.0).
The cost of resetting after an incorrectly predicted branch can cause perform
ance to be worse than that of the processor without branch prediction. This
point is illustrated by CXX4 when comparing the opcode-based branch pre
diction run to the no branch prediction run. Since the conditional branch is
always predicted incorrectly by opcode-based branch prediction, the processor
without branch prediction performs better.
Comparing the performance of CXX2 to both CXX3 and CXX4 shows the
impact alternating branches have on processors with DHTs or BHTs. When
ever possible, branches should be coded such that the same path is traversed
each iteration. Unpredictable branches should be avoided.
Chapter 4. Discussion of Programming Methods 151
Table 12. Performance, CXX2, CXX3, and CXX4
Prediction Method CXX2 CXX3 CXX4
BHT 6.3 4.0 3.6
DHT 6.4 4.3 4.0
Opcode-based 5.1 4.6 6.0
None 5.8 6.0 5.6
Note: Performance is normalized to CXX1 running with the BHT imple
mentation. This is done by dividing the cycles needed to run each test by
the cycles consumed by CXX1 running with BHT branch prediction. The
smaller the number, the better the performance.
Coding guidelines often recommend that the fall-through path, the then path
represented by CXX3, contain the code most often executed. The fall-through
path is the better performing of the two paths when using opcode-based
branch prediction. Generic coding guidelines are often based upon
opcode-
Chapter 4. Discussion of Programming Methods 152
based branch prediction performance since it is the most prevalent form of
branch prediction.
When programming, an understanding of the mainline (most commonly exe
cuted) paths allows the programmer to make those the fall-through paths.
Following these guidelines, a programmer should not use the then clause for
handling unusual events, such as error conditions. But understanding the
implementation of the processor on which the program is executed can also
influence programming decisions. To take full advantage of branch predic
tion, code has to be optimized for a prediction methodology.
When a DHT or BHT is utilized, the path with the fewest instructions per
forms best. This creates a dilemma when deciding how to optimize code
which runs on a variety of processors. There is always a trade-off between
minimizing the number of instructions executed and minimizing the taken
conditional branches. It is important to note that for both paths the
processors with a DHT or a BHT outperforms the one with opcode-based
branch prediction.
Chapter 4. Discussion of Programming Methods 153
4.3.1 Assembler Examples
To further illustrate the impact of alternating branches, examples ASM1,
ASM2, and ASM3 are used. They are assembler routines which mimic the
previous C++ examples. However, an unconditional branch has been placed
in the else path to equalize the number of instructions in each path. These
assembler routines are used to analyze the performance of a conditional
branch always taken, a conditional branch never taken, and a conditional
branch alternating between taken and not taken.
Chapter 4. Discussion of Programming Methods 154
In example ASM1, the conditional branch located at offset 000036 alternates
between taken and not taken.
OFFSET OBJECT CODE LINE* PSEUD0 ASSEMBLY LISTING
00000A 4120 0001 00001 12 LA 2,1
OO0O0E 41E0 0001 00001 13 LA 14,1
000012 41 F0 03E7 003E7 14 LA 15,999
000016 4110 0000 00000 23 LA 1,0
0O0O1A 4100 0000 00000 24 LA 0,0
00O1E 25 T0PIF EQU
*
0O001E 1842 26 LR 4,2
000020 1854 27 LR 5,4
000022 8B40 001E 0001E 28 SLA 4,30
000026 8A40 001E 0001E 29 SRA 4,30
00002A 5450 C1A2 001A8 30 N
5,=F'l'
00002E 1355 31 LCR 5,5
000030 8A50 001F 0001F 32 SRA 5,31
000034 1445 33 NR 4,5
000036 4780 C03A 00040 34 BZ EVEN
00O03A 1A12 35 AR 1,2
00O03C 47 F0 C040 00046 36 B BOT
00040 37 EVEN EQU
*
000040 1A02 38 AR 0,2
000042 47 F0 C040 00046 39 B BOT
00046 40 B0T EQU
*




00004A 41F0 C15A 00160 43 LA 15,$SAVE
00004E 50FD 0008 00008 44 ST 15,8(13)
000052 18DF 45 LR 13,15
000054 58D0 C15E 00164 46 L 13,$SAVE+4
000058 98EC DO0C 0000C 47 LM 14,12,12(13)
O0O05C 07FE 48 BR 14
Figure 64. Example ASM1, branch at 000036 alternates between taken and not
taken.
Chapter 4. Discussion of Programming Methods 155
In example ASM2, the conditional branch at offset 000036 is always taken.
OFFSET OBJECT CODE LINE* PSEUD0 ASSEMBLY LISTING
00000A 4120 0000 00000 16 LA 2,0
00000E 41E0 0002 00002 17 LA 14,2
000012 41 F0 07CD 007CD 18 LA 15,1997
000016 4110 0000 00000 23 LA 1,0
000O1A 4100 0000 00000 24 LA 0,0
0001E 25 TOPIF EQU *
0O001E 1842 26 LR 4,2
000020 1854 27 LR 5,4
000022 8B40 O01E 0O01E 28 SLA 4,30
000026 8A40 O01E 0001E 29 SRA 4,30
00002A 5450 C1A2 001A8 30 N
5,=F'l'
00002 E 1355 31 LCR 5,5
000030 8A50 O01F 000 IF 32 SRA 5,31
000034 1445 33 NR 4,5
000036 4780 C03A 00040 34 BZ EVEN
000O3A 1A12 35 AR 1,2
00003C 47 F0 C040 00046 36 B BOT
00040 37 EVEN EQU *
000040 1A02 38 AR 0,2
000042 47F0 C040 00046 39 B BOT
00046 40 BOT EQU *




00O04A 41 F0 C15A 00160 43 LA 15,$SAVE
00004 E 50 FD 0008 00008 44 ST 15,8(13)
000052 18DF 45 LR 13,15
000054 58D0 C15E 00164 46 L 13,$SAVE+4
000058 98EC DO0C 0O0OC 47 LM 14,12,12(13)
O0O05C 07 FE 48 BR 14
Figure 65. Example ASM2, branch at 000036 always taken
Chapter 4. Discussion of Programming Methods 156
In example ASM3, the conditional branch at offset 000036 is never taken.
OFFSET OBJECT CODE LINE* PSEUD0 ASSEMBLY LISTING
00000A 4120 0001 00001 20 LA 2,1
0O0O0E 41E0 0002 00002 21 LA 14,2
000012 41 F0 07CD O07CD 22 LA 15,1997
000016 4110 0000 00000 23 LA 1,0
00001A 4100 0000 00000 24 LA 0,0
0O01E 25 TOPIF EQU
*
000O1E 1842 26 LR 4,2
000020 1854 27 LR 5,4
000022 8B40 O01E 0O01E 28 SLA 4,30
000026 8A40 001E O001E 29 SRA 4,30
O00O2A 5450 C1A2 001A8 30 N
5,=F'l'
0O0O2E 1355 31 LCR 5,5
000030 8A50 O01F 0O01F 32 SRA 5,31
000034 1445 33 NR 4,5
000036 4780 C03A 00040 34 BZ EVEN
0O0O3A 1A12 35 AR 1,2
00003C 47 F0 C040 00046 36 B BOT
00040 37 EVEN EQU
*
000040 1A02 38 AR 0,2
000042 47 F0 C04O 00046 39 B BOT
00046 40 BOT EQU
*




00004A 41F0 C15A 00160 43 LA 15,$SAVE
0O0O4E 50FD 0008 00008 44 ST 15,8(13)
000052 18DF 45 LR 13,15
000054 58D0 C15E 00164 46 L 13,$SAVE+4
000058 98EC DOOC O000C 47 LM 14,12,12(13)
OO005C 07FE 48 BR 14
Figure 66. Example ASM3, branch at 000036 always not taken
Chapter 4. Discussion of Programming Methods 157
Table 13 (Page 1 of 2). Total Number of Instructions Executed, ASM1, ASM2,
and ASM3
Instruction Type ASM1 ASM2 ASM3
Branches 3000 3000 3000
LCR 999 999 999
NR 999 999 999
LR 1999 1999 1999
AR 999 999 999
SR 1 1 1
LA 7 7 7
ST 2 2 2
N 999 999 999
Chapter 4. Discussion of Programming Methods 158
Table 13 (Page 2 of 2). Total Number of Instructions Executed, ASMl, ASM2,
and ASM3
Instruction Type ASM1 ASM2 ASM3
L 3 3 3
SRA 1998 1998 1998
SLA 999 999 999
STM 1 1 1
LM 1 1 1
Total Instructions Executed 12008 12008 12008
Chapter 4. Discussion of Programming Methods 159
Table 14. Performance, ASM1, ASM2, and ASM3
Prediction Method ASM1 ASM2 ASM3
BHT 1.6 1 1
DHT 1.5 1.3 1.2
Opcode-based 1.5 1.8 1.2
None 1.7 1.8 1.6
Note: Performance is normalized to ASM2 running with a BHT imple
mentation. This is done by dividing the cycles needed to run each test by
the cycles consumed by ASM2 running with BHT branch prediction. The
smaller the number, the better the performance.
All three examples have the same quantity and type of instructions executed.
Comparing the performance of these three programs, it is observed that
opcode-based branch prediction is more effective when conditional branches
are not taken (ASM3). It always predicts these correctly.
Chapter 4. Discussion of Programming Methods 160
Comparing the performance of the three programs, the BHT does equally well
whether the conditional branch is taken or not taken. The DHT does slightly
better when the branch is not taken. This is because it does not predict the
target address when the branch is taken, thus delaying instruction processing
slightly. Both the BHT and DHT perform poorly on ASM1. The alternating
branch causes a sixty percent performance loss (1.0 versus 1.6) with a BHT.
Slightly less of a performance loss is experienced by the DHT.
From examining the performance of examples CXX2 through CXX4 and
ASM1 through ASM3, two coding guidelines have emerged. First, programs
should be written such that conditional branches are not taken. Second, if a
program is written to run on a processor which has branch prediction, the
branches need to be coded such that they are predictable by the branch pre
diction methodology. If these guidelines are not followed, performance is
negatively impacted.
4.4 Subroutine Branches
Subroutine calls are utilized in programming to modularize code, thus facili
tating program development. It will be demonstrated by the following exam
ples that subroutine calls have a negative impact on performance. One way to
reduce the number of subroutine calls but still maintain code readability is to
Chapter 4. Discussion of Programming Methods 161
use functions which are inlinea*5. An inlined function is one whose code is
inserted, at compilation time, at the point where the call to the function
appears. This eliminates the branches and other instructions which are neces
sary to branch to and return from a subroutine, thus reducing the cycles con
sumed and increasing the performance. Inlining functions increase the size of
the load module, because the code for the function appears not once, but
instead at each location the function is invoked.
Examples CXX5 and CXX6 use a function to sum the numbers from one to
999 into even and odd sums. Both use the same function but in CXX5 the
function is inlined, whereas, in CXX6 the function is not inlined. The code
within the function is identical to the IF statement in example CXX2.
Please refer to a programming text, such as The C+ + Primer, by Stanley B.
Lippman[C+ + 91] to understand the usage of inlined functions.
Chapter 4. Discussion of Programming Methods 162
inline void addtobucket(int i, int sum odd, int sum_even)
{
if (i % 2) sum_odd += i;




i nt i ;
int sum_odd,sum_even;
sum_odd = sum_even = 0;
for(i = 1; i < 1000; i++)
addtobucket(i ,sum odd, sum even);
Figure 67. Example CXX5, one FOR loop, inline function
Chapter 4. Discussion of Programming Methods 163
OFFSET OBJECT CODE LINE* P S E U 0 0 ASSEMBLY LISTI N G
eeeeee 5 main DS 8F
eeeeee 47F8 F828 6 B 46(,rl5)
eeeeiE 41E8 F63C 7 LA rl4,69(,rl5)
896822 58F6 C874 8 L rl5,116(,rl2)
888826 87FF 9 BR rl5
668828 98E8 D88C 16 STM rl4,i-8,12(rl3)
868620 5828 D84C 11 L r2.76(,rl3)
868838 4188 2118 12 LA r8,288(,r2)
686834 5588 C88C 13 CL r8.12(.rl2)
866838 4728 F81E 14 BH 38(,rl5)
88883C 58F8 D848 15 L rl5,72(,rl3)
666648 98F8 2648 16 STM rl5.r8,72{r2)
888844 9218 2686 17 MVI 6(r2),16
686848 58D8 2884 18 ST rl3,4(,r2)
888840 18D2 19 LR rl3,r2
88884E 8538 26 BALR r3,r8
866858 End of Prolo 3
*
in ine void addtobucket(int i ,
*
{
int sura odd, int sum even










88829 1 * {
88855 1 * for(i 1; i < 1668; i++) {
6868DE 58E6 D8F8 125 L rl4,248(.rl3)
B88eE2 58F6 D184 126 L rl5,268(,rl3)
eeeeE6 5819 D168 127 L rl,256(,rl3)
eeeeEA 4148 8881 128 LA r4,l
eeeeEE 1824 129 LR r2,r4
eeeeFe 5888 D8FC 138 L r8, 252 (, 1-13)
8888F4 4158 83E7 131 LA r5,999
8888F8 132 @4L2 DS 8D
88956 1 * addtobucket(i,sum odd, sun even);
686eF8 18FE 134 LR rl5,rl4
8688FA 5868 D6EC 135 L r6,236(,rl3)
6888FE 1872 136 LR r7.r2
688189 1887 137 LR r8,r7
888162 8B78 891E 138 SLA r7,38
888166 8A78 991E 139 SRA r7,38
88818A 5489 **** 148 N
r8,=F'l'
eeeieE 1388 141 LCR r8,r8
666118 8A89 991F 142 SRA 1-8,31
888114 1478 143 NR t-7,r8
688116 1816 144 LR rl,r6
888118 1882 145 LR rB,r2
88811A 4788 **** 146 BZ @4L6
eeeiiE 1816 147 LR rl,r6
888126 1A12 148 AR rl,r2
666122 47F9 **** 149 B @4L3
686126 8788 158 NOPR






86612A 154 (ML3 DS 8H
86812A 8724 38A8 155 BXLE r2,r4.<84L2
66612E 156 @4L4 DS 8H
Figure 68. Example CXX5, one FOR loop, assembler code
Chapter 4. Discussion of Programming Methods 164
void addtobucket(int i, int sum_odd, int sum_even)
{
if (i % 2) sum_odd += i;






sum_odd = sum_even = 0;
for(i =1; i < 1000; i++)
addtobucket(i ,sum odd,sum_even);
Figure 69. Example CXX6, one FOR loop, subroutine call
Chapter 4. Discussion of Programming Methods 165
OFFSET OBJECT CODE LINE* P S E U 0 0 ASSEMBLY L I
998688 5 addtobuc OS 9F
eessee 47F8 F836 6 8 54(,|-15)
eeeuzc 41E6 F84A 7 LA rl4,74(.rl5)
909638 5BF8 C874 8 L rl5.116C.rl2)
999934 67FF 9 BB rl5
969936 99E3 D88C 18 STM rl4,r3.12trl3)
66893A 5829 084C 11 I r2.76(,rl3)
98993E 4199 2698 12 LA r8.144(,r2)
999642 5599 ceec 13 CL r-e.lZ(.rlZ)
999946 4729 F82C 14 BH 44(.rl5)
99694* 5BF8 0948 15 L rl5.72(.rl3)
99664E 99F9 2948 16 STM rl5,r9,72(r2)
698852 9219 2988 17 MVI 8l>2),16
999956 5906 2884 18 ST rl3.4(.r2)
6699SA 1802 19 LP. rl3.r2
98995C 9538 29 BALR r3,r8
988B5E End of Prolog
99985E 58E9 1888 22 L rl4.8(,rl)
86919 1 * void addtob CKet (i nt i , int sum
int sum_even)
98929 1 * {
98822 1 if (i <i 2 sum_odd += i :
889962 188E 45 LH r6.rl4
966864 18F8 46 LR rl5,r8
689966 8B89 881 E 47 SLA r6. 38
96886A BA88 981E 48 SRA r8. 38
99666E 54F8 *"" 49 N rl5.=FT
688872 13FF 59 LCR rl5.rl5
869974 8AF8 881F 51 SRA rl5.31
866878 148F 52 NR re.rlS
88887A 58F8 1684 53 L rl5.4f.rl)
98887E 5888 1888 54 L re.S(.rl)
969982 4789 .... 55 BZ 05 LI
988896 1AFE 56 AR rl5.rl4
888888 47F9 .... 57 B 95L2










even += i ;








888892 Start of Epilog
699892 5808 0884 69 L rl3.4(,rl3)
998996 5SE8 D88C 79 L rl4.12(,rl3)
B6889A 9823 D81C 71 LM r2.r3.28(rl3)
86889E 951E 72 BALR rl.rl4
888899 199 main DS 8F
988688 47F8 F828 119 B 48(.rl5)
88881E 41E8 F83C 111 LA rl4,68(.rl5)
866822 58F8 C874 112 L rl5.116(.rl2)
888826 B7FF 113 BR rl5
99992B 98ES DS8C 114 STM rl4,rS,12(rl3J
99982C 5828 084C 115 L r2,76C,rl3)
889939 4188 2118 116 LJf r8.272(.r2)
988834 5588 C88C 117 CL r8,12t,rl2)
988839 4728 F81E 118 BH 38(.rl5)
99683C 58F8 0848 119 L rl5.72(.rl3)
999949 99F9 2848 129 STM rl5.r8,72(r2)
899944 9219 2888 125 MVI 0(r2).16
888649 5909 2864 126 ST rl3,4(.r2)
98B64C 1602 127 LR rl3.r2







86855 for i - l: i < 1999; i) {
B888DE 4148 8881 2B5 LA r4,l
8888E2 1884 298 LR r6.r4
Chapter 4. Discussion of Programming Methods 166
8889E4 5878 D6EC 287 L r7,236(.rl3)
8888EB 5868 D8F8 23B L r6.248(.rl3)
8888EC 5B2B *""* 289 L r2,=V(addtobucket FiN21)
8888F9 4158 63E7 219 LA r5,999
8888F4 4799 9988 211 NOP
eeeeFa 212 ML2 DS 9D
99856 * addtobucfcetfj .sum odd, sum even);
8999F8 5888 0188 214 ST rB,256(.rl3)
8689FC 5878 D184 215 ST r7.268(.rl3)
999198 1BF2 216 LR rl5.r2
999192 5866 0189 217 ST r6.264(.rl3)






eeeiec 8784 39A8 221 BXLE r8,r4,@4L2
888119 222 ML4 OS 9D
Figure 70 (Part 2 of 2). Example CXX6, one FOR loop, assembler code
Table 15 (Page 1 of 3). Number of Instructions Executed, CXX2, CXX5, and
CXX6
Instruction Type CXX2 CXX5 CXX6
Branches 2499 2499 7495
LCR 999 999 999
NR 999 999 999
LR 2000 5497 3998
Chapter 4. Discussion of Programming Methods 167
Table 15 (Page 2 of 3). Number of Instructions Executed, CXX2, CXX5, and
CXX6
Instruction Type CXX2 CXX5 CXX6
AR 999 999 999
SR 1 1 1
LA 4 4 2002
ST 3 4 3997
N 999 999 999
CL 0 0 999
L 3 1004 6997
SRA 1998 1998 1998
SLA 999 999 999
Chapter 4. Discussion of Programming Methods 168
Table 15 (Page 3 of 3). Number of Instructions Executed, CXX2, CXX5, and
CXX6
Instruction Type CXX2 CXX5 CXX6
Ol 1 1 1
MVI 0 0 999
STM 0 0 1998
LM 0 0 999
Total Instructions Executed 11504 16003 36479
Chapter 4. Discussion of Programming Methods 169
Table 16. Number of Branches Executed, CXX2, CXX5, and CXX6
Instruction Type CXX2 CXX5 CXX6
BXLE 999 999 999
BALR 1 1 2998
BC - Conditional 999 999 1998
BC Unconditional 500 500 1500
Total Branches Executed 2499 2499 7494
The inlined function utilizes the same number of branch instructions as
CXX2. It does, however, execute more instructions. Most of these additional
instructions are loads. The subroutine call more than triples the number of
instructions executed. A large percentage of the additional instructions are
branch instructions. It is important to note that this is a very simplistic sub
routine. As more processing is done within the subroutine, the overhead of
Chapter 4. Discussion of Programming Methods 170
invoking the subroutine becomes a smaller portion of the total number of
instructions executed.
There is some performance degradation when using an inlined function versus
actually inlining the code. This is due to additional instructions which are
executed. The additional instructions necessary for the subroutine call
severely impact performance. The subroutine degrades performance from 98
to 178 percent when compared to CXX2. Programmers concerned with per
formance should use subroutines sparingly.
The BHT does demonstrate its superior performance potential in this compar
ison. There are so many additional branches in CXX6 that the BHT imple
mentation is able to outperform the opcode^based branch prediction, even
with the alternating branch.
Chapter 4. Discussion of Programming Methods 171
Table 17. Performance, CXX2, CXX5, and CXX6
Prediction Method CXX2 CXX5 CXX6
BHT 6.3 6.6 12.5
DHT 6.4 6.9 15.0
Opcode-based 5.1 5.6 14.2
None 5.8 6.1 15.1
Note: Performance is normalized to CXX1 running with the BHT imple
mentation. This is done by dividing the cycles needed to run each vari
ation by the cycles consumed by CXX1 running with BHT branch
prediction. The smaller the number, the better the performance.
Chapter 4. Discussion of Programming Methods 172
Table 18. Percent increase when compared to CXX2





Note: The percent increase is calculated by using the performance numbers
in Table 17 on page 172. The equation for the CXX5 column is: ((CXX5
CXX2)/CXX2)*100. The equation for the CXX6 column is: ((CXX6 -
CXX2)/CXX2)*100. So the 98 percent in the CXX6 column is calculated
by doing: ((12.5 6.3)/6.3)*100
= 98.
Chapter 4. Discussion of Programming Methods 173
4.4.1 Take Decisions Outside ofSubroutines ifPossible
The more processing which is done within the subroutine, the less impact the
subroutine call has upon the overall performance. Subroutine calls should be
avoided when very little processing is to be done by the subroutine. Moving
decisions outside of the subroutine is one way to reduce the number of sub
routine calls.
This coding guideline is illustrated by Figure 71 on page 175 and Figure 72
on page 175. The subroutine in Figure 71 on page 175 is called on each iter
ation of the loop, not just when the report needs to be produced. The
program in Figure 72 on page 175 only calls the subroutine once, when the
report needs to be generated. By moving the code performing the test, the
number of instructions executed is significantly reduced.
























Figure 71. Decision within the subrou- Figure 72. Decision outside the subrou
tine tine
Chapter 4. Discussion of Programming Methods 175
4.5 Branch Target Offsets
To optimize the usage of the data in the instruction buffers discussed in 2.7,
"Instruction
Buffers"
on page 89, it is best if the target of a branch is aligned
on a double word boundary. If the target of the branch is at the end of a
double word, then the other bytes in that instruction buffer may not be uti
lized. The same is true if a branch instruction occupies the beginning of a
double word and the next sequential instruction path is not executed. Both
instruction and data alignment can affect performance. A programming
example is used to illustrate these points.
The following example is taken from a customer benchmark. The customer
varied the alignment of a code segment and experienced performance degrada
tion in excess of ten percent. The customer wanted to understand why
instruction alignment is so critical. The following examples examine the
causes of this performance degradation. These performance runs use a
processor with an asynchronous BHT.
The subroutine, SUBI001, within example ASM4 (shown in Figure 73 on
page 178) is aligned on a double word boundary. The subroutine, SUBI002,
Chapter 4. Discussion of Programming Methods 176
in example ASM5 (shown in Figure 74 on page 180) is aligned on a halfword
boundary. ASM5 consumes 11.3 percent more cycles than ASM4.
When examining this benchmark it was found that what was perceived as the
impact of instruction alignment is actually the impact of data alignment. The
data addresses of the STM and LM in the subroutine are all based on the
subroutine entry address. When the subroutine entry is shifted, so are the
data areas.
The data alignment affects how many double words are accessed by the
program and the cycles necessary to process the data. The STM and LM
perform best when their data is on a double word boundary. If the data
fetched by a LM is not on a double word boundary then it takes additional
execution cycles to adjust the data alignment before transferring the data into
the registers. If the data stored by the STM is not going to be put on a
double word boundary, it takes additional execution cycles to adjust the align
ment of the data.
Chapter 4. Discussion of Programming Methods 177
OFFSET OBJECT CODE LINE# PSEUDO ASSEMBLY LISTING
OO00A 9 CALLS0O0 EQU *
0O000A 5820 C4F2 004 F8 12 L
2,=F'10'
000O0E 58F0 C4F6 004FC 13 L 15,=A(SUBI0O1)
000012 05EF 14 BALR 14,15
000014 4620 C008 O000E 15 BCT 2,*-6
000018 41F0 C4AA O04B0 16 LA 15,$SAVE
0O0O1C 50 FD 0008 00008 17 ST 15,8(13)
000020 18DF 18 LR 13,15
000022 58D0 C4AE O04B4 19 L 13,$SAVE+4
000026 98EC D0OC 00O0C 20 LM 14,12,12(13)
00002A 07FE 21 BR 14
00O02C 00080 22 ORG CALLS000+118
00080 23 SUBI0O1 EQU
*
000080 90EC F064 00064 24 STM 14,12,100(15)
000084 4110 F038 00038 25 LA 1,56(,15)
000088 9801 F04C 0004C 26 LM 0,1,76(15)
00008C 07 FE 27 BR 14
000090 28
29
SAVECI01 DS 200 F
0003B0 30 ST0R1 DS 64 F
O004B0 0000000000000000 31 $SAVE DC
18F'0'
Figure 73. Example ASM4, DW aligned CSECT
Chapter 4. Discussion of Programming Methods 178
Example ASM4
-
The STM stores fifteen GPRs starting on a word boundary, accessing
eight double words.
The LM loads two GPRs starting on a word boundary, accessing two
double words.
The STM and LA occupy first instruction buffer, the LM and branch
are in the second instruction buffer.
Chapter 4. Discussion of Programming Methods 179
OFFSET OBJECT CODE LINE# PSEUDO ASSEMBLY LISTING
OO00A 9 CALLSOOO EQU *
O000OA 5820 C4F2 0O4F8 12 L
2,=F'10'
0OOOOE 58F0 C4F6 004FC 13 L 15,=A(SUBI002)
000012 05EF 14 BALR 14,15
000014 4620 C008 0OOOE 15 BCT 2,*-6
000018 41 F0 C4AA 004B0 16 LA 15,$SAVE
O0O01C 50 FD 0008 00008 17 ST 15,8(13)
000020 18DF 18 LR 13,15
000022 58D0 C4AE 004B4 19 L 13,$SAVE+4
000026 98EC D0OC 0O0OC 20 LM 14,12,12(13)
00002A 07FE 21 BR 14
0OO02C 00082 22 ORG CALLS00O+120
00082 23 SUBI002 EQU *
000082 90EC F064 00064 24 STM 14,12,100(15)
000086 4110 F038 00038 25 LA 1,56(,15)
0O0O8A 9801 F04C 0O04C 26 LM 0,1,76(15)






00O3B0 30 ST0R1 DS 64F
0004B0 0000000000000000 31 $SAVE DC
18F'0'
Figure 74. Example ASM5, DW + 1 HW aligned CSECT
Chapter 4. Discussion of Programming Methods 180
Example ASM5
The STM stores fifteen GPRs starting on a three-halfword boundary,
accessing nine double words.
The LM loads two GPRs starting on a three-halfword boundary,
accessing two double words.
The four instructions occupy two instruction buffers.
In this example the performance degradation is found to be solely due to data
alignment. Having the data aligned on halfword boundaries causes the STM
and LM to take more execution cycles. In addition, this alignment causes the
STM to access more double words. Many variations of this program were
analyzed in order to give the customer an accurate understanding of why their
benchmark performed, differently when the subroutine alignment changed.
If the subroutine is aligned on a double word boundary, but the data areas
are shifted by one word then a 1.1 percent performance improvement is
obtained. The code for this variation is in Figure 75 on page 182, example
ASM6. Shifting the data by one word put the data on a double word
boundary and is the reason for this performance gain.
Chapter 4. Discussion of Programming Methods 181
OFFSET OBJECT CODE LINE* PSEUDO ASSEMBLY LISTING
0000A 9 CALLSOOO EQU
*
OOOOOA 5820 C4F2 004 F8 12 L
2,=F'10'
0O000E 58F0 C4F6 004 FC 13 L 15,=A(SUBI009)
000012 05EF 14 BALR 14,15
000014 4620 C008 O0O0E 15 BCT 2,*-6
000018 41 F0 C4AA 004B0 16 LA 15,$SAVE
OOO01C 50FD 0008 00008 17 ST 15,8(13)
000020 18DF 18 LR 13,15
000022 58D0 C4AE O04B4 19 L 13,$SAVE+4
000026 98EC D0OC 00O0C 20 LM 14,12,12(13)
0O0O2A 07 FE 21 BR 14
00002C 00080 22 ORG CALLS000+118
00080 23 SUBIO09 EQU
*
000080 90EC F068 00068 24 STM 14,12,104(15)
000084 4110 F03C OO03C 25 LA 1,60(,15)
000088 9801 F05O 00050 26 LM 0,1,80(15)






00O3B0 30 ST0R1 DS 64F
O0O4B0 0000000000000000 31 $SAVE DC
18F'0'
Figure 75. Example ASM6, DW aligned CSECT, data shifted by one word
Chapter 4. Discussion of Programming Methods 182
Example ASM6
-
The STM stores fifteen GPRs starting on a double word boundary,
accessing eight double words.
-
The LM loads two GPRs starting on a double word boundary, accessing
one double word.
-
Instruction alignment is the same as ASM4.
The instruction alignment does have a slight impact on performance. Exam
ples ASM7 and ASM8 keep the data alignment identical to ASM4. ASM7,
shown in Figure 76 on page 184, has the subroutine instructions shifted by a
word. ASM8, shown in Figure 77 on page 185, has the subroutine
instructions shifted by three halfwords. ASM7 suffers a 1.7 percent perform
ance loss, ASM8 suffers a 2.2 percent performance loss when compared to
ASM4.
Chapter 4. Discussion of Programming Methods 183
OFFSET OBJECT CODE LINE* PSEUDO ASSEMBLY LISTING
00O0A 9 CALLSOOO EQU *
OOOOOA 5820 C4FA 00500 12 L
2,=F'10'
O0O0OE 58F0 C4FE 00504 13 L 15,=A(SUBI006)
000012 05EF 14 BALR 14,15
000014 4620 C008 00OOE 15 BCT 2,*-6
000018 41F0 C4AE O04B4 16 LA 15,$SAVE
0OOO1C 50FD 0008 00008 17 ST 15,8(13)
000020 18DF 18 LR 13,15
000022 58D0 C4B2 004B8 19 L 13,$SAVE+4
000026 98EC DO0C 0O0OC 20 LM 14,12,12(13)
00002A 07FE 21 BR 14
0O0O2C 00084 22 ORG CALLS000+122
00084 23 SUBIO06 EQU *
000084 90EC F060 00060 24 STM 14,12,96(15)
000088 4110 F034 00034 25 LA 1,52(,15)
0O008C 9801 F048 00048 26 LM 0,1,72(15)






O0O3B4 30 ST0R1 DS 64F
0004B4 0000000000000000 31 $SAVE DC
18F'0'
Figure 76. Example ASM7, instructions shifted by one word
Example ASM7
This example requires three instruction buffers. The STM resides in the
first instruction buffer; the LA and LM reside in the second instruction
buffer; and the branch occupies the third instruction buffer.
Chapter 4. Discussion of Programming Methods 184
OFFSET OBJECT CODE LINE* PSEUDO ASSEMBLY LISTING
O00OA 9 CALLSOOO EQU *
OOOOOA 5820 C4FA 00500 12 L
2,=F'10'
O0O0OE 58 FO C4FE 00504 13 L 15,=A(SUBI007)
000012 05EF 14 BALR 14,15
000014 4620 C008 O0O0E 15 BCT 2,*-6
000018 41F0 C4AE 004B4 16 LA 15,$SAVE
000O1C 50FD 0008 00008 17 ST 15,8(13)
000020 18DF 18 LR 13,15
000022 58D0 C4B2 004B8 19 L 13,$SAVE+4
000026 98EC D0OC O00OC 20 LM 14,12,12(13)
0O0O2A 07FE 21 BR 14
00O02C 00086 22 ORG CALLS0O0+124
00086 23 SUBIO07 EQU
*
000086 90EC F05E 00O5E 24 STM 14,12,94(15)
00008A 4110 F032 00032 25 LA 1,50(,15)
00OO8E 9801 F046 00046 26 LM 0,1,70(15)






O003B4 30 ST0R1 DS 64 F
0004B4 0000000000000000 31 $SAVE DC
18F'0'
Figure 77. Example ASM8, instructions shifted by three half words
Example ASM8
-
This example needs three instruction buffers. The STM is split between
the first and second instruction buffers; the LM is split between the
second and third instruction buffers; and the branch is in the third
instruction buffer.
Chapter 4. Discussion of Programming Methods 185
These simple examples illustrate the impact that instruction alignment can
have upon performance. It is not as critical as data alignment but it can be a
contributing factor to performance degradation.
4.6 Conclusions
Several programming guidelines can be derived from the performance infor
mation within this chapter. First, minimize the number of branch and other
instructions used to perform a piece of work. Second, if the processor imple
mentation is known to favor certain branch paths, implement all mainline
code on the favored path. Third, minimize subroutine calls by ensuring that a
routine is only called when it is required to do work. Finally, when data
alignment is known to affect performance, implement code on the optimal
storage boundary. Following these guidelines will improve program perform
ance.
Chapter 4. Discussion of Programming Methods 186
Chapter 5. Summary
5.1 Conclusions
This thesis uses examples to explain concepts related to branch performance
and branch prediction. These examples are kept simple in order to avoid
overwhelming the reader with the complexity found within current processors.
A pipeline and several processor configurations are described in Chapter 1.
These examples illustrate the delays that occur before and after branch
instructions. Additional delays due to interlocks or data fetches are shown to
degrade performance further. This first chapter provides insight into why
branch instructions cause delays in the pipeline.
Chapter 2 introduces the branch prediction methodologies discussed in this
thesis. It also discusses concerns that need to be addressed when imple
menting branch prediction. Some advanced concepts such as timing and
recovery implications are discussed. Chapter 2 presents several means by
which to reduce the branching delays discussed in Chapter 1.
Chapter 3 summarizes the performance gains when implementing branch pre
diction. This chapter uses the processor examples introduced in Chapter 1
Chapter 4. Discussion of Programming Methods 187
and applies the branch prediction techniques introduced in Chapter 2. In
order to increase the throughput of pipelined processors, branch prediction is
necessary. Chapter 3 illustrates how branch prediction is able to reduce, and
in some cases eliminate, the delays associated with branch instructions.
Compilers have a tremendous impact on performance via the code they
produce. Poor programming techniques, depicted in Chapter 4, severely
degrade program performance. Programmers need to understand how code
structure impacts performance when designing their code. Code optimized for
one type of branch prediction may not be the optimal code for another type
of branch prediction. This makes it more difficult to write optimized code
which runs on a variety of processors.
Subroutines are used to modularize code, allowing for more efficient code
development. However, simplified code development comes at the expense of
performance. Understanding the impact of subroutines on a program's per
formance helps the programmer decide when a subroutine is appropriate.
The many types of branch prediction alleviate the degradation caused by
branches. The degree to which branch prediction improves performance
varies considerably. It was shown that the more parallelized a processor, the
Chapter 5. Summary 188
greater the impact delays have upon performance. With the trend towards
more pipelined processors and parallel operation within a processor, the use
of branch prediction is necessary in order to keep pipeline delays to a
minimum.
5.2 Related Concepts Not Explored Within This Thesis
5.2.1 Finite Benefits ofBranch Prediction
The examples within this paper did not show the full finite benefit of branch
prediction. They execute a small segment of code which, after the first occur
rence, is found in the instruction buffers. The finite benefits derived from
branch prediction are very important in a workload environment. In a typical
workload environment the benefit derived from branch prediction is:
1 to 5 percent for opcode-based branch prediction.
3 to 12 percent for DHT based branch prediction.
9 to 35 percent for BHT based branch prediction.
Chapter 5. Summary 189
5.2.2 Optimal Decision Statements
A program can also influence the number of instructions executed by the
arrangement of multiple parts of a decision statement. Figure 78 on
page 190, will be used to briefly illustrate this concept. If the most frequently
occuring check that causes the ELSE portion to be executed is that / exceeds
100 then this check should be performed first. Statements should be placed in
the order they are most likely to cause a branch to the ELSE statement. This
concept could be a subject of another paper on Branch Performance.
IF ((-BROKEN) &





Figure 78. Multiple Part Decision Statement
Chapter 5. Summary 190
5.3 Future Work
To adequately explore the benefits of branch prediction in a workload envi
ronment requires an understanding of the typical workloads. Much of this
work has been done but would require a large effort to document the work
loads and the finite effects of various branch prediction techniques. With
DHT and BHT measurements, not only the workload, but the priming of the
prediction mechanisms needs to be understood.
Another item which could be explored in more depth is the trade-off between
load module size and program performance. The use of inlined functions
improves performance, but increases the amount of storage a program con
sumes. Exploring this item would be a tuning exercise which would be mean
ingful when applied to a typical workload environment.
Chapter 5. Summary 191
Appendix A. S/390 Instruction Formats
Table 19. E Format
Bits 0-15
Opcode







Table 21. RRE Format





Opcode Not Used RI R2
Appendix A. S/390 Instruction Formats 192









Opcode RI X2 B2 D2









Opcode RI R3 B2 D2






Appendix A. S/390 Instruction Formats 193








Opcode Bl DI B2 D2
Table 26. SS Format - Single Length Version







Opcode L Bl DI B2 D2












Opcode LI L2 Bl DI B2 D2
Appendix A. S/390 Instruction Formats 194








Opcode Bl DI B2 D2
Appendix A. S/390 Instruction Formats 195
Appendix B. S/390 Branch Instructions







Reason branches Target Address
Branch and Link OS BALR RR Always branches unless R2 field is 0
Register denoted by R2
field
Branch on Count 06 BCTR RR
Always branches unless contents of register specified by
RI is one or R2 field is 0
Register denoted by K2
field
Branch on Condition 07 BCR RR
Branches if Condition Code field matches value speci
fied by RI field (mask)
Register denoted by R2
field
Branch and Set Mode OB BSM RR Always branches unless R2 field is 0
Register denoted by R2
field
Branch and Save and
Set Mode
OC BASSM RR Always branches unless R2 field is 0
Register denoted by R2
field
Branch and Save OD BASR RR Always branches unless R2 field is 0
Register denoted by R2
field
Branch and Link 45 BAL RX Always branches X2 + B2 + D2
Branch on Count 46 BCT RX
Always branches unless contents of register specified by
RI is one
X2 + B2 + D2
Branch on Condition 47 BC RX
Branches if Condition Code field matches value speci
fied by RI field (mask)
X2 + B2 + D2
Branch and Save 4D BAS RX Always branches X2 + B2 + D2
Branch on Index High 86 BXH RS Branches if sum greater than compare value B2 + D2
Branch on Index Low or
Equal
87 BXLE RS Branches if sum less than or equal to compare value B2 + D2
Appendix A. S/390 Instruction Formats 196
Appendix C. Multiple Decodes per Cycle
S/390 instructions can be two, four, or six bytes in length. See Appendix A,
"S/390 Instruction
Formats"
on page 192 for a detailed layout of the instruc
tion formats which are support by the S/390 architecture. Other architectures
use fixed-length instructions sets. This can be an important difference when it
comes to implementing an instruction set within a processor. This section
illustrates what impact the instruction set can have upon instruction decode
performance.
One aspect of performance is the capability to decode more than one instruc
tion per cycle. When a processor has multiple execution units, multiple
decodes per cycle are necessary to keep the execution units fully utilized. The
number of instructions which can be decoded in one cycle is limited by
numerous constraints. Ignoring all other constraints, a critical path is the
amount of work which can be accomplished within a single cycle. The flexi
bility S/390 offers with its instruction formats can hinder the multiple decode
capability of processors which implement its
instruction set.
The following examples decode four four-byte instructions. The example in
Figure 79 on page 200 depicts the steps a S/390 processor has to complete in
Appendix B. S/390 Branch Instructions 197
order to decode all four instructions within a single cycle. Figure 80 on
page 200 shows the steps to process four fixed-length instructions within a
single cycle. A S/390 processor requires more steps because it does not know
the length of an instruction until it decodes it. The decode stage can become
a critical path due to the amount of processing necessary to complete four
instruction decodes. If the path becomes too long, then either the number of
decodes per cycle has to be decreased or the cycle time has to be increased.
Both of these alternatives degrade performance.
C.l Definition of Steps in Variable Length instruction Decode
DECODE 1&2
Decode bytes 1 and 2. Determine from the byte contents that the instruc
tion is not a two-byte instruction, direct processing to next step, DECODE
3&4.
DECODE 3&4
Decode bytes 3 and 4. Determine from the byte contents that this instruc
tion is a four-byte instruction. Now done decoding instruction, decide
which instruction queue this instruction belongs on. Is there room in this
instruction queue for a new instruction?
Appendix C. Multiple Decodes per Cycle 198
PUT ON QUEUE OR SUSPEND DECODING
If there is room then put this instruction on the appropriate instruction
queue, otherwise stop decoding instructions.
C.2 Definition of Steps in Fixed-Length Instruction Decode
DECODE
Decode instruction, decide which instruction queue this instruction belongs
on. Is there room in this instruction queue for a new instruction?
PUT ON QUEUE OR SUSPEND DECODING
If there is room then put this instruction on the appropriate instruction
queue, otherwise stop decoding instructions.
Appendix C. Multiple Decodes per Cycle 199
Instr 1 I I Instr #2 I I Instr 3 I I Instr 4
I DECODE 1S2 I
I DECODE 3S4 I
I
I PUT ON QUEUE I
I OR SUSPEND I
I DECODING I
I DECODE IS? I
I PUT ON QUEUE I
I OR SUSPEND I
I 0EC00INC I
I PUT ON OUEUE I
I OR SUSPEND I
I DECODING I
I DECODE 3S4 I
I PUT ON QUEUE I
I OR SUSPEND I
I DECODING I
I Instr 1 I I Instr *2 I I Instr 3 I I Instr M I
I I I I
V I I I
I DECODE I I DECODE I I DECODE I I DECODE I
I PUT ON QUEUE I
I OR SUSPEND I
I DECODING I
I PUT ON QUEUE I
I OR SUSPEND I
I DECODING I
I PUT ON OUEUE
I OR SUSPEND
I DECODING
I PUT ON QUEUE I
I OR SUSPEND I
I DECODING I
Figure 80. Fixed instruction length
Figure 79. Variable instruction length
(S/390)
The fixed-length instruction decoder knows the length of each instruction so it
can decode multiple instructions simultaneously. The order in which the
instructions are put on the instruction queues is the only piece of work which
is done sequentially. This accounts for the small amount of offset between
the four instructions.
Appendix C. Multiple Decodes per Cycle 200
The deciphering of the
variable-length instructions increases the complexity of
the decoding hardware. The decode unit can only decode the instructions
sequentially since it needs to figure out where one instruction ends before it
can begin decoding the next instruction. This complexity limits how much
decoding can be done within a given cycle. The complexity of a given archi
tecture directly impacts the complexity of the processors which support the
architecture and the overall performance potential.
Appendix C. Multiple Decodes per Cycle 201
Appendix D. References
[ESA 390]. Enterprise Systems Architecture/390, Principles of Operation,
SA22-7201-00.
[RISC 90]. IBM RISC System/6000 Technology, SA23-26 19-00, 1990.
[CoM 90]. John Cocke and V. Markstein, "The Evolution of RISC Tech
nology at
IBM,"
IBM Journal of Research and Development 34, 4-11 (1990).
[BaGrMo 90]. H.B. Bakoglu, G.F. Grohoski, and R.K. Montoye, "The IBM
RISC System/6000 processor: Hardware
Overview,"
IBM Journal of Research
and Development 34, 12-22 (1990).
[OeGr 90]. R.R. Oehler and R.D. Groves, "IBM RISC System/6000 Processor
Architecture,"
IBM Journal of Research and Development 34, 23-36 (1990).
[Gr 90]. G.F. Grohoski, "Machine Organization of the IBM RISC
System/6000
Processor,"
IBM Journal of Research and Development 34, 37-58
(1990).
Appendix C. Multiple Decodes per Cycle 202
[Li 92]. John S. Liptay, "Design of the IBM Enterprise System/9000 High-end
Processor,"
IBM Journal of Research and Development 36, 713-731 (1992).
[DiMc 87]. David R. Ditzel and Hubert R. McLellan, "Branch Folding in the
CRISP Microprocessor: Reducing the Branch Delay to
Zero,"
14th Annual
Symposium on Computer Architecture, pp. 2-9, June 1987.
[St 93]. Harold S. Stone, "High-Performance Computer
Architecture,"
Addison-
Wesley Publishing Company, Inc., June 1993.
[GoLl 93]. Antonio M. Gonzalez and Jose M. Llaberia, "Reducing Branch
Delay to Zero in Pipelined
Processors,"
IEEE Transactions on Computers,
42:3, March 1993, pp. 363-371.




Company, Inc., September 1991.
Appendix D. References 203
Appendix E. Definition of Terms
Access Register. There are 16 access registers available for use by certain
S/390 instructions. A four-bit field is used to select a register when appro
priate.
Active Instruction Streams. The current sets of instructions being processed
but not yet completed. See section 2.5, "Active
Streams"
on page 83 for a
detailed discussion of active streams.
B
Branch. An instruction which can be used to direct instruction processing to
another segment of code.
BHT (Branch History Table). A table which is used to predict if a conditional
branch is to be taken or not. It also predicts the branch target address for




Appendix D. References 204
Complete (when referring to instructions). An instruction is complete when the
instruction has met all of the architectural requirements. All facilities and
storage locations are updated.
Conditional Branch. A branch instruction which may or may not be taken
dependent upon the current value of a condition code.
Condition Code. A facility which is set to different values based upon the
status of an instruction at completion time. See section 2.6, "Condition
Codes"
on page 86 for a detailed discussion.
Conditional Completion. Conditional completion occurs when conditional
instructions complete in order to allow subsequent conditional instructions to
precede. An instruction which conditionally completes does not architec
turally complete until the instructions preceding it have been architecturally
completed.
Appendix E. Definition of Terms 205
Conditional Stream. A set of instructions which is processed after a predicted
branch but it is not yet known if these are the correct instructions to be proc
essed.
Control Register. There are 16 control registers available for use by certain
S/390 instructions. A four-bit field is used to select a register when appro
priate.
CSECT. An acronym meaning control section.
D
DHT (Decode History Table). A table which is used to predict if a condi
tional branch instruction is to be taken or not. See section 2.3, "DHT
(Decode History
Table)"
on page 56 for more detail.
Double Word. Eight bytes of data.
Fall-through Path. The set of instructions physically following a branch
instruction. These are executed if a branch is not taken.
Appendix E. Definition of Terms 206
False Branch. False branches can occur when using an asynchronous BHT.
A branch is predicted to exist and instruction fetching may continue at what
is predicted to be the target address. When the instruction is decoded it is
found not to be a branch. This occurs when not all instruction bits are used
in the branch prediction or the instruction data at that location has been
altered since the last occurrence.
Fastpath. When a piece of information is able to get from one place to
another quicker than through conventional methods. One example of this is
fastpathing the resultant condition code to the branch that needs it.
Finite Cache. A cache which has a limited capacity. References to a finite
cache will sometimes not find their data within the cache. This cache miss
will elongate the access time for the data. All processors utilizing a cache, use
a finite cache.
General Purpose Register. There are 16 general purpose registers available for
use by certain S/390 instructions. A four-bit field is used to select a register
when appropriate.
Appendix E. Definition of Terms 207
H
Hashing. Method to use certain bits within an instruction address to map
into a branch prediction array. Hashing is used to disperse the activity evenly
throughout the array.
Hit. When an entry is found which matches the search address.
Hit Buffer. Buffer between the BHT and I-unit which holds the most recent
BHT hits.
Infinite Cache. An infinite cache is one in which every access finds the
requested data within -the cache. Each cache access will be a cache hit. An
infinite cache is often used during performance modelling to understand an
ideal case.
Inlining. A function is written as a modular piece of code. When a program
calling this function is compiled, it is done as if this function were a sequen
tial piece of code within this program. Each call to the function is replaced
by the actual code within the function. If a function is not inlined, then to
Appendix E. Definition of Terms 208
execute the code within the function a program must use a branch instruction
to go to and return from the function.
I-fetch. A request for more instruction data.
Instruction Buffer. A facility within the CPU which contains the current
instruction data. See section 2.7, "Instruction
Buffers"
on page 89.
I-Unit. Instruction processing unit. It handles instruction decode and instruc
tion data requests.
Interlock. An interlock occurs when a resource is needed by multiple
instructions but only one can have it at a time. The interlock causes one or
more instructions to wait until the resource is available. Some of the
common interlocks occur on the following resources: the condition code, reg
isters, execution units, and data locations.
LRU (Least Recently Used). An algorithm commonly used to determine
which item within a set is replaced by a new arrival. This algorithm keeps
Appendix E. Definition of Terms 209
track of which item is the least recently used (accessed) and replaces this item




Miss. When a search of an array does not find an entry matching the search
address.
N
No-op. An unconditional branch which never branches. They are often
referred to as no-ops because they perform no operation except to consume an
execution cycle. Some no-ops, such as BCTR are actually used by program
mers to efficiently decrement a register.
NSI. Next sequential instruction. The instruction data physically following
the branch instruction.
o
Out-of-order execution. This is when instructions are handled in a different
order in Which they are presented to the processor. In highly overlapped
processors there are multiple execution units which can concurrently process
Appendix E. Definition of Terms 210
instructions. The processor is responsible for ensuring that it functions as if
the instructions are handled sequentially.
Opcode. Instruction operation code.
Parallel Processor. A processor which can be working on multiple units of
work (usually referred to as instructions) concurrently.
Pipeline. The stages of work necessary in order to interpret and process an
instruction. The actual components which make up the stages of the pipeline
vary among processor designs.
Prefetching. Prefetching is a when a fetch for data is issued before it is known
if the data is needed.
Q
Quad word. Sixteen bytes of data.
Appendix E. Definition of Terms 211
Sequential Processor. A processor which completes a single unit of work
(usually referred to as an instruction) before starting the next unit of work.
Stage. Work that is grouped together as one logical action. Often one stage
takes one cycle to complete.
Storage Penalty. The varying cost to bring data from storage into the
processor. The data can be in any one of the levels of cache within the
storage subsystem or on auxiliary storage. The cycles required to access data
varies depending upon how close it is to the processor.
Synonym. In the case of a branch prediction methodology this is when two
branches look identical to the branch prediction methodology but are in fact
different branches. This can occur when not all of the branch instruction
address bits are examined during the branch prediction process.
Taken Branch. A branch instruction which changes the instruction flow to
another code segment, i.e. it is not followed by the next sequential instruction.
Appendix E. Definition ofTerms 212
u
Unconditional Branch. A branch which is always taken or not taken inde
pendent of the current state of the processor.
Appendix E. Definition of Terms 213
Appendix F. Acknowledgements
Dr. Chang has dedicated a tremendous amount of time towards the composi
tion of this thesis. This thesis flows well not because of my writing abilities,
but due to Dr. Chang's numerous reviews. I also appreciate his willingness to
meet on Saturdays to accommodate my work schedule.
Dr. Unnikrishnan matched myself and Dr. Chang for the purpose of com
pleting this thesis, for this I am very grateful.
I would like to thank IBM for allowing me to use my work at IBM as a basis
for this thesis. I have been able to use their facilities to compose my work.
I would like to thank -my colleagues for reviewing my work. Their questions
enabled me to clarify the paper's contents.
I would like to thank my parents for instilling in me an appreciation for
knowledge.
My husband and children were very supportive during the long periods during
which mom was busy.
Appendix E. Definition of Terms 214
Appendix G. Biography
Debbie St. Onge, IBM S/390 Division, 522 South Road, Poughkeepsie, NY,
12601 (YAMAHA at PKEDVM9, yamaha@pkedvm9.vnet.ibm.com). Debbie
St. Onge received a B.S. in Computer Engineering from Rochester Institute of
Technology, Rochester, NY, in 1988. Debbie St. Onge first worked a
six-
month internship with the processor performance group at IBM in 1985. She
returned to the processor performance area upon completion of her Bachelors
degree in 1988. Her current duties involve performance analysis of the future
mainframe class processors; analyzing customer performance and benchmark
performance on current IBM mainframe products; and analyzing the perform
ance of various mainframe-based solutions.
Appendix F. Acknowledgements 215
Index
Access Register (AR) 204
Active Streams 83
Asynchronous 76, 77, 112
B
BHT (Branch History Table) 70, 77,
80, 122, 142, 152, 160, 172, 204
Bottleneck 12
Branch History Table (BHT) 66
Completion 6, 61, 78, 205
Compute 4, 6
Condition Code 6, 24, 86, 89, 205,
207
Conditional Stream 85, 131, 206
Control Register (CR) 206
CPI (Cycles Per Instruction) 8, 15, 18
D
Decode 4, 55, 94, 197, 198, 206
Decode History Table (DHT) 55
DHT (Decode History Table) 62, 76,
82, 143, 206
E
Execute 5, 9, 84, 88, 119, 121, 133,
141, 153, 171, 177, 206, 209
Execution 5, 6, 37, 90, 131, 210
Index 216
False Branch 79, 206
Fastpath 207
Fetch 4, 5, 6, 80, 113, 209
Next Sequential instruction 45, 212
o
Opcode 43, 46, 62, 143, 160, 176, 211
Operand 4, 5, 6, 21
General Purpose Register (GPR) 207
I pipeline 2, 3, 6, 8, 10, 11, 12, 20, 235
I-unit (instruction processing unit) 76, 30, 85, 129, 211
176
S
Instruction Buffer 4, 7, 10, 25, 33, 89,
90, 92, 101, 105, 112, 142, 176, 181,
184, 185, 189, 209
Interlock 20
M
MIPS (Million of Instructions per
Second) 8
N
Sequential 21, 45, 59, 61, 200, 201,
211, 212
Storage Penalty 26, 212
Synchronous 76, 113
Target Address 23, 61, 73, 96, 98,
107, 118, 204
Index 217
End of Document
Index 218

