Reducing branch delay to zero in pipelined processors by González Colás, Antonio María & Llaberia Griñó, José M.
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3. MARCH 1993 363 
Reducing Branch Delay to Zero in Pipelined Processors 
Antonio M. Gonzalez and Jose M. Llaberia 
Abstract-A mechanism to reduce the cost of branches in pipelined 
processors is described and evaluated. It is based on the use of multiple 
prefetch, early computation of the target address, delayed branch, and 
parallel execution of branches. The implementation of this mechanism 
using a Branch Target Instruction Memory is described. An analytical 
model of the performance of this implementation is presented, which 
allows us to measure the efficiency of the mechanism with a very low 
computational cost. The model is used to determine the size of cache lines 
that maximizes the processor performance, to compare the performance 
of the mechanism with other schemes, and to analyze the performance of 
the mechanism with two alternative cache organizations. 
Index Terms-Branch instructions, branch target instruction memory, 
computer architecture, instruction cache memory, instruction dependen- 
cies, performance evaluation, pipelined processors. 
I. INTRODUCTION 
Pipelining is a technique frequently used in the design of processors 
in order to increase their performance by executing several instruc- 
tions simultaneously. However, the efficiency brought by pipelining 
may be significantly reduced by hazards caused by instruction depen- 
dencies. Those due to branches, also known as control dependencies, 
may have a severe impact on the processor performance since these 
instructions account for a high percentage of executed instructions. 
The present work focuses on the design and evaluation of mech- 
anisms for reducing the negative effect due to hazards produced by 
branch instructions in pipelined processors. We present and evaluate a 
mechanism called COBRA (Cost Optimization of BRAnches) which 
eliminates most of the hazards caused by branches and allows the 
processor to execute branches in parallel with the rest of instructions. 
In this way, the cost of most branches can be reduced to zero. To 
evaluate the performance of this mechanism, a mathematical model 
of COBRA is developed and used to tune the design. 
The rest of this paper is organized as follows. Section I1 is a 
review of previous work on reducing the cost of branches. Section 111 
describes the COBRA mechanism. A mathematical model of COBRA 
is presented in Section IV. Section V discusses the performance of 
COBRA and compares it with other schemes. 
11. REDUCING THE COST OF BRANCHES 
Several mechanisms have been proposed in the literature in order 
to reduce the cost of branches [14], [15]. They make use of either 
one or several of the five techniques described briefly below. 
u) Deluyed brunch. A delayed branch with length equal to n is 
a branch instruction that takes effect after the execution of the n 
instructions below it. The compiler is responsible for benefiting from 
this mechanism because it is in charge of finding the instructions that 
must be scheduled in the n delay slots. Among others, the mechanism 
is used by the MIPS R3000 [16]. 
If the processor is provided with the possibility of nullifying the 
execution of the instructions in the delay slots, the number of delay 
Manuscript received June 15, 1990; revised March 15, 1992. This work was 
supported in part by the Comision Interministerial de Ciencia y Technologia 
(CICYT) under grant TIC89/0300. 
The authors are with the Department of Computer Architecture, Universitat 
PolitCcnica de Catalunya, Barcelona, Spain. 
IEEE Log Number 9202843. 
slots that can be profitably used increases. This mechanism is called 
delayed brunch with squashing. This is the case of the SPARC [4]. 
b) Early execution of branches. Hazards caused by a branch can be 
reduced by executing some of its operations in advance. For example, 
the Motorola 68040 [3] has an additional adder to compute the target 
address as soon as a branch is fetched. 
c) Brunch prediction. Another way of advancing the possible result 
of a branch is to predict it. As an example we could mention the 
Intel 8096CLNext Generation [ll]. In this processor, each branch 
instruction includes a bit that is used by the compiler to predict the 
most likely result of the branch. 
d) Multiple prefetch. It is based on prefetching after each branch 
some of the instructions at the beginning of each possible path. In 
this way, when the result of the branch is known, the fetch stage has 
been already performed, regardless of the taken path. This technique 
is implemented in the Intel i486 [2]. 
e) Parallel execution of brunches. The preceding techniques try to 
reduce the negative effect caused by control dependencies. A greater 
increase in performance can be achieved if the execution of branches 
is completely overlapped with the execution of the rest of instructions. 
This is the case of the IBM RS/6000 [9]. 
In many processors we find that several techniques from those types 
listed above are combined in order to build a particular mechanism 
to reduce the cost of branches. This is the case of the COBRA 
mechanism. 
111. COBRA MECHANISM 
In this section we present the COBRA mechanism. It was devised 
for pipelined processors with any number of stages and with condition 
codes. A preliminary study of the COBRA mechanism was presented 
in [7], [8], and [6]. 
The COBRA mechanism combines several techniques to allow the 
processor to execute branches in parallel with the rest of instructions. 
These techniques are: Early computation of the target address, mul- 
tiple prefetch, delayed branch and parallel execution of branches. At 
the time COBRA was f i s t  proposed [7], what was novel about it in 
relation to other mechanisms was the approach used to implement the 
parallel execution of branches, which is based on early computation of 
the target address and prefetching the two paths of branches. Besides, 
it was the first mechanism (as far as we know) that combined all 
these four types of techniques in order to reduce the branch cost to 
zero. After that, a few recent commercial processor such as the IBM 
RS/6000 [9], implement also a mechanism based on the combination 
of these four types of techniques The same concept has different 
implementations that lead to different performance levels, so, the 
other contribution of COBRA is the way it is implemented. COBRA 
can be implemented using either a conventional instruction cache or 
a branch target instruction memory (both terms are defined later). We 
show in this paper that the implementation using the latter memory 
organization has a better performance in terms of cost-effectiveness. 
To explain the functioning of COBRA, we distinguish two main 
units in the processor: the Instruction Unit (IU), which is respon- 
sible for fetching and sequencing instructions, and the Execution 
Unit (EU), which executes only data manipulation instructions (all 
instructions except control transfer instructions). The target address 
is computed in advance by the use of prefetching techniques. When 
the IU finds a branch (usually some cycles before it must take 
effect), it computes its target address and prefetches some of the first 
instructions of the two possible paths (multiple prefetch). When the 
0018-9340/93$03.00 0 1993 IEEE 
364 IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 
r) m e  next branch Instruction is fetched 
(") instrucliom from both taken and not taken path are fetched r) Cmdition codes are computed and depending on their due ,  al the end of 
the cycle the processor chooses between the two possible paihs. bem t 
and ntheir respeclive first instructions. 
I F  Instruction etch 
0: Decode 
O F  operandsfetch 
ALU: ALU opwatkn 
WR: WMe result inlo destination register 
Fig. 1. Execution of a branch instruction using the COBRA mechanism. 
result of the branch condition is known, one of the two prefetched 
flows of instructions is chosen. In this way, the delay introduced 
by branches is decreased by one unit (in general, it is decreased by 
the same amount of units as a fetch operation takes). The remaining 
delay slots are utilized by means of the delayed branch technique. 
All the operations required by branch instructions are performed by 
the IU in parallel with the EU activity, that is, with the execution 
of instructions different from branches. In this way, the time cost of 
many branches can be reduced to zero. 
The scheme proposed by Katevenis in [13] is used to codify the 
target address of PC-relative branches. The basic idea of this approach 
is that the instruction contains the least-significant bits of the target 
address, rather than its offset. This scheme allows the IU to perform 
the prefetch in cache memory of the instructions at the target address 
in the cycle next to the fetch of the branch instruction, in parallel 
with the computation of the most-significant bits of the target address. 
In this way, the delay cycle cause by the addition operation in the 
conventional scheme is avoided. 
Fig. 1 shows a possible execution of a branch using the COBRA 
mechanism for a sample pipeline. In this example the IU finds a 
branch in cycle n. After that, it continues fetching instructions that 
follow in sequence and also some instructions from the taken path. 
When the instruction that sets the condition codes finishes its ALU 
stage (cycle n + 3) the IU decides which path must be selected and 
sends the corresponding first instruction to the EU. From then on, 
the IU fetches instructions from the selected path until a new branch 
is found. The delay introduced by computing the condition codes 
(two cycles in this example) is used by means of the delayed branch 
technique [lo]. If the ALU is the Nth stage of the pipeline, with this 
scheme each branch will have N - 2 delay slots. 
A. Memory Organization 
' h o  different cache memory organizations have been considered 
for the implementation of COBRA. We call these organizations 
conventional instruction cache memory and branch target instructions 
memory (BTIM). 
In a conventional instruction cache memory the mapping unit 
is a fixed size block. For a branch target instruction memory, the 
mapping unit consists of the instructions between two consecutive 
taken branches (including the latter branch). In this case, the mapping 
unit has a variable size and is defined at execution time. This unit 
will be called sequence. 
To reduce the complexity that the management of information units 
with a variable size implies, a usual approach to implement a BTIM 
consists in limiting to a fixed amount the number of instructions of 
a sequence that are stored in cache memory. If a sequence is greater 
than this size, the remaining instructions are obtained from the next 
level of the memory hierarchy. If it is smaller, the line is filled up 
with the instructions that follow in sequence. An implementation like 
this is used in the Am29000 processor [12]. 
Each entry of the cache memory will be called a line. A line stores 
a block in the case of a conventional cache or part of a sequence in 
the case of a BTIM. 
To access the next level of memory, a burst-mode protocol is 
used. With this protocol, transactions are not fixed in length. After 
sending the instructions corresponding to a given line, the memory 
can continue sending the instructions of the following lines, one 
instruction per cycle, without any delay until the processor or memory 
decides to terminate the transaction. In this way, the latency of the 
external memory is experienced just once as long as the requested 
instructions are at consecutive addresses. 
Each time a cache miss occurs, an entire new line is loaded into 
cache memory. The instructions of the line arrive at the rate of one 
per cycle, in the order they are stored in the line. As soon as the 
instruction that caused the miss is available, it is passed to the IU 
and begins execution. If a new cache memory access is required while 
a line is being loaded (for instance, when the line contains a taken 
branch), the former line must be completely loaded before beginning 
the new cache access. 
B. Design of the Instruction Unit 
The main components of the instruction unit that implements the 
COBRA mechanism are shown in Figs. 2 and 3. The IU is composed 
of a BTIM and the hardware necessary for selecting the instruction 
that must feed the EU in each cycle, detecting branch instructions in 
advance and eliminating them from the flow of instructions sent to 
the EU. The implementation using a conventional instruction cache 
can be found in [8]. 
The IU uses the BTIM to prefetch the first line from the taken path 
of branches. Since the BTIM provides a complete line just in one 
cycle, the prefetch of the taken line can be postponed until the same 
cycle in which the condition codes for the branch are set. Accessing 
the BTIM earlier does not provide any additional benefit except for 
the case when the requested line is not in the BTIM. In this case, a 
further anticipation could be used to prefetch the line from external 
memory but, since the IU has just one path to external memory, this 
implies suspending the fetching of instructions that follow in sequence 
before the outcome of the branch is known. In [5] we demonstrated 
that this alternative does not provide any additional benefit. 
In consequence, the IU must only analyze in each cycle the 
instruction that follows in sequence to the one that is in the first stage 
of the EU pipeline. If the analyzed instruction is a branch, the BTIM 
is accessed to obtain (if hit) the taken line. In the same cycle, the 
instruction that sets the condition codes will be in the ALU stage. In 
this way, at the end of this cycle, the BTIM line (or the corresponding 
miss) will be selected or discarded, depending on the condition codes. 
The IU has a register to store the line obtained from the BTIM 
in case of hit. The first instruction of this line does not need to be 
stored because it must immediately be sent to the EU. 
XI is a multiplexer that selects the instruction to be sent to the EU. 
The X2 multiplexer selects the instruction next to the one selected 
by X I .  This instruction is examined by the early branch detection 
circuit to check if it is a branch (in a RISC architecture it could be 
as simple as testing just one or very few bits of the op-code). The 
circuit that generates the control signals for these two multiplexers 
(not shown in Fig. 2) is basically a counter with the possibility of 
being incremented by one or two units depending on the result of 
, 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 365 
MSB: Most significant bits 
LSB: Less significant bits 
Ti: Tar@ ad& 
s: Sign bit. Used to compute the MSB of the W'gCt ddrrps 
c: Cany bit. Used to compute the MSB of the laget a&rr.ss 
bt: Indicates whether the brauch is a canpuled b d  or not. 
I TAC I Target Address Computation circuit (see fig. 3). 
Branch detection circuit. 
Fig. 2. Block diagram of the Instruction Unit. 
L+K (line size) 4 - 1  Ta+lii size 
Computed branch 
(from EU or- 
M S W a  + pL7 
C+S 
LSBCra) 
bt 
MSB: Most aignificaril bits 
LSB Legs siglliticant bils 
Ta: Target addrrss 
s: Sign bit. Used to compute the MSB of the laget ddrrps 
c: Carry bit. Used to compute the MSB of the target rddreps 
bt: Indicats whether the branch is a computed b d  or wt. 
Fig. 3. Block diagram of the Target Address Computation circuit (TAC in Fig. 2). 
the branch detection circuit. When a branch is taken, this counter is 
reset to zero. 
The instructions supplied by the external memory should arrive at 
the IU one cycle before the EU can start its execution in order to be 
analyzed by the branch detection circuit and processed by the IU if 
they are really branches. A further anticipation, as explained above, 
does not provide any additional benefit. If for any reason, like a BTIM 
miss, they arrive later, some bubbles will occur in the EU pipeline, 
causing a degradation in the processor performance. During the cycle 
that an instruction supplied by the external memory is processed by 
the IU, it is held in the Delay register. 
When a branch is detected, the BTIM is searched for the target line 
while the instruction that sets the condition codes is in the ALU stage. 
At the end of this cycle, the condition codes determine whether the 
branch is to be taken. If the branch is taken, the PC block is loaded 
with the target address and X ,  selects the address that is sent to 
external memory. If the access to the BTIM produced a cache miss 
the selected address is the branch target address. Otherwise, it is the 
branch target address plus the cache line size (S = L + K ) .  Note 
that the burst transaction initiated for the last taken branch is not yet 
suspended and, therefore, it can be continued if the branch is not 
taken. 
The target address of computed branches is calculated by the EU 
and sent to the IU. Call and Return instructions are also a particular 
kind of branches. Call instructions can be sent to the execution unit, 
like an arithmetic instruction, with the sole objective of storing the 
return address (the targets address is computed by the IU). Return 
instructions are also sent to the EU and are treated like computed 
366 IEEE TRANSACITONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 
TABLE I 
NOTATION FOR THE MODELS 
From the amlications: 
B: Probability that an instruction is a branch 
T: Probability that a branch is taken 
D(d): Probability density function of the distance 
between two consecutive taken branches 
(length of sequences) 
F(d): Probability dellsity function of the distance 
between two consecutive branch 
instructions. 
branches that obtain the target address from the place where the 
corresponding call instruction stored it. The drawback of this solution 
is that Call and Return instructions, unlike the rest of branches, spend 
one cycle in the EU, and, therefore, cannot be completely executed 
in parallel. A more efficient solution, also more expensive, consists 
in adding a hardware stack to the IU, where the IU will store the 
return address of Call instructions in parallel with the EU activity. In 
this case, when the IU finds a Return instruction, the target address 
is obtained from the top of this stack, also completely in parallel 
with the EU activity. In this way, Call and Return instructions can 
be executed with zero time cost. The results presented in the next 
section assume that the IU has available this hardware stack. 
IV. MODELING COBRA 
A mathematical model of COBRA for the implementation that 
uses a BTIM is developed in this section. The model has some 
input parameters listed in Table I. These input parameters can be 
classified in three types: a) Those that depend on the applications 
(B ,T ,  D(d),  F ( d ) ) ,  b) those that depend on the implementation 
( L  and S), and c) those that depend on both the applications and 
the implementations (H). This model will be used. to compute the 
performance of the processor for different system configurations. 
In addition, an analytical model for the Delayed Branch scheme is 
presented. Its objective is to compare COBRA with Delayed Branch 
in order to show the extra performance of COBRA in relation to its 
hardware cost (shown in the previous section). 
A.  Pipeline 
The efficiency of COBRA and Delayed Branch depend on the 
length of the pipeline. In this paper we concentrate on a pipeline in 
which the ALU stage is the second one. For this, pipeline, the Delayed 
Branch scheme has one delay slot per branch whereas COBRA does 
not need any delay slot and, in addition, branches are executed in 
parallel with other instructions. A deeper pipeline will imply an 
increase in the number of delay slots of both schemes. 
B. Analytical Model for COBRA 
The peak performance of the processor using COBRA is zero 
cycles for branches and one cycle for any other instruction. However, 
to achieve this peak performance several conditions must hold: 
The target line of each taken branch should be in the BTIM. The 
ratio of lines that are actually found in the BTIM depends on 
the number of lines of the BTIM, the BTIM organization, and 
the temporal locality of the program. 
Each cycle, the EU should begin the execution of a nonbranch 
instruction and, in parallel, the IU should deal with the instruc- 
tion that follows in sequence. Even when every target line were 
in the BTIM, there would be no guarantee that this condition 
is met, since the IU relies on the external memory for part of 
those sequences whose size is greater than a BTIM line. So 
From the implementation: 
L: Latency of external memory 
S: Size of BTIM lines 
From both the amlications and imDlementatioK 
H: BTlM target hit ratio, which is computed as 
the number of taken branches whose target 
sequence is found in the BTIM divided by 
the total number of taken branches 
the line size and the external memory latency also affect the 
performance of the processor. 
In the development of the analytical model we assume that two 
branches never occur without at least one instruction between them. 
This hypothesis simplifies the model by introducing a negligible error, 
since in practice this fact happens very rarely. 
The processor performance (P) is computed as the average number 
of useful instructions executed per cycle. Useful instructions are those 
instructions processed by the EU (all instructions but branches). In 
this way, P = ( 1  - B ) / (  1 - B + D), where D is the average number 
of lost cycles per instruction (including branches). To compute D ,  the 
different sources pf penalization will be characterized. Lost cycles 
are due to five different causes: 1) Memory latency due to BTIM 
misses, 2) Complete replacement of lines, 3) Memory latency for 
BTIM hits, 4) Lack of anticipation due to BTIM misses, and 5) Loss 
of anticipation due to not taken branches. Then, D = D1 + 0 2  + 
0 3  + 0 4  + 0 5 ,  where Di represents the average number of lost 
cycles per instruction due to cause i .  Next, expressions for each Di 
are developed. 
1) Memory Latency Due to BTIM Misses: This happens when a 
branch is taken and a cache miss occurs when the IU accesses the 
BTIM to fetch the next sequence. The cost of this cache miss is L 
cycles. The probability that this event happens is B T (  1 - H ), and, 
therefore, the average number of lost cycles per instruction due to 
this cause is D1 = L B T ( 1  - H ) .  
2) Complzte Replacement of Lines: This happens when the IU is 
dealing with a branch that turns out to be taken, a BTIM miss occurred 
in the previous taken branch and the distance between these two 
branches (here called d) is less that S - 1. In this case, the IU must 
finish the replacement of the former line before beginning to search 
the BTIM for the new line. The additional cycles needed to complete 
the replacement are S - 1 - d, and the average number of lost cycles 
per instruction due to this cause is 
$--2 ~-
0 2  = B T ( l  - H )  E D ( d ) ( S  - 1 - d ) .  
d=2 
3) Memory Latency for BTIM Hits: This happens when the current 
sequence was found in the BTIM but it is larger than a line, and 
therefore, only the first instructions are in the BTIM; the remaining 
instructions are provided by the external memory. If the external 
memory latency (L) is greater than the line size (S), then L - S 
cycles will be lost for each one of those sequences. The average 
number of lost cycles per instruction due to this cause is 
4) Lack of Anticipation Due to a BTIM Miss: This happens for any 
branch when a BTIM miss occurred in the previous taken branch. 
In this case, all the instructions between the last taken branch and 
the next taken one are provided by the external memory at the rate 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 361 
of one per cycle and therefore branches cost one cycle since they 
are not detected early enough to overlap its execution with some 
previous instruction. In this way, while the IU is dealing with the 
branch a NOP is sent to the EU. The average number of lost cycles 
per instruction due to this cause is 0 4  = B(l - H ) .  
5) Loss of Anticipation Due to not Taken Branches: This happens 
for sequences that are found in the BTIM and are larger than a line. 
Let assume that Y is the size of the sequence and it contains X 
branches. The number of cycles needed to read the complete sequence 
from memory is Y - S + L and the number of useful instructions in 
the block is Y - X. Then, the number of cycles that the EU will be 
idle is (I’ - S + L )  - (Y - X )  = X - (S - L). When S < L, from 
this amount we must subtract the L - S cycles that have already been 
taken into account in cause 3. In conclusion, we must count a lost 
cycle for each branch that is preceded by at least S - L not taken 
branches, assuming that if S - L < 0 the previous sentence must be 
interpreted as preceded by at least zero not taken branches (this holds 
for any branch). The average number of lost cycles per instruction 
due to this cause is (see equation at bottom of page) 
where N and I are random variables. N represents the number of 
not taken branches between the current branch and the previous taken 
branch and I represents the number of instructions of the sequence 
to which the branch being analyzed belongs. 
Computing P r ( N  2 K ) :  We assume that the probability that a 
branch is taken is independent of what happened in the branches 
executed before, which implies that the random variable N follows 
a geometric law. Note that in this case, the previous branches 
correspond to not taken branches and therefore all the previous 
branches and the one analyzed are different instructions. Then, it is 
reasonable to assume that each branch instruction is independent of 
the others, although this is not necessarily true. This introduces some 
negligible error in our analysis, but not enough to affect the result as 
the validation of the model (next section) will prove. Therefore, 
00 
P r ( N  2 IC) = ~ ( 1 -  T ) ,  = (1 - TI*. 
,=K 
Computing Prob(I > SIN 2 K): To compute this probability, 
we will first calculate P r ( I  > S). To do that, we define By as 
the average number of branch instructions in a sequence with Y 
instructions. We have that 
P r ( I = Y ) =  oo B y D ( Y )  + Prob(1 > Y )  c B A A  
3=2 
,=2 
By can be computed using the expression 
Y 
BY = j A, C, ( Y )  
,=1 
where A, is the probability that a sequence is composed of j branches 
and C, ( Y )  represents the probability that a sequence with j branches 
has a length equal to Y. 
Because of the hypothesis made before, the value of A, is given 
by the probability density function of a geometric law, which means 
that 
A, = T(1  - Ty-1. 
C,(Y)  depends on F ( d )  and can be computed using the following 
expressions. 
Cl(Y)  = F ( Y )  
c,(Y) = F(Y - k)~,-l(k) if j > 1. 
Y - 1  
le=,  - 1 
The evaluation of P r ( I  > SIN 2 K )  is similar to the calculation 
of P r ( I  > S) with the difference that only those sequences with 
more than K branches must be considered, and the contribution of 
the fist K branches must not be taken into account for computing 
this probability. Thus, we have that 
k=2 
where h l ~  ( k ) represents the average number of branches left (not 
including the first K branches) in a sequence with k instructions and 
assuming that the sequence has at least K + 1 branch instructions. 
Its value is equal to 
L. 
where A, and C3 (k) are the functions above defined. 
6) Validation ofthe Model: The correctness of the analytical model 
was validated by comparing its results with the ones obtained by sim- 
ulation of the execution of four benchmark programs: LEX, NROFF, 
PCC, and YACC’ (9, 12, 21, and 42 million of executed instructions, 
respectively). These programs written in C language were compiled 
to RISC-I1 Assembly language [13] and their execution was simulated 
using the approach presented in [l]. From this simulation, in addition 
to the COBRA performance, the input parameters to the model (see 
Table I) were also obtained. The simulation was carried out for 
several values of the cache size, line size, and external memory 
latency. In this way, the processor performance was obtained for 
31. sets of different values for these three parameters. The processor 
performance predicted by the model and the performance obtained 
by simulation was always less than 3.76% different and the average 
difference for the 31 simulations was 1.36%. 
C. Analytical Model for Delayed Branch 
For the memory organization that we call a BTIM, a line size 
equal to the external memory latency (S = L) is enough to obtain 
the maximum benefit from the delayed branch mechanism in terms 
of instruction execution rate. A further increase in the line size 
would reduce the external memory traffic but would not provide any 
additional gain ih terms of execution rate since these extra instructions 
can be supplied by the external memory without any performance 
degradation. In consequence, the following model assumes that S is 
equal to L. The average number of lost cycles per instruction is the 
sum of the following four terms: 
a) Execution of branch instructions: B 
Unix utilities Unix is a trademark of AT&T Bell Labs. 
B H P r ( N  2 (S- L ) n I >  S) = B H P r ( I >  SIN 2 (S - L ) ) P r ( N  2 (S - L)) 
B H P r ( N  2 On1 > S )  = BIIPr(I> SIN 2 O)Pr(N >_ 0 )  
if S 2 L 
ifS < L D 5 = {  
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 
, 
BTlM M ra60 
No optimization of the delay slot: B ( l  - P o ) .  The value of 
P o  for each benchmark was obtained by the simulation of its 
execution. 
BTIM miss for a taken branch: B T ( 1  - H) 
A taken branch occurs before concluding the replacement of the 
line corresponding to the previous BTIM miss: BT( 1 - H ) N c ,  
where N c  is the average number of entries in the cache line that 
have not yet been filled. It can be calculated by the following 
expression: 
L - 2  
N c  = D ( d ) ( L  - 1 - d ) .  
d=2 
Then, the processor performance computed as the average number 
of useful instructions executed per cycle is equal to 
1 - B  
1 + B ( 1 -  P o + T ( l - H ) + T ( l - H ) N c ) '  
P =  
The difference between the processor performance estimated by 
means of this model and the results obtained by simulation of the 
four benchmarks for 15 different sets of parameters was always less 
than 0.22%, and the mean value of the difference was 0.05%. 
V. PERFORMANCE MEASURES 
In this section, the efficiency of the COBRA mechanism is an- 
alyzed. First, we investigate which is the BTIM line size that 
maximizes the performance of COBRA. Next, the improvement 
achieved by COBRA in relation to the delayed branch mechanism 
is shown. Finally, the performance of COBRA with two alternative 
cache memory organizations are compared. 
A. Size of the Cache Line 
The first application of the mathematical model was to determine 
the optimum size of BTIM lines for COBRA mechanism. A typical 
value of the external memory latency (three cycles) was assumed for 
this analysis. In this section we show that, for the assumed external 
memory latency, the best tradeoff between cost and performance is 
provided by a cache line equal to four instructions. 
The performance of the processor was obtained for a BTIM line 
size ranging from 1 to 6 instructions and a hit ratio ranAing from 0 to 
1 (note that the hit ratio, as it is defined in Table I, only depends on 
the number of lines, not on the line size). The other input parameters 
to the model (B, T, F ( d ) ,  and D ( d ) ,  see Table I), which depend 
on the applications, were assumed to be equal to the average of the 
values obtained for the four benchmarks. The results are shown in 
Fig. 4. 
The main conclusion that can be drawn from Fig. 4 is that for a 
given hit ratio, the processor performance is improved when the line 
size augments, but only until a given size. A further increase in the 
line size produces a decrease in the processor performance due to 
the cost of loading a new line on cache misses. In this figure we can 
also see that the higher the hit ratio, the greater the size from which 
the performance begins to decrease. At the left end of the graphs 
(hit = 0) performance decreases as the line size increases whereas at 
the right end, performance augments as the line size gets larger. 
When the line size is lower than the external memory latency (1 
or 2 instructions) the performance of the system is rather low. If we 
compare line size of three with line size of four in Fig. 4, we can 
observe that the performance of the latter is better from low values of 
hit ratio on (hit 2 0.4), and the difference between them is substantial 
for typical values of the target hit ratio (0.7-0.9). A further increment 
in the line size (5 instructions) is useful only if the hit ratio is greater 
than 0.7 and, in this case, the increase in performance is so low 
0.8 1 p - 4  
I - -  5 
Fig. 4. Processor performance for different values of the BTIM hit ratio and 
line size, assuming an external memory latency of three cycles. 
that it does not justify the additional occupied chip area. So, we can 
conclude that the best tradeoff between cost and efficiency is a line 
size of four instructions. 
B. COBRA Versus Delayed Branch 
In this section we show the benefits brought by COBRA. We 
have already seen the hardware cost needed to implement it. Here 
we compare the performance of COBRA against the delayed branch 
mechanism. Since this latter mechanism does not use any additional 
hardware, we can have an idea of the extra performance in relation 
to the additional hardware of COBRA. 
Fig. 5 shows the performance of COBRA and delayed branch 
mechanisms. In both cases, the same cache memory organization has 
been assumed, that is, a BTIM with direct mapping and 32,64,128, or 
256 lines. The line size is equal to the memory latency (3 instructions) 
for the delayed branch scheme and equal to the latency plus one 
unit (4 instructions) for the COBRA mechanism. The line size for 
COBRA is justified in the previous section whereas the choice for 
delayed branch, as explained in Section IV-C, is due to the fact that 
having a line greater than the external memory latency does not 
provide any additional increase in the instruction execution rate. In 
consequence, for a four instruction line size, the performance figures 
(useful instruction per cycle) of the delayed branch mechanism with 
a BTIM will be the same as the ones depicted in Fig. 5. The other 
input parameters to the analytical models (H, B, T, F ( d ) ,  D ( d ) ,  
see Table I) were obtained from the simulation of the execution of 
each benchmark. 
The efficiency of the COBRA mechanism is between 36% (BTIM 
with 32 lines) and 40% (BTIM with 256 lines) higher than the 
delayed branch for LEX; between 6 and 21% for NROFF; between 
12 and 21% for PCC and between 24 and 26% for YACC. The higher 
the cache hit ratio, the greater the difference between them. 
C.  BTZM Versus Conventional Instruction Cache 
It is also interesting to compare the efficiency of COBRA for 
different cache organizations. Fig. 6 shows the performance of the 
COBRA mechanism with a BTIM and with a conventional instruction 
cache. In both cases we assume the same number of cache lines, the 
same size of lines (4 instructions), a direct mapping and a three-cycle 
external memory latency. The performance figures for a conventional 
cache were obtained using the approach presented in [SI. 
Fig. 6 shows that, for the cache parameters evaluated, a conven- 
tional instruction cache and a BTIM have a similar performance for 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 
usMho.lw!E LEX 
0.9 
0.8 - 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
0.a - 
32 84 128 254 
0.7 - 
0.6 - 
dlnl3.lLW PCC 
0.9 1 
BTlMlineS 
0.5 
32 64 128 256 
1 
0.9 
0.8 
0.7 
0.6 
0.5 
BnM lirmr 
32 64 128 256 
urrlu(inU.lryds YACC 
BTIM linsr 
369 
Fig. 5. COBRA versus delayed branch. 
LEX and YACC (a little better for a conventional cache) whereas for 
NROFF and PCC, the performance of a BTIM is considerably better 
than a conventional cache. The improvement of the BTIM in relation 
to the conventional cache ranges from -1 to -3% for LEX, 29 to 
2% for NROFF, 18 to 12% for PCC, and 4 to -5% for YACC. The 
main difference between LEX, YACC and PCC, NROFF is that the 
former two programs exhibit a higher temporal locality. We can also 
observe in Fig. 6 that the improvement of a BTIM in relation to a 
conventional cache increases as the number of lines (and therefore 
the hit ratio) increases. So the conclusion just regarding efficiency 
is that both schemes provide about the same efficiency when the 
cache hit ratio is very close to 1 and the performance of the BTIM 
is considerably better when the hit ratio is not so high. 
On the other hand, the BTIM generates much more traffic than 
a conventional cache. For LEX the BTIM traffic is between 424 
and 5220% higher than the conventional cache traffic; 28-234% for 
NROFF; 46-104% for PCC; 422-5956% for YACC. The reason is 
that, in a BTIM, there are many instructions that must always be 
supplied by the extemal memory, regardless of the number of lines 
of the cache and the cache hit ratio. These instructions are due to 
sequences greater than a cache line. In this case, the BTIM only 
stores the first instructions of the sequence (just a line) and the rest 
of instructions are supplied by external memory even when a BTIM 
hit occurs for that sequence. Note that this extra traffic does not mean 
any penalization in the processor speed since the access to extemal 
memory is overlapped with the execution of instructions provided by 
the BTIM. 
Finally, regarding hardware cost, the implementation of the IU 
requires a simpler hardware for a BTIM. The design of the IU for a 
conventional cache can be found in [8]. In conclusion, a BTIM offers 
a better cost-efficiency performance than a conventional cache since 
the former simplifies the implementation of the IU and in addition it 
provides in many cases an efficiency quite higher than a conventional 
cache. 
VI. CONCLUSIONS 
We have presented and evaluated a mechanism (COBRA) for 
reducing the cost of branches in pipelined processors. The mechanism 
is based on the following techniques: a) early computation of the 
target address, b) multiple prefetch, c) delayed branch, and d) parallel 
execution of branch instructions. 
370 
LEX 
___ BTIM 
Conventional cache _ _ _ - _ - -  - 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 
NROFF d h o . l W d 8  
d b . / W d 8  PCC 
1 
0.9 
0.8 
0.7 
BnH bnl 
0 A 
32 64 128 250 
32 64 128 256 
uuhlhlb./clde YACC 
1 
04 
0.E 
0.1 
0.6 
0.5 
0.4 
32 64 128 256 
Fig. 6. COBRA with a BTIM versus COBRA with a conventional instruction cache. 
An implementation of the mechanism using a Branch Target 
Instruction Memory (BTIM) is proposed. The behavior of the system 
has been characterized by means of an analytical model. This model 
has been used to select the most adequate size of BTIM lines which, 
for a external memory with latency equal to three cycles, resulted to 
be equal to the latency plus one unit. 
The efficiency of the COBRA mechanism is in average about 
25% higher than the Delayed Branch and the additional hardware 
needed to implement COBRA is quite simple. We have also compared 
two implementations of the COBRA mechanism, each one using 
a different cache organization. The conclusion was that, in terms 
of cost-effectiveness, the BTIM has a better performance than a 
conventional instruction cache although the former generates a higher 
memory traffic. This extra traffic does not mean any penalization 
in the processor speed since it is overlapped with the execution of 
instructions provided by the BTIM. 
ACKNOWLEDGMENT 
We would like to thank T. Lang and the anonymous referees for 
many suggestions that improved the quality of this paper. 
REFERENCES 
[ 11 J. Cortadella and J. M. Llaberia, “Low cost evaluation methodology for 
new architectures,” in Proc. USTED Int. Symp. Appl. Informatics, Feb. 
[2] J.H. Crawford, “The i486 CPU: Executing instruction in one clock 
cycle,” IEEE Micro, vol. 10, no. 1, pp. 27-36, Feb. 1990. 
[3] R. W. Edenfield, “The 68040 Processor. Part 1, Design and implemen- 
tation,” IEEEMicro, vol. 10, no. 1, pp. 66-78, Feb. 1990. 
[4] R. B. Gamer et al., “The scalable processor architecture (SPARC),” in 
Proc. 33rd. IEEE Int. Comput. SOC. Con$, COMPCON’88, Feb 1988, 
[5] A. Gonzilez, “Designing an instruction cache for reducing the cost of 
branches,” Rese. Rep. UPCDAC RR-91/02, Comput. Architecture Dep., 
Polythecnic Univ. of Catalonia, Barcelona, Jan. 1991. 
[6] A. Gonzilez and J.M. Llaberia, “Instruction fetch unit for parallel 
execution of branch instructions,” in Proc. 3rd In?. Con$ Supercomput., 
ACM SIGARCH ICs-89, June 1989, pp. 417-426. 
[7] A. Gonzilez, J. M. Llaberia, and J. Cortadella, “Zero-delay cost branches 
in RISC architectures,” in Proc. LASTED Int. Symp. Appl. Informatics, 
Feb. 1988, pp. 24-27. 
[8] -, “A mechanism for reducing the cost of branches in RISC archi- 
tectures,” Microprocessing and Microprogramming, vol. 24, no. 1-5, 
1987, pp. 192-195. 
pp. 278-283. 
pp. 565-572, Aug. 1988. 
IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 3, MARCH 1993 371 
[9] G. F. Grohoski, “Machine organization of the IBM RISC System/6000 
Processor,” IBMJ. Res. Develop., vol. 34, no. 1, pp. 37-58, Jan. 1990. 
[lo] T. R. Gross and J. L. Hennessy, “Optimizing delayed branches,” in Proc. 
15th Annu. Workshop Microprogramming, ACM SIGMICRO, Oct. 1982, 
[ l l]  G. Hinton, “80960 - Next generation,” in Proc 34th. IEEE Comput. 
Society Con$ COMPCON’89, Feb. 1989, pp. 13-17. 
[12] M. Johnson, “System considerations in the design of the Am29000,” 
IEEE Micro, vol. 7, no. 4, pp. 29-41, Aug. 1987. 
[13] M. G. H. Katevenis, Reduced Instruction Set Computer Architecture for 
VLSI. Cambridge, MA, MIT Press, 1985. 
[14] D. L. Lilja, “Reducing the branch penalty in pipelined processors,”IEEE 
Comput. Mag., vol. 21, no. 7, pp. 47-55, July 1988. 
[15] S. McFarling and J. Hennessy, “Reducing the cost of branches,” in Proc. 
13th Int. Symp. Comput. Architecture, 1986, pp. 396-403. 
1161 T. Riordan et a[., “System design using the MIPS R3000/3010 RISC 
Chipset,” in Proc. 34th IEEE Comput. SOC. Conf, COMPCON’89, Feb. 
pp. 114-120. 
1989, pp. 494-498. 
Constant Geometry Fast Fourier 
Transforms on Array Processors 
George Miel 
Abstract-Matrix algebra is used to design and validate parallel algo- 
rithms for large constant geometry FFT’s on fixed-size array processors. 
The N-point radix 2 case for a linear array processor with N/2 cells is 
identical to the usual procedure corresponding to the matrix factorization 
of M. C. Pease. The algorithms are engendered by matrix factorizations, 
which themselves depend on a basic factorization of the perfect shuffle. 
The resulting data movement is realized in parallel as relatively small 
perfect shuffles inside each local memory and along each row and column 
of the array processor, without requiring that the complete array itself 
have the shuffle-exchange network. 
Index Terms-Array processing, fast Fourier transforms, parallel al- 
gorithms. 
I. INTRODUCTION 
The matrix approach, as a means to design and validate algorithms 
for parallel architectures, was used and advocated by Pease [9] 
in his modification of the Cooley-Tukey procedure. The resulting 
algorithm is often called a constant geometry FFT because its 
communication pattern, namely, the addressing of operands for the 
butterfly operations, is kept the same from stage to stage. For the 
N-point radix 2 case, the algorithm consists of log, N stages each 
preceded by a perfect shuffle of the data. The most natural mapping of 
this algorithm is onto a linear array architecture with N/2  cells and a 
shuffle-exchange interconnection network [2 ] ,  [ 141, [ 151. Thompson 
[16] has shown that the VLSI design of this architecture achieves 
area*time2 performance of R ( N 2  log: N), which is the optimum 
theoretical limit for the N-element Fourier transform established by 
Vuillemin [17]. 
The matrix factorization of the Fourier transform given by Pease is 
invaluable in the study of parallel FIT’S. The problem of parallelizing 
Manuscript received June 15, 1990; revised March 15, 1992. This work was 
done at and supported by Hughes Research Laboratories, Malibu, CA 90265. 
The author is with the Department of Mathematical Sciences, University 
of Nevada, Las Vegas, N V  89154. 
IEEE Log Number 9202844. 
an FFT is essentially that of scheduling onto a targeted architecture 
the tasks engendered by the matrix factors in the corresponding 
factorization. This approach was used by Norton and Silberger [8] in 
the parallelization and performance prediction of FFT algorithms for 
MIMD shared-memory architectures. Recently, Whelchel and others 
[18] used the Pease factorization to describe a pipeline architec- 
ture, based on matrix factors called systolic phase rotations, which 
eliminates delay commutator switches used in the Purdy McClellan 
processor. 
Our aim is to decompose the Pease factorization in order to 
map large constant geometry FFT’s onto fixed-size rectangular array 
processors. Section I1 shows that our results depend fundamentally on 
a factorization of the perfect shuffle permutation. The resulting data 
movement is realized in parallel as relatively small perfect shuffles 
inside each local memory and along each row and column of the 
array processor, without requiring that the complete array itself have 
the shuffle-exchange interconnection network. Section I11 uses these 
results to validate parallel algorithms for rectangular array processors. 
The effectiveness of a mapping of a constant geometry FFT onto 
an array processor depends primarily on two items. The first item is 
the efficiency with which the interconnection network of the array 
processor realizes the data movement required by the algorithm. The 
second item involves a divide-and-conquer strategy for the SIMD 
evaluation of specialized matrix-vector products. Suppose that a 
product D z ,  where D is the direct sum 
N - 1  
D = @ A  
Z = O  
with each A ,  of dimension M x M and 2 is an hlN-vector, is to 
be computed on an array processor with N cells. The vector is first 
divided into N M-tuples 
2 = (209 21, ’ ’ ’  7 Z N - 1  I t ,  22 = ( Z t A - 4 . .  ’ ’ 7 z(%+l)A4-1)$ 
each cell computes in parallel a product A,ZI, and the subvectors 
are then concatenated to get the result. Whereas the first item deals 
with the communication complexity of the mapping, the second item 
pertains to its parallel arithmetic complexity. 
11. MATRIX FACTORIZATIONS 
A perfect shuffle is a permutation that transforms the 2m-vector 
t = (0,1,.. . .m - 1 , m . m  + 1 , . . . , 2 m  - 
to the vector 
~ , ~ z  = (O,m,l,m+ l , . . . , i . m + i , . . . , m  - 1,2m - lit. (2 )  
Components that were m apart become adjacent as a result of the 
perfect shuffle. For simplicity, we henceforth call (2) the shufpe of z .  
Permutations by cutting and shuffling were studied by Golomb 
[3]. Computational applications of the shuffle were conceived by 
Batcher [l] for bitonic sorting and by Singleton [13] and Pease 
[9] for the fast Fourier transform. In particular, Pease presented a 
matrix factorization of the transform, (4)-(5) below, suitable for 
parallel implementation. The relevance of the shuffle permutation 
in parallel processing was further established by Stone [14]. The 
shuffle-exchange interconnection network in a multiprocessor system 
provides useful capabilities [2]. For instance, Wu and Feng [19] have 
shown that a shuffle-exchange network of size N can realize an 
arbitrary permutation in 31og, N - 1 passes. 
001&9340/93$03.00 0 1993 IEEE 
