Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy by 莊博任
Employing Variable Cross-Reference Prediction and
Iterative Dispatch to Raise Dynamic Branch
Prediction Accuracy
Po-Jen Chuang*, Yue-Ter Liau, Young-Tzong Hsiao and Yu-Shian Chiu
Department of Electrical Engineering, Tamkang University,
Tamsui, Taiwan 251, R.O.C.
Abstract
To improve branch prediction accuracy for the two-level adaptive branch predictor, two
schemes  dealing respectively with the prediction and dispatch parts, are presented in this paper. The
proposed VCR prediction scheme is able to achieve desirable prediction accuracy, with reasonably low
time complexity and no extra hardware cost, by variably cross-referring traces in the PHT to make
predictions. The Iterative dispatch approach utilizes the PHT history to do dispatching for an
additional layer of pattern history which helps providing more information for making better
predictions. To attain desirable prediction accuracy at reduced cost, a combined predictor formed by
the proposed VCR scheme and the optimal PPM algorithm is also considered. Extensive trace-driven
simulation runs have been conducted to evaluate the performance of our proposed schemes and other
predictors. As the results indicate, our proposed schemes compare favorably in most of the situations
in terms of prediction accuracy.
Key Words: Branch History, Dynamic Branch Prediction, Performance Evaluation, Prediction
Accuracy, Trace-driven Simulation, Two-level Adaptive Branch Predictor
1. Introduction
Branch prediction is important in maintaining pro-
cessor performance. As high prediction accuracy ensures
better performance, raising prediction accuracy becomes
essential [1]. A number of new schemes (such as [29])
are built to lift up prediction accuracy for the two-level
adaptive branch predictor [10]. These prediction schemes,
such as the 2-bit counter [2], the Markov predictor [5,
11], the PPM algorithm [5], and the gshare [3], agree [4],
bi-mode [7], YAGS [8] and DHLF predictors [9], deal
with either the prediction part or the dispatch part or with
both parts of the two-level predictor, each with its own
advantages and disadvantages. For instance, the 2-bit
counter and the Markov predictor – though generally
used in branch prediction  are unable to provide ade-
quate prediction accuracy for high performance proces-
sors. The PPM algorithm is able to yield remarkable pre-
diction accuracy but the involved complexity is con-
siderably high. The gshare, agree, bi-mode and YAGS
predictors aim to reduce the so-called pattern history ta-
ble (PHT) interference (happening when the outcome
of a branch interferes with the subsequent prediction of a
completely unrelated branch) in order to lift up the pre-
diction accuracy. The DHLF predictor tries to dynami-
cally find the amount of history that performs best for
each code and input data at execution time.
The goal of this paper is to improve branch predic-
tion accuracy at reasonably low cost. A new prediction
scheme dealing with the prediction part is first establi-
shed. The proposed scheme is called the variable cross-
reference (VCR) prediction scheme because it is able to
make desirable predictions (in terms of prediction accu-
racy, time complexity and hardware cost) by variably
cross-referring to traces in the PHTs. Different from pre-
vious schemes, the VCR scheme involves the loop his-
Tamkang Journal of Science and Engineering, Vol. 11, No. 1, pp. 3748 (2008) 37
*Corresponding author. E-mail: pjchuang@ee.tku.edu.tw
torywhich, existing in most programs and easily observ-
able in small programs, is helpful in elevating the predic-
tion accuracy. Also presented is an iterative dispatch ap-
proach which involves some structural changes in the
dispatch part of the branch prediction. In our new design,
the PHT functions as an intermediary index tag stage and
the branch history in each PHT entry is used as an index
tag indexing to an entry in the corresponding sub-PHT at
an additional stage. The branch history in the indexed
sub-PHT entry is then used for prediction. In this way the
iterative approach helps divide information into more
classes to reduce PHT interference and to enhance predic-
tion accuracy accordingly. Besides, a new combined pre-
dictor formed by bringing the proposed VCR scheme and
the optimal PPM algorithm together is also introduced to
attain desirable performance with reduced complexity.
Extensive trace-driven simulation runs using the
SPEC CINT95 benchmarks [12] have been conducted to
evaluate and compare the performance of our proposed
schemes and other related schemes. The collected results
show that the proposed VCR scheme yields lower mis-
prediction rates, i.e., higher prediction accuracy, than the
2-bit counter and the agree predictor  schemes dealing
with the prediction part, and under certain conditions it
produces even superior performance over the optimal
PPM algorithm at much less cost. The proposed iterative
dispatch approach is shown to outperform the gshare
and DHLF predictors  schemes dealing with the dis-
patch part. When compared with schemes dealing with
both the prediction and dispatch parts, such as the bi-
mode and YAGS predictors, the VCR scheme depicts su-
perior performance over the two predictors (at no extra
hardware cost), except in some situations of a certain
benchmark where the performance of the VCR scheme
may not be as satisfactory as that of the two predictors
(whose performance have been obtained with extra hard-
ware and cost). It is also shown that the VCR prediction
scheme can readily work with schemes dealing with the
dispatch part, and the iterative dispatch approach can be
handily incorporated with schemes dealing with the pre-
diction part  to enhance performance especially for
schemes with lower prediction accuracy. Simulation re-
sults also exhibit that the combined PPM-VCR predictor
is able to make optimal prediction at lowered cost.
2. Background and Previous Works
The two-level adaptive branch predictor [10] uses
two levels of branch history information to make pre-
dictions (Figure 1). Prediction is made according to the
branch behavior for the history of the last k branches en-
countered and the last s occurrences of the specific pat-
tern of these k branches. Two major data structures, the
branch history register (BHR) and the pattern history ta-
ble (PHT), are employed to record the two levels of in-
formation. The information is collected at run-time by
updating the contents of the BHR and the bits in the
entries of the PHT. The update is made following the
branch outcomes.
As depicted in the Figure 1, the structure of a two-
level adaptive branch predictor can be divided into two
parts – the dispatch part (where the information is dis-
patched from the BHR to the PHT through some dispatch
mechanism) and the prediction part (where the content
of the addressed PHT entry is used to predict the branch
outcome through some prediction decision function).
2.1 Schemes Dealing with the Prediction Part
The 2-bit Counter
The 2-bit counter approach [2] uses a 2-bit saturating
38 Po-Jen Chuang et al.
Figure 1. A two-level adaptive branch predictor.
up-down counter to predict which path the branch will
take according to the PHT. In the two-level adaptive
branch prediction, the 2-bit counter scans the data of a
certain entry in the PHT. It is increased or decreased by 1
when a branch is taken or not taken. If the counter value
is 2 or greater, a branch is predicted “taken”, otherwise it
is predicted “not taken”. The 2-bit counter involves not
much complexity, but its prediction accuracy can not
meet the need of high performance processors.
The Markov Predictor
A Markov predictor of order j predicts the next bit
based on the j immediately preceding bits and the simple
Markov chain [11]. As specified in [5], the transition
probabilities are proportional to the observed frequen-
cies of a 1 or 0 that occurs given that the predictor is in a
particular state (the bit pattern is associated with the
state). The predictor builds the transition frequency by
recording the number of times a 1 or 0 occurs in the
(j+1)th bit that follows the j-bit pattern. The chain is thus
built for prediction. To predict a branch outcome, the
predictor simply uses the j immediately preceding bits
(outcomes of previous branches) to index a state and pre-
dicts the next bit to correspond to the most frequent tran-
sition out of that state. Note that the prediction accuracy
of the Markov predictor can be impaired with too long or
too short PHTs: If the PHT is too long, the outdated in-
formation may affect prediction accuracy; if too short, it
may cause zero frequency counts, result in incomplete
Markov chains, and thus provide inadequate information
for prediction. Meanwhile it is also unlikely for the pre-
dictor to predict with equal frequency counts.
ThePrediction by PartialMatching (PPM)Algorithm
The zero frequency situation of the Markov predic-
tor can be improved by the Prediction by Partial Match-
ing (PPM) algorithm [5]. The basis of a PPM algorithm
of order m is a set of m+1 Markov predictors. The PPM
uses them immediately preceding bits to search a pattern
in the highest order Markov predictor. If the search suc-
ceeds (i.e., if the pattern with a non-zero frequency count
appears in the input sequence), the PPM will predict the
next bit using the mth order Markov predictor. If the pat-
tern is not found, the PPM will use the m-1 immediately
preceding bits to search the next (m-1)th order Markov
predictor. Whenever a search fails, the PPM reduces the
pattern by one bit and uses it to search in the next lower
order Markov predictor until the pattern is found or until
searching in the 0th order Markov predictor. Thus, by re-
cording and updating information through variably re-
ferring the data in the PHT, the PPM algorithm is able to
improve the zero frequency situation in a Markov pre-
dictor. The prediction accuracy achieved by the PPM al-
gorithm is remarkable, but the involved overhead (to
keep necessary information) is also conspicuous.
The Agree Predictor
Since the number of PHT entries is finite, it is likely
that two unrelated branches in the instruction stream are
mapped to the same PHT entry by the predictor’s index-
ing function. The situation is known as the PHT interfer-
ence [4] because the outcome of one branch will inter-
fere with the subsequent prediction of another comple-
tely unrelated branch. The Agree predictor [4] attempts
to reduce the PHT interference by taking advantage of
the biased behavior. It attaches a biasing bit to each
branch in the Branch Target Buffer (BTB) according to
the branch direction just before the biasing bit is written
into the BTB. The biasing bit, which predicts the most
likely outcome of the branch, will stay the same until the
branch is replaced in the BTB by another branch. The
PHT records “agreeing” or “not-agreeing” the biasing bit
and a 2-bit counter is used to predict whether or not the
branch will go in the direction indicated by the biasing
bit. The counter will be incremented if the branch’s di-
rection agrees with the biasing bit or be decremented if it
disagrees.
2.2 Schemes Dealing with the Dispatch Part
The Gshare Predictor
To reduce the PHT interference, the gshare [3] pre-
dictor tries to use the PHT entries more effectively by
XOR-ing the k-bit BHR with the lower k bits of the
branch address to generate the index into the PHT. The
introduction of address bits into the index effectuates
useful distribution across all PHT entries, but the re-
sulted interference reduction is relatively limited in con-
trast to indexing by the BHR [8].
The DHLF Predictor
As each code requires a specific amount of branch
history to give the best results, the DHLF (Dynamic His-
tory-Length Fitting) predictor [9] tries to dynamically
find the amount of history that performs best for each
code and input data at execution time. The DHLF predic-
tor works on the basis of monitoring the mispredictions
Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy 39
during program execution and changing the history leng-
th accordingly. During the execution of the program the
number of mispredictions for each interval (consisting of
a fixed number of consecutive dynamic branches) is
computed using a fixed history length. At the end of each
interval the history length to be used at the next interval
is determined based on the current number of mispre-
dictions and the minimum number of mispredictions en-
countered so far. If the current number of mispredictions
is less than or equal to the minimum number of mispre-
dictions, the history length is not changed for the next in-
terval. If it is greater, the history length will be changed
to the one corresponding to the minimumnumber of mis-
predictions or toward to it by increasing or decreasing
(by one) the current history length.
2.3 SchemesDealing with both the Prediction and
Dispatch Parts
The Bi-mode Predictor
As it is uncertain if the behavior of a branch will cor-
respond to its bias when the branch is first introduced to
the BTB, the bi-mode predictor [7] helps eliminate mis-
chosen fixed biases by dynamically choosing the bran-
ches’biases. There are three PHTs in the bi-mode predic-
tor: The choice PHT, the taken direction PHT and the
not-taken direction PHT. A branch is first indexed by its
address to the choice PHT. The choice PHT then chooses
one of the direction PHTs that is indexed by BHR xor-ed
with the branch address. The selected direction PHT
makes the final prediction and gets updated. The choice
PHT will also be updated unless it gives a prediction
contradicting the branch outcome while the selected di-
rection PHT gives the correct prediction. The choice
PHT functions like the biasing bit in the agree predictor,
only that it can dynamically choose the bias (the biasing
bit in the agree predictor is fixed).
The YAGS Predictor
To reduce the amount of unnecessary information in
the PHT, the YAGS (Yet Another Global Scheme) pre-
dictor [8] stores in the direction PHTs only the instances
when the branch does not comply with its bias. To iden-
tify those instances in the direction PHTs, tags (the least
significant bits of the branch address) are added to each
entry and are referred to as direction caches. When a
branch occurs in the instruction stream, the choice PHT
is accessed. If it indicates “taken,” the “not taken” cache
will be accessed to check if the prediction does not agree
with the bias (a special case). If there is a miss in the “not
taken” cache, the choice PHT is used for prediction. If
there is a hit in the “not taken” cache, it provides the pre-
diction. A similar process is activated in the “taken” ca-
che when the choice PHT indicates “not taken.” The cho-
ice PHT is addressed and updated like the choice PHT in
the bi-mode predictor. The “not taken” cache will be up-
dated when a prediction from it is used or when the cho-
ice PHT indicates “taken” while the branch outcome is
“not taken” (the mentioned special case). The same pro-
cess applies to the “taken” cache.
3. The Proposed Schemes
To attain desirable prediction accuracy with reduced
overhead for the two-level adaptive branch prediction,
we propose in this paper a new prediction scheme and an
iterative dispatch approach. The proposed prediction
scheme is also combined to work with the PPM algori-
thm to maintain optimal prediction accuracy at decreas-
ed cost.
3.1 The Variable Cross-Reference Prediction
Scheme
Different from previous schemes dealing with the
prediction part, our proposed scheme involves the loop
history which, existing in most programs and easily ob-
servable in small programs, can execute a large quantity
of branch instructions. We believe recognizing the loop
property can be of significant use in elevating the accu-
racy of branch prediction.
In the PHT, i.e., the second level of the two-level dy-
namic predictor, a prediction scheme is used to predict
the outcome of a branch according to the sequence of
branch outcomes (taken or not taken  represented by a
single bit 1 or 0) in the addressed PHT entry. Our pro-
posed prediction scheme operates as follows. The sequ-
ence of outcomes in the addressed PHT entry is first di-
vided into two parts that are equal in length, i.e., number
of outcomes. (If there is an odd number of outcomes in
the PHT entry, the “least recent” outcome could be ig-
nored.) The two parts are then cross-referred to see if
they match with each other. If both parts are completely
the same, we assume the same outcome will repeat again
(according to the loop history) and thus predict the com-
ing branch outcome to be the first outcome in the first
part (also the first outcome in the second part). If the two
parts do not match, ignore the next two “least recent”
40 Po-Jen Chuang et al.
outcomes in the sequence and again divide the remaining
outcomes into two equal-length parts. Check the two
parts: If they are the same, predict the outcome to be the
first outcome of the first part; if not, repeat the above re-
ferring process until a match for the two parts is located.
In case no match is found when the number of referred
outcomes is reduced to only one in each part, employ
some other scheme (such as the 2-bit counter or 0th
Markov predictor) to assist the prediction.
Figure 2 demonstrates the operation of the proposed
scheme. The sequence of branch outcomes in the ad-
dressed PHT, assumed to be Rc-sRc-s+1  Rc-1 with s (the
number of outcomes in each PHT entry) being an even
number, is first divided into two equal-length parts, i.e.,
Rc-sRc-s+1  Rc-s/2-1 and Rc-s/2Rc-s/2+1  Rc-1. The two parts
are then compared. If they match each other, that is, if
Rc-s = Rc-s/2, Rc-s+1 = Rc-s/2+1,  and Rc-s/2-1 = Rc-1,
the coming branch outcome is predicted to be Rc = Rc-s.
If they do not match each other, ignore the two “least re-
cent” outcomes Rc-s and Rc-s+1 and divide the remaining
outcomes into two new parts Rc-s+2Rc-s+3  Rc-s/2 and
Rc-s/2+1Rc-s/2+2  Rc-1. Check again. If the two parts ma-
tch, that is, if
Rc-s+2 = Rc-s/2+1, Rc-s+3 = Rc-s/2+2,  and Rc-s/2 = Rc-1,
our prediction will be Rc = Rc-s+2. If they do not match,
ignore the next two “least recent” outcomes Rc-s+2 and
Rc-s+3, and again divide the remaining outcomes into
Rc-s+4Rc-s+5  Rc-s/2+1 and Rc-s/2+2Rc-s/2+3  Rc-1. If the two
parts match each other, the coming branch outcome is
predicted to be Rc = Rc-s+4. If they do not match, repeat
the same comparison process (by ignoring the next two
“least recent” outcomes at each comparison attempt).
Prediction can be made whenever a match is found by
this variable cross-reference. (Note that in our scheme
the two parts under comparison are with variable, not
fixed, lengths.) If eventually only one outcome is left in
each part and they still do not match each other, the pre-
diction is handed over to a 2-bit counter or a 0th Mar-
kov predictor. Featured by such a variable cross-refer-
ence process  which needs no extra hardware at all, the
proposed prediction scheme is called the Variable Cross-
Reference (VCR) scheme.
Figure 3 further illustrates the VCR scheme. As it
shows, there are 11 bits (01010101101) in the addressed
PHT entry. Ignore the most significant bit, i.e., the “least
recent” branch outcome, and divide the remaining 10
bits into two equal-length parts 10101 and 01101. Com-
pare the two parts. As there is no match, ignore the next
two “least recent” bits and cross-refer the newly divided
two parts 1010 and 1101. Since there is no match, ignore
Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy 41
Figure 2. Prediction flowchart of our VCR scheme.
the next two “least recent” bits and divide the remaining
bits (101101) into two new parts 101 and 101. With the
two parts matching each other, we thus predict the com-
ing branch outcome to be ‘1’(i.e., the most significant bit
in both parts).
3.2 The Iterative Dispatch Approach
To enhance branch prediction accuracy, we also pre-
sent a new approach that involves some structural changes
in the dispatch part. In our new design, the PHT func-
tions as an intermediary index tag stage and the branch
history in each PHT entry is used as an index tag index-
ing to an entry in the corresponding sub-PHT at an addi-
tional stage. The branch history in the indexed sub-PHT
entry is then used for prediction. For instance, if the bits
in the branch history register (BHR) are Rc-kRc-k+1  Rc-1,
it will address an entry in the PHT. However, the history
bits in the addressed PHT entry are used not for predic-
tion but as an index tag indexing the corresponding sub-
PHT at the additional stage. Suppose the length of the
PHT entry is m bits, we can index to a corresponding
sub-PHTwith 2m entries in the same way as indexing the
BHR to the PHT, and have 2m+k sub-PHT entries in total.
Predictions are then made by referring to the bits in the
sub-PHT entries. The BHR, PHT and sub-PHTs will up-
date their contents  after the outcome of each branch
turns out  to lead the sub-PHTs for future predictions. In
this way the “traces of traces” are referred to as a kind of
information to improve prediction accuracy. That is, this
iterative dispatch approach utilizes the PHT history to do
dispatching for an additional layer of pattern history and
the information can hence be further divided into 2m
classes, providing more information and less PHT inter-
ference than employing only the traditional PHTs in
making predictions.
Figure 4 exhibits the structure of our proposed itera-
tive dispatch approach on a two-level adaptive branch
predictor. As shown here, the PHT exists between the
BHR and the sub-PHTs as an intermediary index tag sta-
ge. Data in the BHR are first classified by the intermedi-
ary index tag stage (the PHT) which is much shorter than
the entire sub-PHTs. Based on the behavior of the branch
outcomes in a sub-PHT entry, predictions are then made.
(Note that due to such a structural change, the number of
42 Po-Jen Chuang et al.
Figure 3. Example of our VCR scheme.
Figure 4. Structure of the iterative dispatch.
table entries increases and so does the needed warm-up
time.) Assume the length of the BHR is k bits. It can
arddress the PHT with 2k tags and each tag indexes to an
entry of the corresponding sub-PHT with 2m entries. Be-
fore a prediction is made, an entry (i.e., an index tag) in
the PHT is addressed according to the bits Rc-kRc-k+1 
Rc-1 in the BHR. The index tag then addresses an entry in
the corresponding sub-PHT (say sub-PHTx, 0  x 
2k-1). Aprediction is finally made by referring to the bits
in the indexed sub-PHT entry. If the branch result is Rc, it
is then shifted into the BHR and the bits in the BHR are
updated as Rc-k+1Rc-k+2  Rc-1Rc. The PHT entry and the
indexed sub-PHT entry are also updated by the bit Rc.
As mentioned, the iterative dispatch approach is de-
signed to assist predictors in elevating prediction accu-
racy. Take the proposed VCR predictor as an example.
When encountered with the sequence of branch outcome
10110101, the PPM algorithm, Markov predictor and 2-
bit counter will predict the next bit to be 1, while the
VCR scheme will predict it to be 0. In fact, the sequence
displays a loop history of 1011 with an extra 0 in the mid-
dle  a situation which may lead the VCR scheme to
wrong predictions. For situations like this, the iterative
dispatch approach can be brought in to help as demon-
strated in Figure 5. Assumingm = 1, we first initialize the
intermediary index tag stage to be 0 and the sequence of
branch outcome to be 10110101. After the BHR encoun-
ters the first two bits 10 and makes the prediction, shift
the branch outcome (i.e., 1) into both entry 10 of the PHT
and entry 0 of the corresponding sub-PHT (i.e., sub-
PHT10). Now the newly updated information of both the
addressed PHT entry and the indexed sub-PHT entry be-
comes 1. Then based on the branch outcome for the next
2 bits 01, the content of the PHT entry 01 and the indexed
sub-PHT01 entry are also updated with the outcome 1. As
the original content of the PHT entry 01 is 0, we update
entry 0, instead of entry 1, of sub-PHT01. The content is
now updated to 1. Such a shifting and updating process is
repeated following every two bits of the sequence until
01 is again shifted into the BHR. With the content of the
PHT entry 01 being updated to 1, entry 1 of sub-PHT01
will be updated accordingly. When the end bits of se-
quence 01 is shifted into the BHR, the referred PHT will
be entry 0 of sub-PHT01 because the content of the up-
dated PHT entry 01 is 0 (due to the branch outcome after
the last 2-bit sequence 01 being 0). As the content of en-
try 0 of sub-PHT01 is 1, the VCR schemewill thus predict
the branch outcome to be 1, like the other schemes. The
proposed iterative dispatch approach is shown through
simulation results to work not only for the VCR scheme
but also for other schemes, especially for schemes with
lesser performance, such as the 2-bit counter (to be dis-
cussed in later sections).
3.3 The PPM-VCR Predictor
It has been maintained that a single predictor able to
record larger quantity of trace data proves to be the most
effective predictor, but the performance of a combined
predictor is usually better than that of a single predictor
at the same hardware cost [6]. A combined predictor is
composed of at least two single predictors which simul-
taneously make predictions when a branch occurs. A se-
lector is employed to evaluate the prediction perfor-
mance of each (single) predictor. Based on previous out-
comes, the selector will check and choose the predictor
most likely to make the correct prediction for the current
Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy 43
Figure 5. Example of the iterative dispatch.
branch. Branch prediction outcomes made by each pre-
dictor are recorded in a 2-bit counter that updates itself
with each new result. Following the continually updated
data, the 2-bit counter is able to decide a better predictor
and select it for predicting the incoming branch.
As mentioned, the PPM algorithm achieves remark-
able prediction accuracy at the cost of substantial com-
plexity, whereas the proposed VCR scheme depicts quite
satisfying prediction accuracy with much less complex-
ity. Indeed the VCR scheme performs even better than
the optimal PPM algorithm under certain conditions,
such as during the warm-up period and with shorter
PHTs. (When the PHTs are short, “referring” tends to
yield the same probability for “taken” and “not taken” of
the branch, making the PPM algorithm unable to make
correct predictions. The VCR scheme is free of such lim-
itations. It can make fast and correct predictions when-
ever the cross-reference finds a match.) We are thus in-
terested in combining the two prediction schemes to-
gether to see if performance of the combined predictor
can be strengthened with reduced complexity. For the
combined PPM-VCR predictor, we choose not to use a
2-bit counter as the priority selector considering the per-
formance and overhead of the two prediction schemes.
Instead, the priority selector for the PPM algorithm is ex-
panded into an (n-1)-bit counter (the length of the PHT is
2n) and that for the VCR scheme is set to be a 1-bit coun-
ter. Whenmaking predictions, employ the PPM algorithm
to do the job if it displays larger priority; otherwise, em-
ploy the VCR scheme. The priority selectors are updated
with each new prediction result for future predictions.
4. Performance Evaluation
Extensive trace-driven simulation runs using four
SPECCINT95 benchmarks [12]  vortex, perl, m88ksim
and gcc  are conducted to evaluate performance of the
proposed schemes and other schemes. The SimpleScalar
Toolset [13] is used to generate and capature address
traces. Prediction accuracy is the performance measure
of interest, but for more informative presentation, mis-
prediction rates (one minus prediction accuracy) are pre-
sented in the following discussions, as in [7]. Note that
the Markov predictor is not included in this simulation
because of its prediction limitations for zero and equal
frequencies, and the PPM algorithm adopted here is the
optimal one, i.e., with its best performance. The mispre-
diction rates are collected under various BHR lengths
(2~7 bits) and PHT lengths (8~256 bits). However, we
present only the misprediction rates collected under
PHT lengths = 8~256 bits with BHR length = 7 bits,
and under BHR lengths = 2~7 bits with PHT length =
256 bits, due to limited space.
Comparison among predictors dealing with the
prediction part
Depicted in Figure 6(a) are the misprediction rates
for the 2-bit counter, the PPM algorithm, the agree pre-
dictor and the VCR scheme resulting from running the
four SPEC CINT benchmarks under PHT lengths = 8~
256 bits with BHR length = 7 bits (a similar performance
trend can be found with any of the BHR lengths). As ex-
hibited, the performance of our VCR scheme yields con-
stantly lower misprediction rates than the 2-bit counter
and the agree predictor. In fact, the proposed scheme out-
performs even the optimal PPM algorithm at shorter
PHTs, such as 8 bits, in some benchmarks. This is be-
cause with shorter PHTs, “referring” for the PPM algo-
rithm tends to yield the same probability for “taken” and
“not taken” of the branch, making the algorithm unable
to predict correctly. By contrast, misprediction rates for
the PPM algorithm at longer PHT lengths are apparently
lower than that for the 2-bit counter, the agree predictor
and the VCR scheme. However, it should be pointed out
that the high performance of the PPM algorithm is ac-
hieved at substantial cost because our simulation adopts
the largest predictable PHT length  256 bits, which
makes the PPM algorithm use the 255th PPM predictor
or 256 Markov predictors to predict branches.
Figure 6(b) presents the misprediction rates of these
schemes under BHR lengths = 2~7 bits with PHT length
= 256 bits. The results also exhibit a similar trend as what
is shown in Figure 6(a), i.e., our VCR scheme always
yields lower misprediction rates than the 2-bit counter
and the agree predictor, and sometimes even matches the
optimal PPM algorithm.
Comparison among predictors dealing with the
dispatch part
Performance of the gshare predictor, the DHLF pre-
dictor and our iterative dispatch approach is illustrated in
Figure 7(a) where the misprediction rates are collected
under various PHT lengths with BHR length = 7 bits.
The figures show that misprediction rates obtained from
the four benchmarks are always lower for our iterative
dispatch approach than for the other two schemes. This is
44 Po-Jen Chuang et al.
because the iterative dispatch approach utilizes the PHT
history to do dispatching for an additional layer of pat-
tern history and by dividing the information into more
classes it is able to provide more information and reduce
PHT interference in making predictions. Similar results
can also be learned from Figure 7(b) which depicts the
misprediction rates under various BHR lengths with
PHT length = 256 bits (collected simulation results show
that performance of the three schemes follows almost the
same trend with any of the PHT lengths.)
Comparison between our VCR scheme and the
bi-mode and YAGS predictors
Figure 8 provides the misprediction rates for the bi-
mode predictor, the YAGS predictor (predictors that deal
with both the prediction and dispatch parts) and our VCR
scheme under (a) various PHT lengths with BHR length
= 7 bits and (b) various BHR lengths with PHT length =
256 bits. As we can see, the overall performance of the
VCR scheme excels that of the other 2 schemes, except
in benchmark gcc where the VCR scheme falls behind
the bi-mode predictor and also the YAGS predictor in
some situations  with slight differences. Actually in
more practical situations, such as with longer PHT or
BHR lengths, the performance of our VCR scheme com-
pares favorably, at no extra cost, to the other two predic-
tors which need more extra hardware and cost, such as
two extra direction PHTs and doubled predictions in the
choice and direction PHTs for the bi-mode predictor.
Effect of incorporating our schemes with other
predictors
The VCR scheme, as mentioned, is ready to work
with schemes that deal with the dispatch part, while the
Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy 45
Figure 6. Misprediction rates for schemes dealing with the
prediction part.
Figure 7. Misprediction rates for schemes dealing with the
dispatch part.
iterative dispatch approach can be handily incorporated
with schemes dealing with the prediction part  to en-
hance prediction accuracy especially for schemes with
lesser performance. The effect (performance gain) of in-
corporating the proposed schemes with other approaches
is also presented by simulation results which indicate
different degrees of performance gain for different incor-
porated schemes. To save space, Figure 9 presents only
the performance of the 2-bit counter (scheme with low
prediction accuracy) and the PPM algorithm (scheme
with high prediction accuracy)  with and without our it-
erative dispatch approach. Figures 9(a) and 9(b) show
significant performance gain for the 2-bit counter with
the iterative dispatch approach, while Figures 9(c) and
9(d) exhibit a slightly enhanced PPM algorithm with the
same dispatch approach. The varied performance gain
for the two predictors when incorporated with the itera-
tive dispatch approach results from the fact that the PPM
algorithm is already an optimal prediction scheme by it-
self while the 2-bit counter alone yields relatively low
overall prediction accuracy and hence leaves more room
for improvement.
Performance of the PPM-VCR predictor
As mentioned in Section 3, the VCR scheme works
better and faster than the PPM algorithm under shorter
PHTs while the PPM algorithm performs more desirably
at longer PHTs. To make the most of their advantages,
we bring the two schemes together to form a combined
predictor and conduct performance comparison under
various PHT lengths with BHR length = 7 bits and vari-
ous BHR lengths with PHT length = 256 bits. The result
shows the PPM-VCR predictor performs as well as or
better than the PPM algorithm alone. This is especially
significant when the potentially reducible complexity
due to the VCR scheme is taken into account. (Figure
presentation is omitted due to limited space.)
Discussions
It is interesting to see from the performance compar-
ison in Figures 6 and 8 that the proposed VCR scheme
outperforms the existing schemes quite obviously in
some benchmarks, like m88ksim, and less obviously in
some other benchmarks, like gcc. Recall that the VCR
scheme, which distinguishes itself fromprevious schemes
by taking advantages of the loop history, divides the se-
quence of outcomes in the addressed PHT entry into two
parts that are equal in length. The two parts are then
cross-referred to see if they match each other. If they do,
the coming branch outcome is predicted to be the first
outcome in either part; if they don’t, repeat the same pro-
cess by ignoring the two “least recent” outcomes at each
comparison attempt until a match is located. If a match is
located and prediction is made at earlier comparison at-
tempts, the matched two parts are of larger lengths (indi-
cating the program’s history carries “bigger loop” pro-
perty), and vice versa. As prediction may be made upon
matched parts of different lengths, we are interested to
see the percentages, along with the misprediction rates,
of branch prediction made upon varied matched parts.
For any benchmarks, it is found through simulation that
misprediction rates are always lower (i.e., prediction ac-
curacies are higher) when predictions are made at earlier
comparison attempts (i.e., upon longer matched parts 
parts with more bits). Take m88ksim and gcc as an exam-
46 Po-Jen Chuang et al.
Figure 8. Misprediction rates for our VCR scheme and
schemes dealing with both parts.
ple. For m88ksim, most of the predictions are made after
only a few comparison attempts (i.e., with much longer
matched parts), while the situation for gcc is quite re-
versed. This explains our previous performance com-
parison results. It also pinpoints the fact that our VCR
scheme works even better for programs with history car-
rying “bigger loop” property.
If no match can be located even when the number of
referred outcomes is reduced to only one in each com-
pared part. In this case, prediction can be handed over to
some other scheme, like the 2-bit counter adopted in our
previous simulation. The “last bit” prediction is also a
feasible alternative especially when reducing complex-
ity is concerned. If the above non-matched situation hap-
pens, the “last bit” prediction will predict instantly the
coming branch outcome to be the last bit, i.e., the “most
recent” bit in the PHT entry. The performance of our
VCR scheme using the last bit prediction in non-matched
situations has also been simulated. It turns out that pre-
diction accuracies for our VCR scheme using the 2-bit
counter and the last bit are almost the same (e.g. with
only about 1% accuracy difference for gcc).
5. Conclusion
To improve branch prediction accuracy, a variable
cross-reference (VCR) prediction scheme and an itera-
tive dispatch approach are proposed in this paper. The
proposed VCR scheme can be easily implemented and is
able to yield desirable prediction accuracy for a high per-
formance processor at low cost. To further enhance pre-
diction accuracy, an iterative dispatch approach is pro-
vided. The approach utilizes the PHT history to do dis-
patching for an additional layer of pattern history which
helps providing more information for making better pre-
dictions. It is shown that the proposed VCR scheme and
Employing Variable Cross-Reference Prediction and Iterative Dispatch to Raise Dynamic Branch Prediction Accuracy 47
Figure 9. Incorporating our iterative approach with the 2-bit counter & PPM algorithm.
iterative dispatch approach can handily work with other
predictors to fortify performance. APPM-VCR predictor
is also presented to demonstrate the advantages of a com-
bined predictor.
Performance of the proposed schemes and other
prediction schemes is simulated (by conducting trace-
driven simulation runs using four SPEC CINT95 ben-
chmarks) for evaluation and comparison. The results
show that the overall performance of our VCR scheme
compares favorably to other schemes, such as the 2-bit
counter and the agree predictor due to its variable cross-
reference to the traces in the PHT. With much less com-
plexity, the VCR scheme even outperforms the optimal
and yet complicated PPM algorithm under some condi-
tions.When compared with the bi-mode and YAGS pre-
dictors  which deal with both the prediction and dis-
patch parts of the two-level predictor and require extra
hardware and cost, the VCR scheme still produces bet-
ter performance in most of the situations. Simulation re-
sults show that the proposed iterative dispatch approa-
ch outperforms the gshare and DHLF predictors  sc-
hemes dealing with the dispatch part. It is also show that
the iterative dispatch approach can lift prediction accu-
racy for different schemes, especially for schemes with
lesser performance, such as the 2-bit counter. On the other
hand, performance of the PPM-VCR combined predictor
reveals slight degrees of improvement over the optimal
PPM algorithm. The performance gain alone may not
appear significant enough, but the potentially reducible
complexity (due to the VCR scheme) is appealing.
References
[1] Boggs, D. et al., “The Microarchitecture of the Intel
Pentium 4 Processor on 90 nm Technology,” Intel
Technology Journal, Vol. 8, Feb. (2004).
[2] Yeh, T.-Y. and Patt, Y. N., “Alternative Implemen-
tations of Two-Level Adaptive Branch Prediction,”
Proc. 19th Annual Int’l Symp. on Computer Architec-
ture, May, pp. 124134 (1992).
[3] McFarling, S., “Combining Branch Predictors,” Tech-
nical Report, TN-36, Digital Western Research Labo-
ratory, June (1993).
[4] Sprangle, E., Chappell, R. S., Alsup, M. and Patt, Y.
N., “The Agree Predictor: AMechanism for Reducing
Negative Branch History Interference,” Proc. 24th An-
nual Int’l Symp. on Computer Architecture, May, pp.
284291 (1997).
[5] Chen, I.-C. K., Coffey, J. T. andMudge, T. N., “Analy-
sis of Branch Prediction via Data Compression,” Proc.
7th Int’l Conf. on Architectural Support for Program-
ming Languages and Operating Systems, Oct., pp.
128137 (1996).
[6] Sechrest, S., Lee, C.-C. and Mudge, T., “Correlation
and Aliasing in Dynamic Branch Predictors,” Proc.
23rd Annual Int’l Symp. on Computer Architecture,
May, pp. 2232 (1996).
[7] Lee, C.-C., Chen, I.-C. K. and Mudge, T. N., “The
Bi-Mode Branch Predictor,” Proc. 30th Int’l Symp. on
Microarchitecture, Dec., pp. 413 (1997).
[8] Eden, A. N. and Mudge, T., “The YAGS Branch Pre-
diction Scheme,” Proc. 31st Int’l Symp. on Micro-
architecture, Dec., pp. 6977 (1998).
[9] Juan, T., Sanjeevan, S. and Navarro, J. J., “Dynamic
History-Length Fitting: A Third Level of Adaptivity
for Branch Prediction,” Proc. 25th Annual Int’l Symp.
on Computer Architecture, May, pp. 155166 (1998).
[10] Yeh, T.-Y. and Patt, Y. N., “Two-Level Adaptive Bran-
ch Prediction,” Proc. 24th annual Int’l Symp. on
Microarchitecture, Nov., pp. 5161 (1991).
[11] Ross, S. M., Introduction to Probability Models, Lon-
don, United Kingdom: Academic Press (1985).
[12] SPEC CPU’95, Technical Manual, Aug. (1995).
[13] Burger, D. and Austin, T. M., “The SimpleScalar Tool
Set, Version 2.0,” Univ. of Wisconsin-Madison CS
Dept. Technical Report #1342, June (1997).
Manuscript Received: Sep. 15, 2006
Accepted: Mar. 13, 2007
48 Po-Jen Chuang et al.
