unknown by The Pennsylvania State University CiteSeerX Archives
Abstract
The PA-8500 is the newest  member of the PA-RISC
family of processors.  The design is based on the PA-8000
and PA-8200 processors, but is implemented in a .25
micron process.  The new process allows the large first
level caches to be moved on-chip so that the frequency
can be boosted without the need to add tightly-coupled
second level caches.  Improvements have also been made
in the branch prediction hardware to allow a single
branch prediction structure to take advantage of
hardware and software branch optimization techniques
seamlessly.  These improvements, combined with some
other minor enhancements, will allow the PA-8500 to
deliver industry-leading application performance.
1.  Introduction
The PA-8500 processor, now under development at
Hewlett-Packard’s Engineering Systems lab in Ft. Collins,
Colorado, is the latest addition to the PA-RISC family of
processors.  The design builds on the base established by
the PA-8000 and PA-8200 and pushes to even higher per-
formance levels.  The design goals of the development
project are:
Industry leading application performance
Full binary compatability with all existing PA-RISC
binaries
Optimal performance on binaries tuned for the PA-8000
and PA-8200
Upgrade to some existing systems
The PA-8000 was the first alll-new PA-RISC proc-
essor in five years.  With its large complement of func-
tional units, dual-ported data cache and sophisticated
instruction reordering hardware, it effectively exploited
instruction level parallelism to set new standards of
performance.
While the PA-8000 realized most of its performance
gains through superscalar support, the PA-8200 design
built upon the PA-8000 by increasing the sizes of the
caches, branch prediction cache and TLB as well as in-
creasing the operating frequency.  In order to push to  still
higher levels of performance, the PA-8500 design empha-
sizes increased frequency over CPI improvements in order
to make the most of the infrastructure established in the
PA-8000.  In order to achieve these higher frequencies,
the PA-8500 design is implemented in a .25 micron proc-
ess.  Migrating to this new process also allowed the inclu-
sion of enough on-chip SRAM to implement effective first
level caches.  These caches are large enough to achieve
good performance even without a second level cache,
thereby allowing cheaper systems to be constructed and
establishing the 64-bit PA2.0 architecture across the prod-
uct line.
Although the main emphasis of the PA-8500 project
is to increase the frequency, other incremental changes
have been incorporated to improve CPI (Cycles Per In-
struction) as well.
2.  Caches
2.1  Large on-chip data cache
The PA-8500 implements a large 4-way set associa-
tive 1M data cache on chip.  Large primary caches allow
the PA-8500 to achieve strong performance across a wide
range of workloads by providing a low cache miss rate,
and avoiding the overhead of having to access an L2
cache to reach the data.
Some small benchmarks do well with a small one-
cycle on-chip cache backed by a larger level 2 cache.
However, many real applications generally suffer from
this arrangement.  They don’t fit well into the small level
The PA-8500:  The Evolution Continues
Hewlett-Packard Companyone cache, and the level two cache can’t support the band-
width since a full cache line needs to be moved at a time.
Even level one bandwidth can be insufficient since every
access that misses generates a copy in in addition to the
initial access.  It is possible to mitigate this somewhat by
supporting a smaller line size for the level one cache, but
then the complexities of differing cache line sizes between
the two caches must be managed.
A large level one cache avoids all these problems by
providing high bandwidth directly to a large store of data.
Hewlett Packard has consistently build processors with
high bandwidth access to a large primary cache.  Now
with 0.25 micron technology, large caches can be put di-
rectly on the CPU.
For most workloads, a set associative cache looks
larger than an equivalently sized direct mapped cache.  As
a direct mapped cache fills up with useful data, it becomes
increasing unlikely that the next piece of requested data
will find an unoccupied location in cache to use. Increas-
ing the cache size will improve the likelihood of avoiding
a conflict.  Making the cache set associative can do the
same.
The PA-8000 had more than 500 I/O’s dedicated to
the off chip caches.  These were needed to provide the
bandwidth required by the computation units.  Increasing
the number of I/O’s to support an associative lookup was
not an option.  On the PA-8500 however, since the cache
is on chip it is not a great burden to add set associativity,
and increase the effective cache size, as measured by the
miss rate.
As a point of reference, we simulated the miss rates
of the PA-8500’s 1M 4-way set associative cache and a
64K 4-way cache to get a sense of the benefit of the larger
cache.  On a measure of TPC code we saw a 1.3% miss
rate with the 1M cache versus a 3.3% miss rate on the
smaller 64K cache, a 2.5X difference.  On a verilog run,
we measured a x.x miss rate versus x.x for the smaller
cache, or a xX difference.
One drawback of a large level one cache is that it
cannot be accessed in a single cycle.  Fortunately the reor-
dering queues the PA-8500 inherited from the PA-8000
does an excellent job finding other useful work for the
processor to do while the cache is being accessed. Our
studies show a 5% reduction in performance when the
cache access is extended from one to two cycles - assum-
ing the cache size is fixed.  Of course a single access
cache would actually be much smaller.  The penalty for a
higher miss rate easily swamps the 5% benefit of the
quicker access.
2.2  Fitting and keeping the speed up
The data cache must support two simultaneous
memory operations, while maintaining a two cycle access
and fitting into a very full chip. This is accomplished by
using the same dual bank system developed for the
PA-8000’s off chip data cache.  With this system the cache
can be implemented with simple single port ram design,
conserving area.  Also since each lookup only needs to
concern itself half the cache, the challenge of achieving
the cache lookup in the allotted time is less daunting.
Access to the two banks of the cache are controlled
by the address reorder buffer (ARB) portion of the
PA-8500’s reorder queue. The 28 entry ARB receives ad-
dresses that have been calculated by the processors ad-
dress units.  It prioritizes memory accesses that have had
their addresses calculated by program order, and picks
one even and one odd doubleword access to use the cache
each cycle.  A bypass is provided from the address unit if
there is no outstanding access with its address calculated.
This arrangement allows the cache ports to be kept
busy even when simultaneously calculated addresses hap-
pen to access the same half of the cache.  Any delay in
getting data in this scenario is hidden by the out of order
queues.
Each 0.5M cache block is implemented as four one-
eighth Megabyte arrays, each providing a double-word of
data plus error correction bits. Data is organized within
the arrays such that a cache line can be addressed at a
time, or four ways of associativity can be addressed to-
gether.  The cache line tags are held in four smaller sepa-
rate and independently addressable ram arrays.  In this
way the data and tag can either be accessed together for a
data read, or independently to effect a data store at the
same time another store is assessing its cache line status.
2.3  Large on-chip instruction cache
The PA-8500 instruction cache is a one-half mega-
byte four-way set associative pipelined cache that pro-
vides 128 bits of instruction plus predecode bits each
cycle to the instruction fetch engine.  Like the data cache
each access takes two cycles.
To support some real-word applications especially
commercialapplications such as transaction processing, a
large instruction cache with a lot of bandwidth is required.
For the PA-8000 this lead us to provide a wide path to ex-
ternal rams, but for the PA-8500 with the higher density of
a 0.25 micron process we can include sufficient cache
on-chip.
Alternate schemes might have been employed to
provide a smallerinstruction cache with a single state la-
tency, but the benefits would be small.  The PA-8500 in-
cludes a BTAC to avoid penalties when taken branchesare encountered, so the only case that suffers from the ex-
tra state of latency is a mispredicted branch.  A mispre-
dicted branch is a relatively rare event, and when it does
occur this only represents one state out of a mispredicted
branch penalty of six states.
2.4  No L2 cache supported directly from the
CPU
With on chip primary caches larger than many L2
caches, and therescheduling capabilities of its 56 entry re-
order queue, the PA-8500 doesn’t need to support a tightly
coupled L2 cache directly from the CPU.  This is impor-
tant in that there is not sufficient room on the die for both
the large caches and support for a second level cache on-
chip. To provide meaningful bandwidth, a level two cache
would require significant area for both the cache control
and a large number of data, tag and address I/O’s to com-
municate with off-chip rams.
A minimum configuration PA-8000 processor re-
quires the CPU andeleven rams.  A PA-8500 with more
cache requires just the CPU.  This then allows the
PA-8500 to serve the low end of the PA-RISC product
space as well as the high.
2.5  Correct data
Data integraty is protected on all data stored in the
on-chip caches from all single bit errors. With a total of
1.5 Megabytes of data and instruction, plus storage for
tags and predecode, implemented in a dense process and
covering three-quarters of the die, measures are required
to mantain data integraty.   Simple parity is sufficient for
the instruction cache since its contents are always clean,
so corrupted data can be safely refetched from memory.
The data cache requires error correction to allow re-
covery when a dirty line in cache becomes corrupted.  The
PA-8500 provides six extra bits per word to enable single
bit error correction to protect the cache data.  The correc-
tion however is not in the critical cache access path. Er-
rors are instead detected on the side.  If an error is
detected the corrupted data is forced out of the cache, with
the data being corrected in the copy out path if the line
was dirty.  The access is then re-executed, causing the line
to be brought back into cache with the corrected data.
The data caches tag rams are also correctable, but in
a different way.  The PA-8500 maintains two copies of
each cache lines tag to allow two access to be serviced si-
multaneously.  Each is protected with parity.  If an error is
detected, a copy out of the line is started.  During the copy
out, the tag array that didn’t have the parity error (remem-
ber there were two copies) provides the physical address
for memory, and the correct status.  If the status indicates
the line was dirty, the line is truly copied back to memory.
If the line was not dirty, the cache location is simply in-
validated.  Once the cache has been scrubed in this fash-
ion the access is re-executed bring the line into cache and
perform the access.
3.  Branch prediction improvements
Branch prediction is a hot topic these days, and for
good reason.  Today’s high frequency, superscalar proces-
sors pay a high cost every time they guess wrong about
the actual path the program execution will take.  This pen-
alty is high due to the increased pipeline depth, but just as
importantly, mispredicted branches prevent the processor
from effectively exploiting instruction level parallelism
across branches, thus leading to a high "opportunity cost".
Considering the importance of the topic, it is no surprise
that a lot of attacks on the problem have been proposed
and implemented.  These approaches may be roughly
categorized into software mechanisims and hardware
mechanisms.  
Software techniques generally rely on the compiler
to optimize branch performance and, perhaps, to pass a
hint to the hardware about the likely direction a given
branch will take.  Compiler optimizations are usually
based on execution profile information gathered during a
training run (Profile Based Optimization or PBO) or they
may be based on the structure of the source code using
heuristics.  Hardware based branch prediction techniques
predict the future behavior of a branch based on the past
behavior of that branch and/or other branches.  
Compiler-based optimizations have a number of
benefits over hardware branch prediction.  First, they gen-
erally do not consume chip area.  A second advantage to
compiler-based optimizations is that there is no capacity
limit to some structure, as there is in hardware based solu-
tions.  That is, if the compiler can hint each branch with
its expected direction, then most branches will be pre-
dicted fairly well, no matter how many branches are in the
program.  Still another advantage that compiler-based op-
timization has over hardware-based branch prediction is
that the compiler may be able to eliminate some branches
altogether.  For example,  PA-RISC includes the capabil-
ity of conditionally "nullifying" an instruction, which
amounts to the ability to skip over an instruction without
executing a branch.  Using this capability, the compiler
can generate code to calculate two possible results and
then keep only the desired result by nullifying a copy be-
tween two registers based upon the condition.  If a branch
is difficult to predict, this strategy may lead to betteroverall performance than code including a branch, even
though more total instructions may be generated.  
Hardware prediction solutions have their advantages
as well, of course.  One example is the case where a
branch always goes the same way in a given run of an ap-
plication, but that direction is the function of a mode, such
as a command line option.  Thus a branch might be gener-
ated by a statement such as "if input == stdin".  The com-
piler, of course, cannot know which way the branch will
go when the program is run, but for a given run of the pro-
gram, it will go the same way each time the branch is exe-
cuted.  Even the simplest of dynamic branch prediction
schemes will figure out this branch in short order.
One can argue that, if all the help the compiler can
provide (outside of branch elimination) is to hint a branch
to be taken or not taken forever, a simple hardware-based
scheme will be able to do just as well, and it will also be
able to handle all of the "mode dependent" branches
which the compiler is stumped on as well.  The Achilles’
Heel of hardware-based dynamic branch prediction is a
limited resource for tracking branches.  If two branches
which are each well-behaved (i.e., almost always go in a
consistent direction) map to the same location in the
branch prediction cache, they will interfere with one an-
other and the prediction accuracy of each will suffer.
Thus, whether the overall best single policy for branch
prediction is to follow compiler hints or to let the hard-
ware sort it all out at runtime will depend on the applica-
tion being run.
This dilemma was recognized during the develop-
ment of the PA-8000, so accommodations were made to
be able to run in either "static" prediction mode (i.e., al-
ways follow the hint provided by the compiler for each
branch) or "dynamic" prediction mode (i.e., always follow
the result supplied by the branch prediction cache).  At
compile time, the application developer can mark a binary
to be run in one mode or the other depending on which
performs better for that application
In the PA-8500, we have created a means of com-
bining the advantages of static and dynamic branch pre-
diction.  The branch prediction cache is a standard array
of two bit counters (as described by --), but the informa-
tion stored in the prediction cache is not the direction of
the branch (taken or not taken), but whether the branch
went in the direction indicated by the static hint supplied
by the compiler or not.  If the static hint disagrees with the
actual direction the branch followed, the counter is incre-
mented; if it agrees, the counter is decremented.  Each
time a branch is fetched, the prediction cache is consulted
and, if the counter is zero or one, the static hint encoded in
the instruction is followed.  If the counter is two or three,
the hardware predicts that the branch will go in the direc-
tion opposite the static hint.
One can see that, if the compiler has done a good
job of setting the static hints, and most of the branches are
"well behaved", most of the counters in the prediction
cache will wind up with counts of zero or one.  For
branches which are  well behaved,  but which are mis-
hinted, the counts will tend to be two or three and the
hardware will override the static hint provided by the
compiler.  The hardware capacity limit then makes itself
felt only when two "well-behaved" branches, one of which
is correctly hinted and one of which is not, map to the
same location in the prediction cache and interfere with
one another.  Here again, software comes to the rescue.
In HPUX version 10.30 and later, on each context
switch (at least 100 times per second) the kernel will ex-
amine the next instruction to be executed to see whether  
or not it is a branch which can be hinted.  If so, it exam-
ines the operands of the branch to determine whether that
branch will, in fact, go the way it is hinted.  It the records
the information in a table.  If, on a later context switch,
the same branch is encountered and the accumulated his-
tory indicates that the branch is mis-hinted, the kernel will
recode the branch in memory to correct the hint.  This will
allow the mis-hinted branch to stop interfering with an-
other branch which is hinted correctly and improve the
prediction accuracy of both.  Note that the more often a
given branch is executed, the more likely it is to be sam-
pled by the kernel and therefore the more likely it is to get
corrected.  Therefore, the branches with the most potential
to do harm are the ones most likely to get corrected.
The branch prediction cache allows the PA-8500 to
combine the advantages of static and dynamic branch pre-
diction techniques.  The combination of static and dy-
namic information is achieved with a single hardware
structure, rather than having to implement two or three
hardware arrays to manage the prediction information and
arbitrate between them.  Of course, this method of com-
bining static and dynamic information could be combined
with many other more elaborate branch prediction
schemes.  At this design point, however, we felt that maxi-
mizing the first level cache available on-chip was more
important than adding large amounts of storage to provide
a minimal improvement in branch prediction accuracy.
Finally note that the enhancements made to branch
prediction for the PA-8500 are backward compatible with
the PA-8200 and PA-8000.  If either of these machines is
operating in static prediction mode and the kernel rehints
a branch, they too will see a performance benefit.  If they
are operating in dynamic mode, the rehint makes no
difference.
4.  Performance migrationThe radical new design of the PA-8000 opened up
many new opportunities for compiler optimizations.  In
addition, the PA-8000 was the first processor to imple-
ment the new PA-2.0 architecture.  This meant that a lot
of work went into the development of the compilers as
well as into retuning of applications.  By leveraging the
core micro-architecture of the PA-8000, the PA-8500 en-
sures that all of the investment in compilers and applica-
tion tuning is protected.  
It is also important to note that the employment of
Profile Based Optimization (PBO) has truly come of age
with the PA-8000 family of processors.  This technique
and shown its value in enabling the compiler to perform
optimizations which are not possible without knowledge
of the dynamic behavior of the program.  The investments
which have been made by software developers in moving
to the PBO paradigm will continue to pay off more and
more as compilers continue to improve in their ability to
exploit the ever increasing potential of the processors of
the futurre.
5.  Other features
In addition to increasing the frequency, adding
cache associativity and enhancing branch prediction capa-
bility, some other changes have been incorporated in the
PA-8500 in order to improve its performance:
The size of the TLB has been increased from 120 to
160 entries.
Modifications have been made to support a higher
performance system bus.
Support for higher graphics bandwidth has been added.
Various performance penalties have been removed or
reduced.
Redundant RAM
6.  Conclusions
The PA-8500 is the next member of the PA-RISC
processor family.  By migrating to a .25 micron process,
implementing large, associative on-chip caches, and en-
hancing the branch prediction hardware the PA-8500 will
deliver industry leading performance on binaries tuned for
the PA-8000 microarchitecture.
Acknowledgements
The authors would like to acknowlege all of the individuals who
have worked  so hard to make the PA-8500 possible, which in-
cludes everyone who worked on the PA-8000 and PA-8200.
Special thanks are also in order for Sridhar Ramakrishnan, Wei
Hsu and the rest of the language operation as well as individuals
in the performance community and the field who have worked
and continue to work to bring the full potential of the PA-8000
family of processors to customers.  A final word of thanks goes
out to the HP-UX lab, especially Balakrishna Raghunath and to
Doug Larson for his work on rehinting branchs on-the-fly.
References
[1] T. Yeh and Y. Patt, "Two-Level Adaptive Branch Predic-
tion,"  24th International Symposium on Microarchitecture
(Nov. 1991), pp. 51-61.
[2] D. Hunt, "Advanced Performance Features of the 64-bit
PA-8000",  Compcon Digest of Papers, March 1995, pp.
123-128.