Predictive Read Cache Memories for Reducing Primary Cache Miss Latency in Embedded Microprocessor Systems by Fouts, Douglas Jai
Calhoun: The NPS Institutional Archive
Faculty and Researcher Publications Faculty and Researcher Publications
2000-04-04
Predictive Read Cache Memories for
Reducing Primary Cache Miss Latency
in Embedded Microprocessor Systems
Fouts, Douglas Jai
The United States of America as represented by the Secretary of the Navy, Washington, DC (US)
http://hdl.handle.net/10945/7237
111111 1111111111111111111111111111111111111111111111111111111111111 
United States Patent [19] 
Fouts 
[54] PREDICTIVE READ CACHE MEMORIES 
FOR REDUCING PRIMARY CACHE MISS 
LATENCY IN EMBEDDED 
MICROPROCESSOR SYSTEMS 









Assignee: The United States of America as 
Represented by the Secretary of the 
Navy, Washington, D.C. 
Appl. No.: 08/964,046 
Filed: Nov. 4, 1997 
Int. CI? ...................................................... G06F 12/08 
U.S. CI. ................................ 711/137; 711/122; 711/3 








U.S. PATENT DOCUMENTS 
11/1993 Jouppi et al. ........................... 711/122 
4/1994 Palmer .................................... 382/305 
11/1994 Westberg ................................. 711/137 
3/1995 DeLano et al. ......................... 712/207 
7/1998 Kedem et al. .......................... 711/137 
OlliER PUBLICATIONS 
Fouts, Douglas J. and Arthur B. Billingsley, "Predictive 
Read Caches: An Alternative to On-Chip Second-Level 
Cache Memories", Journal of Microelectronic Systems Inte-
gration, vol. 2, No.2, pp. 109-121, Jun. 1994. 
US006047359A 
[11] Patent Number: 
[45] Date of Patent: 
6,047,359 
Apr. 4, 2000 
Primary Examiner-Reginald G. Bragdon 
Attorney, Agent, or Firm-Donald E. Lincoln 
[57] ABSTRACT 
A predictive read cache reduces primary cache miss latency 
in a microprocessor system that includes a microprocessor, 
a main memory and a primary cache memory connected 
between the main memory and the microprocessor via an 
instruction address bus, a data address bus and a data bus. 
The predictive read cache tracks the pattern of data read 
addresses that cause misses in the primary cache and asso-
ciates the pattern with the specific instruction that generates 
the pattern of miss addresses. When a pattern has been 
determined, the address where the next cache data read miss 
will occur is predicted and sent to memory at a time when 
the memory is not busy with other transactions. The data at 
the predicted miss address is then fetched and stored in the 
predictive read cache. The next time a data read miss occurs 
in the primary cache, if the miss address matches one of the 
predicted miss addresses stored in the cache, then the 
required data is immediately sent to the primary cache from 
the predictive cache, rather than having to be read out of the 
much slower main memory. 































































































SECOND- ADDRESSES ... 
LEVEL MAIN 
CACHE ... ... MEMORY 
--
or 


































PRIMARY MICRO- .. \..40 
.. 
PROCESSOR CACHE RPB 
.. .. MEMORY .. . 
.... DATA 
.... .... DAT~) j .. ~ 
\..22 32) / 42 




MICRO- PRIMARY \..40 PROCESSOR CACHE PRC 
.. .. MEMORY .... ... 
.... DATA 
.. 
- DAT~) A ~ .. 
\..22 32) Ofi/ 42 \..52 FIG. 4 B 
MOST PREVIOUS PREDICTED INSTRUCTION RECENT MISS MISS ADDRESS MISSED ADDRESS ADDRESS TAG ADDRESS (PRMA) (PDMA) (IATG) (MRMA) 
\ \. \. \. 
92 93 94 95 
~48 























































u.s. Patent Apr. 4, 2000 Sheet 3 of 8 6,047,359 













PREFETCH '~78 QUEUE 
BUS INTERFACE 
t 











































u.s. Patent Apr. 4, 2000 Sheet 5 of 8 6,047,359 
96~~ __ ~ __ ~ __ ~~ 
PDDT PREDICTED DATA ~ BLK.O BLK.1 BLK.2 BLK.3 ••• BLK.n I------------t-~ 
95 ~ r-------,.-----.,.------.-----, 
PDM~ BLK.O BLK.1 BLK.2 BLK.3 ••• BLK.n 
PDMA 
-"~ r~104 
94 ~ ,...---...----,--------r-----, 
BLK.n~8 
B ~OO A-B B 
'---------------.-.. ~A A+B-l+ 
PR~ BLK.O BLK.1 BLK.2 BLK.3 ••• 
93~~ __ r--__ ~~--~ 
MRM~ BLK.O BLK.1 BLK.2 BLK.3 ••• BLK.n r--
92 "'4.. .---_,....--_--,--_---y--_--, 












AII-{LOG2n LSBs OF 
INSTRUCTION 
ADDRESS 






INSTRUCTION ADDRESS BUS FIG. 8 BUS 
u.s. Patent Apr. 4, 2000 
BINARY ENCODER 
93~ 
MR~r-B-L-K-. 0'TIB-L-K-.1---'--1 B-L-K-.2'T1 B-L-K-.3---'1 ••• 
INSTRUCTION 
ADDRESS BUS 
Sheet 6 of 8 6,047,359 
PREDICTED DATA 





























































Apr. 4, 2000 Sheet 7 of 8 6,047,359 
120~ 
A~ ~~ 
BLK.3 BLK.3 BLK.3r+ ~ PREDICTED DATA 
BLK.2 BLK.2 BLK.2 r+ MUX .. • • • ... BLK.1 BLK.1 BLK.1 r+ V~138 BLK.O BLK.O BLK.O r+ 
,126 (128 
BLK.3 BLK.3 BLK.3 ,~ ~) r ... PDMA 
BLK.2 BLK.2 BLK.2 ::)))- ... HIT 
• • • ~D-
ENCODER 
BLK.1 BLK.1 BLK.1 ... 
BLK.O BLK.O BLK.O pD-... 
BLK.3 --.~~ 136 BLK.3 BLK.3 
BLK.2 BLK.2 BLK.2 --. MUX r-
• • • BLK.1 BLK.1 BLK.1 --. 




BLK.3f-+-~ BLK.3 BLK.3 A 
BLK.2 BLK.2 BLK.2 f-+ MUX 
• • • 134 BLK.1 BLK.1 BLK.1 f-+-
BLK.O BLK.O BLK.O f-+-~ 130 
BLK.3 BLK.3 BLK.3 )D- .... IATG 
BLK.2 BLK.2 BLK.2 ,~ .... HIT .. 
.;U ENCODER • • • BLK.1 BLK.1 BLK.1 r--- ~U .... 
BLK.O BLK.O BLK.O r--- HU .... 




























22 32 34 
AUUH.c;:,;:,t:;:, AUUHt::::;::;t::::; 
--





























































PREDICTIVE READ CACHE MEMORIES 
FOR REDUCING PRIMARY CACHE MISS 
LATENCY IN EMBEDDED 
MICROPROCESSOR SYSTEMS 
2 
SUMMARY OF THE INVENTION 
The predictive read cache memory according to the 
present invention can be used in place of an entire second-
BACKGROUND OF THE INVENTION 
1. Field of the Invention. 
This invention relates generally to memories for digital 
computer systems and particularly to multilevel hierarchical 
memories. Still more particularly, this invention relates to a 
cache memory that reduces cache miss latency by tracking 
multiple cache data read miss address patterns and by 
associating each cache data read miss address pattern with 
the specific instruction that generated the miss address 
pattern to improve the probability of a correct prediction. 
S level cache memory to obtain nearly the same result, 
depending on the application. The predictive read cache 
tracks the pattern of data read addresses that cause misses in 
the on-board primary cache and associates the pattern with 
the specific instruction that generates the pattern of miss 
10 addresses. When a pattern has been determined, the address 
where the next cache data read miss will occur is predicted 
and sent to memory at a time when the memory is not busy 
with other transactions. The data at the predicted miss 
address is then fetched and stored in the relatively small but 
2. Description of the Prior Art 
15 high-speed predictive read cache. The next time a data read 
miss occurs in the primary cache, if the miss address 
matches one of the predicted miss addresses stored in the 
cache, then the required data is immediately sent to the 
primary cache from the predictive cache, rather than having 
Modern, high-performance microprocessors have 
extremely high memory bandwidth requirements and very 
short memory latency requirements. Memory latency is 
defined as the time between when the processor sends out a 
memory read address and when it receives the data back. In 
such systems, if a single-level memory hierarchy is used, 
then the memory subsystem must be constructed using 
high-speed static random access memory (SRAM) inte-
grated circuits (ICs) because no other technology can meet 25 
the memory bandwidth and latency requirements. However, 
implementing a large main memory system with high-speed 
SRAM is not practical for most applications because of cost, 
size, power consumption, cooling, and weight constraints. 
Therefore, most computers utilize a multilevel, hierarchical 30 
memory subsystem that consists of a large, but relatively 
slow, main memory augmented by a much smaller but very 
high-speed cache memory. The main memory is usually 
constructed with dynamic RAM (DRAM) ICs. With modem 
microprocessors, the cache memory is usually implemented 35 
on the microprocessor chip using high-speed static RAM 
technology, although an off-chip cache can be constructed 
using high-speed static RAM ICs. 
20 to be read out of the much slower main memory. 
BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a block diagram of a typical prior art 
microprocessor-memory subsystem interface without cache; 
FIG. 2 is a block diagram of a typical prior art 
microprocessor-memory subsystem interface with cache; 
FIG. 3 is a block diagram of a typical prior art 
microprocessor-memory subsystem interface with both pri-
mary and second-level cache; 
FIG. 4A is a block diagram of a prior art microprocessor-
memory subsystem interface with a read prediction buffer; 
FIG. 4B is a block diagram of microprocessor-memory 
subsystem interface with a predictive read cache according 
to the present invention; 
FIG. 5 is a flow chart of a prediction algorithm for the read 
prediction buffer of FIG. 4A and the predictive read cache of 
FIG.4B; 
The use of a high-performance microprocessor chip with FIG. 6 illustrates fields in each block of the predictive 
40 read cache of FIG. 4B; an on-board primary cache memory leads to the problem of 
cache-miss latency. The read access time to data in an 
on-board, high-speed, cache memory is typically one clock 
cycle. However, the read access time to data that is not in the 
cache can be as high as hundreds of clock cycles. This 
extreme difference in access time between the cache and the 45 
FIG. 7 is a block diagram of a reduced instruction set 
microprocessor with on-chip predictive read cache accord-
ing to the present invention; 
FIG. 8 is a block diagram of a direct-mapped predictive 
read cache according to the present invention; main memory is very significant with modern reduced 
instruction set computing (RISC) microprocessors that 
execute instructions at a rate of at least one every clock and 
operate at clock rates in the hundreds of megahertz. 
Therefore, the latency encountered when a miss occurs in 
the on-board cache can become a significant portion of the 
average read access time, even if the cache miss ratio is 
small. 
FIG. 9 is a block diagram of a fully associative mapped 
predictive read cache according to the present invention; 
FIG. 10 is a block diagram of a set-associative mapped 
50 predictive read cache according to the present invention; 
Second-level, off-chip, cache memories are the usual 
means for reducing the cache-miss latency of high- 55 
performance workstations, file servers, and main frame 
computers. The problem with second-level cache memories 
is that they require an array of power consuming, heat 
generating, and expensive SRAM ICs that can significantly 
increase the size, weight, power consumption, and generated 60 
heat. Therefore, second-level cache memories are generally 
unsatisfactory for embedded computers. Embedded comput-
ers are normally designed to be small, lightweight, consume 
small amounts of power, and generate small amounts of heat 
in applications where they provide control and 65 
communications, such as satellites, weapon systems, and 
portable, mobile, and aeronautical computing systems. 
FIG. 11 is a block diagram of an alternative memory 
subsystem architecture that has a predictive read cache 
according to the present invention and no primary cache 
memory; and 
FIG. 12 is a block diagram of an alternative memory 
subsystem architecture that includes a primary cache, a 
second-level cache and a predictive read cache. 
DESCRIPTION OF THE PREFERRED 
EMBODIMENT 
FIG. 1 illustrates a typical microprocessor-memory sub-
system interface 20 without a cache memory. In FIG. 1 a 
microprocessor 22 is connected to a main memory 24 via an 
address bus and a data bus. 
FIG. 2 illustrates a typical microprocessor-memory sub-
system interface 26 that includes a cache memory 28 con-
6,047,359 
3 
nected between the microprocessor 22 and the main memory 
24. The cache memory 28 is typically formed on the same 
semiconductor chip (not shown) as the microprocessor 22. 
FIG. 3 illustrates typical microprocessor-memory sub-
system interface 30 with both a primary cache memory 32 
and a second-level cache 34. The primary cache memory 32 
is connected to the microprocessor 22, and the second-level 
cache memory 34 is connected between the main memory 
24 and the primary cache memory 32. The primary cache 
memory 32 is also typically formed on the microprocessor 
chip. 
FIG. 4Aillustrates a microprocessor-memory interface 36 
that includes the primary cache memory 32 connected to the 
microprocessor 22 and a read prediction buffer (RPB) 38 
connected between the main memory 24 and the primary 
cache memory 32. An address bus 40 and a data bus 42 are 
connected between the primary cache memory 32 and the 
RPB 38. Similarly, an address bus 44 and a data bus 46 are 
connected between the RPB 38 and the main memory 24. An 
instruction fetch address bypass bus 48 is connected 
between the address bus 40 and the address bus 44, and an 
instruction fetch bypass bus 50 is connected between the 
data bus 42 and the data bus 46. 
Referring to FIG. 4B, the present invention replaces the 
RPB 38 with a predictive read cache (PRC) 52. Suitable 
structures and methods of operation of the PRC 52 are 
presented subsequently. An explanation of the functions of 
the RPB 38 will facilitate understanding of the PRC 52. 
Additional details of the RPB may be obtained by referring 
to the following references: (1) G. J. Nowicki. "The Design 
and Implementation of a Read Prediction Buffer", Masters 
Thesis, U.S. Naval Postgraduate School, Monterey, Calif., 
December 1992; (2) M. E. Aguilar, "Testing of the Read 
Predictive Buffer Chip, Design and Implementation of the 
Predictive Read Cache Chip", Masters Thesis, U.S. Naval 
Postgraduate School, Monterey, Calif., March 1995; (3) D. 
4 
into a high-speed buffer (not shown) along with the pre-
dicted address and made ready for sending to the micropro-
cessor 22. When the microprocessor 22 initiates the next 
data read, the address is compared against the predicted 
5 address field in the high-speed buffer. If a match occurs, the 
contents of the data field in the high-speed buffer are sent to 
the microprocessor and the predicted address is used as the 
MRMA for a new address prediction. 
The displacement-based algorithm followed by the RPB 
10 38 has several important features. First, and most 
importantly, the required calculations can always be accom-
plished during the amount of time between successive cache 
data read misses. This time can be very short, depending on 
the characteristics of the microprocessor 22 and the software 
15 being executed. Second, the algorithm is demand driven so 
that if the prediction is wrong, the data at the incorrectly 
predicted address does not pollute the primary cache 
memory 32 and reduce performance. Third, the data at the 
predicted address is read from main memory 24 during a free 
20 memory cycle and thus does not use up a significant amount 
of useful memory bandwidth. Fourth, the displacement-
based algorithm can be implemented on a single VLSI IC 
(not shown). In fact, the number of logic gates required to 
implement the RPB 38 is small enough such that the entire 
25 RPB 38 could conceivably be implemented on the micro-
processor 22 chip itself. 
However, the RPB 38 has one major disadvantage which 
limits its effectiveness for many applications. The RPB 38 
can track only a single address pattern because it only has 
30 one address tracking mechanism and one read prediction 
data buffer. Therefore, as soon as the microprocessor 22 
performs a context switch, such as executing a subroutine 
call, a trap, or an interrupt handler, the probability that the 
prediction is incorrect becomes very high. In fact, the 
35 probability of an incorrect prediction is very high even if the 
software just breaks out of an iterative loop within the same 
context. 
J. Fouts, G. J. Nowicki, and M. E. Aguilar, "A CMOS Read 
Prediction Buffer IC for Embedded Microprocessor 
Systems", Journal of Microelectronic Systems Integration, 40 
Vol. 5, No.3, pp. 145-157,1997; and (4) D. J. Fouts and A. 
Replacing the RPB 38 with the PRC 52 overcomes this 
problem by incorporating the ability to track multiple cache 
data read miss address patterns. Furthermore, each cache 
data read miss address pattern is associated with the specific 
B. Billingsley, "Predictive Read Caches: An Alternative to 
On-Chip Second Level Cache Memories", Journal of Micro-
electronic Systems Integration, Vol. 2, No.2, pp. 109-121, 
June 1994. 
Both the RPB 38 and the PRC 52 are normally situated 
between the primary cache 32, which is usually imple-
mented on the microprocessor chip, and the main memory, 
instruction that generates the miss address pattern, which 
further improves the probability of a correct prediction. 
The PRC 52 simultaneously tracks a greater number of 
45 address patterns than the RPB 38. Only one block was 
allowed in the RPB 38, which is why it can track only one 
address pattern. In the PRC 52, the number of blocks is n, 
where n is an even power of 2 and practically ranges from 
as shown in FIGS. 4A and 4B. The RPB 38 operates by 
tracking the sequence of data read addresses going from the 50 
microprocessor 22 to the main memory 24. For micropro-
cessors with an on-board cache, any off-chip data read 
operation will, by definition, be the result of a miss in the 
on-board cache. 
a minimum of about 256 to a maximum of 65,536 or more. 
Referring to FIG. 6, each block still maintains all of the 
same fields as the single block in the RPB 38, including the 
most recent miss address (MRMA 93), the previous miss 
address (PRMA 94), the predicted memory address (PDMA 
95), and the predicted data (PDDT 96). In addition, each 
When the RPB 38 tracks an address sequence, it executes 
the algorithm shown in the flow chart shown in FIG. 5. 
Initially, a new read address is designated as the most recent 
memory address, or MRMA. When the next cache data read 
miss address is obtained, the old MRMA becomes the 
previous read memory address, or PRMA, and the new 
address becomes the MRMA. The PRMA is then subtracted 
from the MRMA to obtain a displacement. The displacement 
is then added to the MRMA to obtain the predicted address 
of the next cache data read miss. Once the predicted address 
has been obtained, the RPB 38 waits for a free memory bus 
cycle and then initiates a main memory read at the predicted 
address. When the data is obtained from memory, it is loaded 
55 block of the PRC 52 includes a new field that is not included 
in a read prediction buffer. The new field stores the most 
significant bits (MSBs) of the address of the instruction that 
generated the data read miss address pattern. This new field 
is referred to as the instruction address tag (IATG 92). The 
60 least significant bits (LSBs) of the address of the instruction 
that generates the data read miss address pattern are used to 
select a specific block within the PRC 52. The dividing line 
between which bits of the address are used to select a block 
and which bits are stored in the IATG depend on the number 
65 of blocks in the PRC 52. For a PRC 52 with n blocks, the 
least significant log2n bits of the instruction address are used 
to select a block. 
6,047,359 
5 
The number of bytes that are stored in the PDDT field 96 
will usually be an even multiple of the number of bytes in 
the data word for the microprocessor 22. Typical values 
range from a minimum of 1 for a small micro controller to 
128 or more for a high-performance microprocessor. The 5 
number of bits in the MRMA field 93 and the PRMA field 
6 
instruction address bus. The MRMAfield 93 is connected to 
the data address bus to receive the MRMA. The output of the 
MRMA field 93 is connected to the A input of a subtracter 
98 and the A input of an adder 100. The output of the MRMA 
field 93 is also connected to the input of the PRMA field 94. 
The output of the PRMA field 94 in connected to the B input 
of the subtracter 98. The subtracter 98 combines the output 
of the PRMA field 94 with the output of the MRMA field 93 
to produce a displacement, A-B. The output of the MRMA 
94 will usually be equal to the number of address bits that 
the microprocessor 22 uses. The number of bits in the 
PDMA field 95 will usually be equal to the number of 
address bits the microprocessor 22 uses less p, where 
p=log2q where q is the number of bytes that are stored in the 
predicted data field at each block of the cache, assuming the 
microprocessor 22 uses byte addressing. The lower p bits of 
the predicted address are discarded after the address has 
been used to pre fetch the data from the main memory 24 and 
store it in the PDDT field 96. 
10 field 93 is also input A of an adder 100. Input B of the added 
100 is the displacementA-B output of the subtracter 98. The 
output A+B of the adder 100 is the PDMA, which is input 
to the PDMAfield 95. It can also be sent to the main memory 
24 on the data address bus. The PDDT field 96 input is 
15 connected to the data bus. The output of the PDDT field 96 
can also be connected to the data bus. 
The design of the PRe 52 according to the present 
invention requires some modifications to typical micropro-
cessor architecture. The PRe 52 must be provided with the 
address of the instruction that causes a data read miss in the 20 
primary cache memory 32 in addition to the normally 
required address of the read data. If the PRe 52 is imple-
mented on a separate chip from the microprocessor 22, then 
an extra set of output drivers and output pins will be required 
to send the instruction address to the PRe 52. However, the 25 
complexity of the PRe 52 is such that it can be easily 
implemented on the chip with the microprocessor 22. If the 
PRe 52 is designed as an on-chip component, the external 
interface of the microprocessor 22 will not be affected. Only 
an extra register (not shown) and dedicated internal bus (not 30 
shown) for instruction addresses need to be added. 
When the PRe 90 receives a data read miss address from 
the primary cache 32, it also receives the address of the 
instruction that generated the memory read. The log2n least 
significant bits of the instruction address are used to select 
a specific block within the PRe 90. A comparator 102 
compares remaining higher-order bits against the value 
stored in the IATG field 92 at the selected block. 
While this is happening, the data address bits, less the 
lower-order bits required to select a specific byte within a 
block, are compared by an address comparator 104 against 
the value stored in the PDMA field 95 at the selected block. 
If a match occurs in both the IATG and PDMAfieids 92 and 
95, respectively, the required data stored in the PDDT field 
96 is sent to the primary cache 32 via the data bus. If the data 
address does not match the PDDT field 96, but the instruc-
tion address does match the IATG field 92, then the required 
data is read from main memory. 
FIG. 7 shows a block diagram of a reduced instruction set 
computing (RISC) microprocessor 60 that utilizes on-chip 
primary instruction and data caches 72 and 76, respectively, 
and an on-chip predictive read cache 74. The microprocessor 
60 shown is assumed to have a decode/dispatch unit 62 and 
three execution units 64, 66 and 68 operating on a register 
file 70. A bus interface shown at the top of FIG. 7 provides 
35 After the required data has been sent to the primary cache 
32 from either the PRe 90 or the main memory 24, a new 
prediction is made by moving the MRMA at the selected 
block into the PRMA and the current address into the 
MRMA at the selected block. The foregoing sequence of 
all of the required off-chip interfaces with the instruction 
cache 72, the predictive read cache 74 and the data cache 76. 40 
A pre fetch queue 78 is connected between the decode and 
dispatch unit 62 and the instruction cache 72. 
steps executes the same prediction algorithm used by the 
read prediction buffer 38 and illustrated in FIG. 5. When the 
main memory 24 is not busy with other transactions, the new 
predicted address is used to perform a main memory read. 
The resulting data is stored in the PDDT field 96, and the 
45 higher-order bits of the predicted address are stored in the 
PDMA field 95, both at the selected block. 
It can be seen from FIG. 7 that the hardware support 
required for the PRe 74 can be provided by using an 
additional instruction address register (IAR) 80 and a dedi-
cated address path connecting the output of the IAR 80 to the 
PRe 74. A program counter 82 is connected between the 
IAR 80 and the decode and dispatch unit 62. A memory data 
register 84 is connected to the data cache 76, and a memory 50 
address register 86 is connected to both the data cache 76 
and the PRe 74. 
As with most cache memories, three different methods 
may be used for mapping an address into the PRe 74. These 
three methods are direct mapping, set-associate mapping, 
and fully-associative mapping. FIGS. 8-10 are block dia-
grams showing how these three mapping methods may be 
implemented. With most cache memories, the mapping 
method chosen is applied to the address of the data. 
However, with the PRe 74, the mapping method chosen is 
applied to the address of the instruction that generates the 
cache data read miss address because that is the address that 
is used to select a block out of the PRe 74. 
When checking an incoming data read miss address from 
the primary cache 32, it is possible that a match will be 
found with the address in the PDMA field 95 but not in the 
IATG field 92. This situation is possible because different 
modules in the executing program may access the same data 
structures. In this situation, the data in the PDDT field 96 can 
still be forwarded to the primary cache 32. However, a 
prediction of the next data read miss address does not need 
55 to be done because the current data read miss address was 
generated by a different instruction than the one represented 
in the selected block of the PRe 90. The only actions that are 
taken are to load the current data address into the MRMA 
field 93 and the higher-order bits of the instruction address 
60 into the IATG field 92. 
A block diagram of a direct-mapped PRe 90 is shown in 
FIG. 8. The PRe 90 includes five fields 92-96 of n blocks 65 
It is also possible that both the instruction address does 
not match against the IATG 92 and the data read miss 
address does not match against the PDMA 95. In this case, 
the required data is fetched from main memory 24. Again, no 
new prediction is done because the data read miss address 
was generated by a different instruction than the one repre-each, as per FIG. 6. Blocks are designated as BLKO, BLK. 
1 . . . , BLK n. The IATG field 92 is connected to the sented in the selected block of the PRe 90. In this case, the 
6,047,359 
7 
only actions that are taken are to load the current data 
address into the MRMA 93 and the higher-order bits of the 
instruction address into the IATG 92. 
Although the PRC 90 is designed to improve the average 
memory access time during data read operations, it cannot 5 
ignore write operations. If write operations are ignored, data 
in the PRC 90 could become stale. The PRC 90 can use any 
8 
is a hit in the PDMA field 95 but a miss in the IATG field 
92, then the correct data has been located in the PRC 108, 
but the data is associated with another address pattern being 
tracked by the PRC 108 for another instruction. In this case, 
the correct data is sent to the primary cache 32. Then, a new 
block in the cache 32 is used to start tracking the new 
address pattern. 
If a miss occurs with the PDMA 95, it means that no 
predicted data is available in the PDDT field 96 in any block 
of the PRC 108. In this case, a read from main memory 24 
must be performed and the data obtained sent to the primary 
cache 32. If the PDMA field miss is accompanied by an 
IATG field hit, then a new prediction is attempted. The new 
predicted address and data are stored in the block where the 
of the write policies normally used for cache memories, 
write through, write invalidate (write around), and write 
back However, write back is not recommended because it 10 
could be a very long time between when the PRC 90 is 
written and when a block is flushed out of the cache 32 and 
main memory 24 is updated. This is especially true for the 
set-associative and fully-associative mapped PRC designs 
described subsequently. Therefore, to maintain consistency 
between the PRC 90 and main memory 24, either write-
through or write-invalidate policies are preferred, especially 
15 IATG field hit occurred. However, if the PDMA field miss 
is accompanied by an IATG field miss, then a new block in 
the cache 32 must be obtained and a new address pattern 
tracked. in multiple CPU systems. 
With the direct-mapped PRC 90 described with reference 
to FIG. 8, it is possible for two or more frequently-executed 20 
instructions with different addresses to have the same least 
significant address bits. When this occurs, the multiple 
address patterns tracked by the PRC 90 will get mapped to 
the same block. This is undesirable because only one tracked 
address pattern can actually reside in a cache block at one 25 
time. When this situation occurs, one tracked address pattern 
will be immediately replaced by another tracked address 
pattern which will immediately be replaced by another 
tracked address pattern, possibly the first tracked address 
pattern. This is known as thrashing. A way to prevent 30 
thrashing, at the expense of increased hardware and design 
complexity, is to use fully-associative mapping. 
With fully-associative mapping, the instruction address 
bus is not divided into two parts, as shown in FIG. 8, for 35 
selecting a block out of the cache and for comparing against 
the IATG. Instead, all bits of the instruction address are 
simultaneously compared against the IATG fields in all of 
the blocks in the cache. 
FIG. 9 illustrates a fully-associate mapped PRC 108. Note 40 
that the fully-associate mapped PRC 108 includes a separate 
address comparator 110a, 110b, ... , lIOn in the IATG field 
92 for each block BLKO, BLK1, ... , BLKn, respectively. 
The fully-associate mapped PRC 108 shown in FIG. 9, also 
includes n comparators labeled 112a, 112b, ... , 112n for the 45 
PDMA field 95 in every block of the PRC 108. The outputs 
of the comparators 110a, 110b, ... , lIOn are input to a 
binary encoder 114, which produces the IATG hit output. 
Similarly the outputs of the comparators 112a, 112b, ... , 
112n are input to a binary encoder 116, which produces the 50 
PDMA hit output. 
The comparators 110a, 110b, . . . , lIOn and 112a, 
112b, ... , 112n for the IATG and PDMAfieids respectively, 
along with the method of handling the instruction address, 
allow a tracked address pattern to be stored in any block of 55 
the PRC 108 and still be located rapidly when a data read 
miss occurs in the primary cache 32. 
Still referring to FIG. 9, when a data read miss occurs in 
the primary cache 32, both the data address and the address 
of the instruction that generated the data read miss are 60 
simultaneously compared against the PDMA and IATG 
fields in all blocks of the PRC 108. If both IATG field and 
PDMA field hits occur, the correctly predicted data is sent to 
the primary cache 32, and a new prediction is done. The new 
predicted address and the new predicted data, once read 65 
from main memory, are stored at the same block in the cache 
32 where the previous correct prediction was found. If there 
When a new block is required to track a new address 
pattern, any previously unused block can be used because of 
the additional comparators available and the ability of a 
fully-associative PRC 108 to simultaneously search all 
blocks in both the IATG field 92 and the PDMA field 95. 
However, there will be times when a new address pattern 
needs to be tracked, but there are no unused blocks available. 
This same situation can also occur in conventional, fully-
associative mapped, demand-driven caches that only use the 
data address for finding the correct block in the cache. As 
with a conventional cache, any of the normally-used block 
replacement algorithms can be used to select a victim block, 
including random, least recently used (LRU), first in first out 
(FIFO), working set, etc. With respect to write operations, 
the fully-associative mapped PRC 108 is no different than 
the direct-mapped PRC 90. The write-through, write-
invalidate (write-around), and write-back policies can all be 
used, although the write-back policy is not recommended. 
An advantage of the fully-associative PRC 108 design is 
that any tracked address pattern can be stored in any block 
of the cache. This eliminates most of the thrashing that can 
occasionally occur with the direct-mapped PRC 90. 
However, the fully-associative PRC 108 design has high 
hardware costs, relative to a direct-mapped PRC 90, because 
comparators are required for both the IATG field 92 and the 
PDMA field 95 at every block of the PRC 108. 
The direct-mapped and the fully-associate mapped 
designs can be combined to obtain performance nearly as 
great as the performance of the fully-associative mapped 
PRC 108 at a hardware cost that is only slightly higher than 
that of the direct-mapped PRC 90. The combined design is 
referred to as a set-associative mapped PRe. A block dia-
gram of a 4-way, set-associative mapped PRC 120 is shown 
in FIG. 10. 
Referring to FIG. 10, all blocks in the PRC 120 are 
grouped into sets, which are identified as SET 0, SET 1, SET 
2, ... , SET n The number of blocks in each set is an even 
power of two, such as 2, 4, or 8. In the exemplary embodi-
ment of FIG. 10, the set size is 4. 
A comparator array 122 is connected between the IATG 
92 and an encoder 124. The output of the encoder 124 
indicates an IATG hit. Similarly, a comparator array 126 is 
connected between the PDMA 95 and an encoder 128. 
A multiplexer 130 is connected to the output of the 
MRMA 93. The output of the multiplexer 130 is input A of 
a subtracter 132 and an adder 134. The output of the PRMA 
94 is input to a multiplexer 136, which provides an input B 
to the subtracter 132. The output displacement A-B of the 
6,047,359 
9 
subtracter 132 is input B to the adder 134 which provides the 
predicted address to the data address bus and to the PDMA 
95. The comparator array 126 is connected to the data 
address bus to receive data addresses for comparison with 
addresses output from the PDMA 95. If the comparator array 
126 detects a match, then the encoder 128 outputs a signal 
indicating a PDMA hit. 
A multiplexer 138 is connected to the outputs of the 
PDDT 96. The multiplexer 138 provides the predicted data 
to the data bus. 
The log2s least significant bits of the address of the 
instruction that generated the cache data read miss are used 
10 
work include random, least recently used, first in first out, 
working set, etc. 
The present invention has several significant advantages 
over the prior art. One such advantage is reduced average 
5 access time to memory. Research has been conducted to 
quantify the improvement in performance that can be 
attained by using a predicitve read cache according to the 
present invention in a memory hierarchy. The study was 
conducted using a highly accurate, address-trace driven, 
10 simulation program that utilizes an analytic model and 
actual address traces captured from executing benchmark 
programs. 
to select one of the sets in the PRC 120, where s is the total 
number of sets in the PRe. Therefore, once a set has been 
selected, the desired address pattern can be tracked only by 15 
one of the blocks in the selected set. This limits the number 
Two benchmark programs that are indicative of the per-
formance improvement that can be attained from using a 
PRC according to the present invention are the Kenbus20 
and KenbusSO benchmarks. These programs are part of a 
standardized set of benchmark programs known as the 
SPECmark suite and represent a typical work load for a 
computer in a multi-user environment with 20 users for the 
of parallel comparisons that need to be executed in the IATG 
field 92 and the PDMA field 95 to the number of blocks in 
a set, or 4 for the embodiment shown in FIG. 10. For the 
IATG field 92, the most significant bits of the instruction 
address are compared against the IATG fields of all blocks 
20 Kenbus20 benchmark and SO users for the KenbusSO bench-
mark. Using these benchmarks, the baseline performance of 
a RISC-type CPU with a primary cache 32 memory but no 
second-level cache or predictive read cache is given in Table 
1, which is appended to this description of the invention. 
in the selected set. For the PDMA field 95, assuming byte 
addressing, all data address bits, less the least significant bits 
that are used to select a byte within a block, are compared 
against the predicted address in the PDMA field 95. The 25 
comparison is done in parallel with all blocks in the selected 
set. 
A fully-associative mapped predicitve read cache was 
modeled in the simulator with an analytic model. Simula-
tions were then performed using the address traces obtained 
from executing Kenbus20 and KenbusSO benchmark pro-
grams. The fully-associative mapped design produced the If a hit occurs in both the IATG field 92 and the PDMA 
field 95, then the block with the hit is identified, and the 
correctly predicted data is forwarded to the primary cache 
32. A new address prediction is then performed and stored 
in the selected block. The data is fetched when the main 
memory 24 is not busy and is also stored in the PDDT field 
96 at the selected block. If a hit occurs in the IATG field 92 
but not in the PDMA field 95, then the address pattern is 
being tracked by the block that produced the IATG hit, but 
the predicted address was incorrect. Therefore, a read from 
main memory 24 must be performed. Once the read has been 
completed, a new predicted address can be calculated and 
stored in the selected block. When the main memory 24 is 
not busy, the data at the predicted address can be read from 
memory and stored in the PDDT field 96 at the selected 
block. 
If a miss occurs in all IATG fields in the selected set but 
a hit occurs in one of the PDMAfields, then the required data 
has been located in the PRC 120 and can be forwarded to the 
primary cache 32. However, the miss in the IATG field 
indicates that the selected block is not actually tracking the 
address pattern generated by the current instruction being 
processed. Therefore, an unused block within the selected 
set must be used to track the new address pattern. If a miss 
occurs in both the IATG field 92 and the PDMA field 95 in 
all blocks of the selected set, then the required data must be 
read from the main memory 24. Once the required data has 
been obtained from the main memory 24 and forwarded to 
the primary cache 32, an unused block within the selected 
set must be used to track the new address pattern. 
It is possible for all blocks within a select set to be in use 
tracking other address patterns at a point in time when a new 
address pattern is identified and needs to be tracked. In this 
case, one of the older address patterns must be deleted from 
one of the blocks within the select set. The block to be 
removed can be selected with any of the victim block 
selection algorithms commonly used with standard, 
demand-driven, set-associate, cache memories that are 
addressed using only the data address. Algorithms that will 
30 best performance improvement, as can be seen in Table 2. In 
Table 2, the average read access time, the speedup 
percentage, and the PRC read hit rate are listed for PRC sizes 
of 256 bytes to 512 Kbytes. It should be noted that 256 bytes 
is an extremely small size compared to the size of a typical 
35 second-level cache and represents a tremendous hardware 
savings. Yet, the 256 byte PRC yielded an 1S.S2% speedup 
in performance for the KenbusSO benchmark and a 12.5S% 
speedup for the Kenbus20 benchmark. A 512 Kbyte fully-
associative cache is extremely large and represents a very 
40 large hardware investment. This much larger PRC yielded 
performance improvements of 20.19% on the KenbusSO 
benchmark and 14.32% on the Kenbus20 benchmark. 
The design of a 4-way, set-associative PRC was also 
modeled using an analytic model in the simulation study. Its 
45 performance was also studied using actual address traces 
from various different executing benchmark programs. For 
the Kenbus20 and KenbusSO benchmarks, the results of the 
simulation study are given in Table 3. As can be seen from 
Table 3, the performance improvement attained by using a 
50 set-associative PRC is not as great as the performance 
improvement attained using a fully-associative PRC. 
However, the hardware costs of a 4-way, set-associative 
mapped PRC are less than for a fully-associative mapped 
PRC because of the reduced number of required compara-
55 tors. Also, the victim block selection algorithm needs only to 
select between the various different blocks in the selected 
set, rather than between all blocks in the cache. Referring to 
Table 3, a 256-byte PRC yields a speedup of 10.39% for the 
KenbusSO benchmark and a speedup of S.lO% for the 
60 Kenbus20 benchmark. For a 512 K byte, 4-way, set-
associative PRC, the speedup is 1S.7S% for the KenbusSO 
benchmark and 12.77% for the Kenbus20 benchmark. It 
should be noted that the different size set-associative PRCs 
listed in Table 3 all have reasonable hardware costs, relative 
65 to both fully-associative PRCs and second-level caches. 
As mentioned previously, the best prior art method for 
reducing cache miss latency is to utilize a second-level 
6,047,359 
11 
cache. For comparison purposes, the performance improve-
ment that can be attained by using a second-level cache 
together with a RISC-type CPU and a primary cache 32 was 
also studied by a simulation study. The second-level cache 
utilized an analytical model and the same address traces 
from the same benchmark programs as were used for 
simulating the predictive read cache designs. The results of 
the simulations that use the address traces from the Ken-
bus20 and Kenbus80 benchmarks are recorded in Table 4. 
Referring to Table 4, a 64 Kbyte, second-level cache 
provides a 3.88% speedup for the Kenbus80 benchmark and 
a 0.50% speedup for the Kenbus20 benchmark. This is 
significantly less than what is provided by even the smallest 
predictive read cache. The 256 byte, fully-associative PRC 
provided an 18.82% speedup for the Kenbus80 test case and 
12 
level cache design that yields the same performance 
improvement. If this is done, it will be seen that for practical 
cache sizes, the hardware costs of a PRC are usually 
significantly lower than the hardware costs of the second-
S level cache that provides the same performance improve-
ment. For example, referring to Tables 2, 4, and 5, it can be 
seen that a fully-associative mapped PRC with only 256 
bytes provides better speedup than a 128 Kbyte second-level 
cache. A second-level cache would have to have 256 Kbytes 
10 to have better performance than the 256-byte, fully-
associative mapped PRC which would require approxi-
mately 270 times the number of transistors. 
Referring to Tables 3, 4, and 5, a 4-way, set-associative 
mapped PRC that is 1 Kbyte in size provides better perfor-
15 mance than a 128 Kbyte second-level cache. The second-
level cache would need to have 256 Kbytes in order to 
provide better performance than the PRe. This would 
require approximately 77 times the number of transistors in 
a 12.58% speedup for the Kenbus20 test case. Even the 
4-way, set-associative, 256-byte PRC provided significantly 
better speedup than the second-level cache. The set-
associative design provided a speedup of 10.39% for the 
Kenbus80 benchmark and 8.10% percent for the Kenbus20 20 
benchmark. It should be noted that the hardware costs for a 
256-byte PRC is significantly less than for a 64 Kbyte 
second-level cache, even if the PRC is fully-associative 
mapped. 
a 1 Kbyte 4-way, set-associative PRe. 
The present invention also allows decreased power con-
sumption in comparison to second-level cache memories. In 
a space-based, weapon-based, portable, mobile, or aeronau-
tical computing system, minimizing power consumption is 
often a critical issue for two reasons. First, for many such 
The characteristics of the PRC are such that as the number 25 systems, the only available power to operate the computer 
comes from batteries, solar cells, or other means that are not 
capable of producing large amounts of power. Second, the 
integrated circuits used to construct computers convert most 
of the consumed electrical energy into heat energy which 
of bytes in the PRC increases, the speedup provided by the 
PRC rapidly increases up to a point and then further 
increases are minimal. The characteristics of a second-level 
cache are such that as the number of bytes increases, the 
speedup provided slowly but continuously increases. 
Eventually, the performance of the second-level cache 
exceeds that of the PRe. However, the performance of a 
second-level cache does not exceed that of a fully-
associative PRC until the size of the caches is 512 Kbyte for 
the Kenbus20 benchmark and 256 Kbytes for the Kenbus80 35 
benchmark. The performance of a second-level cache does 
not exceed that of a 4-way, set-associative PRC until the size 
30 must then be dissipated from the system. Although this is not 
a difficult engineering problem in a desktop computer, it can 
be an extremely limiting factor in certain applications such 
as space-based computers where convection cooling is not 
possible and all cooling must be accomplished by radiation. 
The power consumed by a digital integrated circuit is 
dependent on the frequency of operation, the power supply 
voltage, the type of logic circuit, and the total parasitic 
capacitance of the chip. When comparing the power con-
sumption of a PRC against the power consumption of a 
of the caches is 256 Kbytes for both the Kenbus20 and the 
Kenbus80 benchmarks. For embedded microprocessor sys-
tems performing high-speed control and communications 
functions in space-based, weapon-based, and portable, 
mobile, and aeronautical computing applications, the physi-
cal size, weight, power consumption, and generated heat of 
a 256 Kbyte to 512 Kbyte, second-level, cache memory can 
be prohibitive. 
40 second-level cache, it is reasonable to assume that both will 
be implemented with the same fabrication and logic circuit 
technology. Therefore, it is reasonable to assume that the 
power supply voltage of a PRC would be the same as that of 
a second-level cache. It has been shown that both a PRC and 
45 a second-level cache will improve the speed of operation of 
a computer. However, this speed increase is not the result of 
an increase in the clock rate, or operating frequency. As 
indicated in Tables 2, 3, and 4, both the PRC and the 
The present invention has the added advantage of pro-
viding decreased hardware costs. In addition to studying the 
performance of various different PRC designs, the hardware 
costs of various different PRC designs have been studied. 
The cost of computing hardware, including component 50 
costs, assembly costs, design and test costs, etc., is directly 
proportional to the number of transistors required to imple-
ment the required logic functions. This is especially true for 
VLSI components. Table 5 summarizes the results of this 
study for 256 byte through 64 Kbyte PRCs. Transistor 55 
counts are given for direct-mapped, 4-way set-associative 
mapped, and fully-associative mapped PRCs. 
second-level cache improve performance by reducing the 
number of clocks required to fetch data. Therefore, when 
comparing the power consumption of a PRC against the 
power consumption of a second-level cache, it is reasonable 
to assume that the frequency of operation will be the same 
for both. 
It can be shown that the total parasitic capacitance of an 
integrated circuit is approximately linearly proportional to 
the number of transistors used to implement the chip. 
Therefore, if a PRC and a second-level cache are imple-
mented with the same fabrication and logic circuit technol-
The hardware costs, in number of transistors, for a typical 
second-level cache are approximately one-third of the hard-
ware costs of a direct-mapped PRC for caches with the same 
number of blocks and bytes per block. Upon initial 
inspection, this would tend to indicate that a PRC does not 
have a hardware cost advantage over a standard, second-
level cache. However, it is not reasonable to directly com-
pare second-level caches against PRCs of the same size 
except for very large caches. The appropriate comparison to 
make is to compare a given PRC design against the second-
60 ogy and operate at the same frequency, then the design that 
uses the fewest transistors will consume the least power with 
the ratio of the power consumptions being approximately 
proportional to the ratio of the number of transistors. It has 
been mentioned previously that the PRC uses significantly 
65 fewer transistors than do second-level caches of equivalent 
performance. For second-level caches and PRCs of approxi-
mately equivalent performance, transistor ratios of 77/1 to 
6,047,359 
13 
270/1 are possible. Thus, the power consumption of a PRC 
can be as low as 1/77 to 1/270 that of a second-level cache 
memory of equivalent performance. 
The present invention provides an increased level of 
integration. The level of integration of a digital system refers 5 
to the number of different logic functions that can be placed 
on a single chip. The more functions on a given the chip, the 
higher the integration level, the higher the performance, the 
higher the reliability, the lower the power consumption, and 
the lower the manufacturing costs. It has been shown that the 10 
number of transistors required to implement a PRC is 1/77 
to 1/270 the number of transistors required to implement a 
second-level cache of approximately equal performance. 
Thus, what would have required a VLSI controller chip and 
an array of high-speed static random access memory ICs, 15 
can be accomplished with a single VLSI integrated circuit, 
the PRe. However, based on the transistor counts required 
to actually implement a PRC, as shown in Table 5, and 
taking into consideration current VLSI fabrication technol-
ogy which is capable of producing ICs with over 10 million 20 
transistors with high yield, it is now feasible to implement 
an entire PRC as an integral part of the microprocessor chip. 
Thus, the use of a PRC would completely eliminate the need 
for any memory-related ICs outside the microprocessor 
chip, except for the main memory which is usually imple-
mented with low-power, low-speed, DRAM ICs. 
The advantages of the present invention are achieved by 
providing several new features. These new features include 
25 
a predictive read cache memory that tracks multiple data 
read miss address patterns from the primary cache memory 30 
and the use of a displacement-based algorithm for tracking 
multiple data read miss addresses patterns from the primary 
cache memory. Another new feature is the association of the 
multiple data read miss address patterns from the primary 
cache with the specific instructions that generate the pat-
terns. Still another advantage of the predictive read cache 
memory according to the present invention is identification 
35 
of instructions that generate the data read miss address 
patterns from the primary cache by using the addresses of 
the instructions that generate the patterns. The use of the 40 
least significant bits of the instruction address that generates 
a data read miss in the primary cache to select a block in the 
predictive read cache memory and the most significant bits 
to compare against a tag stored in the block is also a new 
feature. 45 
One design alternative that is possible for the PRC is to 
reverse the rolls of the instruction address and the data 
address. For example, referring to FIG. 8, the least signifi-
cant log2n bits of the data read miss address from the 
primary cache 32 could be used to select a block in the PRC 50 
90. The higher-order bits of the data address would then 
become a tag and would be used in a manner similar to the 
way the instruction address tag is used in FIG. 8. If this 
design were used, the address of the instruction generating 
the data read miss address pattern would need to be used in 55 
a manner similar to that of the data address in FIG. 8. The 
entire instruction address would have to be stored in a field 
in each block. When checking to see if an incoming primary 
cache 32 data read miss address had been correctly 
predicted, the incoming instruction address would need to be 60 
compared against the value stored in the instruction address 
field in the selected block. This alternative method of 
selecting a block in the cache is compatible with all three 
possible methods for implementing address mapping as 
described previously. 65 
Other memory subsystem architectures are possible using 
the predictive read cache. For example, referring to FIG. 11, 
14 
the primary cache 32 memory could be completely elimi-
nated and the predictive read cache 52 could be connected 
between the microprocessor 22 and the main memory 24. 
In another architecture, the predictive read cache 52 could 
be logically situated between the main memory 24 and the 
second-level cache 34 as shown in FIG. 12. Essentially, a 
predictive read cache according to the present invention can 
be placed anywhere in the memory hierarchy, although 
research has indicated that it provides the best performance 
improvement if used along with a primary cache as a 
replacement for a second-level cache. 
APPENDIX 
TABLE 1 
Baseline Performance of RISC CPU With Primary Cache Only 
Average Read Average Write 
Access Time Cache Read Access Time 





























1.513 89.94% 1.00 
1.721 86.44% 1.00 
TABLE 2 
Performance of RISC CPU With Primary 
Cache and Fully-Associative Mapped PRC 
Kenbus 20 Kenbus 80 
Ave. Read PRC Ave. Read 
Access Read Access 
Time Speed- Hit Time 
(clocks) up Rate (clocks) Speedup 
1.323 12.58% 37.49% 1.397 18.82% 
1.317 12.94% 38.40% 1.393 19.04% 
1.314 13.19% 39.15% 1.391 19.18% 
1.312 13.31% 39.57% 1.390 19.25% 
1.311 13.37% 39.94% 1.389 19.28% 
1.309 13.47% 40.40% 1.387 19.39% 
1.306 13.65% 41.11% 1.383 19.64% 
1.302 13.95% 42.30% 1.375 20.10% 
1.297 14.27% 43.54% 1.374 20.19% 
1.296 14.32% 43.70% 1.374 20.19% 
1.296 14.32% 43.70% 1.374 20.19% 
1.296 14.32% 43.70% 1.374 20.19% 
TABLE 3 
Performance of RISC CPU With Primary 
Cache and 4-Way. Set-Associative Mapped PRC 
Kenbus 20 Kenbus 80 
Ave. Read PRC Ave. Read 
Access Read Access 
Time Speed- Hit Time 
(clocks) up Rate (clocks) Speedup 
1.390 8.10% 26.84% 1.542 10.39% 
1.390 8.15% 27.10% 1.542 10.42% 
1.324 12.47% 36.92% 1.399 18.71 % 
1.320 12.73% 37.75% 1.398 18.79% 
1.320 12.76% 37.73 1.398 18.76% 
1.320 12.75% 37.81% 1.398 18.74% 
1.320 12.76% 37.89% 1.398 18.75% 



































Performance of RISC CPU With Primary 
Cache and 4-Way, Set-Associative Mapped PRC 
Kenbus 20 Kenbus 80 
Ave, Read PRC Ave, Read 
PRC Access Read Access 
Size Time Speed- Hit Time 
(bytes) (clocks) up Rate (clocks) Speedup 
64 1,320 12,76% 37,98% 1,398 18,78% 
128 1,320 12,76% 37,99% 1,398 18,78% 
256 1,320 12,76% 37,99% 1,398 18,78% 












address patterns from the primary cache memory by using 
the addresses of the instructions that generate the patterns, 
4, The method of claim 3 further comprising the steps of: 
using the least significant bits of the instruction address 
that generates a data read miss in the primary cache 
memory to select a block in the predictive read cache 
memory; and 
comparing the most significant bits of the instruction 
address that generates a data read miss in the primary 
cache memory with an address tag stored in the block 
5, A method for forming a predictive read cache for 
reducing primary cache miss latency in a microprocessor 
system that includes a microprocessor, a main memory and 
a primary cache memory connected between the main 
memory and the microprocessor via an instruction address 
bus, a data address bus and a data bus, the primary cache 
being arranged to output both a data read miss address when 
it receives an instruction to read data that is not in the 
primary cache memory and the address of the instruction Performance of RISC CPU With Primary and 
Second-Level Cache Memories 20 that generated the memory read, comprising the steps of: 
Kenbus 20 Kenbus 80 
Cache Average Read Average Read 
Size Access Time Access Time 
(Kbytes) (clocks) Speedup (clocks) Speedup 
64 1,505 0,50% 1,654 3,88% 
128 10414 6,54% 10485 13,70% 
256 1,308 13,54% 1,319 2339% 
512 1,210 20,02% 1221 29,08% 
TABLE 5 
Transistor Counts for Direct-Mapped, 4-Way Set Associative Mapped and 
Fully-Associative Mapped PRCs 
4-Way 
PRC Size Direct-Mapped Set-Associative Full-Associative 
(bytes) Transistor Count Transistor Count Transistor Count 
256 23,276 29,813 26,616 
512 44,161 50,706 51,000 
1,024 86,006 92,567 100,024 
2,048 169,835 176,428 198,584 
4,096 337,760 344,417 396,728 
8,192 674,133 680,918 795,064 
16,384 1,347,914 1,354,955 1,595,832 
32,768 2,697,535 2,705,088 3,205,560 
65,536 5,400,884 5,409,461 6,441,400 






L A method for reducing primary cache miss latency in a 
microprocessor system that includes a microprocessor, a 50 
main memory and a primary cache connected between the 
main memory, and microprocessor comprising the steps of: 
connecting a predictive read cache memory between the 
main memory and the primary cache; 
tracking multiple data read miss address patterns from the 55 
primary cache memory; and 
associating the multiple data read miss address patterns 
from the primary cache with specific instructions from 
the microprocessor that generate the multiple data read 
miss address patterns, 60 
2, The method of claim 1 wherein the step of tracking 
multiple data read miss address patterns from the primary 
cache memory further comprises the step of applying a 
displacement-based algorithm to the multiple data read miss 
address patterns, 65 
3, The method of claim 1, further comprising the step of 
identifying the instructions that generate the data read miss 
connecting a first plurality of memory blocks arranged to 
form an instruction address tag field (IATG) to the 
instruction address bus; 
connecting a first comparator means to the IATG field and 
to the instruction address bus to receive the most 
significant bits of an instruction address that generated 
a memory read which resulted in a data cache read 
miss, 
connecting a second plurality of memory blocks arranged 
to form a most recent miss memory address (MRMA) 
field to the data address bus; 
arranging a third plurality of memory blocks to receive 
the MRMA field and form a previous miss address field 
(PRMA) as new memory addresses are received in the 
MRMA field from the data address bus; 
processing the MRMA field and the PRMA field with a 
displacement algorithm to provide a predicted address 
to the data address bus and to the PDMA field; 
arranging a fourth plurality of memory blocks to receive 
the predicted address and form a predicted memory 
address (PDMA) field; 
arranging a second comparator means to compare data 
address bits from the data address bus with data in the 
PDMA field and; 
arranging a fifth plurality of memory blocks to form a 
predicted data (PDDT) field that is sent to the primary 
cache memory via the data bus if the first and second 
comparators produce outputs indicating the occurrence 
of matches in both the IATG and PDMA fields, respec-
tively, 
6, The method of claim 5, further comprising the steps of: 
connecting a first set of comparators to the IATG field so 
that each memory block in the IATG field has a 
corresponding comparator; 
connecting a first binary encoder to the first set of com-
parators; 
connecting a second set of comparators to the PDMAfield 
so that each memory block in the PDMA field has a 
corresponding comparator; 
connecting a second binary encoder to the second set of 
comparators, 
7, The method of claim 5, further comprising the steps of: 
arranging the memory blocks in each of the IATG, 
MRMA, PRMA, PDMA and PDDT fields in a plurality 
of sets with a selected number of blocks per set; 
6,047,359 
17 
connecting a first comparator array to the IATG field; 
connecting a first encoder to the first comparator array; 
connecting a second comparator array to the PDMAfield; 
connecting a second encoder to the second comparator 5 
array; 
connecting a first multiplexer to receive the output of the 
MRMAfield; 
connecting a second multiplexer to receive the output of 
the PRMA field and; 10 
connecting a third multiplexer to receive the output of the 
PDDT field. 
8. A predictive read cache for reducing primary cache 
miss latency in a microprocessor system that includes a 
microprocessor, a main memory and a primary cache 15 
memory connected between the main memory and the 
microprocessor via an instruction address bus, a data address 
bus and a data bus, the primary cache being arranged to 
output both a data read miss address when it receives an 
instruction to read data that is not in the primary cache 20 
memory and the address of the instruction that generated the 
memory read, comprising: 
a first plurality of memory blocks connected to the 
instruction address bus and arranged to form an instruc-
tion address tag field (IATG); 25 
a first comparator having a first input connected to receive 
an output from the IATG field and a second input 
connected to the instruction address bus to receive the 
most significant bits of an instruction address that 30 
generated a memory read which resulted in a data cache 
read miss; 
18 
a second plurality of memory blocks connected to the data 
address bus and arranged to form a most recent miss 
memory address (MRMA) field; 
a third plurality of memory blocks arranged to receive the 
MRMA field and form a previous miss address field as 
new memory addresses are received in the MRMAfield 
from the data address bus; 
means for processing the MRMA field and the PRMA 
field to provide a predicted address to the data address 
bus; 
a fourth plurality of memory blocks arranged to receive 
the predicted address and form a predicted memory 
address (PDMA) field; 
a second comparator connected to the PDMA field and to 
the data address bus arranged to compare data address 
bits with data in the PDMA field; and 
a fifth plurality of memory blocks connected to the data 
bus and arranged to form a predicted data (PDDT) field 
that is sent to the primary cache memory via the data 
bus if the first and second comparators produce outputs 
indicating the occurrence of matches in both the IATG 
and PDMA fields, respectively. 
9. The predictive read cache of claim 8 further comprising 
means for selecting a specific block of the IATG field with 
the log2n least significant bits of the instruction address; and 
means for comparing the instruction address less its log2n 
least significant bits with data stored in the selected 
block of the IATG field. 
* * * * * 
