



THE DESIGN OF A
PREDICTIVE READ CACHE
by
Joseph R. Robert, Jr.
March, 1996
Thesis Advisor: Douglas J. Fouts
Approved for public release; distribution is unlimited.
*dleSSe school
REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data
sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other
aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and
Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188)
Washington DC 20503.
1 . AGENCY USE ONLY (Leave blank) 2. REPORT DATE
March 1996.
REPORT TYPE AND DATES COVERED
Master's Thesis
4. TITLE AND SUBTITLE
THE DESIGN OF A PREDICTIVE READ CACHE
6. AUTHOR(S) ROBERT, Joseph Roy, Jr.
FUNDING NUMBERS






SPONSORING/MONITORING AGENCY NAME<S) AND ADDRESS(ES) 10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the
official policy or position of the Department of Defense or the U.S. Government.
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution is unlimited.
12b. DISTRIBUTION CODE
1 3. ABSTRACT (maximum 200 words)
The objective of this research has been the creation of a hardware design for a Predictive Read Cache (PRC).
The PRC is a developmental cache intended to replace second-level caches common in modem microprocessor
systems. The PRC has the potential of being faster and cheaper than current second-level caches and is distinctive
in its ability to predict data addresses to be referenced by a central processing unit.
Previous research has analyzed the behavior that the PRC must exhibit. During the described research, the
behavior was modeled in the Verilog hardware description language. Verilog-XL was used for simulation, which
uses the Verilog behavioral model as input. The behavioral model suggests that the internal structure of the PRC
could be divided into six modules, each performing part of the function of the whole PRC. Each of these blocks
was studied for hardware equivalents, easing the development of the total structural model.
Using Verilog structural models as input, Epoch was used to automatically perform a very large-scale
integrated (VLSI) circuit layout and to generate timing information. The Epoch output files are used for further
simulation with Verilog-XL to identify critical parts of the design. The result of this research is a complete
hardware design for the PRC.
14. SUBJECT TERMS VLSI (very large scale integrated) design; memory address
















NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89)
Prescribed by ANSI Std. 239-18 298-102
11
Approved for public release; distribution is unlimited.
THE DESIGN OF A PREDICTIVE READ CACHE
Joseph R7 Robert, Jr.
Lieutenant, United States Navy
B.S., State University of New York at Buffalo, 1988
Submitted in partial fulfillment
of the requirements for the degree of









The objective of this research has been the creation of a hardware design for a
Predictive Read Cache (PRC). The PRC is a developmental cache intended to replace
second-level caches common in modern microprocessor systems. The PRC has the potential
of being faster and cheaper than current second-level caches and is distinctive in its ability to
predict data addresses to be referenced by a central processing unit.
Previous research has analyzed the behavior that the PRC must exhibit. During the
described research, the behavior was modeled in the Verilog hardware description language.
Verilog-XL was used for simulation, which uses the Verilog behavioral model as input. The
behavioral model suggests that the internal structure of the PRC could be divided into six
modules, each performing part of the function of the whole PRC. Each of these blocks was
studied for hardware equivalents, easing the development of the total structural model.
Using Verilog structural models as input, Epoch was used to automatically perform a
very large-scale integrated (VLSI) circuit layout and to generate timing information. The
Epoch output files are used for further simulation with Verilog-XL to identify critical parts





B. PRINCIPLE OF OPERATION 1
C. RESEARCH GOALS 3
D. THESIS STRUCTURE 3
II. TESTBENCH 5
A. OVERVIEW OF TESTBENCH 5





G. TEST RESULTS 11
III. PRC BEHAVIORAL MODEL DESIGN PHASE 15
A. PSEUDOCODE MODEL 15
B. DATA STRUCTURE 15
C. BLOCK DIAGRAM 18
D. CONTROLLER 21
E. SNOOPER 2 3
F. LINE MANAGER 2 4
G. PREDICTOR 2 6
H. DATA LIST 2 7
I. BUS INTERFACE UNIT 2 7
J. PREDICTION TESTS 2 7
K. CONCLUSION 3
IV. PRC STRUCTURAL MODEL DESIGN PHASE 31
vn
A. PRC 31
B. CONTROLLER 3 2
C. SNOOPER 3 5
D. LINE MANAGER 37
E. PREDICTOR 3 9
F. DATA LIST 41
G. BUS INTERFACE 43
H. TESTING 44
V. CAD TOOLS 47
A. VERILOG-XL 47
B. CWAVES 48
C. EPOCH 4 9
VI. CONCLUSIONS AND RECOMMENDATIONS 53
A. CONCLUSIONS . 53
B. RECOMMENDATIONS 55
APPENDIX A. LAYOUTS 57
APPENDIX B. TESTBENCH VERILOG FILES 73
A. TESTBENCH 7 3
B. CPU 7 6
C. ARBITER 85
D. MEMORY 8 9
APPENDIX C. PRC BEHAVIOR FILES 97
A. PRC 97
B. CONTROLLER 9 8
C. SNOOPER 104
D. LINE MANAGER 107
Vlll
E. PREDICTOR Ill
F. DATA LIST 112
G. BUS INTERFACE UNIT 114
H. PREDICTION TEST 122
I. PREDICTION TEST RESULTS 12 6
J. LINE REPLACEMENT TEST 13
K. LINE REPLACEMENT TEST RESULTS 13 3
APPENDIX D. PRC STRUCTURE FILES 13 7
A. PRC 13 7
B. CONTROLLER 13 8
C. SNOOPER 143
1. Thirty-Two-Input, Odd-Parity Checker . . 147
D. LINE MANAGER 148
1. Address Register With Equal Comparator . 150
2. AND Gate With 128 Inputs and One Output 150
3. Codefile for Seven-to-128 Decoder
(dec7tol28e.codefile) 151
4. One-Hundred-and-Twenty-Eight-Input , Seven-
Output Encoder, Priority to Low Bits . . 154
5. Thirty-Two-Input, Five-Output Encoder,
Priority to Low Bits 155
6. Eight-Input, Three-Output Encoder, Priority to
Low Bits 156
7. Line Replacement Unit 158
8. OR Gate With 128 Inputs, One Output . . 159
9. Predicted Memory Address List 160
10. One-to-128 Wire Splitter 163
11. One-to-Seven Wire Splitter 164
12. Set, Reset Latch 164
13. Set, Reset Latch Array 128 Bits Wide . . 165
IX
E. PREDICTOR 165
F. DATA LIST 167
G. BUS INTERFACE 168
1. Odd Parity Checker/Generator With 256 Inputs
181
2. Odd Parity Generator With 32 Inputs . . 182
H. TEST RESULTS 183
LIST OF REFERENCES 187




Billingsley and Fouts demonstrated the viability of using
an address predicting buffer to reduce memory latency in
computer systems. "The implementation of a MPB [Memory
Prediction Buffer] is less expensive than a next-level cache
and delivers a comparable performance enhancement."
(Billingsley, 1992)
With this in mind, Nowicki designed a Read Prediction
Buffer (RPB) as part of his thesis work in 1992 (Nowicki,
1992). This RPB was capable of prefetching data based on the
previous pattern of memory accesses. Continuing the work of
Nowicki, Aguilar tested that design and suggested several
enhancements to improve it (Aguilar, 1995) . A tentative
design of this new Predictive Read Cache (PRO was a part of
his thesis work.
Aguilar proposed a design consisting of six modules which
together would comprise the PRC . He designed four of those
six modules, testing each independently, but not together.
B. PRINCIPLE OF OPERATION
The Predictive Read Cache stores data only, not
instructions. The design is based on a couple of observations
about data fetches from main memory. First, within a
specific block of data, the accesses often occur in sequential
patterns such as every element in order, or every other
element in reverse order. The second observation is that a
program often uses several blocks of data concurrently.
The PRC takes advantage of the access patterns to predict
future memory access addresses. The prediction is based on a
linear displacement of the addresses . The PRC calculates the
difference between two given addresses, then adds the
difference to the most recent address to arrive at the
predicted address. For example, if the Central Processing
Unit (CPU) accesses the data at address 20h (hexadecimal 20)
and then at address 40h, the PRC predicts that the CPU soon
will need the data at 60h. Once the PRC has predicted an
address, it fetches the data from that address. Once the data
is stored in the PRC, the PRC can deliver that data to the CPU
much more quickly than the main memory could deliver the data.
The PRC handles multiple data blocks through its "lines."
Each line is capable of tracking the pattern of accesses
within a unique block of data. Thus, the PRC can track only
as many access patterns as it has lines.
When the cache is full and a new access pattern begins,
a line has to be replaced. Lines that have not been used
recently become aged. Aged lines are the first to be replaced
when the cache is full.
Data incoherency is avoided through the process of
flushing lines. When a line is flushed, that line is marked
as containing invalid data and is made available for tracking
new access patterns. If the CPU writes data to an address from
which the PRC has prefetched data, the PRC flushes the line
with that data.
C. RESEARCH GOALS
The objective of this research is to create a complete
hardware design of the PRC . Completing the design has
priority over the performance, though the performance must be
better than the performance of main memory for this design to
be of any value.
The performance is measured in terms of the rate at which
the Central Processing Unit (CPU) can access the data in the
PRC. In the microprocessor system for which this PRC design
is created, data accesses occur in groups. The groups are
called "bursts." Each access within a burst is called a
"beat." With a 60-ns memory and a 66-MHZ system clock, the
four-beat burst operation takes 8-3-3-3 cycles, that is, eight
cycles for the first beat and three more cycles for each of
the three remaining beats. The design of the PRC must perform
at least this well and preferably much faster.
D. THESIS STRUCTURE
The Testbench is presented first, which is the Verilog
model of the environment in which the PRC is expected to
operate. This description includes a summary of the bus
protocol and results of tests that show the correct
performance of the Testbench.
The description of the behavioral model design phase is
presented next . This chapter presents a simple psuedocode
model of the PRC which is used to develop an appropriate data
structure and block diagram for the PRC. The individual
blocks are each modeled with Verilog and then connected
together in the Testbench to verify that the entire PRC works
as desired.
Once the behavioral model design phase is complete, each
block is converted into a hardware (structural) model. This
phase of the design is detailed in Chapter IV.
This thesis also contains a description of the Computer
Aided Design (CAD) tools used for this research. The
descriptions include tips for making their use easier and
descriptions of any problems encountered.
II. TESTBENCH
This chapter describes the Testbench, the environment in
which the Predictive Read Cache (PRO was designed to operate.
In particular, it summarizes the bus arbitration protocol and
explains the important aspects of each part of the Testbench.
The chapter concludes with the test results of the Testbench
itself.
A. OVERVIEW OF TESTBENCH
The Testbench models and simulates the environment in
which the PRC design was tested. As indicated in Figure 1, it
comprises four blocks, one of which is the PRC itself. The
Testbench was developed with Verilog behavioral models. The
CPU module simulates various functions of a PowerPC-603 . The
Memory module simulates the behavior of a 60 -ns dynamic random
access memory (DRAM) . The Arbiter controls access to both the
address and data busses. Each of these modules is described
in more detail in the following sections, after a description









Figure 1. Block Diagram of Testbench.
There were four major decisions made regarding the design
of the Testbench. The first decision was to use a PowerPC-603
microprocessor system as the environment in which this PRC
will operate. The work of Aguilar was started using the '603
(Aguilar, 1995) . It is still a current member of the PowerPC
family; the protocol should not be out of date for quite some
time
.
The second design decision was to limit the '603 to in-
order transactions. The '603 is capable of performing certain
sequences of data transfers out of order. That is, the order
of the data bus cycles can be different from the order of the
address bus cycles. Prohibiting these transactions made the
CPU model simpler and simplified the design of the PRC. This
did not undermine the demonstration of the PRC as a viable
memory management tool
.
The third design decision was to use a 66-MHZ system bus
and CPU clock rate. Sixty- six-MHZ is a reasonably fast system
bus speed. Designing for a slower bus speed could severely
reduce the applicability of this design to modern systems.
The fourth decision was to use the 64-bit data bus vice
the optional 32 -bit configuration. When configured with the
64-bit data bus, the PowerPC-603 can access memory in one of
two modes: single-beat or four-beat burst. A single beat is
one memory access of one to eight bytes. A four-beat burst is
a sequence of four sequential memory accesses, eight bytes per
beat totaling 32 bytes. When configured with the 32-bit data
bus, the '603 can access memory in one of three modes: single-
beat (one to four bytes), two-beat burst (eight bytes), or
eight-beat burst (32 bytes) . Data transfers are less
complicated with the 64-bit data bus since there are fewer
transfer options and a smaller number of beats. Also, the
time from one cache miss to the next is independent of the
data bus size. Since a burst transfer on the 32-bit bus takes
more cycles, there is much less time between cache misses for
the PRC to do its job, perhaps too little time. Further, the
32-bit mode is specific to the '603; therefore, the PRC would
have to be redesigned to be used with the other 64-bit bus
members of the PowerPC family. A disadvantage of the 64-bit
option is the increased number of pins required for the PRC
from about 108 to about 140.
B. SUMMARY OF '603 PROTOCOL
The PowerPC-603 has separate data and address busses,
each with independent cycles, referred to as tenures by the
Motorola engineers. Tenure has three phases: Arbitration,
Transfer and Termination.
The system has a bus arbitration unit which controls the
passing of bus mastership between the requesting units. In
this implementation, the CPU and the PRC are the only
candidates for bus mastership. Module Arbiter is the
arbitration unit.
When a unit wants the bus, it asserts BR_ (bus request) .
If the unit can have the bus next, the arbiter asserts BG_
(bus grant) back to that unit. Then the unit waits, if
necessary, for the previous master to finish its tenure, after
which the unit takes mastership by asserting ABB_ (address bus
busy). When the current master is done with the address bus,
it negates ABB_.
This system has no external cache or multiple processors
;
thus, there are no address-only transactions. If a unit wants
the address bus, it will also want the data bus. After
granting the address bus by asserting BG_, the arbiter then
grants the data bus by asserting DBG_.
Both BG_ and DBG_ remain asserted until the requesting
unit takes mastership or withdraws its request by negating
BR_. If there are no pending bus requests, the arbiter "parks"
the CPU by granting it the busses. If the CPU is parked, it
does not have to take the time to request the bus, thereby
reducing the time for the memory access. If the CPU is parked
and the PRC requests the bus, the arbiter unparks the CPU and
grants the bus to the PRC.
C . TESTBENCH
The Testbench is the highest level in the design
hierarchy. It connects the CPU, PRC, memory, and arbitration
unit. This module establishes the system clock rate and
controls the simulation time.
CPU
The CPU module simulates PowerPC-603 memory accesses.
The Sequencer is a sub-module of the CPU which makes the
Testbench able to simulate every transaction relevant to the
memory and PRC. These transactions can occur in any order.
Many of the possible '603 transactions are not applicable to
this particular system configuration. For example, none of
the "address only" transactions are relevant, since they are
for systems with multiple processors or second-level caches.
Bus arbitration is accurately modeled, including the pipelined
address tenures.
E . MEMORY
This module emulates the main memory of the system. For
simulation efficiency, the memory has only enough physical
address space for four-beat burst reads: 128 bytes. The
address bus width allows a virtual address space of four
Gbytes. Accesses to addresses past the first 128 bytes map to
addresses within the first 128 bytes.
The time required for memory accesses are determined by
the use of the parameters Delayl and Delay2 . The heading in
the file memory. v describes how to adjust these parameters to
achieve a realistic memory access rate.
There were two significant decisions made about the main
memory design. First, the memory emulates a 60-ns DRAM
memory. With a 60-ns memory and a 66-MHZ system clock, the
four-beat burst operation takes 8-3-3-3 cycles, that is, eight
cycles for the first beat and three more cycles for each of
the three remaining beats.
The second design decision was to add a cancel feature to
the main memory chip. The memory module has an input called
CANX which cancels the current read operation. It is through
this signal that the PRC stops the memory module from
delivering data to the CPU when the PRC already has the data.
Another option would be to put the PRC between the CPU
and Memory, not allowing a read request to get to the memory
chip until after the PRC had checked its contents.- This
scheme would increase the time of all memory accesses.
F. ARBITER
The Arbiter emulates the external bus arbitration unit,
implemented as a Finite State Machine (FSM) corresponding to
the state diagram in Figure 2
.
The memory unit in this Testbench is capable of handling
up to two memory accesses in the pipeline at a time, which is
the maximum that the CPU will ever cause. Adding the PRC to
the system creates the possibility of three accesses in the
pipe. For example, the PRC could initiate a third address
tenure before the first of two CPU transactions is complete.
This potential problem is handled by the Arbiter which keeps
track of the pipelining depth. It will not grant the address
10
bus to any unit if that address tenure would put a third
transaction in the pipeline. Rather, the Arbiter will stall
until the data tenure from the first transaction is complete,




A' 1 \ 10 A: Start




D: Grant CPU data bus
E: Grant PRC addr bus
F: Wait for PRC























[ [1011] ( G
"7 \
I [1110] j Numbers refer to verilog state numbers.
Figure 2. State diagram for Arbiter FSM.
G. TEST RESULTS
Testing the Testbench itself was important to establish
that the models matched the behavior described in the Power PC
User's Manual. The Testbench passed all tests of reads,
11
writes and burst operations, in various sequences of
transactions and using an assortment of memory access delays.
Figure 3 shows the fastest possible burst operations, as
if the memory access time were not the limiting factor. Note
again that the address tenure of the second transaction can






10S 120 135 150 165 180 195















Figure 3. Burst write, then burst read. Delay=0. [cWaves
output]
Figure 4 shows a burst write transaction with an access
delay of three cycles and a delay of one cycle in between each
beat. A realistic 60-ns DRAM will have a delay of 8-3-3-3
12
rather than the 3-1-1-1 shown here. The PRC however should be
able to supply data this quickly.
Baseliile 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cu
15 30 45 60 75 9010a2C13a5C16S18C19£21C22£24C25S27C28E30t31E33C34£360 i
ich. CPUl. clk<> Stl
ich.CPUl.BR_<>l
ich. CPOl. BG_<> StO
i i i i i i i i i
i~j i~! pi n n i"i n n f-1 r
i i I i I i I i i i i i i
i i i i i i
sh.CPUl.ABB_<>Pul —;







ich. cpui. ts_<> Hiz .J_,J..
j r»"{-]- ; ry^L^LJ L.. I-! -•--!•• !—j-----=-
1PTI1 A[n-51]A77T77777 . . A Kj f. T% . , , -A.
1 1 \ 1 1 / 1 \ 1 1 / 1 II 1
1 1 i l l l l 1 1 1 • £-"
1 1 1 1 1 1 1 1 1 1 1 .
l,USVi.4U^*_^lUi
I I | I 1 1 1 i 1 1 1 1 || | || | II II 1 II 1 II
1 1 *—*— 1~* 1 — 1
—
f* 1 1 1 1 I 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1









i i i ~~r~ i i —H i i i i
:h. CPUl. DBG_<> StO ! i] !!!!!!!! !
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
:h.cpui.DBB_o*ui ! ! ! i 1 '. ! ! 1 ! ! ! ! !j-h { ! ! ! ! ! ! ! ! !:
i t - i . i ii i , i . , i i i i i i i
i i i i i i i i i i
iPII1 Tl[fl-fi3]^ 7777.7.7.7. 111,111 1 \. 1 1 k, 1 VT
ich. CPUl.ta_o stl |
—
; "| "| "| [I
;
; n ] rn • r
ich. CPUl. clko Stl | i l h h h h h h •"! h il h"
L..J Li Li Li Li i-.i LJ L.i Li L.i Li L
i i i i i i i i
3!
! i \®^
i i i i i
i~l i r~1 t i nil r~]
i i i
I




I 1 1 i T I "




i i i Li Li LJ L.i I j Li i
gh-QtQ—
i i i i
i fT] i :— i"" i •' ""
1 1 ii i'i|
~i i — i i —
r
i i i i
n ri i"i 1 j n n r
...j !.. • '..1 Li '...i I. i ;..'




III. PRC BEHAVIORAL MODEL DESIGN PHASE
This chapter presents the development of the behavioral
models for the PRC. A simple pseudocode model is presented
first. This model was used to develop an appropriate data
structure and block diagram for the PRC. The individual
blocks in this block diagram were implemented with Verilog
behavioral modules and tested together to verify the
behavioral model of the PRC. The next step was to convert
each module into a hardware model compatible with Epoch,
detailed in the next chapter.
A. PSEUDOCODE MODEL
The behavior of the PRC is explained in detail in the
paper by Fouts & Billingsley (1994, p. 113) and summarized in
the Introduction chapter of this thesis. Another way of
summarizing this behavior is through a pseudocode model as
shown in Figure 5, which is just detailed enough to identify
the most significant capabilities the PRC must have. The
purpose of taking this approach was to clarify the function of
the PRC and to aid in identifying specific behaviors of this
cache which the hardware needs to exhibit.
B. DATA STRUCTURE
A possible data structure for the PRC is shown in Figure
6. Each of the 128 lines within the PRC must contain two
15
addresses, some status information and data. The two
addresses are required to maintain the memory access pattern.
There are also two seven-bit pointers, each containing a
value in the range of zero to 127. The ActiveLine pointer
contains the number of the line that is currently being used
by the PRC . The ReplaceLine pointer contains the number of
the next line to be replaced when a new line is needed.
16
*** PRC BEHAVIOR MODEL IN PSEUDOCODE ***
// CAR = current address register
// MRMA = most recent memory address
// PredMA = predicted memory address
always at negative edge of HRESET_
clear all status flags
;
put PRC in IDLE state
;
ActiveLine = 0; ReplaceLine = ;
<IDLE>
wait for next transaction
CASE (transaction)
data burst-read:
if CAR hits in PRC, //PRC has requested data
switch ActiveLine to line that was hit;
send data to CPU;
send cancel signal to memory;
predict next address;
if next address is not already in PRC,
read next address;
store in ActiveLine;
update MRMA and PredMA;
else if CAR misses, //PRC does not have requested data
switch ActiveLine to the next ReplaceLine;
if this is the first miss for this line,
store this address in MRMA;
if this is the second miss for this line,
initiate search for next ReplaceLine;
predict next address;
if next address not already in PRC,
read next address;
store in ActiveLine;








Figure 5 . PRC Pseudocode Model
17
DATA STRUCTURE
PredMA (0:26) MRMA (0:26) status DATA (32 bytes)
64 bit] 64 bits 64 bits 64 bits
^
PredMA = Predicted Memory Address




ITigure 6. PRC Data Structure.
C. BLOCK DIAGRAM
The pseudocode model revealed several specific tasks the
PRC must be able to accomplish. Identifying and clarifying
these tasks resulted in the development of six blocks within
the PRC. These blocks are shown in the block diagram of
Figure 7 and are described briefly here.
The Snooper watches transactions between the CPU and
memory, raising appropriate signals if the transaction is one
in which the PRC is interested.
The Line Manager contains the Address List and Line
Replacement Unit as sub-blocks. The Address List contains all
the recently-accessed memory addresses and all the predicted
addresses. The Line Replacement Unit determines which of the
128 lines will be replaced the next time a new line is needed.
These two blocks are grouped together because they share
18
status information about the lines and work closely together
for line management
.
The Predictor module uses its two input addresses to
predict its output address.
The Data List stores 128 lines of data, 32 bytes in each
line, which is the amount of data in each burst read or burst
write
.
The Bus Interface handles the protocol of data transfers
in to and out of the PRC
.
Finally, the Controller coordinates the actions of all




1 CAR = Current Addr Register
| NAR = Next Addr Register
^~
CONTROLLER w 1 MRMA











n. ti ft Eh
e 1
4) u
i XI 01 j: M CO
1 u 4J H 0] 1 w
i •p W 3 S OS
*T c 111 4) u H <B ac
A
a selectw "
2 *V *>>' "Sf *^k *Sf ^v






















































fetch do ae w^
senddone
^
Figure 7. PRC Block Diagram.
20
D . CONTROLLER
This module is a Finite State Machine which coordinates
the actions of all the other functional blocks of the PRC
.
All control signals are synchronous with the system clock.
HRESET_ causes the Controller to go to the IDLE state. The
state diagram and state output tables are shown in Figures 8
and 9 .
Controller State Output Table
test s :ore send new repl ace
STATE a select pr edict flush hold fetch
IDLE X
TEST_CAR(R) CAR 1 1
SEND DATA X 1 1
TEST NAR NAR 1
FETCH DATA NAR 1
IS LINE EMPTY X 1 .
PREDICT NA X 1 1 1
STORE CAR CAR 1 1
TEST_CAR(W) CAR 1 1
FLUSH_LINE X 1 1




Figure 9. Controller State Diagram.
22
E . SNOOPER
This module watches the system bus activity and makes
appropriate reports to the PRC Controller.
If the transaction is a data burst read or any kind of
write and if the address parity is correct, then two actions
occur. First, read or write is asserted as appropriate.
Second, the address is placed in the Current Address Register
(CAR) . The snoop_ignore signal tells this unit to ignore the
current transaction, because it was initiated by the Bus
Interface Unit. The snoop_ignore signal must be asserted
concurrently with the transfer attributes.
Reads that are not burst reads or data related are
ignored by the PRC. The CAR is updated only on transactions
relevant to the PRC.
Due to the two-stage pipelining capability of the PowerPC
with respect to memory accesses, a second address tenure can
occur shortly after the first, well before the first data
tenure is complete. To compensate for this, the read and
write outputs of the Snooper remain exerted until acknowledged
by the Controller with hold. The rising edge of hold
indicates that the read or write signal was received by the
Controller. The Snooper then can negate these signals but
must leave CAR alone until hold is negated. After hold is
negated, CAR can be updated to the new address.
23
F. LINE MANAGER
This module contains the address list, status flags for
each line (Valid, Aged) , a general status flag (line_empty)
,
the line replacement unit, and a couple of pointers
(ActiveLine, ReplaceLine) . On HRESET_, Valid=0 (all lines)
,
Aged=0 (all lines), 1 i .ne_empty=1 , ActiveLine = .
The MRMA output is always the MRMA of the ActiveLine.
The line_empty flag indicates that the currently active line
has no addresses in it yet; therefore, the addresses cannot be
used by the PRC to make a prediction.
The input a_select determines which address input is used
for a particular operation. The two address inputs are the CAR
and the NAR.
When the Line Manager receives a test signal, it compares
the input address with the contents of the PredMA List. If
there is a match with the CAR, it asserts the hit signal and
changes the ActiveLine pointer to the line number of the hit.
If there is a miss with the CAR, then the ActiveLine
switches to the same line to which ReplaceLine points.
If, during a test, there is a match with the NAR, two
actions occur. First, hit is asserted. Second, the value in
ActiveLine becomes irrelevant since it will not be used. If
there is a miss with the NAR, the ActiveLine must remain
unchanged from the test
.
The fetch_done signal from the Bus Interface causes the
NAR to be stored in PredMA [ActiveLine] , the CAR to be stored
in MRMA [ActiveLine]
, the Valid flag to be set, and the Aged
flag to be reset
.
24
The flush signal causes the current ActiveLine to become
invalid by setting Valid [ActiveLine] = 0.
The store signal causes the input address to be stored
into the MRMA of the ActiveLine. This is only used for the
first address in a new line. The store signal also causes the
lme_empty flag to be reset.
Line replacement: ReplaceLine always points to the line
to be replaced at the next PRC miss. HRESET_ causes this to
be zero.
As soon as the PRC starts predicting the first address
for a line it asserts new_replace . The replacement unit then








elseif (all_line_are_valid AND Aged [ReplaceLine] ) then
Done = true;
else
Aged [ReplaceLine] = 1;
until Done;
line_empty=l;
In words, the Line Replacement Unit searches sequentially
for the next line with invalid data and marks that line as the
next line to be replaced. If all lines contain valid data,
then it scans for the next line that is "aged, " indicated by
25
a set Aged flag. As it scans for an aged line, it sets the
Aged bits in the "unaged" lines it passes. Therefore, as it
wraps around in the search for an aged line, it will
eventually come upon one, even if none were aged when the
search began.
All of this occurs while the PRC is fetching data.
Therefore, the PRC has several clock periods in which to
complete the search.
G . PREDICTOR
The Predictor module has two address inputs, the Most
Recent Memory Address (MRMA) and the Current Address (stored
in the Current Address Register, CAR) . It has a single
output, the Next Address which is stored in the Next Address
Register, NAR.
This module calculates the Next Address based on the Most
Recent Memory Address and the Current Address. The rising
edge of predict initiates the prediction calculation. The
original equation is
NAR = CAR + (CAR - MRMA)
which is implemented as
NAR = 2*CAR - MRMA.




The inputs to the Data List are upload, download and
ActiveLine. The 256-bit bus data_line is an input and output.
An upload signal causes the Data List to store the data
on data_line into the address specified by ActiveLine. A
download signal causes the Data List to assert onto data_line
the data in the address specified by ActiveLine.
I. BUS INTERFACE UNIT
This module handles the protocol of data transfers in to
and out of the PRC, coordinating these activities through the
use of a Finite State Machine.
When this module receives a fetch signal, it latches the
address in the NAR and requests the bus for a burst read. It
stores the incoming data until all four bursts have been
received. Then, it uploads the data into the Data List and
asserts fetch_complete
.
When this module receives a send signal, it sends a
cancel signal (CANX) to the memory module, downloads data from
the Data_List and then sends the data to the CPU. When, the
transfer is finished, it asserts send_done.
J. PREDICTION TESTS
There are two large-scale tests included in this thesis.
The first is the Prediction Test. The second is the Line
Replacement Test. Together, these tests are sufficient to
demonstrate that the behavioral model functions as desired.
27
Once the behavioral model of the PRC passed these tests, it
was ready for conversion to a hardware model
.
The tests are both conducted by connecting the behavioral
model of the PRC to the Testbench described in the previous
chapter and running a simulation with a sequence of events.
The sequence of events for the Prediction Test is included in
the sequencer4 .v file. The sequence of events for the Line
Replacement Test is located in the seguencer5.v file. The
following procedure lists the steps necessary to conduct a
test
:




Modify the file verilog_arguments so that it contains
sequencer4
. v or seguencer5.v as desired and all the
parts to the PRC and to the Testbench.
3. Modify the file testbench. v to set the simulation
duration as described in the heading of the desired
sequencer. Modify the trace flags in every file
listed in verilog_arguments as described in the
sequencer file.
4. At the Unix command prompt, enter the command verilog
-f verilog_arguments
.
The Verilog-XL outputs of both tests are included in the
appendices. Together, these tests show that this behavioral
model performs all the desired functions.
The Prediction Test, using Sequencer4, causes a series of
CPU transactions that tests the ability of the PRC to make the
prediction calculation and to fetch the data. The
transactions are as follows:
Burst_read at OOh: The PRC stores this address.
28
burst_read at 2 Oh: The PRC should predict a next
address of 40h and then fetch the
data from that address.
burst_read at 180h: The PRC should store this address in
a new line.
burst_read at lAOh: The PRC should predict a next
address of ICOh and then fetch the
data from that address.
burst_read at 40h: This data is already in the PRC, so
the PRC should send it to the CPU
and then fetch data from 60h.
burst_write, ICOh: This data is in the PRC, so this
line should be flushed.
burst_read at 60h: The PRC should deliver this data to
the CPU and then fetch the data at
80h.
burst_read at 100h: The PRC should start a new line and
store this address.
This test successfully demonstrates a majority of the
capabilities of the PRC, showing when the Line Manager selects
new lines, when and how the Predictor functions, and when the
CPU starts a read or write and the data involved. The test
shows when the Bus Interface Unit fetched data from memory.
The Data List reported the flow of data in and out of itself.
The only significant behavior not exercised by this test
is the function of the Line Replacement Unit when the PRC is
full. That is handled with Sequencer5 in the Line Replacement
Test
.
The Line Replacement Test was accomplished by a series of
CPU transactions that quickly fill the PRC. The test shows
29
that the Line Replacement Unit correctly selected invalid
lines to be replaced first. When all the lines in the PRC
contained valid data, the Line Replacement Unit executed the
algorithm described in the section on the Line Replacement
Unit .
K. CONCLUSION
At this point in the development of the PRC, the
behavioral model was functioning properly. Therefore, it
could be converted piece by piece into a hardware model. This
was accomplished using the subset of Verilog understood by
Epoch, as described in the next chapter.
30
IV: PRC STRUCTURAL MODEL DESIGN PHASE
This chapter presents the development of the hardware
model of the PRC. In this phase of the design process, each
of the behavioral blocks developed in the previous phase was
implemented with hardware. Converting the blocks in order of
increasing complexity proved to work out well, making it
easier to concentrate first on learning how to use Epoch.
Like the behavioral models, the hardware (structural)
models are Verilog files. Epoch uses these Verilog files to
create VLSI layouts. From those layouts, Epoch calculates
timing information and generates new VerilogOut files with
this timing information. As each block is converted into
hardware, the new VerilogOut model can replace the original
behavioral model in the Testbench for testing with Verilog-XL.
The following hardware blocks result from using this
procedure
.
Each section of this chapter also includes a figure
displaying some important geometric information about the
module, including surface area and transistor count. This
information can be obtained from Epoch with the shell command
geostat -trancount <module name> .
A. PRC
The top level module is only a connection of each of the
modules described in the following sections. The geostat
information is shown in Figure 10. Of particular significance
are the transistor count and the total chip area.
31
Bouncling Box:
9080.748 x 11278.224 microns, 102414707.226 square microns
.
357.510 x 444.025 mils, 158743.109 square mi Is.
Number of Pins = 316.
Number of unique cells = 6.
Number of Datapaths = 1
Number of Sub-Glues = 5
Total Number of Instances = 6
Total number of nets = 498.
Total metall layer route length = 2120297 .98 microns
.
Total metal2 layer route length - 699802. 75 microns
.
Total metal3 layer route length = 0.00 microns.
Total route length = 2820100.74 microns.
Total number of vias = 2460.
Total number of segments = 16989.
Reading transistor view . .
.
Total number of 454310 transistors
0.349 Square mils per Transistor.
2.862 Transistors per square mil.
Power Dissipation = 4742486.500 micro-watts.
Figure 10. PRC Geostat Information. [Epoch output]
B . CONTROLLER
This module is a Finite State Machine which coordinates
the actions of all the other functional blocks of the PRC.
All control signals are synchronous with the system clock.
HRESET_ causes the Controller to go to the IDLE state. The
revised state output table (Figure 11) and the revised state
diagram (Figure 12) give more details.
Of significance are the wait states added to the state
diagram of the behavioral model . These changes are boldface
in the Revised Controller State Output Table. The changes
were required by the Line Manager in which there is a
significant propagation delay for the addresses. This delay
is described in more detail in the Line Manager section of
32
this chapter and is a prime candidate for future work to
improve this design of the PRC . The geostat information is
shown in Figure 13
.
Controller State Output Tabl e
test store send new repl ace








TEST_CAR(R) CAR 1 1








STORE_CAR CAR 1 1
TEST_CAR(W) CAR 1 1
FLUSH_LINE X 1 1






















Figure 12 . Revised Controller State Diagram
34
Bounding Box:
267.516 x 215.964 microns, 57773. 825 square microns.
10.532 x 8.503 mils, 89.550 square mils.
Number of Pins = 26.
Number of unique cells = 18.
Number of Standard cells - 60
Total Number of Instances - 60
Total number of nets = 71.
Total metall layer route length =: 7073.14 microns.
Total metal2 layer route length =: 7073 . 46 microns
.
Total metal3 layer route length == 0.00 microns
.
Total route length = 14146.60 microns.
Total number of vias = 226.
Total number of segments = 1074
.
Reading transistor view . .
.
Total number of 460 transistors.
0.195 Square mils per Transistor.
5.137 Transistors per square mil.
Power Dissipation = 3665.888 micro-watts.
Figure 13. Controller Geostat Information. [Epoch output]
C . SNOOPER
This module watches the system bus activity and makes
appropriate reports to the PRC Controller.
If the transaction is a data-burst read or any kind of
write and if the address parity is correct, then the read or
write signal is asserted as appropriate. Also, the address is
placed in the CAR. The snoop_ignore signal tells this unit to
ignore the current transaction, because it was initiated by
the Bus Interface Unit. The snoop_ignore signal must be
asserted concurrently with the transfer attributes. Reads
35
that are not burst or data related are ignored by the PRC
.
The CAR is updated only on transactions relevant to the PRC.
Due to the two-stage pipelining capability of the PowerPC
with respect to memory accesses, a second address tenure can
occur shortly after the first, well before the first data
tenure is complete. To compensate for this, the read and
write outputs of the Snooper remain asserted until
acknowledged by the Controller with hold. The rising edge of
hold indicates that the read or write signal was received by
the Controller. The Snooper then can negate these signals,
but must leave CAR alone until hold is negated. After hold is
negated, CAR can be updated to the new address.
In Stage 0, the transfer attributes are latched in
registers. Combinational logic determines if these transfer
attributes represent a valid read or a valid write and if the
address parity is correct. If the transaction is valid and
one in which the PRC is interested, then Stage raises a
t ransaction_wai ting signal.
A Finite State Machine in Stage One sits in the IDLE
state until it receives the transaction_waiting signal. Then
it latches the signals needed from Stage 0, resets the
transaction_waiting signal and then waits for the hold signal
to go low. A high hold signal indicates that the PRC is not
done with the previous transaction. Once hold goes low, the
read and write flags are set according to the type of the
current transaction. Also, the input address is stored in the
Current Address Register. The FSM then waits for the rising
edge of hold before returning to the IDLE state where it can
check if there is another transaction waiting. The geostat
information is shown in Figure 14.
36
Bounding Box:
607.500 x 409.536 microns, 248793.127 square microns.
23.917 x 16.123 mils, 385.630 square mils.
Number of Pins = 88.
Number of unique cells = 19.
Number of Standard cells = 169
Total Number of Instances = 169
Total number of nets = 219.
Total metall layer route length = 28547.10 microns.
Total metal2 layer route length = 14615.39 microns.
Total metal3 layer route length = 0.00 microns.
Total route length = 43162.49 microns.
Total number of vias = 464.
Total number of segments = 2268.
Reading transistor view . .
.
Total number of 3608 transistors.
0.107 Square mils per Transistor.
9.356 Transistors per square mil.
Power Dissipation = 26722.156 micro-watts.
Figure 14. Snooper Geostat Information. [Epoch output]
D. LINE MANAGER
This structural model uses a high speed RAM {hsram) for
the MRMA List. The CAR is stored into this RAM on a store or
fetch_done signal.
The predicted_ma_list is a register file for storing
predicted memory addresses. This list is composed of 128
address registers, 128 equality comparators and 128 Valid
status flags. The NAR is stored in this list at the
fetch_done pulse. If there is a match with the input address




The Line Replacement Unit determines the next line to be
replaced whenever the PRC needs to start a new line. It first
selects invalid lines. If all the lines are valid, then it
selects lines that have been "aged." A priority encoder
{ENC_D chooses the line with the lowest index among all the
lines that can be replaced. If all lines are valid, the
output enable (oe) signal of the encoder is used to cause
aging. A line X can be replaced if the following holds true
for that line:
not (X=ActiveLine) AND {not Valid [X] OR (all_lines_valid
AND Aged[X] )
}
Aging is accomplished by the use of a seven-bit counter
(ager_counter) , initially set to zero. When the cause_aging
signal from the encoder is high, the counter advances . A
decoder (DEC_B) output causes the appropriate Aged flag to be
set
.
Changing values of the CAR or NAR have a propagation
delay of 25 ns (1.8 cycles) through the input address
multiplexer (in_addr mux) . This required the addition of wait
states in the Controller before each of the tests. The
Revised Controller State Output Table and the Revised
Controller State Diagram found in the Controller section of
this chapter show the required changes. The geostat
information is shown in Figure 15.
38
Bounding Box:
6704.064 x 8897.364 microns, 59648499.103 square microns.
263.940 x 350.290 mils, 92455.359 square mils.
Number of Pins = 505.
Number of unique cells = 22.
Number of Standard cells = 123
Number of Blocks = 1
Number of Sub-Glues = 2
Total Number of Instances = 126
Total number of nets = 3 57.
Total metall layer route length = 1017746.50 microns.
Total metal2 layer route length = 463265.70 microns.
Total metal3 layer route length = . 00 microns
.
Total route length = 1481012.19 microns.
Total number of vias = 2157.
Total number of segments = 10524.
Reading transistor view . .
.
Total number of 207467 transistors.
0.446 Square mils per Transistor.
2.244 Transistors per square mil.
Power Dissipation = 1777694.500 micro-watts.
Figure 15. Line Manager Geostat Information. [Epoch output
E . PREDICTOR
The purpose of this module is to calculate the Next
Address (stored in NAR) based on the Most Recent Memory Access
(MRMA) and the Current Address (in the CAR) . The prediction
calculation is
NAR = 2*CAR - MRMA
In this structural implementation of the Predictor, the
predict signal is the latch for the CAR and MRMA registers.
The subtraction is accomplished as a two's compliment addition
with a high speed adder.
39
The CAR is multiplied by two, an arithmetic shift left of
one bit. The most significant bit of the CAR is not retained,
as it will not have an effect on the 27-bit output of the
adder. This will adversely affect address prediction only
around the midpoint of the four gigabytes of memory. The
applicable Golden Rule of computer design "is to make the
common case fast: In making a design tradeoff, favor the
frequent case over the infrequent case." (Hennessy, 1990)
A number is negated in two's compliment by inverting all
the bits and adding '1' . The MRMA is negated by inverting all
its bits. Adding the required '!' is implemented as a
Carry-In to the adder.
The Epoch TACTIC tool reported the propagation delay from
predict to NAR to be 4.90 ns . The geostat information is
shown in Figure 16.
40
Bounding Box:
261.900 x 895.824 microns, 234616 293 s quaire microns
.
10.311 x 35.269 mils, 363
.
656 square mi Is.
Number of Pins = 113
.
Number of unique cells = 10.
Number of Blocks = 107
Total Number of Instances = 107
Total number of nets = 230
Total metall layer route 1 ength = 12158 .68 microns
.
Total metal2 layer route 1 ength = 15209 .06 microns
Total metal3 layer route 1 ength = 0.00 microns
.
Total route length 27367 . 74 microns
.
Total number of vias = 392 .
Total number of segments = 1793.
Reading transistor view . .
.
Total number of 3027 trans istors
.
0.120 Square mils per Transistor.
8.324 Transistors per squa re mil
Power Dissipation = 27722.887 micro-watts.
Figure 16. Predictor Geostat Information. [Epoch output]
F. DATA LIST
This module stores the data retrieved from memory in
anticipation of a request by the CPU. The basic memory cell
is the Epoch part hsramoe (high speed ram with output enable)
.
Since each hsram has a maximum word size of 128 bits, there
are two hsram parts in parallel to get the required 256-bit
width
.
An upload signal causes the Data List to store the data
on data__line into the address specified by ActiveLine . The
input upload has to be inverted to match the active-low WR
input of the Epoch hsram component . A download signal causes
41
the Data List to assert onto data_line the data in the address
specified by ActiveLine. This signal also has to be inverted
for the same reason.
Both the invertors can probably be removed if the Bus
Interface Unit makes the upload and download signals active
low. That could only improve the response time of the data
memory
.
Epoch calculated the following timing delays:
download -> hsramoe.DOUT 2.3 ns
ActiveLine -> hsramoe.DOUT 7.3 ns
A design alternative is to use the regular speed version,
ramoe, which gives the following timing delays:
download -> ramoe. DOUT 4 ns
ActiveLine -> ramoe. DOUT 16 ns
Using this slower RAM is possible, but would require a
significant modification to the PRC behavior to handle the
longer delay and would add a cycle delay to CPU reads when
there is a hit in the PRC.
Putting the VerilogOut file of this module into the
original PRC behavioral model for mixed-mode simulation caused
a timing error that had to be corrected in the Bus Interface
Unit behavioral model. After an upload to the Data List,
data_line must remain valid long enough to meet the data hold
time requirement of the Epoch part hsramoe . The geostat
information is shown in Figure 17.
42
Bounding Box:
3834.792 x 3222.936 microns, 12359289.299 square microns
150.976 x 126.887 mils, 19156.938 square mils.
Number of Pins = 282.
Number of unique cells - 3.
Number of Standard cells = 2
Number of Blocks = 2
Total Number of Instances = 4
Total number of nets = 269.
Total metall layer route length = 198805.54 microns.
Total metal2 layer route length = 52952.76 microns.
Total metal3 layer route length = 0.00 microns.
Total route length = 251758.30 microns.
Total number of vias = 728.
Total number of segments = 2422.
Reading transistor view . .
.
Total number of 214712 transistors.
0.089 Square mils per Transistor.
11.208 Transistors per square mil.
Power Dissipation = 2181481.250 micro-watts.
Figure 17. Data List Geostat Information. [Epoch output]
G. BUS INTERFACE
This module connects the PRC with the system bus. It
handles the protocol of data transfer in and out of the PRC.
When this module receives a fetch signal, it latches the
address in the NAR and requests the bus for a burst read. It
stores the incoming data until all four bursts have been
received. Then it uploads the data into the Data List and
asserts fetch_done . If there is a parity error during the
fetch, the Bus Interface informs the Controller by asserting
fetch_abort . Also, the transaction is canceled.
43
When this module receives a send signal, it sends a
cancel signal {CANX) to the memory module, downloads data from
the Data List and then sends the data to the CPU. When the
transfer is finished, it asserts send_done.
The coordination of these activities is accomplished
through the use of two Finite State Machines. One acts as an
address bus master. The other controls the flow of data. The
geostat information is shown in Figure 18.
Bounding Box: -6264, -6408, 2246040, 1972980.
2252.304 x 1979.388 microns, 4458183.285 square microns.
88.673 x 77.929 mils, 6910.198 square mils.
Number of Pins = 448.
Number of unique cells = 56.
Number of Standard cells = 13 93
Number of Sub-Glues = 1
Total Number of Instances = 13 94
Total number of nets = 1843.
Total metall layer route length = 676479.94 microns.
Total metal2 layer route length = 469079. 94 microns.
Total metal3 layer route length = . 00 microns
.
Total route length = 1145559.87 microns.
Total number of vias = 9679.
Total number of segments = 44298.
Reading transistor view . .
.
Total number of 24403 transistors
0.283 Square mils per Transistor.
3.531 Transistors per square mil.
Power Dissipation = 237269.750 micro-watts.




The most significant large-scale test of the structural
model is the Prediction Test, which is similar to the
Prediction Test of the behavioral model. The test runs the
44
same series of CPU transactions to exercise all functional
blocks of the PRC . The sequence of events for the Prediction
Test is included in the seguencer4 . v file.
The following steps are required to conduct a test:
1. Change directories (cd) to the .
.
.veri log/hardware/
directory on the Computer Center (CC) system.
2. At the Unix command prompt, enter the command
veri log -f verilog_arguments
.
The Verilog-XL output of the test is included in the
appendices. This test shows that the structural model of the
PRC performs the desired functions. The output of the
structural model test is different from the output of the
behavioral model test mainly because the new structural model
does not contain the same display commands. These commands
interfere with the Epoch compilation of the modules. Other
display commands were added to the Testbench, which is still
a behavioral model. The displays are sufficient to show that
PRC performs as desired.
While compiling the source files, Verilog-XL reports four
warnings about implicit wires having no fanin. These wires
are labeled NCO and NCI, deriving their initials from "not-
connected." They are unused outputs on a couple of Epoch
parts. Therefore, these warnings can be ignored.
The section with comments about SDF Annotation is the
result of incorporating the Epoch timing analysis into the
Verilog model. Once that annotation is complete, the actual
simulation begins.
The error messages at the beginning of the simulation can
be ignored. These error messages are generated by Epoch parts
45
and indicate improper signal values or timing. All these
errors occur before the system hard reset and are expected.
Having those errors after the system hard reset would have
indicated a real problem.
Once the system has reset, the CPU starts its series of
transactions, beginning with reads from addresses OOh and 2 Oh.
The comment "PRC requested the bus" indicates that the PRC is
prefetching data. It appears that the prefetch occurs before
the start of the second CPU transaction, but in reality it
occurs just after the second CPU address tenure, which is not
shown in the report. Also not shown because of the limitation
of display commands with the PRC is the data prefetched by the
PRC. That the data is correct can be seen .later in the
report, when the PRC sends the data to the CPU.
During the CPU to Memory transactions, there is 60 ns
between each of the four beats of data. When the CPU reads
from address 40h, the speed advantage of the PRC is
demonstrated. Note that there is now only 15 ns between each
beat. That is the period of the system clock and is therefore
the maximum possible rate the CPU can receive data.
The write to address ICOh occurred after the PRC had
prefetched that data. The PRC should have flushed the
prefetched data, because it was no longer valid. Later, when
the CPU performs a read from the same address, it can be seen
from the read data and from the timing (60 ns per beat) that
the CPU is getting the data from main memory. In accordance




The three primary design tools used in the development of
this PRC were Verilog-XL, cWaves and Epoch. This chapter
describes some of the particularly useful features of these
tools and gives some tips for using these tools together.
A. VERILOG-XL
Verilog-XL allows the modeling of circuits in a
programming language. Circuits can be modeled by behavior or
structure. For the complex design of the PRC, it was
convenient to start by dividing the design into six blocks and
then using Verilog to model the behavior of each block. This
allowed clarification of the required behaviors, deferring the
search for hardware solutions until after the desired
behaviors were well defined.
Currently, Verilog-XL is available only on the Computer
Center (CC) network. The following steps make it easier to
use from an Electrical and Computer Engineering (ECE)
workstation
:
1. Add the following line to the .cshrc file in the ECE
account: alias rcc 'xhost in50204 . cc .nps .navy .mil
;
rlogin -1 <username> in50204.cc.nps.navy.mil'.
2. Re-source the session by typing "sc <return>"
.
3. Type "rcc <return>" to log into the CC account.
4. Add the following line to the .cshrc file in the CC
account: alias remote3 ' setenv DISPLAY
47
sun3 .ece .nps .navy .mil : . ' The .cshrc file can
contain similar lines for other workstations.
5. Re-source as in Step 2.
Now the ECE workstation becomes the display for the CC
workstation. Typing "fiiemgr &" will call up the CC file
manager
.
Typing "verilog <return>" should give a list of options
for use with Verilog-XL and will verify access to the program.
One particularly useful option is to put all the arguments in
a file, such as verilog_arguments and put the following line
in the CC .cshrc file:
alias veri 'verilog -f verilog_arguments
'
Typing "veri" is much easier than listing the names of all the
files that need to be included in the simulation.
The Cadence online documentation can be accessed with the
command "openbook &" . The Main Menu is the starting point.
The Alphabetical List on the bottom is the easiest way to find
the desired information. In this list there is a Verilog-XL




This tool is indispensable for the analysis of
complicated circuits. There is nothing like seeing a timing
diagram to track down design errors
.
The database for the cWaves Viewer is created while
running the Verilog simulation. The highest level Verilog
48






where <name> is the instance name of the module to be
observed. More information about these $shm commands can be
found in the cWaves Reference Manual, which is a little
difficult to find. It is in the Cadence Online Library
accessed with "openbook & <return>" . Once the Main Menu
appears, select the Alphabetical List on the bottom. The
cWaves Reference Manual is filed under Composer (Schematic
Entry), Design Framework II. Section 4 of this manual is
particularly useful.
C . EPOCH
A circuit designer would find it very convenient if Epoch
would take as input the raw behavioral models, but it does
not . Each behavioral block must be converted into a
structural model. Then, Epoch can automatically generate a
Very Large Scale Integrated (VLSI) circuit layout using a rule
set from a specific manufacturer. From the layout, Epoch
performs a timing analysis of the circuit and generates a new
Verilog file, which includes the timing information. This new
file then can replace the behavioral model for resimulation
with Verilog-XL. This allows the designer to verify each
block as it is designed. CWaves can be used to track down
timing errors.
49
Epoch is available on the ECE system. To access Epoch,
add " /tools3 /epoch/bin" to the "set path" command in the
.cshrc file. Also, add "setenv CASCADE /tools3 /epoch"
.
The Epoch User's Tutorial and the Epoch Verilog Interface
Reference are both very useful . The former is located at
/tools3 /epoch/data/examples/tutorial . The latter can be
accessed through pull-down menus in Epoch:
Help => On-Line Manual . .
.
Sometimes calling up this manual causes a FrameViewer error,
but the manual does come up after a slight delay.
The VerilogOut option proved very useful in the
development of the PRC. With this option, Epoch creates a new
Verilog file after laying out a design. The new model can be
inserted in place of the old behavioral model for simulation
with Verilog-XL. The Verilog Interface reference describes
how this is done. In addition to the procedures described
there, it will be necessary to take a few extra steps.
1. If the files must be moved from the vout directory to
another directory for simulation with Verilog-XL,
correct the $sdf_annotate path in the .v file.
2. In all the behavioral files, add a 'timescale
directive like the one in the .v file generated by
Epoch. This must appear before the "module"
statement
.
3. It may be necessary to copy primelib .v from
/tools3/epoch/data/verilog into the CC directory.
The PowerPC uses bit zero as the most significant bit of
buses, so it was convenient to follow that convention in this
PRC design. For example, the PowerPC address bus is
50
designated A[0:31] . Unfortunately, this causes a problem with
the VerilogOut program, which reorders some of the indices and
connects busses in reverse order. This problem seems to be
unique to the VerilogOut file generation. The physical layout
itself gets connected correctly regardless of the index
numbering convention. Resolving this problem required
renumbering the indices of all modules used for Epoch input so




VI: CONCLUSIONS AND RECOMMENDATIONS
A. CONCLUSIONS
In conclusion, the objective of this research has been
met. This thesis presents a complete hardware design for the
PRC . The simulation results show that the PRC can deliver
data to the CPU at the rate of 8-1-1-1, that is eight cycles
for the first beat and one cycle for each of the remaining
three beats. This performance is better than the performance
of main memory (8-3-3-3) . With a little more work on the
design, the PRC should be able to deliver data at a rate of 4-
1-1-1.
Aguilar proposed a design consisting of six modules which
together would comprise the PRC. He took a bottom-up
approach, designing four of those six modules, testing each
independently, but not together. (Aguilar, 1995) As a result,
the designs of these modules require modifications to enable
them to function correctly together. Rather than redesigning
the four modules, the approach taken during this research was
top-down. That is, a single working behavioral model- was
divided into six behavioral models that functioned together,
and then each of the six behavioral models was converted into
a hardware model. The result is still a six-module design,
but the six modules of this design have different functions
than the six modules of the design by Aguilar. The top-down
approach worked exceedingly well to clarify the design and to
minimize inter-module signal problems.
53
This research required a total of three academic
quarters. The work during the first quarter primarily
involved studying the problem, analyzing the design
requirements, and learning about the PowerPC system. Two more
quarters were required for the creation of the design, one
quarter each for the behavioral design phase and the
structural design phase.
Epoch and Verilog-XL proved reliable and highly useful
during the development of this hardware design. Verilog-XL
performed the simulations necessary to verify the design.
Epoch performed the VLSI circuit layout and timing analysis
that were required by Verilog-XL in order to produce
simulation results that could be considered accurate.
Simulations with Verilog-XL are conveniently short while
testing small modules. However, simulations of the entire PRC
design typically ran for half an hour on a SUN SPARC-10 work
station. Similarly, on small designs Epoch runs fast enough
that a user could wait at the work station. To compile
complex modules Epoch requires much more time. For example,
Epoch takes over an hour to compile the Bus Interface of the
PRC and more than three hours to compile the entire PRC.
Both Verilog-XL and Epoch have functions and options
which are not readily apparent. That problem is compounded by
inadequate indexes in the user's manuals for each of these
tools. On the other hand, the tutorials are very helpful for
revealing some of those functions and options.
Some of the options in Epoch require significant studying
before use. The pull-down menus in Epoch could be better
organized. Both of these characteristics work to make Epoch
less user-friendly than it should be.
54
B . RECOMMENDATIONS
As with any complex design, there is much more that a
designer could do to improve this PRC . This section describes
some areas of potential future research related to this
hardware design.
The first recommendation is to consider including the
Arbiter on the PRC chip. This PRC design was developed for a
PowerPC-603 microprocessor system, in which both the PRC and
the CPU are candidates for bus mastership. This requires that
there be a bus arbitration unit co prevent both devices from
trying to use the bus simultaneously. The bus arbitration
unit is a simple device whose function can be fulfilled with
a single finite state machine (FSM) . It would be very easy to
add this FSM to the PRC chip, eliminating the requirement to
fabricate a separate integrated circuit chip.
The second recommendation is in regards to improving the
Line Manager design. The Line Manager is the block that
requires the wait stau.es in the Controller State Diagram. The
impact of these wait states is a delay of three cycles in
determining if there is a hit within the PRC. Finding a way
of eliminating these wait states could improve the speed at
which the PRC delivers the first beat of data to the CPU and
the speed at which the PRC prefetches data from main memory.
Specifically, the performance would improve from 8-1-1-1 to 5-
1-1-1. There is a strong chance that Epoch would prove useful
in this endeavor. Epoch has timing analysis routines and can
perform layouts in such a way as to minimize propagation
delays for critical signals. Epoch also has automatic buffer
sizing algorithms which could be used to ensure the output
signals of each part are buffered sufficiently to drive their
55
loads. These capabilities of Epoch do require considerable
CPU time. For example, running an automatic compilation on
the current design of the Bus Interface Unit takes over an
hour of actual CPU time on a Sun SPARC 10 workstation if the
buffer sizing option is selected.
The next recommendation is to study the rest of the
design for critical paths. With Epoch as an analysis tool, it
should be uncomplicated to analyze the entire PRC for critical
timing paths. Some timing limitations may be improved through
the buffer-sizing and timing-critical layout capabilities of
Epoch. Other timing limitations may require modifying the
design. The current PRC design includes only parts that were
available in the Epoch library. It may be possible to design
parts that outperform the Epoch parts.
The final recommendation regards fabrication. If the PRC
design detailed in this thesis is to be fabricated, it must
undergo two steps. First, the power rails should be studied
using Epoch to determine if there is a requirement for
additional power and ground rails. Second, the design must be
put inside a pad ring. Epoch may be able to create the pad
ring automatically with minimal intervention by the designer.
56
APPENDIX A. LAYOUTS
This appendix contains the VLSI (Very-Large-Scale-
Integrated) circuit layouts for the PRC . These layouts were






The PRC expanded to the first level. The four
blocks in the lower left corner, in order of
decreasing size, are the Bus Interface,




Figure A2 . The PRC fully expanded. [Epoch output]
59

Figure A3. The Controller. [3poch output
60

* <j 3B. yys ? ixi
ffmbsJfr— -fl- '-W^&zK^srQzzzaa 1 KHW.~~»y. HPW1*- : - ; 8|jWf-:3«'«-- :» i.M&i ;iiliBj





s l^jj $N1IaSm Sjl OTifP
M J,—^"... »•• w^^^j
^1 --r.
IS
IP 1 v -J 1 V
I^Hh -
1 fi ^j * - i^ifl *•; ...-!-.^^"
HI |;|^-*<
'
£fe§ fiiiSiJIfe^!%:' fen^OI-""l3J-?g-f' ~b- ,ma ".'-'..
rj I i*-v lwL_<a 1.7. '.-'"•^T^ : Jp-*, j:L^£i B;T*" •
• : 1 hL.j:...::J.|h-.. "
-I;
' I 1 -
Ju^iP i«d -t" ! i|"x> i !**"*; fc-ViJ'








1 : BE - 1
»I
1 * r- ;^i
w\ , :R^
i

















1 L| —^ ^ ' V* '
L^: i^I.^1 r E^H
><P"pa): B •^^ :& |*^&I1 '.^..-lii - :il^vi
: ,, -. 1" I ''"-• * r£^:iw ^=r*n II^^ :iILrOiftP :£ ISaBI
st^PVh jr ... •- fc: "^» J Hii"^•j^i-rr^i






;i F " -'
fl^B BB









'• i- - - ; rl
t^rr
J^ l:i 1': 1






































The Line Manager expanded one level. The bottom






The Line Manager. Detail of the bottom portion
in the previous figure. [Epoch output]
63





i p-;. — *fc*ikg|i !
lp.Pl Vjp-S» •— -IP* j|l 1^1
j
IV »-- * -i jfe — * I
ta-a "iHHgn «— »^ k -j| * —








S?ml • ,. J| - | H | I!*




IS *j J m 1 .*» | « f m .
i
If_" : i 1 1 - 1 - d
Figure A8 . The Predictor fully expanded. [Epoch output]
65

Figure A9 . The Data List fully expanded. [Epoch output
66

Figure A10. The Bus Interface fully expanded . [Epoch output]
67




















APPENDIX B. TESTBENCH VERILOG FILES
This appendix contains the Verilog files for the
Testbench. They are all behavioral models, used together to
test the PRC design. The file are located on the Computer






* Author: Joseph R. Robert, Jr.
* Date: 24AUG95
* Revised: 10JAN96
* Purpose: This module is the highest level in the design hierarchy. It
* emulates a complete computer system, composed of
* 1. cpu: a PowerPC-603 microprocessor.
* 2. ram: random access memory.
* 3. arbiter: the bus arbitration unit.
* 4. pre: the predictive read cache under design.
*
* System configuration and features:
* Single CPU
* 64-bit data bus
* No out-of-order split-bus transactions.
* Synchronous interface: all I/O sampled on rising edge of bus clock.
* 66 MHZ system clock, 66 MHZ CPU clock.
* Simulation should be done with a time unit = 1 ns.
module testbench;
// Signal Declarations - conforms to PowerPC-603 notation
// Address Arbitration
wire CPU_BR_, //Bus Request
CPU_BG_; //Bus Grant
tril ABB_; //Address Bus Busy
73
tril TS_; //Transfer Start (memory only, not I/O)
// Address bus
wire [0:31] A; //Address (note Motorola's reverse notation)
wire [0:3] AP; //Address Parity
wire APE_; //Address Parity Error
// Transfer attributes
wire [0:4] TT; //Transfer Type
wire [0:2] TSIZ; //Transfer Size
w ire [0: 1 ] TC; //Transfer Code






tri 1 AACK_; //Address Acknowledge
reg ARTRY_; //Address Retry
// Data Arbitration
wire CPU_DBG_; //Data Bus Grant
reg DBWO_; //Data Bus Write Only
tri 1 DBB_; //Data Bus Busy
// Data Transfer
wire [0:63] D; //Data
wire [0:7] DP; //Data Parity
wire DPE_, //Data Parity Error
DBDIS_; //Data Bus Disable
// Data Termination
tri 1 TA_; //Transfer Acknowledge
reg DRTRY_; //Data Retry
reg TEA_; //Transfer Error Acknowledge
// System control
reg HRESET_; //Hard Reset
wire PRC_BR_; //PRC Bus Request
wire CANX;
//Declare variables, constants, parameters








DBWO_ =hi; //Limits CPU to in-order transactions.
TEA_ = hi; //Only asserted for nonrecoverable bus error events.
ARTRY_ = hi; //Retries used only with multiprocessor or multi-
DRTRY_ = hi; // level memory systems.
HRESET_= hi;
end
//define system clock, 66 MHz, T = 1 5 ns.
reg elk;
initial elk = 1;
always
begin

















#5 HRESET_ = low; //Reset entire system.
#5 HRESET_ = hi;
//#4000;
//$shm_probe(PRCl,'AS");













* Purpose: This module emulates the PowerPC-603 microprocessor. Note that
* most signals are active low. This makes it slightly more difficult to work
* one's way through all the double negatives in this code's conditional
* statements, but makes it much easier to correlate against the timing diagrams
* in the PowerPC-603 User Manual. This model uses the same notations for
* signals mat connect to other modules.
* Tins module uses the sequencer module to determine the operations the CPU
* will perform. This model of the PowerPC-603 is capable of performing reads,
* writes, burst reads, and burst writes. It handles bus arbitration just like
* the '603 including the pipelined address tenures. Please refer to the
* PowerPC-603 User Manual for a detailed description of the nature and timing

























//declare variables, constants, parameters







reg [0:31] addr_reg, address[0:l];
reg [0:31] a_reg;
assign A = a_reg;
reg [0:3] ap_reg, addr_parity_in, addr_parity_calc;
assign AP = ap_reg;
//Data related
reg [0:63] data [0:1];
wire [0:63] seq_data;
reg [0:63] d_reg, load_data, data_reg;
assign D = d_reg;
reg [0:255] lme_reg, line [0:1];
wire [0:255] seqjine;
reg [0:7] dp_reg, d_parity_in, d_parity_calc;
assign DP = dp_reg;
//Other external control signals
reg Transfer_start [0:1];
reg abb_reg_, dbb_reg_, ts_reg_, tbst_reg_;
assign ABB_ = abb_reg_;
assign TS_ = ts_reg_;
assign DBB_ = dbb_reg_;
assign TBST_ = tbst_reg_;
reg [0:4] Transfer_type [0:1];
wire [0:4] seq_Transfer_type;
reg [0:4] tt_reg;
assign TT = tt_reg;
parameter //for Transfer_type
none = 5'bz,
write = 5'b00010, //02
write_atomic = 5'bl0010, //12
read = 5'b0l010, //0A
read_atomic =5'bll010, //1A
burst_wnte =5'b00110, //06
burst_read = 5'bOlllO. //0E
burst_read_atomic = 5'bl 1 1 10; //IE





assign TSIZ = tsiz_reg;
reg [0:1] Transfer_code [0:1];
wire [0:1] seq_Transfer_code;
reg [0:1] tc_reg;





reserved = 2'bl 1;







































































// *** 1. Address bus arbitration
always @(negedge need_bus_trigger_J
need_bus_ = low;
//Parked means that die CPU can take the bus as soon as it needs it.
79
assign parked = (!BG_ & ABB_ & ARTRYJ;
//If CPU needs bus, it needs to assert BR_ only if not parked,
always @(posedge elk)
if (BR_= hi)
BR_ = #7 ~(need_bus_=low & parked=FALSE);





abb_reg_ = #7 low;
AB.Master = TRUE;
BR_ <=#1 hi;
need_bus_ <= #2 hi;
end
II







addr_parity_calc[2] <= ~Aaddr_reg[ 16:23];
addr_parity_calc[3] <= ~Aaddr_reg[24:31];
ts_reg_ = #7 low;
Transfer_start[pp] <= TRUE;





if (Transfer_type[pp] = burst_read
II Transfer_type[pp] == burst_write)
tbst_reg_ <= low;
//insen other address transfer characteristics here,
end
always @(posedge elk)
if (AB_Master & TS_=low)
begin















//insert other addr transfer characteristics here.
abb_reg_ <= #2 hi;






assign qual_DBG_ = ~(!DBG_ & DBB_ & DRTRYJ;
always @(posedge elk)
begin





#2 dpp = ~dpp;
case(Transfer_type[dpp])
none: begin end
//Note: TS is an implied data bus request. CPU can assume mastership if it
//has a qualified data bus grant.
read: begin
//wait for qualified data bus grant and transfer start.
wait(qual_DBG_=low & Transfer_start[dpp]);
@(posedge elk) //assume data bus mastership
dbb_reg_ <= #7 low;
















$display("CPU read %h from address %h.",
data[dpp],address[dpp]);
$display(" Completed at time %d",$time);
end
dbb_reg_ = #4 hi;
dbb_reg_ = #8 'bz;
if (d_parity_in != d_parity_calc)
begin
$display("CPU: data parity error.");
$display(" Calculated parity: %b",
d_parity_calc);








d_parity_calc[2] <= -^ata.regC 16:23];





//wait for qualified data bus grant and transfer start.
wait(qual_DBG_=low & Transfer_start[dpp]);
@(posedge elk) //assume data bus mastership




d_reg <= #7 64'bz;




$display("CPU wrote %h to address %h.",
data[dpp],address[dpp]);
$display(" Completed at time %d",$time);
end
dbb_reg_ = #4 hi;
82
dbb_reg_ = #8 "bz;
end
burst_read: begin
//wait for qualified data bus grant and transfer start.
wait(qual_DBG_=low & Transfer_start[dpp]);
@(posedge elk) //assume data bus mastership
dbb_reg_ <= #7 low;
if (trace)

















#2 if (d_parity_in != d_parity_calc)
begin
$display("CPU: data parity error.");
$display(" Calculated parity: %b",
d_parity_calc);







dbb_reg_ = #4 hi;
dbb_reg_ = #8 'bz;
end
burst_write: begin
//wait for qualified data bus grant and transfer start.
wait(qual_DBG_=low & Transfer_start[dpp]);
if (trace)
$display("CPU started write to address %h at time %d.",
address[dpp],$time);
83
@(posedge elk) //assume data bus mastership














$display(" CPU write beat 1: %h at %d",d_reg,$time);




d_parity_calc[2] <= -Mata.regt 16:23];
d_parity_calc[3] <= ~Adata_reg[24:31];
d_parity_calc[4] <= -Matajeg [32:39];




#7 d_reg = line_reg[64:127];
#1 if (trace)
SdisplayC CPU write beat 2: %h at %d",d_reg,$time);
@(transfer_acknowledged); //second beat done
data_reg = line_reg[128: 191];
d_parity_calc[0] <= ~Adata_reg[0:7];
d_parity_calc[l] <= ~Mata_reg[8:15];
d_parity_calc[2] <= -Mata.regt 16:23];
d_parity_calc[3] <= ~Adata_reg [24:31];
d_parity_calc[4] <= ~Adata_reg[32:39];




#7 d_reg = line_reg[128:191];
#1 if (trace)
SdisplayC CPU write beat 3: %h at %d",d_reg,$time);












#7 d_reg = line_reg[ 192:255];
#1 if (trace)
$display(" CPU write beat 4: %h at %d",d_reg,$time);
@(transfer_acknowledged); //fourth beat done
djreg <= #7 64'bz;
dp_reg <= #7 8'bz;
line_reg <= #7 256'bz;
Transfer_type[dpp] <= #7 none;
Transfer_code[dpp] <= #7 reserved;
Transfer_start[dpp] <= #7 FALSE;
dbb_reg_ = #4 hi;
dbb_reg_ = #8 'bz;
end
default: $display("CPU module has bad TT[%b] = %b",dpp,





* BUS ARBITRATION UNIT
* Filename: arbiter.v
* Author: Joseph R. Robert, Jr.
* Date: 24AUG95
* Revised: 10JAN96
* Purpose: This module emulates the system's external bus arbitration unit.
* It is implemented as a Finite State Machine.
* There are only two possible bus masters in this system: the CPU and the PRC.
* Also, the address bus and data bus are each arbitrated for independendy,
* though the data bus arbitradon occurs after the corresponding address bus
* arbitration.
* If a unit wants the address bus, it asserts BR_. If the bus is available,
* the aribter asserts BG_ back to that unit, which can then take mastership by
* asserting ABB_. When it is done with the address bus, it negates ABB_.
85
* It is assumed that if a unit wanted the address bus, it will also want the
* data bus. "Address only" transactions will not occur in this system, since
* there is no external cache or multiprocessors. Therefore, after asserting
* BG_ to the requesting unit, the arbiter asserts DBG_ on the next cycle.
* BG_ and DBG_ are both asserted undl the requesting unit takes mastership,
* unless the requesting unit withdraws its request by negating BR_.
* If there are no pending bus requests, the arbiter "parks" the CPU by
* granting it the busses. This reduces memory access time for the CPU. If the
* CPU is parked, and then the PRC requests the bus, the CPU is imparked, and
* the arbiter can then grant the bus to the PRC.
* The PowerPC can conduct a second address tenure long before the first data
* tenure is complete. This pipelining has a maximum depth of two transactions,
* meaning that a third address tenure will not start before the first data
* tenure is complete. The Memory Unit in this Testbench is capable of handling
* that situation. However, adding the PRC to the system creates the
* possibility that the PRC will initiate a third address tenure before the
* first of two CPU transactions is complete. This situation is handle by this
* Arbiter which keeps track of the pipelining depth. It will not grant the
* address bus to any unit if that address tenure would put a third transacdon
* in the pipeline. Rather, the arbiter will stall undl the data tenure from





output CPU_BG_, CPU_DBG_, PRC_BG_, PRC_DBG_;
input CPU_BR_- PRC_BR_, ABB_, DBB_, elk;
reg CPU_BG_,CPU_DBG_, PRC_BG_, PRC_DBG_;
wire CPU_BR_, PRC_BR_, elk;
//Declare variables, constants, parameters




reg [1 :0] requests; //concatenated input signals
reg [1:0] depth;
tri stall;
//Finite State Machine variables and parameters
reg [2:0] state, next_state;




















//Track depth of pipeline
always @(posedge ABB_)
begin




depth = depth - 1;
end














@(posedge elk) requests = {CPU_BR_,PRC_BR_}:
case (requests)
2'b00: next_state = grant_cpu_a;
2'b01: next_state = grant_cpu_a;
2'blO: next_state = grant_prc_a;



















@(posedge elk) requests = {CPU_BR_,PRC_BR_
case (requests)
2'b00: next_state = park_cpu;
2'b01: next_state = park_cpu;
2'blO: next_state = grant_cpu_d;









@(posedge elk) requests = { CPU_BR_,PRC_BR.
case (requests)
2'b00: next_state = park_cpu;
2'b01: next_state = park_cpu;
2'blO: next_state = grant_prc_a;


















@(posedge elk) requests = {CPU_BR_,PRC_BR_
case (requests)
2'bOO: next_state = wait_for_prc;
2'bOl: next_state = grant_cpu_d;
2'blO: next_state = wait_for_prc;










@(posedge elk) requests = {CPU_BR_,PRC_BR.
case (requests)
2'bOO: next_state = grant_cpu_a;
2'bOl: next_state = grant_cpu_a;
2'blO: next_state = grant_prc_a;
2'bl 1: next_state = grant_cpu_a;
endcase
end





* RANDOM ACCESS MEMORY
* Filename: memory.v





* Purpose: This module emulates the system's main memory. For simulation
* efficiency, the memory has only enough physical address space for four burst
* reads. Thus, 128 bytes. The address bus width allows a virtual address space
* of 4 G-bytes. Accesses to addresses past the first 128 bytes map to within
* the first 128 bytes.
* The time required for memory accesses are determined by Delayl and
* Delay2. Delayl is the delay, in cycle, required for the initial access.
* Delay2 is the delay required for each successive beat of four-beat
* operations. Set them both to for fastest memory response. Set them to 8
* and 3 respectively for realistic memory response of a 60 ns DRAM. Do not set
* Delay2 > Delayl. That will not represent a realistic memory response, and
* will probably cause this module to act weird.
* There is a two-stage pipeline involved with memory accesses, such that a
* memory tenure can be started while the previous data tenure is still active.
* To accomplish this, some signals have [0:1] in their declaration, and are
* indexed using pp and dpp, which are the address pipeline position pointer,
* and the data pipeline position pointer, respectively.
* To keep this model simple, a single-beat read will always return a
* single byte of data, regardless of TSIZ, in byte lane 0, which is different
* from the way the PowerPC really operates. See Table 10-4 on pg. 10-15 of
* the PowerPC-603 Users Manual for actual alignment. This simplification is
* irrelevant to the performance of the PRC which deals only with burst
* operations.
* It is important to note that this memory module had to have one feature
* that is not typical of memory modules. It has a CANX input with cancels the
* current read operadon. It is through this signal that the PRC stops the

























reg [0:63] d_reg, data;
assign D = d_reg:
//
//Declare variables, constants, parameters




Size = 128, //Size of memory in bytes.
Length = 7, //Length of physical address in bits.
Delay 1 = 8, //Delay for address translation.






read_atomic = 5'bl 1010,
burst_write = 5'b00110,
burst_read = 5'bOlllO,
burst_read_atomic = 5'bl 1 1 10;
reg [0:31] virtual_addr, index;
reg [0:3] addr_parity_calc,addr_parity_in;
reg [0:Length-l] pa_reg, physical_addr [0:1];
reg [0:7]
mem [0:Size-l],
mem_reg; //Memory data register
reg [0:4] Transfer_type [0:1];
reg [0:2] Transfer_size [0:1];
reg burst [0:1];
reg [0:1] i, burst_start;
reg pp,dpp; //current pipeline and data pipeline positions
reg abort;
reg ta_reg_;




















for (index = 0; index<Size; index=index+l)












burst [pp] <= TBST_;








if (addr_parity_in != addr_parity_calc)
begin
$display("Memory: address parity error.");
$display(" Calculated parity: %b",addr_parity_calc);
$display(" Recevied parity: %b",addr_parity_in);
end















































#2 pa_reg = physical_addr[dpp];
burst_start = pa_reg[Length-5:Length-4];
//align to cache line
pa_reg[Length-5:Length-l] = 5'b00000;
physical_addr[dpp] = pa_reg;
if (labort) if (Delay 1-Delay2-1 >= 0)
93
repeat( Delay 1 -Delay2- 1 )@ (posedge elk);
for (index=0; index<4; index=index+l)
begin
if (labort) repeat(Delay2)@(posedge elk);
if (Delay 1-Delay2!=0 II index!=0) @(posedge elk);
if (labort) begin
#7 ta_reg_ <= low;



















ta_reg_ <= #7 'bz;







//burst-writes are always performed in order
if(Delayl-Delay2>=0)
repeat(Delayl-Delay2)@(posedgeclk);
for (index=0; index<4; index=index+l)
begin
repeat(Delay2)@(posedge elk);
#7 ta_reg_ <= low;
i = index;
@(posedge elk) //latch data
data = D;
mem[physical_addr[dpp]+8*i] <= data[ 0: 7];
mem[physical_addr[dpp]+8*i+l] <= data[ 8:15];








ta_reg_ <= #7 'bz;
end
ta_reg_ <= #7 'bz;





default: $display("Memory module received bad TT[%d] = %b",dpp,






APPENDIX C. PRC BEHAVIOR FILES
The files in this appendix are the result of the
behavioral design phase. They include the verilog behavioral
models of the PRC and the testing results. The files are
located on the Computer Center system at joshua_u2/jrrobert
/
thesis/veri 1 og/behavi or .
A. PRC
* Predictive Read Cache
* Filename: prc.v




* Purpose: This module emulates die predictive read cache.
*
* * * * # * * * * * # * * * # * * :fc * # :fc # * * * ^c * 3f: * # ^e :fc :fc :f: ;fc ^c ^c ^ :je * * * * # * :fi :f: * * * * * * * * : fc * 'V- # :fc * * :J* * # : l- : fc : i= * # :£ # "K =!c :Jc :Jt :J: :fr /
module prc(CPU_BR_,BR_,BG_,ABB_,TS_.A,AP,APE_.TT,TSIZ,TC,TBST_.AACK_,
DBG_,DBB_,D,DP,DPE_.TA_,HRESET_,CANX,clk);
// Signals are defined in system.v. Notations follow conventions used in





















//declare v;u-iables, constants, parameters




//Other internal control signals
wire CARJatch, predict,snoop_ignore;
wire [0:255] DATALINE;
wire [0:26] CAR; //current address register
wire [0:26] NAR; // next address register













line_mgr LM 1 (CAR,N AR.HRESET_,a_seIecUest,fetch_done,r!ush.slore.
new_replace,MRMA,ActiveLinedine_empty.hil);
datalist DL 1 (DATALINE,ActiveLine,upload,download):
endmodule
B . CONTROLLER
/* ********************************* * ********* * * *************** * * * * * ********* * * *
* CONTROLLER
* Filename: controller.v
* Autlior: Joseph R. Robert, Jr.
* Date: 21 DEC95
* Revised: 05JAN96
*
* Purpose: This module is a Finite State Machine which coordinates the actions
* of all the other functional blocks of the PRC. All control signals are
98
* synchronous with the system clock. HRESET_ causes the Controller to go to
* tlie IDLE state. See slate diagram and slate output tables.
H= =^ ^= K« * H= * * =^ * * * He * * =^ * * =K :f= =*= =t= =f= =f= ;(= =t= -fc Jf= rft rfc =(c ;fc :fs =fc ifc =tc rfc =(= :fc :fc =f= * * =J= =k =+= -|i * =f= =f= =fc =f= =t= =f= * =K =f= =f= =f= =H ^fr





output a_selecl. test,predict.slore,riush,send.hokl,new_re|ilace, letch;
rcg a_sclccl.lesl,predicl.slorc,riush,send,hold.ncw_rcplaceJclch;
//declare variables, constants, parameters





//Finite State Machine variable and parameters
reg [0:3] state, next_state;
reg [0:2] inputs3;
reg [0:1 | inputs2;
reg input 1
;














state <= idle; //The state variables must be initialized to












#2 stale = nexi_siate:
if ((race)


















2'h()(): next_slale = idle;
2'b()l: nexl_slalc = lesl_car_w;
2'hl(): next_stale = lesl_car_r;














@(posedge elk) input 1 = hit;
case ( input 1)
I'bO: next_state = is_line_emply;















@(posedge elk) input 1 = send_done;
case (input 1)
1 'bO: next_state = send_data;














@(posedge elk) inputs3 = {hitjeadwrite};
case (inputs3)
3'bOOO: next_state = fetch_data;
313001: next_state = idle;
3Td010: next_state = idle;
3'bOl 1: next_state = idle; //This should not happen.
3'bl00: next_state = idle;
3'bl01: next_state = idle;
3'bllO: next_state = idle;















@(posedge elk) input 1 = fetch_done;
case (inputl)
1'bO: next_state = fetch_data;














@(posedge elk) inputl = line_empty;
case (inputl)
1'bO: next_state = predict_na;









































@(posedge elk) input 1 = hit;
case (input 1)
1 'bO: next_state = idle;


















$display("state error in module controller.");










* Author: Joseph R. Robert Jr.
* Date: 21DEC95
* Revised: 05JAN96
* Purpose: This module watches the system bus activity, and makes appropriate
* reports to the PRC Controller.
* If the transaction is a data burst read or any land of write, and if the
* address parity is correct, then the read or write signal is asserted as
* appropriate, and the address is placed in the CAR. The snoop_ignore signal
* tells this unit to ignore the current transaction, because it was undated
* by the Bus Interface Unit. The snoop_ignore signal must be asserted
* concurrendy with the transfer attributes.
* Reads that are not burst or data related are ignored by the PRC. The CAR
* is updated only on transacdons relevant to the PRC.
* Due to the two-stage pipelining capability of the PowerPC, with respect to
* memory accesses, a second address tenure can occur shortly after the first,
* well before the first data tenure is complete. To compensate for this, the
* read and write outputs of the Snooper will remain exerted undl acknowledged
* by the Controller with hold. The rising edge of hold indicates that the read
* or write signal was received by the Controller. The Snooper can then negate
* these signals, but must leave CAR alone undl hold is negated. After hold is















//declare variables, constants, parameters












write = 5'bOOOlO, //02
write_atomic = 5'bl0010, //12
read = 5'bOlOlO, //OA
read_atomic = 5'bllOlO, //1A
burst_write =5'b00110, //06
burst.read = 5'bOlllO, //OE








//Other internal control signals





























assign parity_valid = (addr_parity_calc == addr_parity);
//If there is a transaction,
// and that transaction is a data burst read or any kind of write
// and the transaction is not initiated by the PRC itself,
// and if the address parity is correct










#2 valid_read_0 = Transfer_code= data_transfer &
(Transferjype= burst_read I
Transfer_type= burst_read_atomic);
valid_write_0 = Transfer_type= write I
Transfer_type= write_atomic I
Transfer_type= burst_write;






































* Purpose: This module contains the address list, status flags for each line
* (Valid, Aged), a general status flag (line_empty), the line replacement unit,
* and a couple of pointers (AcdveLine, ReplaceLine).
* The MRMA output is always the MRMA of the AcdveLine. The line_empty
* flag indicates that the currendy acdve line has no addresses in it yet, and
* therefore, cannot be used by the PRC to make a prediction.
* The input a_select determines which address input is used for a particular
* operation. The two address inputs are the CAR and the NAR.
* When the Line Manager receives a test signal, it compares the input address
* with the contents of the PredMA List. If there is a match with the CAR, it
* asserts die hit signal, and changes the ActiveLuie pointer to the line number
* of the match.
* If there is a miss with the CAR, then the AcdveLine switches to the same
* line pointed to by ReplaceLine.
* If, during a test, there is a match with the NAR, hit is asserted, and the
* value in AcdveLine is irrelevant since it will not be used. If there is a
* miss with the NAR, the AcdveLine must remain unchanged from the test.
* The fetch_done signal from the Bus Interface Unit causes the NAR to be
* stored in PredMAjActiveLine], the CAR to be stored in MRMA[AcdveLine], the
* Valid flag to be set, and the Aged flag to be reset.
* The flush signal causes the current AcdveLine to become invalid by setting
107
* Valid [ActiveLine] = 0.
* The store signal causes the input address to be stored into the MRMA of the
* AcdveLine. This is only used for the first address in a new line. Store












//declare variables, constants, parameters























































#2 i2 = 0;
while (!match&i2< 128)
if (PredMA[i2] = in_addr & Valid[i2])
match = hi;
else
i2 = i2+ 1;





else if (match & a_select=l) // a match with the NAR
hit <= hi;
else if (Imatch & a_select=0) //a miss with the CAR
AcdveLine <= ReplaceLine;
else if (Imatch & a_select=l) //a miss with the NAR



















MRMA_out = MRMA [ActiveLine];
line_empty = 0;
end
* LINE REPLACEMENT UNIT
*
* ReplaceLine always points to the line to be replaced at the next PRC miss.
* As soon as the PRC starts predicting the first address for a line it
* asserts new_replace, and the Line Replacement Unit can men find a new line
* to mark as the next ReplaceLine. It searches sequentially for the next line
* with invalid data and marks that line as the next to be replaced. If all
* lines contain valid data, men it scans for the next line that is "aged",
* indicated by a set Aged flag. As it scans for an aged line, it sets the Aged
* bits in the lines it passes. Therefore, as it wraps around in search of an
* aged line, it will eventually come upon one, even if none were aged when the
* search began.
* All of this occurs while the PRC is fetching data, so it has several clock




for (i3=0; i3<=127; i3=i3+l)
if(!Valid[i3])
temp = FALSE;
#1 all_lines_are_valid = temp;
end
I LO





ReplaceLine = ReplaceLine + 1; //mod 128 addition
if(!Valid[ReplaceLine])
done = TRUE;















* Purpose: This module calculates the Next Address (stored in NAR) based on the
* Most Recent Memory Access (MRMA) and the Current Address (in the CAR). The
* prediction calculation is
* NAR = 2*CAR - MRMA
* The calculadon is undated upon each rising edge of the predict signal.














NAR = 2*CAR - MRMA;
if (trace)
begin
$display( "Predictor: NAR = 2*CAR - MRMA");











* Purpose: This module emulates the PRC's Data List.
*
* An upload signal causes the Data List to store the data on datajine into
* the address specified by AcuveLine.
* A download signal causes the Data List to assert onto datajine the data in






//declare variables, constants, parameters







reg [0:255] line [0:127],
line_reg,
data_line_reg;



































* BUS INTERFACE UNIT
* Filename: bus_interface.v
* Author: Joseph R. Robert, Jr.
* Date: 09OCT95
* Revised: 05JAN96
* Purpose: This module connects the PRC with the system bus. It handles
* the protocol of data transfer in and out of the PRC.
* When this module received a fetch signal, it latches the address in the
* NAR, and requests the bus for a burst read. It stores the incoming data
* until all four bursts have been received. Then it uploads the data into the
* Data List and assserts fetch_complete.
* When this module receives a send signal, it sends a cancel signal (CANX) to
* the memory module, downloads data from the Data List, and then sends the data
* to the CPU. When the transfer is finished, it asserts send_done.
* The coordination of these activities is accomplished through the use of a




























//declare variables, constants, parameters








assign A = a_reg;
reg [0:3] ap_reg, addr_parity_calc;
assign AP = ap_reg;
reg [0:1] burst_start;
//Data related
reg [0:255] data_line_reg, datajine;
assign DATALINE = data_line_reg;
reg [0:63] d_reg,data_reg;
assign D = d_reg;
reg [0:7] dp_reg, data_panty_caJc, data_parity_in;
assign DP = dp_reg;
//Other external control signals
reg [0:4] tt_reg,Transfer_type;





burst_read_atomic = 5'bl 1 1 10; //IE
reg [0:2] tsiz_reg;
assign TSIZ = tsiz_reg;
reg abb_reg_,dbb_reg_,ts_reg_,tbst_reg_,ta_reg_;
assign ABB_ = abb_reg_;
assign DBB_ = dbb_reg_;
assign TS_ = ts_reg_;
assign TBST_ = tbst_reg_;
assign TA_ = ta_reg_;
//Other internal control signals
reg [0:2] i; //counter




assign DPE_ = ~Data_Parity_Error;
115
event transfer_acknowledged,start_send;
//Finite State Machine variable and parameters
reg [0:3] state, next_state;
reg [0: 1 ] inputs2;
reg input 1;
parameter idle = 0,













































input 1 <= 'bz;
end
//ADDRESS BUS ARBITRATION





abb_reg_ = #2 low;
AB_Master = TRUE;



























if (AB_Master & TS_=low)
begin
















//insert other addr transfer characteristics here.
abb_reg_<=#2 hi;




//DATA BUS ARBITRATION FOR FETCHES











data_parity_calc[2] <= ~Adata_reg[ 16:23]
data_parity_calc[3] <= ~Adata_reg[24:31]
data_parity_calc[4] <= ~Adata_reg[32:39]






//wait for qualified data bus grant and transfer start.
wait(qual_DBG_=low & Transfer_start);
@(posedge elk) //assume data bus mastership








#2 if (trace) $display(" BIU: %h at %d",data_reg,$time);
#2 if (data_parity_in != data_parity_calc)
begin
$display("BIU: data parity error.");
$display(" Calculated parity: %b",
data_parity_calc);







if (i=0) data_line[ 0: 63] = data_reg;
if (i=l) data_line[ 64:127] = data_reg;
if (i=2) data_line[128:191] = data_reg;






dbb_reg_ = #4 hi;
dbb_reg_ = #8 'bz;
end







j = burst_start+i; II] is mod 4
if (j=0) data_reg = data_line[ 0: 63];
if (j=l) data_reg = data_line[ 64: 127];
if 0=2) data_reg = data_line[128:191];
if (j=3) data_reg = data_line[192:255];
d_reg = data_reg;







ta_reg_ <= #7 'bz;
d_reg <= #7 64'bz;























@(posedge elk) inputs2 = {send,fetch};
case (inputs2)
2'b00: next_state = idle;
2'b01: next_state = fetch 1;
2'blO: next_state = sendl;

















III. Wait for all data to be received.
@(posedge elk) input 1 = Transfer_in_progress;
case (input 1)
1'bO: next_state = fetch3;



















113. Download data from the data list,
download <= hi;










@(posedge elk) inputl = {send_done};
case (inputl)
1'bO: next_state = send2;






$display("state error in module bus_interface.");






* Transaction Sequencer - Prediction Test
* Filename: sequencer4.v




* Purpose: This is one in a set of modules which perform a sequence of CPU
* transactions. This sequencer causes a series of CPU operations that provide
* a comprehensive test of the PRC. It demonstrates a majority of the PRC's
* capabilities, showing when the Line Manager selects new lines, when and how
* the Predictor functions, when the CPU starts a read or write and the data
* involved. It shows when the Bus Interface Unit fetches data from memory.
* The DataList reports the flow of data in and out of it. The only significant
* behavior not exercised by this test is the function of the Line Replacement
* Unit when the PRC is full. That is handled with Sequencer #5.
* Sequence #4:
* burst_read OOh
* burst_read 20h - PRC should predict 40h and fetch data.
* burst_read 180h - PRC should start a new line.
* burst_read 1 AOh - PRC should predict ICOh.
* burst_read 40h - already in PRC, should predict 60h.
* burst_write ICOh - should flush line.
* burst_read 60h - already in PRC, predicts 80.
* burst_read lOOh - PRC should start a new line.
* When using this sequencer, set all trace flags to TRUE (except the
* Controller), and run the simulation for 6000 steps.
*
* General Timing instructions for all Sequencers:
* Use an initial block for each transaction. You must ensure that the
* following rules are adhered to:
122
* 1. Before the first transaction, use
* repeat(2)@(posedge elk)
* 2. Before the first line of the second transaction, use
* wait(ABB_=low);
* wait(ABB_=hi);
* 3. There can be only two transactions pipelined at a time. You must ensure
* manually that the first operation is complete before the third begins.
* When scheduling the current transaction, look at the transaction before
* last. Wait for that TA_ to finish. Also, wait for the ABB_ from the
* previous transaction to go high.



















//declare variables, constants, parameters






write = 5'bOOOlO, //02
write_atomic = 5'bl0010, //12
read = 5'b01010, //0A
read_atomic =5'bll010, //1A
burst_write =51)00110, //06
burst_read = 5'bOlllO, //0E
























need_bus_trigger_ <= #4 low;











need_bus_trigger_ <= #4 low;










need_bus_trigger_ <= #4 low;











need_bus_trigger_ <= #4 low;










need_bus_trigger_ <= #4 low;










line <= {64'h7777777777777777, 641
64'hl 111111111111111, 64'h3333333333333333
j
need_bus_trigger_ <= #4 low;







address <= 32 ,h00000060;
Transfer_type <= burst_read;
Transfer_code <= data_transfer;
need_bus_trigger_ <= #4 low;











need_bus_trigger_ <= #4 low;
need_bus_trigger_ <= #6 hi;
end
endmodule
















VERILOG-XL 2.1.2 log file created Feb 2. 1996 13:14:29
VERILOG-XL 2.1.2 Feb 2,1996 13:14:29
Copyright (c) 1994 Cadence Design Systems, Inc. All Rights Reserved.
Unpublished — rights reserved under the copyright laws of the United States.
Copyright (c) 1994 UNIX Systems Laboratories, Inc. Reproduced with Permission.
THIS SOFTWARE AND ON-LINE DOCUMENTATION CONTAIN CONFIDENTIAL INFORMATION
AND TRADE SECRETS OF CADENCE DESIGN SYSTEMS, INC. USE, DISCLOSURE, OR
REPRODUCTION IS PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF
CADENCE DESIGN SYSTEMS, INC.
RESTRICTED RIGHTS LEGEND
Use, duplication, or disclosure by the Government is subject to
restrictions as set forth in subparagraph (c)(1)(h) of the Rights in
126
Technical Data and Computer Software clause at DFARS 252.227-7013 or
subparagraphs (c)(1) and (2) of Commercial Computer Software — Restricted
Rights at 48 CFR 52.227-19, as applicable.
Cadence Design Systems, Inc.
555 River Oaks Parkway
San Jose, California 95134
For technical assistance please contact the Cadence Response Center at
1
-800-CADENC2 or send email to crc_customers@cadence.com











Compiling source file '
Compiling source file "

























































ActiveLine = Oat$d 5
ActiveLine = latSd 1162
ActiveLine = 2at$d 2287
ActiveLine = 3at$d 3412
ActiveLine = 4at$d 4537
ActiveLine = 5at$d 5662
ActiveLine = 6at$d 6787
ActiveLine = 7at$d 7912
ActiveLine = 8at$d 9037
ActiveLine = 9at$d 10162
ActiveLine = 10 at $d 11287
ActiveLine = llat$d 12412
ActiveLine = 12 at $d 13537
ActiveLine = 13 at $d 14662
ActiveLine = 14 at $d 15787
ActiveLine = 15 at $d 16912
ActiveLine = 16 at $d 18037
ActiveLine = 17 at $d 19162
ActiveLine = 18 atSd 20287
ActiveLine = 19 atSd 21412








































































































































































































































































































































































































































































ActiveLine = 71 at$d 79912
ActiveLine = 72 at $d 81037
ActiveLine = 73 at $d 82162
ActiveLine = 74 at $d 83287
ActiveLine = 75 at $d 84412
ActiveLine = 76 at Sd 85537
ActiveLine = 77 at Sd 86662
ActiveLine = 78 at Sd 87787
ActiveLine = 79 at Sd 88912
ActiveLine = 80 at Sd 90037
ActiveLine = 81at$d 91162
ActiveLine = 82 at Sd 92287
ActiveLine = 83 at Sd 93412
ActiveLine = 84 at Sd 94537
ActiveLine = 85 at Sd 95662
ActiveLine = 86 at Sd 96787
ActiveLine = 87 at Sd 97912
ActiveLine = 88 at Sd 99037
AcdveLine = 89 at Sd 100162
AcdveLine = 90 at Sd 101287
ActiveLine = 91 at$d 102412
ActiveLine = 92 at Sd 103537
ActiveLine = 93 at Sd 104662
ActiveLine = 94 at Sd 105787
AcdveLine = 95 at Sd 106912
ActiveLine = 96 at Sd 108037
AcdveLine = 97 at Sd 109162
AcdveLine = 98 at Sd 1 10287
ActiveLine = 99 at Sd 111412
ActiveLine = [00 at Sd 112537
AcdveLine = 101 at Sd 113662
AcuveLine = 102 at Sd 114787
AcdveLine = 103 at Sd 115912
AcdveLine = 104 at Sd 117037
AcdveLine = '.105 at Sd 118162
AcdveLine = [06 at Sd 119287
AcdveLine =
:
[07 at Sd 120412
AcdveLine =
:
[08 at Sd 121537






AcdveLine = [ 12 at Sd 126037
AcdveLine = L13 at Sd 127162
ActiveLine = 1 14 at Sd 128287
AcdveLine = ]L15atSd 129412
AcdveLine = : L 16 at $d 130537
AcdveLine = '. L 17 at $d 131662
AcdveLine = 1 18at$d 132787
AcdveLine = 19 at Sd 133912
AcdveLine = 20 at Sd 135037
.29
Line_mgr selected new ActiveLine = 121 at $d 136162
Line_mgr selected new ActiveLine = 122 at $d 137287
Line_mgr selected new ActiveLine = 123 at $d 138412
Line_mgr selected new ActiveLine = 124 at $d 139537
Line_mgr selected new ActiveLine = 125 at Sd 140662
Line_mgr selected new ActiveLine = 126 at $d 141787
Line_mgr selected new ActiveLine = 127 at $d 142912
Line_mgr selected new ActiveLine = at $d 145 1 62
Line_mgr selected new ActiveLine = 1 at $d 146287
Line_mgr selected new ActiveLine = 2 at $d 147412
Line_mgr selected new ActiveLine = 3 at $d 148537
L122 "testbench.v": Sfinish at simulation time 152010
31769681 simulation events + 8392 accelerated events
CPU time: 1.0 sees to compile + 0.9 sees to link + 1 16.2 sees in simuladon
End of VERILOG-XL 2.1.2 Feb 2, 1996 13:16:34
J. LINE REPLACEMENT TEST
* Transacdon Sequencer - Line Replacement Test
* Filename: sequencer5.v




* Purpose: This is one in a set of modules which perform a sequence of CPU
* transactions. This Sequencer causes a series of CPU operations which will
* quickly fill the PRC. This will test the Line Replacement Unit's behavior




* burst_read iOOh - PRC should switch to new line i.
* burst_read i20h - PRC should predict i40h, and store data in line i.
* next i
*
* When using this sequencer, set all trace flags to FALSE, except for the Line
* Manager, and run the simuladon for 152000 steps.
*
* General Timing instrucdons for all Sequencers:
* Use an initial block for each transaction. You must ensure that the
* following rules are adhered to:
* 1. Before the first transacdon, use
* repeat(2)@(posedge elk)




* 3. There can be only two transactions pipelined at a time. You must ensure
* manually that the first operation is complete before the third begins.
* When scheduling the current transaction, look at the transaction before
* last. Wait for that TA_ to finish. Also, wait for the ABB_ from the
* previous transaction to go high.



















//declare variables, constants, parameters






write = 5'bOOOlO, //02
write_atomic = 5'bl0010, //12
read = 5'b01010, //0A
read_atomic = 5'bll010, //1A
burst.write =5'b00110, //06
burst_read = 5'b01110, //0E






//Other internal control signals
131

















need_bus_trigger_ <= #4 low;











need_bus_trigger_ <= #4 low;










address <= {12'bO, i, 12'bO};
Transfer_type <= burst_read;
Transfer_code <= data_transfer;
need_bus_trigger_ <= #4 low;





address <= {12'bO, i, 12'h020};
Transfer_type <= burst_read;
Transfer_code <= data_transfer;
need_bus_trigger_ <= #4 low;




















VERILOG-XL 2.1.2 log file created Feb 2, 1996 13:22:22
VERILOG-XL 2.1.2 Feb 2,1996 13:22:22
Copyright (c) 1994 Cadence Design Systems, Inc. All Rights Reserved.
Unpublished — rights reserved under the copyright laws of the United States.
Copyright (c) 1994 UNIX Systems Laboratories, Inc. Reproduced with Permission.
THIS SOFTWARE AND ON-LINE DOCUMENTATION CONTAIN CONFIDENTIAL INFORMATION
AND TRADE SECRETS OF CADENCE DESIGN SYSTEMS, INC. USE, DISCLOSURE, OR
REPRODUCTION IS PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF
CADENCE DESIGN SYSTEMS, INC.
RESTRICTED RIGHTS LEGEND
133
Use, duplication, or disclosure by the Government is subject to
restrictions as set forth in subparagraph (c)(1)(h) of the Rights in
Technical Data and Computer Software clause at DFARS 252.227-7013 or
subparagraphs (c)(1) and (2) of Commercial Computer Software — Restricted
Rights at 48 CFR 52.227-19, as applicable.
Cadence Design Systems, Inc.
555 River Oaks Parkway
San Jose, California 95134
For technical assistance please contact the Cadence Response Center at
1-800-CADENC2 or send email to crc_customers@cadence.com
For more information on Cadence's Verilog-XL product line send email to
talkverilog(5)cadence.com
Compiling source file "bus_interface.v"
Compiling source file "prc.v"
Compiling source file "snooper.v"
Compiling source file "controller.v"
Compiling source file "datalist.v"
Compiling source file "line_mgr.v"
Compiling source file "predictor.v"
Compiling source file "testbench.v"
Compiling source file "arbiter.v"
Compiling source file "cpu.v"
Compiling source file "memory.v"
Compiling source file "sequencer4.v"
Highest level modules:
testbench
Line_mgr selected new ActiveLine = at Sd 5
CPU started read from address 00000000 at tune 45.
CPU read: 0001020304050607 at 181
CPU read: 08090a0b0c0d0e0f at 24
1
CPU read: 1011121314151617 at 301
CPUread:18191alblcldlelfat 361
CPU started read from address 00000020 at tune 390.
BIU started read from address 00000040 at time 412.
CPU read: 2021222324252627 at 496
CPU read: 28292a2b2c2d2e2f at 556
CPU read: 3031323334353637 at 616
CPU read: 38393a3b3c3d3e3f at 676
BIU: 4041424344454647 at 812
BIU: 48494a4b4c4d4e4f at 872
BIU: 5051525354555657 at 932
BIU: 58595a5b5c5d5e5f at 992
DATALIST uploaded this data into line 00 at time 1008.
4()4142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f
CPU started read from address 00000 1 80 at time 1 1 40.
134
Line_mgr selected new ActiveLine = 1 at $d 1 162
CPU read: 0001020304050607 at 1276
CPU read: 08090a0b0c0d0e0f at 1336
CPU read: 1011121314151617 at 1396
CPU read: 18191alblcldlelf at 1456
CPU started read from address 00000 1 aO at time 1515.
CPU read: 2021222324252627 at 1651
BIU started read from address OOOOOlcO at time 1657.
CPU read: 28292a2b2c2d2e2f at 1711
CPU read: 3031323334353637 at 1771
CPU read: 38393a3b3c3d3e3f at 1831
BIU: 4041424344454647 at 1967
BIU: 48494a4b4c4d4e4f at 2027
BIU: 5051525354555657 at 2087
BIU: 58595a5b5c5d5e5f at 2147
DATALIST uploaded this data into line 1 at time 2 1 63
.
404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f
CPU started read from address 00000040 at time 2265.
Line_mgr selected new ActiveLine = at $d 2287
DATALIST downloaded this data from line 00 at time 2313.
404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f
CPU read: 404 1424344454647 at 2356
CPU read: 48494a4b4c4d4e4f at 237
1
CPU read: 5051525354555657 at 2386
CPU read: 58595a5b5c5d5e5f at 2401
BIU started read from address 00000060 at time 2482.
BIU: 6061626364656667 at 2627
BIU: 68696a6b6c6d6e6f at 2687
BIU: 7071727374757677 at 2747
BIU: 78797a7b7c7d7e7f at 2807
DATALIST uploaded this data into line 00 at time 2823.
606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f
CPU started write to address OOOOOlcO at time 3007.
CPU write beat 1: 7777777777777777 at 3022
Line_mgr selected new ActiveLine = 1 at $d 3037
Line manager flushed line 1 at time 3048.
CPU write beat 2: 8888888888888888 at 3 1 58
CPU write beat 3: 1111111111111111 at 3218
CPU write beat 4: 3333333333333333 at 3278
CPU started read from address 00000060 at time 3390.
Line_mgr selected new AcdveLine = Oat$d 3412
DATALIST downloaded this data from line 00 at time 3438.
6061 62636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f
CPU read: 606 1 626364656667 at 348
1
CPU read: 68696a6b6c6d6e6f at 3496
CPU read: 707 1727374757677 at 3511
CPU read: 78797a7b7c7d7e7f at 3526
BIU started read from address 00000080 at time 3607.
BIU: 0001020304050607 at 3752
BIU: 08090a0b0c0d0e0f at 38 12
135
BIU: 1011121314151617 at 3872
BIU: 18191alblcldlelfat 3932
CPU started read from address 00000100 at time 3945.
DATALIST uploaded this data into line 00 at time 3948.
000102030405060708090a0b0c0d0e0fl01112131415161718191alblcldlelf
Line_mgr selected new ActiveLine = 2 at $d 3982
CPU read: 0001020304050607 at 4066
CPU read: 08090a0b0c0d0e0f at 4126
CPU read: 1011121314151617 at 4186
CPU read: 18191alblcldlelf at 4246
L123 "testbench.v": Sfinish at simulation time 6010
1661039 simulation events + 265 accelerated events
CPU time: 0.8 sees to compile + 0.8 sees to link + 5.0 sees in simulation
End of VERILOG-XL 2.1.2 Feb 2, 1996 13:22:29
136
APPENDIX D. PRC STRUCTURE FILES
This appendix contains the Verilog files for the final
hardware design. They include the Verilog structural models
of the PRC and the testing results. The files are located on
the ECE system at home5/robert/thesis/epoch/verilog.
PRC
* Predictive Read Cache
* Filename: prc.v
* Author: Joseph R. Robert. Jr.
* Date: 02OCT95
* Revised: 14MAR96
Purpose: This module emulates the predictive read cache, connecting all the parts.
module prc(HRESET_,clk,BG_,DBG_,BR_,CANX,DA,DP,TT,AP,TSIZ,TC,ABB_,AACK_,TS_,
TBST_,DBB_,TA_,DPE_);


























//Connect parts which have been converted to hardware.
// epoch precompiled predictor
predictor PRE l(MRMA,CAR[25:0],predict,NAR.HRESETJ;
// epoch precompiled line_mgr
line_mgr LM1 (CAR,NAR,HRESET_,a_select,test,fetch_ done,flush,store,
newceplace,MRMA,AcUveLine,line_empty,hit,clk);
// epoch precompiled datalist
datalist DL 1 (DATALINE,ActiveLine,upload,download);
// epoch precompiled snooper
snooper SN 1 (A,AP,TT,TC,TS_,snoop_ignore,hold,clk,CAR,BURSTSTART,
read,write,HRESET_);

















Purpose: This module is a Finite State Machine which coordinates the actions of all the other functional
blocks of the PRC. All control signals are synchronous with the system clock. HRESET_ causes the Controller
to go to the IDLE state. The state diagram and state output tables give more details.
Of significance are the wait states added to the state diagram of the behavioral model. These changes are
highlighted in the State Output Table. The changes were required by the Line Manager, in which there is a
significant propagation delay for the addresses. This is described in more detail in the Line Manager section of this































reg [4:0] I* epoch enum stat */ state, next_state;
reg a_select,fetch,flush,hold,new_replace,predict,send,store,test;
139























if (read= 1'bO & write= 1'bO) next_state = idle;
else if (read= 1'bO & write= l'bl) next_state = wait_d;
else if (read= l'bl) next_state = wait_a;





















































a_select = l'bl; //NAR
test = l'bl;
hold = l'bl;
if ({hit,read,write} = 3'b000) next_state = fetch_data;








if ({fetch_done,fetch_abort} = 2'b00) next_state = fetch_data;



































































* Author: Joseph R. Robert, Jr.
* Date: 21DEC95
* Revised: 06MAR96
Purpose: This module watches the system bus activity, and makes appropriate reports to the PRC
Controller.
If the transaction is a data burst read or any kind of write, and if the address parity is correct, then the read
or write signal is asserted as appropriate, and the address is placed in the CAPv. The snoop_ignore signal tells this
unit to ignore the current transaction, because it was initiated by the Bus Interface Unit. The snoop_ignore signal
must be asserted concurrendy with the transfer attributes. Reads that are not burst or data related are ignored by
the PRC. The CAR is updated only on transactions relevant to the PRC.
Due to the two-stage pipelining capability of the PowerPC, with respect to memory accesses, a second
address tenure can occur shortly after the first well before the first data tenure is complete. To compensate for this,
the read and write outputs of the Snooper will remain exerted until acknowledged by the Controller with hold. The
rising edge of hold indicates that the read or write signal was received by the Controller. The Snooper can then
143
negate these signals, but must leave CAR alone until hold is negated. After hold is negated, CAR can be updated
to the new address.
In Stage 0, the transfer attributes are latched in registers. Combinational logic determines if these tranfer
attributes represent a valid read or a valid write, and if the parity address parity is correct. If the transaction is valid,
and one that the PRC is interested in, then Stage raises a transaction_waiting signal.
A Finite State Machine in Stage One sits in the IDLE state until it receives that signal. Then it latches
the signals needed from Stage 0, resets the transact! on_waiting signal, and then waits for the hold signal to go low.
A high hold signal indicates that the PRC is not done with the previous transaction. Once hold goes low, the read
and write flags are set according to the type of the current transaction. Also, the input address is stored in the
Current Address Register. The FSM then waits for the rising edge of hold before returning to the IDLE state where
it can check if there is another transaction waiting.
module snooper (A,AP,TT,TC,TS_,snoop_ignore,hold,clk,CAR,BURSTSTART,
read_flag,write_flag,HRESET_);



























dff #( 4,0.'AUTO","1") AddrParityLatch (.CLK(latchO),.D(AP),.Q(addr_parity));
dff #( 5.0,'AUTO"." 1") TransferTypeLatch (.CLK(latchO),.D(TT),.Q(TransferType));

















































reg [2:0] /* epoch enum stat */ state, next_state;
reg latchl,tw_resetl_,flag_clk,car_latch;

















































1. Thirty-Two-Input, Odd-Parity Checker
* ODD PARITY CHECKER
* Filename: parityo_chk32.v




























Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 20MAR9696
Purpose: The function of this module is completely described in the behavioral model.
This structural model uses a high speed RAM (hsram) for the MRMA List. The CAR is stored into this
RAM on a store or fetch_done signal.
The predicted_ma_list is a register file for storing predicted memory addresses. This list is composed
of 128 address registers, 128 equality comparators, and 128 Valid status flags. The NAR is stored in this list at
the fetch_done pulse. If there is a match with the input address (in_addr), a priority encoder (ENC_C) determines
which line matches.
The line replacement unit determines the next line to be replaced whenever the PRC needs to start a new
line. It first selects invalid lines. If all the lines are valid, then it selects lines that have been "aged". A priority
encoder (ENC_1) choses the line with the lowest index among all the lines that can be replaced. If all lines are
valid, the encoder's output enable (oe) signal is used to cause aging.
Aging is accomplished by the use of a 7-bit counter (ager_counter), initially set to zero. When the
cause_aging signal from the encoder is high, the counter advances. A decoder (DEC_B) output causes the
appropriate Aged flag to be set.
Changing values of the CAR or NAR have a propogation delay of 25 ns ( 1 .8 cycles) through the input
address multiplexer (in_addr mux). This required the addition of wait states in the Controller before each of the
tests.
The Revised Controller State Diagram and Revised Controller State Output Table show the required changes.
module line_mgr (CAR,NAR,HRESET_,a_select,test,fetch_done,flush,store,
new_replace,MRMA_out,ActiveLine,line_empty,hit,clk);















buff #(27,0,"AUTO","20") InAddrBuffer (.INO(in_addr),.Y(in_addr_buf));
//MRMAJist
stdnor2 MRMA_NOR (.INO(store),.INl(fetch_done),.Y(MRMA_writeJ);
hsram #(27, 128,7,32,1, "2")
MRMAJist (A(ActiveLine),.DIN(CAR),.WR(MRMA_writeJ,.DOUT(MRMA_out));
//PredMAJist


























1 . Address Register With Equal Comparator
ADDRESS REGISTER WITH EQUALITY COMPARITOR for PredMA storage
Filename: addre.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 13FEB96
Purpose: This structural model is a building block for the Predicted Memory Address List (PredMA_List). It
consists of a single 27 -bit register and an equality comparator. The output of the register is compared with the
input address (in_addr).
module addre (NAR,in_addr,store_enable,eq,HRESET_);






dff_c #(27,0,"AUTO",'T") PredMA_reg (.CLK(store_enable),.CLR(HRESETJ,
.D(NAR),.Q(wl));
equal #(27,0,'AUTO"," 1") equall (.A(wl),.B(in_addr),.Y(eq));
endmodule
AND Gate With 128 Inputs and One Output
128-INPUT AND GATE
Filename: andl28.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 20MAR96
Purpose: This structural model is a 128-input AND gate,










and4 #( 8,0,"AUTO","1") AND_B (.IN0( A[31:24]),.IN1( A[23:16]),
.IN2( A[15:8]),.IN3( A[7:0]),.Y(B));
and4 #( 2,0,"AUTO","1") AND_C (.IN0( B[7:6]),.IN1( B[5:4]),
.IN2( B[3:2]),.IN3( B[1:0]),.Y(C));
stdand2 AND_D (.INO(C[0]),.INl(C[l]),.Y(out_unbuffered));
stdbuf #("15") OutputBuffer (.INO(out_unbuffered),.Y(out));
endmodule
Codefile for Seven-to-128 Decoder
(dec7tol28e.codef ile)
//PLA TABLE for 7 to 128 decoder with enable

































001 1 1 1 1
1
//line




































































































Encoder, Priority to Low Bits
128 TO 7 ENCODER, PRIORITY LOW
Filename: encl28to71o.v
Author: loseph R. Robert, Ir.
Date: 21DEC95
Revised: 13FEB96
Purpose: This structural model is a 128-bit input, 7-bit output priority encoder. The highest priority is given to
the bit with the lowest index. Inputs and outputs are active high. It is composed of four 32 to 5 priority encoders
and the logic gates necessary to connect them together.
module encl28to71o (I,A,ei,eo,gs);






wire g3eo,g2eo,g 1 eo,g3gs,g2gs,g 1 gs,g0gs,eo,gs;
enc32to51o ENCg3 (I[127:96],g3A,g2eo, eo,g3gs);
enc32to51o ENCg2 (I[ 95:64],g2A,gleo,g2eo,g2gs);
enc32to51oENCgl (I[ 63:32],glA,g0eo,gleo,glgs);















5. Thirty-Two-Input, Five-Output Encoder, Priority to
Low Bits
32 TO 5 ENCODER, PRIORITY LOW
Filename: enc32to51o.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 13FEB96
Purpose: This structural model is a 32-bit input, 5-bit output priority encoder. The highest priority is given to the
bit with the lowest index. Inputs and outputs are active high.
This module is a composed of four 8 to 3 priority encoders and the logic gates necessary to connect them
together. This module is a building block for the 128 to 7 priority encoder.
module enc32to51o (i,A,ei,eo,gs);






wire g3eo,g2eo,g 1 eo,g3gs,g2gs,g 1 gs,gOgs,eo,gs;
enc8to31o ENCg3 (i[31:24],g3A,g2eo, eo,g3gs);
enc8to31o ENCg2 (i[23:16],g2A,gleo,g2eo,g2gs);
enc8to31oENCgl (i[15: 8],glA,g0eo,gleo,glgs);













6. Eight-Input, Three-Output Encoder, Priority to Low
Bits
8 TO 3 ENCODER, PRIORITY LOW
Filename: enc8to31o.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 13FEB96
Purpose: This structural model is an 8-bit input, 3-bit output priority encoder. The highest priority is given to the
bit with the lowest index. Inputs and outputs are active high.
Truth table
Inputs Outputs




1 x x 1 10 110
1 x x x 1 10 10
1 x x x x 1 1110
lxxxxxlOO 10 10














//Standard cell implemenation is more efficient here. See User Man. 5-34.






//Group Select ("Got Something")
stdnor2 NOR_D (EI_EO,GS);
//Encode A2 = EI.(I7.I6_.I5_.I4_.I3_.I2_.I1_.I0_ + I6.I5_.I4_.I3_.I2_.I1_.I0_ +
// I5.I4_.I3_.I2_.I1_.I0_+ I4.I3_.I2_.I1_.I0J
// = ELCI7.I3_J2_.I1_.I0_ + I6.I3_.I2_.I1_.I0_ +
// I5.I3_.I2_.I1_.I0_ + I4.I3_.I2_.I1_.I0J




//Encode Al = EI.(I7.I6_.I5_.I4_.I3_.I2_.I1_.I0_ + I6.I5_.I4_.I3_.I2_.I1_.I0_ +
// I3.I2_.I1_.I0_ + I2.I1_.I0_)





//Encode A0 = EI.(I7.I6_.I5_.I4_.I3_.I2_.I1_.I0_ + I5.I4_.I3_.I2_.I1_.I0_ +
// I3.I2_.I1_.I0_ + I1.I0J











Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 13FEB96
Purpose: This structural model determines the next line to be replaced whenever the PRC needs to start a new
line. It first selects invalid lines. If all the lines are valid, then it selects lines that have been "aged". A priority
encoder (ENC_1) choses the line with the lowest index among all the lines that can be replaced. If all lines are
valid, the encoder's output enable (oe) signal is used to cause aging. A line X can be replaced if the following holds
true for that line:
not (X=ActiveLine) AND {not Valid[X] OR (all_lines_vahd AND Aged[X])}
Aging is accomplished by the use of a 7-bit counter (ager_counter), initially set to zero. When the
cause_aging signal from the encoder is high, the counter advances. A decoder (DEC_B) output causes the
appropriate Aged flag to be set.
module line_replacement_unit( Valid,ActiveLine,all_lines_valid,
new_replace,fetch_done,HRESET_,CLK,ReplaceLine);































and2 #(128,0,"AUTO","1") AND_C (.IN0(w4),.INl(Valid)..Y(w5));
nor2 #( 128,0,"AUTO"," 1") NOR_D (.INO(wl),.INl(w5),.Y(w6));
stdor2 OR_F (.INO(new_replace),.INl(cause_aging),.Y(latch_en));






OR Gate With 128 Inputs, One Output
128-INPUT OR GATE
Filename: orl28.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 23JAN96
** *** ************************************************************ ***********









or4#( 8,0,"AUTO","1") OR_B (.IN0(A[31:24]),.rNl(A[23:16]),
.IN2(A[15:8]),.IN3(A[7:0]),.Y(B));






9 . Predicted Memory Address List
PREDICTED MEMORY ADDRESS LIST
Filename: predmajist.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 13FEB96
Purpose: This structural model is a register file for storing predicted memory addresses. This list is composed of
128 address registers, 128 equality comparators, and 128 Valid status flags. The NAR is stored in this list at the




// epoch set_attribute FDCEDBLOCK = 1




























encl28to71o ENC_C (.I(m),.A(match_line),.ei(Vdd),.eo(nclI •gs(nc2));













































(NAR,in_addr,store_en_buf [ 1 0] ,equal [ 1 0] ,HRESET
J
(NAR,in_addr,store_en_buf[l l],equal[l 1],HRESET.




(NAR,in_addr,store_en_buf[ 1 6],equal [ 16],HRESET
(NAR,in_addr,store_en_buf[ 1 7],equal [17],HRESET.
































































































































































































































































































































































































addre PredMAlOO (NAR,in_addr,store_en_buf[ 100],equal
addre PredMAlOl (NAR,in_addr,store_en_buf[ 101],equal
addre PredMA102 (NAR,in_addr,store_en_buf[ 102],equal







addre PredMAl 10 (NAR,in_addr,store_en_buf[l 10],equal
addre PredMAl 1 1 (NAR,in_addr,store_en_buf[l 1 1],equal
addre PredMAl 12 (NAR,in_addr,store_en_buf[ 112],equal
addre PredMAl 13 (NAR,in_addr,store_en_buf[l 13],equal
addre PredMAl 14 (NAR,in_addr,store_en_buf[l 14],equal
addre PredMAl 15 (NAR,in_addr,store_en_buf[l 15],equal
addre PredMAl 16 (NAR,in_addr,store_en_biif[l 16],equal
addre PredMAl 17 (NAR,in_addr,store_en_buf[117],equal
addre PredMAl 18 (NAR,in_addr,store_en_buf[118],equal






addre PredMAl 25 (NAR,in_addr,store_en_buf[ 125],equal
addre PredMA126 (NAR,in_addr,store_en_buf[126],equal




































10 One-to-128 Wire Splitter
1 TO 128 WIRE SPLITTER
Filename: splitl28.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 23JAN96
Purpose: Splits a wire into 128 wires.
163
module splitl28 (in,out); //Splits a wire into 128 wires,
input in;
output [127:0] out;










11. One-to-Seven Wire Splitter
1 TO 7 WIRE SPLITTER
Filename: split7.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 23JAN96
Purpose: Splits a wire into 7 wires.
module split7 (in.out); //Splits a wire into 7 wires,
input in;
output [6:0] out;
assign out = {in,in,in,in,in,in,in};
endmodule
12. Set, Reset Latch
STANDARD SET.RESET LATCH
Filename: srlatch.v















13. Set, Reset Latch Array 128 Bits Wide
ARRAY OF 128 SET,RESET LATCHES
Filename: srlatchl28.v
Author: Joseph R. Robert, Jr.
Date: 21DEC95
Revised: 23JAN96





nand2 #(128,0,"AUTO","1") NAND_B (.IN0(RJ,.IN1(Q),.Y(QJ);











Purpose: This module calculates the Next Address (stored in NAR) based on the Most Recent Memory Access
(MRMA) and the Current Address (in the CAR). The prediction calculation is
NAR = 2*CAR - MRMA
hi this structural implementation of the Predictor, the predict signal latches in the CAR and MRMA inputs. The
subtraction is accomplished as a 2's compliment addition with a high speed adder.
The CAR is multiplied times 2 by concatenating a zero at the least significant end. The most significant bit of the
CAR is not retained, since it will not have an effect on the 27 -bit output of the adder. This would adversely affect
address prediction only around the mid-point of the 4 gigabytes of memory. The Golden Rule here is "Design for
the common case."
A number is negated in 2's compliment by inverting all the bits and adding 1 . The MRMA is negated by inverting
all its bits. Adding the required 1 is implemented as a Carry-In to the adder.
Epoch's TACTIC reported the propagation delay from predict to NAR to be 4.90 ns.
module predictor (MRMA,CAR,predict,NAR,HRESETJ;
//CAR is [30:5] of 32-bit address
//MRMA and NAR are [31:5] of 32-bit address










assign A[0] = gnd;













Author: Joseph R. Robert, Jr.
Date: 15DEC95
Revised: 07FEB96
Purpose: This module stores the data retreived from memory in anticipation of a request by the CPU.
The basic memory cell is Epoch's hsramoe (high speed ram with output enable). Since each hsram has
a maximum word size of 128 bits, mere are two hsram parts in parallel to get the required 256-bit width.
An upload signal causes the Data List to store the data on data_line into the address specified by
ActiveLine. The input upload has to be inverted to match the active-low WR input of the Epoch hsram component.
A download signal causes the Data List to assert onto datajine the data in the address specified by
ActiveLine. This signal also has to be inverted for the same reason.
Both the inverters can probably be removed if the Bus Interface Unit makes the upload and download
signals active low. That could only improve the response time of this data memory.
Epoch calculated the following timing delays:
download -> hsramoe.DOUT 2.3 ns
ActiveLine -> hsramoe.DOUT 7.3 ns
A design alternative is to use the regular speed version, ramoe, with the following timing delays.
download -> ramoe.DOUT 4 ns
ActiveLine -> ramoe.DOUT 16 ns
Using this slower RAM is possible, but would require a significant modification to the PRC behavior to handle to
longer delay, and would add a cycle delay to CPU reads when there is a hit in the PRC.
Putting this module's VerilogOut file into the original PRC behavioral model for mixed-mode simulation
caused a timing error that had to be corrected in the Bus Interface Unit. After an upload to the DataList, datajine
must remain valid for long enough to meet the data hold time requirement of Epoch's hsramoe.
**************************** **************** ^^^^^^^^^^^^^5^3ic5fcsfc^c3jcsfc3ic^:5}c>fc=f:3}c3(c^:4;^;3f:^C5ic/
module datalist (data_line,AcdveLine,upload,download);










hsramoe #(128, 128,7,32, 1,'T")
data_raml (.A(ActiveLine),.DIN(data_line[127:0]),.DOUT(data_line[127:0]),
.WR(writeJ,.OE(enableJ);
hsramoe #(128, 128,7,32, 1,"1")




* BUS INTERFACE UNIT
* Filename: bus_interface.v
* Author: Joseph R. Robert, Jr.
* Date: 09OCT95
* Revised: 20MAR96
Purpose: This module connects the PRC with the system bus. It handles the protocol of data transfer in and out
of the PRC.
When this module receives a fetch signal, it latches the address in the NAR, and requests the bus for a
burst read. It stores the incoming data until all four bursts have been received. Then it uploads the data into the
Data List and assserts fetch_done. If there is a parity error during the fetch, the Bus Interface informs the Controller
by asserting fetch_abort, and the transaction is cancelled.
When this module receives a send signal, it sends a cancel signal (CANX) to the memory module,
downloads data from the Data List, and then sends the data to the CPU. When the transfer is finished, it asserts
send_done.
The coordination of these activities is accomplished through the use of two Finite State Machines. One







// epoch set_attribute FIXEDBLOCK = 1








































assign qual_BG_ = ~(ABB_ & !BG_);
169





AddrParityGen (.D( (NAR,GND,GND,GND,GND,GND } ),.PGEN(addr_parity_gen));






a_buffer (.EN(a_en_buf_),.INO( { a_reg,GND.GND,GND,GND,GND }),.Y( A));




tt_buffer (.EN(a_en_),.INO( { GND,VDD,VDD,VDD,GND }),.Y(TT));
tribuf #(3,0,"AUTO","AUTO")
tsize.buffer (.EN(a_en_),.INO( {GND,VDD,GND } ),.Y(TSIZ));
tribuf #(2,0,"AUTO","AUTO")




//ADDRESS FINITE STATE MACHINE









reg [2:0] /* epoch enum astat */ astate, next_astate;
reg a_latch,a_en_,abb_reg_,abb_en_,BR_,NARLatch,snoop_ignore,
tbst_en_,ts_reg_,ts_en_;






astate = next_ astate;
end






















BR_ = 1'bO; // Request the bus.
NARLatch = l'bl; // Latch the Next Address.
if(qual_BG_= 1'bO)
next_astate = MASTER;




ajatch = l'bl; // Latch transfer attributes.
a_en_ = 1'bO; // Enable attribute outputs.
abb_reg_ = 1'bO; //Take the address bus.
abb_en_ = 1'bO;
snoop_ignore = l'bl; //Tell snooper to ignore this transaction.
tbst_en_ = 1'bO; // Another transfer attribute.






















































// Odd Parity Generator/Checker
// epoch precompiled parityo_chkgen256
parityo_chkgen256 DataParityGen
(.D(data),.PIN(dpanty),.ERROR(data_parity_error),.PGEN(dparity^en));
assign DPE_ = ~data_parity_error;
//data registers
stdbufinv TA_INV (.INO(TAJ,.Y(ta));
//Delay buffer required for timing of latch signals. Gates = 4 results in smallest layout area.
stddelaybuf #( l,4,"AUTO") LatchDelayO(.INO(latchO),.Y(latchO_delay));
stddelaybuf#(l,4,"AUTO")LatchDelayl(.IN0(latchl),.Y(latchl_delay));
stddelaybuf #( 1,4,"AUTO") LatchDelay2(.IN0(latch2),Y(latch2_delay));









stdbuf #("CRITICAL") DR0_BUF (.INO(dregO_clk),.Y(dregO_clk_buf));
stdbuf #("CRITICAL") DR1_BUF (.INO(dregl_clk),.Y(dregl_clk_buf));
stdbuf #("CRITICAL") DR2_BUF (.IN0(dreg2_clk),.Y(dreg2_clk_buf));
stdbuf #("CRITICAL") DR3_BUF (.INO(dreg3_clk),.Y(dreg3_clk_buf));
dff #(72,0,"AUTO","AUTO")
DataRegO(.CLK(dregO_clk_buf),.D({mux_out[ 63: 0],DP}),





DataReg2 (.CLK(dreg2_clk_buf),.D( { mux_out[ 1 9 1 : 1 28],DP } ),
.Q({data[191:128],dparity[23:16]}));
dff #(72,0,"AUTO","AUTO")







































stdbuf #("26") CANX_BUF (.INO(cancel),.Y(CANX));
//DATA FINITE STATE MACHINE








































































else if (send) next_dstate = START_SEND;





















latch 1 = l'bl;
if(TA_=l'bl)
next_dstate = SECOND_BEAT;

























if (data_parity_error == l'bl)
next_dstate = ABORT1;
















dataline_en_ = 1'bO; // To meet data hold requirements of hsram
//in Data List.
fetch_done = l'bl;
if (fetch == l'bl)
next_dstate = D_WAIT_FOR_NOT_FETCH_A;
else next_dstate = D_EDLE;
end




if (fetch == I'M)
next_dstate = D_WAIT_FOR_NOT_FETCH_B;











if (burst_start= 2'dO) next_dstate = SENDOO;
else if (burst_start == 2'dl) next_dstate = SEND1 1;
else if (burst_start= 2'd2) next_dstate = SEND22;
else if (burst_start= 2'd3) next_dstate = SEND33;




















































































































1. Odd Parity Checker/Generator With 256 Inputs
* ODD PARITY CHECKER AND GENERATOR
* Filename: parityo_chkgen256.v




Purpose: This module checks the parity of the input data, comparing it to the input parity. Parity is odd including
the parity bit. This module also generates the parity for the input data in groups of eight input bits.
module parityo_chkgen256 (D,PINERROR,PGEN);







(.D(D[ 63: 0]),.PIN(PIN[ 7: 0]),ERROR(ERROR_0),.PGEN(PGEN[ 7: 0]));
parityo_chk64 parity_group_l




































2 . Odd Parity Generator With 32 Inputs
* ODD PARITY GENERATOR
* Filename: parityo_gen32.v
* Author: Joseph R. Robert, Jr.
* Date: 12FEB96
* Revised: 29FEB96






parityo #(8,0,"AUTO","1") parity_group_0 (.D(D[ 7: 0]),.PGEN(PGEN[0]));
parityo#(8,0,"AUTO","l") parity_group_l (.D(D[15: 8]),.PGEN(PGEN[1]));
parityo #(8,0,"AUTO","1") parity_group_2 (.D(D[23:16]),.PGEN(PGEN[2]));














VERILOG-XL 2.1.2 log file created Mar 19, 1996 11:53:03
VERILOG-XL 2.1.2 Mar 19, 1996 11:53:03
Copyright (c) 1994 Cadence Design Systems, Inc. All Rights Reserved.
Unpublished — rights reserved under the copyright laws of the United States.
Copyright (c) 1994 UNIX Systems Laboratories, Inc. Reproduced with Permission.
THIS SOFTWARE AND ON-LINE DOCUMENTATION CONTAIN CONFIDENTIAL INFORMATION
AND TRADE SECRETS OF CADENCE DESIGN SYSTEMS, INC. USE, DISCLOSURE, OR
REPRODUCTION IS PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF
CADENCE DESIGN SYSTEMS, INC.
RESTRICTED RIGHTS LEGEND
Use, duplication, or disclosure by the Government is subject to
restrictions as set forth in subparagraph (c)(1)(h) of the Rights in
Technical Data and Computer Software clause at DFARS 252.227-7013 or
subparagraphs (c)(1) and (2) of Commercial Computer Software — Restricted
Rights at 48 CFR 52.227-19, as applicable.
Cadence Design Systems, Inc.
555 River Oaks Parkway
San Jose, California 95134
183
For technical assistance please contact the Cadence Response Center at
1
-800-CADENC2 or send email to crc_customers@cadence.com
For more information on Cadence's Verilog-XL product line send email to
talkverilog@cadence.com
Compiling source file "prc.v"
Compiling source file "prc_top.v"
Compiling source file "sequencer4.v"
Compiling source file "tarbiter.v"
Compiling source file "tcpu.v"
Compiling source file "testbench.v"
Compiling source file "tmemory.v"
Scanning library file '7tmp_rrmt/lVjoshua_u2/jrrobert/mesis/epoclVprimlib.v''
Scanning library file '7tmp_mnt/h/joshua_u2/jrrobert/thesis/epoch/primlib.v''
Warning! Implicit wire has no fanin [Verilog-rWFA]
"prc.v", 23 159: NCO
Warning! Implicit wire has no fanin [Verilog-rWFA]
"prc.v", 23159: NCI
Warning! Implicit wire has no fanin rVerilog-IWFA]
"prc.v", 23159: NCO




*** SDF Annotator version 1.6_beta.3
*** SDF file: /unp_mnt/h/joshua_u2/jrrobert/thesis/verilog/hardware/prc.sdf
Back-annotation scope: testbench.PRCl.PRCl
No configuration file specified - using default options
*** SDF Annotator log file: sdf.log
*** No MTM selection parameter specified
*** No SCALE FACTORS parameter specified
No SCALE TYPE parameter specified
Configuring for back-annotation...
Reading SDF file and back-annotating amino data...
*** SDF back-annotation successfully completed
PRC granted the data bus.
(ERROR): WR and A are both unknown at time 6.700
(ERROR): WR and A are both unknown at time 6.700
***
184
(ERROR): WR and A are both unknown at time 6.700
(ERROR) WR transition to unknown and (din != MEM[a]) at time 7.000
Instance: testbench.PRC 1 .PRC 1 .LM 1 .MRMA_list.hsram.inst 1
(ERROR) WR transition to unknown and (din != MEM[a]) at time 7.000
Instance: testbench.PRC 1 .PRC 1 .DL 1 .data_ram 1 .hsram.inst 1
(ERROR) WR transition to unknown and (din != MEM[a]) at time 7.000
Instance: testbench.PRCl.PRCl.DLl.data_ram0.hsram.instl
System hard reset at time 35.
CPU started read from address 00000000 at time 45.
CPU read: 0001020304050607 at 211
CPU read: 08090a0b0c0d0e0f at 27
1
CPU read: 1011121314151617 at 331
PRC requested the bus.
CPU read: 18191alblcldlelf at 391
CPU started read from address 00000020 at time 420.
CPU read: 2021222324252627 at 556
CPU read: 28292a2b2c2d2e2f at 616
CPU read: 3031323334353637 at 676
CPU read: 38393a3b3c3d3e3f at 736
PRC granted the data bus.
CPU started read from address 00000180 at time 1215.
CPU read: 0001020304050607 at 1381
CPU read: 08090a0b0c0d0e0f at 144
1
CPU read: 1011121314151617 at 1501
CPU read: 18191alblcldlelf at 1561
CPU started read from address 000001 aO at time 1665.
CPU read: 2021222324252627 at 1831
PRC requested the bus.
CPU read: 28292a2b2c2d2e2f at 1891
CPU read: 3031323334353637 at 1951
CPU read: 38393a3b3c3d3e3f at 201
PRC granted the data bus.
CPU started read from address 00000040 at time 2490.
CPU read: 4041424344454647 at 2641
CPU read: 404 1424344454647 at 2656
CPU read: 505 1 525354555657 at 267
1
CPU read: 4041424344454647 at 2686
PRC requested the bus.
PRC granted the data bus.
CPU started write to address OOOOOlcO at time 3307.
CPU write beat 1: 7777777777777777 at 3322
CPU write beat 2: 8888888888888888 at 3488
CPU write beat 3: 1111111111111111 at 3548
CPU write beat 4: 3333333333333333 at 3608
CPU started read from address 00000060 at time 3765.
CPU read: 606 1 626364656667 at 3916
CPU read: 6061626364656667 at 3931
CPU read: 7071727374757677 at 3946
CPU read: 6061626364656667 at 3961
PRC requested the bus.
185
PRC granted the data bus.
CPU started read from address OOOOOIcO at time 4440.
CPU read: 7777777777777777 at 4606
CPU read: 8888888888888888 at 4666
CPUread: 1111111111111111 at 4726
CPU read: 3333333333333333 at 4786
L125 "testbench.v": Sfinish at simulation time 5035000
4 warnings
158647 simulation events + 266655 accelerated events + 926440 timing check events
CPU time: 6.1 sees to compile + 161.8 sees to link + 377.5 sees in simulation
End of VERILOG-XL 2.1.2 Mar 19, 1996 12:15:44
186
LIST OF REFERENCES
Aguilar F., M.E., "Testing of a Read Prediction Buffer
Integrated Circuit and Design of a Predictive Read Cache,
"
Master's Thesis, Naval Postgraduate School, Monterey, CA,
March 1995.
Billingsley, A.B. and D.J. Fouts, "Memory Latency Reduction
Using an Address Prediction Buffer, " Twenty-Sixth Asilomar
Conference on Signals , Systems , and Computers , Vol.1, pp.78-
82, 1992.
Fouts, D.J. and A.B. Billingsley, "Predictive Read Caches: An
Alternative to On-Chip Second-Level Cache Memories," Journal
of Microelectronic Systems Integration, Vol.2, No . 2 , 19 94.
Hennessy, J.L. and D.A. Patterson, Computer Architecture : A
Quantitative Approach, Morgan Kaufmann Publishers, Inc., San
Mateo, CA, 1990.
Miller, R.W., "Simulation and Analysis of Predictive Read
Cache Performance," Master's Thesis, Naval Postgraduate
School, Monterey, CA, June 1995.
Nowicki, G.J., "The Design and Implementation of a Read
Prediction Buffer," Master's Thesis, Naval Postgraduate
School, Monterey, CA, December 1992.





1. Defense Technical Information Center 2
8725 John J. Kingman Rd . , STE 0944
Ft. Belvoir, VA 22060-6218





3 . Chairman, Code EC 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, CA 93943-5121
4. Dr. Douglas Fouts, Code EC/Fs 2
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, CA 93943-5121
5. Frederick w. Terman, Code EC/Tz 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, CA 93943-5121
6. Raymond Bernstein, Code EC/Be 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, CA 93943-5121









3 2768 00323092 1
