A performance measure of page mode dram as a second level cache in microprocessors by Shoemaker, David R. (David Robert)
A PERFORMANCE MEASURE OF PAGE MODE DRAM AS
A SECOND LEVEL CACHE IN MICROPROCESSORS
by
David R Shoemaker
SUBMITTED TO THE DEPARTMENT OF ELECTRICAL
ENGINEERING AND COMPUTER SCIENCE IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS





MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 1992
Copyright @ David R Shoemaker, 1992. All rights reserved.
The author hereby grants to MIT permission to reproduce and to
distribute copies nf th- a ' ~- part.
Signature of Author _










dal, LopmmImVII Iunmmt.e on Graduate Students
ARCHIVES
U ss. IS 199
JUL 1i0 1992
A PERFORMANCE MEASURE OF PAGE MODE DRAM AS
A SECOND LEVEL CACHE IN MICROPROCESSORS
by
David R Shoemaker
Submitted to the Department of Electrical Engineering and Computer Science on
June 1, 1992 in partial fulfillment of the requirements for the degrees of Bachelor
of Science and Master of Science.
Abstract
An intensive study on three different types of page mode DRAM configurations was
conducted to determine the effect of each on a microprocessor. Pure page mode schemes
involve holding RAS lines down after an access to main memory in order to cache an entire
row. Register-based cache DRAMs utilize a row of registers by the sense amplifiers to
create a slightly more flexible cache. Cache DRAMs with embedded SRAMs allow for a
fully-functional small SRAM cache to be included inside each DRAM chip. The
advantages and disadvantages of each scheme are discussed.
A microprocessor simulator was created to model the performance of each page mode
scheme. By incorporating time, this simulator modeled the interaction of microprocessor
resources as well the miss rates for the various second level caches. While the performance
results are somewhat specific to the particular microprocessor modeled, the second level
miss rates will not change as resources are modified. The simulator modeled a floating
point unit, integer unit, and a store buffer, as well as a memory system that included first
and second level caches, and main memory.
Over two trillion instructions were simulated from the SPEC Benchmarks. A variety of
first and second level cache sizes were swept to give comprehensive data on the
performance of the page mode configurations. A total of forty-two sweeps were completed
across the ten SPEC Benchmarks. Surprisingly, second level miss rates slightly improved
as first level cache sizes were increased.
Thesis Supervisor: Professor Steve Ward
Title: Professor of Electrical Engineering and Computer Science
Thesis Supervisor: Dr. Patrick Bosshart, TI Fellow
Title: Integrated Systems Laboratory
Dedication
I have enjoyed having the opportunity to work under Pat Bosshart for the duration of this
thesis, and his patience, insight, and understanding helped make this all possible. He was
truly an inspiring force, both on and off the hockey rink ice.
I would also like to thank Steve Ward for overseeing my thesis from the MIT end,
particularly for providing background material and resources when needed.
I am indebted to the members of TI, Integrated Systems Lab for allowing my use of their
SPARCstations for background jobs.
I would like also to thank Sun Microsystems, especially Robert Cmelick for the use of the
Shadow tracing program.
Finally, I would like to thank my parents for supporting me during my school years. My
ambition and drive is a reflection of the encouragement they have given me over the years.
I also thank God for giving me the ability to complete this thesis and helping me to keep a








2.2 Microprocessor Simulator 8
2.2.1 Integer Unit 9
2.2.2 Floating Point Unit 9
2.2.3 Memory System 10
2.3 Microprocessor Simulator Statistics 13
2.4 Range of Simulations 17
2.5 SPEC Benchmark 18
3. Memory System 21
3.1 Pure Page Mode DRAM 21
3.1.1 Pure Page Mode Model 22
3.1.2 Trade-offs 24
3.2 Register-Based Cache DRAMs 24
3.2.1 Register Based Cache DRAM Model 25
3.2.2 Trade-offs 25
3.3 Embedded SRAMS 25




4.1 First Level Cache Results 28
4.2 First Sweep Miss Rates 29
4.2.1 Integer Benchmarks 29
4.2.2 Floating Point Benchmarks 30
4.3 First Sweep Performance 31
4.3.1 Integer Benchmarks 32
4.3.2 Floating Point Benchmarks 33
4.4 Second Sweep Miss Rates 34
4.4.1 Integer Benchmarks 35
4.4.2 Floating Point Benchmarks 35
4.5 Second Sweep Performance 36
4.5.1 Integer Benchmarks 36
4.5.2 Floating Point Benchmarks 37
5. Conclusions 38
Appendix A. First Level Miss rates
-5-
Appendix B. CPI breakdown 43
Appendix C. 1st Sweep Miss Rates 48
Appendix D. 1st Sweep Performance Impact 55
Appendix E. 2nd Sweep Miss Rates 68
Appendix F. 2nd Sweep Performance Impact 79
Appendix G. Sample Raw Data 91
Chapter 1
Introduction
As microprocessors are clocked at increasingly faster rates, cache performance
becomes continually more important. Second level caches have been shown to improve
performance, but the cost sometimes prevents designers from including them in a
microprocessor. Because memory systems prove to be more of a bottleneck with each new
generation of microprocessors, the development of an economical, but effective second
level cache becomes more important to the designer.
This thesis explores the performance impact of using various schemes in main
memory to create a physically mapped, second level, unified cache. Three categories of
page mode schemes are compared. Pure page mode DRAMs can cache rows of data in the
sense amps by holding RAS lines down after memory accesses. Register-based cache
DRAMs are special purpose DRAMs which allow a very long row to be split up into
multiple blocks. Cache DRAM architectures with embedded SRAMs contain a complete
SRAM cache on chip with the DRAM. While many such page mode or cache DRAM
systems have been proposed, no large scale performance studies have been completed on
them. While a successful page mode scheme can save a significant number of cycles, a
poor performing system can actually degrade overall performance. A simulator to model
microprocessor resources and execution time was created to evaluate the effect of these
page mode schemes on both the memory system and the overall microprocessor
performance.
The SPEC Benchmark Suite [Dixit 91], release 1, was used for the simulations.
While no benchmarks can accurately predict an overall performance for a microprocessor,
SPEC seems to be the most comprehensive benchmark suite broadly available today. The
SPEC benchmark suite consists of approximately 45 billion instructions. A series of
simulations were completed over different first level cache sizes, second level cache sizes,
and second level block sizes. In addition, a performance penalty charge was considered for
those page mode schemes that degraded performance on a miss. Result data for this project
was compiled by simulating over two trillion instructions.
All graphs for this thesis are included in the appendices. Appendices A and B
include information on first level miss rates and clocks per instruction (CPI), respectively.
Appendices C and D represent a sweep of second level total cache sizes and second level
block sizes while the first level instruction and data caches are held constant. Appendix C
shows miss rates for the benchmarks, while Appendix D shows the total performance
impact as well as the memory impact of the second level caches on the microprocessor.
Appendices E and F represent a second sweep of cache parameters. Two optimal second
level block sizes were chosen and the total sizes for both the first and second level cache
sizes were increased. Appendix E represents miss rates and Appendix F shows
performance impacts. Appendix G includes a partial listing of the raw data collected during
the various simulator runs.
The microprocessor resources modeled for this project were from a low-end SPARC
processor. While numerous performance indicators are represented in this thesis, one
should realize that performance results are somewhat particular to the processor
implemented. However, since relatively simple integer and floating point units were
implemented, one might expect that if better performing units were used, the impact of the
page mode systems would be even greater, since the memory system would be more of a
bottleneck to the entire system. The graphs for performance impact could be looked upon
as a pessimistic indication of the effect of page mode systems. The graphs for miss rates




The Shadow [Hsu 89] program from Sun Microsystems was used to gather traces.
Shadow is a traceless routine which executes a program while filling up a trace buffer with
structures that contain information such as the instruction word, effective address, etc., for
each instruction. When the buffer fills up, it calls the simulation program which runs a
virtual processor to gather performance results. Once the simulator has executed all the
instructions in the trace buffer, it is emptied and the original benchmark program continues
while Shadow refills the trace buffer. This scheme allows the simulation of large programs
without needing prohibitive amounts of memory. The simulations are all based on
programs running on the SPARC architecture.
2.2 Microprocessor Simulator
The simulator created for this project was written in standard C, and modeled all of
the major resources of a microprocessor including the integer unit, floating point unit, first
and second level caches, store buffer, and main memory. It measured performance by
counting clock cycles using an event-driven timing mechanism. Each of these resources
contributed to the overall CPI of the system, and the simulator determined how much of the




The integer unit was modeled by updating the time counter an appropriate number of
clock cycles for each instruction executed. While most integer instructions took one cycle,
store instructions took two cycles, while multiplies and divides took eight and sixteen
respectively. Conditional traps took five instructions. All loads including double words
were accomplished in one cycle. These clock cycles were also stored in a separate counter
which allowed the calculation of an instruction CPI. For an optimal RISC processor, this
number should approach one, as should the overall CPI.
2.2.2 Floating Point Unit
The floating point unit was modeled by creating a linked list of floating point
structures with a length equal to the length of the floating point queue. Each time a floating
point operate instruction was executed, an element was added to the linked list. In the
structure, six pieces of information were kept. First, the time was recorded at which the
instruction entered the floating point queue. Then three 32 bit register masks were created
for the two source registers and the destination register that the instruction used. For these
masks, a one in a particular bit number indicated that the corresponding register was used.
Next a bit was kept to determine whether the instruction was a floating point compare,
which needed to be handled as a special case. Finally, the total amount of time required to
complete the instruction was also kept. The floating point queue length could be set to an
arbitrary number in the program, but the microprocessor implemented had a queue length of
one, so all simulations reflect this length.
When a floating point entry was sent to the queue, the processor was able to continue
with other operations provided that some sort of stall was not necessary. These stalls could
occur in a number of ways. If the queue were full, and another floating point operation
occurred, the processor had to wait for an item to leave the queue. Each time a floating
-10-
point load occurred, the source and destination registers of instructions in the queue had to
be checked against the source and destination registers of the load to see if a collision
occurred. This check could be completed very quickly through a logical AND of the load
register masks and the floating point operate register masks. If the destination register of
the load matched any of the source or destination registers of pending operations in the
queue, then the processor had to wait for the queue to finish the conflicting operation before
the load occurred. Similarly, when a floating point store occurred, the destination register
of the store had to be compared against the destination registers of each pending instruction
in the queue, and a stall would occur on such a collision. A different stall could occur if a
floating point branch were executed while there was a pending compare statement in the
queue. A floating point branch required the processor to stall until all compares were
completed. Finally, a state load or store required the processor to stall until the queue was
empty. These processor stall cycles were kept in a separate counter and comprised the
floating point CPI. If collisions were avoided, floating point instructions could be executed
in parallel with other resources, so no "time" would be charged to the system.
The simulator handled the floating point queue by putting every operation in the
queue, and constantly checking for one of the above collisions. If a collision occurred, the
completion time of the offending operate instruction was calculated and compared to
"current" time. If current time was greater then the completion time, then presumably the
floating point instruction had time to complete and no actual collision occurred. If not,
current time was updated to equal the completion time of the instruction in order to
simulate the stall.
2.2.3 Memory System
The simulator had the ability to model multiple levels of caches, complete with write-
back/write-through capability, set associativity, LRU replacement, and variable block size
-11-
options. The different page mode schemes were modeled as second level caches. To do
this accurately, a few changes were needed. Since the caches were modeled as virtual
caches and the page mode schemes created physical caches, the page number bits were
scrambled to simulate a random mapping. To simulate this random mapping, the following
algorithm was used. First, the bottom twelve bits of the address were changed to zero.
These were the offset bits for the page. Then the XOR of the top sixteen bits of the address
and the bottom sixteen bits of the address was calculated. The resulting sixteen bits were
again split and an XOR was performed on the two sets of eight bits. The resulting eight bits
were divided into a top and bottom half for a final XOR. The resulting four bits replaced
bits twelve through fifteen of the original address to simulate a random bank and page
number for the physical address. Only accesses to the second level, physical DRAM caches
received this address.
Additionally, second level block sizes were not allowed to be larger than the 4K byte
page size of the operating system. Any larger physical block size would have resulted in
two unrelated virtual pages being cached in the same block. The chances of accesses
jumping from one virtual page to the other one in the same block would have been
extremely small.
The caches were set up as large arrays that contained only the tag since the simulator
never needed the actual data. There was no array set up for main memory since all data not
found in the caches was assumed to be in main memory. There were a variety of time
penalties associated with the different caches. A first level Icache or Dcache miss had a
latency charge required to get the first word out. This charge was unavoidable, since the
processor had to wait while this occurred. In addition, there was also a possible throughput
charge as the Icache or Dcache was busy filling the rest of the block. Once the first word
was returned, the processor could begin subsequent operations, but the Icache or Dcache,
and the DRAM would be marked busy for the time it took to fill the rest of the block. Any
-12-
subsequent request to these resources would have to check the busy counter to see if the
block fill was completed. If not, then the processor had to stall. The exception to this case
was an Icache collision that resulted from the PC counter being increased by one. The
processor avoided the stall for this case. The latency and throughput charges were part of
the memory CPI, and consequently the total CPI since the processor could do nothing else
during these periods. A second level miss added an additional penalty to the memory
system, but was strictly a latency charge since the second level caches were actually
specialized main memory DRAM chips. The second level penalties will be discussed more
thoroughly when the page mode configurations are discussed.
The store buffer in the microprocessor contained two entries. If a dirty miss
occurred, then the block needing to be written back to main memory would be kept in the
store buffer. If the store buffer had an entry in it, then the simulator checked each cycle for
the DRAM to be marked free to allow the store buffer entry to be written to DRAM. Once
the DRAM was free, the store buffer entry was deleted and the DRAM marked busy with
the write. If two entries were already in the store buffer and another dirty miss occurred,
then the processor stalled to allow one of the entries to be written to main memory. These
penalties were included in the memory CPI.
The final component of the CPI came from a load-use stall that occurred in the
SPARC Architecture [Sun 89] whenever an instruction tried to use the result of a load that
occurred on the previous cycle. The four parts of the CPI give an idea of not only the
performance of the microprocessor, but the relative impact an improvement to the memory
system might have on the entire system. The graphs in Appendix B show the CPI
components for each of the SPEC Benchmarks, as well as the averages for the integer and
floating point benchmarks. The first two pages (pp 44-45) show the CPI for a memory
system using a simple page mode scheme with 32M bytes of main memory. The memory
system accounts for an average of 22 percent of the CPI for the integer benchmarks and 44
-13-
percent of the CPI for the floating point benchmarks. The last two pages (pp 46-47) of
Appendix B show the CPI breakdown for the largest memory configuration tested,
consisting of 32K bytes instruction and data caches, and a IM bytes second level cache.
For this case the memory only accounts for 7 percent of the CPI for the integer benchmarks
and 16 percent for the floating point benchmarks.
2.3 Microprocessor Simulator Statistics
The simulator collected a wide variety of miss ratio statistics including miss ratios for
reads, writes, data accesses, instruction accesses, and different length data accesses, all for
both the first and second levels. Percentages of write backs and read modify writes for the
first level caches were also included. A number of performance statistics were also kept,
including clocks per instruction (CPI) for each program, broken down by microprocessor
resource as discussed earlier. Additionally, DRAM utilization was calculated, as well as
percentage of floating point operate instructions.
In order to compare the three page mode schemes, the simulator also returned the
number of clock cycles that the page mode systems saved, and gave a percentage
performance increase of using a page mode scheme versus a traditional single level caching
scheme that uses page mode only for block fills. Finally the simulator returned the number
of 2nd level dirty misses that would have occurred if the page mode schemes had somehow
been made into write back caches. The total SPECmark performance rating for a
benchmark execution was given as well.
For the memory system, the simulator kept two additional sets of statistics. The first
set determined the number of cycles the Icache, Dcache, and store buffer caused the
microprocessor to stall due to throughput charges. If an Icache request had to wait for an
Icache fill from DRAM, then an appropriate counter was increased. Nine such counters
were kept to account for each dependency. The second set of statistics tried to determine
-14-
patterns that caused the second level to miss. If a second level Dcache access was followed
by a second level Dcache miss, then the "d after d" counter was updated. The reasoning
behind this was to detect cases for which page mode would most likely fail.
A complete example of one of the simulation files is included in the following two
pages.
-15-
Shadow: version 1.1 (10/Jan/90)
Analyzer: /nfs/ray/u3/shadow/cache5lru: version 3.1 (16/August/90)
Application: fpppp
Hostname: gladstone
Date: Wed Dec 25 05:23:03 1991
Speed: 3 IPS
Status: final
1448153371 instructions (including annulled)
1443743811 instructions (excluding annulled)
34.7 SPECmarks for fpppp





































































































1.934% D write backs


















































































































































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





























































453736166 # of ticks saved = 7.76 percent of total
1239614 # of 2nd level dirty misses
-17-
2.4 Range of Simulations
Over two trillion instructions were simulated for this project by utilizing twenty
SPARCstations with low priority background jobs for a period of about two months. Each
run of the SPEC benchmarks, which contained 45 billion instructions, took about two
machine weeks on a SPARCstation 2.
Two groups of simulations were run. In the first group, a range of second level block
sizes and second level total cache sizes were swept, while the first level caches were held
constant at roughly typical values for today's microprocessors. These simulations helped
determine optimal block sizes. Total second level cache sizes were kept near values
attainable using page mode DRAM in typical memory systems.
There had been some concern that the larger first level cache sizes of future
microprocessors could render page mode caches ineffective. Therefore the second group of
simulations focused on increasing both the first and second level cache sizes. To reduce
simulation time, only two of the best second level block sizes (1K bytes and 512 bytes)
were used for these ranges. Additionally, since one run of the Spice Benchmark took as
long as the other nine benchmarks combined, some of those runs were eliminated from this
sweep. Table 2-I shows a list of all simulations.
Table 2-I: Parameter sweeps of Cache Sizes in Bytes
1st Lvl 2nd Lvl BlockSize Sweep
I=8K D=4K 16K - 64K 128 - 4K First
I=8K D=8K 128K - 1M 512 - 1K Second
I=16K D=16K 128K - 1M 512 - 1K
I=32K D=32K 128K - 1M 512 - 1K
-18-
2.5 SPEC Benchmark
The SPEC benchmark is composed of ten individual benchmarks that perform
minimal I/O and are designed to be CPU intensive. The programs are large enough to
avoid fitting into most first level caches. Four of the programs are integer benchmarks,
while six are floating point. The following is a brief description of each of the ten programs
[Dixit 91].
001.gcc1.35 - This is a Gnu C compiler, Version 1.35 that measures the time for the
compilation of nineteen source files. This program was chosen to test caches, and exhibits
a load/store percentage of about 25%. This program is an integer benchmark written in
C. 1.2 billion instructions are executed, making it the smallest benchmark.
008.espresso - This is a tool from the University of Califomia at Berkeley that
generates and optimizes PLAs. Four input models are run on espresso for this benchmark.
The program is relatively small, spending a reasonable amount of time looping. 30% of the
instructions are load/store. The benchmark executes a total of 2.9 billion instructions and is
an integer benchmark written in C.
013.spice2g6 - Another tool from Berkeley, this is the standard analog circuit
simulator widely used in industry. Five copies of a grey code counter are simulated for this
benchmark. Although considered a floating-point benchmark, this benchmark only
executes about 4% floating point operate instructions, and another 4% floating point
load/store. Written in fortran, this program executed by far the most instructions with a
total of 22.8 billion.
015.doduc - This is another floating point benchmark that completes a Monte Carlo
simulation of the time evolution of a thermohydraulic model for a nuclear reactor's
components. Many subroutines are executed, causing the code to jump around quite often.
26% of the instructions are floating point operations while another 24% are floating point
-19-
load/store. Most loads are double words. A total of 1.3 billion instructions are run for this
benchmark.
020.nasa7 - This benchmark is a collection of seven kernels that test common
scientific computations. Written in fortran, this floating point program executes 30%
floating point operations, and another 44% floating point load/store instructions. A total of
6.8 billion instructions are executed for this benchmark.
022.li - The third of the integer benchmarks, this is a LISP interpreter written in
C. The performance is measured in the time it takes li to solve the Nine Queens problem. A
total of 4.9 billion instructions are executed, with about 25% being load/store operations.
023.eqntott - The fourth and last of the integer benchmarks, eqntott translates a
logical representation of a Boolean equation into a truth table. About 32% of the
instructions are load/store. Although the program will fit in some instruction caches, the
data cache is significantly thrashed. This program executes 1.3 billion instructions.
030.matrix300 - This a double-precision floating point intensive benchmark that runs
operations on 300 by 300 matrices. 38% of the instructions are floating point load/store and
another 25% are floating point operations. 1.44 Megabytes of data accesses can cause
significant data cache problems. 1.7 billion instructions are executed for this program.
042.fpppp - This double-precision floating point benchmark measures performance of
the two electron integral derivative computation that occurs in a Gaussian series of
programs. (I don't know what this means either). 41% of the instructions are floating point
operations and another 44% are floating point load/store. A total of 1.4 billion instructions
are executed.
047.tomcatv - The sixth floating point benchmark, this benchmark is a mesh
generation program. 31% of the instructions are floating point operations and 26% are
floating point load/store instructions. A total of 1.6 billion instructions are executed. This
program was included because it thrashes the data cache.
-20-
When calculating the SPECmark rating for a microprocessor, the time it takes each of
the benchmarks to complete on a VAX-i 1/780 is divided by the completion time on the
microprocessor. The geometric mean of these ten ratios is considered the SPECmark rating
for the microprocessor. Since each of the benchmarks is given equal consideration in the
SPECmak rating, when completing the average floating point and integer graphs for miss





The main memory system modeled by the simulator is shown in figure 3-1. The total
main memory size was 32M bytes, divided into four independent banks, each configured
with 16 lMx4 bit DRAMs for a bank configuration of 1Mx64 bits. The banks were square
with a row consisting of 1Kx64 bits, or 8K Bytes. A normal DRAM access consists of
driving the row address and dropping RAS, and then driving the column address and
dropping CAS. Immediately after the read is finished, the RAS and CAS lines are
precharged high for the next access. For the processor implemented, the DRAMs had an
access time of 80ns while the clock period was l5ns. For such a memory system, three
ways to design a page mode cache will be presented.
3.1 Pure Page Mode DRAM
Pure page mode DRAMs can cache a row of data in the sense amps by not
precharging the RAS lines after an access to main memory. If a subsequent access is in the
same row, then only the shorter CAS access need occur. A miss causes the normal RAS
access to occur, but must also first precharge the RAS lines. As a result, a miss in a page
mode cache is actually slower than a normal DRAM access. This additional time will be
referred to as the precharge penalty. If a page mode DRAM cache has too high a miss rate,
the added penalty of precharging the RAS lines can decrease overall performance of the
memory system.
Pure page mode DRAMs have the limitation of utilizing a small number of very large
blocks. The memory configuration discussed earlier consisted of rows 8K bytes long. A
four bank main memory allows page mode DRAM to cache only four blocks, each with a
size equal to one row. This is illustrated in figure 3-1.
-22-
8K Icache 64 Main Memory
32 Byte Block
1M x 64 per 3
4 banks Disk
64
4K Dcache 4K Bytes Page
16 Byte Block Size
Memory System
8KIcache 64 MaIn Memory
32 Byte Block size 2ndlevel
unified 34K 1M x 64 per 32
h a aphysical
64 t'cache" 4 banks Disk
4 blocks each
4K Dcache 4K Bytes 4K Bytes Page
16 Byte Block Size
Memory System With Page Mode
Figure 3-1: Typical SPARC Memory System
3.1.1 Pure Page Mode Model
Although the memory system modeled created rows that were 8K bytes in length, in
practice, half of the 8K byte block size is lost. The virtual page size for the SPARC
architecture is 4K bytes. Since the mapping from virtual addresses to physical addresses is
essentially random, each 8K byte block in a page mode DRAM caches two distinct and
unrelated physical pages. Therefore, the simulator assumed that the extra 4K byte page was
useless, and limited the effective block size to a maximum of 4K bytes.
Once the virtual address bits were scrambled, and the page size of the operating
-23-
system was taken into account, the page mode systems behaved similarly to a second level
cache. A second level cache hit saved the system the RAS access time, which for the
processor modeled equaled three clock cycles. For a pure page mode system, a second
level miss required not only the three clocks of the RAS access, but cost five additional
cycles due to the precharge penalty.
When comparing systems with page mode to a single level caching system, the
simulator determined the number of clocks saved by using page mode. For a pure page
mode system that included the precharge penalty, the simulator determined exactly how
many cycles were lost on a second level miss. While the maximum loss in such a case was
five cycles, there were two cases during which a portion of this penalty was not suffered,
relative to the single level caching system.
In a single level caching system, whenever a DRAM access is finished, the RAS lines
are immediately precharged. However, if an access to a particular DRAM bank was
immediately followed by an access to the same bank, the precharge would not be finished
in time for the access to complete so the processor would stall. A portion of the precharge
penalty assigned to the pure page mode scheme would also be suffered by the single level
caching system. The simulator modeled this case when calculating the number of clocks
saved by a pure page mode system. Accesses to the same bank were noted, and the time
between DRAM requests was calculated. If this time was less than five, then the number of
clocks saved by page mode was adjusted accordingly.
The second case was unique to the processor implemented. If the DRAM was busy
filling a block, and then another access came for the DRAM, the processor was able to
utilize a second bus to look ahead and check for a second level hit if the access was to a
different bank. If a miss was detected from the address on the second bus, the bank could
begin its precharging process, thereby reducing the precharge penalty suffered. The
simulator took this case into account, although it proved to be a fairly rare case.
-24-
3.1.2 Trade-offs
The biggest drawback to the pure page mode DRAMs comes from the very large
block sizes. As future DRAMs get bigger, rows will get longer and page mode DRAM
schemes will merely cache more unrelated physical pages causing even more cached bytes
to be wasted. In addition, the total cache size is limited since only one block can be cached
per DRAM bank.
A useful side benefit of even this simple page mode scheme comes from the
significant power savings of not having to decode the RAS address on a hit. Also, since
page mode DRAMs are a commodity item, implementation of this scheme would be
relatively simple. However, the precharge penalty can become significant and even
degrade overall performance. For the microprocessor implemented, a second level hit
saved three clock cycles while a second level miss cost as many as five additional cycles,
indicating that a miss rate of more than forty percent in the second level cache would start
to cause a degradation of performance.
3.2 Register-Based Cache DRAMs
Special purpose cache DRAMs have been proposed which try and split up the large
blocks of page mode DRAMs into smaller, more workable blocks [Arimoto 90], [Asukura
89]. Register-based cache DRAMs solve this problem by placing a row of registers near
the sense amps of a conventional DRAM [Goodman 84], [Ward 88], [Ward 90]. These
registers can be loaded in blocks of an arbitrary size whenever a RAS access occurs. In
addition to allowing smaller block sizes, this scheme also avoids the precharge penalty of
page mode DRAMs since the registers now act as the cache and allow the RAS lines to be
precharged immediately after the reference.
-25-
3.2.1 Register Based Cache DRAM Model
The simulator did not have to worry about special cases when handling register-based
cache DRAMs. Since the precharge penalty was eliminated for these second level caches,
the number of saved cycles over a single level caching system was three times the number
of second level hits. The only modeling difference the simulator took into account was the
elimination of this precharge penalty.
3.2.2 Trade-offs
While solving the page mode DRAM problem of huge block sizes, register-based
cache DRAMs do not address the issue of total second level cache size. Still one row is
typically cached per DRAM bank. Also, since cache DRAMs are not currently commodity
parts, their additional cost must be weighed against the benefits gained over pure page
mode DRAM.
However, the elimination of the precharge penalty insures that this scheme will never
degrade performance. In addition, the power savings can still be significant, since the RAS
lines will not have to be dropped on a second level hit.
3.3 Embedded SRAMS
The most extreme form of main memory caching being introduced [Dosaka
92] involves putting a small SRAM memory along with each DRAM chip. This SRAM
functions as a complete second level cache and can be designed with an arbitrary total size,
block size, and set number. While the design of the previous two page mode schemes leads
to a direct-mapped cache, embedded SRAMs have the flexibility of adding multiple sets.
This page mode scheme solves both the problem of total cache size for the second level
cache, and the problem of block sizes. Like the previous cache DRAM architecture, it also
avoids the precharge penalty.
-26-
3.3.1 Embedded SRAM Model
There was no modeling difference between this case and the register-based cache
DRAMs. The only difference lies in the number of graphs that are applicable to this
scheme, since a greater flexibility is attained for DRAMs with embedded SRAMs. As an
additional feature, the simulator calculated the number of second level misses that would
have occurred if the embedded SRAM had been implemented with a write back scheme.
Presumably, this may give a designer some feel for the gain that implementing these
embedded SRAMs as write back caches might yield. Once again the number of cycles
saved over a single level caching system was simply three times the number of second level
hits.
3.3.2 Trade-offs
This system gives maximum flexibility by avoiding all of the penalties the other two
systems incurred. There is no precharge penalty, block size limitation, or total cache size
limitation. However, these DRAMs are new and very far from becoming common chips.
While yielding the most flexibility, they will presumably be the most expensive to
implement.
3.4 Simulations
The simulations for this project sweep across many second level block sizes and total
sizes. The graphs shown do not differentiate between the three page mode schemes, other
than including graphs of second level caches with and without the precharge penalty.
However, each page mode scheme merely translates to a second level cache with a different
total cache size and block size. When analyzing the memory system of the processor
implemented (32MB total), the pure page mode system correlates to 16K bytes for the total
cache size, and a block size of 4K bytes. The register-based cache DRAMs correspond to a
-27-
32K total cache size (since the entire row can now be used) and any of the different block
sizes. Finally DRAMs with embedded SRAMs can correlate to virtually any of the points
for different total cache and block sizes. For the pure page mode systems, those graphs that
include the precharge penalty should be observed, while the graphs without the penalty
should be used for the other two schemes. Graphs with and without the penalty were
included for sweeps of all the parameters. By varying the block sizes and the total cache
sizes, and deciding whether to include the total precharge penalty, information about all




Graphs for both miss rates and performance impact are included in the appendices.
While the miss ratio statistics gathered are valid for any architecture implementation, the
performance impact of the page mode schemes are strictly applicable only to the specific
microprocessor implemented. Care must be taken when making generalizations to other
systems.
4.1 First Level Cache Results
The miss rates of the first level caches determined the number and frequency of the
accesses to the second level page mode caches. Four first level configurations were swept.
The first group of simulations kept the first level cache sizes constant with an 8K Icache
and a 4K Dcache. The second group of simulations increased these sizes, first increasing
the Dcache to 8K bytes, and then increasing both to 16K bytes and then 32K bytes. The
block sizes for all configurations were 32 bytes and 16 bytes for the instruction and data
caches, respectively. Both caches were two-way set associative and used an LRU
replacement scheme. First level miss rates for the different configurations are included in
Appendix A. The top graph indicates the total miss rate of each benchmark. The middle
graph gives the number of instruction misses divided by instruction references, and the
bottom graph works the same way for data references. The columns marked integer and
floating point give the arithmetic mean of the miss rates for the appropriate benchmarks.
The total miss rates indicate a miss rate range between one and two percent for the
integer benchmarks, and between four and ten for the floating point benchmarks. These
miss rates indicate that the benchmarks used were reasonable and that a significant number
II
-29-
of references were reaching the second level. Also, the graphs indicate that more can be
gained by increasing the Dcache rather than the Icache. If the benchmarks are assumed to
have forty percent load/store operations, then a 1% miss rate drop in the Icache is
equivalent to a 2.5% drop in the miss rate of the Dcache. By comparing the benefits of
doubling the Icache and the Dcache, it can readily be observed that the Dcache benefits
much more, even when taking the above ratio into account. One of the reasons for this is
that even with a small Icache, the miss rates are extremely low. For a similarly sized
Dcache, the miss rate is a lot higher causing a subsequent doubling to have much more
room for improvement.
4.2 First Sweep Miss Rates
The first wave of measurements used constant first level cache sizes of 8K bytes for
the instruction cache and 4K bytes for the data cache. Six block sizes were swept, ranging
from 128 bytes to 4K bytes, incremented by multiples of two. Additionally, three total
cache sizes from 16K bytes to 64K bytes were simulated. A total of eighteen different
settings were swept. Appendix C shows a plot of these miss rates. As block sizes are
changed, one would expect to see a U-shaped curve of miss rates since very large or very
small block sizes should yield higher miss rates [Hennessy 90].
4.2.1 Integer Benchmarks
The graphs for the integer benchmarks in Appendix C show a standard U-shaped
curve. The optimal block sizes are 512 bytes and 1K bytes. As the total cache sizes are
increased, the curves are shifted down since miss rates improve. The page mode DRAM
cache corresponds to a block size of 4K bytes since an entire row of DRAM effectively
caches 4K bytes of data. For a four bank main memory scheme (16K Bytes cache total) a
page mode DRAM scheme has a miss rate from nine percent for 008.espresso, to about
-30-
thirty-two percent for 023.eqntott. The average integer benchmark miss rate is around
twenty-two percent for pure page mode DRAM. The break even miss rate for
implementing pure page mode DRAM is around forty percent for the processor
implemented. The register-based cache DRAMs give the ability to break large blocks into
smaller blocks. For a four bank main memory scheme, they can take advantage of an entire
row of DRAM giving a total cache size of 32K Bytes. The ability to break large blocks into
smaller blocks improves the miss rate between five and ten percent for these benchmarks.
The effect is more pronounced on the smaller cache sizes. The ability to increase total
cache size yields improvement of about six percent per doubling.
The added flexibility of smaller block sizes and larger total cache sizes of the
register-based cache DRAMs and DRAMs with Embedded SRAMs could improve the miss
rate by a maximum of about fifteen percent. The optimal case tested is for a DRAM with
embedded SRAMs that resulted in a total cache size of 64K and a block size of 1K bytes.
The miss rate for such a case is about 8.5%.
4.2.2 Floating Point Benchmarks
The graphs for the floating point benchmarks are much more irregular. While most
of the graphs show the standard U-shape, the graphs for 030.matrix300 and 020.nasa7 (p
52)exhibit bizarre behavior. For these two benchmarks, very large and very small block
sizes perform the best, causing the extreme block sizes to exhibit the best miss rates. In
addition, these two benchmarks change the most when parameters are swept, causing the
graph of the average floating point benchmark to be nearly flat.
The configuration for the pure page mode shows an average floating point miss rate
of about 37%, just slightly better than the 40% break even point. However, only two of the
six floating point benchmarks have a miss rate above 40%. 013.spice2g6 gives a miss rate
of 62% while 047.tomcatv gives a miss rate of 45%. Both miss rates go significantly down
if the block size is decreased or if the total cache size is increased.
-31-
The average floating point miss rate does not change much when block sizes are
altered, due primarily to the odd behavior of a couple of the benchmarks. Increasing the
total cache size decreases the miss rate by eight percent per doubling. All points on the
average floating point graph graph lie under the forty percent mark indicating that any page
mode scheme will help, with or without the precharge penalty. The floating point graphs
indicate the wide variety of performance that caches can have. Not all benchmarks will
behave in an intuitive manner.
4.3 First Sweep Performance
One might expect the performance impact curves to look roughly like the inverse of
the cache miss rates curves. The magnitude of the performance gains depend on the
implementation of the rest of the system. Performance gains were reduced if an
improvement to the memory system caused the processor to be bound by some other
resource: for example, an improved memory system performance could make a processor
more bound by the floating point execution time. Also, if the CPI of a program is
dominated by some resource other than the memory, then no matter how much the memory
system is improved, the overall performance will not dramatically increase. The
performance graphs are located in Appendix D. For each benchmark, as well as the average
integer and floating point benchmarks, four graphs are shown. Performance impact with
and without the precharge penalty are considered, as well as the memory impact with and
without the precharge penalty. When comparing the impact with and without the RAS
precharge penalty, the three lines representing total cache size get bunched together and
each line varied over block size gets flattened out when the penalty is eliminated. Since a
better performing benchmark with a bigger performance gain must have a better miss rate,
the elimination of the RAS precharge penalty will not help it as much as the penalty
elimination helps a smaller gain from a poorer performing benchmark. Similarly, the lines
-32-
get flattened out over block size, since the poor performing block sizes have more to gain
from the elimination of the precharge penalty.
The performance impact is calculated by dividing the number of clocks saved by the
page mode scheme by the total number of cycles. The memory impact divides the number
of clocks saved by the total number of clocks that contributed to the memory CPI. The
memory CPI includes only those clocks charged to the processor for a first level miss. A
program that had a zero percent first level miss rate would have a zero memory CPI. The
graphs of the memory impact are the same as the graphs for the total impact, with an
appropriate scaling factor that models how much of the total CPI is devoted to the memory
system. The graphs with the penalty charge assume a pure page mode scheme, while those
without the precharge penalty assume either a register-based cache DRAM or a DRAM
with embedded SRAMs.
4.3.1 Integer Benchmarks
The overall performance impact for pure page mode DRAM schemes on the integer
benchmarks averages to between three and six percent. These graphs include the penalty of
the additional RAS precharge time needed on a miss. The ability to change block sizes can
gain about one percent while increasing the total cache size from 16K bytes to 64K bytes
gains roughly two percent. In general, the first level miss rates are small enough that not
much total performance can be gained by improvements to the second level cache. The
individual benchmarks do not stray much from the average.
The average integer memory impact for the graphs with the precharge penalty is
between fifteen and twenty percent. The ability to change block sizes can give about a
seven percent swing, while increasing the total cache size gives an eight percent memory
performance increase for each factor of two increase. The ratio of these two graphs show
that the memory CPI is roughly one-fifth of the total CPI on average.
-33-
By eliminating the precharge penalty and using either register-based DRAMs or
DRAMs with embedded SRAMs, the average total performance impact goes up by about
two percent, and the memory impact goes up by about ten percent. Changing the block size
now only gains a fraction of a percent for the total system, and about four percent for the
memory impact. Doubling the cache size gains about five percent on average. Each of the
individual integer benchmarks performs similarly. The ratios between the total impact and
the memory impact still shows the memory CPI to be about one-fifth of the total CPI.
4.3.2 Floating Point Benchmarks
The performance of the floating point benchmarks is much more erratic. The graph
for the average floating point benchmark shows an overall performance gain between two
and seven percent and a memory performance gain between five and twenty-five percent
when the precharge penalty is included. The ratio between these two numbers indicates
that the memory CPI is roughly one-third of the total CPI. Changing the block size gives a
small and inconsistent percentage swing while increasing the total cache size shows an
overall performance gain of three percent per doubling, and a memory gain of about six
percent per doubling. Analyzing any individual benchmark can give wildly different
results. 030.matrix300 (p 63) yields negative performance gains for all block sizes other
than the maximum, or 4K byte size. Increasing the total cache size does not significantly
help the middle block sizes for this benchmark. However, the average of the floating point
benchmarks never yields a negative performance for any combination of block size and
total cache size.
When the precharge penalty is removed, the difference in performance impact is
significant. Since the floating point benchmarks tended to have higher miss rates, the
elimination of the precharge penalty impacted these graphs much more than the integer
benchmark graphs. The average floating point benchmark without the precharge penalty
MEW
-34-
yielded overall performance gains between nine and twelve percent and memory
improvement between twenty and thirty percent. Changing block sizes still had little and
unpredictable affects, while an increase in total cache size gained an additional four percent
of both overall and memory performance per doubling. Eliminating the precharge penalty
also tended to make all the individual floating point benchmarks behave much closer to the
average.
4.4 Second Sweep Miss Rates
The second sweep of simulations increased the first and second level total cache sizes
while keeping the second level block sizes constant. Ignoring the 030.matrix300
benchmark, block sizes of 512 Bytes and 1 Kilobyte performed the best and were used for
these simulations. Since each new generation of microprocessors will yield higher on-chip
first level cache sizes, this group of simulations was aimed at determining whether page
mode schemes will be applicable for future processors. First level caches were swept from
8K bytes for both the Icache and the Dcache to 32K bytes. The performance of those first
level caches is shown in Appendix A. Second level cache sizes were swept from 128K
bytes to 1M bytes by successive powers of two.
Since bigger first level caches would lead to fewer and less frequent second level
accesses, one might believe that second level performance would significantly decrease as
first level sizes increase. The graphs in Appendix E show that the miss rates actually
improve slightly as first level caches get bigger for both the integer and floating point
benchmarks. This was a very interesting and somewhat unexpected phenomenon that
speaks well for the future of page mode systems. The result is that small second level
caches perform better than expected with the larger first level caches. Additionally, as the
total size of the second level cache increases, the miss rate continues to drop. For the





For each of the first level configurations, increasing the total second level cache size
significantly reduces the miss rate of the average integer benchmark from five percent
(128K total) to about one percent (IM total). Increasing the first level sizes slightly reduces
the miss rate for the smaller second level total sizes, while for the larger total cache sizes
the miss rate stays constant. For 022.li(p 71), each time the first level cache is doubled, the
second level miss rate goes down more than a factor of two. The shape of each of the
individual integer benchmark graphs is similar, although the actual values of the miss rates
are significantly different. 023.eqntott has a miss rate range from fifteen to five percent,
while 008.espresso has a range from one to nearly zero percent.
4.4.2 Floating Point Benchmarks
Once again, the floating point benchmarks exhibit higher average miss rates yielding
average values around fifteen percent for the 128K total size to about three percent for the
1M total cache size. Again, the average miss rates slightly decrease as the first level cache
sizes are increased. The higher average miss rates for the floating point benchmarks are
largely due to 030.matrix300, since the 512 bytes and 1K bytes block sizes were shown to
be non-optimal block sizes for this benchmark. Miss rates for this benchmark start at about
45 percent (128K bytes total) but end up around two percent (1M bytes total). For these
benchmarks, the second level miss rate of 042.fpppp (p 76) decreases the most as first level
cache sizes are doubled.
-36-
4.5 Second Sweep Performance
The graphs in Appendix F indicate diminishing yields on the overall performance
impact as the first level sizes increase. Since the raw number of accesses to the second
level cache is going down, even an improved miss rate will have a lesser impact on the
processor because the memory CPI will be a smaller percentage of the total CPI. The
graphs in Appendix B indicate how the CPI breakdown changes from the worst case
configuration (16K total, 4K blocks) to the largest configuration (IM total, 1K blocks).
The memory CPI percentage goes from 22 percent to 7 percent for the integer benchmarks
and from 44 percent to 16 percent for the floating point benchmarks. Since the simulator
still models the original microprocessor, the simulations for the largest configuration
correspond to a processor with relatively weak integer and floating point units, but an
extremely powerful memory system. If one assumes that the processor itself will improve
as memory systems get larger, then the impact of these page mode schemes would become
significantly greater. When comparing the performance with and without the precharge
penalty, the difference is almost not noticeable since the miss rates are too small for the
precharge elimination to make any significant contribution to performance.
4.5.1 Integer Benchmarks
Since the miss rates were so small for all of the second level total cache sizes, the
performance impact remains fairly constant as this parameter is varied. As first level sizes
increase, the overall performance gain drops by two percent per doubling. The smallest
first level configuration (8K Icache 8K Dcache) yields an almost constant performance gain
of six percent over varying second level cache sizes. This decreases to two percent as the
first level cache sizes are quadrupled. The memory impact slightly increases as the first
level cache sizes are increased. Since the miss rates were getting slightly better with each
first level increase, the memory performance impact also gets slightly better. Also, the
-A
-37-
increase in total cache size has a more pronounced effect on the memory impact. Memory
improvement lies between 33 and 40 percent for the smallest first level size, and increases
an additional four percent for the largest first level size. The elimination of the precharge
penalty shows little difference since the integer miss rates were so small.
4.5.2 Floating Point Benchmarks
The average floating point benchmark overall performance gain shows a steady
increase as second level total cache size is increased, since the miss rate was significantly
changing for each doubling. Also, the difference between the lines with the precharge
penalty and without are much more significant since the average miss rates were higher for
these benchmarks. For the smallest first level sizes, the overall performance gain was
between six and twelve percent, going up by roughly two percent for each second level
doubling. Each doubling of the first level cache sizes resulted in a halving of this
performance gain. The elimination of the precharge penalty gave an additional gain of
about two percent for the smallest second level cache size but gave no additional gain for
the largest, since the miss rate approached zero.
The memory impact was fairly constant as the first level sizes were increased,
indicating a second level miss rate that was largely independent of first level size. With the
precharge penalty included, increasing the second level size change the memory impact
from 23 percent to about 35 percent. Without the precharge penalty, this change was from




A first sweep of cache parameters showed that page mode DRAM could indeed
improve the performance of a system. With a constant first level cache size, a sweep of
second level block sizes indicated that performance was gained by allowing smaller blocks.
In addition, increasing the size of the second level cache continued to increase the
performance of the page mode DRAM. Increasing total second level cache size was much
more effective than changing the block size.
A second sweep of cache parameters studied the value of page mode schemes for
future microprocessors. While two optimal second level block sizes were held constant, the
total sizes of the first and second level caches were increased to model the increasing
memory system sizes of future processors. The page mode schemes showed improved miss
rates for both larger first level cache sizes and larger second level cache sizes. The
performance impact on the microprocessor went down as first level cache sizes increased,
due to the smaller number of second level cache accesses.
Using pure page mode DRAMs limits both the total cache size and the number of
blocks allowed in the second level cache. By not being able to break up rows of DRAM
into smaller blocks, 4K bytes were wasted for each cached block. Larger DRAMs would
cause even more waste since the rows would get even longer. The total effective cache size
could only be increased by adding more banks of main memory DRAM since only 4K
bytes per DRAM row can effectively be cached. A four bank main memory system with a
total of 32M bytes of memory could be configured with page mode DRAM to create only a
16K bytes second level cache, which is probably too small to be very effective. In addition,
as the miss rate got worse, the RAS precharge penalty started to significantly degrade
performance.
-39-
Register-based cache DRAMs solve the problems of wasted bytes and large blocks.
Allowing smaller block sizes helped the miss rates, but not tremendously. Cache DRAMs
make effective use of the entire DRAM row, but also could get no bigger than the size of a
row multiplied by the number of banks. 32M bytes of main memory translate into 32K
bytes of second level cache for a four bank scheme. Simulations showed that this performs
reasonably well, but could get even better if the total cache size could somehow be
increased. Additionally, register-based cache DRAMS avoid the problem of the RAS
precharge penalty, which becomes significant when the second level cache miss rate is
marginal.
Cache DRAMs with embedded SRAMs allow arbitrary sizes for both second level
blocks and total cache sizes, and eliminate the RAS precharge penalty. Additionally, the
designer may consider design issues such as set associativity or write-back caching. This
increased flexibility could dramatically increase a memory system's performance since
larger second level cache sizes showed increased effectiveness for each doubling.
Achieving a large enough total size is the main problem with page mode caches, and only
the SRAM-based cache DRAM solves it.
The integer benchmarks showed a very regular second level cache behavior, and
second level miss rates were very low. The floating point benchmarks exhibited higher
miss rates and responded to parameter changes erratically. For the largest cache sizes, all
benchmarks exhibited very small miss rates and significant memory performance
improvement.
All three schemes improved the performance of a microprocessor. While page mode
DRAMs showed the least improvement, they are the most readily available. Register-based
cache DRAMS and cache DRAMs with embedded SRAMs were significantly better, but
they are not yet commodity parts. Perhaps the most interesting result shown by the
simulations was that the second level miss rate did not degrade with increasing first level
-40-
cache sizes. Based on this result, the interaction between first and second level cache sizes
merits further investigation. Additionally, with the low miss rates shown in the




First Level Miss rates
-42-





























- W~ N\ M M) M M 0 0 N Nr C
o 0 C\J C'J > ; '- N M N t
0 o 0 0 w 0 o 0 0 0 Q
Miss Rates for Instructions Only
*8K lCache 4K Dcache
2 8K Icache 8K Dcache
- 16K Icache 16K Dcache





. E 8K Dcache
.] 16K Dcache
- ] 32K Dcache
o N~ No 0) Co LO 0) C0 N rl- m
o 0 (\I > N M ";r t >
o 0 0 0 0 0 0 0 0 0m











Total CPI = 1.316Total CPt = 1.768
CPI for Integer Benchmarks
mem CP
inst CP












Floating Point CPI Components





46.72% Average CPI for Floating
45.22% Point Benchmark
2.72%
Total CPI = 2.330
24.38%














Total CPI = 3.751
50.14%










Total CPI = 4.517
45.48%












Total CPI = 1.303 Total CPt = 1.146












Total CPI = 1.249
HE
92.99%
Total CPI = 1.151
qa --
-47-










Total CPI = 2.695
37.12%




Average CPI for Floating
Point Benchmark
3.99% 29.90%















Total CPI = 2.504 31.66%
16.28%
Total CPI = 2.862
52.88%
10.06%





1st Sweep Miss Rates
-49-
2nd Level Miss Rates
for 001.gccl.35
(8K Icache 4K Dcache)
128 256 512 1024 2048 4096
Block Size (bytes)
2nd Level Miss Rates
for 008.espresso



















2nd Level Miss Rates
for 022.li
(8K Icache 4K Dcache)
I I I I I I
128 256 512 1024 2048 4096
Block Size (bytes)
2nd Level Miss Rates
for 023.eqntott
(8K Icache 4K Deache)
I I I I
















2nd Level Miss Rates
for 013.spice2g6(8K Icache 4K Dcache)
70
---- 64K Total






128 256 512 1024 2048 4096
Block Size (bytes)
2nd Level Miss Rates
for 015.doduc
(64K Total 8  Icache 4K Dcache)
~-~ 32K Total









2nd Level Miss Rates
for 030.matrix300
(8K Icache 4K Dcache)
128 256 512 1024 2048 4096
Block Size (bytes)
2nd Level Miss Rates
for 020.nasa7
(8K Icache 4K Dcache)
I I I I I I
I I I
















2nd Level Miss Rates
for 042.fpppp
(8K Icache 4K Deache)
128 256 512 1024 2048 4096
Block Size (bytes)
2nd Level Miss Rates
for 047.tomcatv
(8K Icache 4K Deache)




















2nd Level Miss Rates for
Integer Benchmarks
(8K Icache 4K Dcache)
I I I I I I I
20 F
10 F
I I I I
128 256 512 1024 2048 4096
Block Size (bytes)








(8K Icache 4K Dcache)
I I I I
20 F
I I I I I I





1st Sweep Performance Impact
-56-
Performance Impact of 2nd Level
for 001.gccl.35
(8K Icache 4K Deache)
Total Impact With Penalty
10 F
128 256 512 1024 2048 4096
Block Size (bytes)





128 256 512 1024 2048 4096
Block Size (bytes)
--- 64K Total Cache Size
A 32K Total Cache Size
--- 16K Total Cache Size
Total Impact Without Penalty
10 1-




128 256 512 1024 2048 4096
Block Size (bytes)
128 256 512 1024 2048 4096
Block Size (bytes)
Integer Benchmark





Performance Impact of 2nd Level
for 008.espresso
(8K Icache 4K Dcache)





128 256 512 1024 2048
Block Size (bytes)
4096









-- >- 64K Total Cache Size
-z-- 32K Total Cache Size
-+-16K Total Cache Size
Total Impact Without Penalty
10 1-
I I I f 1
128 256 512 1024 2048 4096
Block Size (bytes)
128 256 512 1024 2048 4096
Block Size (bytes)






F F I I I I
128 256 512 1024 2048 4096
Block Size (bytes)
Integer Benchmark




Performance Impact of 2nd Level
for 022.1i
(8K Icache 4K Dcache)
Total Impact With Penalty
10 I-
128 256 512 1024 2048 4096
Block Size (bytes)
Memory Impact With Penalty









128 256 512 1024 2048 4096
Block Size (bytes)
64K Total Cache Size
-,ar-- 32K Total Cache Size
---- 16K Total Cache Size
Total Impact Without Penalty
128 256 512 1024 2048 4096
Block Size (bytes)





















Performance Impact of 2nd Level
for 023.eqntott
(8K Icache 4K Deache)






128 256 512 1024 2048 4096
Block Size (bytes)
Memory Impact With Penalty








128 256 512 1024 2048 4096
Block Size (bytes)
-- 64K Total Cache Size
-&- 32K Total Cache Size
-+-16K Total Cache Size






128 256 512 1024 2048 4096
Block Size (bytes)













Performance Impact of 2nd Level
for 013.spice2g6
(8K Icache 4K Dcache)
Total Impact With Penalty
128 256 512 1024 2048 4096
Block Size (bytes)














128 256 512 1024 2048 4096
Block Size (bytes)
-0- 64K Total Cache Size
A 32K Total Cache Size
--- 16K Total Cache Size









128 256 512 1024 2048 4096
Block Size (bytes)















Performance Impact of 2nd Level
for 015.doduc
(8K Icache 4K Dcache)
Total Impact With Penalty
* I I I I I
* I I I -








128 256 512 1024 2048 4096
Block Size (bytes)
128 256 512 1024 2048 4096
Block Size (bytes)
-+-64K Total Cache Size
-+ 32K Total Cache Size
-+-16K Total Cache Size













Memory Impact Without Penalty
301F
24





Performance Impact of 2nd Level
for 020.nasa7
(8K Icache 4K Dcache)
Total Impact With Penalty
I I I I I I
128 256 512 1024 2048 4096
Block Size (bytes)
Memory Impact With Penalty









128 256 512 1024 2048 4096
Block Size (bytes)
--- 64K Total Cache Size
IN 32K Total Cache Size
-+-16K Total Cache Size
Total Impact Without Penalty
I I I i I a
128 256 512 1024 2048 4096
Block Size (bytes)





























Performance Impact of 2nd Level
for 030.matrix300
(8K Icache 4K Deache)
Total Impact With Penalty
I I I I I a
10 F
128 256 512 1024 2048
Block Size (bytes)
4096










128 256 512 1024 2048 4096
Block Size (bytes)
-+-64K Total Cache Size
---- 32K Total Cache Size
-0- 16K Total Cache Size
Total Impact Without Penalty
I I I I 1
128 256 512 1024 2048 4096
Block Size (bytes)
Memory Impact Without Penalty
I I I I I I I i 1
20 F
10 F




I I I I I











I I a I I
-64-
Performance Impact of 2nd Level
for 042.fpppp
(8K Icache 4K Dcache)










128 256 512 1024 2048 4096
Block Size (bytes)
Memory Impact With Penalty
128 256 512 1024 2048 4096
Block Size (bytes)
---- 64K Total Cache Size
-&-32K Total Cache Size
-G-- 16K Total Cache Size
Total Impact Without Penalty
128 256 512 1024 2048 4096
Block Size (bytes)































Performance Impact of 2nd Level
for 047.tomcatv
(8K Icache 4K Dcache)






128 256 512 1024 2048 4096
Block Size (bytes)









128 256 512 1024 2048 4096
Block Size (bytes)
-- 64K Total Cache Size
-t-32K Total Cache Size
-+-16K Total Cache Size
Total Impact Without Penalty
128 256 512 1024 2048 4096
Block Size (bytes)













J~ A -' p p L%J ~d
A A A -LI LI LI L.a0 0 0 0 ~ LI
I I I I I I
A A A 4
~ LI LI LI L.a
LI
0 0 0 0 0
Li
-66-
Performance Impact of 2nd level
for Integer Benchmarks
(8K Icache 4K Dcache)
Total Impact With Penalty
I I I I I I
128 256 512 1024 2048 4096
Block Size (bytes)
Memory Impact With Penalty
I I I I I I
128 256 512 1024 2048 4096
Block Size (bytes)
--- 64K Total Cache Size
-A-32K Total Cache Size
-0-16K Total Cache Size
Total Impact Without Penalty
128 256 512 1024 2048 4096
Block Size (bytes)












0 I - I 1
128 256 512 1024 2048 4096
Block Size (bytes)
-67-
Performance Impact of 2nd level
for Floating Point Benchmarks
(8K Icache 4K Dcache)
Total Impact With Penalty
12 F
128 256 512 1024 2048 4096
Block Size (bytes)
40r





128 256 512 1024 2048 4096
Block Size (bytes)
---- 64K Total Cache Size
-t-32K Total Cache Size
----- 16K Total Cache Size
Total Impact Without Penalty
12 -
S I I I
128 256 512 1024 2048 4096
Block Size (bytes)









128 256 512 1024 2048 4096
Block Size (bytes)
A A A A A A


















1K Byes Block Size
512 Byes Block Size
-69-
2nd Level Miss Rates
for 001.gccl.35
(8K Icache 8K Dcache)
2nd level cache size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
(32K Icache 32K Deache)
I I I I
I I
128 256 512 1024









---- 1K Bytes Block Size




















2nd Level Miss Rates
for 008.espresso(8K Icache 8K Dcache)
2nd level cache size (KBytes)
6K Icache 16K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
2K Icache 32K Dcache)
128 256 512 1024










---- 1K Bytes Block Size



















2nd Level Miss Rates
for 022.li
(8K Icache 8K Dcache)
[6K Icache 16K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
2K Icache 32K Deache)
I i I i
128 256 512 1024
2nd level cache size (KBytes)
0 1K Bytes Block Size




2nd Level Miss Rates
for 023.eqntott
(8K Icache 8K Dcache)
12 25 51 12
128 256 512 1024
2nd level cache size (KBytes)
[6K Icache 16K Deache)
15 [
10
128 256 512 1024
2nd level cache size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
I I i I
0
-73-
2nd Level Miss Rates
for 015.doduc(8K Icache 8K Dcache)
0 L1~
1K Bytes Block Size
- - 512 Bytes Block Size
( 1
2nd level cache size (KBytes)
6K Icache 16K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
-74-
2nd Level Miss Rates
for 020.nasa7
(8K Icache 8K Deache)
-0-
(16K Icache 16K Dcache)
128 256 512 1024
2nd level cache size (KBytes)
(32K Icache 32K Dcache)
I I I I
15 -
10-
128 256 512 1024
2nd level cache size (KBytes)
I I




2nd Level Miss Rates
for 030.matrix300
(8K Icache 8K Dcache)
-0-o-
(16K Icache 16K Deache)
128 256 512 1024
2nd level cache size (KBytes)
(32K Icache 32K Deache)
128 256 512 1024






















--- 1K Bytes Block Size



















128 256 512 1024
2nd level cache size (KBytes)
2K Icache 32K Dcache)
2nd level cache size (KBytes)
-76-
2nd Level Miss Rates
for 042.fpppp
(8K Icache 8K Dcache)
128 256 512 1024
2nd level cache size (KBytes)




-.- 1K Bytes Block S




2nd Level Miss Rates
for 047.tomcatv




128 256 512 1024
2nd level cache size (KBytes)
6K Icache 16K Deache)
128 256 512 1024
2nd level cache size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024




2nd Level Miss Rates for
Integer Benchmarks




128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
10 -
5 --10
0 128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)




0 128 256 512 1024
Total Cache Size (KBytes)
2nd Level Miss Rates for
Floating Point Benchmarks




1K Bytes Block Size









S 1K Bytes Block Sie
-@-- .5K Bytes Block Size
-0- 5 yesBokSz
128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
- -
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
-79-
Appendix F
2nd Sweep Performance Impact
-81-
Performance Impact of
2nd Level for 008.espresso
(8K Icache 8K Dcache)
~ 1 KB Blks With Penalty
----- .5KB Blks With Penalty
- --- A--- 1 KB Blks Without Penalty
-A- - - .5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
Memory Impact of
2nd Level for 008.espresso












1 KB Blks With Penalty
.5KB Blks With Penalty
1 Kb Blks Without Penalty




























128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
-82-
Performance Impact of
2nd Level for 022.1i
(8K Icache 8K Dcache)
1 KB Blks With Penalty
---- .5KB Blks With Penaltv
- - --k- - 1KB Blks Without Penalty
- - -r- .5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024



























2nd Level for 022.li
(8K Icache 8K Dcache)
-0--- 1 KB Blks With Penalty
---- .5KB Blks With Penalty
---- A--- 1 KB Blks Without Penaltv
-- -- A-- .5KB Blks Without Penalty
'I
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K lcache 32K Doache)
128 256 512 1024




2nd Level for 023.eqntott
(8K Icache 8K Dcache)
- 1 KB Blks With Penalty
-0-- .5KB Blks With Penalty
---- - 1 KB Blks Without Penalty
-- - -- .5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024


































2nd Level for 023.eqntott
I (8K Icache 8K Dcache)
-- 1 KB Blks With Penalty
0- .5 KB Blks With Penalty
---- -A--- 1 KB Blks Without Penalty
.5 KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
A--
- .- - -
128 256 512 1024












2nd Level for 015.doduc
(8K Icache 8K Dcache)
----- 1KB Blks With Penalty
----- .5KB Blks With Penalty
---- A -- 1KB Blks Without Penalty
----A--- .5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024Total Cache Size (KBytes)
(32K Icache 32K Dcache)
- '























2nd Level for 015.doduc
(8K Icache 8K Dcache)
- -- - 1 B
-- - 1KB Blks With Penalty
------ .5KB Bks With Penalty
----A--- 1 KB Blks Without Penalty
-- A- - -. 5 KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 10'24
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024















2nd Level for 020.nasa7
(8K Icache 8K Dcache)
-0 1KB Blks With Penalty
---- .5KB Blks With Penalty
---- A--- 1KB Blks Without Penalty
--- -A--- .5KB Blks Without Penalty
2nd Level for 020.nasa7
(8K Icache 8K Dcache)
-0-- 1KB Blks With Penalty
----- .5KB Blks With Penalty
------- 1KB Blks Without Penalty
----- .5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
A-
12 8 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 10'24




















128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
A
128 256 512 1024
Total Cache Size (KBytes)
A-
128 256 512 1024





















2nd Level for 030.matrix300
------
----A---
1 (8K Icache 8K Dcache)
1 KB Blks With Penalty
.5KB Blks With Penalty
1 KB Blks Without Penalty


































128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024



































1 KB Biks With Penalty
.5KB Blks With Penalty
1 KB Blks Without Penalty
.5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)






















(8K Icache 8K Dcache)
- ~
0- 1 KB Blks With Penalty
------ .5 KB Blks With Penalty
---- A--- 1KB Blks Without Penalty
---- A--- .5KB Blks Withou Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
Memory Impact of
2nd Level for 042.fpppp
(8K Icache 8K Dcache)
---- 1KB BIks Without Penalty
--- O- .5KB Blks With Penalty
---- A--- 1KB Blks Without Penaltv
--- A--- .5KB Blks Without Penalty
128 256 512 1024


























128 256 512 1024
Total Cache Size (KBytes)
-87-
Performance
128 256 512 1024










(16K Icache 16K Dcache)



























2nd Level for 047.tomcatv
(8K Icache 8K Dcache)
-- -A
---- 1 KB Blks With Penalty
-0- .5KB Blks With Penaltv
---- A--- 1KB Blks Without Penaltv
-- --- - .5KB Blks Without Penalty
-
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024























2nd Level for 047.tomcatv
(8K Icache 8K Dcache)
----- 1 KB Blks With Penalty
----- .5KB Blks With Penalty
----A--- 1 KB Blks Without Penalty
---- A--- .5KB Biks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
I (32K Icache 32K Dcache)
-89-
Performance Impact of 2nd Level Memory Impact of 2nd Level















(8K Icache 8K Dcache)
-0---- 1 KB Blks With Penalty
---- .5KB Blks With Penaltv
- KB Blks Without Penalty
.5KB Blks Without Penalty
128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024







128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
128 256 512 1024
Total Cache Size (KBytes)
(8K Icache 8K Dcache)
---- 1 KB Blks With penalty
-- 0-- .5KB With Penalty
------ 1 KB Blks Without Penalty
- ---- .5KB Blks Without Penalty















(8K Icache 8K Dcache) i
1 KB Blks With Penaltv
.5KB Blks With Penalty
1 KB Blks Without Penalty





















for Floating Point Benchmarks




-- ---- 1 KB Blks With Penalt
20 ---- .5KB Blks With Penal
---- A--- 1 KB Blks Without Per

















128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
128 256 512 1024








128 256 512 1024
Total Cache Size (KBytes)
(16K Icache 16K Dcache)
128 256 512 1024
Total Cache Size (KBytes)
(32K Icache 32K Dcache)
- -
128 256 512 1024












1258984771 instructions (including annulled)
1217214604 instructions (-excluding annulled)



































































































































3.953% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















109294979 # of ticks saved = 5.09 percent of total




































































































































2.161% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















192602952 # of ticks saved = 5.02 percent of total
























































































































































3.018% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks

















# of ticks saved = -1.47 percent of total
Application: doduc
1316441137 instructions (including annulled)
1304567925 instructions (excluding annulled)


























































































































































i for i (icache busy)
i for d (DRAM busy)
i for store (DRAM busy)
d for d (dcache busy)
d for i (DRAM busy)
d for store (DRAM busy)
store for d (DRAM busy)
store for i (DRAM busy)
store for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













159611912 # of ticks saved = 3.83 percent of total
-96-
Application: dnasa7
6800274187 instructions (including annulled)
6784406507 instructions (excluding annulled)













































































































































19.203% D write backs



























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 6.51 percent of total
-97-
Application: xlisp li-input.lsp
4962043458 instructions (including annulled)
4661592279 instructions (excluding annulled)
































































































































3.489% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















310412461 # of ticks saved = 4.53 percent of total
11990588 # of 2nd level dirty misses
-98-
Application: eqntott -s -.ioplte int_pri_3.eqn
1376907962 instructions (including annulled)
1326073659 instructions (excluding annulled)

















































































0.170% D write backs
0.049% D read mod writes





















21310 i for i (icache busy)
104 i for d (DRAM busy)
240 i for store (DRAM busy)
89775 d for d (dcache busy)
1111 d for i (DRAM busy)
35206 d for store (DRAM busy)
0 store for d (DRAM busy)
0 store for i (DRAM busy)





























total ticks of fpu
fpOP instructions
total dram ticks





















23227366 # of ticks saved = 1.49 percent of total
108409 # of 2nd level dirty misses
-99-
Application: matrix300
1695008913 instructions (including annulled)
1693559295 instructions (excluding annulled)























































































































19.912% D write backs


















i for i (icache busy)
i for d (DRAM busy)
i for store (DRAM busy)
d for d (dcache busy)
d for i (DRAM busy)
d for store (DRAM busy)
store for d (DRAM busy)
store for i (DRAM busy)










total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = -2.89 percent of total
-100-
Application: fpppp
1448153391 instructions (including annulled)
1443743830 instructions (excluding annulled)







































































4.329% D write backs


















































for i (icache busy)
for d (DRAM busy)



























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks



















































































































107292 i for i (icache busy)
















8.925% D write backs


















i for store (DRAM busy)
d for d (dcache busy)
d for i (DRAM busy)
d for store (DRAM busy)
store for d (DRAM busy)
store for i (DRAM busy)
store for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks
































1258987043 instructions (including annulled)
1217217202 instructions (excluding annulled)

































































































































3.950% D write backs























for d (DRAM busy)
for i (DRAM busy)










total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 6.68 percent of total


























































































































































i for i (icache busy)
i for d (DRAM busy)
i for store (DRAM busy)
d for d (dcache busy)
d for i (DRAM busy)
d for store (DRAM busy)
store for d (DRAM busy)
store for i (DRAM busy)
store for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 5.49 percent of total






















































































































































3.013% D write backs



























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks






































































































































5.056% D write backs
























for d (DRAM busy)
for i (DRAM busy)









total ticks of fpu
fpOP instructions
total dram ticks













220044568 # of ticks saved = 5.37 percent of total
-106-
Application: dnasa7
6800274207 instructions (including annulled)
6784406515 instructions (excluding annulled)






























































































































for i (icache busy)
for d (DRAM busy)







19.201% D write backs





















for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks

















4962043458 instructions (including annulled)
4661592279 instructions (excluding annulled)





































































































































3.489% D write backs



























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















398601524 # of ticks saved = 5.90 percent of total
5595815 # of 2nd level dirty misses
-108-
Application: eqntott -s -.ioplte int_pri_ 3 .eqn
1376907962 instructions (including annulled)
1326073659 instructions (excluding annulled)





































































































































0.170% D write backs



























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















26549482 # of ticks saved = 1.70 percent of total













































































19.912% D write backs



















































































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks






































































































































4.309% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks























































































































































8.921% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks




































1259060745 instructions (including annulled)
1217290497 instructions (excluding annulled)




































































































































3.930% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 8.04 percent of total









































































































































2.144% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 5.73 percent of total
649461 # of 2nd level dirty misses
-114-
Application: spice2g6
23810783660 instructions (including annulled)
22775128206 instructions (excluding annulled)




































































































































3.018% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks

























































































































































5.056% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 6.18 percent of total
-116-
Application: dnasa7
6800274171 instructions (including annulled)
6784406481 instructions (excluding annulled)











































































































































19.209% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 10.30 percent of total
-117-
Application: xlisp li-input.lsp
4962043458 instructions (including annulled)
4661592279 instructions (excluding annulled)

































































































for i (icache busy)
for d (DRAM busy)
for store (DRAM busy)






























3.464% D write backs




















for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 6.44 percent of total
3213768 # of 2nd level dirty misses
V-118-
Application: eqntott -s -.ioplte intpri_3.eqn
1376907962 instructions (including annulled)
1326073659 instructions (excluding annulled)



































































































































0.170% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















30142570 # of ticks saved = 1.94 percent of total
71442 # of 2nd level dirty misses
-119-
Application: matrix300
1695008957 instructions (including annulled)
1693559338 instructions (excluding annulled)
42.5 SPECmarks for matrix300
level siz
1st I 8 K
1st D 4 K






































































































i for i (icache busy)
























19.912% D write backs






















for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks






















































































































































4.309% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 9.08 percent of total
-121-
Application: tomcatv
1626566032 instructions (including annulled)
1626346353 instructions (excluding annulled)








































































































































8.925% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 13.16 percent of total
-122-
Application: 001.gccl.35
1258997667 instructions (including annulled)
1217227427 instructions (excluding annulled)
49.4 SPECmarks for gcc
level size
1st I 8 KB
1st D 8 KB











































































































i for i (icache busy)




















3.349% D write backs






















for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 8.44 percent of total
1442927 # of 2nd level dirty misses
-123-
Application: 008.espresso
3102930952 instructions (including annulled)
2930507476 instructions (excluding annulled)

































































































































































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















173009550 # of ticks saved = 4.70 percent of total














1st I 1st I
1st D 8 KB





subblk assoc write miss





























































2.604% D write backs
1.537% D read mod writes
















































































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 6.11 percent of total






1316441149 instructions (-including annulled)
1304567934 instructions (excluding annulled)













































































































for d (DRAM busy)
3.823% D write backs


















52435 store for i (DRAM busy)










total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 5.54 percent of total
























6800274227 instructions (including annulled)
6784406530 instructions (excluding annulled)
































































































































i for i (icache busy)








18.764% D write backs






















for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













2768768231 # of ticks saved = 9.94 percent of total
177082961 # of 2nd level dirty misses
-127-
Application: xlisp li-input.lsp
4962043458 instructions (including annulled)
4661592279 instructions (excluding annulled)
























































































































for i (icache busy)
for d (DRAM busy)







2.935% D write backs





















for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 5.51 percent of total
1257483 # of 2nd level dirty misses
-128-


















































































































































i for i (icache busy)
i for d (DRAM busy)
i for store (DRAM busy)
d for d (dcache busy)
d for i (DRAM busy)
d for store (DRAM busy)
store for d (DRAM busy)
store for i (DRAM busy)
store for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















32017425 # of ticks saved = 2.07 percent of total










1st I 8 KB
1st D 8 KB

































































































for i (icache busy)
for d (DRAM busy)
for store (DRAM busy)
for d (dcache busy)
for i (DRAM busy)
7.130% D write backs





















39247512 d for store (DRAM busy)
0 store for d (DRAM busy)
8 store for i (DRAM busy)









total ticks of fpu
fpOP instructions
total dram ticks













-6606058 # of ticks saved = -0.12 percent of total




1448153349 instructions (including annulled)
1443743790 instructions (excluding annulled)
34.7 SPECmarks for fpppp
level size
1st I 8 KB
1st D 8 KB



























































































































1.934% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













454009051 # of ticks saved = 7.77 percent of total





































































7.898% D write backs



















































































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 11.05 percent of total








1259003629 instructions (including annulled)
1217233775 instructions (excluding annulled)
56.6 SPECmarks for gcc
level size
1st I 16 KB
1st D 16 KB
































































































































2.691% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 5.95 percent of total






































































































































0.917% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















81747824 # of ticks saved = 2.37 percent of total






















































































































2.354% D write backs



















































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 4.64 percent of total
37576704 # of 2nd level dirty misses
-135-
Application: doduc
1316441149 instructions (including annulled)
1304567934 instructions (excluding annulled)
33.9 SPECmarks for doduc
level size
1st I 16 KB
1st D 16 KB































































































































2.648% D write backs



























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 3.13 percent of total
996036 # of 2nd level dirty misses
-136-
Application: dnasa7
6800274227 instructions (including annulled)
6784406530 instructions (excluding annulled)








































































































































16.540% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 8.08 percent of total
# of 2nd level dirty misses
-137-
Application: xlisp li-input.lsp
4962043458 instructions (including annulled)
4661592279 instructions (excluding annulled)








































































































































2.407% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 3.90 percent of total
349179 # of 2nd level dirty misses
- Ow"
-138-





1st I 16 KB
1st D 16 KB































































































































0.151% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















30351541 # of ticks saved = 1.96 percent of total










1st I 16 KB
1st D 16 KB




























































































































0.786% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













9960652 # of ticks saved = 0.22 percent of total
3998486 # of 2nd level dirty misses
-140-
Application: fpppp
1448153349 instructions (including annulled)
1443743790 instructions (excluding annulled)
37.5 SPECmarks for fpppp
level size
1st I 16 KB
1st D 16 KB
























































































































0.150% D write backs




























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 5.33 percent of total













1st I 16 KB
1st D 16 KB



























































































































5.432% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 6.43 percent of total
6197797 # of 2nd level dirty misses
-142-
Application: 001.gccl.35



























































































































2.149% D write backs
























for d (DRAM busy)
for i (DRAM busy)









total ticks of fpu
fpOP instructions
total dram ticks





















66060946 # of ticks saved = 4.13 percent of total










1st I 32 KB
1st D 32 KB











































































































0.504% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















43236030 # of ticks saved = 1.29 percent of total




















1316441191 instructions (including annulled)
1304567974 instructions (excluding annulled)


















































































20400719 i for i (icache busy)













































1.413% D write backs






















for d (DRAM busy)
for i (DRAM busy)









total ticks of fpu
fpOP instructions
total dram ticks













54540015 # of ticks saved = 1.55 percent of total
620997 # of 2nd level dirty misses
-145-
Application: dnasa7
6800274207 instructions (including annulled)
6784406522 instructions (excluding annulled)




















































































































write back write allocate
write back write allocate


















14.692% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 7.36 percent of total




4962043458 instructions (including annulled)
4661592279 instructions (excluding annulled)




































































































































1.287% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















# of ticks saved = 2.11 percent of total
54265 # of 2nd level dirty misses
-147-
Application: eqntott -s -.ioplte int_pri_3.eqn
1376907962 instructions (including annulled)
1326073659 instructions (excluding annulled)




























































































































0.138% D write backs


















i for i (icache busy)
i for d (DRAM busy)
i for store (DRAM busy)
d for d (dcache busy)
d for i (DRAM busy)
d for store (DRAM busy)
store for d (DRAM busy)
store for i (DRAM busy)
store for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks





















27812596 # of ticks saved = 1.81 percent of total



















































0.112% D write backs


















































































for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













4918397 # of ticks saved = 0.11 percent of total






































1st I 32 KB
1st D 32 KB



























































































































0.030% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 2.63 percent of total









































































































































5.368% D write backs
























for d (DRAM busy)
for i (DRAM busy)
for store (DRAM busy)
total ticks of fpu
fpOP instructions
total dram ticks













# of ticks saved = 5.81 percent of total














A Circuit Design of Intelligent CDRAM with Automatic Write Back
Capability.
Symposium on VLSI Digest of Technical Papers , 1990.
Asujura et al.
An Experimental 1MB cache DRAM,
Symposium on VLSI Digest of Technical Papers, 1989.
Dixit, Kaivalya.
SPECulations.
SunTech Journal , January, 1991.
Dosaka, K., et al.
A 100MHz 4Mb Cache DRAM with Fast Copy-Back Scheme.
IEEE ISSCC Digest of Technical Papers , February, 1992.
Goodman, James and Chiang, Man-Chow.
The Use of Static Column RAM as Memory Architecture.
The 11th Annual Symposium on Computer Architecture , 1984.
Hennessy, John and Patterson, David.
Computer Architecture: A Quantitative Approach.
Morgan Kaufman Publishers, 1990.
Hsu, Peter.
Introduction to SHADOW.
Technical Report, Sun Microsystems, Inc, July, 1989.
Sun Microsystems, Inc.
The SPARC Architecture Manual.
Technical Report, Sun Microsystems, Inc, September, 1989.
Ward, Steve and Zak, Robert.
Static-Column RAM as a Virtual Cache.
8th International Conference on Computer Science , July, 1988.
Ward, Steve and Zak, Robert.
Technical Report, Laboratory for Computer Science, May, 1990.
I
